Connecting...

W1siziisimnvbxbpbgvkx3rozw1lx2fzc2v0cy9zawduawz5lxrly2hub2xvz3kvanbnl2jhbm5lci1kzwzhdwx0lmpwzyjdxq

Interacting with MapR-DB by Nicolas A Perez

W1siziisijiwmtkvmdevmjgvmtavntkvmzgvnjyyl3blegvscy1wag90by0xmdc4oduwlmpwzwcixsxbinailcj0ahvtyiisijkwmhg5mdbcdtawm2uixv0

Need a platform to help you with your problem solving when using large amounts of data?

MapR Platform is a great option, it is a highly efficient file system with a strong streaming API. Software Engineer, Nicolas A Perez looks at what documents can be stored on this platform and how else we can make the most of it.

 

MapR Platform is an excellent choice for solving many of the problems associated with having humongous, continuously increasing dataset of nowadays businesses.

The distributed, highly efficient file system along with the powerful yet simple and standard streaming API, are key components of the success of the platform. However, one of its most celebrated pieces is its distributed, non-SQL, highly available, JSON database.

MapR-DB supports the HBase API for retro-compatibility, but the newest OJAI API is the core of it.

Let’s look at an example of a document we could store in MapR-DB.

1 {
2   "_id"  : "001",
3   "name" : "Nick",
4   "address": {
5     "line1": "14500 Miami Lakeway",
6     "line2": "Miami, FL",
7     "zip"  : "10001"
8    }
9 }

This is a pure JSON that MapR-DB can store. The document could be as complicated as we want. There is virtually no limit on the document size, the number of fields or recursively nested fields.

Documents are stored across the MapR cluster so that reading and writing from/to a table happens in parallel, distributing the workload and gaining impressive performance numbers as shown in some independent benchmarks.

The images below show some of them.

MapR-DB can do quite more operations per seconds than the rivals.


MapR-DB keeps latency low, constant, and predictable.

The entire comparison can be found here.

When reading or updating documents, MapR-DB knows what part of a document needs to be read or updated and only those parts are actually touched. MapR-DB try to efficiently manipulate documents, tables, and the underlying file system in order to keep performance at its best.

 

Querying MapR-DB

MapR-DB is a non-SQL database, so it does not support SQL natively. The OJAI API is the preferred way to interact to MapR-DB and by using this API we can take advantage of every feature this database offers.

We can use any of the provided clients to run queries on MapR-DB. An example of creating a document using the Java API is the following.

1 public Document buildDocument() {
2     return connection.newDocument()
3         .setId("movie0000001")
4         .set("title", "OJAI -- The Documentary")
5         .set("studio", "MapR Technologies, Inc.")
6         .set("release_date", Values.parseDate("2015-09-29"))
7         .set("trailers.teaser", "https://10.10.21.90/trailers/teaser")
8         .set("trailers.theatrical", "https://10.10.21.90/trailers/theatrical")
9         .setArray("characters", 
10             ImmutableList.of(
11                 "Heroic Developer", "Evil Release Manager", "Mad Development Manager"))
12         .set("box_office_gross", 1000000000L);
13 }

As we can see, the API allows manipulating objects in a friendly way as they represent JSON documents.

Through the OJAI API, we can do all kinds of operations against MapR-DB such as inserts, updates, etc…

1 public class OJAI_002_GetStoreAndInsertDocuments {
2
3   public static void main(String[] args) {
4 
5     System.out.println("==== Start Application ===");
6 
7     // Create an OJAI connection to MapR cluster
8     final Connection connection = DriverManager.getConnection("ojai:mapr:");
9 
10     // Get an instance of OJAI
11     final DocumentStore store = connection.getStore("/demo_table");
12 
13     for (final User someUser : Dataset.users) {
14       // Create an OJAI Document form the Java bean (there are other ways too)
15       final Document userDocument = connection.newDocument(someUser);
16 
17       System.out.println("\t inserting "+ userDocument.getId());
18 
19       // insert the OJAI Document into the DocumentStore
20       store.insertOrReplace(userDocument);
21     }
22 
23     // Close this instance of OJAI DocumentStore
24     store.close();
25 
26     // close the OJAI connection and release any resources held by the connection
27     connection.close();
28 
29     System.out.println("==== End Application ===");
30   }
31 
32 }

Basically, from any application that is able to use the OJAI API, we are able to do most of the work in MapR-DB. However, we could ask ourselves, what about other types of tools that required different processing capabilities?

Example of these is BI tools doing aggregations such as counts, groups by, sums, etc… On the other hand, we also should be able to quickly look at values on the database without the need for writing applications, but is this possible in MapR-DB? Let’s explore our options.

 

MapR DB Shell

MapR-DB offers a tool called 'dbshell' that can be used to query the database using its native language.

1 ➜  ~ mapr dbshell
2 ====================================================
3 *                  MapR-DB Shell                   *
4 * NOTE: This is a shell for JSON table operations. *
5 ====================================================
6 Version: 6.1.0-mapr
7
8
9 MapR-DB Shell
10 maprdb mapr:>

Using the 'dbshell' we can explore what tables we have, query them in all possible ways and more. Let’s see some examples.

Let’s start by listing the tables we have under a 'path'.

1 maprdb mapr:> ls /user/mapr/tables
2 Found 1 items
3 tr--------   - 5000 5000          3 2018-12-14 09:31 /user/mapr/tables/users
4 maprdb mapr:>

Let’s insert some values into this table.

1 maprdb mapr:> insert --t /user/mapr/tables/users --v '{"_id": "2", "name": "Nick", "age": 29}'
2 maprdb mapr:> insert --t /user/mapr/tables/users --v '{"_id": "1", "name": "Martha", "age": 25}'

Now, let’s list the documents.

1 maprdb mapr:> find /user/mapr/tables/users
2 {"_id":"1","age":25,"name":"Martha"}
3 {"_id":"2","age":29,"name":"Nick"}
4 2 document(s) found.

We can query by 'id'.

1 maprdb mapr:> find /user/mapr/tables/users --id "1"
2 {"_id":"1","age":25,"name":"Martha"}
3 1 document(s) found.

Or we can use any other fields.

1 find /user/mapr/tables/users --q '{"$where": {"$eq": {"age": 29}}}'
2 {"_id":"2","age":29,"name":"Nick"}
3 1 document(s) found.

Notice how the query is done. This is the OJAI query language and API playing their roles. This is native to MapR-DB. Remember, it is not a SQL database.

As you could imagine, the 'dbshell' is nice way to taste of how MapR-DB works and for doing quick and simple explorations. However, it might be hard to think about it as the preferred tool for large and complex queries.

Let’s continue to explore the options we have and how to use them.

 

MapR-DB Connector for Apache Spark

MapR offers a connector for Apache Spark that can be used for large data processing on top of MapR-DB.

The connector can be used on the different Spark APIs such as 'RDD[A]', 'DStream[A]', and 'DataFrame/DataSet[A]'.

For using the connector we must add the right dependencies to our spark project first. The following is a 'build.sbt' file from the 'Reactor' project.

1 ...
2 resolvers += "MapR Releases" at "http://repository.mapr.com/maven"
3 libraryDependencies ++= Seq(  
4   "org.apache.spark" % "spark-core_2.11" % "2.2.1" % "provided",  
5   "org.apache.spark" % "spark-sql_2.11" % "2.2.1" % "provided",  
6   "org.apache.spark" % "spark-streaming_2.11" % "2.2.1" % "provided",  
7 
8   "com.fasterxml.jackson.core" % "jackson-databind" % "2.9.7",  
9   "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.9.7",  
10   
11   "com.typesafe.play" % "play-json_2.11" % "2.3.8",  
12   
13   "com.github.scopt" %% "scopt" % "3.7.0",    
14  
15   "org.ojai" % "ojai-scala" % "2.0.1-mapr-1804",   
16   "org.apache.spark" % "spark-streaming-kafka-0-9_2.11" % "2.0.1-mapr-1611",  
17   "com.mapr.db" % "maprdb-spark" % "2.2.1-mapr-1803" % "provided"
18 )
19 ...

Now, we should be able to use the connector without problems.

1 ...
2 import com.mapr.db.spark.sql._
3 import com.mapr.db.spark.{MapRDBSpark, _}
4 ...
5 val fromDBDF = sparkSession.loadFromMapRDB(appConfig.tableName, dbSchema)
6 ....
7 finalDF.saveToMapRDB(appConfig.tableName)
8 ...

The example above only shows a fragment of the app, but notice how the connector is used to load and save DataFrames from/to MapR-DB. The same can be done for other Apache Spark abstractions as mentioned before.

Using the MapR-DB connector for Apache Spark we open a limitless of possibilities since we can combine the distributed nature of MapR-DB and Apache Spark together so we are able to truly process data at scale.

Even though Apache Spark is one of the best tools we can have in our toolset, sometimes it is just not enough. We need to ask ourselves how users that have no coding experience can use the powerful features of MapR-DB without going through the learning process of Spark which, sincerely, it not short nor easy.

 

Distributed Processing using Apache Drill

When we need SQL, we have Drill.

Using Apache Drill we can query almost dataset living in the MapR Platform regardless where it is stored, how it is formatted, or its size.

Interacting with Drill can be done through its different interfaces. Let’s start by using the drill shell since it offers a very simple, shell based solution.

1 [mapr@psnode172 root]$ sqlline
2 apache drill 1.14.0-mapr
3 "a drill in the hand is better than two in the bush"
4 !connect jdbc:drill:drillbits=locahost:31010;auth=maprsasl
5 1: jdbc:drill:drillbits=locahost:31010> select * from dfs.`/user/mapr/tables/users`;
6  +------+-------+---------+
7  | _id  |  age  |  name   |
8  +------+-------+---------+
9  | 1    | 25.0  | Martha  |
10 | 2    | 29.0  | Nick    |
11 +------+-------+---------+
12 2 rows selected (0.409 seconds)

As we can see, we can query MapR-DB, which is a non-SQL database, using pure SQL through Apache Drill. The result, as expected, comes back as a table. As you might suspect, queries of all kind can be executed, aggregations are especially interesting.

1 1: jdbc:drill:drillbits=locahost:31010> select count(*) from dfs.`/user/mapr/tables/users`;
2 +---------+
3 | EXPR$0  |
4 +---------+
5 | 2       |
6 +---------+
7 1 row selected (0.463 seconds)
8 1: jdbc:drill:drillbits=locahost:31010> select name, count(*)as t from dfs.`/user/mapr/tables/users` group by name;
9  +---------+----+
10 |  name   | t  |
11 +---------+----+
12 | Martha  | 1  |
13 | Nick    | 1  |
14 +---------+----+
15 2 rows selected (0.383 seconds)

Running queries like this on top of MapR-DB is mind-blowing. Drill knows exactly how to transform the SQL queries to the underlying MapR-DB query language.

It is important to notice that Drill also runs distributed on the MapR cluster so the same principles for data distributions and high performance continue to apply here.

 

Other Apache Drill Interfaces

The shell is not the only interface Drill supports. We can also use Drill through the REST interface.

1 curl -X POST -H "Content-Type: application/json" -d '{"queryType":"SQL", "query": "select * from dfs.`/user/mapr/tables/users`"}' http://localhost:8047/query.json
2 {
3    "columns" : [ "_id", "age", "name" ],
4    "rows" : [ {
5      "id" : "1",
6      "age" : "25",
7      "name" : "Martha",
8     },
9    {
10      "id" : "2",
11      "age" : "29",
12      "name" : "Nick",
13    }
14    ]
15  }

Also, Drill offers a Web interface for a more friendly usage. Accompanying these interfaces comes the JDBC and ODBC interfaces. These ones are very important to BI tools like Tableau, Microstrategy, and others to connect and interact with Drill.

The same ideas we discussed before apply here. For example, Tableau could connect to Drill through JDBC and Drill will run distributed queries on top of MapR-DB. This makes MapR-DB a very versatile and capable database.

 

Conclusions

MapR-DB is one of the most capable, non-SQL options out there. It offers HBase and JSON capabilities under the same platform. It runs distributed on the MapR cluster, sharing most of the properties on the underlying platform (MapR-FS). MapR-DB can be queried in many forms such as OJAI API for application, 'dbshell' for quick and simple interactions, Apache Spark for data processing at scale and Apache Drill for SQL queries and data analytics and BI tools integrations. Regardless of the tool being used, MapR-DB keeps performance a priority by maintaining low latency and fast operations per seconds at any scale which makes it perfect for the next generation workloads of the future.

Other tools for MapR-DB are independently developed, for instance, 'maprdbcls' that can be found here. It allows deleting documents (records) based on queries.'


 This article was written by Nicolas A Perez​ and posted originally on Medium