Review: RethinkDB rethinks real-time Web apps

RethinkDB's NoSQL document store delivers high-speed change notifications for super-responsive apps

Review: RethinkDB rethinks real-time Web apps
Thinkstock

Like MongoDB or Couchbase, RethinkDB is a clustered, document-oriented database that delivers the benefits of flexible schema, easy development and high scalability. Unlike those other document databases, RethinkDB supports “real-time” applications with the ability to continuously push updated query results to applications that subscribe to changes.

Editors Choice

In this case, a “real-time” application is one that must support a large flow of client requests that alter the state of the database and keep all clients apprised of those changes. A common example of a real-time application is the multiplayer game: Hundreds or thousands of users are pushing buttons, those button pushes are changing the game state, and all of the users must see all of the changes in real time.

RethinkDB expends a great deal of effort ensuring that data change events are quickly dispatched throughout the cluster. And it provides this high-speed event processing mechanism while offering plenty of control over database consistency and durability.

Nevertheless, even the RethinkDB engineers admit that, if your primary consideration for a database is ACID compliance, RethinkDB probably shouldn't be your first choice. The principal reason: RethinkDB does not support transactions across multiple documents. However, within a single document, RethinkDB is fully ACID compliant.

In spite of this, RethinkDB’s “real-time push” technology (explained below) as the means for clients to be kept apprised of database changes makes it ideal for underpinning applications that must provide clients with the most up-to-date view of database state. Further, RethinkDB’s easy-to-grasp query language -- embedded in a host of popular programming languages -- and its out-of-the-box management and monitoring GUI, make for a smooth on-ramp to learning how to put RethinkDB to work in such applications.

NoSQL, so no schemas

RethinkDB is a JSON document database. A JSON document represents a structured object consisting of key/value pairs. The value can be either a primitive data type (integer, string, floating point number) or a nested JSON object (represented in document form). This means, of course, that JSON can describe arbitrarily complex objects.

RethinkDB stores documents in tables. While this might lead one to think that RethinkDB has relational database ingredients, the fact is a table is simply a logical container; the RethinkDB engineers chose to call that container “table” so that developers coming from a relational background would feel comfortable.

A table in RethinkDB places no real restrictions on the structure of its contained documents. In the relational world, all rows within a table necessarily have the same structure (that is, they have the same fields). But RethinkDB is schema-less: No two documents, not even documents within the same table, need to have the same structure. Of course, it’s generally beneficial for all documents in a table to have the same structure, as it simplifies organization and management. But the flexibility is there, if needed. The RethinkDB documentation provides a nice overview of data modeling options in its “yes it’s a table but not really” world.

Real time with changefeeds

In a typical database system, clients discover alterations to database contents by querying the database. To learn if customer X has updated her shopping cart, you fetch customer X’s shopping cart and look inside. Of course, you can improve the throughput if you associate a timestamp with the shopping cart. Rather than fetch the entire shopping cart object, you can fetch the timestamp. If it’s changed since the last time it was fetched, then you get the shopping cart. In both cases, though, an application becomes aware of database changes by polling it.

RethinkDB provides a “real-time push” capability. RethinkDB client applications register themselves as listeners to specific database events. If something changes in the database, the database notifies the client, which doesn’t have to repeatedly poll the database.

This capability, called a changefeed, can be applied to a table, a document, or a query. Your application registers a changefeed via the changes() command. (RethinkDB’s query language is embedded, so commands are native language method calls, as described in more detail later.) The result of the changes() command is an “infinite cursor” -- that is, a cursor object that provides a more or less unending set of change documents. Typically, this is handled inside what amounts to a callback function, in which code iteratively fetches a change document from the cursor. The code looks like this:

rdb.table(‘<tablename>’).changes().run(conn, function(err, cursor) { cursor.each( ... code to handle changefeed ... ) }

The change document is a JSON document containing the previous and current value of the item that’s been changed. The cursor blocks when no new changes are pending.

You can configure changefeeds to throttle the delivery of information. For example, you might configure a changefeed to wait for N changes before sending a response to the listening application. RethinkDB will merge multiple changes together into a single response. This reduces network traffic. In addition, if many alterations occur between subsequent fetches from the changefeed cursor, RethinkDB will collapse the result so that intermediate changes are discarded, and only the previous and current values are reported.

Change documents also include informative state information, such as whether the item is an initial document (in which case the change is really an addition). Changes are buffered at the server, and if that buffer hits its limit, the server will discard early change documents and insert a special error document into the stream of changes. This error document will indicate how many documents were skipped on account of the buffer overrun.

RethinkDB admin

RethinkDB’s management dashboard provides many views into cluster performance. Here the dashboard shows real-time read and write throughput statistics for a table within the database.

Sharding in RethinkDB

As a distributed database, RethinkDB spreads data around the cluster by sharding. RethinkDB uses range sharding, assigning each document to a shard based on that document’s primary key. Documents whose keys are relatively contiguous -- in the same range -- are placed in the same shard. (This is in contrast to hash-based sharding, which assigns documents to shards more or less randomly, as the sharding is governed by a hashed value.)

While range sharding improves query responsiveness, it runs the risk of unbalancing the cluster, should primary keys (like last names in a customer database) be unevenly distributed. Fortunately, RethinkDB intelligently “slices” ranges, cutting overfilled ranges into more numerous segments to rebalance an unbalanced cluster. You can also manually rebalance a cluster through the management console, should the need arise.

Naturally, RethinkDB also supports replication. Each shard is assigned to a primary replica node, but also copied to one or more secondary replica nodes. Reads and writes for a given document are routed to the primary replica node for the shard within which the document resides. As long as the primary replica node is available, reads and writes are consistent. Otherwise, writes are deferred, and reads are served by one of the secondary replica nodes.

Ordinarily, application read operations are up to date. That is, if a read operation follows a write operation (on a particular document), the read is guaranteed to see the effects of the write because both will be served by the shard’s primary node. However, if you want to speed up your application’s read throughput, you can configure reads to be out of date, in which case reads will be directed to the nearest replica node. Recognize, of course, that you’re exchanging speed for concurrency -- it’s possible that a read operation will return “old” data, which has been modified by a write on the primary replica node, but not yet migrated to the secondary replica.

Finally, RethinkDB supports failover, which requires that the cluster have at least three nodes and tables be configured to have more than two shards. If a node becomes unavailable and happens to host the primary replica for a table, then one of the secondary nodes is selected by RethinkDB to become the new primary. No data is lost. Should the lost node come back online, it will resume its position as primary. Note that, even if a majority of nodes for a given replica are lost, data can still be retrieved, though it requires a special recovery operation.

ReQL, the RethinkDB query language

RethinkDB’s query language is called ReQL; it’s easy enough to guess what that stands for. Applications employ ReQL via a client driver library. Official drivers are available for Ruby, Python, and JavaScript. Community drivers are available for many other languages, including Node.js, PHP, C++, and C#. (See the RethinkDB website for a list of available drivers.)

The driver supplies the methods that a developer uses to build ReQL queries. ReQL is an “embedded query language,” meaning you create ReQL queries by chaining together function/method calls in whatever native language you’re using. You’re not writing queries as strings that are handed off to a query language process for parsing, as done in SQL.

An example ReQL query (written in Ruby) might look like this:

r.db(“birds”).table(“sightings”).insert({:name=>”Warbler”, :location=>”North Street”, :quantity=>2}).run(conn)

In the above, r is the RethinkDB namespace, and conn represents the connection to the server. Obviously, this is an insert operation on the sightings table within the birds database. A small, three-element JSON document is being added to the table.

Typically, data in a ReQL query flows from left to right. You begin with the RethinkDB namespace, make a call to reference a table, then issue a command -- for example, pluck() -- to filter the documents within that table, and so on. This sort of chaining is frequently seen in JavaScript, so JavaScript programmers should find the construction of ReQL queries easy to grasp.

Because ReQL queries are written in the native language of the client driver, they are immune to injection attacks. You are also free to use constructs of the driver language -- Python lambdas, for instance -- to make queries more expressive. The driver translates the query into a kind of pseudo-language and sends the translation to the server for execution. Thus, you can treat ReQL queries much like stored procedures. You can construct a query, put it in a variable, and pass it to the server to be executed later.

Unfortunately, RethinkDB does not yet have a query optimizer, though the system is designed to support one. An optimizer is planned for a future release. However, RethinkDB will parallelize queries when possible. A RethinkDB cluster is symmetric in that a query can be sent to any cluster member, which will forward the query to the proper destination for processing.

You can actually perform the equivalent of a relational JOIN operation. In a relational system, a JOIN of two tables connects specified columns in each table. In RethinkDB, a JOIN of two tables connects specified fields, one in each set of the documents that comprise each table.

In addition, ReQL supports a variant of map-reduce called group-map-reduce (GMR). The map operation will fetch a sequence from the database, a sequence being a set of documents or document fields, and transform it into a different sequence. The reduce operation aggregates the results prior to delivery as the query’s response. The additional “group” step can be used to amass the sequences into partitions for which separate results are produced. (For example, you might use the group operation to gather the results of a map-reduce process by gender.)

RethinkDB’s GMR system is a low-level API into the database. Higher-level ReQL commands (JOIN, for example) are compiled into GMR queries and executed on the server by the GMR infrastructure. As with map-reduce operations in Hadoop, resources are allocated automatically, and the system manages the priorities of concurrent queries. More extensive control of resource allocation for GMR queries is planned in a future release of RethinkDB.

Managing RethinkDB

Right out of the box, RethinkDB supplies a browser-based management GUI. The GUI’s dashboard includes multiple views of various aspects of the cluster. For example, you can display cluster I/O performance (reads/writes per second), along with summaries such as the number of servers, number of tables, indexes, disk usage, and so on.

Select a table, and you can view statistics for that table (such as number of documents), as well as see how the data in the table is distributed across individual nodes. Similarly, you can drill down into information on individual server nodes in the cluster to glean the node’s uptime, its cache size, how many shards it is responsible for, whether it is the primary or secondary replica node for a given table, and so on.

1 2 Page 1
Page 1 of 2
It’s time to break the ChatGPT habit
Shop Tech Products at Amazon