NoSQL showdown: MongoDB vs. Couchbase
MongoDB edges Couchbase Server with richer querying and indexing options, as well as superior ease-of-use
Document databases may be the most popular NoSQL database variant of them all. Their great flexibility -- schemas can be grown or changed with remarkable ease -- makes them suitable for a wide range of applications, and their object nature fits in well with current programming practices. In turn, Couchbase Server and MongoDB have become two of the more popular representatives of open source document databases, though Couchbase Server is a recent arrival among the ranks of document databases.
In this context, the word "document" does not mean a word processing file or a PDF. Rather, a document is a data structure defined as a collection of named fields. JSON (JavaScript Object Notation) is currently the most widely used notation for defining documents within document-oriented databases. JSON's advantage as an object notation is that, once you comprehend its syntax -- and JSON is remarkably easy to grasp -- then you have all you need to define what amounts to the schema of a document database. That's because, in a document database, each document carries its own schema -- unlike an RDBMS, in which every row in a given table must have the same columns.
[ Andrew C. Oliver answers the question on everyone's mind: Which freaking database should I use? | Also on InfoWorld: The time for NoSQL standards is now | Get a digest of the key stories each day in the InfoWorld Daily newsletter. ]
The latest versions of Couchbase Server and MongoDB are both newly arrived. In December 2012, Couchbase released Couchbase Server 2.0, a version that makes Couchbase Server a full-fledged document database. Prior to that release, users could store JSON data into Couchbase, but the database wrote JSON data as a blob. Couchbase was, effectively, a key/value database.
10gen released MongoDB 2.4 just this week. MongoDB has been a document database from the get-go. This latest release incorporates numerous performance and usability enhancements.
Both databases are designed to run on commodity hardware, as well as for horizontal scaling via sharding (in Couchbase, the rough equivalent to a shard is called a partition). Both databases employ JSON as the document definition notation, though in MongoDB, the notation is BSON (Binary JSON), a binary-encoded superset of JSON that defines useful data types not found in JSON. While both databases employ JavaScript as the primary data manipulation language, both provide APIs for all the most popular programming languages to allow applications direct access to database operations.
Key differencesOf course there are differences. First, MongoDB's handling of documents is better developed. This becomes most obvious in the mongo shell, which serves the dual purpose of providing a management and development window into a MongoDB database. Database, collections, and documents are first-class entities in the shell. Collections are actually properties on database objects.
This is not to say that Couchbase is hobbled. You can easily manage your Couchbase cluster -- adding, deleting, and fetching documents -- from the Couchbase Management GUI, for which MongoDB has no counterpart. Indeed, if you prefer management via GUI consoles, score one for Couchbase Server. If, however, you favor life at the command line, you will be tipped in MongoDB's direction.
The cloud-based MongoDB Monitoring Service (MMS), which gathers statistics, is not meant to be a full-blown database management interface. But MongoDB's environment provides a near seamless connection between the data objects abstracted in the mongo shell and the database entities they model. This is particularly apparent when you discover that MongoDB allows you to create indexes on a specific document field using a single function call, whereas indexes in Couchbase must be created by more complex mapreduce operations.
In addition, while Couchbase documents are described via JSON, MongoDB documents are described in BSON; the latter notation includes a richer number of useful data types, such as 32-bit and 64-bit integer types, date types, and byte arrays. Both support geospatial data and queries, but this support in Couchbase is currently in an experimental phase and likely won't stay there long. New in version 2.4, MongoDB's full text search capability is also integrated with the database. A similar capability is available in Couchbase Server, but requires a plug-in for the elasticsearch tool.
Both Couchbase Server and MongoDB provide data safety via replication, both within a cluster (where live documents are protected from loss by the invisible creation of replica documents) and outside of a cluster (through cross data-center replication). Also, both provide access parallelism through sharding. However, where both Couchbase and MongoDB support hash sharding, MongoDB supports range sharding and "tag" sharding. This is a two-edged sword. On the one hand, it puts a great deal of flexibility at a database administrator's fingertips. On the other hand, its misuse can result in an imbalanced cluster.
Mapreduce is a key tool used in both Couchbase and MongoDB, but for different purposes. In MongoDB, mapreduce serves as the means of general data processing, information aggregating, and analytics. In Couchbase, it is the means of creating indexes for the purpose of querying data in the database. (We suspect that this, like the poorer document handling, is an effect of Couchbase's only having recently morphed into a document database.) As a result, it's easier to create indexes and perform ad hoc queries in MongoDB.
Couchbase's full incorporation of Memcached has no counterpart in MongoDB, and Memcached is a powerful adjunct as general object caching system for high-throughput, data-intensive Internet and intranet applications. If your application needs a Memcache server with your database, then look no further than Couchbase.
In general, the two systems are neck-and-neck in terms of features provided, though the ways those features are implemented may differ. Further, the advantages that one might hold over the other will certainly come and go as development proceeds. Both provide database drivers and client frameworks in all the popular programming languages, both are open source, both are easily installed, and both enjoy plenty of online documentation and active community support. As is typical for such well-matched systems, the best advice anyone could give for determining one over the other will be that you install them both and try them out.
Couchbase promotes Couchbase Server as a solution for real-time access, not data warehousing. Nor is Couchbase Server suitable for batch-oriented analytic processing -- it is designed to be an operational data store.
Though Couchbase Server is based on Apache CouchDB, it is more than CouchDB with incremental modifications. For starters, Couchbase is an amalgam of CouchDB and Memcached, the distributed, in-memory, key/value storage system. In fact, Couchbase can be used as a direct replacement for Memcached. The system provides a separate port that unmodified, legacy Memcached clients can use, as well as "smart SDK" and proxy tools that improve its performance as a Memcached server.
For example, you can use a "thick client" deployment model, which will place the continuously updated knowledge of Memcached node topology on the client tier. This speeds response, as any request for a particular Memcached object will be sent from the client directly to the caching node for that object. This thick-client approach also plays an important role in the Couchbase system's resilience to node crashes (described later).
Couchbase includes its own object-level caching system based on Memcached, though with enhancements. For example, Couchbase tracks working sets (the documents most frequently accessed on a given node) in its object cache using NRU (not recently used) algorithms. All I/O operations act on this in-memory cache. Updates to documents in the cache are eventually persisted to disk. In addition, for updates, locking is employed at the document level -- not at the node, database, or partition level (which would hobble throughput with numerous I/O waits), nor at the field level (which would snarl the system with memory and CPU cycles required to track the locks).
Couchbase accelerates access by using "append only" persistence. This is used not only with the data, but with indexes as well. Updated information is never overwritten; instead, it is appended to the end of whatever data structure is being modified. Further, deleted space is reclaimed by compaction, an operation that can be scheduled to take place during times of low activity. Append-only storage speeds updates and allows read operations to occur while writes are taking place.
Couchbase scaling and replicationTo facilitate horizontal scaling, Couchbase uses hash sharding, which ensures that data is distributed uniformly across all nodes. The system defines 1,024 partitions (a fixed number), and once a document's key is hashed into a specific partition, that's where the document lives. In Couchbase Server, the key used for sharding is the document ID, a unique identifier automatically generated and attached to each document. Each partition is assigned to a specific node in the cluster. If nodes are added or removed, the system rebalances itself by migrating partitions from one node to another.
There is no single point of failure in a Couchbase system. All partition servers in a Couchbase cluster are equal, with each responsible for only that portion of the data assigned to it. Each server in a cluster runs two primary processes: a data manager and a cluster manager. The data manager handles the actual data in the partition, while the cluster manager deals primarily with intranode operations.
System resilience is enhanced by document replication. The cluster manager process coordinates the communication of replication data with remote nodes, and the data manager process shepherds whatever replica data the cluster has assigned to the local node. Naturally, replica partitions are distributed throughout the cluster so that the replica copy of a partition is never on the same physical server as the active partition.
Like the documents themselves, replicas exist on a bucket basis -- a bucket being the primary unit of containment in Couchbase. Documents are placed into buckets, and documents in one bucket are isolated from documents in other buckets from the perspective of indexing and querying operations. When you create a new bucket, you are asked to specify the number of replicas (up to three) to create for that bucket. If a server crashes, the system will detect the crash, locate the replicas of the documents that lived on the crashed system, and promote those replicas to active status. The system maintains a cluster map, which defines the topology of the cluster, and this is updated in response to the crash.
Note that this scheme relies on thick clients -- embodied in the API libraries that applications use to communicate with Couchbase -- that are in constant communication with server nodes. These thick clients will fetch the updated cluster map, then reroute requests in response to the changed topology. In addition, the thick clients participate in load-balancing requests to the database. The work done to provide load balancing is actually distributed among the smart clients.
Changes in topology are coordinated by an orchestrator, which is a server node elected to be the single arbiter of cluster configuration changes. All topology changes are sent to all nodes in the cluster; even if the orchestrator node goes down, a new node can be elected to that position and system operation can continue uninterrupted.
Couchbase supports cross-data-center replication (XDCR), which provides live replication of database contents of one Couchbase cluster to a geographically remote cluster. Note that XDCR operates simultaneously with intracluster replication (the copying of live documents to their inactive replica counterparts on other cluster members), and all systems in an XDCR arrangement invisibly synchronize with one another. However, Couchbase does not provide automatic fail-over for XDCR arrangements, relying instead on techniques such as using a load-balancing mechanism to reroute traffic at the network layer, in which case the XDCR group will have been set up in a master-master configuration.
Couchbase indexing and queriesQueries on Couchbase Server are performed via "views," Couchbase terminology for indexes. Put another way, when you create an index, you're provided with a view that serves as your mechanism for querying Couchbase data. Views are new to Couchbase 2.0, as is the incremental mapreduce engine that powers the actual creation of views. Note that queries really didn't exist prior to Couchbase Server 2.0. Until this latest release, the database was a key/value storage system that simply did not understand the concept of a multifield document.