Review: Stretch your NoSQL database with MarkLogic 8

Enterprise-oriented document database brings powerful indexing and flexible querying to a broad range of data types

Young businessman doing flexible handstand outdoors twist contort stretch
Thinkstock

MarkLogic is a document-oriented, distributed NoSQL database from the company of the same name. In the world of MarkLogic, a document is principally an XML file, though MarkLogic can also handle JSON documents, text files, image files, audio files and more. If you can put it in a file, you can put it in a MarkLogic database. The system's ability to ingest JSON and manipulate it with the same ease as XML is new with the latest release, MarkLogic 8.

MarkLogic describes itself as schema-less, in that two documents in the same database can be composed of completely different structures. In addition to easy manipulation of text, MarkLogic's querying system also recognizes RDF (Resource Description Framework) and geospatial data.

Designed to run on commodity hardware, a single-instance MarkLogic server needs only 512MB of RAM (though at least 2GB is recommended). Versions exist for 64-bit Windows Server 2008 and Windows Server 2012, Solaris 10, Mac OS X, and various Linux distributions, including Suse, Red Hat and CentOS. In addition, it can be deployed to Amazon EC2.

MarkLogic offers several licensing options. The developer license is free, and its features are pretty much identical to those available through the paid license editions. The exceptions: You cannot use the database in a commercial product, there's no support and you must renew the license every six months. (For license and feature details, see the MarkLogic website.)

MarkLogic is a very flexible database, both in the types of data natively supported and in the ways that data can be indexed and queried. Not surprisingly, the price you pay for that flexibility is a good deal of complexity. For example, administrators must determine which among 30-plus indexes best suits the intended application.

Forests and stands

As stated above, MarkLogic can natively store XML, JSON, RDF, text, geospatial data and binary data. Here, “natively store” means that applications accessing the data need not perform conversion operations to query or extract the various data types.

On the metal, MarkLogic persists data in one of two forms: compressed text or binary. XML and JSON (which is transformed into XML) are stored as compressed text. RDF and geospatial data are represented in their constituent documents as XML. Text documents are stored as parentless XML text nodes. Binary data is stored as separate files (described further below).

Of course, the text data types (XML and JSON) have both structure and content, and both must be persisted in the database. MarkLogic employs a compressed tree representation for such data, thereby conserving disk space while preserving the hierarchical internals of the original documents.

The outermost container on a MarkLogic cluster is a database, which is effectively identical to the database of the RDBMS world. A single cluster can manage multiple databases. Queries and transactions are confined to databases; they cannot reach across database boundaries.

Look inside a database, and you'll find one or more forests. A forest is a collection, roughly (very roughly) analogous to an RDBMS table. Aside from serving as containers, forests enhance performance, as they can be queried in parallel. A forest is empty until you start putting documents in it. When you do that, MarkLogic creates a “stand.”

A stand is a storage structure that can reside in memory or on disk. Typically, a stand begins life in memory, and is moved to disk as it grows (i.e., as documents are added). The contents of the stand are the database's actual data (XML, text, binary) and indexes, which have been converted into a compressed binary form and stored in files (when the stand is written to disk).

MarkLogic SPARQL

MarkLogic’s query console lets you browse and manage a database via several query languages. Here, a SPARQL query (upper pane) fetches what the database knows about Weird Al Yankovic. The results (bottom pane) can be displayed in Turtle (the terse RDF triple language), JSON, or raw text.

Memory and storage

When a document is initially written to a MarkLogic cluster, it is written first into an in-memory stand. At the same time a record of the write operation is written to a journal on disk, so the operation can be recovered in the even of a hardware failure. As more writes take place, the in-memory stand fills and must ultimately be flushed to disk. Over time, as more and more stands are written to disk, the system must search through more and more files to satisfy query operations. Naturally, this hampers performance. MarkLogic will periodically merge on-disk stands to reduce fragmentation. Because this is a CPU-intensive operation, MarkLogic allows the admin to configure its frequency.

This mechanism of writing in-memory stands to disk is more or less identical to the memory-to-disk flow of a log-structured merge tree (commonly used in key-value database systems). One of the main advantages is that the disk I/O is primarily a series of sequential write operations, which makes it particularly suitable for use on solid-state drives (SSDs).

In that vein, MarkLogic supports a capability known as “tiered storage,” which amounts to its migrating data onto the persistent storage type (hard disk or SSD) that best suits the access patterns of that data. You can specify a “fast data directory” for a forest, and point that directory to an SSD. MarkLogic will write smaller stands and more frequent stand merges, as well as journals, of that forest to that SSD-resident fast data directory. More frequently updated documents tend to reside in smaller stands, so they are natural fits for SSDs, where they can be retrieved more promptly.

You can even define a Document Assignment Policy, which specifies the forest a document is placed in, based on document criteria spelled out in the policy. For example, a financial services company might keep its most recent trade documents in a forest that has been pointed to an SSD. Past trades can be stored in a forest that has been assigned to slower, spinning media. Once you've created your Document Assignment Policy, MarkLogic handles the migration of documents automatically.

Concurrency control

MarkLogic is fully ACID compliant and supports XA transactions via the Java Transaction API. Thus a database transaction can span multiple statements, multiple MarkLogic databases, and a mixture of MarkLogic and other XA-compliant databases. Update transactions to the database are isolated not only from each other but from themselves. That is, an update will not actually “see” the updated data until the transaction commits. (You can think of updates like queued I/O requests; they are submitted at commit time.)

System integrity is maintained via read/write locks, which are granted on a first-come, first-served basis. Locks are automatically released when the I/O request completes. Naturally, deadlock detection is built into MarkLogic. If a deadlock is discovered, the transaction that has proceeded the furthest in its database requests is allowed to proceed, while other “entangled” transactions are restarted.

Cluster member roles

Each member of a MarkLogic cluster runs identical software. As a result, a cluster's health does not depend on a single “master” member. Replication of cluster data is automatic, so transactions can be satisfied even if a cluster member is lost. Those components of the transaction headed for the missing member will be redirected to the members replicating the otherwise lost data.

Nevertheless, you can assign “roles” to each cluster member. Specifically, a member can act as either a D-node or an E-node. (Actually, a member can act as both, which is required in a single-member cluster. In multimember clusters, it is recommended that each member be assigned only one role.) A cluster member acting as a D-node is a “data manager node.” A D-node handles the storage and retrieval of a subset of the cluster's data. Meanwhile, an E-node member is an “evaluator node.” E-nodes handle database queries. They federate the queries around the cluster, sending requests to the D-node members and aggregating the returned results.

MarkLogic allows the administrator to assign D-node and E-node roles, so you can tune the performance of a cluster to its anticipated work profile. In a nutshell, more E-nodes permit the cluster to support more clients, while more D-nodes permit the cluster to support more data.

A forest of indexes

The key to using MarkLogic effectively is a good understanding of its indexing capabilities, which are extensive. MarkLogic employs numerous, specialized indexes that it deftly choreographs to resolve queries and accelerate data access.

The most important member of MarkLogic's panoply of indexes is the “universal index.” This index tracks words in a document, as well as pairings of document elements (or properties) and the words contained in those elements. If a document contained <title>Star Wars</title>, the universal index would have entries for Star, Wars, title/'Star’, and title/'Wars'. This index not only supports simple text searches, but also helps MarkLogic satisfy Xpath queries.

Other available indexes:

  • The range index, which is useful for sorting data, creating efficient ORDER BY queries, and performing what might be called “document joins” (queries that involve retrieving information from multiple documents that share linking data).
  • The triple index, which allows querying of RDF triples (subject, predicate, object). MarkLogic actually supports SPARQL (SPARQL Protocol and RDF Query Language) for querying RDF data.
  • The geospatial index, with which MarkLogic handle queries for points, circles, boxes, complex polygons, and other geospatial objects. MarkLogic's geospatial capabilities enable it to integrate with products like Google Maps, Bing Maps, and others. Furthermore, MarkLogic supports both WGS84 (World Geodetic System) and raw coordinate systems, and it provides a variety of built-in geospatial functions. For example, it can determine if two regions intersect, or whether a polygon contains a region, and more.

For multiword searches, you can configure MarkLogic to enable word positions, which means it will index not only words, but their positions in documents as well. This allows MarkLogic to quickly determine word adjacency. There are many more indexing capabilities in MarkLogic. You’ll find a thorough discussion of them in Inside MarkLogic Server, a downloadable PDF.

Working with MarkLogic

You manipulate data in MarkLogic primarily using XQuery (and XPath), XSLT, or JavaScript. (JavaScript support is new in MarkLogic 8.) MarkLogic's built-in Web server (protected via SSL) lets you invoke server-side JavaScript, XQuery, or XSLT in much the same way you would invoke PHP code on a typical Web server. Because the code executes on the server, the arrangement is somewhat analogous to the RDBMS world's stored procedures.

You can also access a database using MarkLogic's REST API, with which you can perform all the standard CRUD operations, as well as execute queries and management operations. MarkLogic provides client libraries for Java and Node.js. Of course, because access is through a straightforward REST API, you're free to call the database with any language that provides a RESTful client library.

MarkLogic's query console -- accessible via your Web browser -- lets you write and execute queries in XQuery, SPARQL, JavaScript, and even SQL. (Note that MarkLogic supports SQL in a not-so-obvious fashion. Rows and columns are abstractions, based on the contents of range indexes. You can read about MarkLogic’s SQL support here.) The console also provides a database explorer with which you can browse database documents. For the ultimate in document browsing convenience, however, you’ll want to use MarkLogic's WebDAV interface. The WebDAV UI presents the documents in your database as files in a file system and lets you access them with drag-and-drop ease.

Because MarkLogic supports server-side JavaScript (thanks to the Google V8 JavaScript engine), JavaScript code that you submit from the console is actually executed on the server. The JavaScript API includes a number of built-in objects for simplifying the manipulation of database document entities (for example, Node and Document objects).

Of course, before you can do all this querying, you have to get data into the database. That's the job of Information Studio, a browser-based XQuery API. Information Studio employs connectors that read data from an external data source, which might be as simple as a file or as complex as another database. A connector is effectively a software module plug-in. MarkLogic provides some connectors out of the box, and you can write your own as needed. One of the supplied connectors reads files from a specified directory for importing into the database. Incoming data is processed with XSLT, transforming it into whatever structure is suitable for the target database.

1 2 Page 1
Page 1 of 2
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon