Log management: Why we need to look beyond relational databases

Kevin Hanrahan, Addamark Technologies
 

November 6, 2003 (Computerworld) The term relational database is almost superfluous these days. After all, every major commercial database product—Oracle, Sybase, DB2—is based on the same underlying relational model. There are many good reasons for the dominance of the relational model over older models, such as hierarchical and network models, as well as more recent models such as object databases: The underlying theory on which they are based allows relational databases to elegantly represent complex data sets and perform flexible and arbitrary queries. Formalized methodologies to schema design can go a long way toward eliminating data repetition and inconsistencies by splitting information into multiple tables with narrowly focused uses.
Given the ubiquity and obvious strengths of relational databases, it's understandable that application developers reflexively gravitate toward them whenever serious data storage is required. Security application vendors are no exception; a number of them have built and released products that attempt to store millions of firewall, VPN and intrusion-detection system records, in addition to server and application logs, in one of these relational database products.
So after making substantial investments in hardware and software licenses, customers are all too often dismayed to find that their security event analysis application fails to manage the large volumes of security-related data needed for proper incident investigation or to meet regulatory compliance.
The most immediate issue in log data management is finding that the events can't be inserted into the database as fast as they are generated. A number of factors contribute to the database insertion bottleneck, including index construction, commit and rollback space, queries, data deletion, and database management and maintenance.

Index issues
Queries against relational databases perform best when they access the data via an index. The natural tendency is for developers to try to optimize performance by constructing an index to handle many, if not most, of the anticipated user queries.
In log management, however, this strategy backfires because the amount of data being inserted is far greater than the amount of queries against the data.
It's also important to note that the efficiency of an index will also degrade over time, as records are added and deleted from the underlying table. A routine task in system administration is the periodic rebuilding of indices to offset this problem. In high-volume log management applications, this period may be surprisingly short, often as little as one week. Not surprisingly, the rebuilding of indices against a table containing many millions of records will result in the underlying table being effectively unusable for both insertion and querying for many hours.

Commit and rollback issues
The transactional nature of relational databases is indispensable in many applications. The textbook example of a financial transfer illustrates this requirement: Deducting an amount from one account and adding the amount to a different account involves two linked transactions. Ideally, both transactions should succeed, but if one should fail, the other must also. It is unacceptable for one to occur without the other.
In the context of log management, however, no such dependencies exist. The inability to load a log record into the system does not invalidate any other log record loaded along with it. Security applications that feed into relational databases may choose to minimize the number of transactions by loading fewer batches of many records. This causes a large amount of rollback space to be consumed, as well as an expensive recovery should one record fail in the large batch. Loading many small batches of fewer records has a much smaller recovery overhead but generally results in lower throughput as the increased number of transactions slows the overall process.
A combination of filtering the amount of data sent to the database and buying sufficiently powerful hardware to handle the insertion of the reduced data raises other issues, including the following:

Opinion
Kevin Hanrahan



Conclusion
An enterprise security program must fulfill a variety of requirements driven by regulatory, operational and legal demands, including analyzing user or customer activity, generating alerts under certain conditions or simply preserving audit trails for possible legal evidence. No matter what a company's requirements are, however, the first and foremost priority is to collect and archive all relevant log files and event records generated throughout an organization. It's the only way to achieve complete visibility into business activities.
Having developed applications on top of relational databases for many years, I have gained a healthy respect for the power of relational databases and their use for a wide variety of applications. But the problem of security log management is a very different beast than customer relationship management and requires a different approach to tame it.
This type of solution requires that data is stored in a format optimized for event logs, which enables the storage of huge amounts of data economically while providing redundancy and scalability. Log management solutions must address all of these issues while simultaneously enabling tens of terabytes of event data to be quickly loaded and queried in a cost-effective manner.