Increasing efforts by enterprises to glean business intelligence from the massive volumes of unstructured data generated by web logs, clickstream tools, social media products and the like has led to a surge of interest in open source Hadoop technology, analysts say.
Hadoop, an Apache data management software project with roots in Google's MapReduce software framework for distributed computing, is designed to support applications that use massive amounts of unstructured and structured data.
Unlike traditional relational database management systems, Hadoop is designed to work with multiple data types and data sources. Hadoop's Distributed File System (HDFS) technology allows large application workloads to be broken up into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing
The technology is already used widey by some of the world's largest Web properties, such as Facebook, EBay, Amazon, Baidu and Yahoo. Observers note that Yahoo has been one of the biggest contributors to Hadoop.
Increasingly, Hadoop technology is used in banks, advertising companies, life science firms, pharmaceutical companies and by other corporate IT operations, said Stephen O'Grady, an analyst with RedMonk.
What's driving Hadoop is the desire by companies to leverage massive amounts of different kinds of data to make business decisions, O'Grady said. The technology lets companies process terabytes and even petabytes of complex data relatively effectively and at substantially lower cost than conventional relational database management systems, experts say.
"The big picture is that with Hadoop you can have even a one and two person startup being able to process the same volume of data that some of the biggest companies in the world are," he said.
Hadoop user Tynt, a Web analytics firm, provides analytics services for more than 500,000 websites. Its primary offering is a service that lets content publishers get insight into how their content is being shared. On an average day Tynt collects and analyzes close to 1 terabyte of data from hundreds of millions of web interactions on the sites that it monitors.
The company switched to Hadoop about 18 months ago when its MySQL database infrastructure began collapsing under the sheer volume of data that Tynt was collecting.
"Philosophically, Hadoop is a whole different animal," said Cameron Befus, Tynt's vice president of engineering.
Relational database technologies focus the speed of data retrieval, complex query support and transaction reliability, integrity and consistency. "What they don't do very well is to accept new data quickly," he said.
"Hadoop reverses that. You can put data into Hadoop at ridiculously fast rates," he said. Hadoop's file structure allows companies to essentially capture and consolidate pretty much any structured and complex data type, such as web server logs, metadata, audio and video files, unstructured e-mail content, Twitter stream data and social media content, he said.
The technology therefore is ideal for companies looking to analyze massive volumes of structured and unstructured data.
Retrieving raw data from the HDFS and processing it, however, is not nearly easy or as convenient as typical database systems, because the data is not organized or structured, Befus said. "Essentially what Hadoop does is to write data out in large files. It does not care what's in the files. It just manages them and makes sure that there are multiple copies of them."
Early on, users had to write jobs in a programming language like Java in order to parse and then query raw data in Hadoop. But tools are now available that can be used to write SQL-like queries for data stored in Hadoop, Befus said.
Tynt uses a popular tool called Pig for writing queries to Hadoop. Another widely used option is Hive.
According to Befus, Hadoop's architecture makes it ideal for running batch processing applications involving 'big data.'
Hadoop can be used for more real-time business intelligence applications as well.
Increasingly, companies like OpenLogic have begun using another open source technology called HBase on top of Hadoop to enable fast querying of the data in HDFS. HBase is a column-oriented Hadoop data store that enables real-time access and querying of the data in Hadoop.
OpenLogic offers enterprises a service for verifying that open source code is properly attributed and is in full compliance with open source licenses.
To deliver the service, OpenLogic maintains a comprehensive database of hundreds of thousands of open source packages. The company stores metadata, version numbers and revision histories is stored on a Hadoop cluster. The data is accessed via HBase.
Rod Cope, CTO of OpenLogic, said the company gets the best of both worlds with Hadoop. "A lot of the data we have won't fit into a RDBMS like MySQL and Oracle. So the best option out there is Hadoop," he said.
By running HBase on top of Hadoop, OpenLogic has also been able to enable real-time data access in nearly the same manner as conventional database technologies, he said.
There are some caveats associated with the use of Hadoop, users note.
"The biggest challenge is that this is still young technology with a lot of moving parts," Cope said. "You have to configure and install and integrate a number of components and get them working just so, and that's a non-trivial process."
The relative lack of Hadoop expertise among IT professionals has been another big problem, Befus said.
"It's hard to find anybody with any experience with Hadoop," he said. The fact that Hadoop is not quite a mature technology yet also means that companies need top notch operations staff to handle potential glitches.
Both OpenLogic and Tynt are using a Cloudera Hadoop support tools.
Cloudera offers technical support, implementation help, bug fixes and patches and other handholding services for Hadoop. It also offers a Cloudera distribution of the open source technology featuring core Apache Hadoop and nine related open source tools all integrated into one package.
Jaikumar Vijayan covers data security and privacy issues, financial services security and e-voting for Computerworld. Follow Jaikumar on Twitter at @jaivijayan, or subscribe to Jaikumar's RSS feed . His e-mail address is firstname.lastname@example.org.