Twitter solves its data formatting challenge
IDG News Service - Eschewing popular choices such as XML, CSV and JSON, Twitter has opted to format the back-end storage of its user and systems data with a relatively unknown format pioneered by Google, called Protocol Buffers.
With the company storing 12TB of this data each day for later use, the decision of which format to use was a crucial one.
"Getting your data formats right is everything," said Twitter analytics lead Kevin Weil, during a talk at the HadoopWorld conference in New York on Tuesday.
The company is planning for the time when it will have to house "a trillion Tweets," Weil said, and it wants tools in place to analyze this information. The combination of Protocol Buffers, along with Hadoop and other associated technologies, should streamline this job, Weil said.
When stored, each short message, or "tweet," consists of 17 fields, six of which have at least one subfield, he explained. And the company will probably add more fields to these schema in the years to come.
In addition to the tweets the company's users supply, Twitter keeps internal log data on more than 80 different types of operations that occur within its systems, Weil said. Much of this log data is aggregated by Facebook's open-source technology Scribe.
The choice of a format to store all this data was a difficult one. One obvious choice is XML (Extensible Markup Language), but that protocol is "very wordy," Weil said, referring to how the name of the tag accompanies each data element.
Under XML, "one petabyte for a trillion Tweets might become 10 petabytes for a trillion Tweets," he said.
JSON (JavaScript Object Notation), though it was designed to simplify XML, is also wordy, in that it also stores the name of the key with every entry.
At the other end of the spectrum is CSV (Comma Separated Values). As the name suggests, CSV separates each data element only with a comma. While simple, it is not good for nesting data elements in subfields, Weil explained. Also, if the schema is changed, the resulting programming it would take to accommodate data in the old schema would be considerable.
A downside to all of these protocols is that, in order to get the data in and out of applications, developers have to repeatedly create data structures to encode and parse the data, work Weil considers "rote."
Protocol Buffers, used widely within Google, is an extensible protocol for serializing data, one Google claims is simpler than XML. And it can automate the process of recreating the data structures within applications.
"You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages," a Google tutorial on Protocol Buffers states. "You can even update your data structure without breaking deployed programs that are compiled against the 'old' format."
- Google I/O 2013's Coolest Products and Services
- 10 Star Trek Technologies That are Almost Here
- 19 Generations of Computer Programmers
- 25 Must-Have Technologies for SMBs
- A walking tour: 33 questions to ask about your company's security
- 15 social media scams
- The 7 elements of a successful security awareness program
- IT Certification Study Tips
- Register for this Computerworld Insider Study Tip guide and gain access to hundreds of premium content articles, cheat sheets, product reviews and more.
- The Total Cost of Email In this white paper, we'll explore the true costs of fragmented email management and uncover how to reduce those costs with a cloud-based...
- The Shape of Email The shape of email is a starting point in helping us understand the qualify of the information residing in the inboxes of organizations...
- SaaS with a Face: User Satisfaction in Cloud-Based E-mail Management with Mimecast Learn how a carefully targeted SaaS approach can add value to your email environment and potentially result in better services within a much...
-
Your Data under Siege: Protection in the Age of BYODs
Download Kaspersky Lab's new whitepaper, Your Data under Siege: Protection in the Age of BYODs, to learn about:
- How a mobile workforce stretches...
- Live Webcast
Get an Integrated Approach to Data Management - This KnowledgeVault Exchange is your one-stop resource center for designing a winning data management strategy with quantifiable top-line gains and bottom-line savings.
- Live Webcast
MFT and FileXpress - An Overview - Business users and applications exchange files on a regular basis. File transfer is a core part of the flow of business activity.
- Live Webcast
Bridging HTTP and FTP with FileXpress Internet Server - What if you could take an FTP server on your internal network, and allow external users (partners or customers) to securely access it...
- 3 Reasons Why Sepaton is the World's Fastest Backup Solution Leading analyst, Storage Switzerland learns how Sepaton backs up and deduplicates massive data volumes while maintaining the industry's fastest performance - all in...
- Bridging HTTP and FTP with FileXpress Internet Server What if you could take an FTP server on your internal network, and allow external users (partners or customers) to securely access it... All Data Storage White Papers | Webcasts