Really big data: The challenges of managing mountains of information

Shops that shepherd petascale amounts of data have figured out some interesting methods for getting the job done.

1 2 3 Page 2
Page 2 of 3

Splitting files into manageable chunks

Amazon.com, the e-commerce giant that has ventured into cloud services, is quickly becoming one of the largest holders of data in the world, with around 450 billion objects stored in its cloud for its own storage needs and those of its customers. Alyssa Henry, vice president of storage services at Amazon Web Services, explains that that translates to about 1,500 objects for each person in the United States and to one object for every star in the Milky Way galaxy.

Some of the objects in the database are fairly massive -- up to 5TB each -- and could be databases in their own right. Henry says she expects single-object sizes to get as high as 500TB each by 2016.

She says the secret to dealing with massive data is to split the objects into chunks, a process called parallelization.

For its S3 public-cloud storage service, Amazon uses its own custom code to split files into 1000MB pieces. This is a common practice, but what makes Amazon's approach unique is that the file-splitting process occurs in real time.

"This always-available storage architecture is a contrast with some storage systems which move data between what are known as 'archived' and 'live' states, creating a potential delay for data retrieval," Henry explains.

Corrupt files are another challenge that storage administrators have to face when dealing with massive amounts of data. Most companies don't worry about the occasional corrupt file, but when you have 449 billion objects, even low failure rates create a storage challenge.

Amazon uses custom software that analyzes every piece of data for bad memory allocations, calculates checksums and analyzes how fast an error can be repaired to deliver the throughput needed for the cloud storage.

Henry says Amazon's data storage requirements are destined to grow significantly as its customers keep more and more data in its S3 systems. For instance, some users of the company's cloud-based services are storing massive data sets for genome sequencing, and a customer in the U.S. is using the service to store data collected from sensors implanted on cows to track their movements and health. Henry would not predict how big the data collection might get. Facing demands like those, Amazon is prepared to add nodes quickly to scale out as needed, says Henry.

Relying on virtualization

Mazda Motor Corp., with 800 employees in the U.S., manages around 90TB of stored information.

Barry Blakeley, the infrastructure architect in Mazda's North American Operations, says employees and some 900 Mazda car dealerships are generating ever-increasing amounts of data analytics files, marketing materials, business intelligence databases, SharePoint data and more.

"We have virtualized everything, including storage," says Blakeley. The company uses tools from Compellent, now part of Dell, for storage virtualization and Dell PowerVault NX3100 as its SAN, along with VMware systems to host the virtual servers.

Barry Blakeley
Barry Blakeley, the infrastructure architect in Mazda's North American Operations, says the automaker relies heavily on a tiered virtualization model to handle its big data.

Mazda's small IT staff -- Blakeley did not want to provide an exact head count -- is often hard-pressed to do any manual migrations, especially from disk to tape. But virtualization makes the task easier.

The key, says Blakeley, is migrate "stale" data quickly onto tape. He says 80% of Mazda's stored data becomes stale within months, which means that blocks of data aren't accessed at all.

To accommodate these usage patterns, the virtual storage is in a tiered structure: fast solid-state disks connected by Fibre Channel switches for the first tier, which handles 20% of the company's data needs. The remainder of the data is archived to slower disks running at 15,000 rpm on a Fibre Channel system for a second tier, and a third tier of 7,200-rpm disks connected by serial-attached SCSI.

Blakeley says Mazda is putting less and less data on tape -- about 17TB today -- as it continues to virtualize storage.

1 2 3 Page 2
Page 2 of 3
Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon