Digital Economy Creating Big Headaches for Unstructured Data Storage

The sheer volume has many organizations struggling to keep up

shutterstock 386810425

The cascade of information flowing into organizations is growing at an incredible pace. The data is an exploding confusion of files, files, and even more files, most of it unstructured and ranging from emails to pictures, videos, and rich media. An increasing number of organizations struggle with the challenge of storing this veritable flood. It’s a chore to back up and even more of a chore to protect. The truth is, if you strain to store it, you probably don’t know what’s in the data, how to classify it, or what is necessary to protect it. This problem impacts the entire organization, not just the IT department.

This scenario is especially true where some of the principal assets of an organization are digital. For example, the health care industry handles complex 2D & 3D imaging, enormous amounts of patient data, financial records, and more. Some health care organizations deal with multiple petabytes of data on the datacenter floor every month.

Of course, health care is but one sector struggling to store and protect all that unstructured data for future use. Media and entertainment companies face similar challenges. Think of what happens when a new movie is prepared for release: The studio has to encode their final product into dozens of different formats for the various smart phones and tablets on the market in addition to traditional broadcast formats.

Manufacturers using systems-generated data to improve assembly lines

Consider manufacturing, where assembly lines produce petascale telemetry datasets used to analyze and then improve the manufacturing process. One HPE customer in the auto manufacturing industry collects thousands of hours of high-definition video from safety and crash testing. They’ve moved from collecting megabytes and gigabytes to now storing petabytes of data.

Then there is the Internet of Things (IoT), which is starting to gain momentum across the economy. With the IoT, you can monitor and measure the flow of data from large numbers of sensors embedded in a network. Many organizations want the ability generate data from the edge of their networks, whether machines, cars, or assembly lines, any system that produces telemetry. Sensors pick up the information, collate it, and send it back. Should you ship and process all that data centrally, or compute at the edge where those devices are? Pushing PB after PB of data around the internet is definitely not the way to go. It takes up too much bandwidth and is ultimately cost-prohibitive.

The data tsunami is forcing organizations to look at new IT architectures that can start small and grow as the volume of incoming data grows. Some organizations are adding software-defined technology to their architecture that will allow them to deploy servers with integrated, software-based storage that will scale out as data volumes increase. This delivers hardware continuity, but lower cost.

Adopting new automation and provisioning models

There has also been a shift toward automation and provisioning in the datacenter. Organizations can no longer afford to have a lone administrator coping with the flood of data as it hits the loading dock. With automation and provisioning, that admin can provision storage when and where it’s needed. Once data hits the datacenter floor, it becomes part of the ecosystem without direct human intervention. Capacity is identified and provisioned automatically.

To enhance data protection, an increasing number of organizations are moving from tape to disk-based backup systems. Scale-out architectures can start with a single system and expand to larger-capacity pools as the need arises. This is especially useful for organizations doing large volumes of backup consolidation. For those who require long-term data retention, including Tiers 3, 4, and archival data, incorporating an object-storage tier can provide the next horizon of scalability.

Historic shift in storage media underway

While the volume of data explodes, we’re in the middle of a game-changing shift from disks to flash, which will help companies cope with the data inundation. More help will also come from software-defined storage, letting organizations choose a storage appliance where costs and level-of-service are predictable. With software-defined, businesses can deploy storage software on industry standard, capacity-oriented server architectures with intelligence in the software. Instead of just buying more appliances to keep up with storage needs, companies can use software to provision more memory.

Other organizations are leaning towards disaggregated architectures where, for example, compute and memory are separated from storage. This strategy allows organizations to make sure they get full utilization of their compute, networking, and storage resources while providing the ability to scale each independently, depending on workload. There is also server-defined architecture, in which compute and storage are co-located.

This strategy allows for simple management by collapsing layers of the IT stack into one “hyperconverged” pool of resources. One example is scale-out databases like Hadoop, an open-source resource that co-locates compute and storage to process massive datasets for analytics. In the end, the workload and operational capabilities of IT staff determine the storage architecture that best fits the organization’s infrastructure.

Public cloud presents many challenges

Then, of course, there is the cloud. To be sure, moving from a CapEx model to the OpEx model offered by cloud services is appealing, but be aware of the tradeoffs with regard to moving large amounts of unstructured data to the cloud. Understand how this data will be accessed and analyzed over its lifetime and the regulatory and security implications of data sovereignty for regulated industries. When it comes to unstructured data, the question becomes, “What can you confidently store in the cloud and what can’t you?”

Some customers looking for the flexibility and automation that the cloud brings are certainly interested in being able to start small and scaling from there. Others choose to architect systems that look and feel like a public cloud but are really a private cloud deployed and managed on-premises. This option provides all of the benefits of automation, provisioning, tiers of service, and an API driven infrastructure. Many will take a hybrid approach, which we have seen with a number of HPE customers choosing to deploy on-premises or off, depending on workload and function. Regardless, customers need to plan ahead to pick the most appropriate deployment model for their critical assets.

This is just a brief summary of the unstructured data challenge and some of the solutions that organizations are turning to. I guarantee you’ll be one of the winners in the new digital economy if you increase the speed at which you can process the flood of data pouring in and provide a framework to tap it productively. Hewlett Packard Enterprise (HPE) can help. We can provide studies which investigate the challenges you face with your current infrastructure or workshops where we dig into the issues you face.

The march toward exascale computers
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies