Every organization seeking to make sense of big data must determine which platforms and tools, in the sea of available options, will help them to meet their business goals.
Answering the following eight questions can help guide IT leaders to make the right data management choices for their organization’s future success.
1. Does your organization already have a big data platform?
If the answer is yes, it doesn’t make sense for a business to have more than one. As a next step, you should consider creating a combined, consistent, cross functional architecture. This would offer further opportunities for cross business analysis and make the most of the scarce technical resource existing in these leading edge technologies.
2. What are the data platform drivers --storage or advanced analytics?
For organizations needing to store and process tens of terabytes of data, using an open-source distributed file system is a mature choice due to its predictable scalability over clustered hardware. Plus, it’s the base platform for many big data architectures already.
However, if looking to run analytics in online or real-time applications, consider hybrid architectures containing distributed file systems combined with distributed database management systems (which have lower latency). Or look at large traditional relational systems to get real-time access to data that has been through the heavy lifting processes of a distributed file system.
3. Is low latency, real time application access needed?
Latency always needs to be balanced against the requirements for consistency. If ultra-low latency access, or minimal delay, is a requirement for your business, key value stores are hard to beat as they store data as an arbitrary value that allows fast update and retrieval. An in-memory solution may be even faster as it allows firms to process data across large datasets in near-real time.
4. What are the availability and consistency requirements for the platform?
If a distributed system is needed, ether because of the large scale of your system or dispersed user population constraints, you’ll need to consider the CAP theorem -- systems can’t be both consistent and available when a break in the network causes the system to be partitioned.
In such scenarios you’ll need to consider trading off consistency for availability. Many modern systems have developed strategies for dealing with partitioning scenarios, reducing the impact and improving the ability of systems to recover from partitions in a less disruptive way.
5. How will your data be accessed by users and applications?
Many NoSQL databases require specific application interfaces (APIs) in order to access the data. With this, you’ll need to consider the integration of visualization or other tools that will need access to the data. If the tools being used with the big data platform need a SQL interface, choose a tool that has maturity in that area.
Of note, NoSQL and big data platforms are evolving quickly and businesses just starting to build custom applications on top of a big data platform may be able to build around the sometimes “raw” data access frameworks. Alternatively, businesses with existing applications will need a more mature offering.
6. What is the shape of your data?
If data requirements are especially unstructured, or include streaming data sources such as social media or video, businesses should look into data serialization technologies that allow capture, storage and representation of such high velocity data.
Also, how applications consume data should also be taken into consideration. For instance, some existing tools allow users to project different structures across the data store, giving flexibility to store data in one way and access it in another. Yes, being flexible in how data is presented to consuming applications is a benefit, but the performance may not be good enough for high velocity data. To overcome this performance challenge, you may need to integrate with a more structured data store further downstream in your data architecture.
7. Do you need to integrate with existing data warehouses?
If looking to extend your current data architecture by integrating a big data platform into an existing data warehouse, data integration tools can help. Many integration vendors that support big data platforms also have specialized support for integrating with SQL data warehouses and data marts.
8. What is the workload profile required of the solution --consistent flow or spikes?
If a spiked load profile exists for the platform, a Platform as a Service (PaaS) deployment might be appropriate. Or consider platform distributions that can be deployed on an Infrastructure as a Service (IaaS) cloud as this allows users to pay for the platform only when you are processing. More consistent or predictable loads might be easier to achieve with an on-premise deployment. And if workloads are mixed, consider a combined cloud and on premise approach.
When selecting big data technologies, there are a lot of elements in the solution to consider. I hope the answers to these eight questions will help you to make effective data management decisions for your business.