The main reason behind the rising popularity of data science is the incredible amount of digital data that gets stored and processed daily. Usually, this abundant data is referred to as "big data" and it's no surprise that data science and big data are often paired in the same discussion and used almost synonymously. While the two are related, the existence of big data prompted the need for a more scientific approach – data science – to the consumption and analysis of this incredible wealth of data.
In order for cybersecurity professionals to see the greatest possibilities offered by big data and data science it would be ideal to go Back to the Future to see how data insights will unfold. Lacking the time-travel expertise of that movie's Doc Brown, today’s data scientists must imagine the possibilities of how big-data analysis will inform and educate our world.
As I discussed in the first blog of this series, the application of data science techniques to cybersecurity relies on the prompt availability of massive amounts of data on which models can be built and tested to extract interesting insights.
How much data is enough?
To give you an idea of how much data needs to be processed, a medium–size network with 20,000 devices (laptops, smartphones and servers) will transmit more than 50 TB of data in a 24–hour period. That means that over 5 Gbits must be analyzed every second to detect cyberattacks, potential threats and malware attributed to malicious hackers! We can now understand Doc Brown’s amazement when he shouted “1.21 gigawatts!” in Back to the Future.
While dealing with such volumes of data in real time poses difficult challenges, we should also remember that analyzing large volumes of data is necessary to create data–science models that can detect cyberattacks while both minimizing false positives (false alarms) and false negatives (failing to detect real threats).
The three V's of context
When discussing big data, the three big "V's" are often mentioned: Volume, Variety and Velocity. Let's see what these really mean in a cybersecurity context.
- Volume: large quantities of data are necessary to build robust models and properly test them. When is "large" large enough? The quote below by statistician Andrew Gelman from a 2005 blog entry is very relevant.
“Sample sizes are never large. If N (i.e. the sample size) is too small to get a sufficiently precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc.). N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data.”
If a data scientist is relying on machine learning to build a model, large data samples are necessary to understand and extract new features, and properly estimate the performance of the model before deploying it in production environments. Also, when a given model is based on simple rules or heuristic findings, it is of paramount importance to test it out on large data samples to assess performance and the possible rate of false positives. When the data sample is "large" enough and, as I will discuss in the second point, has enough "variability", the data scientist can try to identify different ways of categorizing the data and unexpected properties of the data may become evident.
- Variety: in big data discussions, this term usually refers to the number of types of data available. From the point of view of data organization, this refers to structured data (e.g., data that follows a precise schema) versus unstructured data (e.g., log records or data that involves a lot of text). The latter sometimes doesn’t follow a precise schema and, while this poses some challenges, unstructured data often provide a richness of content that can be beneficial when building a data science model.
For cybersecurity data science models, "Variability" really matters more than "Variety." Variability refers to the range of values that a given feature could take in a data set.
The importance of having data with enough variability in building cybersecurity models cannot be stressed enough, and it's often underestimated. Network deployments in organizations – businesses, government agencies and private institutions – vary greatly. Commercial network applications are used differently across organizations and custom applications are developed for specific purposes. If the data sample on which a given model is tested lacks variability, the risk of an incorrect assessment of the model’s performance is high. If a given machine learning model has been built properly (e.g., without "overtraining", which happens when the model picks up very specific properties of the data on which it has been trained), it should be able to generalize to "unseen" data. However, if the original data set lacks in variability, the chance of improper modeling (for example, misclassification of a given data sample) is higher.
- Velocity: the amount of digital information increases more than tenfold every five years according to a The Economist article "Data, data everywhere". As I noted in the first post of this series, the analysis of large data samples is possible thanks to the nearly ubiquitous availability of low–cost compute and storage resources. If a data scientist has to analyze hundreds of million of records and every single query to the data set requires hours, building and testing models would be a cumbersome and tedious process. Being able to quickly iterate through the data, modify some parameters in a particular model and quickly assess its performance are all crucial aspects of the successful application of data science techniques to cybersecurity.
Volume, Variety, and Velocity (as well as Variability) are all essential characteristics of big data that have high relevance for applying data science to cybersecurity. More recent discussions on big data have also started to emphasize the concept of the "Value" of data.
In the next post in this series I will start to discuss how machine learning can be applied to cybersecurity and the value of your network’s data.
This article is published as part of the IDG Contributor Network. Want to Join?