Cybersecurity, data science and machine learning: Is all data equal?

To apply machine learning to cybersecurity data, it's important to understand the different value of the data that will be used to build machine learning models.

In big-data discussions, the value of data sometimes refers to the predictive capability of a given data model and other times to the discovery of hidden insights that appear when rigorous analytical methods are applied to the data itself. From a cybersecurity point of view, I believe the value of data refers first to the "nature" of the data itself.  Positive data, i.e. malicious network traffic data from malware and cyberattacks, have much more value than some other data science problems. To better understand this, let's start to discuss how a wealth of network traffic data can be used to build network security models through the use of machine learning techniques.

Machine learning, together with data science and big data, is gaining a lot of popularity due to its widespread use in many tech companies around the world. The applications of machine learning range from recommendation systems (e.g., Netflix, Amazon) to spam filtering by popular Web-based email providers to image and voice recognition and many other applications.

From a cybersecurity perspective, data models need to have predictive power to automatically distinguish between normal benign network traffic and abnormal, potentially malicious traffic that can be an indicator of an active cyberattack or malware infection. Machine learning can be used to build classifiers, as the goal of the models is to provide a binary response (e.g., good or bad) to the network traffic being analyzed. This is similar to the problem that spam filters need to address, since they are built to identify normal emails from ads, phishing, Trojan horses and other types of spam.

Classifiers: Separating normal from malicious

In order to build a classifier, large amounts of data are required. This data will be used to train a machine-learning algorithm and evaluate the classifier’s performance. Data falls into two categories: positive and negative samples. In the case of the spam classifier example, "positive" data refers to data showing the behavior that the classifier needs to be able to detect: real spam email. In the case of a network security model, "positive" data refers to traffic showing the behavior of real cyberattacks and malware infections. "Negative" data refers to "normal" data. In the case of the spam classifier, "negative" data are legitimate emails; for a network security model, it is normal network traffic data.

From what I have discussed so far, it would seem the two problems described above are reasonably similar. We want a classifier to detect spam and only keep good emails, and we want a network security model that detects cyberattacks and malware infections without incorrectly judging benign network traffic to be harmful.

What is intrinsically different between these two problems is the prompt availability of positive data. Positive data for spam emails are abundant and readily available for building a classifier. Despite the increase of cyberattacks reports in the news that have affected organizations across a broad set of industries, positive data from real cyberattacks and malware infections are not easily accessible. And this is particularly true for “targeted” attacks where the attack is highly customized for a particular target.

While there are libraries of malware samples (just to name a couple of examples, Deep End Research and McAfee), hackers quickly modify their techniques, and attacks are increasingly sophisticated, making these libraries quickly obsolete. This is not surprising, as the “targeted” malware is often custom-built for large-scale monetization via targeted attacks, where data is stolen or destroyed, and it is designed to remain stealthy for as long as possible. This applies to many of the data breaches being reported in the last 18 months. Even for attacks that may seem similar in their goals (e.g., Target Corp, Neiman Marcus, Home Depot), the tactics of the attacks were always adapted to the particular victim as was pointed out by The Washington Post in a recent article on the Anthem breach.

The value of the positive samples

It seems evident that positive samples used to build machine-learning models have an intrinsically high value and are of the utmost importance to guarantee that the predictive power of the model will be generalized enough to identify new cyberattack and malware flavors. This condition is necessary but not sufficient, as the choice of features that are utilized to build the model also has an extremely high impact on the model performance, as I will discuss in a future blog. In fact, it would not make sense to try to collect extremely large amounts of positive samples before testing a given machine learning model, as feature selection and proper training techniques are also very important aspects in machine learning.


It should also be clear that, no matter how many positive samples are available, the training data for the machine-learning model will be highly unbalanced, as the amount of negative samples (e.g., benign network traffic) will always be many orders-of-magnitude more abundant than the positive (e.g., cyberattack, malware infection) data samples. The typical example that is presented in this context is the one where in a classification problem one has 99% data corresponding to one class (e.g. benign traffic data), then one can achieve 99% accuracy just by labeling everything as benign! This is a well-known problem and can be resolved through a proper choice of the evaluation metric, proper training dataset balancing and the use of sophisticated sampling methods. The application of these techniques also allows you to determine if the right quantity of positive samples is available, or if more data is necessary.

The collection of positive samples is therefore one of the first and most important tasks that enables the use of machine-learning algorithms to build cybersecurity models; this sample collection process sometimes can be lengthy. For example, it may be necessary to run a sample of malware on a dedicated sandbox and collect output from several different sources in order to extract the relevant features connected to that malware sample. This process could take several hours for just one sample.

. . . . . . .

This completes my focus on the main aspects of big data and its relevance to cybersecurity by discussing the value of data that I started in the previous post in this series, Big data sends cybersecurity back to the future. In the next post, I will discuss which issues should be considered in choosing the right set of features before training a given machine-learning model in the context of cybersecurity data.

Copyright © 2015 IDG Communications, Inc.

Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon