Big data is around us. However, it is common to hear from a lot of data scientists and researchers doing analytics that they need more data. How is that possible, and where does this eagerness to get more data come from?
Very often, data scientists need lots of data to train sophisticated machine-learning models. The same applies when using machine-learning algorithms for cybersecurity. Lots of data is needed in order to build classifiers that identify, among many different targets, malicious behavior and malware infections. In this context, the eagerness to get vast amounts of data comes from the need to have enough positive samples — such as data from real threats and malware infections — that can be used to train machine-learning classifiers.
Is the need for large amounts of data really justified? It depends on the problem that machine learning is trying to solve. But exactly how much data is needed to train a machine-learning model should always be associated with the choice of features that are used.
Features are the set of information that’s provided to characterize a given data sample. Sometimes the number of features available is not directly under control because it comes from sophisticated data pipelines that can’t be easily modified. In other cases, it’s relatively easy to access new features from existing data samples, or properly pre-process data to build new and more interesting features. This process is sometimes known as "feature engineering."
Machine-learning books will emphasize the importance of accurately choosing the right features to train a machine-learning algorithm. This is an important consideration, because an endless amount of training data, if paired to the wrong set of features, will not produce a reliable model.
This is especially true when feature choices for a machine-learning algorithm are applied to network traffic data to identify cybersecurity threats. For some models, knowing which protocol the traffic is using — such as TCP or UDP — could be relevant, although it might be a useless feature for other cases.
Applying natural language processing (NLP) techniques for feature extraction could be the right choice for models that involve HTTP data, such as parsing the URL field. However, it might not be relevant for models that look primarily at aggregate information about network traffic flows like client/server communications.
In general, the number of features available is connected to the ability of parsing a given network protocol. This is because, in its absence, the amount of available information that can be extracted from raw network traffic is fairly limited.
The discussion above could create the wrong impression that using an extremely large set of features would solve any machine-learning problems.
Indeed, many off-the shelf machine-learning libraries provide easy-to-access methods to assess the importance of different features used to train some algorithms. Such tools try to automate the process of properly choosing the right features, but should not eliminate a careful inspection of the features being tested.
The quality of the features selected to solve a machine-learning problem is much more important than the number of features utilized. This important point can be seen as a very simple expression of the famous curse of dimensionality (R. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, Princeton, N.J., 1961).
A lot has been written about this topic as well as several different definitions. One reasonably accurate statement that’s a bit cryptic is that as the feature dimensionality increases, the volume of the space increases so fast that the available data becomes sparse.
A different way to explain this is that as feature dimensionality increases, distance among different samples in the feature-space quickly converges to the same value.
This is fairly intuitive, because the sparsity of the data will push the different data samples to corners of the feature space that are asymptotically equidistant. Click here for a visual representation of this phenomenon.
As many machine learning algorithms rely on one form or another for a definition of distance (e.g., Euclidean), these algorithms quickly lose predictive power as the distance definition become meaningless.
For a fixed amount of training data, an increasing number of features will lead to overfitting problems. For example, classifiers that have extremely good performance on the training dataset might have very poor predictive power on unseen data.
One possible solution in this case is to increase the size of the training dataset. But as we pointed out above for network traffic classifiers, this is sometimes not possible or very expensive and time-consuming.
A potentially useful approach involves the proper selection of available features, identifying relationships among them, and using techniques like principal component analysis (PCA) to help reduce the feature dimensionality. But the new "reduced" feature sets run the risk of being less intuitive than the original ones.
As we discussed in a previous blog, limiting the amount of positive samples is critical to successfully training a cybersecurity machine-learning model. The proper choice of the features is equally important and plays a vital role in building classifiers that have a high degree of generalization and work successfully for data that is never seen in cross-validation samples.
This article is published as part of the IDG Contributor Network. Want to Join?