Data mining techniques for network scan detection

  • Authors:
  • Vipin Kumar;Zhi-Li Zhang;Gyorgy J. Simon

  • Affiliations:
  • University of Minnesota;University of Minnesota;University of Minnesota

  • Venue:
  • Data mining techniques for network scan detection
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

A precursor to many attacks on networks is often a reconnaissance operation, more commonly referred to as a scan. Despite the vast amount of attention focused on methods for scan detection, the state-of-the-art methods suffer from high rate of false alarms and low rate of scan detection. In this thesis, we formalize the problem of scan detection as a data mining problem. We show how a network traffic data set can be converted into a data set that is appropriate for off-the-shelf classifiers. Our method successfully demonstrates that data mining models can encapsulate expert knowledge to create an adaptable algorithm that can substantially outperform state-of-the-art methods for scan detection in both coverage and precision. Specifically, we show that our method is capable of very early detection (in many cases, as early as the first connection attempt on the specific port) without significantly compromising the precision of the detection and is capable of distinguishing P2P and backscatter traffic from scanners. Using off-the-shelf classifiers as scan detectors is very effective but it requires a training data set whose instances are labeled to indicate the correct class assignment. In rapidly changing fields, like computer network traffic analysis, the availability of up-to-date labeled data sets is very limited. This is primarily a consequence of the excessively high cost of an expert manually labeling these large data sets. In this research, we also propose a method, where labeling the data set is carried out in a semi-supervised manner with user-specified guarantees about the quality of the labeling. Thirdly and lastly, we also propose a method for estimating the performance of the classifier (scan detector) when labeled data is unavailable.