Exploring discrepancies in findings obtained with the KDD Cup '99 data set

  • Authors:
  • Vegard Engen;Jonathan Vincent;Keith Phalp

  • Affiliations:
  • (Correspd. Tel.: +44 1202 965503/ E-mail: vengen@bournemouth.ac.uk) Software Systems Research Centre, Bournemouth University, Fern Barrow, Talbot Campus, Poole, UK;Software Systems Research Centre, Bournemouth University, Fern Barrow, Talbot Campus, Poole, UK;Software Systems Research Centre, Bournemouth University, Fern Barrow, Talbot Campus, Poole, UK

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The KDD Cup '99 data set has been widely used to evaluate intrusion detection prototypes, most based on machine learning techniques, for nearly a decade. The data set served well in the KDD Cup '99 competition to demonstrate that machine learning can be useful in intrusion detection systems. However, there are discrepancies in the findings reported in the literature. Further, some researchers have published criticisms of the data (and the DARPA data from which the KDD Cup '99 data has been derived), questioning the validity of results obtained with this data. Despite the criticisms, researchers continue to use the data due to a lack of better publicly available alternatives. Hence, it is important to identify the value of the data set and the findings from the extensive body of research based on it, which has largely been ignored by the existing critiques. This paper reports on an empirical investigation, demonstrating the impact of several methodological differences in the publicly available subsets, which uncovers several underlying causes of the discrepancy in the results reported in the literature. These findings allow us to better interpret the current body of research, and inform recommendations for future use of the data set.