Internet traffic classification demystified: on the sources of the discriminative power

  • Authors:
  • Yeon-sup Lim;Hyun-chul Kim;Jiwoong Jeong;Chong-kwon Kim;Ted "Taekyoung" Kwon;Yanghee Choi

  • Affiliations:
  • University of Massachusetts, Amherst, MA;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea

  • Venue:
  • Proceedings of the 6th International COnference
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on "Why" some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the "Why" question, which is critical in understanding and establishing a scientific ground for traffic classification research, this paper reveals the three sources of the discriminative power in classifying the Internet application traffic: (i) ports, (ii) the sizes of the first one-two (for UDP flows) or four-five (for TCP flows) packets, and (iii) discretization of those features. We find that C4.5 performs the best under any circumstances, as well as the reason why; because the algorithm discretizes input features during classification operations. We also find that the entropy-based Minimum Description Length discretization on ports and packet size features substantially improve the classification accuracy of every machine learning algorithm tested (by as much as 59.8%!) and make all of them achieve 93% accuracy on average without any algorithm-specific tuning processes. Our results indicate that dealing with the ports and packet size features as discrete nominal intervals, not as continuous numbers, is the essential basis for accurate traffic classification (i.e., the features should be discretized first), regardless of classification algorithms to use.