Decision tree network traffic classifier via adaptive hierarchical clustering for imperfect training dataset

  • Authors:
  • Ping Lin;Zhenming Lei;Luying Chen;Jie Yang;Fang Liu

  • Affiliations:
  • Key Laboratory of Information Processing and Intelligent Technology, Beijing University of Posts and Telecommunications, Beijing, China;Key Laboratory of Information Processing and Intelligent Technology, Beijing University of Posts and Telecommunications, Beijing, China;Key Laboratory of Information Processing and Intelligent Technology, Beijing University of Posts and Telecommunications, Beijing, China;Key Laboratory of Information Processing and Intelligent Technology, Beijing University of Posts and Telecommunications, Beijing, China;Key Laboratory of Information Processing and Intelligent Technology, Beijing University of Posts and Telecommunications, Beijing, China

  • Venue:
  • WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Existing network traffic classifiers often assume the availability of ideal training dataset. Yet in practice, the training dataset may contain a substantial number of flows labeled as 'unknown', including both the flows from classes that are not modeled by the classifier, and the unrecognized flows from modeled classes. Such training dataset will seriously degrade the recall rate and generalization capability of existing classifiers treating unknowns just as a normal class. In this paper, we propose a semi-supervised multivariate decision tree classification algorithm, based on adaptive hierarchical clustering. Rather than using Gini index or information gain relying on perfect training dataset, we use adaptive hierarchical clustering, to construct the decision tree. The clustering process can identify unknown flows belonging modeled classes, avoiding the pitfalls of existing algorithms treating them equally as real unknowns. After mapping each leaf cluster to a class based on its majority members, and assigning decision rules based on cluster centers, we get a multivariate decision tree. The experiment result shows that our algorithm can significantly improve the recall rate of flows belonging to modeled classes compared to a decision tree classifier, with only small impact on precision.