A scalable sampling scheme for clustering in network traffic analysis

Authors:
Abdun Mahmood;Christopher Leckie;Parampalli Udaya
Affiliations:
University of Melbourne, Melbourne, Australia;University of Melbourne, Melbourne, Australia;University of Melbourne, Melbourne, Australia
Venue:
Proceedings of the 2nd international conference on Scalable information systems
Year:
2007

Citing 17
Cited 0

Application of sampling methodologies to network traffic characterization

SIGCOMM '93 Conference proceedings on Communications architectures, protocols and applications
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Deriving traffic demands for operational IP networks: methodology and experience

IEEE/ACM Transactions on Networking (TON)
Charging from sampled network usage

IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
An Efficient Approximation Scheme for Data Mining Tasks

Proceedings of the 17th International Conference on Data Engineering
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice

ACM Transactions on Computer Systems (TOCS)
Automatically inferring patterns of resource consumption in network traffic

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Profiling internet backbone traffic: behavior models and applications

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Reducing unwanted traffic in a backbone network

SRUTI'05 Proceedings of the Steps to Reducing Unwanted Traffic on the Internet on Steps to Reducing Unwanted Traffic on the Internet Workshop
Finding hierarchical heavy hitters in data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Echidna: efficient clustering of hierarchical data for network traffic analysis

NETWORKING'06 Proceedings of the 5th international IFIP-TC6 conference on Networking Technologies, Services, and Protocols; Performance of Computer and Communication Networks; Mobile and Wireless Communications Systems
Enhancing network intrusion detection with integrated sampling and filtering

RAID'06 Proceedings of the 9th international conference on Recent Advances in Intrusion Detection

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sampling is a popular method for improving the scalability of analyzing massive datasets such as network traffic traces, webclick traffic and other forms of transaction data. However, in some cases, existing simple sampling strategies fail to capture the underlying distribution of the data. In particular, for network traffic, sampling is influenced by heavy traffic from flash crowds and Denial of Service (DoS) attacks. In such cases, it reveals little information about the other smaller traffic patterns which may contain interesting yet important information about the traffic. We propose an adaptive sampling technique that utilizes a buffer of frequently seen patterns and a combination of sampling steps to build a hierarchical tree of traffic clusters. We show that this sampling technique ensures that smaller and newer patterns are represented in the cluster tree while satisfying the maximum sampling rate imposed by the resource constraints. This technique has two benefits: it preserves the underlying patterns of the data, and improves efficiency by reducing the sampling of records from known patterns. Through an empirical evaluation on a benchmark dataset, we demonstrate the accuracy of our system in detecting certain types of rare attacks that are otherwise not detected by systematic sampling. We also demonstrate the efficiency of our system in terms of reducing the number of sampled records in detecting frequent patterns.