Preprocessing DNS log data for effective data mining

Authors:
Mark E. Snyder;Ravi Sundaram;Mayur Thakur
Affiliations:
Department of Computer Science, Missouri S&T, Rolla, MO;Department of Computer and Information Science, Northeastern University, Boston, MA;Google Inc., Mountain View, CA
Venue:
ICC'09 Proceedings of the 2009 IEEE international conference on Communications
Year:
2009

Citing 6
Cited 0

Statistical analysis with missing data

Statistical analysis with missing data
Data mining: concepts and techniques

Data mining: concepts and techniques
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Applying statistical methodology to optimize and simplify software metric models with missing data

Proceedings of the 2006 ACM symposium on Applied computing
DDoS attack detection method using cluster analysis

Expert Systems with Applications: An International Journal
Statistical techniques for detecting traffic anomalies through packet header data

IEEE/ACM Transactions on Networking (TON)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Domain Name Service (DNS) provides a critical function in directing Internet traffic. Defending DNS servers from bandwidth attacks is assisted by the ability to effectively mine DNS log data for statistical patterns. Processing DNS log data can be classified as a data-intensive problem, and as such presents challenges unique to this class of problem. When problems occur in capturing log data, or when the DNS server experiences an outage (scheduled or unscheduled), the normal pattern of traffic for that server becomes clouded. Simple linear interpolation of the holes in the data does not preserve features such as peaks in traffic (which can occur during an attack, making them of particular interest). We demonstrate a method for estimating values for missing portions of time sensitive DNS log data. This method would be suitable for use with a variety of datasets containing time series values where certain portions are missing.