Data sanitization: improving the forensic utility of anomaly detection systems

Authors:
Gabriela F. Cretu;Angelos Stavrou;Salvatore J. Stolfo;Angelos D. Keromytis
Affiliations:
Department of Computer Science, Columbia University;Department of Computer Science, Columbia University;Department of Computer Science, Columbia University;Department of Computer Science, Columbia University
Venue:
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Year:
2007

Citing 8
Cited 2

Bagging predictors

Machine Learning
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
"Why 6?" Defining the Operational Limits of Stide, an Anomaly-Based Intrusion Detector

SP '02 Proceedings of the 2002 IEEE Symposium on Security and Privacy
Building a reactive immune system for software services

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Detecting targeted attacks using shadow honeypots

SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
Polymorphic blending attacks

USENIX-SS'06 Proceedings of the 15th conference on USENIX Security Symposium - Volume 15
Anomalous payload-based worm detection and signature generation

RAID'05 Proceedings of the 8th international conference on Recent Advances in Intrusion Detection
Anagram: a content anomaly detector resistant to mimicry attack

RAID'06 Proceedings of the 9th international conference on Recent Advances in Intrusion Detection

Behavior-Based Network Access Control: A Proof-of-Concept

ISC '08 Proceedings of the 11th international conference on Information Security
Adaptive Anomaly Detection via Self-calibration and Dynamic Updating

RAID '09 Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection

Quantified Score

Hi-index	0.00

Visualization

Abstract

Anomaly Detection (AD) sensors have become an invaluable tool for forensic analysis and intrusion detection. Unfortunately, the detection accuracy of all learning-based ADs depends heavily on the quality of the training data, which is often poor, severely degrading their reliability as a protection and forensic analysis tool. In this paper, we propose extending the training phase of an AD to include a sanitization phase that aims to improve the quality of unlabeled training data by making them as "attack-free" and "regular" as possible in the absence of absolute ground truth. Our proposed scheme is agnostic to the underlying AD, boosting its performance based solely on training-data sanitization. Our approach is to generate multiple AD models for content-based AD sensors trained on small slices of the training data. These AD "micro-models" are used to test the training data, producing alerts for each training input. We employ voting techniques to determine which of these training items are likely attacks. Our preliminary results show that sanitization increases 0-day attack detection while maintaining a low false positive rate, increasing confidence to the AD alerts. We perform an initial characterization of the performance of our system when we deploy sanitized versus unsanitized AD systems in combination with expensive host-based attack-detection systems. Finally, we provide some preliminary evidence that our system incurs only an initial modest cost, which can be amortized over time during online operation.