Experimental challenges in cyber security: a story of provenance and lineage for malware

Authors:
Tudor Dumitras;Iulian Neamtiu
Affiliations:
Symantec Research Labs;University of California, Riverside
Venue:
CSET'11 Proceedings of the 4th conference on Cyber security experimentation and test
Year:
2011

Citing 12
Cited 1

Empirical studies of software engineering: a roadmap

Proceedings of the Conference on The Future of Software Engineering
Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory

ACM Transactions on Information and System Security (TISSEC)
TPC-W: A Benchmark for E-Commerce

IEEE Internet Computing
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Probabilistic discovery of time series motifs

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Strategies for sound internet measurement

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
The evolution of FreeBSD and linux

Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering
A Complexity Measure

IEEE Transactions on Software Engineering
On challenges in evaluating malware clustering

RAID'10 Proceedings of the 13th international conference on Recent advances in intrusion detection
An experimentation workbench for replayable networking research

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Data Provenance and Security

IEEE Security and Privacy
Toward a standard benchmark for computer security research: the worldwide intelligence network environment (WINE)

Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security

Towards automatic software lineage inference

SEC'13 Proceedings of the 22nd USENIX conference on Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

Rigorous experiments and empirical studies hold the promise of empowering researchers and practitioners to develop better approaches for cyber security. For example, understanding the provenance and lineage of polymorphic malware strains can lead to new techniques for detecting and classifying unknown attacks. Unfortunately, many challenges stand in the way: the lack of sufficient field data (e.g., malware samples and contextual information about their impact in the real world), the lack of metadata about the collection process of the existing data sets, the lack of ground truth, the difficulty of developing tools and methods for rigorous data analysis. As a first step towards rigorous experimental methods, we introduce two techniques for reconstructing the phylogenetic trees and dynamic control-flow graphs of unknown binaries, inspired from research in software evolution, bioinformatics and time series analysis. Our approach is based on the observation that the long evolution histories of open source projects provide an opportunity for creating precise models of lineage and provenance, which can be used for detecting and clustering malware as well. As a second step, we present experimental methods that combine the use of a representative corpus of malware and contextual information (gathered from end hosts rather than from network traces or honeypots) with sound data collection and analysis techniques. While our experimental methods serve a concrete purpose-- understanding lineage and provenance--they also provide a general blueprint for addressing the threats to the validity of cyber security studies.