Supervised learning for provenance-similarity of binaries

Authors:
Sagar Chaki;Cory Cohen;Arie Gurfinkel
Affiliations:
Carnegie Mellon Software Engineering Institute, Pittsburgh, PA, USA;Carnegie Mellon Software Engineering Institute, Pittsburgh, PA, USA;Carnegie Mellon Software Engineering Institute, Pittsburgh, PA, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 8
Cited 1

Random Forests

Machine Learning
BinHunt: Automatically Finding Semantic Differences in Binary Programs

ICICS '08 Proceedings of the 10th International Conference on Information and Communications Security
Computing the behavior of malicious code with function extraction technology

Proceedings of the 5th Annual Workshop on Cyber Security and Information Intelligence Research: Cyber Security and Information Intelligence Challenges and Strategies
Detecting code clones in binary executables

Proceedings of the eighteenth international symposium on Software testing and analysis
Large-scale malware indexing using function-call graphs

Proceedings of the 16th ACM conference on Computer and communications security
A static birthmark of binary executables based on API call structure

ASIAN'07 Proceedings of the 12th Asian computing science conference on Advances in computer science: computer and network security
Extracting compiler provenance from program binaries

Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Automatic malware categorization using cluster ensemble

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Towards semantic comparison of multi-granularity process traces

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding, measuring, and leveraging the similarity of binaries (executable code) is a foundational challenge in software engineering. We present a notion of similarity based on provenance -- two binaries are similar if they are compiled from the same (or very similar) source code with the same (or similar) compilers. Empirical evidence suggests that provenance-similarity accounts for a significant portion of variation in existing binaries, particularly in malware. We propose and evaluate the applicability of classification to detect provenance-similarity. We evaluate a variety of classifiers, and different types of attributes and similarity labeling schemes, on two benchmarks derived from open-source software and malware respectively. We present encouraging results indicating that classification is a viable approach for automated provenance-similarity detection, and as an aid for malware analysts in particular.