On Improving the Accuracy and Performance of Content-Based File Type Identification

Authors:
Irfan Ahmed;Kyung-Suk Lhee;Hyunjung Shin;Manpyo Hong
Affiliations:
Digital Vaccine and Internet Immune System Lab Graduate School of Information and Communication, Ajou University, South Korea;Digital Vaccine and Internet Immune System Lab Graduate School of Information and Communication, Ajou University, South Korea;Department of Industrial and Information Systems Engineering, Ajou University, South Korea;Digital Vaccine and Internet Immune System Lab Graduate School of Information and Communication, Ajou University, South Korea
Venue:
ACISP '09 Proceedings of the 14th Australasian Conference on Information Security and Privacy
Year:
2009

Citing 7
Cited 1

Content Based File Type Detection Algorithms

HICSS '03 Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS'03) - Track 9 - Volume 9
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Statistical Disk Cluster Classification for File Carving

IAS '07 Proceedings of the Third International Symposium on Information Assurance and Security
BotHunter: detecting malware infection through IDS-driven dialog correlation

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Detection of Malcodes by Packet Classification

ARES '08 Proceedings of the 2008 Third International Conference on Availability, Reliability and Security
Anagram: a content anomaly detector resistant to mimicry attack

RAID'06 Proceedings of the 9th international conference on Recent Advances in Intrusion Detection
Predicting the types of file fragments

Digital Investigation: The International Journal of Digital Forensics & Incident Response

An intelligent technique to detect file formats and e-mail spam

Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India

Quantified Score

Hi-index	0.00

Visualization

Abstract

Types of files (text, executables, Jpeg images, etc.) can be identified through file extension, magic number, or other header information in the file. However, they are easy to be tampered or corrupted so cannot be trusted as secure ways to identify file types.In the presence of adversaries, analyzing the file content may be a more reliable way to identify file types, but existing approaches of file type analysis still need to be improved in terms of accuracy and speed. Most of them use byte-frequency distribution as a feature in building a representative model of a file type, and apply a distance metric to compare the model with byte-frequency distribution of the file in question. Mahalanobis distance is the most popular distance metric. In this paper, we propose 1) the cosine similarity as a better metric than Mahalanobis distance in terms of classification accuracy, smaller model size, and faster detection rate, and 2) a new type-identification scheme that applies recursive steps to identify types of files. We compare the cosine similarity to Mahalanobis distance using Wei-Hen Li et al.'s single and multi-centroid modeling techniques, which showed 4.8% and 13.10% improvement in classification accuracy (single and multi-centroid respectively). The cosine similarity showed reduction of the model size by about 90% and improvement in the detection speed by 11%. Our proposed type identification scheme showed 37.78% and 31.47% improvement over Wei-Hen Li's single and multi-centroid modeling techniques respectively.