Cloud-based malware detection for evolving data streams

Authors:
Mohammad M. Masud;Tahseen M. Al-Khateeb;Kevin W. Hamlen;Jing Gao;Latifur Khan;Jiawei Han;Bhavani Thuraisingham
Affiliations:
University of Texas at Dallas, TX;University of Texas at Dallas, TX;University of Texas at Dallas, TX;University of Illinois at Urbana-Champaign;University of Texas at Dallas, TX;University of Illinois at Urbana-Champaign;University of Texas at Dallas, TX
Venue:
ACM Transactions on Management Information Systems (TMIS)
Year:
2008

Citing 30
Cited 4

Instance-Based Learning Algorithms

Machine Learning
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
C4.5: programs for machine learning

C4.5: programs for machine learning
Machine learning, neural and statistical classification

Machine learning, neural and statistical classification
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining time-changing data streams

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining Methods for Detection of New Malicious Executables

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Mining concept-drifting data streams using ensemble classifiers

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Systematic data selection to mine concept-drifting data streams

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to detect malicious executables in the wild

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Polygraph: Automatically Generating Signatures for Polymorphic Worms

SP '05 Proceedings of the 2005 IEEE Symposium on Security and Privacy
Combining proactive and reactive predictions for data streams

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits

Proceedings of the 12th ACM conference on Computer and communications security
Using additive expert ensembles to cope with concept drift

ICML '05 Proceedings of the 22nd international conference on Machine learning
Supervised dimensionality reduction using mixture models

ICML '05 Proceedings of the 22nd international conference on Machine learning
A Framework for On-Demand Classification of Evolving Data Streams

IEEE Transactions on Knowledge and Data Engineering
Hamsa: Fast Signature Generation for Zero-day PolymorphicWorms with Provable Attack Resilience

SP '06 Proceedings of the 2006 IEEE Symposium on Security and Privacy
Peer-to-peer botnets: overview and case study

HotBots'07 Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A scalable multi-level feature extraction technique to detect malicious executables

Information Systems Frontiers
Closed-form supervised dimensionality reduction with generalized linear models

Proceedings of the 25th international conference on Machine learning
On Appropriate Assumptions to Mine Data Streams: Analysis and Practice

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Adapted One-versus-All Decision Trees for Data Stream Classification

IEEE Transactions on Knowledge and Data Engineering
A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Stop Chasing Trends: Discovering High Order Models in Evolving Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
New ensemble methods for evolving data streams

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting an antivirus interface

Computer Standards & Interfaces
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Mining Data Streams with Labeled and Unlabeled Training Examples

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Parallel K-Means Clustering Based on MapReduce

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing

Tracking concept drift in malware families

Proceedings of the 5th ACM workshop on Security and artificial intelligence
Review: An intrusion detection and prevention system in cloud computing: A systematic review

Journal of Network and Computer Applications
Taxonomy and proposed architecture of intrusion detection and prevention systems for cloud computing

CSS'12 Proceedings of the 4th international conference on Cyberspace Safety and Security
Design and Implementation of a Data Mining System for Malware Detection

Journal of Integrated Design & Process Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail. In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.