Cloud-based malware detection for evolving data streams

  • Authors:
  • Mohammad M. Masud;Tahseen M. Al-Khateeb;Kevin W. Hamlen;Jing Gao;Latifur Khan;Jiawei Han;Bhavani Thuraisingham

  • Affiliations:
  • University of Texas at Dallas, TX;University of Texas at Dallas, TX;University of Texas at Dallas, TX;University of Illinois at Urbana-Champaign;University of Texas at Dallas, TX;University of Illinois at Urbana-Champaign;University of Texas at Dallas, TX

  • Venue:
  • ACM Transactions on Management Information Systems (TMIS)
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data stream classification for intrusion detection poses at least three major challenges. First, these data streams are typically infinite-length, making traditional multipass learning algorithms inapplicable. Second, they exhibit significant concept-drift as attackers react and adapt to defenses. Third, for data streams that do not have any fixed feature set, such as text streams, an additional feature extraction and selection task must be performed. If the number of candidate features is too large, then traditional feature extraction techniques fail. In order to address the first two challenges, this article proposes a multipartition, multichunk ensemble classifier in which a collection of v classifiers is trained from r consecutive data chunks using v-fold partitioning of the data, yielding an ensemble of such classifiers. This multipartition, multichunk ensemble technique significantly reduces classification error compared to existing single-partition, single-chunk ensemble approaches, wherein a single data chunk is used to train each classifier. To address the third challenge, a feature extraction and selection technique is proposed for data streams that do not have any fixed feature set. The technique's scalability is demonstrated through an implementation for the Hadoop MapReduce cloud computing architecture. Both theoretical and empirical evidence demonstrate its effectiveness over other state-of-the-art stream classification techniques on synthetic data, real botnet traffic, and malicious executables.