Mining needle in a haystack: classifying rare classes via two-phase rule induction

Authors:
Mahesh V. Joshi;Ramesh C. Agarwal;Vipin Kumar
Affiliations:
Department of Computer Science, IBM T. J. Watson Research Center and University of Minnesota, Minneapolis;IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY;Department of Computer Science, University of Minnesota, Minneapolis, MN
Venue:
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Year:
2001

Citing 9
Cited 30

C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
A simple, fast, and effective rule learner

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Information Retrieval

Information Retrieval
Machine Learning

Machine Learning
Families of splitting criteria for classification trees

Statistics and Computing
The CN2 Induction Algorithm

Machine Learning
Lightweight Rule Induction

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning

Predicting Rare Classes: Comparing Two-Phase Rule Induction to Cost-Sensitive Boosting

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Predicting rare classes: can boosting make any weak learner strong?

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
ART: A Hybrid Classification Model

Machine Learning
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Coordinated internet attacks: responding to attack complexity

Journal of Computer Security
A Multi-Class SLIPPER System for Intrusion Detection

COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Volume 01
Mining Customer Value: From Association Rules to Direct Marketing

Data Mining and Knowledge Discovery
Feature bagging for outlier detection

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Predicting Software Escalations with Maximum ROI

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Information extraction from voicemail transcripts

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Patterns Based Classifiers

World Wide Web
Raising data for improved support in rule mining: How to raise and how far to raise

Intelligent Data Analysis
Techniques for Classifying Executions of Deployed Software to Support Software Engineering Tasks

IEEE Transactions on Software Engineering
Local decomposition for rare class analysis

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Why machine learning algorithms fail in misuse detection on KDD intrusion detection data set

Intelligent Data Analysis
Outlier Detection with Kernel Density Functions

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
The Needles-in-Haystack Problem

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Efficient Pruning Schemes for Distance-Based Outlier Detection

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
An interface for medical diagnosis support: from the viewpoint of chance discovery

International Journal of Advanced Intelligence Paradigms
COG: local decomposition for rare class analysis

Data Mining and Knowledge Discovery
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list

Journal of Intelligent Information Systems
Customer Validation of Commercial Predictive Models

Proceedings of the 2010 conference on Data Mining for Business Applications
Frequent subsequence-based protein localization

BioDM'06 Proceedings of the 2006 international conference on Data Mining for Biomedical Applications
Incremental connectivity-based outlier factor algorithm

VoCS'08 Proceedings of the 2008 international conference on Visions of Computer Science: BCS International Academic Conference
Evolutionary computing for knowledge discovery in medical diagnosis

Artificial Intelligence in Medicine
BRACID: a comprehensive approach to learning rules from imbalanced data

Journal of Intelligent Information Systems
Multi-level relationship outlier detection

International Journal of Business Intelligence and Data Mining
Causal inference with rare events in large-scale time-series data

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning models to classify rarely occurring target classes is an important problem with applications in network intrusion detection, fraud detection, or deviation detection in general. In this paper, we analyze our previously proposed two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The key feature of our method is that it separately conquers the objectives of achieving high recall and high precision for the given target class. The first phase of the method aims for high recall by inducing rules with high support and a reasonable level of accuracy. The second phase then tries to improve the precision by learning rules to remove false positives in the collection of the records covered by the first phase rules. Existing sequential covering techniques try to achieve high precision for each individual disjunct learned. In this paper, we claim that such approach is inadequate for rare classes, because of two problems: splintered false positives and error-prone small disjuncts. Motivated by the strengths of our two-phase design, we design various synthetic data models to identify and analyze the situations in which two state-of-the-art methods, RIPPER and C4.5 rules, either fail to learn a model or learn a very poor model. In all these situations, our two-phase approach learns a model with significantly better recall and precision levels. We also present a comparison of the three methods on a challenging real-life network intrusion detection dataset. Our method is significantly better or comparable to the best competitor in terms of achieving better balance between recall and precision.