COG: local decomposition for rare class analysis

Authors:
Junjie Wu;Hui Xiong;Jian Chen
Affiliations:
Department of Information Systems, School of Economics and Management, Beihang University, Beijing, China;Department of Management Science and Information Systems, Rutgers Business School, Rutgers University, Newark, USA;Department of Management Science and Engineering, School of Economics and Management, Tsinghua University, Beijing, China
Venue:
Data Mining and Knowledge Discovery
Year:
2010

Citing 23
Cited 5

Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners

IEEE Transactions on Pattern Analysis and Machine Intelligence
A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
The nature of statistical learning theory

The nature of statistical learning theory
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Machine Learning for the Detection of Oil Spills in Satellite Radar Images

Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Mining needle in a haystack: classifying rare classes via two-phase rule induction

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Learning When Negative Examples Abound

ECML '97 Proceedings of the 9th European Conference on Machine Learning
AdaCost: Misclassification Cost-Sensitive Boosting

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Cost-Sensitive Learning by Cost-Proportionate Example Weighting

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Data Mining and Knowledge Discovery Handbook

Data Mining and Knowledge Discovery Handbook
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
K-means clustering versus validation measures: a data distribution perspective

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Local decomposition for rare class analysis

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Hellinger distance decision trees are robust and skew-insensitive

Data Mining and Knowledge Discovery
Divergence-based feature selection for separate classes

Neurocomputing
Towards enhancing centroid classifier for text classification-A border-instance approach

Neurocomputing
Shilling attack detection utilizing semi-supervised learning method for collaborative recommender system

World Wide Web
A scatter method for data and variable importance evaluation

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for classification using local clustering (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as support vector machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Also, we provide a systematic analysis of time and space complexity of the COG method. Indeed, the experimental results on various real-world data sets show that COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we show that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection.