Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners
IEEE Transactions on Pattern Analysis and Machine Intelligence
A training algorithm for optimal margin classifiers
COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
The nature of statistical learning theory
The nature of statistical learning theory
WebACE: a Web agent for document categorization and exploration
AGENTS '98 Proceedings of the second international conference on Autonomous agents
Machine Learning for the Detection of Oil Spills in Satellite Radar Images
Machine Learning - Special issue on applications of machine learning and the knowledge discovery process
MetaCost: a general method for making classifiers cost-sensitive
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
Mining needle in a haystack: classifying rare classes via two-phase rule induction
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Learning When Negative Examples Abound
ECML '97 Proceedings of the 9th European Conference on Machine Learning
AdaCost: Misclassification Cost-Sensitive Boosting
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A decision-theoretic generalization of on-line learning and an application to boosting
EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Cost-Sensitive Learning by Cost-Proportionate Example Weighting
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Data Mining and Knowledge Discovery Handbook
Data Mining and Knowledge Discovery Handbook
Introduction to Data Mining, (First Edition)
Introduction to Data Mining, (First Edition)
K-means clustering versus validation measures: a data distribution perspective
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Local decomposition for rare class analysis
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Hellinger distance decision trees are robust and skew-insensitive
Data Mining and Knowledge Discovery
Divergence-based feature selection for separate classes
Neurocomputing
A scatter method for data and variable importance evaluation
Integrated Computer-Aided Engineering
Hi-index | 0.01 |
Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for classification using local clustering (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as support vector machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Also, we provide a systematic analysis of time and space complexity of the COG method. Indeed, the experimental results on various real-world data sets show that COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we show that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection.