Class imbalance and the curse of minority hubs

Authors:
Nenad Tomašev;Dunja Mladenić
Affiliations:
-;-
Venue:
Knowledge-Based Systems
Year:
2013

Citing 53
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Discriminant Adaptive Nearest Neighbor Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Machine Learning

Machine Learning
Random Forests

Machine Learning
An Instance-Weighting Method to Induce Cost-Sensitive Trees

IEEE Transactions on Knowledge and Data Engineering
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Adaptive Quasiconformal Kernel Nearest Neighbor Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution

IEEE Transactions on Knowledge and Data Engineering
Does cost-sensitive learning beat sampling for classifying rare classes?

UBDM '05 Proceedings of the 1st international workshop on Utility-based data mining
SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Exploratory Under-Sampling for Class-Imbalance Learning

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Improving nearest neighbor rule with a simple adaptive distance measure

Pattern Recognition Letters
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
Active learning for class imbalance problem

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning on the border: active learning in imbalanced data classification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Skewed Class Distributions and Mislabeled Examples

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
On the k-NN performance in a challenging scenario of imbalance and overlapping

Pattern Analysis & Applications - Special Issue: Non-parametric distance-based classification techniques and their applications
IKNN: Informative K-Nearest Neighbor Pattern Classification

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
A comparative study on rough set based class imbalance learning

Knowledge-Based Systems
A New Approach to Fuzzy-Rough Nearest Neighbour Classification

RSCTC '08 Proceedings of the 6th International Conference on Rough Sets and Current Trends in Computing
microPred

Bioinformatics
Nearest neighbors in high-dimensional data: the emergence and influence of hubs

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
When is 'nearest neighbour' meaningful: A converse theorem and implications

Journal of Complexity
Learning from Imbalanced Data

IEEE Transactions on Knowledge and Data Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Concept learning and the problem of small disjuncts

IJCAI'89 Proceedings of the 11th international joint conference on Artificial intelligence - Volume 1
How does high dimensionality affect collaborative filtering?

Proceedings of the third ACM conference on Recommender systems
Neighbor-weighted K-nearest neighbor for unbalanced text corpus

Expert Systems with Applications: An International Journal
Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

The Journal of Machine Learning Research
Adaptive k-nearest-neighbor classification using a dynamic number of nearest neighbors

ADBIS'07 Proceedings of the 11th East European conference on Advances in databases and information systems
On the existence of obstinate results in vector space models

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
The role of hubness in clustering high-dimensional data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Addressing the classification with imbalanced data: open problems and new challenges on class distribution

HAIS'11 Proceedings of the 6th international conference on Hybrid artificial intelligent systems - Volume Part I
INSIGHT: efficient and effective instance selection for time-series classification

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Improving k nearest neighbor with exemplar generalization for imbalanced classification

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Class confidence weighted kNN algorithms for imbalanced data sets

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Identifying mislabeled training data with the aid of unlabeled data

Applied Intelligence
A probabilistic approach to nearest-neighbor classification: naive hubness bayesian kNN

Proceedings of the 20th ACM international conference on Information and knowledge management
Fraud/uncollectible debt detection using a Bayesian network based learning system: a rare binary outcome with mixed data structures

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

Expert Systems with Applications: An International Journal
A probabilistic approach for semi-supervised nearest neighbor classification

Pattern Recognition Letters
Hubness-Aware shared neighbor distances for high-dimensional k-nearest neighbor classification

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
Identification of different types of minority class examples in imbalanced data

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
A Kernel-Based Two-Class Classifier for Imbalanced Data Sets

IEEE Transactions on Neural Networks
Foundation of mining class-imbalanced data

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods

Knowledge-Based Systems
Novel classifier scheme for imbalanced problems

Pattern Recognition Letters
Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques

Pattern Recognition Letters
Local and global scaling reduce hubs in space

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most machine learning tasks involve learning from high-dimensional data, which is often quite difficult to handle. Hubness is an aspect of the curse of dimensionality that was shown to be highly detrimental to k-nearest neighbor methods in high-dimensional feature spaces. Hubs, very frequent nearest neighbors, emerge as centers of influence within the data and often act as semantic singularities. This paper deals with evaluating the impact of hubness on learning under class imbalance with k-nearest neighbor methods. Our results suggest that, contrary to the common belief, minority class hubs might be responsible for most misclassification in many high-dimensional datasets. The standard approaches to learning under class imbalance usually clearly favor the instances of the minority class and are not well suited for handling such highly detrimental minority points. In our experiments, we have evaluated several state-of-the-art hubness-aware kNN classifiers that are based on learning from the neighbor occurrence models calculated from the training data. The experiments included learning under severe class imbalance, class overlap and mislabeling and the results suggest that the hubness-aware methods usually achieve promising results on the examined high-dimensional datasets. The improvements seem to be most pronounced when handling the difficult point types: borderline points, rare points and outliers. On most examined datasets, the hubness-aware approaches improve the classification precision of the minority classes and the recall of the majority class, which helps with reducing the negative impact of minority hubs. We argue that it might prove beneficial to combine the extensible hubness-aware voting frameworks with the existing class imbalanced kNN classifiers, in order to properly handle class imbalanced data in high-dimensional feature spaces.