Class imbalance and the curse of minority hubs

  • Authors:
  • Nenad Tomašev;Dunja Mladenić

  • Affiliations:
  • -;-

  • Venue:
  • Knowledge-Based Systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most machine learning tasks involve learning from high-dimensional data, which is often quite difficult to handle. Hubness is an aspect of the curse of dimensionality that was shown to be highly detrimental to k-nearest neighbor methods in high-dimensional feature spaces. Hubs, very frequent nearest neighbors, emerge as centers of influence within the data and often act as semantic singularities. This paper deals with evaluating the impact of hubness on learning under class imbalance with k-nearest neighbor methods. Our results suggest that, contrary to the common belief, minority class hubs might be responsible for most misclassification in many high-dimensional datasets. The standard approaches to learning under class imbalance usually clearly favor the instances of the minority class and are not well suited for handling such highly detrimental minority points. In our experiments, we have evaluated several state-of-the-art hubness-aware kNN classifiers that are based on learning from the neighbor occurrence models calculated from the training data. The experiments included learning under severe class imbalance, class overlap and mislabeling and the results suggest that the hubness-aware methods usually achieve promising results on the examined high-dimensional datasets. The improvements seem to be most pronounced when handling the difficult point types: borderline points, rare points and outliers. On most examined datasets, the hubness-aware approaches improve the classification precision of the minority classes and the recall of the majority class, which helps with reducing the negative impact of minority hubs. We argue that it might prove beneficial to combine the extensible hubness-aware voting frameworks with the existing class imbalanced kNN classifiers, in order to properly handle class imbalanced data in high-dimensional feature spaces.