CODE: a data complexity framework for imbalanced datasets

Authors:
Cheng G. Weng;Josiah Poon
Affiliations:
School of Information Technologies, University of Sydney, NSW, Australia;School of Information Technologies, University of Sydney, NSW, Australia
Venue:
PAKDD'09 Proceedings of the 13th Pacific-Asia international conference on Knowledge discovery and data mining: new frontiers in applied data mining
Year:
2009

Citing 10
Cited 0

Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Concept-Learning in the Presence of Between-Class and Within-Class Imbalances

AI '01 Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution

IEEE Transactions on Knowledge and Data Engineering
A Data Complexity Analysis on Imbalanced Datasets and an Alternative Imbalance Recovering Strategy

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Improved heterogeneous distance functions

Journal of Artificial Intelligence Research
Balancing strategies and class overlapping

IDA'05 Proceedings of the 6th international conference on Advances in Intelligent Data Analysis
A new evaluation measure for imbalanced datasets

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87

Quantified Score

Hi-index	0.00

Visualization

Abstract

Imbalanced datasets occur in many domains, such as fraud detection, cancer detection and web; and in such domains, the class of interest often concerns the rare occurring events. Thus it is important to have a good performance on these classes while maintaining a reasonable overall accuracy. Although imbalanced datasets can be difficult to learn, but in the previous researches, the skewed class distribution has been suggested to not necessarily being the one that poses problems for learning. Therefore, when the learning of the rare class becomes problematic, it does not imply that the skewed class distribution is the cause to blame, but rather that the imbalanced distribution may just be a byproduct of some other hidden intrinsic difficulties. This paper tries to shade some light on this issue of learning from imbalanced dataset. We propose to use data complexity models to profile datasets in order to make connections with imbalanced datasets; this can potentially lead to better learning approaches. We have extended from our previous work with an improved implementation of the CODE framework in order to tackle a more difficult learning challenge. Despite the increased difficulty, CODE still enables a reasonable performance on profiling the data complexity of imbalanced datasets.