A data reduction approach for resolving the imbalanced data issue in functional genomics

Authors:
Kihoon Yoon;Stephen Kwek
Affiliations:
University of Texas at San Antonio, Department of Computer Science, 78249, San Antonio, TX, USA;University of Texas at San Antonio, Department of Computer Science, 78249, San Antonio, TX, USA
Venue:
Neural Computing and Applications
Year:
2007

Citing 0
Cited 1

A new probabilistic active sample selection algorithm for class imbalance problem

International Journal of Knowledge Engineering and Soft Data Paradigms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difficulty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good filter that reduces the amount of imbalance so that traditional classification techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with significant improvements over previous predictors.