A data reduction approach for resolving the imbalanced data issue in functional genomics

  • Authors:
  • Kihoon Yoon;Stephen Kwek

  • Affiliations:
  • University of Texas at San Antonio, Department of Computer Science, 78249, San Antonio, TX, USA;University of Texas at San Antonio, Department of Computer Science, 78249, San Antonio, TX, USA

  • Venue:
  • Neural Computing and Applications
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difficulty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good filter that reduces the amount of imbalance so that traditional classification techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with significant improvements over previous predictors.