Supervised neural network modeling: an empirical investigation into learning from imbalanced data with labeling errors

  • Authors:
  • Taghi M. Khoshgoftaar;Jason Van Hulse;Amri Napolitano

  • Affiliations:
  • Data Mining and Machine Learning Laboratory, Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL;Data Mining and Machine Learning Laboratory, Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL;Data Mining and Machine Learning Laboratory, Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL

  • Venue:
  • IEEE Transactions on Neural Networks
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Neural network algorithms such as multilayer perceptrons (MLPs) and radial basis function networks (RBFNets) have been used to construct learners which exhibit strong predictive performance. Two data related issues that can have a detrimental impact on supervised learning initiatives are class imbalance and labeling errors (or class noise). Imbalanced data can make it more difficult for the neural network learning algorithms to distinguish between examples of the various classes, and class noise can lead to the formulation of incorrect hypotheses. Both class imbalance and labeling errors are pervasive problems encountered in a wide variety of application domains. Many studies have been performed to investigate these problems in isolation, but few have focused on their combined effects. This study presents a comprehensive empirical investigation using neural network algorithms to learn from imbalanced data with labeling errors. In particular, the first component of our study investigates the impact of class noise and class imbalance on two common neural network learning algorithms, while the second component considers the ability of data sampling (which is commonly used to address the issue of class imbalance) to improve their performances. Our results, for which over two million models were trained and evaluated, show that conclusions drawn using the more commonly studied C4.5 classifier may not apply when using neural networks.