An exploration of learning when data is noisy and imbalanced

  • Authors:
  • Jason Van Hulse;Taghi M. Khoshgoftaar;Amri Napolitano

  • Affiliations:
  • Florida Atlantic University, Boca Raton, Florida, USA;(Correspd. Tel.: +1 561 297 3994/ Fax: +1 561 297 2800/ E-mail: taghi@cse.fau.edu) Florida Atlantic University, Boca Raton, Florida, USA;Florida Atlantic University, Boca Raton, Florida, USA

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Much of the research literature in data mining and machine learning has focused on developing classification models for various application-specific learning tasks. In contrast, the characteristics of the underlying data, and their impacts on learning, have received much less attention. While it is generally understood that imbalanced, noisy and relatively small datasets make classification tasks more difficult, there has been, to our knowledge, no comprehensive examination of the impacts of these important and commonly-encountered dataset characteristics on the learning process. In this work, we present a comprehensive empirical analysis of learning from imbalanced, limited and noisy data. We present the performance of 11 commonly used learning algorithms and the effects of dataset size, class distribution, noise level and noise distribution on each learner. In this work, for which over one million classification models were built, we identify which learners are most robust to changing each of these experimental factors using two different performance metrics. Our results show that each of these factors plays a critical role in learner performance, with some learners exhibiting much greater stability than others.