Software quality analysis by combining multiple projects and learners

  • Authors:
  • Taghi M. Khoshgoftaar;Pierre Rebours;Naeem Seliya

  • Affiliations:
  • Computer Science and Engineering, Florida Atlantic University, Boca Raton, USA 33431;Computer Science and Engineering, Florida Atlantic University, Boca Raton, USA 33431;Computer and Information Science, University of Michigan-Dearborn, Dearborn, USA 48128

  • Venue:
  • Software Quality Control
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

When building software quality models, the approach often consists of training data mining learners on a single fit dataset. Typically, this fit dataset contains software metrics collected during a past release of the software project that we want to predict the quality of. In order to improve the predictive accuracy of such quality models, it is common practice to combine the predictive results of multiple learners to take advantage of their respective biases. Although multi-learner classifiers have been proven to be successful in some cases, the improvement is not always significant because the information in the fit dataset sometimes can be insufficient. We present an innovative method to build software quality models using majority voting to combine the predictions of multiple learners induced on multiple training datasets. To our knowledge, no previous study in software quality has attempted to take advantage of multiple software project data repositories which are generally spread across the organization. In a large scale empirical study involving seven real-world datasets and seventeen learners, we show that, on average, combining the predictions of one learner trained on multiple datasets significantly improves the predictive performance compared to one learner induced on a single fit dataset. We also demonstrate empirically that combining multiple learners trained on a single training dataset does not significantly improve the average predictive accuracy compared to the use of a single learner induced on a single fit dataset.