On the value of outlier elimination on software effort estimation research

  • Authors:
  • Yeong-Seok Seo;Doo-Hwan Bae

  • Affiliations:
  • Department of Computer Science, College of Information Science & Technology, KAIST, Daejeon, South Korea;Department of Computer Science, College of Information Science & Technology, KAIST, Daejeon, South Korea

  • Venue:
  • Empirical Software Engineering
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Producing accurate and reliable software effort estimation has always been a challenge for both academic research and software industries. Regarding this issue, data quality is an important factor that impacts the estimation accuracy of effort estimation methods. To assess the impact of data quality, we investigated the effect of eliminating outliers on the estimation accuracy of commonly used software effort estimation methods. Based on three research questions, we associatively analyzed the influence of outlier elimination on the accuracy of software effort estimation by applying five methods of outlier elimination (Least trimmed squares, Cook's distance, K-means clustering, Box plot, and Mantel leverage metric) and two methods of effort estimation (Least squares regression and Estimation by analogy with the variation of the parameters). Empirical experiments were performed using industrial data sets (ISBSG Release 9, Bank and Stock data sets that are collected from financial companies, and a Desharnais data set in the PROMISE repository). In addition, the effect of the outlier elimination methods is evaluated by the statistical tests (the Friedman test and the Wilcoxon signed rank test). The experimental results derived from the evaluation criteria showed that there was no substantial difference between the software effort estimation results with and without outlier elimination. However, statistical analysis indicated that outlier elimination leads to a significant improvement in the estimation accuracy on the Stock data set (in case of some combinations of outlier elimination and effort estimation methods). In addition, although outlier elimination did not lead to a significant improvement in the estimation accuracy on the other data sets, our graphical analysis of errors showed that outlier elimination can improve the likelihood to produce more accurate effort estimates for new software project data to be estimated. Therefore, from a practical point of view, it is necessary to consider the outlier elimination and to conduct a detailed analysis of the effort estimation results to improve the accuracy of software effort estimation in software organizations.