Investigating the use of Support Vector Regression for web effort estimation

  • Authors:
  • Anna Corazza;Sergio Di Martino;Filomena Ferrucci;Carmine Gravino;Emilia Mendes

  • Affiliations:
  • University of Napoli "Federico II", Naples, Italy 80126;University of Napoli "Federico II", Naples, Italy 80126;University of Salerno, Fisciano, Italy 84084;University of Salerno, Fisciano, Italy 84084;The University of Auckland, Auckland, New Zealand 92019

  • Venue:
  • Empirical Software Engineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Support Vector Regression (SVR) is a new generation of Machine Learning algorithms, suitable for predictive data modeling problems. The objective of this paper is twofold: first, to investigate the effectiveness of SVR for Web effort estimation using a cross-company dataset; second, to compare different SVR configurations looking at the one that presents the best performance. In particular, we took into account three variables' preprocessing strategies (no-preprocessing, normalization, and logarithmic), in combination with two different dependent variables (effort and inverse effort). As a result, SVR was applied using six different data configurations. Moreover, to understand the suitability of kernel functions to handle non-linear problems, SVR was applied without a kernel, and in combination with the Radial Basis Function (RBF) and the Polynomial kernels, thus obtaining 18 different SVR configurations. To identify, for each configuration, which were the best values for each of the parameters we defined a procedure based on a leave-one-out cross-validation approach. The dataset employed was the Tukutuku database, which has been adopted in many previous Web effort estimation studies. Three different training and test set splits were used, including respectively 130 and 65 projects. The SVR-based predictions were also benchmarked against predictions obtained using Manual StepWise Regression and Case-Based Reasoning. Our results showed that the configuration corresponding to the logarithmic features' preprocessing, in combination with the RBF kernel provided the best results for all three data splits. In addition, SVR provided significantly superior prediction accuracy than all the considered benchmarking techniques.