GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation

  • Authors:
  • Adriano L. I. Oliveira;Petronio L. Braga;Ricardo M. F. Lima;Márcio L. Cornélio

  • Affiliations:
  • Center of Informatics, Federal University of Pernambuco, Recife 50.732-970, Brazil;Center of Informatics, Federal University of Pernambuco, Recife 50.732-970, Brazil;Center of Informatics, Federal University of Pernambuco, Recife 50.732-970, Brazil;Center of Informatics, Federal University of Pernambuco, Recife 50.732-970, Brazil

  • Venue:
  • Information and Software Technology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Context: In software industry, project managers usually rely on their previous experience to estimate the number men/hours required for each software project. The accuracy of such estimates is a key factor for the efficient application of human resources. Machine learning techniques such as radial basis function (RBF) neural networks, multi-layer perceptron (MLP) neural networks, support vector regression (SVR), bagging predictors and regression-based trees have recently been applied for estimating software development effort. Some works have demonstrated that the level of accuracy in software effort estimates strongly depends on the values of the parameters of these methods. In addition, it has been shown that the selection of the input features may also have an important influence on estimation accuracy. Objective: This paper proposes and investigates the use of a genetic algorithm method for simultaneously (1) select an optimal input feature subset and (2) optimize the parameters of machine learning methods, aiming at a higher accuracy level for the software effort estimates. Method: Simulations are carried out using six benchmark data sets of software projects, namely, Desharnais, NASA, COCOMO, Albrecht, Kemerer and Koten and Gray. The results are compared to those obtained by methods proposed in the literature using neural networks, support vector machines, multiple additive regression trees, bagging, and Bayesian statistical models. Results: In all data sets, the simulations have shown that the proposed GA-based method was able to improve the performance of the machine learning methods. The simulations have also demonstrated that the proposed method outperforms some recent methods reported in the recent literature for software effort estimation. Furthermore, the use of GA for feature selection considerably reduced the number of input features for five of the data sets used in our analysis. Conclusions: The combination of input features selection and parameters optimization of machine learning methods improves the accuracy of software development effort. In addition, this reduces model complexity, which may help understanding the relevance of each input feature. Therefore, some input parameters can be ignored without loss of accuracy in the estimations.