Building a second opinion: learning cross-company data

  • Authors:
  • Ekrem Kocaguneli;Bojan Cukic;Tim Menzies;Huihua Lu

  • Affiliations:
  • West Virginia University, Morgantown;West Virginia University, Morgantown;West Virginia University, Morgantown;West Virginia University, Morgantown

  • Venue:
  • Proceedings of the 9th International Conference on Predictive Models in Software Engineering
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Background: Developing and maintaining a software effort estimation (SEE) data set within a company (within data) is costly. Often times parts of data may be missing or too difficult to collect, e.g. effort values. However, information about the past projects-although incomplete- may be helpful, when incorporated with the SEE data sets from other companies (cross data). Aim: Utilizing cross data to aid within company estimates and local experts; Proposing a synergy between semi-supervised, active and cross company learning for software effort estimation. Method: The proposed method: 1) Summarizes existing unlabeled within data; 2) Uses cross data to provide pseudo-labels for the summarized within data; 3) Uses steps 1 and 2 to provide an estimate for the within test data as an input for the local company experts. We use 21 data sets and compare the proposed method to existing state-of-the-art within and cross company effort estimation methods subject to evaluation by 7 different error measures. Results: In 132 out of 147 settings (21 data sets X 7 error measures = 147 settings), the proposed method performs as well as the state-of-the-art methods. Also, the proposed method summarizes the past within data down to at most 15% of the original data. Conclusion: It is important to look for synergies amongst cross company and within-company effort estimation data, even when the latter is imperfect or sparse. In this research, we provide the experts with a method that: 1) is competent (performs as well as prior within and cross data estimation methods) 2) reflects on local data (estimates come from the within data); 3) is succinct (summarizes within data down to 15% or less); 4) cheap (easy to build).