How to Find Relevant Data for Effort Estimation?

Authors:
Ekrem Kocaguneli;Tim Menzies
Affiliations:
-;-
Venue:
ESEM '11 Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement
Year:
2011

Citing 0
Cited 7

On the dataset shift problem in software engineering prediction models

Empirical Software Engineering
Privacy and utility for defect prediction: experiments with MORPH

Proceedings of the 34th International Conference on Software Engineering
Web effort estimation: the value of cross-company data set compared to single-company data set

Proceedings of the 8th International Conference on Predictive Models in Software Engineering
Size doesn't matter?: on the value of software size features for effort estimation

Proceedings of the 8th International Conference on Predictive Models in Software Engineering
Data science for software engineering

Proceedings of the 2013 International Conference on Software Engineering
Better cross company defect prediction

Proceedings of the 10th Working Conference on Mining Software Repositories
Building a second opinion: learning cross-company data

Proceedings of the 9th International Conference on Predictive Models in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Background: Building effort estimators requires the training data. How can we find that data? It is tempting to cross the boundaries of development type, location, language, application and hardware to use existing datasets of other organizations. However, prior results caution that using such cross data may not be useful. Aim: We test two conjectures: (1) instance selection can automatically prune irrelevant instances and (2) retrieval from the remaining examples is useful for effort estimation, regardless of their source. Method: We selected 8 cross-within divisions (21 pairs of within-cross subsets) out of 19 datasets and evaluated these divisions under different analogy-based estimation (ABE) methods. Results: Between the within & cross experiments, there were few statistically significant differences in (i) the performance of effort estimators, or (ii) the amount of instances retrieved for estimation. Conclusion: For the purposes of effort estimation, there is little practical difference between cross and within data. After applying instance selection, the remaining examples (be they from within or from cross source divisions) can be used for effort estimation.