Can cross-company data improve performance in software effort estimation?

Authors:
Leandro L. Minku;Xin Yao
Affiliations:
The University of Birmingham, Birmingham, UK;The University of Birmingham, Birmingham, UK
Venue:
Proceedings of the 8th International Conference on Predictive Models in Software Engineering
Year:
2012

Citing 17
Cited 3

Software engineering metrics and models

Software engineering metrics and models
Software Engineering Economics

Software Engineering Economics
Dealing with Missing Software Project Data

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics
Using additive expert ensembles to cope with concept drift

ICML '05 Proceedings of the 22nd international conference on Machine learning
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Cross versus Within-Company Cost Estimation Studies: A Systematic Review

IEEE Transactions on Software Engineering
Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

The Journal of Machine Learning Research
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Applying moving windows to software effort estimation

ESEM '09 Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement
The Impact of Diversity on Online Ensemble Learning in the Presence of Concept Drift

IEEE Transactions on Knowledge and Data Engineering
When to use data from other projects for effort estimation

Proceedings of the IEEE/ACM international conference on Automated software engineering
Using chronological splitting to compare cross- and single-company effort models: further investigation

ACSC '09 Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91
A principled evaluation of ensembles of learning machines for software effort estimation

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
DDD: A New Ensemble Approach for Dealing with Concept Drift

IEEE Transactions on Knowledge and Data Engineering
Investigating the use of chronological splitting to compare software cross-company and single-company effort predictions: a replicated study

EASE'09 Proceedings of the 13th international conference on Evaluation and Assessment in Software Engineering
Investigating the use of chronological splitting to compare software cross-company and single-company effort predictions

EASE'08 Proceedings of the 12th international conference on Evaluation and Assessment in Software Engineering
Evaluating prediction systems in software project estimation

Information and Software Technology

Data science for software engineering

Proceedings of the 2013 International Conference on Software Engineering
The impact of parameter tuning on software effort estimation using learning machines

Proceedings of the 9th International Conference on Predictive Models in Software Engineering
Building a second opinion: learning cross-company data

Proceedings of the 9th International Conference on Predictive Models in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Background: There has been a long debate in the software engineering literature concerning how useful cross-company (CC) data are for software effort estimation (SEE) in comparison to within-company (WC) data. Studies indicate that models trained on CC data obtain either similar or worse performance than models trained solely on WC data. Aims: We aim at investigating if CC data could help to increase performance and under what conditions. Method: The work concentrates on the fact that SEE is a class of online learning tasks which operate in changing environments, even though most work so far has neglected that. We conduct an analysis based on the performance of different approaches considering CC and WC data. These are: (1) an approach not designed for changing environments, (2) approaches designed for changing environments and (3) a new online learning approach able to identify when CC data are helpful or detrimental. Results: Interesting features of data sets commonly used in the SEE literature are revealed, showing that different subsets of CC data can be beneficial or detrimental depending on the moment in time. The newly proposed approach is able to benefit from that, successfully using CC data to improve performance over WC models. Conclusions: This work not only shows that CC data can help to increase performance for SEE tasks, but also demonstrates that the online nature of software prediction tasks should be exploited, being an important issue to be considered in the future.