An analysis of data sets used to train and validate cost prediction systems

Authors:
Carolyn Mair;Martin Shepperd;Magne Jørgensen
Affiliations:
Bournemouth University, United Kingdom;Bournemouth University, United Kingdom;Simula Labs, Norway
Venue:
PROMISE '05 Proceedings of the 2005 workshop on Predictor models in software engineering
Year:
2005

Citing 6
Cited 11

Software project development cost estimation

Journal of Systems and Software
An empirical validation of software cost estimation models

Communications of the ACM
Predicting with Sparse Data

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
An assessment of systems and software engineering scholars and institutions (1999-2003)

Journal of Systems and Software - Special issue: Computer software & applications
Replicating software engineering experiments: a poisoned chalice or the Holy Grail

Information and Software Technology
A review of studies on expert estimation of software development effort

Journal of Systems and Software

Software project economics: a roadmap

FOSE '07 2007 Future of Software Engineering
Comparative studies of the model evaluation criterions mmre and pred in software cost estimation research

Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement
A study of project selection and feature weighting for analogy based software cost estimation

Journal of Systems and Software
Why comparative effort prediction studies may be invalid

PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
The impact of limited search procedures for systematic literature reviews A participant-observer case study

ESEM '09 Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement
Systematic literature reviews in software engineering - A tertiary study

Information and Software Technology
Refining the systematic literature review process--two participant-observer case studies

Empirical Software Engineering
Visualizing metadata for environmental datasets

DCMI '10 Proceedings of the 2010 International Conference on Dublin Core and Metadata Applications
Data quality: cinderella at the software metrics ball?

Proceedings of the 2nd International Workshop on Emerging Trends in Software Metrics
Data quality in empirical software engineering: a targeted review

Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering
Potential and limitations of the ISBSG dataset in enhancing software engineering research: A mapping review

Information and Software Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

OBJECTIVE - to build up a picture of the nature and type of data sets being used to develop and evaluate different software project effort prediction systems. We believe this to be important since there is a growing body of published work that seeks to assess different prediction approaches.METHOD - we performed an exhaustive search from 1980 onwards from three software engineering journals for research papers that used project data sets to compare cost prediction systems.RESULTS - this identified a total of 50 papers that used, one or more times, a total of 71 unique project data sets. We observed that some of the better known and easily accessible data sets were used repeatedly making them potentially disproportionately influential. Such data sets also tend to be amongst the oldest with potential problems of obsolescence. We also note that only about 60% of all data sets are in the public domain. Finally, extracting relevant information from research papers has been time consuming due to different styles of presentation and levels of contextural information.CONCLUSIONS - first, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need to assess the way results are presented in order to facilitate meta-analysis and whether a standard protocol would be appropriate.