Sensitivity of results to different data quality meta-data criteria in the sample selection of projects from the ISBSG dataset

Authors:
Marta Fernández-Diego;Mónica Martínez-Gómez;José-María Torralba-Martínez
Affiliations:
Universidad Politécnica de Valencia, Valencia, Spain;Universidad Politécnica de Valencia, Valencia, Spain;Universidad Politécnica de Valencia, Valencia, Spain
Venue:
Proceedings of the 6th International Conference on Predictive Models in Software Engineering
Year:
2010

Citing 34
Cited 1

A Procedure for Analyzing Unbalanced Datasets

IEEE Transactions on Software Engineering
An Investigation of Analysis Techniques for Software Datasets

METRICS '99 Proceedings of the 6th International Symposium on Software Metrics
Using Public Domain Metrics To Estimate Software Development Effort

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Assessing Variation in Development Effort Consistency Using a Data Source with Missing Data

Software Quality Control
A Replicated Comparison of Cross-Company and Within-Company Effort Estimation Models Using the ISBSG Database

METRICS '05 Proceedings of the 11th IEEE International Software Metrics Symposium
An Empirical Analysis of Software Productivity over Time

METRICS '05 Proceedings of the 11th IEEE International Software Metrics Symposium
Ensemble of missing data techniques to improve software prediction accuracy

Proceedings of the 28th international conference on Software engineering
Categorical missing data imputation for software cost estimation by multinomial logistic regression

Journal of Systems and Software
Cross-company and single-company effort models using the ISBSG database: a further replicated study

Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering
Software project economics: a roadmap

FOSE '07 2007 Future of Software Engineering
An empirical study of the impact of team size on software development effort

Information Technology and Management
Regression via Classification applied on software defect estimation

Expert Systems with Applications: An International Journal
Developing project duration models in software engineering

Journal of Computer Science and Technology
Functional size measurement revisited

ACM Transactions on Software Engineering and Methodology (TOSEM)
A new calibration for Function Point complexity weights

Information and Software Technology
An empirical analysis of software effort estimation with outlier elimination

Proceedings of the 4th international workshop on Predictor models in software engineering
Data sets and data quality in software engineering

Proceedings of the 4th international workshop on Predictor models in software engineering
A comparative evaluation on the accuracies of software effort estimates from clustered data

Information and Software Technology
Experiments with Analogy-X for Software Cost Estimation

ASWEC '08 Proceedings of the 19th Australian Conference on Software Engineering
An approach to optimizing software development team size

Information Processing Letters
An empirical study of the Cobb-Douglas production function properties of software development effort

Information and Software Technology
The relationship between software development team size and software development cost

Communications of the ACM - Rural engineering development
Integrating Function Point Project Information for Improving the Accuracy of Effort Estimation

ADVCOMP '08 Proceedings of the 2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences
The IT Measurement Compendium: Estimating and Benchmarking Success with Functional Size Measurement

The IT Measurement Compendium: Estimating and Benchmarking Success with Functional Size Measurement
Comparison of estimation methods of cost and duration in IT projects

Information and Software Technology
Why comparative effort prediction studies may be invalid

PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
Search based data sensitivity analysis applied to requirement engineering

Proceedings of the 11th Annual conference on Genetic and evolutionary computation
Applying moving windows to software effort estimation

ESEM '09 Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement
Improving the Accuracy of Software Effort Estimation Based on Multiple Least Square Regression Models by Estimation Error-Based Data Partitioning

APSEC '09 Proceedings of the 2009 16th Asia-Pacific Software Engineering Conference
On the Relationship between Different Size Measures in the Software Life Cycle

APSEC '09 Proceedings of the 2009 16th Asia-Pacific Software Engineering Conference
On the Approximation of the Substitution Costs for Free/Libre Open Source Software

BCI '09 Proceedings of the 2009 Fourth Balkan Conference in Informatics
Visual comparison of software cost estimation models by regression error characteristic analysis

Journal of Systems and Software
Assessing the quality and cleaning of a software project dataset: an experience report

EASE'06 Proceedings of the 10th international conference on Evaluation and Assessment in Software Engineering
Maximising data retention from the ISBSG repository

EASE'08 Proceedings of the 12th international conference on Evaluation and Assessment in Software Engineering

Discretization methods for NBC in effort estimation: an empirical comparison based on ISBSG projects

Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement

Quantified Score

Hi-index	0.00

Visualization

Abstract

Background: Most prediction models, e.g. effort estimation, require preprocessing of data. Some datasets, such as ISBSG, contain data quality meta-data which can be used to filter out low quality cases from the analysis. However, an agreement has not been reached yet between researchers about these data quality selection criteria. Aims: This paper aims to analyze the influence of data quality meta-data criteria in the number of selected projects, which can have influence in the models obtained. For this, a case study has been selected to gain a more complete understanding of what might be important to focus in future research. Method: Data quality meta-data selection criteria of some works based on ISBSG dataset which propose prediction models were reviewed first. Considerable attention has been paid to two data quality meta-data variables in ISBSG dataset Release 11 which are Data Quality Rating and Unadjusted Function Point Rating. Secondly, this paper considers data from 830 projects which have been collected from the ISBSG dataset after a preliminary screening. This first screening leads mainly to a subset of projects with comparable definitions in size and effort. Then data quality meta-data criteria are applied in order to infer their influence. Results: Overall, it seems that data selection criteria, regardless data quality meta-data concerns, involve an important reduction in sample size. From 5052 projects, only 830 are really considered. Then 262 projects remain for analysis if the maximum quality rate is applied for both data quality meta-data variables. But, since the initial data preparation focuses the problem of missingness for a certain purpose, data quality criteria seem not to be the clue for the analysis results. However, some variability has been observed. Conclusions: Whilst this analysis is supported by a case study, it is hoped that it contributes to a better understanding of the subject. In fact, results found suggest that in those studies where the selection criteria of projects are not very strictly applied, these data quality criteria must be carefully taken into account.