A Procedure for Analyzing Unbalanced Datasets
IEEE Transactions on Software Engineering
An Investigation of Analysis Techniques for Software Datasets
METRICS '99 Proceedings of the 6th International Symposium on Software Metrics
Using Public Domain Metrics To Estimate Software Development Effort
METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Assessing Variation in Development Effort Consistency Using a Data Source with Missing Data
Software Quality Control
METRICS '05 Proceedings of the 11th IEEE International Software Metrics Symposium
An Empirical Analysis of Software Productivity over Time
METRICS '05 Proceedings of the 11th IEEE International Software Metrics Symposium
Ensemble of missing data techniques to improve software prediction accuracy
Proceedings of the 28th international conference on Software engineering
Categorical missing data imputation for software cost estimation by multinomial logistic regression
Journal of Systems and Software
Cross-company and single-company effort models using the ISBSG database: a further replicated study
Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering
Software project economics: a roadmap
FOSE '07 2007 Future of Software Engineering
An empirical study of the impact of team size on software development effort
Information Technology and Management
Regression via Classification applied on software defect estimation
Expert Systems with Applications: An International Journal
Developing project duration models in software engineering
Journal of Computer Science and Technology
Functional size measurement revisited
ACM Transactions on Software Engineering and Methodology (TOSEM)
A new calibration for Function Point complexity weights
Information and Software Technology
An empirical analysis of software effort estimation with outlier elimination
Proceedings of the 4th international workshop on Predictor models in software engineering
Data sets and data quality in software engineering
Proceedings of the 4th international workshop on Predictor models in software engineering
A comparative evaluation on the accuracies of software effort estimates from clustered data
Information and Software Technology
Experiments with Analogy-X for Software Cost Estimation
ASWEC '08 Proceedings of the 19th Australian Conference on Software Engineering
An approach to optimizing software development team size
Information Processing Letters
An empirical study of the Cobb-Douglas production function properties of software development effort
Information and Software Technology
The relationship between software development team size and software development cost
Communications of the ACM - Rural engineering development
Integrating Function Point Project Information for Improving the Accuracy of Effort Estimation
ADVCOMP '08 Proceedings of the 2008 The Second International Conference on Advanced Engineering Computing and Applications in Sciences
The IT Measurement Compendium: Estimating and Benchmarking Success with Functional Size Measurement
The IT Measurement Compendium: Estimating and Benchmarking Success with Functional Size Measurement
Comparison of estimation methods of cost and duration in IT projects
Information and Software Technology
Why comparative effort prediction studies may be invalid
PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
Search based data sensitivity analysis applied to requirement engineering
Proceedings of the 11th Annual conference on Genetic and evolutionary computation
Applying moving windows to software effort estimation
ESEM '09 Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement
APSEC '09 Proceedings of the 2009 16th Asia-Pacific Software Engineering Conference
On the Relationship between Different Size Measures in the Software Life Cycle
APSEC '09 Proceedings of the 2009 16th Asia-Pacific Software Engineering Conference
On the Approximation of the Substitution Costs for Free/Libre Open Source Software
BCI '09 Proceedings of the 2009 Fourth Balkan Conference in Informatics
Visual comparison of software cost estimation models by regression error characteristic analysis
Journal of Systems and Software
Assessing the quality and cleaning of a software project dataset: an experience report
EASE'06 Proceedings of the 10th international conference on Evaluation and Assessment in Software Engineering
Maximising data retention from the ISBSG repository
EASE'08 Proceedings of the 12th international conference on Evaluation and Assessment in Software Engineering
Discretization methods for NBC in effort estimation: an empirical comparison based on ISBSG projects
Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement
Hi-index | 0.00 |
Background: Most prediction models, e.g. effort estimation, require preprocessing of data. Some datasets, such as ISBSG, contain data quality meta-data which can be used to filter out low quality cases from the analysis. However, an agreement has not been reached yet between researchers about these data quality selection criteria. Aims: This paper aims to analyze the influence of data quality meta-data criteria in the number of selected projects, which can have influence in the models obtained. For this, a case study has been selected to gain a more complete understanding of what might be important to focus in future research. Method: Data quality meta-data selection criteria of some works based on ISBSG dataset which propose prediction models were reviewed first. Considerable attention has been paid to two data quality meta-data variables in ISBSG dataset Release 11 which are Data Quality Rating and Unadjusted Function Point Rating. Secondly, this paper considers data from 830 projects which have been collected from the ISBSG dataset after a preliminary screening. This first screening leads mainly to a subset of projects with comparable definitions in size and effort. Then data quality meta-data criteria are applied in order to infer their influence. Results: Overall, it seems that data selection criteria, regardless data quality meta-data concerns, involve an important reduction in sample size. From 5052 projects, only 830 are really considered. Then 262 projects remain for analysis if the maximum quality rate is applied for both data quality meta-data variables. But, since the initial data preparation focuses the problem of missingness for a certain purpose, data quality criteria seem not to be the clue for the analysis results. However, some variability has been observed. Conclusions: Whilst this analysis is supported by a case study, it is hoped that it contributes to a better understanding of the subject. In fact, results found suggest that in those studies where the selection criteria of projects are not very strictly applied, these data quality criteria must be carefully taken into account.