Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

Authors:
Ingunn Myrtveit;Erik Stensrud;Ulf H. Olsson
Affiliations:
-;-;-
Venue:
IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Year:
2001

Citing 11
Cited 46

Software engineering metrics and models

Software engineering metrics and models
Statistical analysis with missing data

Statistical analysis with missing data
Estimating Software Project Effort Using Analogies

IEEE Transactions on Software Engineering
A Controlled Experiment to Assess the Benefits of Estimating with Analogy and Regression Models

IEEE Transactions on Software Engineering
Validating the ISO/IEC 15504 Measure of Software Requirements Analysis Process Capability

IEEE Transactions on Software Engineering
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Human Performance Estimating with Analogy and Regression Models: An Empirical Validation

METRICS '98 Proceedings of the 5th International Symposium on Software Metrics
Assessing the Benefits of Imputing ERP Projects with Missing Data

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Using Public Domain Metrics To Estimate Software Development Effort

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Building A Software Cost Estimation Model Based On Categorical Data

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Controlling Software Projects: Management, Measurement, and Estimates

Controlling Software Projects: Management, Measurement, and Estimates

Empirical Analysis of Safety-Critical Anomalies During Operations

IEEE Transactions on Software Engineering
Software Effort Prediction Models Using Maximum Likelihood Methods Require Multivariate Normality of the Software Metrics Data Sample: Can Such a Sample Be Made Multivariate Normal?

COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Volume 01
A Short Note on Safest Default Missingness Mechanism Assumptions

Empirical Software Engineering
Assessing Variation in Development Effort Consistency Using a Data Source with Missing Data

Software Quality Control
Nearest neighbour approach in the least-squares data imputation algorithms

Information Sciences: an International Journal
Ensemble of missing data techniques to improve software prediction accuracy

Proceedings of the 28th international conference on Software engineering
Using industry based data sets in software engineering research

Proceedings of the 2006 international workshop on Summit on software engineering education
Categorical missing data imputation for software cost estimation by multinomial logistic regression

Journal of Systems and Software
Applying statistical methodology to optimize and simplify software metric models with missing data

Proceedings of the 2006 ACM symposium on Applied computing
Benchmarking k-nearest neighbour imputation with homogeneous Likert data

Empirical Software Engineering
A comparative study of attribute weighting heuristics for effort estimation by analogy

Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering
A new imputation method for small software project data sets

Journal of Systems and Software
Outlier elimination in construction of software metric models

Proceedings of the 2007 ACM symposium on Applied computing
Decision Support Analysis for Software Effort Estimation by Analogy

PROMISE '07 Proceedings of the Third International Workshop on Predictor Models in Software Engineering
A comprehensive empirical evaluation of missing value imputation in noisy software measurement data

Journal of Systems and Software
Missing Data Imputation Techniques

International Journal of Business Intelligence and Data Mining
Tests for consistent measurement of external subjective software quality attributes

Empirical Software Engineering
An Integrated Approach for Identifying Relevant Factors Influencing Software Development Productivity

Balancing Agility and Formalism in Software Engineering
Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Journal of Systems and Software
Imputation techniques for multivariate missingness in software measurement data

Software Quality Control
AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES

Applied Artificial Intelligence
Sizing user stories using paired comparisons

Information and Software Technology
A study of the non-linear adjustment for analogy based software cost estimation

Empirical Software Engineering
Nearest neighbours in least-squares data imputation algorithms with different missing patterns

Computational Statistics & Data Analysis
Missing data imputation: a fuzzy K-means clustering algorithm over sliding window

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 3
Missing data imputation based on unsupervised simple competitive learning

AIKED'10 Proceedings of the 9th WSEAS international conference on Artificial intelligence, knowledge engineering and data bases
Preprocessing DNS log data for effective data mining

ICC'09 Proceedings of the 2009 IEEE international conference on Communications
LSEbA: least squares regression and estimation by analogy in a semi-parametric model for software cost estimation

Empirical Software Engineering
Probabilistic and analytical estimation of software development team size

International Journal of Hybrid Intelligent Systems
Ensemble missing data techniques for software effort prediction

Intelligent Data Analysis
Adaptive ridge regression system for software cost estimating on multi-collinear datasets

Journal of Systems and Software
Predicting software project effort: A grey relational analysis based method

Expert Systems with Applications: An International Journal
Rank Estimation in Missing Data Matrix Problems

Journal of Mathematical Imaging and Vision
Data quality: cinderella at the software metrics ball?

Proceedings of the 2nd International Workshop on Emerging Trends in Software Metrics
Dealing with noise in defect prediction

Proceedings of the 33rd International Conference on Software Engineering
Do ontological de ficiencies in modeling grammars matter?

MIS Quarterly
Handling missing data in software effort prediction with naive Bayes and EM algorithm

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
ReLink: recovering links between bugs and changes

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
A robust missing value imputation method for noisy data

Applied Intelligence
A data pre-processing method to increase efficiency and accuracy in data mining

AIME'05 Proceedings of the 10th conference on Artificial Intelligence in Medicine
Dealing with missing data: algorithms based on fuzzy set and rough set theories

Transactions on Rough Sets IV
Multi-layered approach for recovering links between bug reports and fixes

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
A distributed problem-solving framework for probabilistic software effort estimation

Expert Systems: The Journal of Knowledge Engineering
Optimum estimation of missing values in randomized complete block design by genetic algorithm

Knowledge-Based Systems
Incomplete-case nearest neighbor imputation in software measurement data

Information Sciences: an International Journal
MND-SCEMP: an empirical study of a software cost estimation modeling process in the defense domain

Empirical Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Missing data are often encountered in data sets used to construct effort prediction models. Thus far, the common practice has been to ignore observations with missing data. This may result in biased prediction models. In this paper, we evaluate four missing data techniques (MDTs) in the context of software cost modeling: listwise deletion (LD), mean imputation (MI), similar response pattern imputation (SRPI), and full information maximum likelihood (FIML). We apply the MDTs to an ERP data set, and thereafter construct regression-based prediction models using the resulting data sets. The evaluation suggests that only FIML is appropriate when the data are not missing completely at random (MCAR). Unlike FIML, prediction models constructed on LD, MI and SRPI data sets will be biased unless the data are MCAR. Furthermore, compared to LD, MI and SRPI seem appropriate only if the resulting LD data set is too small to enable the construction of a meaningful regression-based prediction model.