A comprehensive empirical evaluation of missing value imputation in noisy software measurement data

Authors:
Jason Van Hulse;Taghi M. Khoshgoftaar
Affiliations:
Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431, United States;Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431, United States
Venue:
Journal of Systems and Software
Year:
2008

Citing 18
Cited 5

Statistical analysis with missing data

Statistical analysis with missing data
Software metrics (2nd ed.): a rigorous and practical approach

Software metrics (2nd ed.): a rigorous and practical approach
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Empirical Software Engineering
Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques

Empirical Software Engineering
Analogy-Based Practical Classification Rules for Software Quality Estimation

Empirical Software Engineering
Dealing with Missing Software Project Data

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics
An Evaluation of k-Nearest Neighbour Imputation Using Likert Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
The Necessity of Assuring Quality in Software Measurement Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
A Short Note on Safest Default Missingness Mechanism Assumptions

Empirical Software Engineering
Assessing Variation in Development Effort Consistency Using a Data Source with Missing Data

Software Quality Control
Ensemble Imputation Methods for Missing Software Engineering Data

METRICS '05 Proceedings of the 11th IEEE International Software Metrics Symposium
Determining noisy instances relative to attributes of interest

Intelligent Data Analysis
A Hybrid Approach to Cleansing Software Measurement Data

ICTAI '06 Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence
The pairwise attribute noise detection algorithm

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis

Data sets and data quality in software engineering

Proceedings of the 4th international workshop on Predictor models in software engineering
Handling missing data in software effort prediction with naive Bayes and EM algorithm

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
A robust missing value imputation method for noisy data

Applied Intelligence
Data quality in empirical software engineering: a targeted review

Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering
On the value of outlier elimination on software effort estimation research

Empirical Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The handling of missing values is a topic of growing interest in the software quality modeling domain. Data values may be absent from a dataset for numerous reasons, for example, the inability to measure certain attributes. As software engineering datasets are sometimes small in size, discarding observations (or program modules) with incomplete data is usually not desirable. Deleting data from a dataset can result in a significant loss of potentially valuable information. This is especially true when the missing data is located in an attribute that measures the quality of the program module, such as the number of faults observed in the program module during testing and after release. We present a comprehensive experimental analysis of five commonly used imputation techniques. This work also considers three different mechanisms governing the distribution of missing values in a dataset, and examines the impact of noise on the imputation process. To our knowledge, this is the first study to thoroughly evaluate the relationship between data quality and imputation. Further, our work is unique in that it employs a software engineering expert to oversee the evaluation of all of the procedures and to ensure that the results are not inadvertently influenced by poor quality data. Based on a comprehensive set of carefully controlled experiments, we conclude that Bayesian multiple imputation and regression imputation are the most effective techniques, while mean imputation performs extremely poorly. Although a preliminary evaluation has been conducted using Bayesian multiple imputation in the empirical software engineering domain, this is the first work to provide a thorough and detailed analysis of this technique. Our studies also demonstrate conclusively that the presence of noisy data has a dramatic impact on the effectiveness of imputation techniques.