A new imputation method for small software project data sets

Authors:
Qinbao Song;Martin Shepperd
Affiliations:
Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China;Brunel University, Uxbridge, UB8 3PH, UK
Venue:
Journal of Systems and Software
Year:
2007

Citing 18
Cited 12

Statistical analysis with missing data

Statistical analysis with missing data
Software engineering metrics and models

Software engineering metrics and models
C4.5: programs for machine learning

C4.5: programs for machine learning
Feature Selection: Evaluation, Application, and Small Sample Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to classify incomplete examples

Computational learning theory and natural learning systems: Volume IV
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Mining massively incomplete data sets by conceptual reconstruction

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Search Heuristics, Case-based Reasoning And Software Project Effort Prediction

GECCO '02 Proceedings of the Genetic and Evolutionary Computation Conference
A Review of Surveys on Software Effort Estimation

ISESE '03 Proceedings of the 2003 International Symposium on Empirical Software Engineering
Dealing with Missing Software Project Data

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics
An Evaluation of k-Nearest Neighbour Imputation Using Likert Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
A Short Note on Safest Default Missingness Mechanism Assumptions

Empirical Software Engineering
Using Multivariate Statistics (5th Edition)

Using Multivariate Statistics (5th Edition)
The Bayesian structural EM algorithm

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
Reconstruction of baseline JPEG coded images in error prone environments

IEEE Transactions on Image Processing

Missing Data Imputation Techniques

International Journal of Business Intelligence and Data Mining
An empirical analysis of software effort estimation with outlier elimination

Proceedings of the 4th international workshop on Predictor models in software engineering
Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Journal of Systems and Software
On the influence of imputation in classification: practical issues

Journal of Experimental & Theoretical Artificial Intelligence
A study of the non-linear adjustment for analogy based software cost estimation

Empirical Software Engineering
Methodologies for model-free data interpretation of civil engineering structures

Computers and Structures
Similarities in fuzzy data mining: from a cognitive view to real-world applications

WCCI'08 Proceedings of the 2008 IEEE world conference on Computational intelligence: research frontiers
Adaptive ridge regression system for software cost estimating on multi-collinear datasets

Journal of Systems and Software
Handling missing data in software effort prediction with naive Bayes and EM algorithm

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
Soft computing based imputation and hybrid data and text mining: The case of predicting the severity of phishing alerts

Expert Systems with Applications: An International Journal
Case-based reasoning in comparative effectiveness research

IBM Journal of Research and Development
Detecting mistakes in binary data tables

Automatic Documentation and Mathematical Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Effort prediction is a very important issue for software project management. Historical project data sets are frequently used to support such prediction. But missing data are often contained in these data sets and this makes prediction more difficult. One common practice is to ignore the cases with missing data, but this makes the originally small software project database even smaller and can further decrease the accuracy of prediction. The alternative is missing data imputation. There are many imputation methods. Software data sets are frequently characterised by their small size but unfortunately sophisticated imputation methods prefer larger data sets. For this reason we explore using simple methods to impute missing data in small project effort data sets. We propose a class mean imputation (CMI) method based on the k-NN hot deck imputation method (MINI) to impute both continuous and nominal missing data in small data sets. We use an incremental approach to increase the variance of population. To evaluate MINI (and k-NN and CMI methods as benchmarks) we use data sets with 50 cases and 100 cases sampled from a larger industrial data set with 10%, 15%, 20% and 30% missing data percentages respectively. We also simulate Missing Completely at Random (MCAR) and Missing at Random (MAR) missingness mechanisms. The results suggest that the MINI method outperforms both CMI and the k-NN methods. We conclude that this new imputation technique can be used to impute missing values in small data sets.