Handling missing data in software effort prediction with naive Bayes and EM algorithm

Authors:
Wen Zhang;Ye Yang;Qing Wang
Affiliations:
Institute of Software, Chinese Academy of Sciences, Beijing, P. R. China;Institute of Software, Chinese Academy of Sciences, Beijing, P. R. China;Institute of Software, Chinese Academy of Sciences, Beijing, P. R. China
Venue:
Proceedings of the 7th International Conference on Predictive Models in Software Engineering
Year:
2011

Citing 19
Cited 0

Statistical analysis with missing data

Statistical analysis with missing data
C4.5: programs for machine learning

C4.5: programs for machine learning
Measure—based classifier performance evaluation

Pattern Recognition Letters - Special issue on pattern recognition in practice VI
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Induction of Decision Trees

Machine Learning
Machine Learning and Software Engineering

Software Quality Control
A Probabilistic Model for Predicting Software Development Effort

IEEE Transactions on Software Engineering
Categorical missing data imputation for software cost estimation by multinomial logistic regression

Journal of Systems and Software
A new imputation method for small software project data sets

Journal of Systems and Software
Software quality estimation with limited fault data: a semi-supervised learning perspective

Software Quality Control
Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability)

Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability)
A comprehensive empirical evaluation of missing value imputation in noisy software measurement data

Journal of Systems and Software
An empirical validation of a neural network model for software effort estimation

Expert Systems with Applications: An International Journal
Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Journal of Systems and Software
Bayesian Network Models for Web Effort Prediction: A Comparative Study

IEEE Transactions on Software Engineering
Imputation techniques for multivariate missingness in software measurement data

Software Quality Control
An investigation of software development productivity in China

ICSP'08 Proceedings of the Software process, 2008 international conference on Making globally distributed software development a success story

Quantified Score

Hi-index	0.00

Visualization

Abstract

Background: Missing data, which usually appears in software effort datasets, is becoming an important problem in software effort prediction. Aims: In this paper, we adapt naïve Bayes and EM (Expectation Maximization) for software effort prediction, and develop two embedded strategies: missing data toleration and missing data imputation, to handle the missing data in software effort datasets. Method: The missing data toleration strategy ignores missing values in software effort datasets while missing data imputation strategy uses observed values to impute missing values. Results: Experiments on ISBSG and CSBSG datasets demonstrate that: 1)both proposed strategies outperform BPNN with classic imputation techniques as MI and MINI. Meanwhile, the imputation strategy outperforms toleration strategy in most cases and has produced the highest accuracy as 75.15%; 2) the unlabeled projects used in training prediction model has significantly improved the performances of effort prediction of naïve Bayes and EM with both strategies, especially when the size of training data to the size of unlabeled data is at a relatively optimal level; 3) each class of software effort data exactly corresponds to a Gaussian component for both ISBSG and CSBSG datasets. Conclusion: Although initial experiments on ISBSG data set demonstrate some promising aspects of the proposed strategies, we cannot draw that they can be generalized to be applied in all the other software effort datasets.