Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Authors:
Qinbao Song;Martin Shepperd;Xiangru Chen;Jun Liu
Affiliations:
Department of Computer Science & Technology, Xi'an Jiaotong University, 28 Xian-Ning West Road, Xi'an, Shaanxi, 710049, China;School of IS, Computing & Maths, Brunel University, Uxbridge, UB8 3PH, United Kingdom;Department of Computer Science & Technology, Xi'an Jiaotong University, 28 Xian-Ning West Road, Xi'an, Shaanxi, 710049, China;Shaanxi Electric Power Training Center for the Staff Members, 21 Dian-Chang East Road, Xi'an, Shaanxi, 710038, China
Venue:
Journal of Systems and Software
Year:
2008

Citing 37
Cited 3

Statistical analysis with missing data

Statistical analysis with missing data
An empirical validation of software cost estimation models

Communications of the ACM
Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis

IEEE Transactions on Software Engineering - Special Issue on Artificial Intelligence in Software Applications
A Pattern Recognition Approach for Software Engineering Data Analysis

IEEE Transactions on Software Engineering - Special issue on software measurement principles, techniques, and environments
Empirical studies of assumptions that underlie software cost-estimation models

Information and Software Technology
C4.5: programs for machine learning

C4.5: programs for machine learning
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
Machine Learning Approaches to Estimating Software Development Effort

IEEE Transactions on Software Engineering
Feature Selection: Evaluation, Application, and Small Sample Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
A comparison of software effort estimation techniques: using function points with neural networks, case-based reasoning and regression models

Journal of Systems and Software
Enhancements to the data mining process

Enhancements to the data mining process
Estimating Software Project Effort Using Analogies

IEEE Transactions on Software Engineering
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Explaining the cost of European space and military projects

Proceedings of the 21st international conference on Software engineering
An assessment and comparison of common software cost estimation modeling techniques

Proceedings of the 21st international conference on Software engineering
A replicated assessment and comparison of common software cost modeling techniques

Proceedings of the 22nd international conference on Software engineering
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Modeling Development Effort in Object-Oriented Systems Using Design Properties

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Predicting with Sparse Data

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Software Engineering Economics

Software Engineering Economics
Improving Subjective Estimates Using Paired Comparisons

IEEE Software
A Modified Chi2 Algorithm for Discretization

IEEE Transactions on Knowledge and Data Engineering
The CN2 Induction Algorithm

Machine Learning
Induction of Decision Trees

Machine Learning
Discovering Patterns in EEG-Signals: Comparative Study of a Few Methods

ECML '93 Proceedings of the European Conference on Machine Learning
Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
Quantitative Empirical Modeling for Manageing Software Development: Constraints, Needs and Solutions

Proceedings of the International Workshop on Experimental Software Engineering Issues: Critical Assessment and Future Directions
Using Public Domain Metrics To Estimate Software Development Effort

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Building A Software Cost Estimation Model Based On Categorical Data

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Dealing with Missing Software Project Data

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics
An Evaluation of k-Nearest Neighbour Imputation Using Likert Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
A Short Note on Safest Default Missingness Mechanism Assumptions

Empirical Software Engineering
Using Multivariate Statistics (5th Edition)

Using Multivariate Statistics (5th Edition)
A new imputation method for small software project data sets

Journal of Systems and Software
Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation

IEEE Transactions on Software Engineering
A method of programming measurement and estimation

IBM Systems Journal

Application of decision tree based on C4.5 in analysis of coal logistics customer

IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
Handling missing data in software effort prediction with naive Bayes and EM algorithm

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
Learning in rough-neuro-fuzzy system for data with missing values

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I

Quantified Score

Hi-index	0.02

Visualization

Abstract

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%.