Categorical missing data imputation for software cost estimation by multinomial logistic regression

Authors:
Panagiotis Sentas;Lefteris Angelis
Affiliations:
Department of Informatics, Aristotle University of Thesaloniki, Thesaloniki 54124, Greece;Department of Informatics, Aristotle University of Thesaloniki, Thesaloniki 54124, Greece
Venue:
Journal of Systems and Software
Year:
2006

Citing 7
Cited 5

Statistical analysis with missing data

Statistical analysis with missing data
Estimating Software Project Effort Using Analogies

IEEE Transactions on Software Engineering
Validating the ISO/IEC 15504 Measure of Software Requirements Analysis Process Capability

IEEE Transactions on Software Engineering
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Building A Software Cost Estimation Model Based On Categorical Data

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Dealing with Missing Software Project Data

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics

A study of the non-linear adjustment for analogy based software cost estimation

Empirical Software Engineering
Software project similarity measurement based on fuzzy C-means

ICSP'08 Proceedings of the Software process, 2008 international conference on Making globally distributed software development a success story
Sensitivity of results to different data quality meta-data criteria in the sample selection of projects from the ISBSG dataset

Proceedings of the 6th International Conference on Predictive Models in Software Engineering
Handling missing data in software effort prediction with naive Bayes and EM algorithm

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
An algorithmic approach to missing data problem in modeling human aspects in software development

Proceedings of the 9th International Conference on Predictive Models in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common problem in software cost estimation is the manipulation of incomplete or missing data in databases used for the development of prediction models. In such cases, the most popular and simple method of handling missing data is to ignore either the projects or the attributes with missing observations. This technique causes the loss of valuable information and therefore may lead to inaccurate cost estimation models. On the other hand, there are various imputation methods used to estimate the missing values in a data set. These methods are applied mainly on numerical data and produce continuous estimates. However, it is well known that the majority of the cost data sets contain software projects with mostly categorical attributes with many missing values. It is therefore reasonable to use some estimating method producing categorical rather than continuous values. The purpose of this paper is to investigate the possibility of using such a method for estimating categorical missing values in software cost databases. Specifically, the method known as multinomial logistic regression (MLR) is suggested for imputation and is applied on projects of the ISBSG multi-organizational software database. Comparisons of MLR with other techniques for handling missing data, such as listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI) under different patterns and percentages of missing data, show the high efficiency of the proposed method.