Applying statistical methodology to optimize and simplify software metric models with missing data

Authors:
W. Eric Wong;Jin Zhao;Victor K. Y. Chan
Affiliations:
University of Texas at Dallas;University of Texas at Dallas;Macao Polytechnic Institute
Venue:
Proceedings of the 2006 ACM symposium on Applied computing
Year:
2006

Citing 15
Cited 1

Software project development cost estimation

Journal of Systems and Software
Statistical analysis with missing data

Statistical analysis with missing data
Software engineering metrics and models

Software engineering metrics and models
An Evaluation of Expert Systems for Software Engineering Management

IEEE Transactions on Software Engineering
Examining the feasibility of a case-based reasoning model for software effort estimation

MIS Quarterly
Rule-based approach to computing module cohesion

ICSE '93 Proceedings of the 15th international conference on Software Engineering
Explaining the cost of European space and military projects

Proceedings of the 21st international conference on Software engineering
Software Cost Estimation with Incomplete Data

IEEE Transactions on Software Engineering
Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Software Development Cost Estimation Using Function Points

IEEE Transactions on Software Engineering
Assessing the Benefits of Imputing ERP Projects with Missing Data

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Using Public Domain Metrics To Estimate Software Development Effort

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Building A Software Cost Estimation Model Based On Categorical Data

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Dealing with Missing Software Project Data

METRICS '03 Proceedings of the 9th International Symposium on Software Metrics
Optimizing and Simplifying Software Metric Models Constructed Using Maximum Likelihood Methods

COMPSAC '05 Proceedings of the 29th Annual International Computer Software and Applications Conference - Volume 01

Preprocessing DNS log data for effective data mining

ICC'09 Proceedings of the 2009 IEEE international conference on Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

During the construction of a software metric model, the decision on whether a particular predictor metric should be included is most likely based on an intuitive or experience based assumption that the predictor metric has an impact on the target metric with a statistical significance. However, a model constructed based on such an assumption may contain redundant predictor metric(s) and/or unnecessary predictor metric complexity. This is because the assumption made before the model construction is not verified after the model is constructed. To resolve the first problem (i.e., possible redundant predictor metric(s)), we propose a statistical hypothesis testing methodology to verify "retrospectively" the statistical significance of the impact of each predictor metric on the target metric. If the variation of a predictor metric does not correlate enough with the variation of the target metric, the predictor metric should be deleted from the model. For the second problem (i.e., unnecessary predictor metric complexity), we use "goodness-of-fit" to determine whether certain categories of a categorical predictor metric should be combined together. In addition, missing data often appear in the data sample used for constructing the model. We use a modified k-nearest neighbors (k-NN) imputation method to deal with this problem. A study using data from the "Repository Data Disk - Release 6" is reported. The results indicate that our methodology can be useful in trimming redundant predictor metrics and identifying unnecessary categories initially assumed for a categorical predictor metric in the model.