Controlling Overfitting in Software Quality Models: Experiments with Regression Trees and Classification

Authors:
Taghi M. Khoshgoftaar;Edward B. Allen;Jianyu Deng
Affiliations:
-;-;-
Venue:
METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Year:
2001

Citing 0
Cited 10

Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques

Empirical Software Engineering
Resource-oriented software quality classification models

Journal of Systems and Software
Experiences and results from initiating field defect prediction and product test prioritization efforts at ABB Inc.

Proceedings of the 28th international conference on Software engineering
Statistical models vs. expert estimation for fault prediction in modified code - an industrial case study

Journal of Systems and Software
Improving fault detection in modified code: a study from the telecommunication industry

Journal of Computer Science and Technology
Anomaly-based fault detection in pervasive computing system

Proceedings of the 5th international conference on Pervasive services
Classification of tasks using machine learning

PROMISE '09 Proceedings of the 5th International Conference on Predictor Models in Software Engineering
Evaluating logistic regression models to estimate software project outcomes

Information and Software Technology
Feature selection and clustering in software quality prediction

EASE'07 Proceedings of the 11th international conference on Evaluation and Assessment in Software Engineering
Data flow analysis for anomaly detection and identification toward resiliency in extreme scale systems

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this day of "faster, cheaper, better" release cycles, software developers must focus enhancement efforts on those modules that need improvement the most. Predictions of which modules are likely to have faults during operations is an important tool to guide such improvement efforts during maintenance. Tree-based models are attractive because they readily model nonmonotonic relationships between a response variable and predictors. However, tree-based models are vulnerable to overfitting, where the model reflects the structure of the training data set too closely. Even though a model appears to be accurate on training data, if overfitted, it may be much less accurate when applied to a current data set. To account for the severe consequences of misclassifying fault-prone modules, our measure of overfitting is based on expected costs of misclassification, rather than the total number of misclassifications. In this paper, we apply a regression-tree algorithm in the S-Plus system to classification of software modules by application of our classification rule that accounts for the preferred balance between misclassification rates. We conducted a case study of a very large legacy telecommunications system, and investigated two parameters of the regression-tree algorithm. We found here that minimum deviance was strongly related to overfitting, and can be used to control it, but the effect of minimum node size on overfitting is ambiguous.