Controlling Overfitting in Classification-Tree Models ofSoftware Quality

Authors:
Taghi M. Khoshgoftaar;Edward B. Allen
Affiliations:
Florida Atlantic University, Boca Raton, Florida USA;Mississippi State University, Mississippi USA
Venue:
Empirical Software Engineering
Year:
2001

Citing 22
Cited 7

Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis

IEEE Transactions on Software Engineering - Special Issue on Artificial Intelligence in Software Applications
Methodology for Validating Software Metrics

IEEE Transactions on Software Engineering
A Pattern Recognition Approach for Software Engineering Data Analysis

IEEE Transactions on Software Engineering - Special issue on software measurement principles, techniques, and environments
Improving Software Maintenance at Martin Marietta

IEEE Software
A neural network approach for early detection of program modules having high risk in the maintenance phase

Selected papers of the sixth annual Oregon workshop on Software metrics
System acquisition based on software product assessment

Proceedings of the 18th international conference on Software engineering
A Validation of Object-Oriented Design Metrics as Quality Indicators

IEEE Transactions on Software Engineering
A Procedure for Analyzing Unbalanced Datasets

IEEE Transactions on Software Engineering
Which software modules have faults which will be discovered by customers?

Journal of Software Maintenance: Research and Practice
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Empirical Software Engineering
Balancing Misclassification Rates in Classification-TreeModels of Software Quality

Empirical Software Engineering
Early Quality Prediction: A Case Study in Telecommunications

IEEE Software
Emerald: Software Metrics and Models on the Desktop

IEEE Software
Induction of Decision Trees

Machine Learning
Using Classification Trees for Software Quality Models: Lessons Learned

HASE '98 The 3rd IEEE International Symposium on High-Assurance Systems Engineering
Application of a Usage Profile in Software Quality Models

CSMR '99 Proceedings of the Third European Conference on Software Maintenance and Reengineering
A tree-based classification model for analysis of a military software system

HASE '96 Proceedings of the 1996 High-Assurance Systems Engineering Workshop
An Integrated Process and Product Model

METRICS '98 Proceedings of the 5th International Symposium on Software Metrics
Assessing Uncertain Predictions of Software Quality

METRICS '99 Proceedings of the 6th International Symposium on Software Metrics
Preparing Measurements of Legacy Software for Predicting Operational Faults

ICSM '99 Proceedings of the IEEE International Conference on Software Maintenance
Building Software Quality Classification Trees: Approach, Experimentation, Evaluation

ISSRE '97 Proceedings of the Eighth International Symposium on Software Reliability Engineering
Classification Tree Models of Software Quality Over Multiple Releases

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering

Uncertain Classification of Fault-Prone Software Modules

Empirical Software Engineering
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study

Empirical Software Engineering
Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis
Training on errors experiment to detect fault-prone software modules by spam filter

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Accuracy and efficiency comparisons of single- and multi-cycled software classification models

Information and Software Technology
Prediction of Fault-Prone Software Modules Using a Generic Text Discriminator

IEICE - Transactions on Information and Systems
Feature selection and clustering in software quality prediction

EASE'07 Proceedings of the 11th international conference on Evaluation and Assessment in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.