C4.5: programs for machine learning
C4.5: programs for machine learning
Selected papers of the sixth annual Oregon workshop on Software metrics
Lazy learning
Robust Classification for Imprecise Environments
Machine Learning
Balancing Misclassification Rates in Classification-TreeModels of Software Quality
Empirical Software Engineering
Machine Learning
Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Tree-Based Software Quality Estimation Models For Fault Prediction
METRICS '02 Proceedings of the 8th International Symposium on Software Metrics
Predicting Fault-Prone Modules with Case-Based Reasoning
ISSRE '97 Proceedings of the Eighth International Symposium on Software Reliability Engineering
Editorial: special issue on learning from imbalanced data sets
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Detecting noisy instances with the rule-based classification model
Intelligent Data Analysis
Boosted Classification Trees and Class Probability/Quantile Estimation
The Journal of Machine Learning Research
Data Mining Static Code Attributes to Learn Defect Predictors
IEEE Transactions on Software Engineering
Adequate and Precise Evaluation of Quality Models in Software Engineering Studies
PROMISE '07 Proceedings of the Third International Workshop on Predictor Models in Software Engineering
Experimental perspectives on learning from imbalanced data
Proceedings of the 24th international conference on Machine learning
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognition
The class imbalance problem: A systematic study
Intelligent Data Analysis
IEEE Transactions on Software Engineering
Automatically countering imbalance and its empirical relationship to cost
Data Mining and Knowledge Discovery
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Learning when training data are costly: the effect of class distribution on tree induction
Journal of Artificial Intelligence Research
A brief introduction to boosting
IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning
ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Generalized Discrete Software Reliability Modeling With Effect of Program Size
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Software Quality Analysis of Unlabeled Program Modules With Semisupervised Clustering
IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Data preparation techniques for improving rare class prediction
MAMECTIS/NOLASC/CONTROL/WAMUS'11 Proceedings of the 13th WSEAS international conference on mathematical methods, computational techniques and intelligent systems, and 10th WSEAS international conference on non-linear analysis, non-linear systems and chaos, and 7th WSEAS international conference on dynamical systems and control, and 11th WSEAS international conference on Wavelet analysis and multirate systems: recent researches in computational techniques, non-linear systems and control
An in-depth study of the potentially confounding effect of class size in fault prediction
ACM Transactions on Software Engineering and Methodology (TOSEM)
Hi-index | 0.00 |
Software-quality data sets tend to fall victim to the class-imbalance problem that plagues so many other application domains. The majority of faults in a software system, particularly high-assurance systems, usually lie in a very small percentage of the software modules. This imbalance between the number of fault-prone (fp) and non-fp (nfp) modules can have a severely negative impact on a data-mining technique's ability to differentiate between the two. This paper addresses the class-imbalance problem as it pertains to the domain of software-quality prediction. We present a comprehensive empirical study examining two different methodologies, data sampling and boosting, for improving the performance of decision-tree models designed to identify fp software modules. This paper applies five data-sampling techniques and boosting to 15 software-quality data sets of different sizes and levels of imbalance. Nearly 50 000 models were built for the experiments contained in this paper. Our results show that while data-sampling techniques are very effective in improving the performance of such models, boosting almost always outperforms even the best data-sampling techniques. This significant result, which, to our knowledge, has not been previously reported, has important consequences for practitioners developing software-quality classification models.