Nearest neighbor sampling for better defect prediction

Authors:
Gary D. Boetticher
Affiliations:
University of Houston - Clear Lake, Houston, Texas
Venue:
PROMISE '05 Proceedings of the 2005 workshop on Predictor models in software engineering
Year:
2005

Citing 6
Cited 3

C4.5: programs for machine learning

C4.5: programs for machine learning
Machine Learning Approaches to Estimating Software Development Effort

IEEE Transactions on Software Engineering
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Empirically Guided Software Development Using Metric-Based Classification Trees

IEEE Software
Complexity Measure Evaluation and Selection

IEEE Transactions on Software Engineering
A review of studies on expert estimation of software development effort

Journal of Systems and Software

Adequate and Precise Evaluation of Quality Models in Software Engineering Studies

PROMISE '07 Proceedings of the Third International Workshop on Predictor Models in Software Engineering
Techniques for evaluating fault prediction models

Empirical Software Engineering
Software defect prediction using Bayesian networks

Empirical Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important step in building effective predictive models applies one or more sampling techniques. Traditional sampling techniques include random, stratified, systemic, and clustered. The problem with these techniques is that they focus on the class attribute, rather than the non-class attributes. For example, if a test instance's nearest neighbor is from the opposite class of the training set, then it seems doomed to misclassification. To illustrate this problem, this paper conducts 20 experiments on five different NASA defect datasets (CM1, JM1, KC1, KC2, PC1) using two different learners (J48 and Naïve Bayes). Each data set is divided into 3 groups, a training set, and "nice/nasty" neighbor test sets. Using a nearest neighbor approach, "Nice neighbors" consist of those test instances closest to class training instances. "Nasty neighbors" are closest to opposite class training instances. The "Nice" experiments average 94 percent accuracy and the "Nasty" experiments average 20 percent accuracy. Based on these results a new nearest neighbor sampling technique is proposed.