Detecting noisy instances with the rule-based classification model

Authors:
Taghi M. Khoshgoftaar;Naeem Seliya;Kehan Gao
Affiliations:
Florida Atlantic University, Boca Raton, Florida, USA;Florida Atlantic University, Boca Raton, Florida, USA;Florida Atlantic University, Boca Raton, Florida, USA
Venue:
Intelligent Data Analysis
Year:
2005

Citing 15
Cited 9

Software metrics (2nd ed.): a rigorous and practical approach

Software metrics (2nd ed.): a rigorous and practical approach
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Comparing case-based reasoning classifiers for predicting high risk software components

Journal of Systems and Software
Comparing Software Prediction Techniques Using Simulation

IEEE Transactions on Software Engineering - Special section on the seventh international software metrics symposium
Body of Knowledge for Software Quality Measurement

Computer
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Predicting Fault-Proneness using OO Metrics: An Industrial Case Study

CSMR '02 Proceedings of the 6th European Conference on Software Maintenance and Reengineering
Investigation of Logistic Regression as a Discriminant of Software Quality

METRICS '01 Proceedings of the 7th International Symposium on Software Metrics
Improving Usefulness of Software Quality Classification Models Based on Boolean Discriminant Functions

ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
Analyzing Software Measurement Data with Clustering Techniques

IEEE Intelligent Systems
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study

Empirical Software Engineering
The Necessity of Assuring Quality in Software Measurement Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis
Detecting outliers using rule-based modeling for improving CBR-based software quality classification models

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development

The pairwise attribute noise detection algorithm

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Identifying noisy features with the Pairwise Attribute Noise Detection Algorithm

Intelligent Data Analysis
Extracting classification rule of software diagnosis using modified MEPA

Expert Systems with Applications: An International Journal
Hybrid sampling for imbalanced data

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Knowledge discovery from imbalanced and noisy data

Data & Knowledge Engineering
Improving software-quality predictions with data sampling and boosting

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
Supervised neural network modeling: an empirical investigation into learning from imbalanced data with labeling errors

IEEE Transactions on Neural Networks
Software diagnosis using fuzzified attribute base on modified MEPA

IEA/AIE'06 Proceedings of the 19th international conference on Advances in Applied Artificial Intelligence: industrial, Engineering and Other Applications of Applied Intelligent Systems
Predicting high-risk program modules by selecting the right software measurements

Software Quality Control

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of a classification model is invariably affected by the characteristics of measurement data it is built upon. If quality of the data is generally poor, then the classification model will demonstrate poor performance. The amount of noisy instances present in a given dataset is a good reflection of quality of the data. The detection and removal of noisy data instances will improve quality of the data, and consequently the performance of the classification model. This study presents an attractive and user-friendly approach for detecting data noise based on Boolean rules generated from the measurement data. The approach follows a simple and replicable approach that analyzes the rules to detect mislabeled noisy instances in the training dataset. Such instances are treated as data noise, and are removed to obtain a clean dataset. A case study of a software measurement dataset with known noisy instances is used to demonstrate the effectiveness of our approach. The dataset is obtained from a NASA software project developed for realtime predictions based on simulations. It is empirically demonstrated that the proposed approach is extremely effective in detecting noise in the dataset; in fact, the approach detected 100% of the known noisy instances. The proposed approach is compared with noise filtering based on five classification filters and an ensemble filter of five classifiers. We also demonstrate that the proposed approach shows excellent promise in detecting noisy instances in several (six) independent and real-world software measurement datasets with unknown noisy instances.