Detecting noisy instances with the rule-based classification model

  • Authors:
  • Taghi M. Khoshgoftaar;Naeem Seliya;Kehan Gao

  • Affiliations:
  • Florida Atlantic University, Boca Raton, Florida, USA;Florida Atlantic University, Boca Raton, Florida, USA;Florida Atlantic University, Boca Raton, Florida, USA

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The performance of a classification model is invariably affected by the characteristics of measurement data it is built upon. If quality of the data is generally poor, then the classification model will demonstrate poor performance. The amount of noisy instances present in a given dataset is a good reflection of quality of the data. The detection and removal of noisy data instances will improve quality of the data, and consequently the performance of the classification model. This study presents an attractive and user-friendly approach for detecting data noise based on Boolean rules generated from the measurement data. The approach follows a simple and replicable approach that analyzes the rules to detect mislabeled noisy instances in the training dataset. Such instances are treated as data noise, and are removed to obtain a clean dataset. A case study of a software measurement dataset with known noisy instances is used to demonstrate the effectiveness of our approach. The dataset is obtained from a NASA software project developed for realtime predictions based on simulations. It is empirically demonstrated that the proposed approach is extremely effective in detecting noise in the dataset; in fact, the approach detected 100% of the known noisy instances. The proposed approach is compared with noise filtering based on five classification filters and an ensemble filter of five classifiers. We also demonstrate that the proposed approach shows excellent promise in detecting noisy instances in several (six) independent and real-world software measurement datasets with unknown noisy instances.