Class noise detection using frequent itemsets

  • Authors:
  • Jason Van Hulse;Taghi M. Khoshgoftaar

  • Affiliations:
  • Florida Atlantic University, Boca Raton, FL 33431, USA;(Correspd. Tel.: +1 561 297 3994/ Fax: +1 561 297 2800/ E-mail: taghi@cse.fau.edu) Florida Atlantic University, Boca Raton, FL 33431, USA

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The presence of a substantial number of noisy instances in a given dataset may adversely affect the hypothesis learnt from that data. Removing noisy instances prior to the construction of a classifier has been shown to improve the classification ability of a learner on new data. This paper introduces a novel technique for identifying observations with class noise in a dataset using frequent itemsets. For the given dataset, each instance is assigned a NoiseFactor, indicating a relative likelihood that it contains class noise. A frequent itemset is a set of instances with common attribute values which contains at least as many instances as a user-defined minimum support threshold. Consequently, the set of frequent itemsets contains information related to the structure and dependence between the attributes. Each frequent itemset is assigned a class, based on the proportion of instances within the itemset from each class. Instances that are contained in itemsets that have a large proportion of instances from the other class are identified as noisy. The technique proposed in this paper is analyzed in numerous case studies using real-world software measurement datasets with either inherent or injected noise. A comparison is provided with two well-known techniques for the identification of class noise: Classification Filter and Ensemble Filter. The results demonstrate that this new algorithm is very effective at identifying instances with class noise.