The pairwise attribute noise detection algorithm

  • Authors:
  • Jason D. Van Hulse;Taghi M. Khoshgoftaar;Haiying Huang

  • Affiliations:
  • Florida Atlantic University, Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, 33431, Boca Raton, FL, USA;Florida Atlantic University, Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, 33431, Boca Raton, FL, USA;Florida Atlantic University, Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, 33431, Boca Raton, FL, USA

  • Venue:
  • Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Analyzing the quality of data prior to constructing data mining models is emerging as an important issue. Algorithms for identifying noise in a given data set can provide a good measure of data quality. Considerable attention has been devoted to detecting class noise or labeling errors. In contrast, limited research work has been devoted to detecting instances with attribute noise, in part due to the difficulty of the problem. We present a novel approach for detecting instances with attribute noise and demonstrate its usefulness with case studies using two different real-world software measurement data sets. Our approach, called Pairwise Attribute Noise Detection Algorithm (PANDA), is compared with a nearest neighbor, distance-based outlier detection technique (denoted DM) investigated in related literature. Since what constitutes noise is domain specific, our case studies uses a software engineering expert to inspect the instances identified by the two approaches to determine whether they actually contain noise. It is shown that PANDA provides better noise detection performance than the DM algorithm.