Identifying noisy features with the Pairwise Attribute Noise Detection Algorithm

  • Authors:
  • Taghi M. Khoshgoftaar;Jason Van Hulse

  • Affiliations:
  • Florida Atlantic University, Boca Raton, Florida, FL, USA;Florida Atlantic University, Boca Raton, Florida, FL, USA

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

A critical issue in data mining and knowledge discovery is the problem of data quality. Quantifying the presence of noise in dataset is often used as an indicator of data quality. While existing works have mostly focused on detecting class noise or mislabeling errors, very limited attention has been given to finding noisy attributes or features. Prior work in the area of noise handling has concentrated on the detection of observations that contain noise in either the attributes or class labels. Methodologies that provide insight into the quality of an attribute can provide valuable knowledge to a domain expert when data analysis is being performed. We present a novel methodology for detecting noisy attributes. The procedure utilizes our recently proposed Pairwise Attribute Noise Detection Algorithm (PANDA) for detecting instances with attribute noise. From a data analyst's point of view, our approach provides a viable solution to: "Given a dataset, which attribute(s) contains the most noise?". The proposed methodology is investigated with multiple case studies of a real-world software measurement dataset. The empirical study is investigated by injecting simulated noise into one or more attributes of a dataset that has no class noise. Based on a domain expert's inspection of the obtained results, the effectiveness of our technique for detecting noisy attributes is demonstrated.