Mining in Large Noisy Domains

  • Authors:
  • Manoranjan Dash;Ayush Singhania

  • Affiliations:
  • Nanyang Technological University, Singapore;Nanyang Technological University, Singapore

  • Venue:
  • Journal of Data and Information Quality (JDIQ)
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this article we address the issue of how to mine efficiently in large and noisy data. We propose an efficient sampling algorithm (Concise) as a solution for large and noisy data. Concise is far more superior than the Simple Random Sampling (SRS) in selecting a representative sample. Particularly when the data is very large and noisy, Concise achieves the maximum gain over SRS. The comparison is in terms of their impact on subsequent data mining tasks, specifically, classification, clustering, and association rule mining. We compared Concise with a few existing noise removal algorithms followed by SRS. Although the accuracy of mining results are similar, Concise spends very little time compared to the existing algorithms because Concise has linear time complexity.