Enhancing Data Analysis with Noise Removal

Authors:
Hui Xiong;Gaurav Pandey;Michael Steinbach;Vipin Kumar
Affiliations:
IEEE;IEEE;IEEE Computer Society;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2006

Citing 29
Cited 13

Algorithms for clustering data

Algorithms for clustering data
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Noise reduction in a statistical approach to text categorization

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Data quality and systems theory

Communications of the ACM
The impact of poor data quality on the typical enterprise

Communications of the ACM
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Multidimensional access methods

ACM Computing Surveys (CSUR)
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
AJAX: an extensible data cleaning tool

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
IntelliClean: a knowledge-based intelligent data cleaner

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Information Retrieval

Information Retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications

Data Mining and Knowledge Discovery
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Mining Strong Affinity Association Patterns in Data Sets with Skewed Support Distribution

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Selecting the right objective measure for association analysis

Information Systems - Knowledge discovery and data mining (KDD 2002)
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)

Discovering and Exploiting Causal Dependencies for Robust Mobile Context-Aware Recommenders

IEEE Transactions on Knowledge and Data Engineering
Association analysis-based transformations for protein interaction networks: a function prediction case study

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Improving object detection by removing noisy samples from training sets

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Association Analysis Techniques for Bioinformatics Problems

BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
Towards understanding hierarchical clustering: A data distribution perspective

Neurocomputing
Subspace sums for extracting non-random data from massive noise

Knowledge and Information Systems
Mining in Large Noisy Domains

Journal of Data and Information Quality (JDIQ)
Soft fuzzy rough sets for robust feature evaluation and selection

Information Sciences: an International Journal
DCUBE: CUBE on dirty databases

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Sensitivity of different machine learning algorithms to noise

Journal of Computing Sciences in Colleges
Robust fuzzy rough classifiers

Fuzzy Sets and Systems
Is the contextual information relevant in text clustering by compression?

Expert Systems with Applications: An International Journal
A Web-Based Multimedia Retrieval System with MCA-Based Filtering and Subspace-Based Learning Algorithms

International Journal of Multimedia Data Engineering & Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the product of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amounts of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distance-based, clustering-based, and an approach based on the Local Outlier Factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hyperclique-based data cleaner (HCleaner). These techniques are evaluated in terms of their impact on the subsequent data analysis, specifically, clustering and association analysis. Our experimental results show that all of these methods can provide better clustering performance and higher quality association patterns as the amount of noise being removed increases, although HCleaner generally leads to better clustering performance and higher quality associations than the other three methods for binary data.