Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

Authors:
Xingquan Zhu;Xindong Wu;Qijun Chen
Affiliations:
Department of Computer Science, University of Vermont, Burlington, USA 05405;Department of Computer Science, University of Vermont, Burlington, USA 05405;Department of Computer Science, University of Vermont, Burlington, USA 05405
Venue:
Data Mining and Knowledge Discovery
Year:
2006

Citing 24
Cited 8

The Strength of Weak Learnability

Machine Learning
Unknown attribute values in induction

Proceedings of the sixth international workshop on Machine learning
Instance-Based Learning Algorithms

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
Discovering informative patterns and data cleaning

Advances in knowledge discovery and data mining
Rule induction with extension matrices

Journal of the American Society for Information Science - Special issue: knowledge discovery and data mining
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Knowledge Acquisition from Databases

Knowledge Acquisition from Databases
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
The CN2 Induction Algorithm

Machine Learning
Induction of Decision Trees

Machine Learning
Rule Induction with CN2: Some Recent Improvements

EWSL '91 Proceedings of the European Working Session on Machine Learning
The Problem with Noise and Small Disjuncts

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Improving Medical/Biological Data Classification Performance by Wavelet Preprocessing

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
An extensible meta-learning approach for scalable and accurate inductive learning

An extensible meta-learning approach for scalable and accurate inductive learning
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Error detection and impact-sensitive instance ranking in noisy datasets

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Using qualitative hypotheses to identify inaccurate data

Journal of Artificial Intelligence Research
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Mining Multiple Data Sources: Local Pattern Analysis

Data Mining and Knowledge Discovery
Class Noise Handling for Effective Cost-Sensitive Learning by Cost-Guided Iterative Classification Filtering

IEEE Transactions on Knowledge and Data Engineering
Editorial: Special issue on mining low-quality data

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Mining in Large Noisy Domains

Journal of Data and Information Quality (JDIQ)
Integrating induction and deduction for noisy data mining

Information Sciences: an International Journal
Sensitivity of different machine learning algorithms to noise

Journal of Computing Sciences in Colleges
Big data, big business: bridging the gap

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Information enhancement for data mining

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.