The Strength of Weak Learnability
Machine Learning
Unknown attribute values in induction
Proceedings of the sixth international workshop on Machine learning
Instance-Based Learning Algorithms
Machine Learning
C4.5: programs for machine learning
C4.5: programs for machine learning
Discovering informative patterns and data cleaning
Advances in knowledge discovery and data mining
Rule induction with extension matrices
Journal of the American Society for Information Science - Special issue: knowledge discovery and data mining
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Reduction Techniques for Instance-BasedLearning Algorithms
Machine Learning
Knowledge Acquisition from Databases
Knowledge Acquisition from Databases
A Survey of Methods for Scaling Up Inductive Algorithms
Data Mining and Knowledge Discovery
Machine Learning
Machine Learning
Rule Induction with CN2: Some Recent Improvements
EWSL '91 Proceedings of the European Working Session on Machine Learning
The Problem with Noise and Small Disjuncts
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Improving Medical/Biological Data Classification Performance by Wavelet Preprocessing
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
An extensible meta-learning approach for scalable and accurate inductive learning
An extensible meta-learning approach for scalable and accurate inductive learning
Probabilistic Noise Identification and Data Cleaning
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Class Noise vs. Attribute Noise: A Quantitative Study
Artificial Intelligence Review
Error detection and impact-sensitive instance ranking in noisy datasets
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Using qualitative hypotheses to identify inaccurate data
Journal of Artificial Intelligence Research
Identifying and eliminating mislabeled training instances
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Mining Multiple Data Sources: Local Pattern Analysis
Data Mining and Knowledge Discovery
IEEE Transactions on Knowledge and Data Engineering
Editorial: Special issue on mining low-quality data
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Journal of Data and Information Quality (JDIQ)
Integrating induction and deduction for noisy data mining
Information Sciences: an International Journal
Sensitivity of different machine learning algorithms to noise
Journal of Computing Sciences in Colleges
Big data, big business: bridging the gap
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Information enhancement for data mining
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Hi-index | 0.00 |
To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.