Error detection and impact-sensitive instance ranking in noisy datasets

Authors:
Xingquan Zhu;Xindong Wu;Ying Yang
Affiliations:
Department of Computer Science, University of Vermont, Burlington, VT;Department of Computer Science, University of Vermont, Burlington, VT;Department of Computer Science, University of Vermont, Burlington, VT
Venue:
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Year:
2004

Citing 12
Cited 13

Statistical analysis with missing data

Statistical analysis with missing data
Structured induction in expert systems

Structured induction in expert systems
C4.5: programs for machine learning

C4.5: programs for machine learning
Knowledge acquisition from databases

Knowledge acquisition from databases
Data quality and systems theory

Communications of the ACM
Data Quality

Data Quality
Understanding the Crucial Role of AttributeInteraction in Data Mining

Artificial Intelligence Review
Analyzing Outliers Cautiously

IEEE Transactions on Knowledge and Data Engineering
Induction of Decision Trees

Machine Learning
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining

Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Linear-Time Wrappers to Identify Atypical Points: Two Subset Generation Methods

IEEE Transactions on Knowledge and Data Engineering
Cost-Constrained Data Acquisition for Intelligent Data Preparation

IEEE Transactions on Knowledge and Data Engineering
Class noise vs. attribute noise: a quantitative study of their impacts

Artificial Intelligence Review
Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets

Data Mining and Knowledge Discovery
Editorial: Special issue on mining low-quality data

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Unsupervised data pruning for clustering of noisy data

Knowledge-Based Systems
Soft fuzzy rough sets for robust feature evaluation and selection

Information Sciences: an International Journal
RAMOBoost: ranked minority oversampling in boosting

IEEE Transactions on Neural Networks
Robust fuzzy rough classifiers

Fuzzy Sets and Systems
From Context to Distance: Learning Dissimilarity for Categorical Data Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
A novel classification algorithm to noise data

ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Tackling the problem of classification with noisy data using Multiple Classifier Systems: Analysis of the performance and robustness

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts on the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to address this problem. Given a noisy dataset D, we first train a benchmark classifier T from D. The instances, that cannot be effectively classified by T are treated as suspicious and forwarded to a subset S. For each attribute Ai, we switch Ai and the class label C to train a classifier APi for Ai. Given an instance Ik in S, we use APi and the benchmark classifier T to locate the erroneous value of each attribute Ai. To quantitatively rank instances in S, we define an impact measure based on the Information-gain Ratio (IR). We calculate IRi between attribute Ai and C, and use IRi as the impact-sensitive weight of Ai. The sum of impact-sensitive weights from all located erroneous attributes of Ik indicates its total impact value. The experimental results demonstrate the effectiveness of our strategies.