Domain independent data discrepancy detection using ensemble learning

Authors:
Danico Lee;Costas Tsatsoulis
Affiliations:
Information and Telecommunication Technology Center, Department of Electrical Engineering and Computer Science, The University of Kansas, Lawrence, KS;Information and Telecommunication Technology Center, Department of Electrical Engineering and Computer Science, The University of Kansas, Lawrence, KS
Venue:
ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Year:
2008

Citing 10
Cited 0

Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Ordinal association rules for error identification in data sets

Proceedings of the tenth international conference on Information and knowledge management
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
Exploratory Data Mining and Data Cleaning

Exploratory Data Mining and Data Cleaning
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Intelligent data entry assistant for XML using ensemble learning

Proceedings of the 10th international conference on Intelligent user interfaces
New Support Vector Algorithms

Neural Computation
Ensemble methods for noise elimination in classification problems

MCS'03 Proceedings of the 4th international conference on Multiple classifier systems
Asymmetric kernel regression

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.01

Visualization

Abstract

Data entry and acquisition are prone to errors and discrepancies. The data cleaning problem is an important process and a key challenge in data warehousing, pattern recognition, knowledge discovery in databases, data mining, and data quality management. In this paper, we present an ensemble that automates the task of domain-independent data discrepancy detection. The system uses a set of predictors/classifiers combined into an ensemble, and identifies potentially erroneous attributes in data sets. The ensemble incorporates individual predictive algorithms and learns which is most accurate for each attribute in the data set. The system was trained and tested on five real-world data sets totaling over 25,000 data records, and on varying error detection thresholds. The results show that for the best error detection thresholds data errors are identified with median accuracy of 89% (average 87% with standard deviation of 12.6), and with low false positive rates (median 8.4%, and average of 13.5% with standard deviation of 9.5).