Domain independent data discrepancy detection using ensemble learning

  • Authors:
  • Danico Lee;Costas Tsatsoulis

  • Affiliations:
  • Information and Telecommunication Technology Center, Department of Electrical Engineering and Computer Science, The University of Kansas, Lawrence, KS;Information and Telecommunication Technology Center, Department of Electrical Engineering and Computer Science, The University of Kansas, Lawrence, KS

  • Venue:
  • ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

Data entry and acquisition are prone to errors and discrepancies. The data cleaning problem is an important process and a key challenge in data warehousing, pattern recognition, knowledge discovery in databases, data mining, and data quality management. In this paper, we present an ensemble that automates the task of domain-independent data discrepancy detection. The system uses a set of predictors/classifiers combined into an ensemble, and identifies potentially erroneous attributes in data sets. The ensemble incorporates individual predictive algorithms and learns which is most accurate for each attribute in the data set. The system was trained and tested on five real-world data sets totaling over 25,000 data records, and on varying error detection thresholds. The results show that for the best error detection thresholds data errors are identified with median accuracy of 89% (average 87% with standard deviation of 12.6), and with low false positive rates (median 8.4%, and average of 13.5% with standard deviation of 9.5).