Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Ordinal association rules for error identification in data sets
Proceedings of the tenth international conference on Information and knowledge management
Distance-based outliers: algorithms and applications
The VLDB Journal — The International Journal on Very Large Data Bases
Artificial Intelligence: A Modern Approach
Artificial Intelligence: A Modern Approach
Exploratory Data Mining and Data Cleaning
Exploratory Data Mining and Data Cleaning
Probabilistic Noise Identification and Data Cleaning
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Intelligent data entry assistant for XML using ensemble learning
Proceedings of the 10th international conference on Intelligent user interfaces
Neural Computation
Ensemble methods for noise elimination in classification problems
MCS'03 Proceedings of the 4th international conference on Multiple classifier systems
IEEE Transactions on Neural Networks
Hi-index | 0.01 |
Data entry and acquisition are prone to errors and discrepancies. The data cleaning problem is an important process and a key challenge in data warehousing, pattern recognition, knowledge discovery in databases, data mining, and data quality management. In this paper, we present an ensemble that automates the task of domain-independent data discrepancy detection. The system uses a set of predictors/classifiers combined into an ensemble, and identifies potentially erroneous attributes in data sets. The ensemble incorporates individual predictive algorithms and learns which is most accurate for each attribute in the data set. The system was trained and tested on five real-world data sets totaling over 25,000 data records, and on varying error detection thresholds. The results show that for the best error detection thresholds data errors are identified with median accuracy of 89% (average 87% with standard deviation of 12.6), and with low false positive rates (median 8.4%, and average of 13.5% with standard deviation of 9.5).