Multiprocessor transitive closure algorithms
DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A comparison of approximate string matching algorithms
Software—Practice & Experience
An overview of data warehousing and OLAP technology
ACM SIGMOD Record
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
The String-to-String Correction Problem
Journal of the ACM (JACM)
IntelliClean: a knowledge-based intelligent data cleaner
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Database research: achievements and opportunities into the 1st century
ACM SIGMOD Record
Fundamentals of Data Warehouses
Fundamentals of Data Warehouses
Computation of Normalized Edit Distance and Applications
IEEE Transactions on Pattern Analysis and Machine Intelligence
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Privacy and anonymization for very large datasets
Proceedings of the 18th ACM conference on Information and knowledge management
Record linkage performance for large data sets
Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
On memory and I/O efficient duplication detection for multiple self-clean data sources
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Similarity and duplicate detection system for an OAI compliant federated digital library
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Hi-index | 0.00 |
Existing data cleansing methods are costly and will take very long time to cleanse large databases. Since large databases are common nowadays, it is necessary to reduce the cleansing time. Data cleansing consists of two main components, detection method and comparison method. In this paper, we first propose a simple and fast comparison method, TI-Similarity, which reduces the time for each comparison. Based on TI-Similarity, we propose a new detection method, RAR, to further reduce the number of comparisons. With RAR and TI-Similarity, our new approach for cleansing large databases is composed of two processes: Filtering process and Pruning process. In filtering process, a fast scan on the database is carried out with RAR and TI-Similarity. This process guarantees the detection of potential duplicate records but may introduce false positives. In pruning process, the duplicate result from the filtering process is pruned to eliminate the false positives using more trustworthy comparison methods. The performance study shows that our approach is efficient and scalable for cleansing large databases, and is about an order of magnitude faster than existing cleansing methods.