A fast filtering scheme for large database cleansing

Authors:
Sam Y. Sung;Zhao Li;Peng Sun
Affiliations:
National University of Singapore, Singapore;National University of Singapore, Singapore;National University of Singapore, Singapore
Venue:
Proceedings of the eleventh international conference on Information and knowledge management
Year:
2002

Citing 15
Cited 4

Multiprocessor transitive closure algorithms

DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Dealing with dirty data

DBMS
A comparison of approximate string matching algorithms

Software—Practice & Experience
An overview of data warehousing and OLAP technology

ACM SIGMOD Record
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
The String-to-String Correction Problem

Journal of the ACM (JACM)
IntelliClean: a knowledge-based intelligent data cleaner

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Database research: achievements and opportunities into the 1st century

ACM SIGMOD Record
Fundamentals of Data Warehouses

Fundamentals of Data Warehouses
Computation of Normalized Edit Distance and Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Privacy and anonymization for very large datasets

Proceedings of the 18th ACM conference on Information and knowledge management
Record linkage performance for large data sets

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
On memory and I/O efficient duplication detection for multiple self-clean data sources

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Similarity and duplicate detection system for an OAI compliant federated digital library

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Existing data cleansing methods are costly and will take very long time to cleanse large databases. Since large databases are common nowadays, it is necessary to reduce the cleansing time. Data cleansing consists of two main components, detection method and comparison method. In this paper, we first propose a simple and fast comparison method, TI-Similarity, which reduces the time for each comparison. Based on TI-Similarity, we propose a new detection method, RAR, to further reduce the number of comparisons. With RAR and TI-Similarity, our new approach for cleansing large databases is composed of two processes: Filtering process and Pruning process. In filtering process, a fast scan on the database is carried out with RAR and TI-Similarity. This process guarantees the detection of potential duplicate records but may introduce false positives. In pruning process, the duplicate result from the filtering process is pruned to eliminate the false positives using more trustworthy comparison methods. The performance study shows that our approach is efficient and scalable for cleansing large databases, and is about an order of magnitude faster than existing cleansing methods.