An efficient duplicate record detection using q-grams array inverted index

Authors:
Alfredo Ferro;Rosalba Giugno;Piera Laura Puglisi;Alfredo Pulvirenti
Affiliations:
Dept. of Mathematics and Computer Sciences, University of Catania;Dept. of Mathematics and Computer Sciences, University of Catania;Dept. of Mathematics and Computer Sciences, University of Catania;Dept. of Mathematics and Computer Sciences, University of Catania
Venue:
DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Year:
2010

Citing 13
Cited 1

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Data manipulation in heterogeneous databases

ACM SIGMOD Record
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fast Linkage Detection Scheme for Multi-Source Information Integration

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data

The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87

Instance-based 'one-to-some' assignment of similarity measures to attributes

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Duplicate record detection is a crucial task for data cleaning process in data warehouse systems. Many approaches have been presented to address this problem: some of these rely on the accuracy of the resulted records, others focus on the efficiency of the comparison process. Following the first direction, we introduce two similarity functions based on the concept of q-grams that contribute to improve accuracy of duplicate detection process with respect to other well known measures. We also reduce the number and the running time of record comparisons by building an inverted index on a sorted list of q-grams, named q-grams array. Then, we extend this approach to perform a clustering process based on the proposed q-grams array. Finally, an experimental analysis on synthetic and real data shows the efficiency of the novel indexing method for both record comparison process and clustering.