An efficient duplicate record detection using q-grams array inverted index

  • Authors:
  • Alfredo Ferro;Rosalba Giugno;Piera Laura Puglisi;Alfredo Pulvirenti

  • Affiliations:
  • Dept. of Mathematics and Computer Sciences, University of Catania;Dept. of Mathematics and Computer Sciences, University of Catania;Dept. of Mathematics and Computer Sciences, University of Catania;Dept. of Mathematics and Computer Sciences, University of Catania

  • Venue:
  • DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Duplicate record detection is a crucial task for data cleaning process in data warehouse systems. Many approaches have been presented to address this problem: some of these rely on the accuracy of the resulted records, others focus on the efficiency of the comparison process. Following the first direction, we introduce two similarity functions based on the concept of q-grams that contribute to improve accuracy of duplicate detection process with respect to other well known measures. We also reduce the number and the running time of record comparisons by building an inverted index on a sorted list of q-grams, named q-grams array. Then, we extend this approach to perform a clustering process based on the proposed q-grams array. Finally, an experimental analysis on synthetic and real data shows the efficiency of the novel indexing method for both record comparison process and clustering.