Tuning large scale deduplication with reduced effort

  • Authors:
  • Guilherme Dal Bianco;Renata Galante;Carlos A. Heuser;Marcos André Gonçalves

  • Affiliations:
  • Universidade Federal do Rio Grande do Sul, Porto Alegre, RS, Brazil;Universidade Federal do Rio Grande do Sul, Porto Alegre, RS, Brazil;Universidade Federal do Rio Grande do Sul, Porto Alegre, RS, Brazil;Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brazil

  • Venue:
  • Proceedings of the 25th International Conference on Scientific and Statistical Database Management
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.