The impact of parameter setup on a genetic programming approach to record deduplication

  • Authors:
  • Moisés G. de Carvalho;Alberto H. F. Laender;Marcos André Gonçalves;Thiago C. Porto

  • Affiliations:
  • Federal University of Minas Gerais, Belo Horizonte, MG -- Brazil;Federal University of Minas Gerais, Belo Horizonte, MG -- Brazil;Federal University of Minas Gerais, Belo Horizonte, MG -- Brazil;Federal University of Minas Gerais, Belo Horizonte, MG -- Brazil

  • Venue:
  • SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several systems that rely on the integrity of the data in order to offer high quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicates entries in their repositories. Because of that, there has been a huge effort from private and government organizations in developing effective methods for removing replicas from large data repositories. This is due to the fact that cleaned, replica-free repositories not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process this data. In this work, we extend the results of a GP-based approach we proposed to record deduplication by performing a comprehensive set of experiments regarding its parameterization setup. Our experiments show that some parameter choices can improve the results to up 30%. Thus, the obtained results can be used as guidelines to suggest the most effective way to set up the parameters of our GP-based approach to record deduplication.