Replica identification using genetic programming

  • Authors:
  • Moisés G. Carvalho;Albero H. F. Laender;Marcos André Gonçalves;Altigran S. da Silva

  • Affiliations:
  • Federal University of Minas Gerais, MG Brazil;Federal University of Minas Gerais, MG Brazil;Federal University of Minas Gerais, MG Brazil;Federal University of Amazonas, Manaus - AM Brazil

  • Venue:
  • Proceedings of the 2008 ACM symposium on Applied computing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Identifying and handling replicas are important to guarantee the quality of the information made available by modern data storage services. There has been a large investment from companies and governments in the development of effective methods for removing replicas from large databases. Typically, this investment has produced significant results, since cleaned replica-free databases not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process and maintaining this data. In this paper, we propose a GP-based approach to automatic replica identification that combines evidence based on the data content in order to find a similarity function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an SVM-based method used as baseline by at least 6.5%. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our approach is capable to automatically adapt to any given replica identification boundary.