The impact of parameter setup on a genetic programming approach to record deduplication

Authors:
Moisés G. de Carvalho;Alberto H. F. Laender;Marcos André Gonçalves;Thiago C. Porto
Affiliations:
Federal University of Minas Gerais, Belo Horizonte, MG -- Brazil;Federal University of Minas Gerais, Belo Horizonte, MG -- Brazil;Federal University of Minas Gerais, Belo Horizonte, MG -- Brazil;Federal University of Minas Gerais, Belo Horizonte, MG -- Brazil
Venue:
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Year:
2008

Citing 16
Cited 0

Genetic programming: on the programming of computers by means of natural selection

Genetic programming: on the programming of computers by means of natural selection
Genetic programming: an introduction: on the automatic evolution of computer programs and its applications

Genetic programming: an introduction: on the automatic evolution of computer programs and its applications
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval

Modern Information Retrieval
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Learning to deduplicate

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Replica identification using genetic programming

Proceedings of the 2008 ACM symposium on Applied computing
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several systems that rely on the integrity of the data in order to offer high quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicates entries in their repositories. Because of that, there has been a huge effort from private and government organizations in developing effective methods for removing replicas from large data repositories. This is due to the fact that cleaned, replica-free repositories not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process this data. In this work, we extend the results of a GP-based approach we proposed to record deduplication by performing a comprehensive set of experiments regarding its parameterization setup. Our experiments show that some parameter choices can improve the results to up 30%. Thus, the obtained results can be used as guidelines to suggest the most effective way to set up the parameters of our GP-based approach to record deduplication.