Replica identification using genetic programming

Authors:
Moisés G. Carvalho;Albero H. F. Laender;Marcos André Gonçalves;Altigran S. da Silva
Affiliations:
Federal University of Minas Gerais, MG Brazil;Federal University of Minas Gerais, MG Brazil;Federal University of Minas Gerais, MG Brazil;Federal University of Amazonas, Manaus - AM Brazil
Venue:
Proceedings of the 2008 ACM symposium on Applied computing
Year:
2008

Citing 16
Cited 7

Genetic programming: on the programming of computers by means of natural selection

Genetic programming: on the programming of computers by means of natural selection
Genetic programming: an introduction: on the automatic evolution of computer programs and its applications

Genetic programming: an introduction: on the automatic evolution of computer programs and its applications
Autonomous citation matching

Proceedings of the third annual conference on Autonomous Agents
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval

Modern Information Retrieval
Digital Libraries and Autonomous Citation Indexing

Computer
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding similar identities among objects from multiple web sources

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Learning to deduplicate

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data

The impact of parameter setup on a genetic programming approach to record deduplication

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
An unsupervised heuristic-based approach for bibliographic metadata deduplication

Information Processing and Management: an International Journal
EAGLE: efficient active learning of link specifications using genetic programming

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Learning expressive linkage rules using genetic programming

Proceedings of the VLDB Endowment
Active learning of expressive linkage rules for the web of data

ICWE'12 Proceedings of the 12th international conference on Web Engineering
An evolutionary approach to complex schema matching

Information Systems
Active learning of expressive linkage rules using genetic programming

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying and handling replicas are important to guarantee the quality of the information made available by modern data storage services. There has been a large investment from companies and governments in the development of effective methods for removing replicas from large databases. Typically, this investment has produced significant results, since cleaned replica-free databases not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process and maintaining this data. In this paper, we propose a GP-based approach to automatic replica identification that combines evidence based on the data content in order to find a similarity function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an SVM-based method used as baseline by at least 6.5%. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our approach is capable to automatically adapt to any given replica identification boundary.