Learning to deduplicate

Authors:
Moisés G. de Carvalho;Marcos André Gonçalves;Alberto H. F. Laender;Altigran S. da Silva
Affiliations:
Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Amazonas, Manaus, Brazil
Venue:
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Year:
2006

Citing 16
Cited 11

Genetic programming: on the programming of computers by means of natural selection

Genetic programming: on the programming of computers by means of natural selection
Genetic programming: an introduction: on the automatic evolution of computer programs and its applications

Genetic programming: an introduction: on the automatic evolution of computer programs and its applications
Autonomous citation matching

Proceedings of the third annual conference on Autonomous Agents
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval

Modern Information Retrieval
Digital Libraries and Autonomous Citation Indexing

Computer
Some Observations about GA-Based Exam Timetabling

PATAT '97 Selected papers from the Second International Conference on Practice and Theory of Automated Timetabling II
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding similar identities among objects from multiple web sources

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Replica identification using genetic programming

Proceedings of the 2008 ACM symposium on Applied computing
Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitais

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
The impact of parameter setup on a genetic programming approach to record deduplication

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
An unsupervised heuristic-based approach for bibliographic metadata deduplication

Information Processing and Management: an International Journal
Learning expressive linkage rules using genetic programming

Proceedings of the VLDB Endowment
Active learning of expressive linkage rules for the web of data

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Named entity disambiguation in streaming data

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Active learning of expressive linkage rules using genetic programming

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.01

Visualization

Abstract

Identifying record replicas in Digital Libraries and other types of digital repositories is fundamental to improve the quality of their content and services as well as to yield eventual sharing efforts. Several deduplication strategies are available, but most of them rely on manually chosen settings to combine evidence used to identify records as being replicas. In this paper, we present the results of experiments we have carried out with a novel Machine Learning approach we have proposed for the deduplication problem. This approach, based on Genetic Programming (GP), is able to automatically generate similarity functions to identify record replicas in a given repository. The generated similarity functions properly combine and weight the best evidence available among the record fields in order to tell when two distinct records represent the same real-world entity. The results of the experiments show that our approach outperforms the baseline method by Fellegi and Sunter by more than 12% when identifying replicas in a data set containing researcher's personal data, and by more than 7%, in a data set with article citation data.