A Genetic Programming Approach to Record Deduplication

Authors:
Moises G. de Carvalho;Alberto H. F. Laender;Marcos Andre Goncalves;Altigran S. da Silva
Affiliations:
Nokia INdT, Brazil;Federal University of Minas Gerais, Belo Horizonte;Federal University of Minas Gerais, Belo Horizonte;Federal University of Amazonas, Manaus
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2012

Citing 0
Cited 7

Unsupervised learning of link discovery configuration

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Learning expressive linkage rules using genetic programming

Proceedings of the VLDB Endowment
Active learning of expressive linkage rules for the web of data

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
An evolutionary approach to complex schema matching

Information Systems
Active learning of expressive linkage rules using genetic programming

Web Semantics: Science, Services and Agents on the World Wide Web
SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter.