Genetic programming: on the programming of computers by means of natural selection
Genetic programming: on the programming of computers by means of natural selection
Genetic programming: an introduction: on the automatic evolution of computer programs and its applications
Proceedings of the third annual conference on Autonomous Agents
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding similar identities among objects from multiple web sources
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
The impact of parameter setup on a genetic programming approach to record deduplication
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
An unsupervised heuristic-based approach for bibliographic metadata deduplication
Information Processing and Management: an International Journal
EAGLE: efficient active learning of link specifications using genetic programming
ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Learning expressive linkage rules using genetic programming
Proceedings of the VLDB Endowment
Active learning of expressive linkage rules for the web of data
ICWE'12 Proceedings of the 12th international conference on Web Engineering
An evolutionary approach to complex schema matching
Information Systems
Active learning of expressive linkage rules using genetic programming
Web Semantics: Science, Services and Agents on the World Wide Web
Hi-index | 0.00 |
Identifying and handling replicas are important to guarantee the quality of the information made available by modern data storage services. There has been a large investment from companies and governments in the development of effective methods for removing replicas from large databases. Typically, this investment has produced significant results, since cleaned replica-free databases not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process and maintaining this data. In this paper, we propose a GP-based approach to automatic replica identification that combines evidence based on the data content in order to find a similarity function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an SVM-based method used as baseline by at least 6.5%. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our approach is capable to automatically adapt to any given replica identification boundary.