Genetic programming: on the programming of computers by means of natural selection
Genetic programming: on the programming of computers by means of natural selection
Genetic programming: an introduction: on the automatic evolution of computer programs and its applications
Proceedings of the third annual conference on Autonomous Agents
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval
Some Observations about GA-Based Exam Timetabling
PATAT '97 Selected papers from the Second International Conference on Practice and Theory of Automated Timetabling II
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding similar identities among objects from multiple web sources
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Effective and scalable solutions for mixed and split citation problems in digital libraries
Proceedings of the 2nd international workshop on Information quality in information systems
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Merging the results of approximate match operations
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Replica identification using genetic programming
Proceedings of the 2008 ACM symposium on Applied computing
Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitais
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
The impact of parameter setup on a genetic programming approach to record deduplication
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
An unsupervised heuristic-based approach for bibliographic metadata deduplication
Information Processing and Management: an International Journal
Learning expressive linkage rules using genetic programming
Proceedings of the VLDB Endowment
Active learning of expressive linkage rules for the web of data
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Named entity disambiguation in streaming data
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Active learning of expressive linkage rules using genetic programming
Web Semantics: Science, Services and Agents on the World Wide Web
Hi-index | 0.01 |
Identifying record replicas in Digital Libraries and other types of digital repositories is fundamental to improve the quality of their content and services as well as to yield eventual sharing efforts. Several deduplication strategies are available, but most of them rely on manually chosen settings to combine evidence used to identify records as being replicas. In this paper, we present the results of experiments we have carried out with a novel Machine Learning approach we have proposed for the deduplication problem. This approach, based on Genetic Programming (GP), is able to automatically generate similarity functions to identify record replicas in a given repository. The generated similarity functions properly combine and weight the best evidence available among the record fields in order to tell when two distinct records represent the same real-world entity. The results of the experiments show that our approach outperforms the baseline method by Fellegi and Sunter by more than 12% when identifying replicas in a data set containing researcher's personal data, and by more than 7%, in a data set with article citation data.