An unsupervised heuristic-based approach for bibliographic metadata deduplication

Authors:
Eduardo N. Borges;Moisés G. de Carvalho;Renata Galante;Marcos André Gonçalves;Alberto H. F. Laender
Affiliations:
Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil;Computer Science Dept., Federal University of Minas Gerais, Belo Horizonte, Brazil;Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil;Computer Science Dept., Federal University of Minas Gerais, Belo Horizonte, Brazil;Computer Science Dept., Federal University of Minas Gerais, Belo Horizonte, Brazil
Venue:
Information Processing and Management: an International Journal
Year:
2011

Citing 19
Cited 1

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval

Modern Information Retrieval
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Digital Libraries and Autonomous Citation Indexing

Computer
Induction of Decision Trees

Machine Learning
The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding similar identities among objects from multiple web sources

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries

ACM Transactions on Information Systems (TOIS)
Measuring similarity between collection of values

Proceedings of the 6th annual ACM international workshop on Web information and data management
Introduction to the special issue on semantic integration

ACM SIGMOD Record
Learning to deduplicate

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Replica identification using genetic programming

Proceedings of the 2008 ACM symposium on Applied computing
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Estimating recall and precision for vague queries in databases

CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering

Building a research social network from an individual perspective

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Digital libraries of scientific articles contain collections of digital objects that are usually described by bibliographic metadata records. These records can be acquired from different sources and be represented using several metadata standards. These metadata standards may be heterogeneous in both, content and structure. All of this implies that many records may be duplicated in the repository, thus affecting the quality of services, such as searching and browsing. In this article we present an approach that identifies duplicated bibliographic metadata records in an efficient and effective way. We propose similarity functions especially designed for the digital library domain and experimentally evaluate them. Our results show that the proposed functions improve the quality of metadata deduplication up to 188% compared to four different baselines. We also show that our approach achieves statistical equivalent results when compared to a state-of-the-art method for replica identification based on genetic programming, without the burden and cost of any training process.