Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitais

Authors:
Eduardo N. Borges;Renata M. Galante;Marcos A. Gonçalves
Affiliations:
Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brasil;Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre -- RS -- Brasil;Universidade Federal de Minas Gerais (UFMG), Belo Horizonte -- MG -- Brasil
Venue:
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Year:
2008

Citing 17
Cited 0

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Digital libraries

Communications of the ACM
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval

Modern Information Retrieval
Digital Libraries and Autonomous Citation Indexing

Computer
Induction of Decision Trees

Machine Learning
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding similar identities among objects from multiple web sources

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Measuring similarity between collection of values

Proceedings of the 6th annual ACM international workshop on Web information and data management
Learning to deduplicate

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
A strategy for allowing meaningful and comparable scores in approximate matching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Introduction to Information Retrieval

Introduction to Information Retrieval
Estimating recall and precision for vague queries in databases

CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Digital libraries contain collections of digital objects, acquired from different sources, which can be represented through several metadata standards. These metadata are heterogeneous both in content and in structure. This paper presents an approach that identifies duplicated metadata records referring to objects from digital libraries. We propose similarity functions designed for the digital library domain that compare the content of metadata. The results of experiments show that the proposed functions, compared to three different baselines, improve the quality of metadata deduplication from 0.64 to 31.5% using an algorithm with linear complexity to compare authors' names.