De-duplication of aggregation authority files

Authors:
P. Manghi;M. Mikulicic;C. Atzori
Affiliations:
Istituto di Scienza e Tecnologie dell'Informazione 'Alessandro Faedo', Consiglio Nazionale delle Ricerche, Via G. Moruzzi 1, 56124, Pisa, Italy;Istituto di Scienza e Tecnologie dell'Informazione 'Alessandro Faedo', Consiglio Nazionale delle Ricerche, Via G. Moruzzi 1, 56124, Pisa, Italy;Istituto di Scienza e Tecnologie dell'Informazione 'Alessandro Faedo', Consiglio Nazionale delle Ricerche, Via G. Moruzzi 1, 56124, Pisa, Italy
Venue:
International Journal of Metadata, Semantics and Ontologies
Year:
2012

Citing 19
Cited 0

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The open archives initiative: building a low-barrier interoperability framework

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation

IEEE Transactions on Visualization and Computer Graphics
Accurate Synthetic Generation of Realistic Personal Information

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Record linkage performance for large data sets

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Detecting near-duplicates in large-scale short text databases

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Detecting duplicate biological entities using Shortest Path Edit Distance

International Journal of Data Mining and Bioinformatics
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents PACE (Programmable Authority Control Engine), an authority control tool conceived to maintain 'aggregation authority files'. These are obtained as continuous aggregations of records originating from a variable set of information systems with heterogeneous and duplicated content. To facilitate record deduplication in the presence of such heterogeneity and dynamicity, PACE user interfaces enable an iterative curation process, where data curators can: (i) configure algorithms for the identification of record duplicates; (ii) open work sessions where algorithm configurations can be run and evaluated; (iii) merge the identified record duplicates to disambiguate the authority file and (iv) repeat this cycle several times. PACE supports a tunable probabilistic similarity measure and performs record matching with a customisable variation of the sorted neighbourhood heuristic. Finally, it addresses the underlying performance and scalability issues by exploiting multi-core parallel processing and Cassandra's storage systems, to support I/O performances that scale up linearly with the number of records.