The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The open archives initiative: building a low-barrier interoperability framework
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation
IEEE Transactions on Visualization and Computer Graphics
Accurate Synthetic Generation of Realistic Personal Information
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Record linkage performance for large data sets
Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Cassandra: a decentralized structured storage system
ACM SIGOPS Operating Systems Review
Detecting near-duplicates in large-scale short text databases
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Detecting duplicate biological entities using Shortest Path Edit Distance
International Journal of Data Mining and Bioinformatics
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 0.00 |
This paper presents PACE (Programmable Authority Control Engine), an authority control tool conceived to maintain 'aggregation authority files'. These are obtained as continuous aggregations of records originating from a variable set of information systems with heterogeneous and duplicated content. To facilitate record deduplication in the presence of such heterogeneity and dynamicity, PACE user interfaces enable an iterative curation process, where data curators can: (i) configure algorithms for the identification of record duplicates; (ii) open work sessions where algorithm configurations can be run and evaluated; (iii) merge the identified record duplicates to disambiguate the authority file and (iv) repeat this cycle several times. PACE supports a tunable probabilistic similarity measure and performs record matching with a customisable variation of the sorted neighbourhood heuristic. Finally, it addresses the underlying performance and scalability issues by exploiting multi-core parallel processing and Cassandra's storage systems, to support I/O performances that scale up linearly with the number of records.