Similarity-aware indexing for real-time entity resolution

Authors:
Peter Christen;Ross Gayler;David Hawking
Affiliations:
Australian National University, Canberra, Australia;Veda Advantage, Melbourne, Australia;Funnelback Pty Ltd, Dickson, Australia
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 7
Cited 4

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
A Fast Linkage Detection Scheme for Multi-Source Information Integration

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
A Comparison of Personal Name Matching: Techniques and Practical Issues

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops

Record linkage performance for large data sets

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Efficient similarity search: arbitrary similarity measures, arbitrary composition

Proceedings of the 20th ACM international conference on Information and knowledge management
Flexible and efficient distributed resolution of large entities

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Cost-aware query planning for similarity search

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static databases. However, many organisations are increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, online retail stores, eGovernment services, and digital libraries. A novel inverted index based approach for real-time entity resolution is presented in this paper. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other approaches to approximate query matching in that it allows any similarity comparison function, and any 'blocking' (encoding) function, both possibly domain specific, to be incorporated. Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach. The interested reader is referred to the longer version of this paper [5].