The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
TAILOR: A Record Linkage Tool Box
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A Fast Linkage Detection Scheme for Multi-Source Information Integration
WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
LinkClus: efficient clustering via heterogeneous semantic links
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Adaptive Blocking: Learning to Scale Up Record Linkage
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Adaptive sorted neighborhood methods for efficient record linkage
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning blocking schemes for record linkage
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Journal of Artificial Intelligence Research
Geocode Matching and Privacy Preservation
Privacy, Security, and Trust in KDD
An efficient duplicate record detection using q-grams array inverted index
DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Multiple valued logic approach for matching patient records in multiple databases
Journal of Biomedical Informatics
Fast and accurate incremental entity resolution relative to an entity knowledge base
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore becomes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted index approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index approaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when standard blocking is used, and thus more work is required.