Fast and accurate incremental entity resolution relative to an entity knowledge base

Authors:
Michael J. Welch;Aamod Sane;Chris Drome
Affiliations:
Barnes & Noble, Palo Alto, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 9
Cited 1

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Large-Scale Deduplication with Constraints Using Dedupalog

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Large-scale collective entity matching

Proceedings of the VLDB Endowment
Streaming cross document entity coreference resolution

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Entity disambiguation with hierarchical topic models

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Record Linkage in Data Warehousing: State-of-the-Art Analysis and Research Perspectives

DEXA '11 Proceedings of the 2011 22nd International Workshop on Database and Expert Systems Applications
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87

WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

User facing topical web applications such as events or shopping sites rely on large collections of data records about real world entities that are updated at varying latencies ranging from days to seconds. For example, event venue details are changed relatively infrequently whereas ticket pricing and availability for an event is often updated in near-realtime. Users regard these sites as high quality if they seldom show duplicates, the URLs are stable, and their content is fresh, so it is important to resolve duplicate entity records with high quality and low latencies. High quality entity resolution typically evaluates the entire record corpus for similar record clusters at the cost of latency, while low latency resolution examines the least possible entities to keep time to a minimum, even at the cost of quality. In this paper we show how to keep low latency while achieving high quality, combining the best of both approaches: given an entity to be resolved, our incremental Fastpath system, in a matter of milliseconds, makes approximately the same decisions that the underlying batch system would have made. Our experiments show that the Fastpath system makes matching decisions for previously unseen entities with 90% precision and 98% recall relative to batch decisions, with latencies under 20ms on commodity hardware.