Mining document collections to facilitate accurate approximate entity matching

Authors:
Surajit Chaudhuri;Venkatesh Ganti;Dong Xin
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 19
Cited 13

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Contextual word similarity and estimation from sparse data

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Rule based synonyms for entity extraction from noisy text

Proceedings of the second workshop on Analytics for noisy unstructured text data
Scalable ad-hoc entity extraction from text collections

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Exploiting web search to generate synonyms for entities

Proceedings of the 18th international conference on World wide web
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Learning string transformations from examples

Proceedings of the VLDB Endowment

Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Text-to-query: dynamically building structured analytics to illustrate textual content

Proceedings of the 2010 EDBT/ICDT Workshops
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On indexing error-tolerant set containment

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Query portals: dynamically generating portals for entity-oriented web queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Extending dictionary-based entity extraction to tolerate errors

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A framework for robust discovery of entity synonyms

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Matching product titles using web-based enrichment

Proceedings of the 21st ACM international conference on Information and knowledge management
Mining acronym expansions and their meanings using query click log

Proceedings of the 22nd international conference on World Wide Web
Discovering attribute and entity synonyms for knowledge integration and semantic web search

Proceedings of the 3rd International Workshop on Semantic Search Over the Web
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Discovering emerging entities with ambiguous names

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many entity extraction techniques leverage large reference entity tables to identify entities in documents. Often, an entity is referenced in document collections differently from that in the reference entity tables. Therefore, we study the problem of determining whether or not a substring "approximately" matches with a reference entity. Similarity measures which exploit the correlation between candidate substrings and reference entities across a large number of documents are known to be more robust than traditional stand alone string-based similarity functions. However, such an approach has significant efficiency challenges. In this paper, we adopt a new architecture and propose new techniques to address these efficiency challenges. We mine document collections and expand a given reference entity table with variations of each of its entities. Thus, the problem of approximately matching an input string against reference entities reduces to that of exact match against the expanded reference table, which can be implemented efficiently. In an extensive experimental evaluation, we demonstrate the accuracy and scalability of our techniques.