Foundations of statistical natural language processing
Foundations of statistical natural language processing
Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient string matching: an aid to bibliographic search
Communications of the ACM
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Automatic retrieval and clustering of similar words
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Contextual word similarity and estimation from sparse data
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient Batch Top-k Search for Dictionary-based Entity Recognition
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
An efficient filter for approximate membership checking
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Rule based synonyms for entity extraction from noisy text
Proceedings of the second workshop on Analytics for noisy unstructured text data
Scalable ad-hoc entity extraction from text collections
Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
Exploiting web search to generate synonyms for entities
Proceedings of the 18th international conference on World wide web
Transformation-based Framework for Record Matching
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Learning string transformations from examples
Proceedings of the VLDB Endowment
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Text-to-query: dynamically building structured analytics to illustrate textual content
Proceedings of the 2010 EDBT/ICDT Workshops
From information to knowledge: harvesting entities and relationships from web sources
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On indexing error-tolerant set containment
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Query portals: dynamically generating portals for entity-oriented web queries
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Extending dictionary-based entity extraction to tolerate errors
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A framework for robust discovery of entity synonyms
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Matching product titles using web-based enrichment
Proceedings of the 21st ACM international conference on Information and knowledge management
Mining acronym expansions and their meanings using query click log
Proceedings of the 22nd international conference on World Wide Web
Discovering attribute and entity synonyms for knowledge integration and semantic web search
Proceedings of the 3rd International Workshop on Semantic Search Over the Web
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Discovering emerging entities with ambiguous names
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
Many entity extraction techniques leverage large reference entity tables to identify entities in documents. Often, an entity is referenced in document collections differently from that in the reference entity tables. Therefore, we study the problem of determining whether or not a substring "approximately" matches with a reference entity. Similarity measures which exploit the correlation between candidate substrings and reference entities across a large number of documents are known to be more robust than traditional stand alone string-based similarity functions. However, such an approach has significant efficiency challenges. In this paper, we adopt a new architecture and propose new techniques to address these efficiency challenges. We mine document collections and expand a given reference entity table with variations of each of its entities. Thus, the problem of approximately matching an input string against reference entities reduces to that of exact match against the expanded reference table, which can be implemented efficiently. In an extensive experimental evaluation, we demonstrate the accuracy and scalability of our techniques.