Scalable ad-hoc entity extraction from text collections

Authors:
Sanjay Agrawal;Kaushik Chakrabarti;Surajit Chaudhuri;Venkatesh Ganti
Affiliations:
Microsoft Research;Microsoft Research;Microsoft Research;Microsoft Research
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 14
Cited 12

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information Extraction: Techniques and Challenges

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Factorizing complex predicates in queries to exploit indexes

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Exploiting web search to generate synonyms for entities

Proceedings of the 18th international conference on World wide web
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Mining document collections to facilitate accurate approximate entity matching

Proceedings of the VLDB Endowment
Data-oriented content query system: searching for data into text on the web

Proceedings of the third ACM international conference on Web search and data mining
Query portals: dynamically generating portals for entity-oriented web queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online annotation of text streams with structured entities

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
A framework for robust discovery of entity synonyms

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Supporting entity extraction from large document collections is important for enabling a variety of important data analysis tasks. In this paper, we introduce the "ad-hoc" entity extraction task where entities of interest are constrained to be from a list of entities that is specific to the task. In such scenarios, traditional entity extraction techniques that process all the documents for each ad-hoc entity extraction task can be significantly expensive. In this paper, we propose an efficient approach that leverages the inverted index on the documents to identify the subset of documents relevant to the task and processes only those documents. We demonstrate the efficiency of our techniques on real datasets.