Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient string matching: an aid to bibliographic search
Communications of the ACM
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information Extraction: Techniques and Challenges
SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Factorizing complex predicates in queries to exploit indexes
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Web-scale information extraction in knowitall: (preliminary results)
Proceedings of the 13th international conference on World Wide Web
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A search engine for natural language applications
WWW '05 Proceedings of the 14th international conference on World Wide Web
Efficient Batch Top-k Search for Dictionary-based Entity Recognition
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
To search or to crawl?: towards a query optimizer for text-centric tasks
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An efficient filter for approximate membership checking
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Exploiting web search to generate synonyms for entities
Proceedings of the 18th international conference on World wide web
Efficient approximate entity extraction with edit distance constraints
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Mining document collections to facilitate accurate approximate entity matching
Proceedings of the VLDB Endowment
Data-oriented content query system: searching for data into text on the web
Proceedings of the third ACM international conference on Web search and data mining
Query portals: dynamically generating portals for entity-oriented web queries
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online annotation of text streams with structured entities
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Pass-join: a partition-based method for similarity joins
Proceedings of the VLDB Endowment
A framework for robust discovery of entity synonyms
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A partition-based method for string similarity joins with edit-distance constraints
ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Hi-index | 0.00 |
Supporting entity extraction from large document collections is important for enabling a variety of important data analysis tasks. In this paper, we introduce the "ad-hoc" entity extraction task where entities of interest are constrained to be from a list of entities that is specific to the task. In such scenarios, traditional entity extraction techniques that process all the documents for each ad-hoc entity extraction task can be significantly expensive. In this paper, we propose an efficient approach that leverages the inverted index on the documents to identify the subset of documents relevant to the task and processes only those documents. We demonstrate the efficiency of our techniques on real datasets.