Index structures for efficiently searching natural language text

Authors:
Pirooz Chubak;Davood Rafiei
Affiliations:
University of Alberta, Edmonton, AB, Canada;University of Alberta, Edmonton, AB, Canada
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 16
Cited 0

Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient phrase querying with an auxiliary index

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Fast phrase querying with combined indexes

ACM Transactions on Information Systems (TOIS)
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
Indexing compressed text

Journal of the ACM (JACM)
Optimizing scoring functions and indexes for proximity search in type-annotated corpora

Proceedings of the 15th international conference on World Wide Web
A picture of search

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Lightweight natural language text compression

Information Retrieval
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Heavy-tailed distributions and multi-keyword queries

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Data extraction from the web using wild card queries

Proceedings of the 18th ACM conference on Information and knowledge management
Suffix arrays on words

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many existing indexes on text work at the document granularity and are not effective in answering the class of queries where the desired answer is only a term or a phrase. In this paper, we study some of the index structures that are capable of answering the class of queries referred to here as wild card queries and perform an analysis of their performance. Our experimental results on a large class of queries from different sources (including query logs and parse trees) and with various datasets reveal some of the performance barriers of these indexes. We then present Word Permuterm Index (WPI) which is an adaptation of the permuterm index for natural language text applications and show that this index supports a wide range of wild card queries, is quick to construct and is highly scalable. Our experimental resultS comparing WPI to alternative methods on a wide range oF wild card queries show a few orders of magnitude performancE improvements for WPI while the memory usage is kept the same for all compared systems.