Indexing Word Sequences for Ranked Retrieval

Authors:
Samuel Huston;J. Shane Culpepper;W. Bruce Croft
Affiliations:
University of Massachusetts Amherst;RMIT University;University of Massachusetts Amherst
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2014

Citing 53
Cited 0

Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Fast phrase querying with combined indexes

ACM Transactions on Information Systems (TOIS)
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Term proximity scoring for ad-hoc retrieval on very large text collections

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Accurate discovery of co-derivative documents via duplicate text detection

Information Systems
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Incorporating term dependency in the dfr framework

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A unified and discriminative model for query refinement

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Discovering key concepts in verbose queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Finding frequent items in data streams

Proceedings of the VLDB Endowment
Out of the Box Phrase Indexing

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Space-optimal heavy hitters with strong error bounds

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On single-pass indexing with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Learning concept importance using a weighted dependence model

Proceedings of the third ACM international conference on Web search and data mining
Methods for finding frequent items in data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
Using various term dependencies according to their utilities

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Efficient indexing of repeated n-grams

Proceedings of the fourth ACM international conference on Web search and data mining
Modeling term proximity for probabilistic information retrieval models

Information Sciences: an International Journal
A cascade ranking model for efficient ranked retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Upper-bound approximations for dynamic pruning

ACM Transactions on Information Systems (TOIS)
A quasi-synchronous dependence model for information retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
High-performance processing of text queries with tunable pruned term and term pair indexes

ACM Transactions on Information Systems (TOIS)
Field-weighted XML retrieval based on BM25

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Approximate scalable bounded space sketch for large data NLP

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A five-level static cache architecture for web search engines

Information Processing and Management: an International Journal
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Modeling higher-order term dependencies in information retrieval using query hypergraphs

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Space-Efficient top-k document retrieval

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Sketch algorithms for estimating point queries in NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch-based indexing of n-words

Proceedings of the 21st ACM international conference on Information and knowledge management
Compact query term selection using topically related text

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance. In this article, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of (ε, δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term-dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary. Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.