Sketch-based indexing of n-words

Authors:
Samuel Huston;J. Shane Culpepper;W. Bruce Croft
Affiliations:
University of Massachusetts Amherst, Amherst, MA, USA;RMIT University, Melbourne, Australia;University of Massachusetts Amherst, Amherst, MA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 11
Cited 1

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Discovering key concepts in verbose queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Finding frequent items in data streams

Proceedings of the VLDB Endowment
Space-optimal heavy hitters with strong error bounds

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Methods for finding frequent items in data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient indexing of repeated n-grams

Proceedings of the fourth ACM international conference on Web search and data mining
Modeling term proximity for probabilistic information retrieval models

Information Sciences: an International Journal
Field-weighted XML retrieval based on BM25

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
A five-level static cache architecture for web search engines

Information Processing and Management: an International Journal

Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing these types of statistics using standard inverted indexes requires unreasonable processing time or incurs a substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance. In this paper, we present and analyze a new index structure designed to improve query efficiency in term dependency retrieval models, with bounded space requirements. By adapting a class of (ε,δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate various statistics important in term dependency models with low, probabilistically bounded error rates. The space requirements of the sketch index structure is largely independent of this size and the number of phrase term dependencies. Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of all n-grams consisting of between 1 and 5 words extracted from the Clueweb-Part-B collection to less than 0.2% of the requirements of an equivalent full index. We show that n-gram queries of 5 words can be processed more efficiently than in current alternatives, such as next-word indexes. We show retrieval using the sketch index to be up to 400 times faster than with positional indexes, and 15 times faster than next-word indexes.