Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A Markov random field model for term dependencies
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Discovering key concepts in verbose queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Finding frequent items in data streams
Proceedings of the VLDB Endowment
Space-optimal heavy hitters with strong error bounds
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Methods for finding frequent items in data streams
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient indexing of repeated n-grams
Proceedings of the fourth ACM international conference on Web search and data mining
Modeling term proximity for probabilistic information retrieval models
Information Sciences: an International Journal
Field-weighted XML retrieval based on BM25
INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
A five-level static cache architecture for web search engines
Information Processing and Management: an International Journal
Indexing Word Sequences for Ranked Retrieval
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing these types of statistics using standard inverted indexes requires unreasonable processing time or incurs a substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance. In this paper, we present and analyze a new index structure designed to improve query efficiency in term dependency retrieval models, with bounded space requirements. By adapting a class of (ε,δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate various statistics important in term dependency models with low, probabilistically bounded error rates. The space requirements of the sketch index structure is largely independent of this size and the number of phrase term dependencies. Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of all n-grams consisting of between 1 and 5 words extracted from the Clueweb-Part-B collection to less than 0.2% of the requirements of an equivalent full index. We show that n-gram queries of 5 words can be processed more efficiently than in current alternatives, such as next-word indexes. We show retrieval using the sketch index to be up to 400 times faster than with positional indexes, and 15 times faster than next-word indexes.