Compressing term positions in web indexes

Authors:
Hao Yan;Shuai Ding;Torsten Suel
Affiliations:
Polytechnic Institute of NYU, Brooklyn, NY, USA;Polytechnic Institute of NYU, Brooklyn, NY, USA;Polytechnic Institute of NYU, Brooklyn, NY, USA
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 23
Cited 14

Parameterised compression for sparse bitmaps

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling word occurrences for the compression of concordances

ACM Transactions on Information Systems (TOIS)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Bursty and hierarchical structure in streams

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting clustering in inverted file Compression

DCC '96 Proceedings of the Conference on Data Compression
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Proceedings of the 2004 ACM symposium on Applied computing
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Index compression using fixed binary codewords

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Improved Word-Aligned Binary Compression for Text Indexing

IEEE Transactions on Knowledge and Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Term proximity scoring for ad-hoc retrieval on very large text collections

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Binary codes for locally homogeneous sequences

Information Processing Letters
An exploration of proximity measures in information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Efficient text proximity search

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Boosting web retrieval through query operations

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Revisiting globally sorted indexes for efficient document retrieval

Proceedings of the third ACM international conference on Web search and data mining
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
Efficient term proximity search with term-pair indexes

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Term frequency quantization for compressing an inverted index

AMT'10 Proceedings of the 6th international conference on Active media technology
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
High-performance processing of text queries with tunable pruned term and term pair indexes

ACM Transactions on Information Systems (TOIS)
VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal

Proceedings of the 15th International Conference on Extending Database Technology
Distributed search based on self-indexed compressed text

Information Processing and Management: an International Journal
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
To index or not to index: time-space trade-offs in search engines with positional ranking functions

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
RasterZip: compressing network monitoring data with support for partial decompression

Proceedings of the 2012 ACM conference on Internet measurement conference
Quasi-succinct indices

Proceedings of the sixth ACM international conference on Web search and data mining
Applying a lightweight iterative merging chinese segmentation in web image annotation

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Bitlist: new full-text index for low space cost and efficient keyword search

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inverted index compression. We focus on techniques for compressing term positions in web search engine indexes. Most previous work has focused on compressing docID and frequency data, or position information in other types of text collections. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency data. We perform a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also study how to efficiently access position data for ranking functions that take proximity features into account.