Inverted index compression via online document routing

Authors:
Gal Lavee;Ronny Lempel;Edo Liberty;Oren Somekh
Affiliations:
Technion, Haifa, Israel;Yahoo! Labs, Haifa, Israel;Yahoo! Labs, Haifa, Israel;Yahoo! Labs, Haifa, Israel
Venue:
Proceedings of the 20th international conference on World wide web
Year:
2011

Citing 19
Cited 0

On-line routing of virtual circuits with applications to load balancing and machine scheduling

Journal of the ACM (JACM)
Online computation and competitive analysis

Online computation and competitive analysis
Cluster-based language models for distributed retrieval

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Proceedings of the 2004 ACM symposium on Applied computing
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Query-driven document partitioning and collection selection

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
TSP and cluster-based solutions to the reassignment of document identifiers

Information Retrieval
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Compressed web indexes

Proceedings of the 18th international conference on World wide web
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern search engines are expected to make documents searchable shortly after they appear on the ever changing Web. To satisfy this requirement, the Web is frequently crawled. Due to the sheer size of their indexes, search engines distribute the crawled documents among thousands of servers in a scheme called local index-partitioning, such that each server indexes only several million pages. To ensure documents from the same host (e.g., www.nytimes.com) are distributed uniformly over the servers, for load balancing purposes, random routing of documents to servers is common. To expedite the time documents become searchable after being crawled, documents may be simply appended to the existing index partitions. However, indexing by merely appending documents, results in larger index sizes since document reordering for index compactness is no longer performed. This, in turn, degrades search query processing performance which depends heavily on index sizes. A possible way to balance quick document indexing with efficient query processing, is to deploy online document routing strategies that are designed to reduce index sizes. This work considers the effects of several online document routing strategies on the aggregated partitioned index size. We show that there exists a tradeoff between the compression of a partitioned index and the distribution of documents from the same host across the index partitions (i.e., host distribution). We suggest and evaluate several online routing strategies with regard to their compression, host distribution, and complexity. In particular, we present a term based routing algorithm which is shown analytically to provide better compression results than the industry standard random routing scheme. In addition, our algorithm demonstrates comparable compression performance and host distribution while having much better running time complexity than other document routing heuristics. Our findings are validated by experimental evaluation performed on a large benchmark collection of Web pages.