On-line routing of virtual circuits with applications to load balancing and machine scheduling
Journal of the ACM (JACM)
Online computation and competitive analysis
Online computation and competitive analysis
Cluster-based language models for distributed retrieval
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
WebBase: a repository of Web pages
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
ACM Transactions on Internet Technology (TOIT)
Binary Interpolative Coding for Effective Index Compression
Information Retrieval
Inverted file compression through document identifier reassignment
Information Processing and Management: an International Journal
Index Compression through Document Reordering
DCC '02 Proceedings of the Data Compression Conference
Assigning document identifiers to enhance compressibility of Web Search Engines indexes
Proceedings of the 2004 ACM symposium on Applied computing
Inverted Index Compression Using Word-Aligned Binary Codes
Information Retrieval
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Query-driven document partitioning and collection selection
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
TSP and cluster-based solutions to the reassignment of document identifiers
Information Retrieval
Performance of compressed inverted list caching in search engines
Proceedings of the 17th international conference on World Wide Web
Inverted index compression and query processing with optimized document ordering
Proceedings of the 18th international conference on World wide web
Proceedings of the 18th international conference on World wide web
Sorting out the document identifier assignment problem
ECIR'07 Proceedings of the 29th European conference on IR research
Scalable techniques for document identifier assignment in inverted indexes
Proceedings of the 19th international conference on World wide web
Hi-index | 0.00 |
Modern search engines are expected to make documents searchable shortly after they appear on the ever changing Web. To satisfy this requirement, the Web is frequently crawled. Due to the sheer size of their indexes, search engines distribute the crawled documents among thousands of servers in a scheme called local index-partitioning, such that each server indexes only several million pages. To ensure documents from the same host (e.g., www.nytimes.com) are distributed uniformly over the servers, for load balancing purposes, random routing of documents to servers is common. To expedite the time documents become searchable after being crawled, documents may be simply appended to the existing index partitions. However, indexing by merely appending documents, results in larger index sizes since document reordering for index compactness is no longer performed. This, in turn, degrades search query processing performance which depends heavily on index sizes. A possible way to balance quick document indexing with efficient query processing, is to deploy online document routing strategies that are designed to reduce index sizes. This work considers the effects of several online document routing strategies on the aggregated partitioned index size. We show that there exists a tradeoff between the compression of a partitioned index and the distribution of documents from the same host across the index partitions (i.e., host distribution). We suggest and evaluate several online routing strategies with regard to their compression, host distribution, and complexity. In particular, we present a term based routing algorithm which is shown analytically to provide better compression results than the industry standard random routing scheme. In addition, our algorithm demonstrates comparable compression performance and host distribution while having much better running time complexity than other document routing heuristics. Our findings are validated by experimental evaluation performed on a large benchmark collection of Web pages.