Compressed web indexes

Authors:
Flavio Chierichetti;Ravi Kumar;Prabhakar Raghavan
Affiliations:
Sapienza University of Rome, Rome, Italy;Yahoo! Research, Sunnyvale, CA, USA;Yahoo! Research, Sunnyvale, CA, USA
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 16
Cited 5

Source models for natural language text

International Journal of Man-Machine Studies
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Block addressing indices for approximate text retrieval

Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Modern Information Retrieval

Modern Information Retrieval
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Proceedings of the 2004 ACM symposium on Applied computing
The WebGraph Framework II: Codes For The World-Wide Web

DCC '04 Proceedings of the Conference on Data Compression
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
A formal derivation of Heaps' Law

Information Sciences—Informatics and Computer Science: An International Journal
The Structure and Dynamics of Networks: (Princeton Studies in Complexity)

The Structure and Dynamics of Networks: (Princeton Studies in Complexity)
Linked: How Everything Is Connected to Everything Else and What It Means

Linked: How Everything Is Connected to Everything Else and What It Means
Introduction to Information Retrieval

Introduction to Information Retrieval

Building a dynamic classifier for large text data collections

ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Compact representation of large RDF data sets for publishing and exchange

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
Inverted index compression via online document routing

Proceedings of the 20th international conference on World wide web
A fast indexing algorithm optimization with user behavior pattern

ICPCA/SWS'12 Proceedings of the 2012 international conference on Pervasive Computing and the Networked World
Binary RDF representation for publication and exchange (HDT)

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.