Source models for natural language text
International Journal of Man-Machine Studies
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Trawling the Web for emerging cyber-communities
WWW '99 Proceedings of the eighth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Block addressing indices for approximate text retrieval
Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
Information Retrieval: Computational and Theoretical Aspects
Information Retrieval: Computational and Theoretical Aspects
Modern Information Retrieval
Inverted file compression through document identifier reassignment
Information Processing and Management: an International Journal
Index Compression through Document Reordering
DCC '02 Proceedings of the Data Compression Conference
Assigning document identifiers to enhance compressibility of Web Search Engines indexes
Proceedings of the 2004 ACM symposium on Applied computing
The WebGraph Framework II: Codes For The World-Wide Web
DCC '04 Proceedings of the Conference on Data Compression
The webgraph framework I: compression techniques
Proceedings of the 13th international conference on World Wide Web
A formal derivation of Heaps' Law
Information Sciences—Informatics and Computer Science: An International Journal
The Structure and Dynamics of Networks: (Princeton Studies in Complexity)
The Structure and Dynamics of Networks: (Princeton Studies in Complexity)
Linked: How Everything Is Connected to Everything Else and What It Means
Linked: How Everything Is Connected to Everything Else and What It Means
Introduction to Information Retrieval
Introduction to Information Retrieval
Building a dynamic classifier for large text data collections
ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Compact representation of large RDF data sets for publishing and exchange
ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
Inverted index compression via online document routing
Proceedings of the 20th international conference on World wide web
A fast indexing algorithm optimization with user behavior pattern
ICPCA/SWS'12 Proceedings of the 2012 international conference on Pervasive Computing and the Networked World
Binary RDF representation for publication and exchange (HDT)
Web Semantics: Science, Services and Agents on the World Wide Web
Hi-index | 0.00 |
Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.