On compressing the textual web

Authors:
Paolo Ferragina;Giovanni Manzini
Affiliations:
Università di Pisa, Pisa, Italy;Università del Piemonte Orientale, Alessandria, Italy
Venue:
Proceedings of the third ACM international conference on Web search and data mining
Year:
2010

Citing 41
Cited 9

Potential benefits of delta encoding and data compression for HTTP

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Data compression with long repeated strings

Information Sciences: an International Journal - Dictionary based compression
Modern Information Retrieval

Modern Information Retrieval
Cluster-Based Delta Compression of a Collection of Files

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Dictionaries using variable-length keys and data, with applications

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Boosting textual compression in optimal linear time

Journal of the ACM (JACM)
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
The engineering of a compression boosting library: theory vs practice in BWT compression

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Fast generation of result snippets in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Index compression is good, especially for random access

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A scalable pattern mining approach to web graph compression with communities

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Data challenges at Yahoo!

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Reorganizing compressed text

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Reducing the Storage Burden via Data Deduplication

Computer
On the bit-complexity of Lempel-Ziv compression

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
On compressing social networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
A fast and compact web graph representation

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory

Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Medium-space algorithms for inverse BWT

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
Sample selection for dictionary-based corpus compression

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Distribution-aware compressed full-text indexes

ESA'11 Proceedings of the 19th European conference on Algorithms
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Proceedings of the VLDB Endowment
Faster approximate pattern matching in compressed repetitive texts

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
To index or not to index: time-space trade-offs in search engines with positional ranking functions

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Grammar precompression speeds up burrows---wheeler compression

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Trends in suffix sorting: a survey of low memory algorithms

ACSC '12 Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing the raw Web pages. Many Web applications use simple compression algorithms--- e.g. gzip, or word-based Move-to-Front or Huffman coders-and conclude that, even compressed, raw data take more space than Inverted Lists. In this paper we investigate two typical scenarios of use of data compression for large Web collections. In the first scenario, the compressed pages are stored on disk and we only need to support the fast scanning of large parts of the compressed collection (such as for map-reduce paradigms). In the second scenario, we consider the fast access to individual pages of the compressed collection that is distributed among the RAMs of many PCs (such as for search engines and miners). For the first scenario, we provide a thorough experimental comparison among state-of-the-art compressors thus indicating pros and cons of the available solutions. For the second scenario, we compare known compressed-storage solutions with the new algorithmic technology of compressed self-indexes [NM07]. Our results show that Web pages are more compressible than expected and, consequently, that some common beliefs in this area should be reconsidered. Our results are novel for the large spectrum of tested approaches and the size of datasets, and provide a threefold contribution: a non-trivial baseline for designing new compressed-storage solutions, a guide for software developers faced with Web-page storage, and a natural complement to the recent figures on InvertedList-compression achieved by [Yan et al, sigir 09 and www 09].