Potential benefits of delta encoding and data compression for HTTP
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Data compression with long repeated strings
Information Sciences: an International Journal - Dictionary based compression
Modern Information Retrieval
Cluster-Based Delta Compression of a Collection of Files
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
DCC '02 Proceedings of the Data Compression Conference
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
The webgraph framework I: compression techniques
Proceedings of the 13th international conference on World Wide Web
Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Lexical and semantic clustering by web links
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Dictionaries using variable-length keys and data, with applications
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Boosting textual compression in optimal linear time
Journal of the ACM (JACM)
A web-based kernel function for measuring the similarity of short text snippets
Proceedings of the 15th international conference on World Wide Web
ACM Computing Surveys (CSUR)
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Redundancy elimination within large collections of files
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
The engineering of a compression boosting library: theory vs practice in BWT compression
ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Fast generation of result snippets in web search
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Index compression is good, especially for random access
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A scalable pattern mining approach to web graph compression with communities
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Bigtable: A Distributed Storage System for Structured Data
ACM Transactions on Computer Systems (TOCS)
Performance of compressed inverted list caching in search engines
Proceedings of the 17th international conference on World Wide Web
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Compressed text indexes: From theory to practice
Journal of Experimental Algorithmics (JEA)
On the bit-complexity of Lempel-Ziv compression
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
Inverted index compression and query processing with optimized document ordering
Proceedings of the 18th international conference on World wide web
On compressing social networks
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Compressing term positions in web indexes
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Sorting out the document identifier assignment problem
ECIR'07 Proceedings of the 29th European conference on IR research
A fast and compact web graph representation
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Grammar-based codes: a new class of universal lossless source codes
IEEE Transactions on Information Theory
Data structures: time, I/Os, entropy, joules!
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Medium-space algorithms for inverse BWT
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
Sample selection for dictionary-based corpus compression
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Distribution-aware compressed full-text indexes
ESA'11 Proceedings of the 19th European conference on Algorithms
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections
Proceedings of the VLDB Endowment
Faster approximate pattern matching in compressed repetitive texts
ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
To index or not to index: time-space trade-offs in search engines with positional ranking functions
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Grammar precompression speeds up burrows---wheeler compression
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Trends in suffix sorting: a survey of low memory algorithms
ACSC '12 Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122
Hi-index | 0.00 |
Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing the raw Web pages. Many Web applications use simple compression algorithms--- e.g. gzip, or word-based Move-to-Front or Huffman coders-and conclude that, even compressed, raw data take more space than Inverted Lists. In this paper we investigate two typical scenarios of use of data compression for large Web collections. In the first scenario, the compressed pages are stored on disk and we only need to support the fast scanning of large parts of the compressed collection (such as for map-reduce paradigms). In the second scenario, we consider the fast access to individual pages of the compressed collection that is distributed among the RAMs of many PCs (such as for search engines and miners). For the first scenario, we provide a thorough experimental comparison among state-of-the-art compressors thus indicating pros and cons of the available solutions. For the second scenario, we compare known compressed-storage solutions with the new algorithmic technology of compressed self-indexes [NM07]. Our results show that Web pages are more compressible than expected and, consequently, that some common beliefs in this area should be reconsidered. Our results are novel for the large spectrum of tested approaches and the size of datasets, and provide a threefold contribution: a non-trivial baseline for designing new compressed-storage solutions, a guide for software developers faced with Web-page storage, and a natural complement to the recent figures on InvertedList-compression achieved by [Yan et al, sigir 09 and www 09].