Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Authors:
Christopher Hoobin;Simon J. Puglisi;Justin Zobel
Affiliations:
School of Computer Science and Information Technology, RMIT University;School of Computer Science and Information Technology, RMIT University and King's College London;University of Melbourne
Venue:
Proceedings of the VLDB Endowment
Year:
2011

Citing 31
Cited 0

Word-based text compression

Software—Practice & Experience
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Adding compression to a full-text retrieval system

Software—Practice & Experience
Advantages of query biased summaries in information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Data compression with long repeated strings

Information Sciences: an International Journal - Dictionary based compression
Compression and Coding Algorithms

Compression and Coding Algorithms
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A general-purpose compression scheme for large collections

ACM Transactions on Information Systems (TOIS)
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
A Compression Scheme for Large Databases

ADC '00 Proceedings of the Australasian Database Conference
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Lightweight natural language text compression

Information Retrieval
Fast generation of result snippets in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Index compression is good, especially for random access

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Introduction to Information Retrieval

Introduction to Information Retrieval
Got data?: a guide to data preservation in the information age

Communications of the ACM - Surviving the data deluge
New adaptive compressors for natural language text

Software—Practice & Experience
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Document Compaction for Efficient Query Biased Snippet Generation

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
Improving semistatic compression via pair-based coding

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Dynamic lightweight text compression

ACM Transactions on Information Systems (TOIS)
LZ77-Like Compression with Fast Random Access

DCC '10 Proceedings of the 2010 Data Compression Conference
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Sample selection for dictionary-based corpus compression

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as gzip. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.