Software—Practice & Experience
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Adding compression to a full-text retrieval system
Software—Practice & Experience
Advantages of query biased summaries in information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Fast and flexible word searching on compressed text
ACM Transactions on Information Systems (TOIS)
Data compression with long repeated strings
Information Sciences: an International Journal - Dictionary based compression
Compression and Coding Algorithms
Compression and Coding Algorithms
Compression of inverted indexes For fast query evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A general-purpose compression scheme for large collections
ACM Transactions on Information Systems (TOIS)
A Compression Scheme for Large Databases
ADC '00 Proceedings of the Australasian Database Conference
Offline Dictionary-Based Compression
DCC '99 Proceedings of the Conference on Data Compression
Inverted Index Compression Using Word-Aligned Binary Codes
Information Retrieval
Super-Scalar RAM-CPU Cache Compression
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Lightweight natural language text compression
Information Retrieval
Fast generation of result snippets in web search
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Index compression is good, especially for random access
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Bigtable: A Distributed Storage System for Structured Data
ACM Transactions on Computer Systems (TOCS)
Introduction to Information Retrieval
Introduction to Information Retrieval
Got data?: a guide to data preservation in the information age
Communications of the ACM - Surviving the data deluge
New adaptive compressors for natural language text
Software—Practice & Experience
Search Engines: Information Retrieval in Practice
Search Engines: Information Retrieval in Practice
Document Compaction for Efficient Query Biased Snippet Generation
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
On compressing the textual web
Proceedings of the third ACM international conference on Web search and data mining
Improving semistatic compression via pair-based coding
PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Dynamic lightweight text compression
ACM Transactions on Information Systems (TOIS)
LZ77-Like Compression with Fast Random Access
DCC '10 Proceedings of the 2010 Data Compression Conference
Information Retrieval: Implementing and Evaluating Search Engines
Information Retrieval: Implementing and Evaluating Search Engines
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Sample selection for dictionary-based corpus compression
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hi-index | 0.00 |
Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as gzip. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.