A general-purpose compression scheme for large collections

Authors:
Adam Cannane;Hugh E. Williams
Affiliations:
RMIT University, Melbourne, Victoria, Australia;RMIT University, Melbourne, Victoria, Australia
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2002

Citing 24
Cited 4

Data compression

ACM Computing Surveys (CSUR)
Word-based text compression

Software—Practice & Experience
Text compression

Text compression
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Adding compression to a full-text retrieval system

Software—Practice & Experience
Arithmetic coding revisited

ACM Transactions on Information Systems (TOIS)
Fast algorithms for sorting and searching strings

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Data compression via textual substitution

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Experiments in text file compression

Communications of the ACM
Common phrases and minimum-space text storage

Communications of the ACM
Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
General-purpose compression for efficient retrieval

Journal of the American Society for Information Science and Technology
Data compression with long repeated strings

Information Sciences: an International Journal - Dictionary based compression
In-memory hash tables for accumulating text vocabularies

Information Processing Letters
Text Compression for Dynamic Document Databases

IEEE Transactions on Knowledge and Data Engineering
A Compression Scheme for Large Databases

ADC '00 Proceedings of the Australasian Database Conference
The entropy of English using PPM-based models

DCC '96 Proceedings of the Conference on Data Compression
A General-Purpose Compression Scheme for Databases

DCC '99 Proceedings of the Conference on Data Compression
Data Compression Using Long Common Strings

DCC '99 Proceedings of the Conference on Data Compression
Compact In-Memory Models for Compression of Large Text Databases

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Arithmetic coding revisited

DCC '95 Proceedings of the Conference on Data Compression
Some Theory and Practice of Greedy Off-Line Textual Substitution

DCC '98 Proceedings of the Conference on Data Compression

Block merging for off-line compression

Journal of the American Society for Information Science and Technology
Compression techniques for fast external sorting

The VLDB Journal — The International Journal on Very Large Data Bases
External sorting with on-the-fly compression

BNCOD'03 Proceedings of the 20th British national conference on Databases
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compression of large collections can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. We propose a semistatic phrase-based approach called xray that builds a model offline using sample training data extracted from a collection, and then compresses the entire collection online in a single pass. The particular benefits of xray are that it can be used in applications where individual records or documents must be decompressed, and that decompression is fast. The xray scheme also allows new data to be added to a collection without modifying the semistatic model. Moreover, xray can be used to compress general-purpose data such as genomic, scientific, image, and geographic collections without prior knowledge of the structure of the data. We show that xray is effective on both text and general-purpose collections. In general, xray is more effective than the popular gzip and compress schemes, while being marginally less effective than bzip2. We also show that xray is efficient: of the popular schemes we tested, it is typically only slower than gzip in decompression. Moreover, the query evaluation costs of retrieval of documents from a large collection with our search engine is improved by more than 30% when xray is incorporated compared to an uncompressed approach. We use simple techniques for obtaining the training data from the collection to be compressed and show that with just over 4% of data the entire collection can be effectively compressed. We also propose four schemes for phrase-match selection during the single pass compression of the collection. We conclude that with these novel approaches xray is a fast and effective scheme for compression and decompression of large general-purpose collections.