A text compression scheme that allows fast searching directly in the compressed file

Authors:
Udi Manber
Affiliations:
Univ. of Arizona, Tucson
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
1997

Citing 12
Cited 34

Storing text retrieval systems on CD-ROM: compression and encryption considerations

ACM Transactions on Information Systems (TOIS)
Text compression

Text compression
Matching patterns in strings subject to multi-linear transformations

Sequences
Fast text searching: allowing errors

Communications of the ACM
Fast string searching

Software—Practice & Experience
Two-dimensional periodicity and its applications

SODA '92 Proceedings of the third annual ACM-SIAM symposium on Discrete algorithms
Let sleeping files lie: pattern matching in Z-compressed files

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
A fast string searching algorithm

Communications of the ACM
Efficient generation of the binary reflected gray code and its applications

Communications of the ACM
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
DZ: A Text Compression Algorithm For Natural Languages

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching

Fast searching on compressed text allowing errors

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Interactive phrase browsing within compressed text

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms on Compressed Strings and Arrays

SOFSEM '99 Proceedings of the 26th Conference on Current Trends in Theory and Practice of Informatics on Theory and Practice of Informatics
Boyer-Moore String Matching over Ziv-Lempel Compressed Text

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Block Merging for Off-Line Compression

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
String Matching with Stopper Encoding and Code Splitting

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text

CPM '99 Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching
Regular Expression Searching over Ziv-Lempel Compressed Text

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
A New Compression Method for Compressed Matching

DCC '00 Proceedings of the Conference on Data Compression
Searching in Compressed Dictionaries

DCC '02 Proceedings of the Data Compression Conference
Pattern Matching in Huffman Encoded Texts

DCC '01 Proceedings of the Data Compression Conference
Faster Approximate String Matching over Compressed Text

DCC '01 Proceedings of the Data Compression Conference
Compressed Pattern Matching for Sequitur

DCC '01 Proceedings of the Data Compression Conference
Time/space efficient compressed pattern matching

Fundamenta Informaticae - Special issue on computing patterns in strings
Regular expression searching on compressed text

Journal of Discrete Algorithms
Approximate string matching on Ziv-Lempel compressed text

Journal of Discrete Algorithms
Pattern matching in Huffman encoded texts

Information Processing and Management: an International Journal
LZgrep: a Boyer–Moore string matching tool for Ziv–Lempel compressed text: Research Articles

Software—Practice & Experience
Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

Information Processing and Management: an International Journal
Block merging for off-line compression

Journal of the American Society for Information Science and Technology
A Run-Time Efficient Implementation of Compressed Pattern Matching Automata

CIAA '08 Proceedings of the 13th international conference on Implementation and Applications of Automata
Context-Sensitive Grammar Transform: Compression and Pattern Matching

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Simple Random Access Compression

Fundamenta Informaticae
Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

Information Processing and Management: an International Journal
Simple compression code supporting random access and fast string matching

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Interpolative coding of integer sequences supporting log-time random access

Information Processing and Management: an International Journal
Fast decoding algorithms for variable-lengths codes

Information Sciences: an International Journal
Phrase-Based pattern matching in compressed text

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Simple Random Access Compression

Fundamenta Informaticae
Accelerating multipattern matching on compressed HTTP traffic

IEEE/ACM Transactions on Networking (TON)
Time/Space Efficient Compressed Pattern Matching

Fundamenta Informaticae - Computing Patterns in Strings
Fast matching method for DNA sequences

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
Grammar precompression speeds up burrows---wheeler compression

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new text compression scheme is presented in this article. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and precessing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information retrieval system that we developed is also discussed.