Efficient String Matching in Huffman Compressed Texts

Authors:
Kimmo Fredriksson;Jorma Tarhio
Affiliations:
Department of Computer Science, University of Joensuu, PO Box 111, 80101 Joensuu, Finland;Department of CSE, Helsinki University of Technology, PO Box 5400, 02015 Hut, Finland
Venue:
Fundamenta Informaticae
Year:
2004

Citing 17
Cited 0

A new approach to text searching

Communications of the ACM
Fast text searching: allowing errors

Communications of the ACM
Let sleeping files lie: pattern matching in Z-compressed files

Journal of Computer and System Sciences
Optimal two-dimensional compressed matching

Journal of Algorithms
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
Efficient variants of Huffman codes in high level languages

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Multiple Approximate String Matching

WADS '97 Proceedings of the 5th International Workshop on Algorithms and Data Structures
Faster String Matching with Super-Alphabets

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Pattern Matching in Huffman Encoded Texts

DCC '01 Proceedings of the Data Compression Conference
Faster Approximate String Matching over Compressed Text

DCC '01 Proceedings of the Data Compression Conference
Tuning string matching for huge pattern sets

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(n{[log 2σ]/b}) time, where n is the size of the compressed text in bytes, σ is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O({b/[H log 2σ]}) characters, where H is the entropy of the text. Each super-character is processed in O(1) time. The algorithm uses O(2 b) space and O(b2 b) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(n{[log 2σ]/[Hw]}), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n{[log 2σ]/b}+t), where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time O(n{[log 2σ]/b}⌈(m+s-1)/w⌉+t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O(b/[H log 2σ]). The method can be applied to several other algorithms as well. Finally, we apply the methods to natural language taking the words (vocabulary) as the alphabet. This improves the compression ratio and allows more complex search problems to be solved efficiently. We conclude with some experimental results.