Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts

Authors:
Dana Shapira;Ajay Daptardar
Affiliations:
Computer Science Department, Brandeis University, Waltham, MA;Computer Science Department, Brandeis University, Waltham, MA
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 11
Cited 0

Bidirectional Huffman coding

The Computer Journal
Fast text searching: allowing errors

Communications of the ACM
Let sleeping files lie: pattern matching in Z-compressed files

Journal of Computer and System Sciences
A text compression scheme that allows fast searching directly in the compressed file

ACM Transactions on Information Systems (TOIS)
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Generating a canonical prefix encoding

Communications of the ACM
Skeleton Trees for the Efficient Decoding of Huffman Encoded Texts

Information Retrieval
In-Place Calculation of Minimum-Redundancy Codes

WADS '95 Proceedings of the 4th International Workshop on Algorithms and Data Structures
Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
A New Compression Method for Compressed Matching

DCC '00 Proceedings of the Conference on Data Compression
Pattern Matching in Huffman Encoded Texts

DCC '01 Proceedings of the Data Compression Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the present work we perform compressed pattern matching in binary Huffman encoded texts [Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. of the IRE, 40, 1098-1101]. A modified Knuth-Morris-Pratt algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text. We propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch since the alphabet is binary. To avoid processing any bit of the encoded text more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that we are always able to align the start of the encoded pattern with the start of a codeword in the encoded text. We combine our KMP algorithm with two practical Huffman decoding schemes which handle more than a single bit per machine operation; skeleton trees defined by Klein [Klein, S. T. (2000). Skeleton trees for efficient decoding of huffman encoded texts. Information Retrieval, 3, 7-23], and numerical comparisons between special canonical values and portions of a sliding window presented in Moffat and Turpin [Moffat, A., & Turpin, A. (1997). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 1200-1207]. Experiments show rapid search times of our algorithms compared to the "decompress then search" method, therefore, files can be kept in their compressed form, saving memory space. When compression gain is important, these algorithms are better than cgrep [Ferragina, P., Tommasi, A., & Manzini, G. (2004). C Library to search over compressed texts, http://roquefort.di.unipi.it/~ferrax/CompressedSearch], which is only slightly faster than ours.