Text compression
Bounds on the Complexity of the Longest Common Subsequence Problem
Journal of the ACM (JACM)
An experimental study of an opportunistic index
SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Pattern Matching in Text Compressed with the ID Heuristic
DCC '98 Proceedings of the Conference on Data Compression
Multiple Pattern Matching in LZW Compressed Text
DCC '98 Proceedings of the Conference on Data Compression
Searching BWT Compressed Text with the Boyer-Moore Algorithm and Binary Search
DCC '02 Proceedings of the Data Compression Conference
Indexing text using the Ziv-Lempel trie
Journal of Discrete Algorithms - SPIRE 2002
Hi-index | 0.00 |
Data compression is popularly applied to computer systems and communication systems in order to reduce storage size and communication time, respectively. Since large data are used frequently, string matching for such data takes a long time. If the data are compressed, the time gets much longer because decompression is necessary. Long string matching time makes computer virus scan time longer and gives serious influence to the security of data. From this, CPM (Compression Pattern Matching) methods for several compression methods have been proposed. This paper proposes CPM method for PPM which achieves fast virus scan and improves dependability of the compressed data, where PPM is based on a Markov model, uses a context information, and achieves a better compression ratio than BW transform and Ziv-Lempel coding. The proposed method encodes the context information, which is generated in the compression process, and appends the encoded data at the beginning of the compressed data as a header. The proposed method uses only the header information. Computer simulation says that augmentation of the compression ratio is less than 5 percent if the order of the PPM is less than 5 and the source file size is more than 1 M bytes, where order is the maximum length of the context used in PPM compression. String matching time is independent of the source file size and is very short, less than 0.3 micro seconds in the PC used for the simulation.