Minimisation of acyclic deterministic automata in linear time
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
A fast string-searching algorithm for multiple patterns
Information Processing and Management: an International Journal
Efficient string matching: an aid to bibliographic search
Communications of the ACM
XRel: a path-based approach to storage and retrieval of XML documents using relational databases
ACM Transactions on Internet Technology (TOIT)
A Boyer-Moore Type Algorithm for Compressed Pattern Matching
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Offline Dictionary-Based Compression
DCC '99 Proceedings of the Conference on Data Compression
Pattern Matching in Huffman Encoded Texts
DCC '01 Proceedings of the Data Compression Conference
Pattern matching in Huffman encoded texts
Information Processing and Management: an International Journal
Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts
Information Processing and Management: an International Journal
Efficient String Matching in Huffman Compressed Texts
Fundamenta Informaticae
A general compression algorithm that supports fast searching
Information Processing Letters
Sparse and truncated suffix trees on variable-length codes
CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
On-Line linear-time construction of word suffix trees
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Faster pattern matching algorithm for arc-annotated sequences
Proceedings of the 2005 international conference on Federation over the Web
On bit-parallel processing of multi-byte text
AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Efficient String Matching in Huffman Compressed Texts
Fundamenta Informaticae
Efficient string-based XML stream prefiltering
ADC '12 Proceedings of the Twenty-Third Australasian Database Conference - Volume 124
Hi-index | 0.00 |
Techniques in processing text files "as is" are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the "as-is" principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.