On bit-parallel processing of multi-byte text

Authors:
Heikki Hyyrö;Jun Takaba;Ayumi Shinohara;Masayuki Takeda
Affiliations:
PRESTO, Japan Science and Technology Agency (JST);Department of Informatics, Kyushu University 33, Fukuoka, Japan;PRESTO, Japan Science and Technology Agency (JST);Department of Informatics, Kyushu University 33, Fukuoka, Japan
Venue:
AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Year:
2004

Citing 14
Cited 2

A bit-string longest-common-subsequence algorithm

Information Processing Letters
A new approach to text searching

Communications of the ACM
Text algorithms

Text algorithms
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
A fast string searching algorithm

Communications of the ACM
Efficient string matching: an aid to bibliographic search

Communications of the ACM
A technique for computer detection and correction of spelling errors

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Fast and flexible string matching by combining bit-parallelism and suffix automata

Journal of Experimental Algorithmics (JEA)
NR-grep: a fast and flexible pattern-matching tool

Software—Practice & Experience
A fast and practical bit-vector algorithm for the longest common subsequence problem

Information Processing Letters
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Faster Bit-Parallel Approximate String Matching

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching

Flexible Framework for Time-Series Pattern Matching over Multi-dimension Data Stream

New Frontiers in Applied Data Mining
Efficient longest common subsequence computation using bulk-synchronous parallelism

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V

Quantified Score

Hi-index	0.00

Visualization

Abstract

There exist practical bit-parallel algorithms for several types of pair-wise string processing, such as longest common subsequence computation or approximate string matching. The bit-parallel algorithms typically use a size-σ table of match bit-vectors, where the bits in the vector for a character λ identify the positions where the character λ occurs in one of the processed strings, and σ is the alphabet size. The time or space cost of computing the match table is not prohibitive with reasonably small alphabets such as ASCII text. However, for example in the case of general Unicode text the possible numerical code range of the characters is roughly one million. This makes using a simple table impractical. In this paper we evaluate three different schemes for overcoming this problem. First we propose to replace the character code table by a character code automaton. Then we compare this method with two other schemes: using a hash table, and the binary-search based solution proposed by Wu, Manber and Myers [25]. We find that the best choice is to use either the automaton-based method or a hash table.