Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
A data compression scheme for Chinese text files using Huffman coding and a two-level dictionary
Information Sciences—Informatics and Computer Science: An International Journal
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Enhanced word-based block-sorting text compression
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
DCC '99 Proceedings of the Conference on Data Compression
Application of a Word-Based Text Compression Method to Japanese and Chinese Texts
DCC '99 Proceedings of the Conference on Data Compression
Opportunistic data structures with applications
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Journal of the ACM (JACM)
Compressing and searching XML data via two zips
Proceedings of the 15th international conference on World Wide Web
Succinct suffix arrays based on run-length encoding
Nordic Journal of Computing
On-Line linear-time construction of word suffix trees
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Hi-index | 0.00 |
The Burrows Wheeler transform (BWT) has become popular in text compression, full-text search, XML representation, and DNA sequence matching. It is very efficient to perform a full-text search on BWT encoded text using backward search. This paper aims to study different approaches for applying BWT on multi-byte encoded (e.g. UTF-16) text documents. While previous work has studied BWT on word-based models, and BWT can be applied directly on multi-byte encodings (by treating the document as single-byte coded), there has been no extensive study on how to utilize BWT on multi-byte encoded documents for efficient full-text search. Therefore, in this paper, we propose several ways to efficiently backward search multi-byte text documents. We demonstrate our findings using Chinese text documents. Our experiment results show that our extensions to the standard BWT method offer faster search performance and use less runtime memory.