Full-text search on multi-byte encoded documents

Authors:
Raymond K. Wong;Fengming Shi;Nicole Lam
Affiliations:
University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia
Venue:
Proceedings of the 2012 ACM symposium on Document engineering
Year:
2012

Citing 15
Cited 0

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
A data compression scheme for Chinese text files using Huffman coding and a two-level dictionary

Information Sciences—Informatics and Computer Science: An International Journal
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Enhanced word-based block-sorting text compression

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Sparse Suffix Trees

COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
A Modified Burrows-Wheeler Transformation for Case-Insensitive Search with Application to Suffix Array Compression

DCC '99 Proceedings of the Conference on Data Compression
Application of a Word-Based Text Compression Method to Japanese and Chinese Texts

DCC '99 Proceedings of the Conference on Data Compression
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Indexing compressed text

Journal of the ACM (JACM)
Compressing and searching XML data via two zips

Proceedings of the 15th international conference on World Wide Web
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Fast and accurate long-read alignment with Burrows–Wheeler transform

Bioinformatics
On-Line linear-time construction of word suffix trees

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Burrows Wheeler transform (BWT) has become popular in text compression, full-text search, XML representation, and DNA sequence matching. It is very efficient to perform a full-text search on BWT encoded text using backward search. This paper aims to study different approaches for applying BWT on multi-byte encoded (e.g. UTF-16) text documents. While previous work has studied BWT on word-based models, and BWT can be applied directly on multi-byte encodings (by treating the document as single-byte coded), there has been no extensive study on how to utilize BWT on multi-byte encoded documents for efficient full-text search. Therefore, in this paper, we propose several ways to efficiently backward search multi-byte text documents. We demonstrate our findings using Chinese text documents. Our experiment results show that our extensions to the standard BWT method offer faster search performance and use less runtime memory.