A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

Authors:
Wing-Kai Hon;Tak-Wah Lam;Kunihiko Sadakane;Wing-Kin Sung;Siu-Ming Yiu
Affiliations:
Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong;Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong;Department of Computer Science and Communication Engineering, Kyushu University, Kyushu, Japan;School of Computing, National University of Singapore, Singapore, Singapore;Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
Venue:
Algorithmica
Year:
2007

Citing 0
Cited 13

Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space

Theoretical Computer Science
Rank/select on dynamic compressed sequences and applications

Theoretical Computer Science
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
A Linear-Time Burrows-Wheeler Transform Using Induced Sorting

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Journal of Experimental Algorithmics (JEA)
Space-efficient construction of Lempel-Ziv compressed text indexes

Information and Computation
Lightweight BWT construction for very large string collections

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Lightweight data indexing and compression in external memory

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
FEMTO: fast search of large sequence collections

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Efficient algorithm for circular burrows-wheeler transform

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Lightweight algorithms for constructing and inverting the BWT of string collections

Theoretical Computer Science
A Compressed Suffix Tree Based Implementation With Low Peak Memory Usage

Electronic Notes in Theoretical Computer Science (ENTCS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from the text. The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n). Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese. Precisely, when the alphabet size is |Σ|, the working space is O(n log |Σ|) bits, and the time complexity remains O(n log n), which is independent of |Σ|.