Lightweight BWT construction for very large string collections

Authors:
Markus J. Bauer;Anthony J. Cox;Giovanna Rosone
Affiliations:
Illumina Cambridge Ltd., United Kingdom;Illumina Cambridge Ltd., United Kingdom;University of Palermo, Dipartimento di Matematica e Informatica, Palermo, Italy
Venue:
CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Year:
2011

Citing 15
Cited 7

New indices for text: PAT Trees and PAT arrays

Information retrieval
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Indexing compressed text

Journal of the ACM (JACM)
Linear work suffix array construction

Journal of the ACM (JACM)
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

Algorithmica
An extension of the Burrows–Wheeler Transform

Theoretical Computer Science
A New Combinatorial Approach to Sequence Comparison

Theory of Computing Systems
The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching

The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
Linear Time Suffix Array Construction Using D-Critical Substrings

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Efficient construction of an assembly string graph using the FM-index

Bioinformatics
Lightweight data indexing and compression in external memory

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
An extension of the burrows wheeler transform and applications to sequence comparison and data compression

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching

FEMTO: fast search of large sequence collections

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Comparing DNA sequence collections by direct comparison of compressed text indexes

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Lightweight LCP construction for next-generation sequencing datasets

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Computing the longest common prefix array based on the Burrows-Wheeler transform

Journal of Discrete Algorithms
Lightweight algorithms for constructing and inverting the BWT of string collections

Theoretical Computer Science
Trends in suffix sorting: a survey of low memory algorithms

ACSC '12 Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122
Suffix Array Construction in External Memory Using D-Critical Substrings

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of the BWT in compression and indexing are well known, but the computational demands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context. We address this obstacle by presenting two algorithms capable of computing the BWT of very large string collections. The algorithms are lightweight in that the first needs O(m log m) bits of memory to process m strings and the memory requirements of the second are constant with respect to m. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. Although our tests were on collections of DNA sequences of uniform length, the algorithms themselves apply to any string collection over any alphabet.