Lightweight algorithms for constructing and inverting the BWT of string collections

Authors:
Markus J. Bauer;Anthony J. Cox;Giovanna Rosone
Affiliations:
Illumina Cambridge Ltd., United Kingdom;Illumina Cambridge Ltd., United Kingdom;University of Palermo, Dipartimento di Matematica e Informatica, Via Archirafi 34, 90123 Palermo, Italy
Venue:
Theoretical Computer Science
Year:
2013

Citing 16
Cited 1

A new approach to fragment assembly in DNA sequencing

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Indexing compressed text

Journal of the ACM (JACM)
Linear work suffix array construction

Journal of the ACM (JACM)
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

Algorithmica
Fast BWT in small space by blockwise suffix sorting

Theoretical Computer Science
An extension of the Burrows–Wheeler Transform

Theoretical Computer Science
A New Combinatorial Approach to Sequence Comparison

Theory of Computing Systems
Linear Time Suffix Array Construction Using D-Critical Substrings

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Efficient construction of an assembly string graph using the FM-index

Bioinformatics
Lightweight BWT construction for very large string collections

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Lightweight data indexing and compression in external memory

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
An extension of the burrows wheeler transform and applications to sequence comparison and data compression

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching

Lightweight LCP construction for next-generation sequencing datasets

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics

Quantified Score

Hi-index	5.23

Visualization

Abstract

Recent progress in the field of DNA sequencing motivates us to consider the problem of computing the Burrows-Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the BWT at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the BWT for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the BWT of a string collection in a lightweight fashion. A first implementation of our algorithm needs O(mlogm) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve RAM usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the BWT a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the BWT of the original collection can be efficiently updated to obtain the BWT of the revised collection.