A new approach to fragment assembly in DNA sequencing
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Opportunistic data structures with applications
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Journal of the ACM (JACM)
Linear work suffix array construction
Journal of the ACM (JACM)
A taxonomy of suffix array construction algorithms
ACM Computing Surveys (CSUR)
Fast BWT in small space by blockwise suffix sorting
Theoretical Computer Science
An extension of the Burrows–Wheeler Transform
Theoretical Computer Science
A New Combinatorial Approach to Sequence Comparison
Theory of Computing Systems
Linear Time Suffix Array Construction Using D-Critical Substrings
CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Compressed Suffix Arrays for Massive Data
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Linear-time construction of suffix arrays
CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Lightweight BWT construction for very large string collections
CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Lightweight data indexing and compression in external memory
LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Lightweight LCP construction for next-generation sequencing datasets
WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Hi-index | 5.23 |
Recent progress in the field of DNA sequencing motivates us to consider the problem of computing the Burrows-Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the BWT at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the BWT for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the BWT of a string collection in a lightweight fashion. A first implementation of our algorithm needs O(mlogm) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve RAM usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the BWT a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the BWT of the original collection can be efficiently updated to obtain the BWT of the revised collection.