Efficient parallel construction of suffix trees for genomes larger than main memory

Authors:
Matteo Comin;Montse Farreras
Affiliations:
University of Padova, Italy;Barcelona Supercomputing Center, Barcelona, Spain
Venue:
Proceedings of the 20th European MPI Users' Group Meeting
Year:
2013

Citing 17
Cited 0

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Optimal parallel suffix tree construction

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
On the sorting-complexity of suffix tree construction

Journal of the ACM (JACM)
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Faster suffix sorting

Theoretical Computer Science
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Detection of subtle variations as consensus motifs

Theoretical Computer Science
Indexing genomic sequences on the IBM Blue Gene

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
VARUN: Discovering Extensible Motifs under Saturation Constraints

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Suffix trees for inputs larger than main memory

Information Systems
ERA: efficient serial and parallel suffix tree construction for very long strings

Proceedings of the VLDB Endowment
Bridging lossy and lossless compression by motif pattern discovery

General Theory of Information Transfer and Combinatorics
Whole-Genome Phylogeny by Virtue of Unic Subwords

DEXA '12 Proceedings of the 2012 23rd International Workshop on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The construction of suffix tree for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become everyday more complex, requiring fast queries to multiple genomes. In this paper we presented Parallel Continuous Flow PCF, a parallel suffix tree construction method that is suitable for very long strings. We tested our method on the construction of suffix tree of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input string grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the Human genome in 7 minutes using 172 nodes.