Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A strategy for assembling the maize (Zea mays L.) genome
Bioinformatics
Handbook of Computational Molecular Biology (Chapman & All/Crc Computer and Information Science Series)
Space and time efficient parallel algorithms and software for EST clustering
IEEE Transactions on Parallel and Distributed Systems
An efficient parallel approach for identifying protein families in large-scale metagenomic data sets
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Highly scalable genome assembly on campus grids
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Hi-index | 0.00 |
Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in under 2h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences.