Assembling genomes on large-scale parallel computers

Authors:
A. Kalyanaraman;S. J. Emrich;P. S. Schnable;S. Aluru
Affiliations:
School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA;Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA and Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011, ...;Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011, USA and Departments of Agronomy, and Genetics, Development and Cell Biology, Iowa State University ...;Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA and Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011, ...
Venue:
Journal of Parallel and Distributed Computing
Year:
2007

Citing 4
Cited 2

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A strategy for assembling the maize (Zea mays L.) genome

Bioinformatics
Handbook of Computational Molecular Biology (Chapman & All/Crc Computer and Information Science Series)

Handbook of Computational Molecular Biology (Chapman & All/Crc Computer and Information Science Series)
Space and time efficient parallel algorithms and software for EST clustering

IEEE Transactions on Parallel and Distributed Systems

An efficient parallel approach for identifying protein families in large-scale metagenomic data sets

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Highly scalable genome assembly on campus grids

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in under 2h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences.