Assembling genomes on large-scale parallel computers

  • Authors:
  • A. Kalyanaraman;S. J. Emrich;P. S. Schnable;S. Aluru

  • Affiliations:
  • School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA;Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA and Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011, ...;Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011, USA and Departments of Agronomy, and Genetics, Development and Cell Biology, Iowa State University ...;Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA and Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, IA 50011, ...

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in under 2h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences.