Highly scalable genome assembly on campus grids

Authors:
Christopher Moretti;Michael Olson;Scott Emrich;Douglas Thain
Affiliations:
University of Notre Dame;University of Notre Dame;University of Notre Dame;University of Notre Dame
Venue:
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Year:
2009

Citing 9
Cited 1

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Genome Sequence Assembly: Algorithms and Issues

Computer
Sun Grid Engine: Towards Creating a Compute Power Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
An Enabling Framework for Master-Worker Applications on the Computational Grid

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Assembling genomes on large-scale parallel computers

Journal of Parallel and Distributed Computing
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
CloudBurst

Bioinformatics
Harnessing parallelism in multicore clusters with the all-pairs and wavefront abstractions

Proceedings of the 18th ACM international symposium on High performance distributed computing

Adapting bioinformatics applications for heterogeneous systems: a case study

Proceedings of the second international workshop on Emerging computational methods for the life sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable modular genome assembler that can achieve significant speedup using large numbers of conventional desktop machines, such as those found in a campus computing grid. The system is based on the Celera open-source assembly toolkit, and replaces two independent sequential modules with scalable replacements: a scalable candidate selector exploits the distributed memory capacity of a campus grid, while the scalable aligner exploits the distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency on several scales of resources. We show results for several datasets ranging from 738 thousand to over 121 million alignments using campus grid resources ranging from a small cluster to more than a thousand nodes spanning three institutions. Our largest run so far achieves a 927x speedup with 71.3 percent efficiency.