Parallel genomic sequence-search on a massively parallel system

  • Authors:
  • Oystein Thorsen;Brian Smith;Carlos P. Sosa;Karl Jiang;Heshan Lin;Amanda Peters;Wu-chun Feng

  • Affiliations:
  • IBM - Rochester, Rochester, MN;IBM - Rochester, Rochester, MN;University of Minnesota, Minneapolis, MN;IBM - Rochester, Rochester, MN;North Carolina State University, Raleigh, NC;IBM - Rochester, Rochester, MN;Virginia Tech, Blacksburg, VA

  • Venue:
  • Proceedings of the 4th international conference on Computing frontiers
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the life sciences, genomic databases for sequence search have been growing exponentially in size. As a result, faster sequence-search algorithms to search these databases continue to evolve to cope with algorithmic time complexity. The ubiquitous tool for such search is the Basic Local Alignment Search Tool (BLAST) [1] from the National Center for Biotechnology Information (NCBI). Despite continued algorithmic improvements in BLAST, it cannot keep up with the rate at which the database is exponentially increasing in size. Therefore, parallel implement-ations such as mpiBLAST have emerged to address this problem. The performance of such implementations depends on a myriad of factors including algorithmic, architectural, and mapping of the algorithm to the architecture. This paper describes modifications and extensions to a parallel and distributed-memory version of BLAST called mpiBLAST-PIO and how it maps to a massively parallel system, specifically IBM Blue Gene/L (BG/L). The extensions include a virtual file manager, a "multiple master" run-time model, efficient fragment distribution, and intelligent load balancing. In this study, we have shown that our optimized mpiBLAST-PIO on BG/L using a query with 28014 sequences and the NR and NT databases scales to 8192 nodes (two cores per node). The cases tested here are well suited for a massively parallel system.