Integrating bioinformatics, distributed data management, and distributed computing for applied training in high performance computing

  • Authors:
  • Michael D. Kane;John A. Springer

  • Affiliations:
  • Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN

  • Venue:
  • Proceedings of the 8th ACM SIGITE conference on Information technology education
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The utilization of multi-core and multi-node parallel high performance computing (HPC) systems is growing rapidly to meet computational demands in the scientific computing arena. For example, the exponential growth of genomic data has outpaced increases in single CPU clock speeds by 15-fold over the last 20 years, placing great value on the use of parallel processing systems in bioinformatics. Fortunately, increased demand for multi-node architectures has resulted in decreased costs for distributed computing components making these architectures more affordable to organizations and institutions. As the demand for HPC computer architectures grows, so does the demand for professionals skilled in the implementation, utilization and administration of these systems. With the goal of training undergraduate and graduate students to meet this demand, a model HPC training module has been developed and implemented that integrates bioinformatics, distributed data management and distributed computing. In this HPC training module bioinformatics provides exposure to applied scientific computing as well as the rationale for multi-processor computing to overcome large computational problems. In addition, the parallelization of computing is explored from the classic divide-and-conquer approach, as well as the distributed data management perspective, which places emphasis on the network bandwidth and disk paging as detractors to HPC performance. Students participate in the HPC module through hands-on interactions with three different HPC cluster types: (1) Beowulf, (2) blade servers, and (3) multi-processor shared memory systems. The results of this training module include exploratory student projects to determine mathematical relationships between HPC performance and (1) processing nodes, (2) cluster type, (3) database size and segmentation methods, (4) bioinformatics application type, (5) RAM per node, and (6) network bandwidth. The outcome of this training module is hands-on training in HPC across multiple cluster types, and across multiple computer and information technology perspectives.