Integrating bioinformatics, distributed data management, and distributed computing for applied training in high performance computing

Authors:
Michael D. Kane;John A. Springer
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN
Venue:
Proceedings of the 8th ACM SIGITE conference on Information technology education
Year:
2007

Citing 5
Cited 1

Introduction

Communications of the ACM - Bioinformatics
Computational biology and high-performance computing

Communications of the ACM - Bioinformatics
HIT and MIS: implications of health information technology and medical information systems

Communications of the ACM - The digital society
Evaluating Parallel Computing Systems in Bioinformatics

ITNG '06 Proceedings of the Third International Conference on Information Technology: New Generations
An information technology emphasis in biomedical informatics education

Journal of Biomedical Informatics

Evaluation of a computer networking class in information technology

SIGITE '08 Proceedings of the 9th ACM SIGITE conference on Information technology education

Quantified Score

Hi-index	0.00

Visualization

Abstract

The utilization of multi-core and multi-node parallel high performance computing (HPC) systems is growing rapidly to meet computational demands in the scientific computing arena. For example, the exponential growth of genomic data has outpaced increases in single CPU clock speeds by 15-fold over the last 20 years, placing great value on the use of parallel processing systems in bioinformatics. Fortunately, increased demand for multi-node architectures has resulted in decreased costs for distributed computing components making these architectures more affordable to organizations and institutions. As the demand for HPC computer architectures grows, so does the demand for professionals skilled in the implementation, utilization and administration of these systems. With the goal of training undergraduate and graduate students to meet this demand, a model HPC training module has been developed and implemented that integrates bioinformatics, distributed data management and distributed computing. In this HPC training module bioinformatics provides exposure to applied scientific computing as well as the rationale for multi-processor computing to overcome large computational problems. In addition, the parallelization of computing is explored from the classic divide-and-conquer approach, as well as the distributed data management perspective, which places emphasis on the network bandwidth and disk paging as detractors to HPC performance. Students participate in the HPC module through hands-on interactions with three different HPC cluster types: (1) Beowulf, (2) blade servers, and (3) multi-processor shared memory systems. The results of this training module include exploratory student projects to determine mathematical relationships between HPC performance and (1) processing nodes, (2) cluster type, (3) database size and segmentation methods, (4) bioinformatics application type, (5) RAM per node, and (6) network bandwidth. The outcome of this training module is hands-on training in HPC across multiple cluster types, and across multiple computer and information technology perspectives.