GPU Cluster for High Performance Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Computer Animation and Virtual Worlds - Special Issue: The Very Best Papers from CASA 2004
Fast scan algorithms on graphics processors
Proceedings of the 22nd annual international conference on Supercomputing
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU
Proceedings of the 23rd international conference on Supercomputing
Axel: a heterogeneous cluster with FPGAs and GPUs
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
A coarse-grained stream architecture for cryo-electron microscopy images 3D reconstruction
Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
VGRIS: virtualized GPU resource isolation and scheduling in cloud gaming
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Hi-index | 0.00 |
Heterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize massive parallelism of both applications and architectures on such heterogeneous systems. In this paper we present a practice on how to exploit and orchestrate parallelism at algorithm level to take advantage of underlying parallelism at architecture level. A potential Petaflops application -- cryo-EM 3D reconstruction is selected as an example. We exploit all possible parallelism in cryo-EM 3D reconstruction, and leverage a self-adaptive dynamic scheduling algorithm to create a proper parallelism mapping between the application and architecture. The parallelized programs are evaluated on a subsystem of Dawning Nebulae supercomputer, whose node is composed of two Intel six-core Xeon CPUs and one Nvidia Fermi GPU. The experiment confirms that hierarchical parallelism is an efficient pattern of parallel programming to utilize capabilities of both CPU and GPU in a heterogeneous system. The CUDA kernels run more than 3 times faster than the OpenMP parallelized ones using 12 cores (threads). Based on the GPU-only version, the hybrid CPU-GPU program further improves the whole application's performance by 30% on the average.