A microbenchmark suite for OpenMP 2.0
ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
Hybrid technology multithreaded architecture
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Computer Animation and Virtual Worlds - Special Issue: The Very Best Papers from CASA 2004
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Biomedical image analysis on a cooperative cluster of GPUs and multicores
Proceedings of the 22nd annual international conference on Supercomputing
Fast scan algorithms on graphics processors
Proceedings of the 22nd annual international conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system
Proceedings of the 20th international symposium on High performance distributed computing
A coarse-grained stream architecture for cryo-electron microscopy images 3D reconstruction
Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
A GPU-based high-throughput image retrieval algorithm
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Hi-index | 0.00 |
Single-particle 3D reconstruction from cryo-electron microscopy (cryo-EM) images is a kernel application of biological molecules analysis, as the computational requirement of which is now beyond PetaFlop for a high-resolution 3D structure. In this paper, we quantitatively analyze the workload, computational intensity and memory performance of the application, parallelize it on an emerging multicore architecture GPU-CUDA. Further we apply a percolation technique to decouple computation with memory operations and orchestrate thread-data mapping to reduce the overhead off-chip memory operations. Finally we tested our optimization strategy on a popular open-source package EMAN to GPU-CUDA, which achieves a relative speedup of about 10X to the original CPU-only EMAN. The experimental results also show that the proposed percolation programming greatly improves utilization of memory bandwidth and floating-point units.