Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
NAMD2: greater scalability for parallel molecular dynamics
Journal of Computational Physics - Special issue on computational molecular biophysics
Fast Messages: Efficient, Portable Communication for Workstation Clusters and MPPs
IEEE Parallel & Distributed Technology: Systems & Technology
NAMD: biomolecular simulation on thousands of processors
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
ACM SIGGRAPH 2004 Papers
Performance and modularity benefits of message-driven execution
Journal of Parallel and Distributed Computing
GPU Cluster for High Performance Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
ClawHMMER: A Streaming HMMer-Search Implementatio
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Scaling applications to massively parallel machines using Projections performance analysis tool
Future Generation Computer Systems
Design of High Performance MVAPICH2: MPI2 over InfiniBand
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster
Parallel Computing
General purpose molecular dynamics simulations fully implemented on graphics processing units
Journal of Computational Physics
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scalable Parallel Programming with CUDA
Queue - GPU Computing
GPU acceleration of cutoff pair potentials for molecular modeling applications
Proceedings of the 5th conference on Computing frontiers
Initial experiences porting a bioinformatics application to a graphics processor
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Probing biomolecular machines with graphics processors
Communications of the ACM - A View of Parallel Computing
Probing Biomolecular Machines with Graphics Processors
Queue - Bioscience
A massively parallel adaptive fast-multipole method on heterogeneous architectures
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the International Conference and Workshop on Emerging Trends in Technology
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster
Journal of Computational Physics
A Memory Centric Kernel Framework for Accelerating Short-Range, Interactive Particle Simulation
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
The reverse-acceleration model for programming petascale hybrid systems
IBM Journal of Research and Development
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scaling Hierarchical N-body Simulations on GPU Clusters
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Immersive molecular visualization and interactive modeling with commodity hardware
ISVC'10 Proceedings of the 6th international conference on Advances in visual computing - Volume Part II
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Journal of Computational Physics
Scaling scientific applications on clusters of hybrid multicore/GPU nodes
Proceedings of the 8th ACM International Conference on Computing Frontiers
Introducing scalable quantum approaches in language representation
QI'11 Proceedings of the 5th international conference on Quantum interaction
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
A massively parallel adaptive fast multipole method on heterogeneous architectures
Communications of the ACM
Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems
Parallel Computing
Performance models for asynchronous data transfers on consumer Graphics Processing Units
Journal of Parallel and Distributed Computing
Direct approaches to exploit many-core architecture in bioinformatics
Future Generation Computer Systems
Optimizing fine-grained communication in a biomolecular simulation application on Cray XK6
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Accelerating text mining workloads in a MapReduce-based distributed GPU environment
Journal of Parallel and Distributed Computing
Multi-level parallelism for incompressible flow computations on GPU clusters
Parallel Computing
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems
Proceedings of the 27th international ACM conference on International conference on supercomputing
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.02 |
Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster.