Parallel many-body simulations without all-to-all communication
Journal of Parallel and Distributed Computing
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
A non-linear dimension reduction methodology for generating data-driven stochastic input models
Journal of Computational Physics
Pairwise Distance Matrix Computation for Multiple Sequence Alignment on the Cell Broadband Engine
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Parallel accelerated cartesian expansions for particle dynamics simulations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Compute Pairwise Manhattan Distance and Pearson Correlation Coefficient of Data Points with GPU
SNPD '09 Proceedings of the 2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing
Exploiting the capabilities of modern GPUs for dense matrix computations
Concurrency and Computation: Practice & Experience
Constructing Gene Regulatory Networks on Clusters of Cell Processors
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Direct N-body Kernels for Multicore Platforms
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Parallel information theory based construction of gene regulatory networks
HiPC'08 Proceedings of the 15th international conference on High performance computing
Parallel Information-Theory-Based Construction of Genome-Wide Gene Regulatory Networks
IEEE Transactions on Parallel and Distributed Systems
Accelerating Pairwise Computations on Cell Processors
IEEE Transactions on Parallel and Distributed Systems
Applications on emerging paradigms in parallel computing
Applications on emerging paradigms in parallel computing
Hi-index | 0.00 |
Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory bandwidth. In this paper, we develop architecture-aware methods to accelerate all-pairs computations on many-core graphics processors. Pairwise computations occur frequently in numerous application areas in scientific computing. While they appear easy to parallelize due to the independence of computing each pairwise interaction from all others, development of techniques to address multi-layered memory hierarchies, mapping within the restrictions imposed by the small and low-latency on-chip memories, striking the right balanced between concurrency, reuse and memory traffic etc., are crucial to obtain high-performance. We present a hierarchical decomposition scheme for GPUs based on decomposition of the output matrix and input data. We demonstrate that a careful tuning of the involved set of decomposition parameters is essential to achieve high efficiency on the GPUs. We also compare the performance of our strategies with an implementation on the STI Cell processor as well as multi-core CPU parallelizations using OpenMP and Intel Threading Building Blocks.