Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

Authors:
Dominik Göddeke;Robert Strzodka;Jamaludin Mohd-Yusof;Patrick McCormick;Sven H. M. Buijssen;Matthias Grajewski;Stefan Turek
Affiliations:
Institute of Applied Mathematics, University of Dortmund, Vogelpothsweg 87, 44227 Dortmund, Germany;Stanford University, Max Planck Center, United States;Computer, Computational and Statistical Sciences Division, Los Alamos National Laboratory, United States;Computer, Computational and Statistical Sciences Division, Los Alamos National Laboratory, United States;Institute of Applied Mathematics, University of Dortmund, Vogelpothsweg 87, 44227 Dortmund, Germany;Institute of Applied Mathematics, University of Dortmund, Vogelpothsweg 87, 44227 Dortmund, Germany;Institute of Applied Mathematics, University of Dortmund, Vogelpothsweg 87, 44227 Dortmund, Germany
Venue:
Parallel Computing
Year:
2007

Citing 11
Cited 26

A column pre-ordering strategy for the unsymmetric-pattern multifrontal method

ACM Transactions on Mathematical Software (TOMS)
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Shader algebra

ACM SIGGRAPH 2004 Papers
GPU Cluster for High Performance Computing

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Hardware-oriented numerics and concepts for PDE software

Future Generation Computer Systems
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Extended-precision floating-point numbers for GPU computation

ACM SIGGRAPH 2006 Research posters
A performance-oriented data parallel virtual machine for GPUs

ACM SIGGRAPH 2006 Sketches
A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations

International Journal of Parallel, Emergent and Distributed Systems
Using GPUs to improve multigrid solver performance on a cluster

International Journal of Computational Science and Engineering

Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Hardware-accelerated components for hybrid computing systems

Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance
Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA

Journal of Parallel and Distributed Computing
Supporting MapReduce on large-scale asymmetric multi-core clusters

ACM SIGOPS Operating Systems Review
Integrated Digital Image Correlation for the Identification of Mechanical Properties

MIRAGE '09 Proceedings of the 4th International Conference on Computer Vision/Computer Graphics CollaborationTechniques
Probing biomolecular machines with graphics processors

Communications of the ACM - A View of Parallel Computing
Probing Biomolecular Machines with Graphics Processors

Queue - Bioscience
Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU

International Journal of Computational Science and Engineering
A comparison of three parallelisation methods for 2D flood inundation models

Environmental Modelling & Software
State-of-the-art in heterogeneous computing

Scientific Programming
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Designing Accelerator-Based Distributed Systems for High Performance

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A Capabilities-Aware Programming Model for Asymmetric High-End Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A capabilities-aware framework for using computational accelerators in data-intensive computing

Journal of Parallel and Distributed Computing
Analysis of Parallel Algorithms for Energy Conservation with GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Reusable software components for accelerator-based clusters

Journal of Systems and Software
Development of parallel explicit finite element sheet forming simulation system based on GPU architecture

Advances in Engineering Software
Simulation of multistage excavation based on a 3D spectral-element method

Computers and Structures
C-DAC's efforts: application kernels on HPC cluster with GPU accelerators

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

Parallel Computing
Performance evaluation of OpenMP and CUDA on multicore systems

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Power and Performance Management of GPUs Based Cluster

International Journal of Cloud Applications and Computing
Energy cost evaluation of parallel algorithms for multiprocessor systems

Cluster Computing
Vectorized OpenCL implementation of numerical integration for higher order finite elements

Computers & Mathematics with Applications
Accelerated finite element elastodynamic simulations using the GPU

Journal of Computational Physics
Numerical integration on GPUs for higher order finite elements

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance- and power-related metrics.