Towards large-scale multi-socket, multicore parallel simulations: Performance of an MPI-only semiconductor device simulator

Authors:
Paul T. Lin;John N. Shadid
Affiliations:
Sandia National Laboratories, P.O. Box 5800, MS 0316, Albuquerque, NM 87185-0316, USA;Sandia National Laboratories, P.O. Box 5800, MS 0316, Albuquerque, NM 87185-0316, USA
Venue:
Journal of Computational Physics
Year:
2010

Citing 20
Cited 4

A new finite element formulation for computational fluid dynamics: II. Beyond SUPG

Computer Methods in Applied Mechanics and Engineering
Finite element analysis of the compressible Euler and Navier-Stokes equations

Finite element analysis of the compressible Euler and Navier-Stokes equations
Iterative solution methods

Iterative solution methods
Domain decomposition: parallel multilevel methods for elliptic partial differential equations

Domain decomposition: parallel multilevel methods for elliptic partial differential equations
Efficient parallel computation of unstructured finite element reacting flow solutions

Parallel Computing - Special issue on applications: parallel computing methods in applied fluid mechanics
A multigrid tutorial: second edition

A multigrid tutorial: second edition
Parallel multilevel k-way partitioning scheme for irregular graphs

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Multigrid

Multigrid
Semiconductor Devices: A Simulation Approach with CDROM

Semiconductor Devices: A Simulation Approach with CDROM
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Jacobian-free Newton-Krylov methods: a survey of approaches and applications

Journal of Computational Physics
Performance of fully coupled domain decomposition preconditioners for finite element transport/reaction simulations

Journal of Computational Physics
An Improved Convergence Bound for Aggregation-Based Domain Decomposition Preconditioners

SIAM Journal on Matrix Analysis and Applications
Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)

Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)
A New Petrov-Galerkin Smoothed Aggregation Preconditioner for Nonsymmetric Linear Systems

SIAM Journal on Scientific Computing
Performance of a parallel algebraic multilevel preconditioner for stabilized finite element semiconductor device modeling

Journal of Computational Physics
A Light-weight API for Portable Multicore Programming

PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Towards a scalable fully-implicit fully-coupled resistive MHD formulation with stabilized FE methods

Journal of Computational Physics
Semiconductor device simulation using adaptive refinement and flux upwinding

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
The Red Storm Architecture and Early Experiences with Multi-Core Processors

International Journal of Distributed Systems and Technologies

Poster: mini-applications: vehicles for co-design

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Application-driven analysis of two generations of capability computing: the transition to multicore processors

Concurrency and Computation: Practice & Experience
Parallel 3D-TLM algorithm for simulation of the Earth-ionosphere cavity

Journal of Computational Physics
Exascale design space exploration and co-design

Future Generation Computer Systems

Quantified Score

Hi-index	31.45

Visualization

Abstract

This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a set of multi-socket, multicore architectures with nonuniform memory access (NUMA) compute nodes. These multicore architectures include two linux clusters with multicore processors: a quad-socket, quad-core AMD Opteron platform and a dual-socket, quad-core Intel Xeon Nehalem platform; and a dual-socket, six-core AMD Opteron workstation. These platforms have complex memory hierarchies that include local core-based cache, local socket-based memory, access to memory on the same mainboard from another socket, and then memory across network links to different nodes. The specific semiconductor device simulator used in this study employs a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling results presented include a large-scale problem of 100+ million unknowns on 4096 cores and a comparison with the Cray XT3/4 Red Storm capability platform. Although the MPI-only device simulator employed for this work can take advantage of all the cores of quad-core and six-core CPUs, the efficiency of the linear system solve is decreasing with increased core count and eventually a different programming paradigm will be needed.