Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Authors:
Aparna Chandramowlishwaran;Kamesh Madduri;Richard Vuduc
Affiliations:
-;-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 28
Cited 3

Quantitative system performance: computer system analysis using queueing network models

Quantitative system performance: computer system analysis using queueing network models
A fast algorithm for particle simulations

Journal of Computational Physics
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Guest Editors' Introduction: The Top 10 Algorithms

Computing in Science and Engineering
The Fast Multipole Algorithm

Computing in Science and Engineering
A framework for performance modeling and prediction

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A kernel-independent adaptive fast multipole algorithm in two and three dimensions

Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Massively parallel implementation of a fast multipole method for distributed memory machines

Journal of Parallel and Distributed Computing
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
Characterization of simultaneous multithreading (SMT) efficiency in POWER5

IBM Journal of Research and Development - POWER5 and packaging
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
High performance BLAS formulation of the multipole-to-local operator in the fast multipole method

Journal of Computational Physics
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Fast, parallel, GPU-based construction of space filling curves and octrees

Proceedings of the 2008 symposium on Interactive 3D graphics and games
A genetic algorithms approach to modeling the performance of memory-bound computations

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A regression-based approach to scalability prediction

Proceedings of the 22nd annual international conference on Supercomputing
Memory hierarchy performance measurement of commercial dual-core desktop processors

Journal of Systems Architecture: the EUROMICRO Journal
Fast multipole methods on graphics processors

Journal of Computational Physics
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
A massively parallel adaptive fast-multipole method on heterogeneous architectures

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods

Parallel Computing
An exploration of performance attributes for symbolic modeling of emerging processing devices

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications

Scalable fast multipole methods on distributed heterogeneous architectures

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Brief announcement: towards a communication optimal fast multipole method and its implications at exascale

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-of- the-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (end- user programmers) and automated performance analysis and tuning tools could themselves apply.