Quantitative system performance: computer system analysis using queueing network models
Quantitative system performance: computer system analysis using queueing network models
A fast algorithm for particle simulations
Journal of Computational Physics
A parallel hashed Oct-Tree N-body algorithm
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Guest Editors' Introduction: The Top 10 Algorithms
Computing in Science and Engineering
Computing in Science and Engineering
A framework for performance modeling and prediction
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A kernel-independent adaptive fast multipole algorithm in two and three dimensions
Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Massively parallel implementation of a fast multipole method for distributed memory machines
Journal of Parallel and Distributed Computing
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
Characterization of simultaneous multithreading (SMT) efficiency in POWER5
IBM Journal of Research and Development - POWER5 and packaging
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
High performance BLAS formulation of the multipole-to-local operator in the fast multipole method
Journal of Computational Physics
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Fast, parallel, GPU-based construction of space filling curves and octrees
Proceedings of the 2008 symposium on Interactive 3D graphics and games
A genetic algorithms approach to modeling the performance of memory-bound computations
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A regression-based approach to scalability prediction
Proceedings of the 22nd annual international conference on Supercomputing
Memory hierarchy performance measurement of commercial dual-core desktop processors
Journal of Systems Architecture: the EUROMICRO Journal
Fast multipole methods on graphics processors
Journal of Computational Physics
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
A massively parallel adaptive fast-multipole method on heterogeneous architectures
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An exploration of performance attributes for symbolic modeling of emerging processing devices
HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Scalable fast multipole methods on distributed heterogeneous architectures
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-of- the-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (end- user programmers) and automated performance analysis and tuning tools could themselves apply.