Evaluating multicore algorithms on the unified memory model

Authors:
John E. Savage;Mohammad Zubair
Affiliations:
(Correspd. Tel.: +1 401 863 76 42/ Fax: +1 401 863 76 57/ E-mail: jes@cs.brown.edu) Brown University, Providence, RI 02912, USA;Old Dominion University, Norfolk, VA 23529, USA
Venue:
Scientific Programming - Software Development for Multi-core Computing Systems
Year:
2009

Citing 35
Cited 3

A model for hierarchical memory

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
The input/output complexity of sorting and related problems

Communications of the ACM
Communication complexity of PRAMs

Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
A bridging model for parallel computation

Communications of the ACM
Optimal disk I/O with parallel block transfer

STOC '90 Proceedings of the twenty-second annual ACM symposium on Theory of computing
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Designing broadcasting algorithms in the Postal Model for message-passing systems

Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
The design, implementation, and evaluation of a symmetric banded linear solver for distributed-memory parallel computers

ACM Transactions on Mathematical Software (TOMS)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Models of Computation: Exploring the Power of Computing

Models of Computation: Exploring the Power of Computing
Optimal organizations for pipelined hierarchical memories

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Towards a theory of cache-efficient algorithms

Journal of the ACM (JACM)
The Parallel Hierarchical Memory Model

SWAT '94 Proceedings of the 4th Scandinavian Workshop on Algorithm Theory
Extending the Hong-Kung Model to Memory Hierarchies

COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Cache-oblivious B-trees

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Models and resource metrics for parallel and distributed computation

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Parallelism in random access machines

STOC '78 Proceedings of the tenth annual ACM symposium on Theory of computing
Performance Evaluation of Parallel Algorithms for Pricing Multidimensional Financial Derivatives

ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
Architecture independent parallel binomial tree option price valuations

Parallel Computing
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Hierarchical memory with block transfer

SFCS '87 Proceedings of the 28th Annual Symposium on Foundations of Computer Science
A unified model for multicore architectures

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Cache-optimal algorithms for option pricing

ACM Transactions on Mathematical Software (TOMS)

Cache-optimal algorithms for option pricing

ACM Transactions on Mathematical Software (TOMS)
Upper and lower I/O bounds for pebbling r-pyramids

IWOCA'10 Proceedings of the 21st international conference on Combinatorial algorithms
Upper and lower I/O bounds for pebbling r-pyramids

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM) to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores). It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5.