A memory access model for highly-threaded many-core architectures

Authors:
Lin Ma;Kunal Agrawal;Roger D. Chamberlain
Affiliations:
-;-;-
Venue:
Future Generation Computer Systems
Year:
2014

Citing 44
Cited 1

A model for hierarchical memory

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
The input/output complexity of sorting and related problems

Communications of the ACM
A bridging model for parallel computation

Communications of the ACM
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Large-scale sorting in uniform memory hierarchies

Journal of Parallel and Distributed Computing - Special issue on parallel I/O systems
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Efficient Algorithms for Shortest Paths in Sparse Networks

Journal of the ACM (JACM)
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Introduction to Algorithms

Introduction to Algorithms
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Parallelism in random access machines

STOC '78 Proceedings of the tenth annual ACM symposium on Theory of computing
A Survey of Parallel Algorithms for Shared-Memory Machines

A Survey of Parallel Algorithms for Shared-Memory Machines
Visualizing computer memory architectures

VIS '90 Proceedings of the 1st conference on Visualization '90
Concurrent cache-oblivious b-trees

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Performance Predictions for General-Purpose Computation on GPUs

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Streaming Algorithms for Biological Sequence Alignment on GPUs

IEEE Transactions on Parallel and Distributed Systems
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Fundamental parallel algorithms for private-cache chip multiprocessors

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Cache-efficient dynamic programming algorithms for multicores

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Hierarchical memory with block transfer

SFCS '87 Proceedings of the 28th Annual Symposium on Foundations of Computer Science
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
All-pairs shortest-paths for large graphs on the GPU

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Real-time parallel hashing on the GPU

ACM SIGGRAPH Asia 2009 papers
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Scheduling irregular parallel computations on hierarchical caches

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A quantitative performance analysis model for GPU architectures

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
CuMAPz: a tool to analyze memory access patterns in CUDA

Proceedings of the 48th Design Automation Conference
Blocked All-Pairs Shortest Paths Algorithm for Hybrid CPU-GPU System

HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Bloom Filter Performance on Graphics Engines

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Performance Estimation of GPUs with Cache

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
A Performance Model for Memory Bandwidth Constrained Applications on Graphics Engines

ASAP '12 Proceedings of the 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors
A Memory Access Model for Highly-threaded Many-core Architectures

ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems

Theoretical analysis of classic algorithms on highly-threaded many-core GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

A number of highly-threaded, many-core architectures hide memory-access latency by low-overhead context switching among a large number of threads. The speedup of a program on these machines depends on how well the latency is hidden. If the number of threads were infinite, theoretically, these machines could provide the performance predicted by the PRAM analysis of these programs. However, the number of threads per processor is not infinite, and is constrained by both hardware and algorithmic limits. In this paper, we introduce the Threaded Many-core Memory (TMM) model which is meant to capture the important characteristics of these highly-threaded, many-core machines. Since we model some important machine parameters of these machines, we expect analysis under this model to provide a more fine-grained and accurate performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the dynamic programming algorithm and Johnson's algorithm have the same performance in the PRAM model. However, our model predicts different performance for large enough memory-access latency and validates the intuition that the dynamic programming algorithm performs better on these machines. We validate several predictions made by our model using empirical measurements on an instantiation of a highly-threaded, many-core machine, namely the NVIDIA GTX 480.