Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Authors:
Jae-Seung Yeom;Dimitrios S. Nikolopoulos
Affiliations:
-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 33
Cited 0

Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
GPGPU: general purpose computation on graphics hardware

ACM SIGGRAPH 2004 Course Notes
A general approach for partitioning N-dimensional parallel nested loops with conditionals

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
PBPI: a high performance implementation of Bayesian phylogenetic inference

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Performance evaluation of GPUs using the RapidMind development platform

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
GPGPU: general-purpose computation on graphics hardware

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A portable runtime interface for multi-level memory hierarchies

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Accelerating computing with the cell broadband engine processor

Proceedings of the 5th conference on Computing frontiers
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
Supporting OpenMP on cell

International Journal of Parallel Programming
Programming the Cell Processor: For Games, Graphics, and Computation

Programming the Cell Processor: For Games, Graphics, and Computation
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
CellSs: Scheduling techniques to better exploit memory hierarchy

Scientific Programming - High Performance Computing with the Cell Broadband Engine
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Optimizing assignment of threads to SPEs on the cell BE processor

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Multi-core acceleration of chemical kinetics for simulation and prediction

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Modeling multigrain parallelism on heterogeneous multi-core processors: a case study of the cell BE

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Proceedings of the 7th ACM international conference on Computing frontiers
A ROSE-Based OpenMP 3.0 research compiler supporting multiple runtime libraries

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Tagged procedure calls (TPC): efficient runtime support for task-based parallelism on the cell processor

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Analysis of task offloading for accelerators

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-core processors with explicitly-managed local memories provide advanced capabilities to optimize data caching and prefetching in software. Unfortunately, these capabilities are neither easily accessible to programmers, nor exploited to their maximum potential by current language, compiler, or runtime frameworks. We present Strider, a runtime framework for optimizing compilers on multi-core processors with software- managed memories. Strider transparently optimizes grouping, decomposition, and scheduling of explicit software-managed accesses to multi-dimensional arrays in nested loops, given a high- level specification of loops and their data access patterns. In particular, Strider contributes new methods to improve temporal locality, optimize the critical path of scheduling data transfers for multi-stride accesses in regular nested parallel loops, and distribute accesses between cores. The prototype of Strider on the IBM Cell processor performs competitively to hand-optimized code and better than contemporary language frameworks, in both non-trivial parallel applications and important application kernels.