SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Authors:
Chunyang Gou;Georgi Kuzmanov;Georgi N. Gaydadjiev
Affiliations:
Delft University of Technology, The Netherlands;Delft University of Technology, The Netherlands;Delft University of Technology, The Netherlands
Venue:
Proceedings of the 24th ACM International Conference on Supercomputing
Year:
2010

Citing 22
Cited 4

On the effective bandwidth of interleaved memories in vector processor systems

IEEE Transactions on Computers
A Simulation Study of the CRAY X-MP Memory System

IEEE Transactions on Computers
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Conflict-Free Vector Access Using a Dynamic Storage Scheme

IEEE Transactions on Computers
Increased Memory Performance During Vector Accesses Through the Use of Linear Address Transformations

IEEE Transactions on Computers
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Conflict-Free Access for Streams in Multimodule Memories

IEEE Transactions on Computers
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems

IEEE Transactions on Computers
Conflict-Free Accesses to Strided Vectors on a Banked Cache

IEEE Transactions on Computers
The vector fixed point unit of the synergistic processor element of the cell architecture processor

Proceedings of the conference on Design, automation and test in Europe: Designers' forum
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
The Organization and Use of Parallel Memories

IEEE Transactions on Computers
Sams: single-affiliation multiple-stride parallel memory scheme

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Implementing Wilson-Dirac operator on the cell broadband engine

Proceedings of the 22nd annual international conference on Supercomputing
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Elastic pipeline: addressing GPU on-chip shared memory bank conflicts

Proceedings of the 8th ACM International Conference on Computing Frontiers
An efficient vectorization of linked-cell particle simulations

Proceedings of the 9th conference on Computing Frontiers
Data layout optimization for multi-valued containers in OpenCL

Journal of Parallel and Distributed Computing
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose to bridge the discrepancy between data representations in memory and those favored by the SIMD processor by customizing the low-level address mapping. To achieve this, we employ the extended Single-Affiliation Multiple-Stride (SAMS) parallel memory scheme at an appropriate level in the memory hierarchy. This level of memory provides both Array of Structures (AoS) and Structure of Arrays (SoA) views for the structured data to the processor, appearing to have maintained multiple layouts for the same data. With such multi-layout memory, optimal SIMDization can be achieved. Our synthesis results using TSMC 90nm CMOS technology indicate that the SAMS Multi-Layout Memory system has efficient hardware implementation, with a critical path delay of less than 1ns and moderate hardware overhead. Experimental evaluation based on a modified IBM Cell processor model suggests that our approach is able to decrease the dynamic instruction count by up to 49% for a selection of real applications and kernels. Under the same conditions, the total execution time can be reduced by up to 37%.