On the effective bandwidth of interleaved memories in vector processor systems
IEEE Transactions on Computers
A Simulation Study of the CRAY X-MP Memory System
IEEE Transactions on Computers
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Conflict-Free Vector Access Using a Dynamic Storage Scheme
IEEE Transactions on Computers
IEEE Transactions on Computers
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Conflict-Free Access for Streams in Multimodule Memories
IEEE Transactions on Computers
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Impulse: Building a Smarter Memory Controller
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems
IEEE Transactions on Computers
Conflict-Free Accesses to Strided Vectors on a Banked Cache
IEEE Transactions on Computers
The vector fixed point unit of the synergistic processor element of the cell architecture processor
Proceedings of the conference on Design, automation and test in Europe: Designers' forum
Optimizing data permutations for SIMD devices
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
The Organization and Use of Parallel Memories
IEEE Transactions on Computers
Sams: single-affiliation multiple-stride parallel memory scheme
Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Implementing Wilson-Dirac operator on the cell broadband engine
Proceedings of the 22nd annual international conference on Supercomputing
Outer-loop vectorization: revisited for short SIMD architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts
Proceedings of the 8th ACM International Conference on Computing Frontiers
An efficient vectorization of linked-cell particle simulations
Proceedings of the 9th conference on Computing Frontiers
Data layout optimization for multi-valued containers in OpenCL
Journal of Parallel and Distributed Computing
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
We propose to bridge the discrepancy between data representations in memory and those favored by the SIMD processor by customizing the low-level address mapping. To achieve this, we employ the extended Single-Affiliation Multiple-Stride (SAMS) parallel memory scheme at an appropriate level in the memory hierarchy. This level of memory provides both Array of Structures (AoS) and Structure of Arrays (SoA) views for the structured data to the processor, appearing to have maintained multiple layouts for the same data. With such multi-layout memory, optimal SIMDization can be achieved. Our synthesis results using TSMC 90nm CMOS technology indicate that the SAMS Multi-Layout Memory system has efficient hardware implementation, with a critical path delay of less than 1ns and moderate hardware overhead. Experimental evaluation based on a modified IBM Cell processor model suggests that our approach is able to decrease the dynamic instruction count by up to 49% for a selection of real applications and kernels. Under the same conditions, the total execution time can be reduced by up to 37%.