Array organization in parallel memories

Authors:
Mayez Al-Mouhamed
Affiliations:
Computer Engineering Department, College of Computer Science and Engineering, King Fahd University of Petroleum and Minerals, P.O. Box 787, Dhahran 31261, Saudi Arabia
Venue:
International Journal of Parallel Programming
Year:
2004

Citing 24
Cited 0

Designing efficient algorithms for parallel computers

Designing efficient algorithms for parallel computers
Scrambled storage for parallel memory systems

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Interpolation between bases and the shuffle exchange network

European Journal of Combinatorics
Program optimization for instruction caches

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Index Transformation Algorithms in a Linear Algebra Framework

IEEE Transactions on Parallel and Distributed Systems
Exploiting dual data-memory banks in digital signal processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-directed page coloring for multiprocessors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Evaluation of pipelined dilated banyan switch architectures for ATM networks

IEEE/ACM Transactions on Networking (TON)
Algorithms: design techniques and analysis

Algorithms: design techniques and analysis
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Compiler-Directed Collective-I/O

IEEE Transactions on Parallel and Distributed Systems
Computer and Robot Vision

Computer and Robot Vision
Improving memory energy using access pattern classification

Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
A New Direction for Computer Architecture Research

Computer
High-Bandwidth Interleaved Memories for Vector Processors - A Simulation Study

IEEE Transactions on Computers
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Multiskewing-A Novel Technique for Optimal Parallel Memory Access

IEEE Transactions on Parallel and Distributed Systems
Conflict-Free Routing on Hypercubes

ICCI '92 Proceedings of the Fourth International Conference on Computing and Information: Computing and Information
A memory-layout oriented run-time technique for locality optimization

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Compiling for the Impulse Memory Controller

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Parallel Memories in Video Encoding

DCC '99 Proceedings of the Conference on Data Compression
Enhanced Configurable Parallel Memory Architecture

DSD '02 Proceedings of the Euromicro Symposium on Digital Systems Design
Global Address Space, Non-Uniform Bandwidth: A Memory System Performance Characterization of Parallel Systems

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Architecture and Compiler Co-Optimization for High Performance Computing

IWIA '02 Proceedings of the International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'02)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The bandwidth mismatch between processor and main memory is one major throughput limiting problem. Although streamed computations have predictable access patterns their references have little temporal locality and are generally too long to cache. A memory and compiler co-optimization aimed at reducing low-level memory accesses using software and hardware locality optimizations is presented. We propose a scalable and predictable parallel memory based on a compiler synthesis of storage schemes for multi-dimensional arrays that are accessed by an arbitrary but known set of data access patterns. Using algebra of non-singular Boolean matrices, we present analysis of conflict-free access to (1) parallel memories, and (2) alignment networks. Finding a multi-pattern storage scheme is one NP-complete problem. An effective compiler heuristic is proposed for finding a storage matrix that minimizes overall memory access time. This applies to arbitrary linear patterns and arbitrary alignment networks. It is shown that the proposed storage scheme finds an optimal storage scheme for parallel (1) FFT, and (2) bitonic sorting. The proposed storage scheme outperforms statically optimized storages in the case of power-of-2 multi-stride access. The case of non power-of-2 strides is also addressed. The performance and scalability of the proposed parallel memory and its predictable access time are presented using numerical and multimedia algorithms. It is shown that a memory utilization above 83% is achieved by our storage scheme for 64 memories, which largely outperforms previous proposals. Our approach provides a tool for matching the storage pattern with the data access patterns needed for embedded systems running streamed computations with predictable data access patterns.