Designing efficient algorithms for parallel computers
Designing efficient algorithms for parallel computers
Scrambled storage for parallel memory systems
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Interpolation between bases and the shuffle exchange network
European Journal of Combinatorics
Program optimization for instruction caches
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Index Transformation Algorithms in a Linear Algebra Framework
IEEE Transactions on Parallel and Distributed Systems
Exploiting dual data-memory banks in digital signal processors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-directed page coloring for multiprocessors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Evaluation of pipelined dilated banyan switch architectures for ATM networks
IEEE/ACM Transactions on Networking (TON)
Algorithms: design techniques and analysis
Algorithms: design techniques and analysis
Dynamic Access Ordering for Streamed Computations
IEEE Transactions on Computers
Compiler-Directed Collective-I/O
IEEE Transactions on Parallel and Distributed Systems
Computer and Robot Vision
Improving memory energy using access pattern classification
Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
High-Bandwidth Interleaved Memories for Vector Processors - A Simulation Study
IEEE Transactions on Computers
Block, Multistride Vector, and FFT Accesses in Parallel Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Multiskewing-A Novel Technique for Optimal Parallel Memory Access
IEEE Transactions on Parallel and Distributed Systems
Conflict-Free Routing on Hypercubes
ICCI '92 Proceedings of the Fourth International Conference on Computing and Information: Computing and Information
A memory-layout oriented run-time technique for locality optimization
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Compiling for the Impulse Memory Controller
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Parallel Memories in Video Encoding
DCC '99 Proceedings of the Conference on Data Compression
Enhanced Configurable Parallel Memory Architecture
DSD '02 Proceedings of the Euromicro Symposium on Digital Systems Design
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Architecture and Compiler Co-Optimization for High Performance Computing
IWIA '02 Proceedings of the International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'02)
Hi-index | 0.00 |
The bandwidth mismatch between processor and main memory is one major throughput limiting problem. Although streamed computations have predictable access patterns their references have little temporal locality and are generally too long to cache. A memory and compiler co-optimization aimed at reducing low-level memory accesses using software and hardware locality optimizations is presented. We propose a scalable and predictable parallel memory based on a compiler synthesis of storage schemes for multi-dimensional arrays that are accessed by an arbitrary but known set of data access patterns. Using algebra of non-singular Boolean matrices, we present analysis of conflict-free access to (1) parallel memories, and (2) alignment networks. Finding a multi-pattern storage scheme is one NP-complete problem. An effective compiler heuristic is proposed for finding a storage matrix that minimizes overall memory access time. This applies to arbitrary linear patterns and arbitrary alignment networks. It is shown that the proposed storage scheme finds an optimal storage scheme for parallel (1) FFT, and (2) bitonic sorting. The proposed storage scheme outperforms statically optimized storages in the case of power-of-2 multi-stride access. The case of non power-of-2 strides is also addressed. The performance and scalability of the proposed parallel memory and its predictable access time are presented using numerical and multimedia algorithms. It is shown that a memory utilization above 83% is achieved by our storage scheme for 64 memories, which largely outperforms previous proposals. Our approach provides a tool for matching the storage pattern with the data access patterns needed for embedded systems running streamed computations with predictable data access patterns.