Efficient and exact data dependence analysis
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Efficient support for irregular applications on distributed-memory machines
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An integrated compile-time/run-time software distributed shared memory system
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Space-time scheduling of instruction-level parallelism on a raw machine
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Pointer analysis for multithreaded programs
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
R. Barua, W. Lee, S. Amarasinghe and A. Agarwal
HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
Pointer analysis for multithreaded programs
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Bidwidth analysis with application to silicon compilation
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Attacking the semantic gap between application programming languages and configurable hardware
FPGA '01 Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays
A decade of reconfigurable computing: a visionary retrospective
Proceedings of the conference on Design, automation and test in Europe
Coarse grain reconfigurable architecture (embedded tutorial)
Proceedings of the 2001 Asia and South Pacific Design Automation Conference
Automatic Code Mapping on an Intelligent Memory Architecture
IEEE Transactions on Computers
A compiler approach to fast hardware design space exploration in FPGA-based systems
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
An interleaved cache clustered VLIW processor
ICS '02 Proceedings of the 16th international conference on Supercomputing
Pointer analysis for structured parallel programs
ACM Transactions on Programming Languages and Systems (TOPLAS)
FlexCache: A Framework for Flexible Compiler Generated Data Caching
IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Automatic Allocation of Arrays to Memories in FPGA Processors with Multiple Memory Banks
FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Near Fine Grain Parallel Processing Using Static Scheduling on Single Chip Multiprocessors
IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Custom Data Layout for Memory Parallelism
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
SAGE: an automatic analyzing system for a new high-performance SoC architecture-processor-in-memory
Journal of Systems Architecture: the EUROMICRO Journal
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures
Proceedings of the 18th annual international conference on Supercomputing
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Proceedings of the 31st annual international symposium on Computer architecture
IEEE Transactions on Parallel and Distributed Systems
A compiler approach to managing storage and memory bandwidth in configurable architectures
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A framework for low energy data management in reconfigurable multi-context architectures
Journal of Systems Architecture: the EUROMICRO Journal
Strength reduction of integer division and modulo operations
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Optimizing data locality using array tiling
Proceedings of the International Conference on Computer-Aided Design
Hi-index | 0.00 |
This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck," with machine performance limited by the large memory latency and limited memory bandwidth. A Raw architecture addresses this problem by taking advantage of the rapidly increasing transistor budget to move much of its memory on chip. To remove the bottleneck and complexity associated with centralized memory, Raw distributes the memory with its processing elements. Unified memory semantics are implemented jointly by the hardware and the compiler. The hardware provides a clean compiler interface to its two inter-tile interconnects: a fast, statically schedulable network and a traditional dynamic network. Maps then uses these communication mechanisms to orchestrate the memory accesses for low latency and parallelism while enforcing proper dependence. It optimizes for speed in two ways: by finding accesses that can be scheduled on the static interconnect through static promotion, and by minimizing dependence sequentialization for the remaining accesses. Static promotion is performed using equivalence class unification and modulo unrolling; memory dependences are enforced through explicit synchronization and software serial ordering. We have implemented Maps based on the SUIF infrastructure. This paper demonstrates that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for our regular applications and about 5-fold speedup on 16 or more tiles for our irregular applications. The paper also shows that selective use of dynamic accesses can be a useful complement to the mostly static memory system.