Maps: a compiler-managed memory system for raw machines

Authors:
Rajeev Barua;Walter Lee;Saman Amarasinghe;Anant Agarwal
Affiliations:
M.I.T. Laboratory for Computer Science, Cambridge, MA;M.I.T. Laboratory for Computer Science, Cambridge, MA;M.I.T. Laboratory for Computer Science, Cambridge, MA;M.I.T. Laboratory for Computer Science, Cambridge, MA
Venue:
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Year:
1999

Citing 15
Cited 25

Efficient and exact data dependence analysis

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Efficient support for irregular applications on distributed-memory machines

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An integrated compile-time/run-time software distributed shared memory system

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Pointer analysis for multithreaded programs

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Will Physical Scalability Sabotage Performance Gains?

Computer
Baring It All to Software: Raw Machines

Computer
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
R. Barua, W. Lee, S. Amarasinghe and A. Agarwal

HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing

Pointer analysis for multithreaded programs

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Bidwidth analysis with application to silicon compilation

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Attacking the semantic gap between application programming languages and configurable hardware

FPGA '01 Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays
A decade of reconfigurable computing: a visionary retrospective

Proceedings of the conference on Design, automation and test in Europe
Coarse grain reconfigurable architecture (embedded tutorial)

Proceedings of the 2001 Asia and South Pacific Design Automation Conference
Automatic Code Mapping on an Intelligent Memory Architecture

IEEE Transactions on Computers
A compiler approach to fast hardware design space exploration in FPGA-based systems

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
An interleaved cache clustered VLIW processor

ICS '02 Proceedings of the 16th international conference on Supercomputing
Pointer analysis for structured parallel programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
FlexCache: A Framework for Flexible Compiler Generated Data Caching

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Convergent scheduling

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Automatic Allocation of Arrays to Memories in FPGA Processors with Multiple Memory Banks

FCCM '99 Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Near Fine Grain Parallel Processing Using Static Scheduling on Single Chip Multiprocessors

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
Custom Data Layout for Memory Parallelism

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
SAGE: an automatic analyzing system for a new high-performance SoC architecture-processor-in-memory

Journal of Systems Architecture: the EUROMICRO Journal
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Proceedings of the 18th annual international conference on Supercomputing
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Scalar Operand Networks

IEEE Transactions on Parallel and Distributed Systems
A compiler approach to managing storage and memory bandwidth in configurable architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A framework for low energy data management in reconfigurable multi-context architectures

Journal of Systems Architecture: the EUROMICRO Journal
Strength reduction of integer division and modulo operations

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Optimizing data locality using array tiling

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck," with machine performance limited by the large memory latency and limited memory bandwidth. A Raw architecture addresses this problem by taking advantage of the rapidly increasing transistor budget to move much of its memory on chip. To remove the bottleneck and complexity associated with centralized memory, Raw distributes the memory with its processing elements. Unified memory semantics are implemented jointly by the hardware and the compiler. The hardware provides a clean compiler interface to its two inter-tile interconnects: a fast, statically schedulable network and a traditional dynamic network. Maps then uses these communication mechanisms to orchestrate the memory accesses for low latency and parallelism while enforcing proper dependence. It optimizes for speed in two ways: by finding accesses that can be scheduled on the static interconnect through static promotion, and by minimizing dependence sequentialization for the remaining accesses. Static promotion is performed using equivalence class unification and modulo unrolling; memory dependences are enforced through explicit synchronization and software serial ordering. We have implemented Maps based on the SUIF infrastructure. This paper demonstrates that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for our regular applications and about 5-fold speedup on 16 or more tiles for our irregular applications. The paper also shows that selective use of dynamic accesses can be a useful complement to the mostly static memory system.