SimpleFit: A Framework for Analyzing Design Trade-Offs in Raw Architectures

Authors:
Csaba Andras Moritz;Donald Yeung;Anant Agarwal
Affiliations:
Univ. of Massachusetts, Amherst;Univ. of Maryland, College Park;MIT, Cambridge, MA
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2001

Citing 17
Cited 8

Memory requirements for balanced computer architectures

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Architecture of a message-driven processor

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Architectural tradeoffs in parallel computer design

Proceedings of the decennial Caltech conference on VLSI on Advanced research in VLSI
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The design of the Caltech Mosaic C multicomputer

Proceedings of the 1993 symposium on Research on integrated systems
A family of routing and communication chips based on the Mosaic

Proceedings of the 1993 symposium on Research on integrated systems
LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
LoPC: modeling contention in parallel algorithms

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
LoGPC: Modeling Network Contention in Message-Passing Programs

IEEE Transactions on Parallel and Distributed Systems
Mips-X RISC Microprocessor

Mips-X RISC Microprocessor
Baring It All to Software: Raw Machines

Computer
Organization of the Motorola 88110 Superscalar RISC Microprocessor

IEEE Micro
The J-Machine Network

ICCD '92 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR

THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
A generic system simulator with novel on-chip cache and throughput models for gigascale integration

A generic system simulator with novel on-chip cache and throughput models for gigascale integration
Logic emulation with virtual wires

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Parametric timing and power macromodels for high level simulation of low-swing interconnects

Proceedings of the 2002 international symposium on Low power electronics and design
Opportunities and challenges in application-tuned circuits and architectures based on nanodevices

Proceedings of the 1st conference on Computing frontiers
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Design and analysis of an NoC architecture from performance, reliability and energy perspective

Proceedings of the 2005 ACM symposium on Architecture for networking and communications systems
A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks

Proceedings of the 33rd annual international symposium on Computer Architecture
A domain-specific approach for software development on Manycore platforms

ACM SIGARCH Computer Architecture News
Manycore performance analysis using timed configuration graphs

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
Proposal of an analytical solution for the load imbalance problem in parallel systems

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The semiconductor industry roadmap projects that advances in VLSI technology will permit more than one billion transistors on a chip by the year 2010. The MIT Raw microprocessor is a proposed architecture that strives to exploit these chip-level resources by implementing thousands of tiles, each comprising a processing element and a small amount of memory, coupled by a static two-dimensional interconnect. A compiler partitions fine-grain instruction-level parallelism across the tiles and statically schedules intertile communication over the interconnect. Because Raw microprocessors fully expose their internal hardware structure to the software, they can be viewed as a gigantic FPGA with coarse-grained tiles in which software orchestrates communication over static interconnections. One open challenge in Raw architectures is to determine their optimal grain size and balance. The grain size is the area of each tile and the balance is the proportion of area in each tile devoted to memory, processing, communication, and off-chip global I/O. If the total chip area is fixed, higher processing power per tile requires large tiles and hence reduces the total number of tiles on the chip. This paper presents SimpleFit, a novel analytical framework that designers can use to reason about the design space of Raw microprocessors. Our model is also generalizable to multiprocessors on a chip. Based on an architectural model, an application model, and a VLSI cost analysis, the framework computes the performance of applications and uses an optimization process to identify designs that will execute these applications most cost-effectively. Although the optimal machine configurations obtained vary for different applications, problem sizes, and budgets, the general trends for various applications are similar. Accordingly, for the applications studied, assuming a onr billion logic transistor equivalent area, we recommend building a Raw chip with approximately 1,000 tiles, 30 words/cycle global I/O, 20 Kbytes of local memory per tile, three to four words/cycle local communication bandwidth, and single-issue processors. This configuration will give performance near the global optimum for most applications.