High performance, energy efficiency, and scalability with GALS chip multiprocessors

Authors:
Zhiyi Yu;Bevan M. Baas
Affiliations:
Microelectronics Department, Fudan University, Shanghai, China;Electrical and Computer Engineering Department, University of California at Davis, CA
Venue:
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Year:
2009

Citing 15
Cited 5

Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Digital systems engineering

Digital systems engineering
Power and performance evaluation of globally asynchronous locally synchronous processors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Parameter variations and impact on circuits and microarchitecture

Proceedings of the 40th annual Design Automation Conference
A Low-Latency FIFO for Mixed-Clock Systems

WVLSI '00 Proceedings of the IEEE Computer Society Annual Workshop on VLSI (WVLSI'00)
A critical analysis of application-adaptive multiple clock processors

Proceedings of the 2003 international symposium on Low power electronics and design
Energy-Efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Globally-asynchronous locally-synchronous systems (performance, reliability, digital)

Globally-asynchronous locally-synchronous systems (performance, reliability, digital)
Optimal partitioning of globally asychronous locally synchronous processor arrays

Proceedings of the 14th ACM Great Lakes symposium on VLSI
Synchroscalar: A Multiple Clock Domain, Power-Aware, Tile-Based Embedded Processor

Proceedings of the 31st annual international symposium on Computer architecture
Statistical Analysis of Clock Skew Variation in H-Tree Structure

ISQED '05 Proceedings of the 6th International Symposium on Quality of Electronic Design
Clock trees: differential or single ended?

ISQED '05 Proceedings of the 6th International Symposium on Quality of Electronic Design
Toward a multiple clock/voltage island design style for power-aware processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multi-Processor Systems

ISVLSI '06 Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures
A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Architecture design principles for the integration of synchronization interfaces into Network-on-Chip switches

Proceedings of the 2nd International Workshop on Network on Chip Architectures
Design space exploration of a mesochronous link for cost-effective and flexible GALS NOCs

Proceedings of the Conference on Design, Automation and Test in Europe
A low-area multi-link interconnect architecture for GALS chip multiprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Dataflow-driven execution control in a coarse-grained reconfigurable array (abstract only)

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
A compact clock generator for heterogeneous GALS MPSoCs in 65-nm CMOS technology

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip multiprocessors with globally asynchronous locally synchronous (GALS) clocking styles are promising candidates for processing computationally-intensive and energy-constrained workloads. The GALS methodology simplifies clock tree design, provides opportunities to use clock and voltage scaling jointly in system submodules to achieve high energy efficiencies, and can also result in easily scalable clocking systems. However, its use typically also introduces performance penalties due to additional communication latency between clock domains. We show that GALS chip multiprocessors (CMPs) with large inter-processor first-inputs-first-outputs (FIFOs) buffers can inherently hide much of the GALS performance penalty while executing applications that have been mapped with few communication loops. In fact, the penalty can be driven to zero with sufficiently large FIFOs and the removal of multiple-loop communication links. We present an example mesh-connected GALS chip multiprocessor and show it has a less than 1% performance (throughput) reduction on average compared to the corresponding synchronous system for many DSP workloads. Furthermore, adaptive clock and voltage scaling for each processor provides an approximately 40% power savings without any performance reduction. These results compare favorably with the GALS uniprocessor, which compared to the corresponding synchronous uniprocessor, has a reported greater than 10% performance (throughput) reduction and an energy savings of approximately 25% using dynamic clock and voltage scaling for many general purpose applications.