Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Authors:
Enric Gibert;Jesús Sánchez;Antonio González
Affiliations:
Department of Computer Architecture, Universitat Politècnica de Catalunya, Barcelona - SPAIN;Intel Barcelona Research Center, Intel Labs - Universitat Politècnica de Catalunya, Barcelona - SPAIN;Department of Computer Architecture, Universitat Politècnica de Catalunya, Barcelona - SPAIN and Intel Barcelona Research Center, Intel Labs - Universitat Politècnica de Catalunya, Barce ...
Venue:
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2003

Citing 23
Cited 7

IMPACT: an architectural framework for multiple-instruction-issue processors

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Lifetime-sensitive modulo scheduling

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Dynamic memory disambiguation for array references

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
The filter cache: an energy efficient memory structure

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Power and performance tradeoffs using various caching strategies

ISLPED '98 Proceedings of the 1998 international symposium on Low power electronics and design
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Heterogeneous memory management for embedded systems

CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Baring It All to Software: Raw Machines

Computer
The TigerSHARC DSP Architecture

IEEE Micro
Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Compiler managed micro-cache bypassing for high performance EPIC processors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications

EDTC '97 Proceedings of the 1997 European conference on Design and Test
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
CARS: A New Code Generation Framework for Clustered ILP Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Swing Modulo Scheduling: A Lifetime-Sensitive Approach

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Compile-time memory disambiguation for c programs

Compile-time memory disambiguation for c programs

Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Proceedings of the 18th annual international conference on Supercomputing
Distributed Data Cache Designs for Clustered VLIW Processors

IEEE Transactions on Computers
A Distributed Control Path Architecture for VLIW Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Variable-Based Multi-module Data Caches for Clustered VLIW Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Compiler-directed Data Partitioning for Multicluster Processors

Proceedings of the International Symposium on Code Generation and Optimization
Inter-cluster communication in VLIW architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Evaluation of bus based interconnect mechanisms in clustered VLIW architectures

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wire delays are a major concern for current and forthcoming processors.One approach to attack this problem is to divide the processorinto semi-independent units referred to as clusters. Acluster usually consists of a local register file and a subset of thefunctional units, while the data cache remains centralized. However,as technology evolves, the latency of such a centralizedcache will increase leading to an important performance impact.In this paper we propose to include flexible low-latency buffers ineach cluster in order to reduce the performance impact of highercache latencies. The reduced number of entries in each buffer permitsthe design of flexible ways to map data from L1 to these buffers.The proposed L0 buffers are managed by the compiler, whichis responsible to decide which memory instructions make use ofthem.Effective instruction scheduling techniques are proposed togenerate code that exploits these buffers. Results for the Media-benchbenchmark suite show that the performance of a clusteredVLIW processor with a unified L1 data cache is improved by 16%when such buffers are used. In addition, the proposed architecturealso shows significant advantages over both MultiVLIW processorsand a clustered processors with a word-interleaved cache,two state-of-the-art designs with a distributed L1 data cache.