An interleaved cache clustered VLIW processor

Authors:
Enric Gibert;Jesús Sánchez;Antonio González
Affiliations:
Universitat Politècnica de Catalunya, Barcelona - SPAIN;Universitat Politècnica de Catalunya, Barcelona - SPAIN;Universitat Politècnica de Catalunya, Barcelona - SPAIN
Venue:
ICS '02 Proceedings of the 16th international conference on Supercomputing
Year:
2002

Citing 22
Cited 11

Bulldog: a compiler for VLSI architectures

Bulldog: a compiler for VLSI architectures
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
IMPACT: an architectural framework for multiple-instruction-issue processors

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
Modulo scheduling of loops in control-intensive non-numeric programs

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Cache sensitive modulo scheduling

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Maps: a compiler-managed memory system for raw machines

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A comparative study of modulo scheduling techniques

ICS '02 Proceedings of the 16th international conference on Supercomputing
Baring It All to Software: Raw Machines

Computer
The TigerSHARC DSP Architecture

IEEE Micro
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
CARS: A New Code Generation Framework for Clustered ILP Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Swing Modulo Scheduling: A Lifetime-Sensitive Approach

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Compile-time memory disambiguation for c programs

Compile-time memory disambiguation for c programs

Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Cost-Sensitive Partitioning in an Architecture Synthesis System for Multicluster Processors

IEEE Micro
Cache organizations for clustered microarchitectures

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Distributed Data Cache Designs for Clustered VLIW Processors

IEEE Transactions on Computers
A Distributed Control Path Architecture for VLIW Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Compiler-directed Data Partitioning for Multicluster Processors

Proceedings of the International Symposium on Code Generation and Optimization
Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Design principles for a virtual multiprocessor

Proceedings of the 2007 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
XPoint cache: scaling existing bus-based coherence protocols for 2D and 3D many-core systems

Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.01

Visualization

Abstract

Clustered microarchitectures are becoming a common organiza驴tion due to their potential to reduce the penalties caused by wire delays and power consumption. Fully-distributed architectures are particularly effective to deal with these constraints, and besides they are very scalable. However, the distribution of the data cache memory poses a significant challenge and may be crit驴ical for performance. In this work, a distributed data cache VLIW architecture based on an interleaved cache organization along with cyclic scheduling techniques are proposed. Moreover, the use of Attraction Buffers for such an architecture is introduced. Attraction Buffers are a novel hardware mechanism to increase the percentage of local accesses. The idea is to allow the move驴ment of some data towards the clusters that need it.Performance results for 9 Mediabench benchmarks show that our scheduling techniques are able to hide the increased mem驴ory latency when accessing data mapped in a remote cluster. In addition, the local hit ratio is increased by 15% and stall time is reduced by 30% when using the same scheduling techniques with an interleaved cache clustered processor with Attraction Buffers. Finally, the proposed architecture is compared with a state-of-the-art distributed architecture such as the multiVLIW. Results show that the performance of an interleaved cache clustered VLIW pro驴cessor with Attraction Buffers is similar to that of the multiVLIW architecture, whereas the former has a lower hardware complex驴ity.