Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

Authors:
Chenjie Yu;Peter Petrov
Affiliations:
University of Maryland;University of Maryland
Venue:
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Year:
2010

Citing 44
Cited 0

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Compiler code transformations for superscalar-based high performance systems

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Undecidability of static analysis

ACM Letters on Programming Languages and Systems (LOPLAS)
The undecidability of aliasing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Lazy release consistency for hardware-coherent multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Maximizing parallelism and minimizing synchronization with affine transforms

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Pointer analysis for multithreaded programs

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Selective cache ways: on-demand cache resource allocation

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Unification-based pointer analysis with directional assignments

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Pointer and escape analysis for multithreaded programs

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Pointer analysis: haven't we solved this problem yet?

PASTE '01 Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors

Proceedings of the 2002 international symposium on Low power electronics and design
Imagine: Media Processing with Streams

IEEE Micro
Points-to analysis using BDDs

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
A highly configurable cache architecture for embedded systems

Proceedings of the 30th annual international symposium on Computer architecture
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
The future of multiprocessor systems-on-chips

Proceedings of the 41st annual Design Automation Conference
A comparison of sequential consistency with home-based lazy release consistency for software distributed shared memory

Proceedings of the 18th annual international conference on Supercomputing
Coherence decoupling: making use of incoherence

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Exploring Energy/Performance Tradeoffs in Shared Memory MPSoCs: Snoop-Based Cache Coherence vs. Software Solutions

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Exploring the energy efficiency of cache coherence protocols in single-chip multi-processors

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Overview of the MPSoC design challenge

Proceedings of the 43rd annual Design Automation Conference
Cache coherence tradeoffs in shared-memory MPSoCs

ACM Transactions on Embedded Computing Systems (TECS)
The M5 Simulator: Modeling Networked Systems

IEEE Micro
Proximity-aware directory-based coherence for multi-core processor architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Software behavior oriented parallelization

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
A self-tuning configurable cache

Proceedings of the 44th annual Design Automation Conference
A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Aggressive snoop reduction for synchronized producer-consumer communication in energy-efficient embedded multi-processors

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Energy-efficient MESI cache coherence with pro-active snoop filtering for multicore microprocessors

Proceedings of the 13th international symposium on Low power electronics and design
An Approach for Enhancing Inter-processor Data Locality on Chip Multiprocessors

Transactions on High-Performance Embedded Architectures and Compilers I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a framework for performance-, bandwidth-, and energy-efficient intercore communication in embedded MultiProcessor Systems-on-a-Chip (MPSoC). The methodology seamlessly integrates compiler, operating system, and hardware support to achieve a low-cost communication between synchronized producers and consumers. The technique is especially beneficial for data-streaming applications exploiting pipeline parallelism with computational phases mapped to separate cores. Code transformations utilizing a simple ISA support ensure that producer writes are propagated to consumers with a single interconnect transaction per cache block just prior to the producer exiting its synchronization region. Furthermore, in order to completely eliminate misses to shared data caused by interference with private data and also to minimize the cache energy, we integrate to the proposed framework a cache way partitioning policy based on a simple cache configurability support, which isolates the shared buffers from other cache traffic. This mechanism results in significant power savings since only a subset of the cache ways needs to be looked up for each cache access. The end result of the proposed framework is a single communication transaction per shared cache block between a producer and a consumer with no coherence misses on the consumer caches. Our experiments demonstrate significant reductions in interconnect traffic, cache misses, and energy for a set of multiprocessor benchmarks.