Latency and bandwidth efficient communication through system customization for embedded multiprocessors

Authors:
Chenjie Yu;Peter Petrov
Affiliations:
University of Maryland College Park;University of Maryland College Park
Venue:
Proceedings of the 45th annual Design Automation Conference
Year:
2008

Citing 10
Cited 3

MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors

Proceedings of the 2002 international symposium on Low power electronics and design
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Exploring the energy efficiency of cache coherence protocols in single-chip multi-processors

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
The M5 Simulator: Modeling Networked Systems

IEEE Micro
Aggressive snoop reduction for synchronized producer-consumer communication in energy-efficient embedded multi-processors

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis

Broadcast filtering: Snoop energy reduction in shared bus-based low-power MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems

Journal of Parallel and Distributed Computing
Energy and throughput efficient transactional memory for embedded multicore systems

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a cross-layer customization methodology for latency and bandwidth efficient inter-core communication in embedded multiprocessors. The methodology integrates compiler, operating system, and hardware support to achieve a bandwidth efficient, snoop-free, and coherence cache miss-free shared memory communication between synchronized producer and consumers cores. A compiler-driven code transformation is introduced that utilizes a simple ISA support in the form of a special write-through store instruction. It ensures that producer writes are propagated to the consumers with a single bus transaction per cache block when the producer performs the last write to that cache line before exiting its synchronization region. Information regarding the shared buffers involved in the communications is captured by the OS and provided to the cores with the purpose of filtering bus traffic and performing remote updates when necessary. The end result of the proposed methodology is a single bus transaction per shared cache block and snoop-free communication between a producer and a set of consumers with no intervening coherence misses on the consumer caches. Our experiments demonstrate the significant reductions in both bus traffic and cache misses for a set of multiprocessor benchmarks.