Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs

Authors:
Andrea Marongiu;Paolo Burgio;Luca Benini
Affiliations:
University of Bologna, Bologna, Italy;University of Bologna, Bologna, Italy;University of Bologna, Bologna, Italy
Venue:
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Year:
2010

Citing 15
Cited 2

Data distribution support on distributed shared memory multiprocessors

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Extending OpenMP for NUMA machines

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Performance of Cluster-enabled OpenMP for the SCASH Software Distributed Shared Memory System

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Optimal topology exploration for application-specific 3D architectures

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Proceedings of the 33rd annual international symposium on Computer Architecture
Die Stacking (3D) Microarchitecture

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Locality-Aware Distributed Loop Scheduling for Chip Multiprocessors

VLSID '07 Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference: Embedded Systems
3D-Stacked Memory Architectures for Multi-core Processors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Adaptive work-stealing with parallelism feedback

ACM Transactions on Computer Systems (TOCS)
A practical approach for reconciling high and predictable performance in non-regular parallel programs

Proceedings of the conference on Design, automation and test in Europe
PicoServer: Using 3D stacking technology to build energy efficient servers

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Deque-Free Work-Optimal Parallel STL Algorithms

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Is 3D chip technology the next growth engine for performance improvement?

IBM Journal of Research and Development
Efficient OpenMP support and extensions for MPSoCs with explicitly managed memory hierarchy

Proceedings of the Conference on Design, Automation and Test in Europe
System-level power/performance evaluation of 3D stacked DRAMs for mobile applications

Proceedings of the Conference on Design, Automation and Test in Europe

PRO3D: programming for future 3D manycore architectures

Proceedings of the 2012 Interconnection Network Architecture: On-Chip, Multi-Chip Workshop
Application task and data placement in embedded many-core NUMA architectures

Proceedings of the 10th Workshop on Optimizations for DSP and Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. 3D stacking technology enables low latency and high bandwidth access to multiple, large memory banks in close spatial proximity. In our implementation one silicon layer contains multiple processors, whereas one or more DRAM layers on top host a NUMA memory subsystem. To obtain high locality and balanced workload we consider a two-step approach. First, a compiler pass analyzes memory references in a loop and schedules each iteration to the processor owning the most frequently accessed data. Second, if locality-aware loop parallelization has generated unbalanced workload we allow idle processors to execute part of the remaining work from neighbors by implementing runtime support for work stealing.