Data distribution support on distributed shared memory multiprocessors
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Extending OpenMP for NUMA machines
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Performance of Cluster-enabled OpenMP for the SCASH Software Distributed Shared Memory System
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Optimal topology exploration for application-specific 3D architectures
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory
Proceedings of the 33rd annual international symposium on Computer Architecture
Die Stacking (3D) Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Locality-Aware Distributed Loop Scheduling for Chip Multiprocessors
VLSID '07 Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference: Embedded Systems
3D-Stacked Memory Architectures for Multi-core Processors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Adaptive work-stealing with parallelism feedback
ACM Transactions on Computer Systems (TOCS)
Proceedings of the conference on Design, automation and test in Europe
PicoServer: Using 3D stacking technology to build energy efficient servers
ACM Journal on Emerging Technologies in Computing Systems (JETC)
Deque-Free Work-Optimal Parallel STL Algorithms
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Is 3D chip technology the next growth engine for performance improvement?
IBM Journal of Research and Development
Efficient OpenMP support and extensions for MPSoCs with explicitly managed memory hierarchy
Proceedings of the Conference on Design, Automation and Test in Europe
System-level power/performance evaluation of 3D stacked DRAMs for mobile applications
Proceedings of the Conference on Design, Automation and Test in Europe
PRO3D: programming for future 3D manycore architectures
Proceedings of the 2012 Interconnection Network Architecture: On-Chip, Multi-Chip Workshop
Application task and data placement in embedded many-core NUMA architectures
Proceedings of the 10th Workshop on Optimizations for DSP and Embedded Systems
Hi-index | 0.00 |
In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. 3D stacking technology enables low latency and high bandwidth access to multiple, large memory banks in close spatial proximity. In our implementation one silicon layer contains multiple processors, whereas one or more DRAM layers on top host a NUMA memory subsystem. To obtain high locality and balanced workload we consider a two-step approach. First, a compiler pass analyzes memory references in a loop and schedules each iteration to the processor owning the most frequently accessed data. Second, if locality-aware loop parallelization has generated unbalanced workload we allow idle processors to execute part of the remaining work from neighbors by implementing runtime support for work stealing.