Locality-conscious workload assignment for array-based computations in MPSOC architectures

Authors:
Feihui Li;Mahmut Kandemir
Affiliations:
The Pennsylvania State University;The Pennsylvania State University
Venue:
Proceedings of the 42nd annual Design Automation Conference
Year:
2005

Citing 15
Cited 6

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Memory exploration for low power, embedded systems

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
A Chip-Multiprocessor Architecture with Speculative Multithreading

IEEE Transactions on Computers
An energy saving strategy based on adaptive loop parallelization

Proceedings of the 39th annual Design Automation Conference
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design

Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
Access ordering and memory-conscious cache utilization

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Exploiting Processor Workload Heterogeneity for Reducing Energy Consumption in Chip Multiprocessors

Proceedings of the conference on Design, automation and test in Europe - Volume 2
The energy efficiency of CMP vs. SMT for multimedia workloads

Proceedings of the 18th annual international conference on Supercomputing
The Thrifty Barrier: Energy-Aware Synchronization in Shared-Memory Multiprocessors

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture

Supporting task migration in multi-processor systems-on-chip: a feasibility study

Proceedings of the conference on Design, automation and test in Europe: Proceedings
MPSoC memory optimization using program transformation

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A hybrid memory organization to enhance task migration and dynamic task allocation in NoC-based MPSoCs

Proceedings of the 20th annual conference on Integrated circuits and systems design
Assessing task migration impact on embedded soft real-time streaming multimedia applications

EURASIP Journal on Embedded Systems - Operating System Support for Embedded Real-Time Applications
Multiprocessor, Multithreading and Memory Optimization for On-Chip Multimedia Applications

Journal of Signal Processing Systems
Compiler-directed memory management for heterogeneous MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

While the past research discussed several advantages of multiprocessor-system-on-a-chip (MPSOC) architectures from both area utilization and design verification perspectives over complex single core based systems, compilation issues for these architectures have relatively received less attention. Programming MPSOCs can be challenging as several potentially conflicting issues such as data locality, parallelism and load balance across processors should be considered simultaneously. Most of the compilation techniques discussed in the literature for parallel architectures (not necessarily for MPSOCs) are loop based, i.e., they consider each loop nest in isolation. However, one key problem associated with such loop based techniques is that they fail to capture the interactions between the different loop nests in the application. This paper takes a more global approach to the problem and proposes a compiler-driven data locality optimization strategy in the context of embedded MPSOCs. An important characteristic of the proposed approach is that, in deciding the workloads of the processors (i.e., in parallelizing the application) it considers all the loop nests in the application simultaneously. Our experimental evaluation with eight embedded applications shows that the global scheme brings significant power/performance benefits over the conventional loop based scheme.