MPSoC memory optimization using program transformation

Authors:
Youcef Bouchebaba;Bruno Girodias;Gabriela Nicolescu;El Mostapha Aboulhamid;Bruno Lavigueur;Pierre Paulin
Affiliations:
École Polytechnique de Montréal;École Polytechnique de Montréal;École Polytechnique de Montréal;Université de Montréal;STMicroelectronics;STMicroelectronics
Venue:
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Year:
2007

Citing 43
Cited 4

The systematic design of systolic arrays

Centre National de Recherche Scientifique on Automata networks in computer science: theory and applications
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Scalar replacement in the presence of conditional control flow

Software—Practice & Experience
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation

Proceedings of the 28th annual international symposium on Microarchitecture
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Optimal weighted loop fusion for parallel programs

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Automatic storage management for parallel programs

Parallel Computing - Special issues on languages and compilers for parallel computers
A Chip-Multiprocessor Architecture with Speculative Multithreading

IEEE Transactions on Computers
The Organization of Computations for Uniform Recurrence Equations

Journal of the ACM (JACM)
Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Loop tiling for parallelism

Loop tiling for parallelism
The parallel execution of DO loops

Communications of the ACM
Optimizing memory usage in the polyhedral model

ACM Transactions on Programming Languages and Systems (TOPLAS)
Loop fusion for memory space optimization

Proceedings of the 14th international symposium on Systems synthesis
Loop fusion for clustered VLIW architectures

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design

Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
Scheduling and Automatic Parallelization

Scheduling and Automatic Parallelization
Optimizing inter-nest data locality

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Automatic Array Privatization

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Loop Parallelization in the Polytope Model

CONCUR '93 Proceedings of the 4th International Conference on Concurrency Theory
Automatic Parallelization in the Polytope Model

The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
New Results on Array Contraction

ASAP '02 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
Improving Software Pipelining With Unroll-and-Jam

HICSS '96 Proceedings of the 29th Hawaii International Conference on System Sciences Volume 1: Software Technology and Architecture
On the Complexity of Loop Fusion

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Lattice-based memory allocation

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Parallel programming models for a multi-processor SoC platform applied to high-speed traffic management

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Optimizing the memory bandwidth with loop fusion

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Improving Data Locality by Array Contraction

IEEE Transactions on Computers
Code Generation in the Polyhedral Model Is Easier Than You Think

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
The Energy Impact of Aggressive Loop Fusion

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Locality-conscious workload assignment for array-based computations in MPSOC architectures

Proceedings of the 42nd annual Design Automation Conference
Data space-oriented tiling for enhancing locality

ACM Transactions on Embedded Computing Systems (TECS)
A polynomial-time algorithm for memory space reduction

International Journal of Parallel Programming
Buffer and register allocation for memory space optimization

ASAP '06 Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors

Multiprocessor, Multithreading and Memory Optimization for On-Chip Multimedia Applications

Journal of Signal Processing Systems
Compiler-directed memory management for heterogeneous MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
MpAssign: A Framework for Solving the Many-Core Platform Mapping Problem

Software—Practice & Experience
Experimentation with SMT solvers and theorem provers for verification of loop and arithmetic transformations

Proceedings of the 5th IBM Collaborative Academia Research Exchange Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multiprocessor system-on-a-chip (MPSoC) architectures have received a lot of attention in the past years, but few advances in compilation techniques target these architectures. This is particularly true for the exploitation of data locality. Most of the compilation techniques for parallel architectures discussed in the literature are based on a single loop nest. This article presents new techniques that consist in applying loop fusion and tiling to several loop nests and to parallelize the resulting code across different processors. These two techniques reduce the number of memory accesses. However, they increase dependencies and thereby reduce the exploitable parallelism in the code. This article tries to address this contradiction. To optimize the memory space used by temporary arrays, smaller buffers are used as a replacement. Different strategies are studied to optimize the processing time spent accessing these buffers. The experiments show that these techniques yield a significant reduction in the number of data cache misses (30%) and in processing time (50%).