Coarse grain task parallel processing with cache optimization on shared memory multiprocessor

Authors:
Kazuhisa Ishizaka;Motoki Obata;Hironori Kasahara
Affiliations:
Dept.EECE, Waseda University, Tokyo, Japan;Dept.EECE, Waseda University, Tokyo, Japan;Dept.EECE, Waseda University, Tokyo, Japan
Venue:
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
Year:
2001

Citing 13
Cited 4

Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Run-time methods for parallelizing partially parallel loops

ICS '95 Proceedings of the 9th international conference on Supercomputing
On the Automatic Parallelization of the Perfect Benchmarks®

IEEE Transactions on Parallel and Distributed Systems
An affine partitioning algorithm to maximize parallelism and minimize communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors

ICS '99 Proceedings of the 13th international conference on Supercomputing
Locality optimizations for multi-level caches

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Data Localization Using Loop Aligned Decomposition for Macro-Dataflow Processing

LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Automatic Array Privatization

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Achieving Multi-level Parallelization

ISHPC '97 Proceedings of the International Symposium on High Performance Computing
Interprocedural Analysis for Parallelization

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Near Fine Grain Parallel Processing Using Static Scheduling on Single Chip Multiprocessors

IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture

Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Static coarse grain task scheduling with cache optimization using OpenMP

International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Performance of OSCAR multigrain parallelizing compiler on SMP servers

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
OSCAR API for real-time low-power multicores and its performance on multicores and SMP servers

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In multiprocessor systems, the gap between peak and effective performance has getting larger. To cope with this performance gap, it is important to use multigrain parallelism in addition to ordinary loop level parallelism. Also, effective use of memory hierarchy is important for the performance improvement of multiprocessor systems because the speed gap between processors and memories is getting larger. This paper describes coarse grain task parallel processing that uses parallelism among macro-tasks like loops and subroutines considering cache optimization using data localization scheme. The proposed scheme is implemented on OSCAR automatic multigrain parallelizing compiler. OSCAR compiler generates OpenMP FORTRAN program realizing the proposed scheme from a sequential FORTRAN77 program. Its performance is evaluated on IBM RS6000 SP 604e High Node 8 processors SMP machine using SPEC95fp tomcatv, swim, mgrid. In the evaluation, the proposed coarse grain task parallel processing scheme with cache optimization gives us up to 1.3 times speedup on 1PE, 4.7 times speedup on 4PE and 8.8 times speedup on 8PE compared with a sequential processing time.