A study of a software cache implementation of the OpenMP memory model for multicore and manycore architectures

Authors:
Chen Chen;Joseph B. Manzano;Ge Gan;Guang R. Gao;Vivek Sarkar
Affiliations:
Tsinghua University, Beijing, P.R. China;University of Delaware, Newark, DE;University of Delaware, Newark, DE;University of Delaware, Newark, DE;Rice University, Houston, TX
Venue:
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Year:
2010

Citing 18
Cited 0

Cedar—a large scale multiprocessor

Advanced computer architecture
The NYU ultracomputer—designing a MIMD, shared-memory parallel machine

25 years of the international symposia on Computer architecture (selected papers)
Memory access buffering in multiprocessors

25 years of the international symposia on Computer architecture (selected papers)
Commit-reconcile & fences (CRF): a new memory model for architects and compiler writers

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A Unified Formalization of Four Shared-Memory Models

IEEE Transactions on Parallel and Distributed Systems
The Java memory model

Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Memory Model = Instruction Reordering + Store Atomicity

Proceedings of the 33rd annual international symposium on Computer Architecture
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
A theory of memory models

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

IEEE Transactions on Computers
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
Foundations of the C++ concurrency memory model

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Toward an Automatic Code Layout Methodology

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Hybrid access-specific software cache techniques for the cell BE architecture

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Complete formal specification of the OpenMP memory model

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is motivated by the desire to provide an efficient and scalable software cache implementation of OpenMP on multicore and manycore architectures in general, and on the IBM CELL architecture in particular. In this paper, we propose an instantiation of the OpenMP memory model with the following advantages: (1) The proposed instantiation prohibits undefined values that may cause problems of safety, security, programming and debugging. (2) The proposed instantiation is scalable with respect to the number of threads because it does not rely on communication among threads or a centralized directory that maintains consistency of multiple copies of each shared variable. (3) The proposed instantiation avoids the ambiguity of the original memory model definition proposed on the OpenMP Specification 3.0. We also introduce a new cache protocol for this instantiation, which can be implemented as a software-controlled cache. Experimental results on the Cell Broadband Engine show that our instantiation results in nearly linear speedup with respect to the number of threads for a number of NAS Parallel Benchmarks. The results also show a clear advantage when comparing it to a software cache design derived from a stronger memory model that maintains a global total ordering among flush operations.