Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

Authors:
Juan del Cuvillo;Weirong Zhu;Guang Gao
Affiliations:
University of Delaware, Newark, DE;University of Delaware, Newark, DE;University of Delaware, Newark, DE
Venue:
Proceedings of the 3rd conference on Computing frontiers
Year:
2006

Citing 21
Cited 8

Synchronization Algorithms for Shared-Memory Multiprocessors

Computer
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Lock-free linked lists using compare-and-swap

Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Evaluating synchronization on shared address space multiprocessors: methodology and performance

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Concurrent set manipulation without locking

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
High performance dynamic lock-free hash tables and list-based sets

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer

International Journal of Parallel Programming
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
A Pragmatic Implementation of Non-blocking Linked-Lists

DISC '01 Proceedings of the 15th International Conference on Distributed Computing
Performance Comparisons of Basic OpenMP Constructs

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Performance characteristics of openMP constructs, and application benchmarks on a large symmetric multiprocessor

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Dynamic decentralized cache schemes for mimd parallel processors

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Nonblocking synchronization and system design

Nonblocking synchronization and system design
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects

IEEE Transactions on Parallel and Distributed Systems
A scalable lock-free stack algorithm

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Optimizing NANOS OpenMP for the IBM Cyclops Multithreaded Architecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Nonblocking memory management support for dynamic-sized data structures

ACM Transactions on Computer Systems (TOCS)
Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture

HPCS '06 Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment
Evaluation of OpenMP for the cyclops multithreaded architecture

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

Languages and Compilers for Parallel Computing
Research on Evaluation of Parallelization on an Embedded Multicore Platform

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Exploiting fine-grain thread parallelism on multicore architectures

Scientific Programming - Software Development for Multi-core Computing Systems
Performance characteristics of OpenMP language constructs on a many-core-on-a-chip architecture

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Analysis and performance results of computing betweenness centrality on IBM Cyclops64

The Journal of Supercomputing
High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique architecture presents new opportunities for optimization. Specifically, we consider the following three areas: (1) a memory aware runtime library that places frequently used data structures in scratchpad memory; (2) a unique spin lock algorithm for shared memory synchronization based on in-memory atomic instructions and native support for thread level execution; (3) a fast barrier that directly uses C64 hardware support for collective synchronization. All three optimizations together, result in an 80% overhead reduction for language constructs in OpenMP. We believe that such a drastic reduction in the cost of managing parallelism makes OpenMP more amenable for writing parallel programs on the C64 platform.