Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Lock-free linked lists using compare-and-swap
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Evaluating synchronization on shared address space multiprocessors: methodology and performance
SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Concurrent set manipulation without locking
Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
High performance dynamic lock-free hash tables and list-based sets
Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer
International Journal of Parallel Programming
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
A Pragmatic Implementation of Non-blocking Linked-Lists
DISC '01 Proceedings of the 15th International Conference on Distributed Computing
Performance Comparisons of Basic OpenMP Constructs
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Dynamic decentralized cache schemes for mimd parallel processors
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Nonblocking synchronization and system design
Nonblocking synchronization and system design
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects
IEEE Transactions on Parallel and Distributed Systems
A scalable lock-free stack algorithm
Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Optimizing NANOS OpenMP for the IBM Cyclops Multithreaded Architecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Nonblocking memory management support for dynamic-sized data structures
ACM Transactions on Computer Systems (TOCS)
Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture
HPCS '06 Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment
Evaluation of OpenMP for the cyclops multithreaded architecture
WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Languages and Compilers for Parallel Computing
Research on Evaluation of Parallelization on an Embedded Multicore Platform
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Exploiting fine-grain thread parallelism on multicore architectures
Scientific Programming - Software Development for Multi-core Computing Systems
Performance characteristics of OpenMP language constructs on a many-core-on-a-chip architecture
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Analysis and performance results of computing betweenness centrality on IBM Cyclops64
The Journal of Supercomputing
The Journal of Supercomputing
Hi-index | 0.00 |
This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique architecture presents new opportunities for optimization. Specifically, we consider the following three areas: (1) a memory aware runtime library that places frequently used data structures in scratchpad memory; (2) a unique spin lock algorithm for shared memory synchronization based on in-memory atomic instructions and native support for thread level execution; (3) a fast barrier that directly uses C64 hardware support for collective synchronization. All three optimizations together, result in an 80% overhead reduction for language constructs in OpenMP. We believe that such a drastic reduction in the cost of managing parallelism makes OpenMP more amenable for writing parallel programs on the C64 platform.