Evaluation of a Multithreaded Architecture for Cellular Computing

Authors:
Jose G. Castanos;Luis Ceze;Karin Strauss;Henry S. Warren Jr
Affiliations:
-;-;-;-
Venue:
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Year:
2002

Citing 0
Cited 20

Dissecting Cyclops: a detailed analysis of a multithreaded architecture

ACM SIGARCH Computer Architecture News
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

ACM Transactions on Architecture and Code Optimization (TACO)
TiNy Threads: A Thread Virtual Machine for the Cyclops64 Cellular Architecture

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15
Fast and fair: data-stream quality of service

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Real-time rendering systems in 2010

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Evaluation of OpenMP for the cyclops multithreaded architecture

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Hierarchical multithreading: programming model and system software

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
ReMAP: A Reconfigurable Heterogeneous Multicore Architecture

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Performance evaluation of intel's quad core processors for embedded applications

WSEAS Transactions on Computers
Performance evaluation of OpenMP benchmarks on intel's quad core processors

ICCOMP'10 Proceedings of the 14th WSEAS international conference on Computers: part of the 14th WSEAS CSCC multiconference - Volume I
The challenges of efficient code-generation for massively parallel architectures

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Performance modelling and optimization of memory access on cellular computer architecture cyclops64

NPC'05 Proceedings of the 2005 IFIP international conference on Network and Parallel Computing
Factory: an object-oriented parallel programming substrate for deep multiprocessors

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cyclops is a new architecture for high performance parallel computers being developed at the IBM T. J. Watson Research Center. The basic cell of this architecture is a single-chip SMP system with multiple threads of execution, embedded memory, and integrated communications hardware. Massive intra-chip parallelism is used to tolerate memory and functional unit latencies. Large systems with thousands of chips can be built by replicating this basic cell in a regular pattern. In this paper we describe the Cyclops architecture and evaluate two of its new hardware features: memory hierarchy with flexible cache organization and fast barrier hardware. Our experiments with the STREAM benchmark show that a particular design can achieve a sustainable memory bandwidth of 40\,GB/s, equal to the peak hardware bandwidth and similar to the performance of a 128-processor SGI Origin 3800. For small vectors, we have observed in-cache bandwidth above 80 GB/s. We also show that the fast barrier hardware can improve the performance of the Splash-2 FFT kernel by up to 10%. Our results demonstrate that the Cyclops approach of integrating a large number of simple processing elements and multiple memory banks in the same chip is an effective alternative for designing high performance systems.