Synchronization and communication in the T3E multiprocessor

Authors:
Steven L. Scott
Affiliations:
Cray Research, Inc.
Venue:
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Year:
1996

Citing 22
Cited 66

The NCUBE family of high-performance parallel computer systems

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Translation lookaside buffer consistency: a software approach

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Wait-free synchronization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
T: a multithreaded massively parallel architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A tightly-coupled processor-network interface

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
A methodology for implementing highly concurrent data objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
The J-machine multicomputer: an architectural evaluation

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The PowerPC architecture: a specification for a new family of RISC processors

The PowerPC architecture: a specification for a new family of RISC processors
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
The CRAFT Fortran programming model

Scientific Programming
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Empirical evaluation of the CRAY-T3D: a compiler perspective

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The GigaRing Channel

IEEE Micro
Limits on Interconnection Network Performance

IEEE Transactions on Parallel and Distributed Systems
Measurement of Communication Rates on the Cray T3D Interprocessor Network

HPCN Europe 1994 Proceedings of the nternational Conference and Exhibition on High-Performance Computing and Networking Volume II: Networking and Tools

HARP: a fast spectral partitioner

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Fine-grain multithreading with the EM-X multiprocessor

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
LoPC: modeling contention in parallel algorithms

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementing multidestination worms in switch-based parallel systems: architectural alternatives and their impact

Proceedings of the 24th annual international symposium on Computer architecture
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

IEEE Transactions on Parallel and Distributed Systems
Prefetching on the Cray-T3E

ICS '98 Proceedings of the 12th international conference on Supercomputing
Architecture

Communications of the ACM
UTLB: a mechanism for address translation on network interfaces

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Dynamically Configurable Message Flow Control for Fault-Tolerant Routing

IEEE Transactions on Parallel and Distributed Systems
Ace: a language for parallel programming with customizable protocols

ACM Transactions on Computer Systems (TOCS)
A new switch chip for IBM RS/6000 SP systems

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Hardware-only stream prefetching and dynamic access ordering

Proceedings of the 14th international conference on Supercomputing
Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and Their Impact

IEEE Transactions on Parallel and Distributed Systems
Data Locality Exploitation in the Decomposition of Regular Domain Problems

IEEE Transactions on Parallel and Distributed Systems
Minimizing Data and Synchronization Costs in One-Way Communication

IEEE Transactions on Parallel and Distributed Systems
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Barrier Synchronization on Wormhole-Routed Networks

IEEE Transactions on Parallel and Distributed Systems
Tolerating communication latency through dynamic thread invocation in a multithreaded architecture

Compiler optimizations for scalable parallel systems
Performance of the CRAY T3E multiprocessor

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors

International Journal of Parallel Programming
A Reliable Hardware Barrier Synchronization Scheme

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
A Parallel System Architecture Based on Dynamically Configurable Shared Memory Clusters

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
Cache Remapping to Improve the Performance of Tiled Algorithms

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Message Passing Evaluation and Analysis on Cray T3E and SGI Origin 2000 Systems

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Ninja: A Framework for Network Services

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Memory System Support for Dynamic Cache Line Assembly

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
VLSI Architecture: Past, Present, and Future

ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
A Balanced Approach to High-Level Verification: Performance Trade-Offs in Verifying Large-Scale Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Evaluating support for global address space languages on the Cray X1

Proceedings of the 18th annual international conference on Supercomputing
Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism

Proceedings of the 31st annual international symposium on Computer architecture
Adaptive History-Based Memory Schedulers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A Complete Compiler Approach to Auto-Parallelizing C Programs for Multi-DSP Systems

IEEE Transactions on Parallel and Distributed Systems
Performance Evaluation of the Cray X1 Distributed Shared-Memory Architecture

IEEE Micro
Design and Evaluation of an HPVM-Based Windows NT Supercomputer

International Journal of High Performance Computing Applications
Feedback-Based Synchronization in System Area Networks for Cluster Computing

IEEE Transactions on Parallel and Distributed Systems
Fast synchronization on shared-memory multiprocessors: An architectural approach

Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
Efficient address remapping in distributed shared-memory systems

ACM Transactions on Architecture and Code Optimization (TACO)
Fault-tolerant wormhole routing with 2 virtual channels in meshes

Journal of Computer Science and Technology
Lightweight lock-free synchronization methods for multithreading

Proceedings of the 20th annual international conference on Supercomputing
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Assessing the potential of hybrid hpc systems for scientific applications: a case study

Proceedings of the 4th international conference on Computing frontiers
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Active memory operations

Proceedings of the 21st annual international conference on Supercomputing
Memory scheduling for modern microprocessors

ACM Transactions on Computer Systems (TOCS)
An Evaluation of the Oak Ridge National Laboratory Cray XT3

International Journal of High Performance Computing Applications
Scalable barrier synchronisation for large-scale shared-memory multiprocessors

International Journal of High Performance Computing and Networking
Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Atomic Vector Operations on Chip Multiprocessors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Proceedings of the 7th ACM international conference on Computing frontiers
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Predictability of bulk synchronous programs using MPI

EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
Early evaluation of the cray XT3

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Graphical design tool for parallel programs with execution control based on global application states

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
HPP controller: a system controller for high performance computing

Frontiers of Computer Science in China
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
TLSync: support for multiple fast barriers using on-chip transmission lines

Proceedings of the 38th annual international symposium on Computer architecture
Distributed application configuration, management, and visualization with plush

ACM Transactions on Internet Technology (TOIT)
Hardware support for OpenMP collective operations

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Low-Overhead, high-speed multi-core barrier synchronization

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Active memory controller

The Journal of Supercomputing
SGI® UV2: a fused computation and data analysis machine

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
RDMA in the SiCortex cluster systems

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hardware support for fine-grained event-driven computation in Anton 2

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include performance measurements for various aspects of communication and synchronization.The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers). E-registers are used as the source or target for all remote communication. They provide a highly pipelined interface to global memory that allows dozens of requests per processor to be outstanding. Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. The T3E also provides a set of virtual hardware barrier/eureka networks that can be arbitrarily embedded into the 3D torus interconnect.