Tarantula: a vector extension to the alpha architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A Simulation Study of Decoupled Vector Architectures
The Journal of Supercomputing
Decoupled access/execute computer architectures
ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
So Many States, So Little Time: Verifying Memory Coherence in the Cray X1
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Performance characteristics of the Cray X1 and their implications for application performance tuning
Proceedings of the 18th annual international conference on Supercomputing
Evaluating support for global address space languages on the Cray X1
Proceedings of the 18th annual international conference on Supercomputing
Cache Refill/Access Decoupling for Vector Machines
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
The Soft Error Problem: An Architectural Perspective
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Early Evaluation of the Cray X1
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Microarchitecture of a High-Radix Router
Proceedings of the 32nd annual international symposium on Computer Architecture
Leading Computational Methods on Scalar and Vector HEC Platforms
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
The BlackWidow High-Radix Clos Network
Proceedings of the 33rd annual international symposium on Computer Architecture
Technology-Driven, Highly-Scalable Dragonfly Topology
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Atomic Vector Operations on Chip Multiprocessors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A shared cache for a chip multi vector processor
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Performance tuning and analysis of future vector processors based on the roofline model
Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Performance evaluation of NEC SX-9 using real science and engineering applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
On-Chip Network Evaluation Framework
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput
Proceedings of the 38th annual international symposium on Computer architecture
Exploiting communication and packaging locality for cost-effective large scale networks
Proceedings of the 26th ACM international conference on Supercomputing
The dynamic granularity memory system
Proceedings of the 39th Annual International Symposium on Computer Architecture
Cray cascade: a scalable HPC system based on a Dragonfly network
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Vector Extensions for Decision Support DBMS Acceleration
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
The power 775 architecture at scale
Proceedings of the 27th international ACM conference on International conference on supercomputing
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators
ACM Transactions on Computer Systems (TOCS)
Scalable high-radix router microarchitecture using a network switch organization
ACM Transactions on Architecture and Code Optimization (TACO)
A locality-aware memory hierarchy for energy-efficient GPU architectures
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
This paper describes the system architecture of the Cray BlackWidow scalable vector multiprocessor. The BlackWidow system is a distributed shared memory (DSM) architecture that is scalable to 32K processors, each with a 4-way dispatch scalar execution unit and an 8-pipe vector unit capable of 20.8 Gflops for 64-bit operations and 41.6 Gflops for 32-bit operations at the prototype operating frequency of 1.3 GHz. Global memory is directly accessible with processor loads and stores and is globally coherent. The system supports thousands of outstanding references to hide remote memory latencies, and provides a rich suite of built-in synchronization primitives. Each BlackWidow node is implemented as a 4-way SMP with up to 128 Gbytes of DDR2 main memory capacity. The system supports common programming models such as MPI and OpenMP, as well as global address space languages such as UPC and CAF. We describe the system architecture and microarchitecture of the processor, memory controller, and router chips. We give preliminary performance results and discuss design tradeoffs.