Empirical evaluation of the CRAY-T3D: a compiler perspective

Authors:
Remzi H. Arpaci;David E. Culler;Arvind Krishnamurthy;Steve G. Steinberg;Katherine Yelick
Affiliations:
Computer Science Division, University of California, Berkeley;Computer Science Division, University of California, Berkeley;Computer Science Division, University of California, Berkeley;Computer Science Division, University of California, Berkeley;Computer Science Division, University of California, Berkeley
Venue:
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Year:
1995

Citing 12
Cited 22

Alpha architecture reference manual

Alpha architecture reference manual
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Anatomy of a message in the Alewife multiprocessor

ICS '93 Proceedings of the 7th international conference on Supercomputing
Micro benchmark analysis of the KSR1

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
CPU performance evaluation and execution time prediction using narrow spectrum benchmarking

CPU performance evaluation and execution time prediction using narrow spectrum benchmarking
Message passing on the Meiko CS-2

Parallel Computing - Special issue: message passing interfaces
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Compositional C++: Compositional Parallel Programming

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing

Optimizing parallel programs with explicit synchronization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Remote queues: exposing message queues for optimization and atomicity

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Modeling the benefits of mixed data and task parallelism

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Effective distributed scheduling of parallel workloads

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Practical parallel algorithms for personalized communication and integer sorting

Journal of Experimental Algorithmics (JEA)
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Evaluation of architectural support for global address-based communication in large-scale parallel machines

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
From AAPC algorithms to high performance permutation routing and sorting

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Parallel algorithms for personalized communication and sorting with an experimental study (extended abstract)

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
Pc-based Shared Memory Architecture and Language

The Journal of Supercomputing
The design, implementation, and evaluation of Jade

ACM Transactions on Programming Languages and Systems (TOPLAS)
A new deterministic parallel sorting algorithm with an experimental evaluation

Journal of Experimental Algorithmics (JEA)
Evaluating synchronization on shared address space multiprocessors: methodology and performance

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Information and control in gray-box systems

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors

International Journal of Parallel Programming
A Compiler-Directed Cache Coherence Scheme Using Data Prefetching

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Exploiting Gray-Box Knowledge of Buffer-Cache Management

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Deconstructing Commodity Storage Clusters

Proceedings of the 32nd annual international symposium on Computer Architecture
Towards energy efficient parallel computing on consumer electronic devices

ICT-GLOW'11 Proceedings of the First international conference on Information and communication on technology for the fight against global warming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most recent MPP systems employ a fast microprocessor surrounded by a shell of communication and synchronization logic. The CRAY-T3D provides an elaborate shell to support global-memory access, prefetch, atomic operations, barriers, and block transfers. We provide a detailed empirical performance characterization of these primitives using micro-benchmarks and evaluate their utility in compiling for a parallel language. We have found that the raw performance of the machine is quite impressive and the most effective forms of communication are prefetch and write. Other shell provisions, such as the bulk transfer engine and the external Annex register set, are cumbersome and of little use. By evaluating the system in the context of a language implementation, we shed light on important trade-offs and pitfalls in the machine architecture.