Evaluating titanium SPMD programs on the Tera MTA
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Parallelization of a Dynamic Unstructured Algorithm Using Three Leading Programming Paradigms
IEEE Transactions on Parallel and Distributed Systems
Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer
International Journal of Parallel Programming
Dissecting Cyclops: a detailed analysis of a multithreaded architecture
ACM SIGARCH Computer Architecture News
Early Experience with Scientific Programs on the Cray MTA-2
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Optimizing NANOS OpenMP for the IBM Cyclops Multithreaded Architecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Evaluating the impact of simultaneous multithreading on network servers using real hardware
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Proceedings of the 19th annual international conference on Supercomputing
Extensible control architectures
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
A case study of multi-threading in the embedded space
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Irregular computations in Fortran - expression and implementation strategies
Scientific Programming
Perflint: A Context Sensitive Performance Advisor for C++ Programs
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Evaluation of OpenMP for the cyclops multithreaded architecture
WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Improving SMT performance scheduling processes
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
Reassortment Networks and the Evolution of Pandemic H1N1 Swine-Origin Influenza
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Exploring irregular memory accesses on FPGAs
Proceedings of the first workshop on Irregular applications: architectures and algorithm
Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Parallel solution of the subset-sum problem: an empirical study
Concurrency and Computation: Practice & Experience
Compiled multithreaded data paths on FPGAs for dynamic workloads
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hi-index | 0.00 |
The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera MTA.In a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA codes.We then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the "serial bottlenecks" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of threads.Instead, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except for the EP benchmark, the tuned codes issue memory references at nearly the peak rate of one per instruction. Consequently, the network can support the memory references issued by one, but not two, processors. As a result, the parallel efficiency of EP is near-perfect, but the others are reduced accordingly.Another reason for imperfect speedup pertains to the compiler. While the definition of a thread in a single processor or multi-processor mode is essentially the same, there is a different implementation and an associated overhead with running on multiple processors. We characterize the overhead of running "frays" (a collection of threads running on a single processor) and "crews" (a collection of frays, one per processor.)