Early experiences with large-scale Cray XMT systems

Authors:
David Mizell;Kristyn Maschhoff
Affiliations:
Cray Inc., Seattle, WA, USA;Cray Inc., Seattle, WA, USA
Venue:
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Year:
2009

Citing 0
Cited 9

Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A highly parallel implementation of k-means for multithreaded architecture

Proceedings of the 19th High Performance Computing Symposia
Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Large-scale continuous subgraph queries on streams

Proceedings of the first annual workshop on High performance computing meets databases
Critical path-based thread placement for NUMA systems

ACM SIGMETRICS Performance Evaluation Review
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Understanding parallelism in graph traversal on multi-core clusters

Computer Science - Research and Development
Massive data analytics: the graph 500 on IBM Blue Gene/Q

IBM Journal of Research and Development
A GPU-based method for computing eigenvector centrality of gene-expression networks

AusPDC '13 Proceedings of the Eleventh Australasian Symposium on Parallel and Distributed Computing - Volume 140

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several 64-processor XMT systems have now been shipped to customers and there have been 128-processor, 256-processor and 512-processor systems tested in Cray's development lab. We describe some techniques we have used for tuning performance in hopes that applications continued to scale on these larger systems. We discuss how the programmer must work with the XMT compiler to extract maximum parallelism and performance, especially from multiply nested loops, and how the performance tools provide vital information about whether or how the compiler has parallelized loops and where performance bottlenecks may be occurring. We also show data that indicate that the maximum performance of a given application on a given size XMT system is limited by memory or network bandwidth, in a way that is somewhat independent of the number of processors used.