Linear algebra operators for GPU implementation of numerical algorithms
ACM SIGGRAPH 2003 Papers
GPU Cluster for High Performance Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
High Resolution Aerospace Applications using the NASA Columbia Supercomputer
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Performance evaluation of supercomputers using HPCC and IMB Benchmarks
Journal of Computer and System Sciences
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Programming the Intel 80-core network-on-a-chip terascale processor
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Early performance evaluation of a "Nehalem" cluster using scientific and engineering applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Performance Analysis of Scientific and Engineering Applications Using MPInside and TAU
HPCC '10 Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications
The impact of hyper-threading on processor resource utilization in production applications
HIPC '11 Proceedings of the 2011 18th International Conference on High Performance Computing
A microbenchmark suite for OpenMP tasks
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Memory performance at reduced CPU clock speeds: an analysis of current x86_64 processors
HotPower'12 Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Intel recently introduced the Xeon Phi coprocessor based on the Many Integrated Core architecture featuring 60 cores with a peak performance of 1.0 Tflop/s. NASA has deployed a 128-node SGI Rackable system where each node has two Intel Xeon E2670 8-core Sandy Bridge processors along with two Xeon Phi 5110P coprocessors. We have conducted an early performance evaluation of the Xeon Phi. We used microbenchmarks to measure the latency and bandwidth of memory and interconnect, I/O rates, and the performance of OpenMP directives and MPI functions. We also used OpenMP and MPI versions of the NAS Parallel Benchmarks along with two production CFD applications to test four programming modes: offload, processor native, coprocessor native and symmetric (processor plus coprocessor). In this paper we present preliminary results based on our performance evaluation of various aspects of a Phi-based system.