Assessing the performance of OpenMP programs on the intel xeon phi

Authors:
Dirk Schmidl;Tim Cramer;Sandra Wienke;Christian Terboven;Matthias S. Müller
Affiliations:
Center for Computing and Communication, RWTH Aachen University, Aachen, Germany,JARA High-Performance Computing, Aachen, Germany;Center for Computing and Communication, RWTH Aachen University, Aachen, Germany,JARA High-Performance Computing, Aachen, Germany;Center for Computing and Communication, RWTH Aachen University, Aachen, Germany,JARA High-Performance Computing, Aachen, Germany;Center for Computing and Communication, RWTH Aachen University, Aachen, Germany,JARA High-Performance Computing, Aachen, Germany;Center for Computing and Communication, RWTH Aachen University, Aachen, Germany,Chair for High Performance Computing, RWTH Aachen University, Aachen, Germany,JARA High-Performance Computing, Aache ...
Venue:
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Year:
2013

Citing 14
Cited 1

Nested OpenMP for efficient computation of 3D critical points in multi-block CFD datasets

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Features for image retrieval: an experimental comparison

Information Retrieval
Data and thread affinity in openmp programs

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel Minimum $p$-Norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products

SIAM Journal on Scientific Computing
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Simulation of bevel gear cutting with GPGPUs--performance and productivity

Computer Science - Research and Development
Assessing OpenMP tasking implementations on NUMA architectures

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Efficient backprojection-based synthetic aperture radar computation with many-core processors

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimization of geometric multigrid for emerging multi- and manycore processors

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Task-parallel programming on NUMA architectures

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Virtualizing high-end GPGPUs on ARM clusters for the next generation of high performance cloud computing

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its native x86 support, the Xeon Phi may also be used stand-alone, meaning codes can be executed directly on the device without the need for interaction with a host. In this sense, the Xeon Phi resembles a big SMP on a chip if its 240 logical cores are compared to a common Xeon-based compute node offering up to 32 logical cores. In this work, we compare a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes. Considering both as individual SMP systems, they come at a very similar price and power envelope, but our results show significant differences in absolute application performance and scalability. We also show in how far common programming idioms for the Xeon multi-core architecture are applicable for the Xeon Phi many-core architecture and which challenges the changing ratio of core count to single core performance poses for the application programmer.