COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors

Authors:
Srihari Cadambi;Giuseppe Coviello;Cheng-Hong Li;Rajat Phull;Kunal Rao;Murugan Sankaradass;Srimat Chakradhar
Affiliations:
NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA;NEC Laboratories America, Inc., Princeton, NJ, USA
Venue:
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Year:
2013

Citing 23
Cited 0

Cache performance of operating system and multiprogramming workloads

ACM Transactions on Computer Systems (TOCS)
Internet Streaming SIMD Extensions

Computer
Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
GViM: GPU-accelerated virtual machines

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
A framework for efficient and scalable execution of domain-specific templates on GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Scheduling Concurrent Bag-of-Tasks Applications on Heterogeneous Platforms

IEEE Transactions on Computers
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Operating System Concepts with Java

Operating System Concepts with Java
Learning to rank with (a lot of) word features

Information Retrieval
Tessellation: space-time partitioning in a manycore client OS

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
A GPGPU transparent virtualization component for high performance computing clouds

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Proceedings of the 20th international symposium on High performance distributed computing
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
TimeGraph: GPU scheduling for real-time multi-tasking environments

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Early experiences with the intel many integrated cores accelerated computing technology

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing memory interference in multicore systems via application-aware memory channel partitioning

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines

IEEE Transactions on Computers
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Interference-driven resource management for GPU-based heterogeneous clusters

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Proceedings of the 26th ACM international conference on Supercomputing
Dynamic Fractional Resource Scheduling versus Batch Scheduling

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is remarkably easy to offload processing to Intel's newest manycore coprocessor, the Xeon-Phi: it supports a popular ISA (x86-based), a popular OS (Linux) and a popular programming model (OpenMP). Unfortunately, easy portability does not automatically ensure high performance. Additional programmer effort is necessary to leverage the new performance-oriented hardware features. But programmer optimizations alone are insufficient. Multiprocessing is also necessary to improve hardware utilization, and Linux makes it easy for processes to share the manycore coprocessor. However multiprocessing inefficiencies can easily offset gains made by the programmer. Our experiments on a production, high-performance Xeon server with multiple Xeon Phi coprocessors show that multiprocessing on coprocessors not only slows down the processes but also introduces unreliability (some processes crash unexpectedly). We propose a new, user-level middleware called COSMIC that improves performance and reliability of multiprocessing on coprocessors like the Xeon Phi. COSMIC seamlessly fits in the existing Xeon Phi software stack and is transparent to programmers. It manages Xeon Phi processes that execute parallel regions offloaded to the coprocessors. Offloads typically have programmer-driven performance directives like thread and affinity requirements. Unlike the existing Xeon Phi software stack, COSMIC does fair scheduling of both processes and offloads, and takes into account conflicting requirements of offloads belonging to different processes. By doing so, COSMIC has two clear benefits. First, it improves multiprocessing performance by preventing thread and memory oversubscription, by avoiding inter-offload interference and by reducing load imbalance on coprocessors and cores. Second, it increases multiprocessing reliability by exploiting programmer-specified per-process coprocessor memory requirements to completely avoid memory oversubscription and crashes. Our experiments on several representative Xeon Phi workloads show that, in a multiprocessing environment, COSMIC improves average core utilization by up to 3 times, reduces make-span by up to 52%, reduces average process latency (turn-around-time) by 70%, and completely eliminates process crashes.