Cache performance of operating system and multiprogramming workloads
ACM Transactions on Computer Systems (TOCS)
Internet Streaming SIMD Extensions
Computer
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
GViM: GPU-accelerated virtual machines
Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
A framework for efficient and scalable execution of domain-specific templates on GPUs
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Scheduling Concurrent Bag-of-Tasks Applications on Heterogeneous Platforms
IEEE Transactions on Computers
Addressing shared resource contention in multicore processors via scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Operating System Concepts with Java
Operating System Concepts with Java
Learning to rank with (a lot of) word features
Information Retrieval
Tessellation: space-time partitioning in a manycore client OS
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
A GPGPU transparent virtualization component for high performance computing clouds
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework
Proceedings of the 20th international symposium on High performance distributed computing
The impact of memory subsystem resource sharing on datacenter applications
Proceedings of the 38th annual international symposium on Computer architecture
TimeGraph: GPU scheduling for real-time multi-tasking environments
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Early experiences with the intel many integrated cores accelerated computing technology
Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing memory interference in multicore systems via application-aware memory channel partitioning
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines
IEEE Transactions on Computers
A virtual memory based runtime to support multi-tenancy in clusters with GPUs
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Interference-driven resource management for GPU-based heterogeneous clusters
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors
Proceedings of the 26th ACM international conference on Supercomputing
Dynamic Fractional Resource Scheduling versus Batch Scheduling
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
It is remarkably easy to offload processing to Intel's newest manycore coprocessor, the Xeon-Phi: it supports a popular ISA (x86-based), a popular OS (Linux) and a popular programming model (OpenMP). Unfortunately, easy portability does not automatically ensure high performance. Additional programmer effort is necessary to leverage the new performance-oriented hardware features. But programmer optimizations alone are insufficient. Multiprocessing is also necessary to improve hardware utilization, and Linux makes it easy for processes to share the manycore coprocessor. However multiprocessing inefficiencies can easily offset gains made by the programmer. Our experiments on a production, high-performance Xeon server with multiple Xeon Phi coprocessors show that multiprocessing on coprocessors not only slows down the processes but also introduces unreliability (some processes crash unexpectedly). We propose a new, user-level middleware called COSMIC that improves performance and reliability of multiprocessing on coprocessors like the Xeon Phi. COSMIC seamlessly fits in the existing Xeon Phi software stack and is transparent to programmers. It manages Xeon Phi processes that execute parallel regions offloaded to the coprocessors. Offloads typically have programmer-driven performance directives like thread and affinity requirements. Unlike the existing Xeon Phi software stack, COSMIC does fair scheduling of both processes and offloads, and takes into account conflicting requirements of offloads belonging to different processes. By doing so, COSMIC has two clear benefits. First, it improves multiprocessing performance by preventing thread and memory oversubscription, by avoiding inter-offload interference and by reducing load imbalance on coprocessors and cores. Second, it increases multiprocessing reliability by exploiting programmer-specified per-process coprocessor memory requirements to completely avoid memory oversubscription and crashes. Our experiments on several representative Xeon Phi workloads show that, in a multiprocessing environment, COSMIC improves average core utilization by up to 3 times, reduces make-span by up to 52%, reduces average process latency (turn-around-time) by 70%, and completely eliminates process crashes.