Proceedings of the 27th annual international symposium on Computer architecture
Memory Controller Optimizations for Web Servers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Complexity effective memory access scheduling for many-core accelerator architectures
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Energy-efficient mechanisms for managing thread context in throughput processors
Proceedings of the 38th annual international symposium on Computer architecture
Improving GPU performance via large warps and two-level warp scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Parallel application memory scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
The case for GPGPU spatial multitasking
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Fine-grained resource sharing for concurrent GPGPU kernels
HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function
IEEE Computer Architecture Letters
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Improving GPGPU concurrency with elastic kernels
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Cache-Conscious Wavefront Scheduling
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Neither more nor less: optimizing thread-level parallelism for GPGPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
The available computing resources in modern GPUs are growing with each new generation. However, as many general purpose applications with limited thread-scalability are tuned to take advantage of GPUs, available compute resources might not be optimally utilized. To address this, modern GPUs will need to execute multiple kernels simultaneously. As current generations of GPUs (e.g., NVIDIA Kepler, AMD Radeon) already enable concurrent execution of kernels from the same application, in this paper we address the next logical step: executing multiple concurrent applications in GPUs. We show that while this paradigm has a potential to improve the overall system performance, negative interactions among concurrently executing applications in the memory system can severely hamper the performance and fairness among applications. We show that the current application agnostic GPU memory system design can (1) lead to sub-optimal GPU performance; and (2) create significant imbalance in performance slowdowns across kernels. Thus, we argue that GPU memory system should be augmented with application awareness. As one example to the applicability of this concept, we augment the memory system hardware with application awareness such that requests from different applications can be scheduled in a round robin (RR) fashion while still preserving the benefits of the current first-ready FCFS (FR-FCFS) memory scheduling policy. Evaluations with different multi-application workloads demonstrate that the proposed memory scheduling policy, first-ready round-robin FCFS (FR-RR-FCFS), improves fairness and delivers better system performance compared to the existing FR-FCFS memory scheduling scheme.