Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors

Authors:
Xi E. Chen;Tor Aamodt
Affiliations:
NVIDIA, Portland;University of British Columbia, Vancouver
Venue:
IEEE Transactions on Computers
Year:
2012

Citing 0
Cited 5

Efficient techniques for predicting cache sharing and throughput

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Data-Intensive Workload Consolidation for the Hadoop Distributed File System

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A queueing theoretic approach for performance evaluation of low-power multi-core embedded systems

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	14.98

Visualization

Abstract

This paper proposes an analytical model for accurately predicting the impact of contention on cache miss rates. The focus is multiprogrammed workloads running on multithreaded manycore architectures. This work addresses a key challenge facing earlier cache contention models as the number of concurrent threads exceeds the associativity of shared caches. The memory access characteristics of individual applications are obtained in isolation by profiling their circular sequences and two new measures of access locality are proposed. An evaluation of this model in the context of a Niagara processor shows that it achieves an average 8.7 percent error in miss rate predictions which improves upon the best prior model by 48.1x. This paper also presents a novel Markov chain throughput model. When combining the contention model with the Markov chain model, throughput is estimated with an average error of 8.3 percent compared to detailed simulation. Moreover, the combined model tracks throughput sufficiently well to find the same optimized design point for application-specific workloads 65 times faster than detailed simulation. This paper also shows that the models accurately predict cache contention and throughput trends across various workloads on real hardware.