Reuse distance based performance modeling and workload mapping

Authors:
Sai Prashanth Muralidhara;Mahmut Kandemir;Orhan Kislal
Affiliations:
Pennsylvania State University, State College, PA, USA;Pennsylvania State University, State College, PA, USA;Pennsylvania State University, State College, PA, USA
Venue:
Proceedings of the 9th conference on Computing Frontiers
Year:
2012

Citing 18
Cited 0

Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Performance characteristics of gang scheduling in multiprogrammed environments

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Symbiotic jobscheduling with priorities for a simultaneous multithreading processor

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Simics: A Full System Simulation Platform

Computer
Calculating stack distances efficiently

Proceedings of the 2002 workshop on Memory system performance
Estimating cache misses and locality using stack distances

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Hyper-threading aware process scheduling heuristics

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
L2 Cache Modeling for Scientific Applications on Chip Multi-Processors

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Analysis and approximation of optimal co-scheduling on chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating Systems Review
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Characterization and dynamic mitigation of intra-application cache interference

ISPASS '11 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern multicore architectures have multiple cores connected to a hierarchical cache structure resulting in heterogeneity in cache sharing across different subsets of cores. In these systems, overall throughput and efficiency depends heavily on a careful mapping of applications to available cores. In this paper, we study the problem of application-to-core mapping with the goal of trying to improve the overall cache performance in the presence of a hierarchical multi-level cache structure. We propose to sample the memory access patterns of individual applications and build their reuse distance distributions. Further, we propose to use these reuse distance distributions to compute an application-to-core mapping that tries to improve the overall cache performance, and consequently, the overall throughput. We show that our proposed mapping scheme is very effective in practice yielding throughput benefits of about 39% over the worst case mapping and about 30% over the default operating system based mapping. We believe, as larger chip multiprocessors with deeper cache hierarchies are projected to be the norm in the future, efficient mapping of applications to cores will become a vital requirement to extract the maximum possible performance from these systems.