Evaluating Associativity in CPU Caches
IEEE Transactions on Computers
Analytical cache models with applications to cache partitioning
ICS '01 Proceedings of the 15th international conference on Supercomputing
HPCVIEW: A Tool for Top-down Analysis of Node Performance
The Journal of Supercomputing
On the effectiveness of set associative page mapping and its application to main memory management
ICSE '76 Proceedings of the 2nd international conference on Software engineering
Cross-architecture performance predictions for scientific applications using parameterized models
Proceedings of the joint international conference on Measurement and modeling of computer systems
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Fast data-locality profiling of native execution
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Multiple Page Size Modeling and Optimization
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Locality approximation using time
Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
IEEE Transactions on Software Engineering
All-window profiling of concurrent executions
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Sampling-based program locality approximation
Proceedings of the 7th international symposium on Memory management
Open | SpeedShop: An open source infrastructure for parallel performance analysis
Scientific Programming - Large-Scale Programming Tools and Environments
A Prediction Based CMP Cache Migration Policy
HPCC '08 Proceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications
Analysis and approximation of optimal co-scheduling on chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Program locality analysis using reuse distance
ACM Transactions on Programming Languages and Systems (TOPLAS)
Evaluation techniques for storage hierarchies
IBM Systems Journal
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Addressing shared resource contention in multicore processors via scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Accelerating multicore reuse distance analysis with sampling and parallelization
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Comparing scalability prediction strategies on an SMP of CMPs
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Characterizing the Relation Between Apex-Map Synthetic Probes and Reuse Distance Distributions
ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
All-window profiling and composable models of cache sharing
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Fast modeling of shared caches in multicore systems
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Discovery of locality-improving refactorings by reuse path analysis
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Linear-time Modeling of Program Working Set in Shared Cache
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Is reuse distance applicable to data locality analysis on chip multiprocessors?
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Pinpointing data locality problems using data-centric analysis
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
FractalMRC: Online Cache Miss Rate Curve Prediction on Commodity Systems
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
A higher order theory of locality
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
HOTL: a higher order theory of locality
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Heterogeneous makespan and energy-constrained DAG scheduling
Proceedings of the 2013 workshop on Energy efficient high performance parallel and distributed computing
Hi-index | 0.00 |
Because of the interference in the shared cache on multicore processors, the performance of a program can be severely affected by its co-running programs. If job scheduling does not consider how a group of tasks utilize cache, the performance may degrade significantly, and the degradation usually varies sizably and unpredictably from run to run. In this paper, we use trace-based program locality analysis and make it efficient enough for dynamic use. We show a complete on-line system for periodically measuring the parallel execution, predicting and ranking cache interference for all co-run choices, and reorganizing programs based on the prediction. We test our system on floating-point and mixed integer and floating-point workloads composed of SPEC 2006 benchmarks and compare with the default Linux job scheduler to show the benefit of the new system in improving performance and reducing performance variation.