Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Authors:
Eddy Z. Zhang;Yunlian Jiang;Xipeng Shen
Affiliations:
The College of William and Mary, Williamsburg, VA, USA;The College of William and Mary, Williamsburg, VA, USA;The College of William and Mary, Williamsburg, VA, USA
Venue:
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Year:
2010

Citing 23
Cited 31

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Symbiotic jobscheduling with priorities for a simultaneous multithreading processor

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Compiling for instruction cache performance on a multithreaded architecture

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Initial Observations of the Simultaneous Multithreading Pentium 4 Processor

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Locality phase prediction

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Architectural Support for Enhanced SMT Job Scheduling

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Architectural support for operating system-driven CMP cache management

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Locality and Loop Scheduling on NUMA Multiprocessors

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Analysis and approximation of optimal co-scheduling on chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A study on optimally co-scheduling jobs of different lengths on chip multiprocessors

Proceedings of the 6th ACM conference on Computing frontiers
Compiler techniques for reducing data cache miss rate on a multithreaded architecture

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Evaluating OpenMP on chip multithreading platforms

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Compatible phase co-scheduling on a CMP of multi-threaded processors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Combining locality analysis with online proactive job co-scheduling in chip multiprocessors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Design principles for end-to-end multicore schedulers

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
LU decomposition on cell broadband engine: an empirical study to exploit heterogeneous chip multiprocessors

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Towards scalable service composition on multicores

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Array regrouping on CMP with non-uniform cache sharing

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
A case for NUMA-aware contention management on multicore systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
A work stealing scheduler for parallel loops on shared cache multicores

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Is reuse distance applicable to data locality analysis on chip multiprocessors?

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Neighborhood-aware data locality optimization for NoC-based multicores

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Reuse distance based performance modeling and workload mapping

Proceedings of the 9th conference on Computing Frontiers
Toward predictable performance in software packet-processing platforms

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Metronome: operating system level performance management via self-adaptive computing

Proceedings of the 49th Annual Design Automation Conference
Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Locality & utility co-optimization for practical capacity management of shared last level caches

Proceedings of the 26th ACM international conference on Supercomputing
Cache Conscious Task Regrouping on Multicore Processors

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
Exploiting inter-sequence correlations for program behavior prediction

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Measuring interference between live datacenter applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic generation of program affinity policies using machine learning

CC'13 Proceedings of the 22nd international conference on Compiler Construction
To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
NUMA-aware shared-memory collective communication for MPI

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Proceedings of the 40th Annual International Symposium on Computer Architecture
Software-level scheduling to exploit non-uniformly shared data cache on GPGPU

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Dynamic thread pinning for phase-based OpenMP programs

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Imbalanced cache partitioning for balanced data-parallel programs

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
On modeling contention for shared caches in multi-core processors with techniques from ecology

Natural Computing: an international journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood. In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.