Scheduling threads for constructive cache sharing on CMPs

Authors:
Shimin Chen;Phillip B. Gibbons;Michael Kozuch;Vasileios Liaskovitis;Anastassia Ailamaki;Guy E. Blelloch;Babak Falsafi;Limor Fix;Nikos Hardavellas;Todd C. Mowry;Chris Wilkerson
Affiliations:
Intel Research Pittsburgh;Intel Research Pittsburgh;Intel Research Pittsburgh;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Intel Research Pittsburgh;Carnegie Mellon University;Carnegie Mellon University and Intel Research Pittsburgh;Intel Microprocessor Research Lab
Venue:
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Year:
2007

Citing 31
Cited 42

Footprints in the cache

ACM Transactions on Computer Systems (TOCS)
An analytical cache model

ACM Transactions on Computer Systems (TOCS)
Analysis of multithreaded architectures for parallel computing

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Space-efficient scheduling of parallelism with synchronization variables

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Provably efficient scheduling for languages with fine-grained parallelism

Journal of the ACM (JACM)
Space-efficient scheduling of nested parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Analytical cache models with applications to cache partitioning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Design Challenges of Technology Scaling

IEEE Micro
Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator

FCRC '96/WACG '96 Selected papers from the Workshop on Applied Computational Geormetry, Towards Geometric Engineering
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Exploiting Choice in Resizable Cache Design to Optimize Deep-Submicron Processor Energy-Delay

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Dynamic Partitioning of Shared Cache Memory

The Journal of Supercomputing
Design and implementation of the POWER5™ microprocessor

Proceedings of the 41st annual Design Automation Conference
Effectively sharing a cache among threads

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Inspector joins

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Maximizing CMP Throughput with Mediocre Cores

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Parallel Real-Time Task Scheduling on Multicore Platforms

RTSS '06 Proceedings of the 27th IEEE International Real-Time Systems Symposium
Performance of multithreaded chip multiprocessors and implications for operating system design

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
A multithreaded PowerPC processor for commercial servers

IBM Journal of Research and Development
A dynamically tunable memory hierarchy

IEEE Transactions on Computers

High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimistic parallelism benefits from data partitioning

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Fundamental parallel algorithms for private-cache chip multiprocessors

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Scheduling strategies for optimistic parallel execution of irregular programs

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Main-memory scan sharing for multi-core CPUs

Proceedings of the VLDB Endowment
Hash Join Optimization Based on Shared Cache Chip Multi-processor

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Dynamic thread and data mapping for NoC based CMPs

Proceedings of the 46th Annual Design Automation Conference
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Intra-application shared cache partitioning for multithreaded applications

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Contention aware execution: online contention detection and response

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
An adaptive task creation strategy for work-stealing scheduling

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Hardware/software support for adaptive work-stealing in on-chip multiprocessor

Journal of Systems Architecture: the EUROMICRO Journal
Reinventing scheduling for multicore systems

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Proposal and evaluation of APIs for utilizing inter-core time aggregation scheduler

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Scheduling task parallelism on multi-socket multicore systems

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Loaf: a framework and infrastructure for creating online adaptive solutions

Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
FACT: a framework for adaptive contention-aware thread migrations

Proceedings of the 8th ACM International Conference on Computing Frontiers
Energy efficient scheduling for multithreaded programs on general-purpose processors

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
REEact: a customizable virtual execution manager for multicore platforms

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Parallel application memory scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Improving web cache server performance through arbitral thread and delayed caching

Cluster Computing
Neighborhood-aware data locality optimization for NoC-based multicores

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Reuse distance based performance modeling and workload mapping

Proceedings of the 9th conference on Computing Frontiers
OpenMP task scheduling strategies for multicore NUMA systems

International Journal of High Performance Computing Applications
CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

Proceedings of the 26th ACM international conference on Supercomputing
Cache-efficient parallel isosurface extraction for shared cache multicores

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Measuring interference between live datacenter applications

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Characterizing and mitigating work time inflation in task parallel programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Locality-aware task management for unstructured parallelism: a quantitative limit study

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
An empirical model for predicting cross-core performance interference on multicore processors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity

ACM Transactions on Architecture and Code Optimization (TACO)
Characterizing and mitigating work time inflation in task parallel programs

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.