Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Authors:
David Tam;Reza Azimi;Michael Stumm
Affiliations:
University of Toronto, Toronto, Canada;University of Toronto, Toronto, Canada;University of Toronto, Toronto, Canada
Venue:
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Year:
2007

Citing 20
Cited 59

Impact of sharing-based thread placement on multithreaded architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The performance implications of locality information usage in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Performance counters and state sharing annotations: a unified approach to thread locality

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Data clustering: a review

ACM Computing Surveys (CSUR)
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Effects of Memory Performance on Parallel Job Scheduling

JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
Using Cohort-Scheduling to Enhance Server Performance

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Architectural Support for Enhanced SMT Job Scheduling

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Scheduling Algorithms for Effective Thread Pairing on Hybrid Multiprocessors

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Online performance analysis by statistical sampling of microprocessor performance counters

Proceedings of the 19th annual international conference on Supercomputing
Chip multithreading systems need a new operating system scheduler

Proceedings of the 11th workshop on ACM SIGOPS European workshop
Performance of multithreaded chip multiprocessors and implications for operating system design

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Hyper-threading aware process scheduling heuristics

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Enhancements for hyper-threading technology in the operating system: seeking the optimal scheduling

WIESS'02 Proceedings of the 2nd conference on Industrial Experiences with Systems Software - Volume 2
Steps towards cache-resident transaction processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Compatible phase co-scheduling on a CMP of multi-threaded processors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Contention-aware scheduler: unlocking execution parallelism in multithreaded java programs

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Sharing-aware OS scheduling algorithms for multi-socket multi-core servers

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating Systems Review
Phase-guided thread-to-core assignment for improved utilization of performance-asymmetric multi-core processors

IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Managing contention for shared resources on multicore processors

Communications of the ACM
A case for integrated processor-cache partitioning in chip multiprocessors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Load balancing on speed

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Managing Contention for Shared Resources on Multicore Processors

Queue - Power Management
Probabilistic job symbiosis modeling for SMT processor scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Bias scheduling in heterogeneous multi-core architectures

Proceedings of the 5th European conference on Computer systems
Dynamically managed multithreaded reconfigurable architectures for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Reinventing scheduling for multicore systems

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Contention-Aware Scheduling on Multicore Systems

ACM Transactions on Computer Systems (TOCS)
An adaptive hash-based multilayer scheduler for L7-filter on a highly threaded hierarchical multi-core server

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Online cache modeling for commodity multicore processors

ACM SIGOPS Operating Systems Review
ULCC: a user-level facility for optimizing shared cache performance on multicores

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Dynamic cache contention detection in multi-threaded applications

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Exploring implicit parallelism in class diagrams

Journal of Systems and Software
Array regrouping on CMP with non-uniform cache sharing

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A QHD-capable parallel H.264 decoder

Proceedings of the international conference on Supercomputing
The impact of memory subsystem resource sharing on datacenter applications

Proceedings of the 38th annual international symposium on Computer architecture
A case for NUMA-aware contention management on multicore systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
FACT: a framework for adaptive contention-aware thread migrations

Proceedings of the 8th ACM International Conference on Computing Frontiers
Overseer: low-level hardware monitoring and management for Java

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Optimal task assignment in multithreaded processors: a statistical approach

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Region scheduling: efficiently using the cache architectures via page-level affinity

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Leveraging Core Specialization via OS Scheduling to Improve Performance on Asymmetric Multicore Systems

ACM Transactions on Computer Systems (TOCS)
Combining locality analysis with online proactive job co-scheduling in chip multiprocessors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Is reuse distance applicable to data locality analysis on chip multiprocessors?

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Phase-based tuning for better utilization of performance-asymmetric multicore processors

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Share memory aware scheduler: balancing performance and fairness

Proceedings of the great lakes symposium on VLSI
Probabilistic modeling for job symbiosis scheduling on SMT processors

ACM Transactions on Architecture and Code Optimization (TACO)
Reuse distance based performance modeling and workload mapping

Proceedings of the 9th conference on Computing Frontiers
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Dynamic adaptive virtual core mapping to improve power, energy, and performance in multi-socket multicores

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
MemProf: a memory profiler for NUMA multicore systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
A practical method for estimating performance degradation on multicore processors, and its application to HPC workloads

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Lock-contention-aware scheduler: A scalable and energy-efficient method for addressing scalability collapse on multicore systems

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
ADAPT: A framework for coscheduling multithreaded programs

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Automatic generation of program affinity policies using machine learning

CC'13 Proceedings of the 22nd international conference on Compiler Construction
Traffic management: a holistic approach to memory placement on NUMA systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Dynamic threshold for imbalance assessment on load balancing for multicore systems

Computers and Electrical Engineering
SMT-centric power-aware thread placement in chip multiprocessors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Model-based cache-aware dispatching of object-oriented software for multicore systems

Journal of Systems and Software
Dynamic thread pinning for phase-based OpenMP programs

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Crank it up or dial it down: coordinated multiprocessor frequency and folding control

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
On modeling contention for shared caches in multi-core processors with techniques from ecology

Natural Computing: an international journal
Virtual Machine Coscheduling: A Game Theoretic Approach

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.02

Visualization

Abstract

The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses. In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.