Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

Authors:
Achille Peternier;Danilo Ansaloni;Daniele Bonetta;Cesare Pautasso;Walter Binder
Affiliations:
-;-;-;-;-
Venue:
Future Generation Computer Systems
Year:
2014

Citing 28
Cited 0

Impact of Memory Contention on Dynamic Scheduling on NUMA Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Instruction-level Characterization of Scientific Computing Applications Using Hardware Performance Counters

WWC '98 Proceedings of the Workload Characterization: Methodology and Case Studies
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Multiprocessor Energy-Efficient Scheduling for Real-Time Tasks with Different Power Characteristics

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Enhancements for hyper-threading technology in the operating system: seeking the optimal scheduling

WIESS'02 Proceedings of the 2nd conference on Industrial Experiences with Systems Software - Volume 2
What can performance counters do for memory subsystem analysis?

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Real time power estimation and thread scheduling via performance counters

ACM SIGARCH Computer Architecture News
The Art of Multiprocessor Programming

The Art of Multiprocessor Programming
Decomposable and responsive power models for multicore processors using performance counters

Proceedings of the 24th ACM International Conference on Supercomputing
The trouble With multi-core

IEEE Spectrum
SoftPower: fine-grain power estimations using performance counters

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Contention-Aware Scheduling on Multicore Systems

ACM Transactions on Computer Systems (TOCS)
Power efficient scheduling for hard real-time systems on a multiprocessor platform

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Online cache modeling for commodity multicore processors

ACM SIGOPS Operating Systems Review
Performance profiling of virtual machines

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Demand-driven software race detection using hardware performance counters

Proceedings of the 38th annual international symposium on Computer architecture
Rapid identification of architectural bottlenecks via precise event counting

Proceedings of the 38th annual international symposium on Computer architecture
A case for NUMA-aware contention management on multicore systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Are hardware performance counters a cost effective way for integrity checking of programs

Proceedings of the sixth ACM workshop on Scalable trusted computing
Power efficient rate monotonic scheduling for multi-core systems

Journal of Parallel and Distributed Computing
Critical path-based thread placement for NUMA systems

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
Overseer: low-level hardware monitoring and management for Java

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Fine-grained per-core frequency scheduling for power efficient-multicore execution

IGCC '11 Proceedings of the 2011 International Green Computing Conference and Workshops
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Libmonitor: A tool for first-party monitoring

Parallel Computing
Characterizing thread placement in the IBM POWER7 processor

IISWC '12 Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC)
Hardware-aware Thread Scheduling: The Case of Asymmetric Multicore Processors

ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern processor architectures are increasingly complex and heterogeneous, often requiring software solutions tailored to the specific hardware characteristics of each processor model. In this article, we address this problem by targeting two processors featuring Simultaneous MultiThreading (SMT) to improve the occupancy of their internal execution units through a sustained stream of instructions coming from more than one thread. We target the AMD Bulldozer and IBM POWER7 processors as case studies for specific hardware-oriented performance optimizations that increase the variety of instructions sent to each core to maximize the occupancy of all its execution units. WorkOver, presented in this article, improves thread scheduling by increasing the performance of floating point-intensive workloads on Linux-based operating systems. WorkOver is a user-space monitoring tool that automatically identifies FPU-intensive threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 20% can be achieved by simply allowing WorkOver to monitor applications and schedule their threads, without any modification of the workload.