Scheduling and page migration for multiprocessor compute servers

Authors:
Rohit Chandra;Scott Devine;Ben Verghese;Anoop Gupta;Mendel Rosenblum
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford CA;Computer Systems Laboratory, Stanford University, Stanford CA;Computer Systems Laboratory, Stanford University, Stanford CA;Computer Systems Laboratory, Stanford University, Stanford CA;Computer Systems Laboratory, Stanford University, Stanford CA
Venue:
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Year:
1994

Citing 18
Cited 52

Process control and scheduling issues for multiprogrammed shared-memory multiprocessors

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Scheduling Support for Concurrency and Parallelism in the Mach Operating System

Computer
Distributed Hierarchical Control for Parallel Processing

Computer
The performance of multiprogrammed multiprocessor scheduling algorithms

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
NUMA policies and their relation to memory architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Analysis of task migration in shared-memory multiprocessor scheduling

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Experimental comparison of memory management policies for NUMA multiprocessors

ACM Transactions on Computer Systems (TOCS)
The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Scheduler activations: effective kernel support for the user-level management of parallelism

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Comparative performance evaluation of cache-coherent NUMA and COMA architectures

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The DASH prototype: implementation and performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Benefits of cache-affinity scheduling in shared-memory multiprocessors: a summary

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Efficient scheduling on multiprogrammed shared-memory multiprocessors

Efficient scheduling on multiprogrammed shared-memory multiprocessors
COOL: An Object-Based Language for Parallel Programming

Computer
Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling

IEEE Transactions on Parallel and Distributed Systems
Making effective use of shared-memory multiprocessors: the process control approach

Making effective use of shared-memory multiprocessors: the process control approach

Memory system performance of UNIX on CC-NUMA multiprocessors

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The interaction of parallel and sequential workloads on a network of workstations

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Informing memory operations: providing memory performance feedback in modern processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Effective distributed scheduling of parallel workloads

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Integrating performance monitoring and communication in parallel computers

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SHRIMP performance monitor: design and applications

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Informing memory operations: memory performance feedback mechanisms and their applications

ACM Transactions on Computer Systems (TOCS)
Performance isolation: sharing and isolation in shared-memory multiprocessors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Performance experiences on Sun's Wildfire prototype

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A case for user-level dynamic page migration

Proceedings of the 14th international conference on Supercomputing
Memory Conscious Scheduling for Cluster-based NUMA Multiprocessors

The Journal of Supercomputing
Symbiotic jobscheduling for a simultaneous mutlithreading processor

ACM SIGPLAN Notices
Is data distribution necessary in OpenMP?

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Symbiotic jobscheduling with priorities for a simultaneous multithreading processor

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Automatic data migration for reducing energy consumption in multi-bank memory systems

Proceedings of the 39th annual Design Automation Conference
Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Compiler Support for Array Distribution onNUMA Shared Memory Multiprocessors

The Journal of Supercomputing
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

International Journal of Parallel Programming
Performance Analysys of a CC-NUMAOperating System

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Distributed Job Scheduling in SCI Local-Area MultiProcessors

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Virtual memory on data diffusion architectures

Parallel Computing
Quantifying contention and balancing memory load on hardware DSM multiprocessors

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
LODS: locality-oriented dynamic scheduling for on-chip multiprocessors

Proceedings of the 41st annual Design Automation Conference
X-RAY: A Non-Invasive Exclusive Caching Mechanism for RAIDs

Proceedings of the 31st annual international symposium on Computer architecture
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
The diffusion space of data diffusion architectures

Parallel Computing
NUMA-Aware Java Heaps for Server Applications

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system

Proceedings of the 19th annual international conference on Supercomputing
Page migration with dynamic space-sharing scheduling policies: the case of the SGI 02000

International Journal of Parallel Programming - Special issue II: The 17th annual international conference on supercomputing (ICS'03)
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Balancing power consumption in multiprocessor systems

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
A transparent runtime data distribution engine for OpenMP

Scientific Programming
Memory bank aware dynamic loop scheduling

Proceedings of the conference on Design, automation and test in Europe
Speeding-up multiprocessors running DBMS workloads through coherence protocols

International Journal of High Performance Computing and Networking
Efficient operating system scheduling for performance-asymmetric multi-core architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Hardware monitors for dynamic page migration

Journal of Parallel and Distributed Computing
Integrating Dynamic Memory Placement with Adaptive Load-Balancing for Parallel Codes on NUMA Multiprocessors

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Impact of NUMA effects on high-speed networking with multi-opteron machines

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Micro-pages: increasing DRAM efficiency with locality-aware data placement

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Handling the problems and opportunities posed by multiple on-chip memory controllers

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Journal of Parallel and Distributed Computing
Dual-layered file cache on cc-NUMA system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A hybrid strategy based on data distribution and migration for optimizing memory locality

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Reducing memory interference in multicore systems via application-aware memory channel partitioning

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
ADAPT: A framework for coscheduling multithreaded programs

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several cache-coherent shared-memory multiprocessors have been developed that are scalable and offer a very tight coupling between the processing resources. They are therefore quite attractive for use as compute servers for multiprogramming and parallel application workloads. Process scheduling and memory management, however, remain challenging due to the distributed main memory found on such machines. This paper examines the effects of OS scheduling and page migration policies on the performance of such compute servers. Our experiments are done on the Stanford DASH, a distributed-memory cache-coherent multiprocessor. We show that for our multiprogramming workloads consisting of sequential jobs, the traditional Unix scheduling policy does very poorly. In contrast, a policy incorporating cluster and cache affinity along with a simple page-migration algorithm offers up to two-fold performance improvement. For our workloads consisting of multiple parallel applications, we compare space-sharing policies that divide the processors among the applications to time-slicing policies such as standard Unix or gang scheduling. We show that space-sharing policies can achieve better processor utilization due to the operating point effect, but time-slicing policies benefit strongly from user-level data distribution. Our initial experience with automatic page migration suggests that policies based only on TLB miss information can be quite effective, and useful for addressing the data distribution problems of space-sharing schedulers.