Feedback-directed thread scheduling with memory considerations

Authors:
Fengguang Song;Shirley Moore;Jack Dongarra
Affiliations:
University of Tennessee;University of Tennessee;University of Tennessee
Venue:
Proceedings of the 16th international symposium on High performance distributed computing
Year:
2007

Citing 14
Cited 4

A multilevel algorithm for partitioning graphs

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Message passing versus distributed shared memory on networks of workstations

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Parallel sorting on cache-coherent DSM multiprocessors

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
Partitioned parallel radix sort

Journal of Parallel and Distributed Computing
Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Restructuring computations for temporal data cache locality

International Journal of Parallel Programming
The Potential of Computation Regrouping for Improving Locality

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
A New Technique to Reduce False Sharing in Parallel Irregular Codes Based on Distance Functions

ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
Hardware profile-guided automatic page placement for ccNUMA systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
POWER5 System microarchitecture

IBM Journal of Research and Development - POWER5 and packaging

Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
JavaSymphony: a programming and execution environment for parallel and distributed many-core architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
Dynamic thread pinning for phase-based OpenMP programs

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a novel approach to generate an optimized schedule to run threads on distributed shared memory (DSM) systems. The approach relies upon a binary instrumentation tool to automatically acquire the memory sharingrelationship between user-level threads by analyzing their memory trace. We introduce the concept of Affinity Graph to model the relationship. Expensive I/O for large trace files is completely eliminated by using an online graph creation scheme. We apply the technique of hierarchical graph partitioning and thread reordering to the affinity graph to determine an optimal thread schedule. We have performed experiments on an SGI Altix system. The experimental results show that our approach is able to reduce the totalexecution time by 10% to 38% for a variety of applications through the maximization of the data reuse within a single processor, minimization of the data sharing between processors, and a good load balance.