Impact of sharing-based thread placement on multithreaded architectures

Authors:
R. Thekkath;S. J. Eggers
Affiliations:
Dept. of Computer Science & Engg., University of Washington, Seattle;Dept. of Computer Science & Engg., University of Washington, Seattle
Venue:
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Year:
1994

Citing 18
Cited 14

Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
PRESTO: a system for object-oriented parallel programming

Software—Practice & Experience
MASA: a multithreaded processor architecture for parallel symbolic computing

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Memory-reference characteristics of multiprocessor applications under MACH

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The effect of sharing on the cache and bus performance of parallel programs

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of multithreaded architectures for parallel computing

SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Techniques for efficient inline tracing on a shared-memory multiprocessor

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Using processor affinity in loop scheduling on shared-memory multiprocessors

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Adaptive software cache management for distributed shared memory architectures

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Performance Tradeoffs in Multithreaded Processors

IEEE Transactions on Parallel and Distributed Systems
Conservative parallel discrete event simulation: principles and practice

Conservative parallel discrete event simulation: principles and practice

The effectiveness of multiple hardware contexts

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Increasing superscalar performance through multistreaming

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The Named-State Register File: Implementation and Performance

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Multitasking and Multithreading on a Multiprocessor with Virtual Shared Memory

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Thread scheduling for multi-core platforms

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Sharing-aware OS scheduling algorithms for multi-socket multi-core servers

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Load balancing on speed

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Reinventing scheduling for multicore systems

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
FACT: a framework for adaptive contention-aware thread migrations

Proceedings of the 8th ACM International Conference on Computing Frontiers
Matching memory access patterns and data placement for NUMA systems

Proceedings of the Tenth International Symposium on Code Generation and Optimization
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multithreaded architectures context switch between instruction streams to hide memory access latency. Although this improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-locate threads that share data on the same processor. Multiple threads sharing in the cache should reduce compulsory and invalidation misses, thereby improving execution time.To test this hypothesis, we compared a variety of thread placement algorithms via trace-driven simulation of fourteen coarse- and medium-grain parallel applications on several multithreaded architectures. Our results contradict the hypothesis. Rather than decreasing, compulsory and invalidation misses remained nearly constant across all placement algorithms, for all processor configurations, even with an infinite cache. That is, sharing-based placement had no (positive) effect on execution time. Instead, load balancing was the critical factor that affected performance. Our results were due to one or both of the following reasons: (1) the sequential and uniform access of shared data by the application's threads and (2) the insignificant number of data references that require interconnect access, relative to the total number of instructions.