Computer
Distributing Hot-Spot Addressing in Large-Scale Multiprocessors
IEEE Transactions on Computers
Two algorithms for barrier synchronization
International Journal of Parallel Programming
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of multithreaded architectures for parallel computing
SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Comparative evaluation of latency reducing and tolerating techniques
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Waiting algorithms for synchronization in large-scale multiprocessors
ACM Transactions on Computer Systems (TOCS)
Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Register relocation: flexible contexts for multithreading
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Interleaving: a multithreading technique targeting multiprocessors and workstations
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The effectiveness of multiple hardware contexts
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Empirical studies of competitve spinning for a shared-memory multiprocessor
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
APRIL: a processor architecture for multiprocessing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
The Effect of Scheduling Discipline on Spin Overhead in Shared Memory Parallel Systems
IEEE Transactions on Parallel and Distributed Systems
Performance Tradeoffs in Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
CACHE COHERENCE PROTOCOLS FOR LARGE-SCALE MULTIPROCESSORS
CACHE COHERENCE PROTOCOLS FOR LARGE-SCALE MULTIPROCESSORS
PROTEUS: A HIGH-PERFORMANCE PARALLEL-ARCHITECTURE SIMULATOR
PROTEUS: A HIGH-PERFORMANCE PARALLEL-ARCHITECTURE SIMULATOR
LOW-COST SUPPORT FOR FINE-GRAIN SYNCHRONIZATION IN MULTIPROCESSORS
LOW-COST SUPPORT FOR FINE-GRAIN SYNCHRONIZATION IN MULTIPROCESSORS
Algorithms for Scalable Synchronization on Shared-Memory Multiproceessors
Algorithms for Scalable Synchronization on Shared-Memory Multiproceessors
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor
Proceedings of the 25th annual international symposium on Computer architecture
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Hi-index | 0.00 |
Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical.