Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
PRESTO: a system for object-oriented parallel programming
Software—Practice & Experience
MASA: a multithreaded processor architecture for parallel symbolic computing
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of multithreaded architectures for parallel computing
SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
Techniques for efficient inline tracing on a shared-memory multiprocessor
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
LimitLESS directories: A scalable cache coherence scheme
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Comparative evaluation of latency reducing and tolerating techniques
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Strategies for achieving improved processor throughput
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Improved multithreading techniques for hiding communication latency in multiprocessors
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Using processor affinity in loop scheduling on shared-memory multiprocessors
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Impact of sharing-based thread placement on multithreaded architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Complexity/performance tradeoffs with non-blocking loads
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
ICS '90 Proceedings of the 4th international conference on Supercomputing
APRIL: a processor architecture for multiprocessing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Limits on Interconnection Network Performance
IEEE Transactions on Parallel and Distributed Systems
Performance Tradeoffs in Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
Computing Per-Process Summary Side-Effect Information
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Static Analysis of Barrier Synchronization in Explicitly Parallel Programs
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Informing memory operations: providing memory performance feedback in modern processors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Limits on the performance benefits of multithreading and prefetching
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fine-grain multithreading with the EM-X multiprocessor
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
ACM Transactions on Computer Systems (TOCS)
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS)
Simultaneous multithreading: maximizing on-chip parallelism
25 years of the international symposia on Computer architecture (selected papers)
Effects of Multithreading on Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Data locality sensitivity of multithreaded computations on a distributed-memory multiprocessor
CASCON '96 Proceedings of the 1996 conference of the Centre for Advanced Studies on Collaborative research
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Optimizing NANOS OpenMP for the IBM Cyclops Multithreaded Architecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Mambo: a full system simulator for the PowerPC architecture
ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
Software thread integration for embedded system display applications
ACM Transactions on Embedded Computing Systems (TECS)
Fairness and Throughput in Switch on Event Multithreading
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Fairness enforcement in switch on event multithreading
ACM Transactions on Architecture and Code Optimization (TACO)
Exploiting multilevel parallelism using OpenMP on a massive multithreaded architecture
Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Enhancing Microkernel Performance on VLIW DSP Processors via Multiset Context Switch
Journal of Signal Processing Systems
Memory-level parallelism aware fetch policies for simultaneous multithreading processors
ACM Transactions on Architecture and Code Optimization (TACO)
Service level agreement for multithreaded processors
ACM Transactions on Architecture and Code Optimization (TACO)
Improving SMT performance scheduling processes
EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
A processor extension for cycle-accurate real-time software
EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Multi-criteria checkpointing strategies: response-time versus resource utilization
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hi-index | 0.00 |
Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache conflict misses. In this paper we evaluate the two phenomena together, examining their combined effect on execution time.The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing. Multiple hardware contexts are most effective on programs that have been optimized for data locality. For these programs, execution time dropped with increasing contexts, over widely varying architectures. With unoptimized applications, multiple contexts had limited value. The best performance was seen with only two contexts, and only on uniprocessors and small multiprocessors. The behavior of the unoptimized applications changed more noticeably with variations in cache associativity and cache hierarchy, unlike the optimized programs.As a mechanism for exploiting program parallelism, an additional processor is clearly better than another context. However, there were many configurations for which the addition of a few hardware contexts brought as much or greater performance than a larger multiprocessor with fewer than the optimal number of contexts.