Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Analysis of multithreaded architectures for parallel computing
SPAA '90 Proceedings of the second annual ACM symposium on Parallel algorithms and architectures
New CPU benchmark suites from SPEC
COMPCON '92 Proceedings of the thirty-seventh international conference on COMPCON
An elementary processor architecture with simultaneous instruction issuing from multiple threads
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Microarchitecture support for dynamic scheduling of acyclic task graphs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
MISC: a Multiple Instruction Stream Computer
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
Fast and accurate instruction fetch and branch prediction
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Exploring the design space for a shared-cache multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
The effectiveness of multiple hardware contexts
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Proceedings of the 28th annual international symposium on Microarchitecture
Evaluation of design alternatives for a multiprocessor microprocessor
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Increasing cache port efficiency for dynamic superscalar microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The case for a single-chip multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Techniques for extracting instruction level parallelism on MIMD architectures
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Portable Programs for Parallel Processors
Portable Programs for Parallel Processors
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Performance Tradeoffs in Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
A Fine-Grain Threaded Abstract Machine
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
The effects of STEF in finely parallel multithreaded processors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Design and performance evaluation of a multithreaded architecture
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Performance Study of a Multithreaded Superscalar Microprocessor
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Tuning compiler optimizations for simultaneous multithreading
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Hardware-software trade-offs in a direct Rambus implementation of the RAMpage memory hierarchy
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Effects of Multithreading on Cache Performance
IEEE Transactions on Computers - Special issue on cache memory and related problems
Instruction fetch mechanisms for multipath execution processors
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Software-Directed Register Deallocation for Simultaneous Multithreaded Processors
IEEE Transactions on Parallel and Distributed Systems
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Tuning Compiler Optimizations for Simultaneous Multithreading
International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Symbiotic jobscheduling for a simultaneous mutlithreading processor
ACM SIGPLAN Notices
Analytical cache models with applications to cache partitioning
ICS '01 Proceedings of the 15th international conference on Supercomputing
Symbiotic jobscheduling for a simultaneous multithreaded processor
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
An analysis of operating system behavior on a simultaneous multithreaded architecture
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Asynchrony in parallel computing: from dataflow to multithreading
Progress in computer research
SMT Layout Overhead and Scalability
IEEE Transactions on Parallel and Distributed Systems
Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Symbiotic jobscheduling with priorities for a simultaneous multithreading processor
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Asynchrony in parallel computing: from dataflow to multithreading
Progress in computer research
The Need for Fast Communication in Hardware-Based Speculative Chip Multiprocessors
International Journal of Parallel Programming
A survey of processors with explicit multithreading
ACM Computing Surveys (CSUR)
Effects of Memory Performance on Parallel Job Scheduling
JSSPP '01 Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing
A framework for performance modeling and prediction
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Improving server software support for simultaneous multithreaded processors
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Mini-Threads: Increasing TLP on Small-Scale SMT Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Large-Scale TCP Models Using Optimistic Parallel Simulation
Proceedings of the seventeenth workshop on Parallel and distributed simulation
Power-Sensitive Multithreaded Architecture
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Simultaneous Multithreading-Based Routers
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
An evaluation of speculative instruction execution on simultaneous multithreaded processors
ACM Transactions on Computer Systems (TOCS)
A Clustered Approach to Multithreaded Processors
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Large-scale network simulation techniques: examples of TCP and OSPF models
ACM SIGCOMM Computer Communication Review
Dynamic Partitioning of Shared Cache Memory
The Journal of Supercomputing
The need for adaptive dynamic thread scheduling
High performance scientific and engineering computing
The energy efficiency of CMP vs. SMT for multimedia workloads
Proceedings of the 18th annual international conference on Supercomputing
Architectural Support for Enhanced SMT Job Scheduling
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Area and System Clock Effects on SMT/CMP Throughput
IEEE Transactions on Computers
Efficient Direct User Level Sockets for an Intel® Xeon" Processor Based TCP On-Load Engine
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
Scalable cache memory design for large-scale SMT architectures
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Evaluating the impact of simultaneous multithreading on network servers using real hardware
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
How Well Can Simple Metrics Represent the Performance of HPC Applications?
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
An Efficient Way of Passing of Data in a Multithreaded Scheduled Dataflow Architecture
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Chip multithreading systems need a new operating system scheduler
Proceedings of the 11th workshop on ACM SIGOPS European workshop
A performance prediction framework for scientific applications
Future Generation Computer Systems
Throttling-Based Resource Management in High Performance Multithreaded Architectures
IEEE Transactions on Computers
Thread-associative memory for multicore and multithreaded computing
Proceedings of the 2006 international symposium on Low power electronics and design
Online power-performance adaptation of multithreaded programs using hardware event-based prediction
Proceedings of the 20th annual international conference on Supercomputing
CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
SMP-SoC is the answer if you ask the right questions
SAICSIT '06 Proceedings of the 2006 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Journal of Parallel and Distributed Computing
ACM Transactions on Computer Systems (TOCS)
Fuce: the continuation-based multithreading processor
Proceedings of the 4th international conference on Computing frontiers
Proceedings of the 4th international conference on Computing frontiers
Performance of multithreaded chip multiprocessors and implications for operating system design
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Managing The Complexity Of Performance Monitoring Hardware: The Brink Andabyss Approach
International Journal of High Performance Computing Applications
Exploring the performance limits of simultaneous multithreading for memory intensive applications
The Journal of Supercomputing
Dynamic tiling for effective use of shared caches on multithreaded processors
International Journal of High Performance Computing and Networking
A genetic algorithms approach to modeling the performance of memory-bound computations
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A SMT-ARM simulator and performance evaluation
SEPADS'06 Proceedings of the 5th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems
DLL-conscious instruction fetch optimization for SMT processors
Journal of Systems Architecture: the EUROMICRO Journal
Towards achieving reliable and high-performance nanocomputing via dynamic redundancy allocation
ACM Journal on Emerging Technologies in Computing Systems (JETC)
A continuation-based noninterruptible multithreading processor architecture
The Journal of Supercomputing
Source level merging of independent programs
Journal of Parallel and Distributed Computing
Performance evaluation of the sparse matrix-vector multiplication on modern architectures
The Journal of Supercomputing
A performance prediction framework for scientific applications
Future Generation Computer Systems
A performance prediction framework for scientific applications
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Shared Register File Based ILP for Multicore
GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Managing SMT resource usage through speculative instruction window weighting
ACM Transactions on Architecture and Code Optimization (TACO)
PMPS(3): a performance model of parallel systems
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Low power microprocessor design for embedded systems
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
PATMOS'06 Proceedings of the 16th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
HELIX: automatic parallelization of irregular programs for chip multiprocessing
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Hazard driven test generation for SMT processors
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
ACM Transactions on Architecture and Code Optimization (TACO)
The sharing architecture: sub-core configurability for IaaS clouds
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.01 |
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to complete for and share all of the processor's resources every cycle.The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to share the processor's functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When a program has only a single thread, all of the SMT processor's resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. We examine two alternative on-chip parallel architectures for the next generation of processors. We compare SMT and small-scale, on-chip multiprocessors in their ability to exploit both ILP and TLP. First, we identify the hardware bottlenecks that prevent multiprocessors from effectively exploiting ILP. Then, we show that because of its dynamic resource sharing, SMT avoids these inefficiencies and benefits from being able to run more threads on a single processor. The use of TLP is especially advantageous when per-thread ILP is limited. The ease of adding additional thread contexts on an SMT (relative to adding additional processors on an MP) allows simultaneous multithreading to expose more parallelism, further increasing functional unit utilization and attaining a 52% average speedup (versus a four-processor, single-chip multiprocessor with comparable execution resources). This study also addresses an often-cited concern regarding the use of thread-level parallelism or multithreading: interference in the memory system and branch prediction hardware.We find the multiple threads cause interthread interference in the caches and place greater demands on the memory system, thus increasing average memory latencies. By exploiting threading-level parallelism, however, SMT hides these additional latencies, so that they only have a small impact on total program performance. We also find that for parallel applications, the additional threads have minimal effects on branch prediction.