Tuning Compiler Optimizations for Simultaneous Multithreading

Authors:
Jack L. Lo;Susan J. Eggers;Henry M. Levy;Sujay S. Parekh;Dean M. Tullsen
Affiliations:
-;-;-;-;-
Venue:
International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Year:
1999

Citing 37
Cited 0

Highly concurrent scalar processing

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Optimal loop parallelization

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade-Offs

Computer
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
New CPU benchmark suites from SPEC

COMPCON '92 Proceedings of the thirty-seventh international conference on COMPCON
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler-directed page coloring for multiprocessors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems (TOCS)
Task selection for a multiscalar processor

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Tera hardware-software cooperation

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Portable Programs for Parallel Processors

Portable Programs for Parallel Processors
Scaling Parallel Programs for Multiprocessors: Methodology and Examples

Computer
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Simultaneous Multithreading: A Platform for Next-Generation Processors

IEEE Micro
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Hierarchical tiling for improved superscalar performance

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Out-of-Order Execution may not be Cost-Effective on Processors Featuring Simultaneous Multithreading

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines.