Highly concurrent scalar processing
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
New CPU benchmark suites from SPEC
COMPCON '92 Proceedings of the thirty-seventh international conference on COMPCON
Limits of control flow on parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Effective compiler support for predicated execution using the hyperblock
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
Compiler optimizations for improving data locality
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Unifying data and control transformations for distributed shared-memory machines
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler-directed page coloring for multiprocessors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
ACM Transactions on Computer Systems (TOCS)
Conversion of control dependence to data dependence
POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Portable Programs for Parallel Processors
Portable Programs for Parallel Processors
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Hierarchical tiling for improved superscalar performance
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
An analysis of database workload performance on simultaneous multithreaded processors
Proceedings of the 25th annual international symposium on Computer architecture
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automatic compiler techniques for thread coarsening for multithreaded architectures
Proceedings of the 14th international conference on Supercomputing
α-coral: a multigrain, multithreaded processor architecture
ICS '01 Proceedings of the 15th international conference on Supercomputing
An analysis of operating system behavior on a simultaneous multithreaded architecture
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
SMT Layout Overhead and Scalability
IEEE Transactions on Parallel and Distributed Systems
Improving server software support for simultaneous multithreaded processors
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Estimating cache misses and locality using stack distances
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
An evaluation of speculative instruction execution on simultaneous multithreaded processors
ACM Transactions on Computer Systems (TOCS)
Architectural Support for Enhanced SMT Job Scheduling
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Optimizing NANOS OpenMP for the IBM Cyclops Multithreaded Architecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Exploring the performance limits of simultaneous multithreading for memory intensive applications
The Journal of Supercomputing
Dynamic tiling for effective use of shared caches on multithreaded processors
International Journal of High Performance Computing and Networking
DLL-conscious instruction fetch optimization for SMT processors
Journal of Systems Architecture: the EUROMICRO Journal
Enhancing the performance of multigrid smoothers in simultaneous multithreading architectures
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Evaluation of OpenMP for the cyclops multithreaded architecture
WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Optimizing chip multiprocessor work distribution using dynamic compilation
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A Parallel infrastructure on dynamic EPIC SMT and its speculation optimization
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost, inter-processor communication. This paper reexamines several compiler optimizations in the context of simultaneous multithreading (SMT), a processor architecture that issues instructions from multiple threads to the functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current multiprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding latencies. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines compilers can generate code that improves the performance of programs executing on SMT machines.