A study of source-level compiler algorithms for automatic construction of pre-execution code

Authors:
Dongkeun Kim;Donald Yeung
Affiliations:
University of Maryland at College Park, MD;University of Maryland at College Park, MD
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2004

Citing 33
Cited 8

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Supporting dynamic data structures on distributed-memory machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Predictability of load/store instruction latencies

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Tolerating latency in multiprocessors through compiler-inserted prefetching

ACM Transactions on Computer Systems (TOCS)
Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Automatic I/O hint generation through speculative execution

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving virtual function call target prediction via dependence-based pre-computation

ICS '99 Proceedings of the 13th international conference on Supercomputing
Code transformations to improve memory parallelism

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Understanding the backward slices of performance degrading instructions

Proceedings of the 27th annual international symposium on Computer architecture
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Execution-based prediction using speculative slices

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Speculative precomputation: long-range prefetching of delinquent loads

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Post-pass binary adaptation for software-based speculative precomputation

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Difficult-path branch prediction using subordinate microthreads

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
A Study of a Simultaneous Multithreaded Processor Implementation

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Master/slave speculative parallelization

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A quantitative framework for automated pre-execution thread selection

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Memory Latency-Tolerance Approaches for Itanium Processors: Out-of-Order Execution vs.Speculative Precomputation

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Shade: A Fast Instruction Set Simulator for Execution Profiling

Shade: A Fast Instruction Set Simulator for Execution Profiling

Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Automatic Thread Extraction with Decoupled Software Pipelining

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Hiding I/O latency with pre-execution prefetching for parallel applications

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
On reducing load/store latencies of cache accesses

Journal of Systems Architecture: the EUROMICRO Journal
Cashing in on hints for better prefetching and caching in PVFS and MPI-IO

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Multicore performance optimization using partner cores

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
smt-SPRINTS: software precomputation with intelligent streaming for resource-constrained SMTs

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This article investigates several source-to-source C compilers for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. We present an aggressive profile-driven compiler that employs three powerful algorithms for code extraction. First, program slicing removes non-critical code for computing cache-missing memory references. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, speculative loop parallelization generates thread-level parallelism to tolerate the latency of blocking loads. In addition, we present four "reduced" compilers that employ less aggressive algorithms to simplify compiler implementation. Our reduced compilers rely on back-end code optimizations rather than program slicing to remove non-critical code, and use compile-time heuristics rather than profiling to approximate runtime information (e.g., cache-miss and loop-trip counts).We prototype our algorithms on the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [Lyle and Wallace 1997]. Using our prototype, we undertake a performance evaluation of our compilers on a detailed architectural simulator of an 8-way out-of-order SMT processor with 4 hardware contexts, and 13 applications selected from the SPEC and Olden benchmark suites. Our most aggressive compiler improves the performance of 10 out of 13 applications, reducing execution time by 20.9%. Across all 13 applications, our aggressive compiler achieves a harmonic average speedup of 17.6%. For our reduced compilers, eliminating program slicing and relying on back-end optimizations degrades performance minimally, suggesting that effective pre-execution compilers can be built without program slicing. Furthermore, without cache-miss profiles, we still achieve good speedup, 15.5%, but without loop-trip count profiles, we achieve a speedup of only 7.7%. Finally, our results show compiler-based pre-execution can benefit multiprogrammed workloads. Simultaneously executing applications achieve higher throughput with pre-execution compared to no pre-execution. Due to contention for hardware contexts, however, time-slicing outperforms simultaneous execution in some cases where individual applications make heavy use of pre-execution threads.