The program dependence graph and its use in optimization
ACM Transactions on Programming Languages and Systems (TOPLAS)
The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Critical path reduction for scalar programs
Proceedings of the 28th annual international symposium on Microarchitecture
New faster Kernighan-Lin-type graph-partitioning algorithms
ICCAD '93 Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design
ICS '98 Proceedings of the 12th international conference on Supercomputing
Task selection for a multiscalar processor
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Removing architectural bottlenecks to the scalability of speculative parallelization
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Master/slave speculative parallelization
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
Thread-Spawning Schemes for Speculative Multithreading
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Program slices: formal, psychological, and practical investigations of an automatic program abstraction method
Min-cut program decomposition for thread-level speculation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
A cost-driven compilation framework for speculative parallelization of sequential programs
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Memory Ordering: A Value-Based Approach
Proceedings of the 31st annual international symposium on Computer architecture
Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation
Proceedings of the 19th annual international conference on Supercomputing
Automatic Thread Extraction with Decoupled Software Pipelining
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research
IEEE Computer Architecture Letters
POSH: a TLS compiler that exploits program structure
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 33rd annual international symposium on Computer Architecture
Hardware atomicity for reliable software speculation
Proceedings of the 34th annual international symposium on Computer architecture
Core fusion: accommodating software diversity in chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Speculative Decoupled Software Pipelining
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Exploiting Postdominance for Speculative Parallelization
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Mitosis: A Speculative Multithreaded Processor Based on Precomputation Slices
IEEE Transactions on Parallel and Distributed Systems
A new foundation for control-dependence and slicing for modern program structures
ESOP'05 Proceedings of the 14th European conference on Programming Languages and Systems
ISAMAP: instruction mapping driven by dynamic binary translation
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Limits of region-based dynamic binary parallelization
Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
HW/SW co-designed acceleration of dynamic languages
Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Proceedings of the 6th International Systems and Storage Conference
ASC: automatically scalable computation
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
The performance of single-threaded programs and legacy binary code is of critical importance in many everyday applications. However, neither can hardware multi-core processors directly speed up single-threaded programs, nor can software automatic parallelizing compilers effectively parallelize legacy binary code and irregular applications. In this paper, we propose a framework and a set of algorithms to dynamically parallelize single-threaded binary programs. Our parallelization is based on program slicing and explores both instruction-level parallelism (ILP) and thread-level parallelism (TLP). To significantly reduce the critical path of the parallel slices, our slicing algorithms exploit speculation to cut rare dependences, and use well-designed program transformations to expose parallelism. Furthermore, because we transparently parallelize binary code at runtime, we perform slicing only on program hot regions. Our experiments demonstrate that the proposed speculative slicing approach extracts more parallelism than any known slicing based parallelization schemes. For the SPEC2000 benchmarks, we can achieve 3x parallelism with infinite number of threads, and 1.8x parallelism with 4 threads.