The anatomy of the register file in a multiscalar processor
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Extracting task-level parallelism
ACM Transactions on Programming Languages and Systems (TOPLAS)
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ARB: A Hardware Mechanism for Dynamic Reordering of Memory References
IEEE Transactions on Computers
Dynamic feedback: an effective technique for adaptive computing
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Dynamic speculation and synchronization of data dependences
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Speculative multithreaded processors
ICS '98 Proceedings of the 12th international conference on Supercomputing
Task selection for a multiscalar processor
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Data speculation support for a chip multiprocessor
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Improving the performance of speculatively parallel applications on the Hydra CMP
ICS '99 Proceedings of the 13th international conference on Supercomputing
Partitioning parallel programs for macro-dataflow
LFP '86 Proceedings of the 1986 ACM conference on LISP and functional programming
A scalable approach to thread-level speculation
Proceedings of the 27th annual international symposium on Computer architecture
Static scheduling algorithms for allocating directed task graphs to multiprocessors
ACM Computing Surveys (CSUR)
Architectural support for scalable speculative parallelization in shared-memory multiprocessors
Proceedings of the 27th annual international symposium on Computer architecture
Reference idempotency analysis: a framework for optimizing speculative execution
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
High-level adaptive program optimization with ADAPT
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Reducing Parallel Overheads Through Dynamic Serialization
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Control Speculation in Multithreaded Processors through Dynamic Loop Detection
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Tradeoffs in Buffering Memory State for Thread-Level Speculation in Multiprocessors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Adaptive loop transformations for scientific programs
SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
A dynamic multithreading processor
A dynamic multithreading processor
Min-cut program decomposition for thread-level speculation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
A cost-driven compilation framework for speculative parallelization of sequential programs
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Rating Compiler Optimizations for Automatic Performance Tuning
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Exposing speculative thread parallelism in SPEC2000
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hardware support for thread-level speculation
Hardware support for thread-level speculation
POSH: a TLS compiler that exploits program structure
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Experiences in using cetus for source-to-source transformations
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Compiler-Driven Dependence Profiling to Guide Program Parallelization
Languages and Compilers for Parallel Computing
Exploiting Speculative TLP in Recursive Programs by Dynamic Thread Prediction
CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Dynamic performance tuning for speculative threads
Proceedings of the 36th annual international symposium on Computer architecture
PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploiting coarse-grain speculative parallelism
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Paragon: collaborative speculative loop execution on GPU and CPU
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Runtime automatic speculative parallelization
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Dynamically dispatching speculative threads to improve sequential execution
ACM Transactions on Architecture and Code Optimization (TACO)
Optimizing software runtime systems for speculative parallelization
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
ACM Transactions on Architecture and Code Optimization (TACO)
ASC: automatically scalable computation
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
A thread partitioning approach for speculative multithreading
The Journal of Supercomputing
Hi-index | 0.00 |
Chip multiprocessors (CMPs), or multi-core processors, have become a common way of reducing chip complexity and power consumption while maintaining high performance. Speculative CMPs use hardware to enforce dependence, allowing a parallelizing compiler to generate multithreaded code without needing to prove independence. In these systems, a sequential program is decomposed into threads to be executed in parallel; dependent threads cause performance degradation, but do not affect correctness. Thread decomposition attempts to reduce the run-time overheads of data dependence, thread misprediction, and load imbalance. Because these overheads depend on the runtimes of the threads that are being created by the decomposition, reducing the overheads while creating the threads is a circular problem. Static compile-time decomposition handles this problem by estimating the run times of the candidate threads, but is limited by the estimates' inaccuracy. Dynamic execution-time decomposition in hardware has better run-time information, but is limited by the decomposition hardware's complexity and run-time overhead. We propose a third approach where a compiler instruments a profile run of the application to search through candidate threads and pick the best threads as the profile run executes. The resultant decomposition is compiled into the application so that a production run of the application has no instrumentation and does not incurany decomposition overhead. We avoid static decomposition's estimation accuracy problem by using actual profile-run execution times to pick threads, and we avoid dynamic decomposition's overhead by performing the decomposition at profile time. Because we allow candidate threads to span arbitrary sections of the application's call graph and loop nests, an exhaustive search of the decomposition space is prohibitive, even in profile runs. To address this issue, we make the key observation that the run-time overhead of a thread depends, to the first order, only on threads that overlap with the thread inexecution (e.g., in a four-core CMP, a given thread can overlap with at most three preceding and three following threads). This observation implies that a given thread affects only a few other threads, allowing pruning of the space. Using a CMP simulator, we achieve an average speedup of 3.51 on four cores for five of the SPEC CFP2000 benchmarks, which compares favorably to recent static techniques. We also discuss experiments with CINT2000.