The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
DAISY: dynamic compilation for 100% architectural compatibility
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
25 years of the international symposia on Computer architecture (selected papers)
An evaluation of staged run-time optimizations in DyC
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
The Superthreaded Processor Architecture
IEEE Transactions on Computers
Dynamo: a transparent dynamic optimization system
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
High-level adaptive program optimization with ADAPT
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Using thread-level speculation to simplify manual parallelization
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Toward efficient and robust software speculative parallelization on multiprocessors
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
An Eight Issue Tree-VLIW Processor for Dynamic Binary Translation
ICCD '98 Proceedings of the International Conference on Computer Design
A Programmable Co-processor for Profiling
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture
Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture
Continuous program optimization
Continuous program optimization
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
ATOM: a system for building customized program analysis tools
ACM SIGPLAN Notices - Best of PLDI 1979-1999
An API for Runtime Code Patching
International Journal of High Performance Computing Applications
Hardware and software architectures for the CELL processor
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Mostly static program partitioning of binary executables
ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic Parallelization in a Binary Rewriter
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Polyhedral parallelization of binary code
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
ISAMAP: instruction mapping driven by dynamic binary translation
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Limits of region-based dynamic binary parallelization
Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Proceedings of the 6th International Systems and Storage Conference
ASC: automatically scalable computation
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
As performance improvements are being increasingly sought via coarse-grained parallelism, established expectations of continued sequential performance increases are not being met. Current trends in computing point towards platforms seeking performance improvements through various degrees of parallelism, with coarse-grained parallelism features becoming commonplace in even entry-level systems.Yet the broad variety of multiprocessor configurations that will be available that differ in the number of processing elements will make it difficult to statically create a single parallel version of a program that performs well on the whole range of such hardware. As a result, there will soon be a vast number of multiprocessor systems that are significantly under-utilized for lack of software that harnesses their power effectively. This problem is exacerbated by the growing inventory of legacy programs in binary executable form with possibly unreachable source code.We present a system that improves the performance of optimized sequential binaries through dynamic recompilation. Leveraging observations made at runtime, a thin software layer recompiles executing code compiled for a uniprocessor and generates parallelized and/or vectorized code segments that exploit available parallel resources. Among the techniques employed are control speculation, loop distribution across several threads, and automatic parallelization of recursive routines.Our solution is entirely software-based and can be ported to existing hardware platforms that have parallel processing capabilities. Our performance results are obtained on real hardware without using simulation.In preliminary benchmarks on only modestly parallel (2-way) hardware, our system already provides speedups of upto 40% on SpecCPU benchmarks, and near-optimal speedups on more obviously parallelizable benchmarks.