ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Alpha architecture reference manual
Alpha architecture reference manual
Evaluation of the WM architecture
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Cray Y-MP C90: system features and early benchmark results
Parallel Computing
The effectiveness of decoupling
ICS '93 Proceedings of the 7th international conference on Supercomputing
Designing the TFP Microprocessor
IEEE Micro
Supercomputer performance evaluation and the Perfect Benchmarks
ICS '90 Proceedings of the 4th international conference on Supercomputing
PIPE: a VLSI decoupled architecture
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Performance of the decoupled ACRI-1 architecture: the perfect club
HPCN Europe '95 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
Sigma II: A Tool Kit for Building Parallelizing Compilers and Performance Analysis Systems
Proceedings of the IFIP WG 10.3 Workshop on Programming Environments for Parallel Computing
MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Improving Latency Tolerance of Multithreading through Decoupling
IEEE Transactions on Computers
Multithreading decoupled architectures for complexity-effective general purpose computing
ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
Code Partitioning in Decoupled Compilers
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Deriving Efficient Data Movement from Decoupled Access/Execute Specifications
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
OUTRIDER: efficient memory latency tolerance with decoupled strands
Proceedings of the 38th annual international symposium on Computer architecture
Hi-index | 0.00 |
Decoupled architectures provide a key to the problem of sustained supercomputer performance through their ability to hide large memory latencies. When a program executes in a decoupled mode the perceived memory latency at the processor is zero; effectively the entire physical memory has an access time equivalent to the processor's register file, and latency is completely hidden. However, the asynchronous functional units within a decoupled architecture must occasionally synchronize, incurring a high penalty. The goal of compiling and optimizing for decoupled architectures is to partition the program between the asynchronous functional units in such a way that latencies are hidden but synchronization events are executed infrequently. This paper describes a model for decoupled compilation, and explains the effectiveness of compilation for decoupled systems. A number of new compiler optimizations are introduced and evaluated quantitatively using the Perfect Club scientific benchmarks. We show that with a suitable repertiore of optimizations, it is possible to hide large latencies most of the time for most of the programs in the Perfect Club.