One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation
Proceedings of the 26th ACM international conference on Supercomputing
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Software-level scheduling to exploit non-uniformly shared data cache on GPGPU
Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Hi-index | 0.00 |
Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to improve the data locality for the next period of execution. In this work, we examine the implications that modern heterogeneous Chip Multiprocessors (CMP) architecture imposes on the optimization paradigm. We develop three techniques to enhance the optimizations. The first is asynchronous data transformation, which moves data reordering off the critical path through dependence circumvention. The second is a novel data transformation algorithm, named TLayout, designed specially to take advantage of modern throughput-oriented processors. Together they provide two complementary ways to attack a benefit-overhead dilemma inherited in traditional techniques. Working with a dynamic adaptation scheme, the techniques produce significant performance improvement for a set of dynamic simulation benchmarks.