SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
IEEE Transactions on Parallel and Distributed Systems
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Parallel Programming with Polaris
Computer
IEEE Transactions on Parallel and Distributed Systems
Code Generation in the Polyhedral Model Is Easier Than You Think
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
The STAMPede approach to thread-level speculation
ACM Transactions on Computer Systems (TOCS)
Optimizing memory transactions
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Accelerator: using data parallelism to program GPUs for general-purpose uses
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Optimistic parallelism requires abstractions
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Software transactional memory for large scale clusters
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Software thread-level speculation: an optimistic library implementation
Proceedings of the 1st international workshop on Multicore software engineering
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
Iterative optimization in the polyhedral model: part ii, multidimensional time
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
DiSTM: A Software Transactional Memory Framework for Clusters
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
How much parallelism is there in irregular applications?
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Copy or Discard execution model for speculative parallelization on multicores
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Automatic parallelization for graphics processing units
PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
NePaLTM: Design and Implementation of Nested Parallelism for Transactional Memory Systems
Genoa Proceedings of the 23rd European Conference on ECOOP 2009 --- Object-Oriented Programming
D2STM: Dependable Distributed Software Transactional Memory
PRDC '09 Proceedings of the 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing
Implementing the PGI Accelerator model
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
hiCUDA: High-Level GPGPU Programming
IEEE Transactions on Parallel and Distributed Systems
A domain-specific approach to heterogeneous parallelism
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Divergence Analysis and Optimizations
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Hardware transactional memory for GPU architectures
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Automatic speculative DOALL for clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Simultaneous branch and warp interweaving for sustained GPU performance
Proceedings of the 39th Annual International Symposium on Computer Architecture
iGPU: exception support and speculative execution on GPUs
Proceedings of the 39th Annual International Symposium on Computer Architecture
Towards a software transactional memory for graphics processors
EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Hi-index | 0.00 |
Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a difficult challenge for a variety of reasons, including the inherent algorithmic characteristics and data structure choices used by applications as well as the tedious performance optimization cycle that is necessary to achieve high performance. The goal of this work is to increase the applicability of GPUs beyond CUDA/OpenCL to implicitly data-parallel applications written in C/C++ using speculative parallelization. To achieve this goal, we propose Paragon: a static/dynamic compiler platform to speculatively run possibly data-parallel portions of sequential applications on the GPU while cooperating with the system CPU. For such loops, Paragon utilizes the GPU in an opportunistic way while orchestrating a cooperative relation between the CPU and GPU to reduce the overhead of miss-speculations. Paragon monitors the dependencies for the loops running speculatively on the GPU and nonspeculatively on the CPU using a lightweight distributed conflict detection designed specifically for GPUs, and transfers the execution to the CPU in case a conflict is detected. Paragon resumes the execution on the GPU after the CPU resolves the dependency. Our experiments show that Paragon achieves 4x on average and up to 30x speedup compared to unsafe CPU execution with four threads and 7x on average and up to 64x speedup versus sequential execution across a set of sequential but implicitly data-parallel applications.