Probabilistic accuracy bounds for fault-tolerant computations that discard tasks
Proceedings of the 20th annual international conference on Supercomputing
Using early phase termination to eliminate load imbalances at barrier synchronization points
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Application-Level Correctness and its Impact on Fault Tolerance
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
Languages and Compilers for Parallel Computing
PetaBricks: a language and compiler for algorithmic choice
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Artificial Intelligence: A Modern Approach
Artificial Intelligence: A Modern Approach
Green: a framework for supporting energy-conscious programming using controlled approximation
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
Speeding up K-Means Algorithm by GPUs
CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Dynamic knobs for responsive power-aware computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
EnerJ: approximate data types for safe and general low-power computation
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Managing performance vs. accuracy trade-offs with loop perforation
Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Architecture support for disciplined approximate programming
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Language and compiler support for auto-tuning variable-accuracy algorithms
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms
IEEE Transactions on Image Processing
Branch and data herding: reducing control and memory divergence for error-tolerant GPU applications
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Neural Acceleration for General-Purpose Approximate Programs
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Paraprox: pattern-based approximation for data parallel applications
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, is one solution that can help data processing keep pace with the current and growing overabundance of information. For particular domains such as multimedia and learning algorithms, approximation is commonly used today. We consider automation to be essential to provide transparent approximation and we show that larger benefits can be achieved by constructing the approximation techniques to fit the underlying hardware. Our target platform is the GPU because of its high performance capabilities and difficult programming challenges that can be alleviated with proper automation. Our approach, SAGE, combines a static compiler that automatically generates a set of CUDA kernels with varying levels of approximation with a run-time system that iteratively selects among the available kernels to achieve speedup while adhering to a target output quality set by the user. The SAGE compiler employs three optimization techniques to generate approximate kernels that exploit the GPU microarchitecture: selective discarding of atomic operations, data packing, and thread fusion. Across a set of machine learning and image processing kernels, SAGE's approximation yields an average of 2.5x speedup with less than 10% quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.