Reducing power by optimizing the necessary precision/range of floating-point arithmetic
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on low-power electronics and design
Frequent value locality and value-centric data cache design
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Adaptive Cache Compression for High-Performance Processors
Proceedings of the 31st annual international symposium on Computer architecture
IEEE Micro
Memory-Link Compression Schemes: A Value Locality Perspective
IEEE Transactions on Computers
Fool me twice: Exploring and exploiting error tolerance in physics-based animation
ACM Transactions on Graphics (TOG)
Memory expansion technology (MXT): software support and performance
IBM Journal of Research and Development
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing
FPL '10 Proceedings of the 2010 International Conference on Field Programmable Logic and Applications
C-pack: a high-performance microprocessor cache compression algorithm
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IEEE Micro
Analyzing throughput of GPGPUs exploiting within-die core-to-core frequency variation
ISPASS '11 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software
GPUWattch: enabling energy optimizations in GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Linearly compressed pages: a low-complexity, low-latency main memory compression framework
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
State-of-the-art graphic processing units (GPUs) provide very high memory bandwidth, but the performance of many general-purpose GPU (GPGPU) workloads is still bounded by memory bandwidth. Although compression techniques have been adopted by commercial GPUs, they are only used for compressing texture and color data, not data for GPGPU workloads. Furthermore, the microarchitectural details of GPU compression are proprietary and its performance benefits have not been previously published. In this paper, we first investigate required microarchitectural changes to support lossless compression techniques for data transferred between the GPU and its off-chip memory to provide higher effective bandwidth. Second, by exploiting some characteristics of floating-point numbers in many GPGPU workloads, we propose to apply lossless compression to floating-point numbers after truncating their least-significant bits (i.e., lossy compression). This can reduce the bandwidth usage even further with very little impact on overall computational accuracy. Finally, we demonstrate that a GPU with our lossless and lossy compression techniques can improve the performance of memory-bound GPGPU workloads by 26% and 41% on average.