Proceedings of the 27th annual international symposium on Computer architecture
IEEE Transactions on Computers
The design space of data-parallel memory systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Architecting phase change memory as a scalable dram alternative
Proceedings of the 36th annual international symposium on Computer architecture
Scalable high performance main memory system using phase-change memory technology
Proceedings of the 36th annual international symposium on Computer architecture
PDRAM: a hybrid PRAM and DRAM main memory system
Proceedings of the 46th Annual Design Automation Conference
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Practical and secure PCM systems by online detection of malicious write streams
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
GPUs and the Future of Parallel Computing
IEEE Micro
Phase change memory in enterprise storage systems: silver bullet or snake oil?
Proceedings of the 1st Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
Single-chip CPU/GPU architecture is being adopted in high-end (embedded) systems, e.g., smartphones and tablet PCs. Main memory subsystem is expected to consist of hybrid DRAM and phase-change RAM (PRAM) due to the difficulties in DRAM scaling. In this work, we address the performance optimization of the hybrid DRAM/PRAM main memory for single chip CPU/GPU. Based on the tight requirements of low latency from CPU and the relative tolerance to long latency from GPU, DRAM is first allocated to CPU while PRAM with longer write latency is allocated to GPU. Then, in order to improve the write performance of GPU traffic, we propose (1) an in-DRAM write buffer to accommodate GPU write traffics, (2) dynamic hot data management to improve the efficiency of write buffer, (3) runtime-adaptive adjustment of write buffer size to meet the given CPU performance bound, and (4) CPU-aware DRAM access scheduling to give low latency to CPU traffics. The experiments show that the proposed method gives 1.02~44.2 times performance improvement in GPU performance with modest (negligible) CPU performance overhead (when compute-intensive CPU programs run).