A data cache with multiple caching strategies tuned to different types of locality
ICS '95 Proceedings of the 9th international conference on Supercomputing
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The SimpleScalar tool set, version 2.0
ACM SIGARCH Computer Architecture News
IEEE Transactions on Computers
A decade of reconfigurable computing: a visionary retrospective
Proceedings of the conference on Design, automation and test in Europe
Scratchpad memory: design alternative for cache on-chip memory in embedded systems
Proceedings of the tenth international symposium on Hardware/software codesign
StrongARM: a high-performance ARM processor
COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
Memory resource management in VMware ESX server
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Design Methodology for a Tightly Coupled VLIW/Reconfigurable Matrix Architecture: A Case Study
Proceedings of the conference on Design, automation and test in Europe - Volume 2
Dynamic overlay of scratchpad memory for energy minimization
Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Compilation techniques for energy reduction in horizontally partitioned cache architectures
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
DRAMsim: a memory system simulator
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Reconfigurable split data caches: a novel scheme for embedded systems
Proceedings of the 2007 ACM symposium on Applied computing
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor
International Journal of Parallel Programming
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design)
PSMalloc: content based memory management for MPI applications
Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Operation and data mapping for CGRAs with multi-bank memory
Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Efficient data streaming with on-chip accelerators: Opportunities and challenges
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Improving performance of nested loops on reconfigurable array processors
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Memory-centric communication architecture for reconfigurable computing
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
High Throughput Data Mapping for Coarse-Grained Reconfigurable Architectures
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 0.00 |
While programmable accelerators such as application-specific processors and reconfigurable architectures can dramatically speed up compute-intensive kernels of an application, application performance can still be severely limited by the communication between processors. To minimize the communication overhead, a shared memory such as a scratchpad memory may be employed between the main processor and the accelerator coprocessor. However, this setup poses a significant challenge to the main processor, which now must manage data on the scratchpad explicitly, resulting in superfluous data copying due to the inflexibility of a scratchpad. In this article, we present an enhancement of a scratchpad, Configurable Range Memory (CRM), whose address range can be reprogrammed to minimize unnecessary data copying between processors and therefore promote data reuse on the accelerator, and also present a software management algorithm for the CRM. Our experimental results involving detailed simulation of full multimedia applications demonstrate that our CRM architecture can reduce the communication overhead quite effectively, reducing the kernel execution time by up to 28% and the application runtime by up to 12.8%, in addition to considerable system energy reduction, compared to the conventional architecture based on a scratchpad.