Reference history, page size, and migration daemons in local/remote architectures
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Dynamic Page Migration in Multiprocessors with Distributed Global Memory
IEEE Transactions on Computers
An optimal memory allocation scheme for scratch-pad-based embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Scratchpad memory: design alternative for cache on-chip memory in embedded systems
Proceedings of the tenth international symposium on Hardware/software codesign
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Reducing Power Consumption during TLB Lookups in a PowerPC" Embedded Processor
ISQED '05 Proceedings of the 6th International Symposium on Quality of Electronic Design
How to use SimPoint to pick simulation points
ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Dynamic allocation for scratch-pad memory using compile-time decisions
ACM Transactions on Embedded Computing Systems (TECS)
L1 Cache Filtering Through Random Selection of Memory References
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Memory allocation for embedded systems with a compile-time-unknown scratch-pad size
ACM Transactions on Embedded Computing Systems (TECS)
Two new techniques integrated for energy-efficient TLB design
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Improving performance of OpenMP for SMP clusters through overlapped page migrations
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Automatic structure extraction from MPI applications tracefiles
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
Modern Chip Multiprocessors (CMPs) composed of accelerators and on-chip scratchpad memories are currently emerging as powerefficient architectures. However, these architectures are hard to program because they require efficient data allocation. In addition, when running legacy applications on these architectures, unless their code is adapted to utilize the distributed memory architecture, applications cannot benefit from their high computational power. In this paper, we propose FELI, a set of operating system mechanisms that allocate application data to on-chip memories without any user intervention. FELI, automatically maps data to on-chip memories using the address translation mechanism. It relies on a set of TLB counters, and dynamical migration of pages from off-chip memory to on-chip memory. We also introduce virtually tagged L0 caches to alleviate the address translation overhead. Moreover, we make a comparison in performance and power consumption versus a homogeneous cache-based CMP design. Our evaluation shows a 50% average improvement in power consumption with the scratchpad-based CMP compared to a cache-based CMP. And a 10% in average memory access time even accounting for the cost of page migrations and TLB invalidations. FELI can automatically allocate on-chip memory to an average of 90% of the applications working set.