FELI: HW/SW support for on-chip distributed shared memory in multicores

Authors:
Carlos Villavieja;Yoav Etsion;Alex Ramirez;Nacho Navarro
Affiliations:
Universitat Politecnica de Catalunya and Barcelona Supercomputing Center, Barcelona, Spain;Barcelona Supercomputing Center, Barcelona, Spain;Universitat Politecnica de Catalunya and Barcelona Supercomputing Center, Barcelona, Spain;Universitat Politecnica de Catalunya and Barcelona Supercomputing Center, Barcelona, Spain
Venue:
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Year:
2011

Citing 16
Cited 0

Reference history, page size, and migration daemons in local/remote architectures

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Dynamic Page Migration in Multiprocessors with Distributed Global Memory

IEEE Transactions on Computers
An optimal memory allocation scheme for scratch-pad-based embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Scratchpad memory: design alternative for cache on-chip memory in embedded systems

Proceedings of the tenth international symposium on Hardware/software codesign
Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Reducing Power Consumption during TLB Lookups in a PowerPC" Embedded Processor

ISQED '05 Proceedings of the 6th International Symposium on Quality of Electronic Design
How to use SimPoint to pick simulation points

ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Dynamic allocation for scratch-pad memory using compile-time decisions

ACM Transactions on Embedded Computing Systems (TECS)
L1 Cache Filtering Through Random Selection of Memory References

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Architecting Efficient Interconnects for Large Caches with CACTI 6.0

IEEE Micro
Memory allocation for embedded systems with a compile-time-unknown scratch-pad size

ACM Transactions on Embedded Computing Systems (TECS)
Two new techniques integrated for energy-efficient TLB design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Improving performance of OpenMP for SMP clusters through overlapped page migrations

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Automatic structure extraction from MPI applications tracefiles

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern Chip Multiprocessors (CMPs) composed of accelerators and on-chip scratchpad memories are currently emerging as powerefficient architectures. However, these architectures are hard to program because they require efficient data allocation. In addition, when running legacy applications on these architectures, unless their code is adapted to utilize the distributed memory architecture, applications cannot benefit from their high computational power. In this paper, we propose FELI, a set of operating system mechanisms that allocate application data to on-chip memories without any user intervention. FELI, automatically maps data to on-chip memories using the address translation mechanism. It relies on a set of TLB counters, and dynamical migration of pages from off-chip memory to on-chip memory. We also introduce virtually tagged L0 caches to alleviate the address translation overhead. Moreover, we make a comparison in performance and power consumption versus a homogeneous cache-based CMP design. Our evaluation shows a 50% average improvement in power consumption with the scratchpad-based CMP compared to a cache-based CMP. And a 10% in average memory access time even accounting for the cost of page migrations and TLB invalidations. FELI can automatically allocate on-chip memory to an average of 90% of the applications working set.