Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems

Authors:
Michael Marchetti;Leonidas I. Kontothanassis;Ricardo Bianchini;Michael L. Scott
Affiliations:
-;-;-;-
Venue:
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Year:
1995

Citing 0
Cited 24

Decoupled hardware support for distributed shared memory

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
VM-based shared memory on low-latency, remote-memory-access networks

Proceedings of the 24th annual international symposium on Computer architecture
Reactive NUMA: a design for unifying S-COMA and CC-NUMA

Proceedings of the 24th annual international symposium on Computer architecture
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
Hardware Support for Flexible Distributed Shared Memory

IEEE Transactions on Computers
Optimal replacements in caches with two miss costs

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
A case for user-level dynamic page migration

Proceedings of the 14th international conference on Supercomputing
Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Is data distribution necessary in OpenMP?

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The trade-off between implicit and explicit data distribution in shared-memory programming paradigms

ICS '01 Proceedings of the 15th international conference on Supercomputing
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

International Journal of Parallel Programming
A Study of Implicit Data Distribution Methods for OpenMP Using the SPEC Benchmarks

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Using memory-mapped network interfaces to improve the performance of distributed shared memory

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Shared memory computing on clusters with symmetric multiprocessors and system area networks

ACM Transactions on Computer Systems (TOCS)
Page migration with dynamic space-sharing scheduling policies: the case of the SGI 02000

International Journal of Parallel Programming - Special issue II: The 17th annual international conference on supercomputing (ICS'03)
A transparent runtime data distribution engine for OpenMP

Scientific Programming
Scaling non-regular shared-memory codes by reusing custom loop schedules

Scientific Programming - OpenMP
Experience distributing objects in an SMMP OS

ACM Transactions on Computer Systems (TOCS)
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Dual-layered file cache on cc-NUMA system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Deadlock-free fine-grained thread migration

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
Hybrid openMP-MPI turbulent boundary layer code over 32k cores

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

The cost of a cache miss depends heavily on the location of the main memory that backs the missing line. For certain applications, this cost is a major factor in overall performance. We report on the utility of OS-based page placement as a mechanism to increase the frequency with which cache fills access local memory in distributed shared memory multiprocessors. Even with the very simple policy of first-use placement, we find significant improvements over round-robin placement for many applications on both hardware- and software-coherent systems. For most of our applications, first-use placement allows 35 to 75 percent of cache fills to be performed locally, resulting in performance improvements of up to 40 percent with respect to round-robin placement. We were surprised to find no performance advantage in more sophisticated policies, including page migration and page replication. In fact, in many cases the performance of our applications suffered under these policies.