Dynamic Page Migration in Multiprocessors with Distributed Global Memory
IEEE Transactions on Computers
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling and page migration for multiprocessor compute servers
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Scaling application performance on a cache-coherent multiprocessor
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A case for user-level dynamic page migration
Proceedings of the 14th international conference on Supercomputing
On page migration and other relaxed task systems
Theoretical Computer Science
An Evaluation of Multiprocessor Cache Coherence Based on Virtual Memory Support
Proceedings of the 8th International Symposium on Parallel Processing
Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments
Euro-Par '98 Proceedings of the 4th International Euro-Par Conference on Parallel Processing
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Using Hardware Counters to Automatically Improve Memory Performance
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
This Paper describes and evaluates a system of dynamic memory migraton for codes executing in a Non-Uniform Memory Access environment. This system of migration applies information about the load-imbalance within a workload in order to determine the affinity between threads of the application and regions of memory. This information then serves as the basis of migration decisions, with the object of minimising the NUMA distance between code and the memory it accesses. Results are presented which demonstrate the effectiveness of this technique in reducing the runtime of a set of representative HPC kernels.