NUMA policies and their relation to memory architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Scheduling and page migration for multiprocessor compute servers
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors
Proceedings of the 25th annual international symposium on Computer architecture
Scaling application performance on a cache-coherent multiprocessor
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Eliminating synchronization overhead in automatically parallelized programs using dynamic feedback
ACM Transactions on Computer Systems (TOCS)
Performance experiences on Sun's Wildfire prototype
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
Maximizing Speedup through Self-Tuning of Processor Allocation
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
The evolution of the HP/Convex Exemplar
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
WildFire: A Scalable Path for SMPs
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Is data distribution necessary in OpenMP?
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The trade-off between implicit and explicit data distribution in shared-memory programming paradigms
ICS '01 Proceedings of the 15th international conference on Supercomputing
Scaling irregular parallel codes with minimal programming effort
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models
International Journal of Parallel Programming
Quantifying and Resolving Remote Memory Access Contention on Hardware DSM Multiprocessors
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Leveraging Transparent Data Distribution in OpenMP via User-Level Dynamic Page Migration
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Compiler Techniques for the Distribution of Data and Computation
IEEE Transactions on Parallel and Distributed Systems
Quantifying contention and balancing memory load on hardware DSM multiprocessors
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
A transparent runtime data distribution engine for OpenMP
Scientific Programming
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
OpenMP and NUMA architectures I: Investigating memory placement on the SGI origin 3000
ICCS'03 Proceedings of the 2003 international conference on Computational science
Dual-layered file cache on cc-NUMA system
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A QHD-capable parallel H.264 decoder
Proceedings of the international conference on Supercomputing
Matching memory access patterns and data placement for NUMA systems
Proceedings of the Tenth International Symposium on Code Generation and Optimization
A flexible and dynamic page migration infrastructure based on hardware counters
The Journal of Supercomputing
Hi-index | 0.00 |
This paper presents user-level dynamic page migration, a runtime technique which transparently enables parallel programs to tune their memory performance on distributed shared memory multiprocessors, with feedback obtained from dynamic monitoring of memory activity. Our technique exploits the iterative nature of parallel programs and information available to the program both at compile time and at runtime in order to improve the accuracy and the timeliness of page migrations, as well as amortize better the overhead, compared to page migration engines implemented in the operating system. We present an adaptive page migration algorithm based on a competitive and a predictive criterion. The competitive criterion is used to correct poor page placement decisions of the operating system, while the predictive criterion makes the algorithm responsive to scheduling events that necessitate immediate page migrations, such as preemptions and migrations of threads. We also present a new technique for preventing page pingpong and a mechanism for monitoring the performance of page migration algorithms at runtime and tuning their sensitive parameters accordingly. Our experimental evidence on a SGI Origin2000 shows that unmodified OpenMP codes linked with our runtime system for dynamic page migration are effectively immune to the page placement strategy of the operating system and the associated problems with data locality. Furthermore, our runtime system achieves solid performance improvements compared to the IRIX 6.5.5 page migration engine, for single parallel OpenMP codes and multiprogrammed workloads.