A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors
ICS '98 Proceedings of the 12th international conference on Supercomputing
Cache-oblivious priority queue and graph algorithm applications
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Provably good multicore cache performance for divide-and-conquer algorithms
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Fundamental parallel algorithms for private-cache chip multiprocessors
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Dynamic data migration for structured AMR solvers
International Journal of Parallel Programming
Feedback-directed page placement for ccNUMA via hardware-generated memory traces
Journal of Parallel and Distributed Computing
Memory system performance in a NUMA multicore multiprocessor
Proceedings of the 4th Annual International Conference on Systems and Storage
Matching memory access patterns and data placement for NUMA systems
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Optimizing Large-scale Graph Analysis on Multithreaded, Multicore Platforms
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Hi-index | 0.00 |
In modern shared-memory systems, the communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and hardware threads. Thus the performance of multithreaded applications varies with the mapping of software threads to logical processors. In our study we observe huge variation in application performance under different mappings. Moreover, applications with irregular access patterns perform poorly under the default mapping. We maximize application performance by balancing communication overhead and available resources. Remote access overhead in irregular applications dominates execution time and can not be reduced by mapping alone on NUMA systems when the logical processors span multiple chips. In addition to new data replication and distribution optimizations, we improve geographical locality by matching access pattern to the data layout. We further propose a locality-centric optimization for simultaneously reducing remote accesses and improving cache performance. Our approach achieves better performance than prior NUMA-specific techniques.