Simple but effective techniques for NUMA memory management
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Access normalization: loop restructuring for NUMA computers
ACM Transactions on Computer Systems (TOCS)
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Improving performance of sparse matrix-vector multiplication
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
The webgraph framework I: compression techniques
Proceedings of the 13th international conference on World Wide Web
Load balancing and locality in range-queriable data structures
Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
A Portable Programming Interface for Performance Evaluation on Modern Processors
International Journal of High Performance Computing Applications
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Scheduling multithreaded computations by work stealing
SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Approximating betweenness centrality
WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
Corey: an operating system for many cores
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Exascale computing technology challenges
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
The university of Florida sparse matrix collection
ACM Transactions on Mathematical Software (TOMS)
Operating system management of shared caches on multicore processors
Operating system management of shared caches on multicore processors
A topology-aware load balancing algorithm for clustered hierarchical multi-core machines
Future Generation Computer Systems
Hi-index | 0.00 |
We investigate dynamic methods to improve the power and performance profiles of large irregular applications on modern multi-core systems. In this context, we study a large sparse graph application, Betweenness Centrality, and focus on memory behavior as core count scales. We introduce new techniques to efficiently map the computational demands onto non-uniform memory architectures (NUMA). Our dynamic design adapts to hardware topology and dramatically improves both energy and performance. These gains are more significant at higher core counts. We implement a scheme for adaptive data layout, which reorganizes the graph after observing parallel access patterns, and a dynamic task scheduler that encourages shared data between neighboring cores. We measure performance and energy consumption on a modern multi-core machine and observe that mean execution time is reduced by 51.2% and energy is reduced by 52.4%.