NUMA-aware graph mining techniques for performance and energy efficiency

Authors:
Michael Frasca;Kamesh Madduri;Padma Raghavan
Affiliations:
The Pennsylvania State University, University Park, Pennsylvania;The Pennsylvania State University, University Park, Pennsylvania;The Pennsylvania State University, University Park, Pennsylvania
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 19
Cited 1

Simple but effective techniques for NUMA memory management

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Improving performance of sparse matrix-vector multiplication

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The data locality of work stealing

Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Load balancing and locality in range-queriable data structures

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
A Portable Programming Interface for Performance Evaluation on Modern Processors

International Journal of High Performance Computing Applications
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Scheduling multithreaded computations by work stealing

SFCS '94 Proceedings of the 35th Annual Symposium on Foundations of Computer Science
A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Approximating betweenness centrality

WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Exascale computing technology challenges

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
The university of Florida sparse matrix collection

ACM Transactions on Mathematical Software (TOMS)
Operating system management of shared caches on multicore processors

Operating system management of shared caches on multicore processors

A topology-aware load balancing algorithm for clustered hierarchical multi-core machines

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate dynamic methods to improve the power and performance profiles of large irregular applications on modern multi-core systems. In this context, we study a large sparse graph application, Betweenness Centrality, and focus on memory behavior as core count scales. We introduce new techniques to efficiently map the computational demands onto non-uniform memory architectures (NUMA). Our dynamic design adapts to hardware topology and dramatically improves both energy and performance. These gains are more significant at higher core counts. We implement a scheme for adaptive data layout, which reorganizes the graph after observing parallel access patterns, and a dynamic task scheduler that encourages shared data between neighboring cores. We measure performance and energy consumption on a modern multi-core machine and observe that mean execution time is reduced by 51.2% and energy is reduced by 52.4%.