Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing

Authors:
Nadathur Satish;Changkyu Kim;Jatin Chhugani;Pradeep Dubey
Affiliations:
Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 24
Cited 1

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
Randomized algorithms

Randomized algorithms
Software pipelining

ACM Computing Surveys (CSUR)
A query language for a Web-site management system

ACM SIGMOD Record
A multithreaded message passing interface (MPI) architecture: performance and program issues

Journal of Parallel and Distributed Computing
On the Architectural Requirements for Efficient Execution of Graph Algorithms

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Measurement and analysis of online social networks

Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
Efficient Breadth-First Search on the Cell/BE Processor

IEEE Transactions on Parallel and Distributed Systems
FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units

Proceedings of the VLDB Endowment
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
Approximating betweenness centrality

WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
FAST: fast architecture sensitive tree search on modern CPUs and GPUs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
Parallel breadth-first search on distributed memory systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Better benchmarking for supercomputers

IEEE Spectrum
The entropy of ordered sequences and order statistics

IEEE Transactions on Information Theory
Performance characteristics of Graph500 on large-scale distributed environment

IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization
Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

On fast parallel detection of strongly connected components (SCC) in small-world graphs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal on single nodes has been well studied and optimized on modern CPU architectures. However, current cluster implementations suffer from high latency data communication with large volumes of transfers across nodes, leading to inefficiency in performance and energy consumption. In this work, we show that we can overcome these constraints using a combination of efficient low-overhead data compression techniques to reduce transfer volumes along with latency-hiding techniques. Using an optimized single node graph traversal algorithm [1], our novel cluster optimizations result in over 6.6X performance improvements over state-of-the-art data transfer techniques, and almost an order of magnitude in energy savings. Our resulting implementation of the Graph500 benchmark achieves 115 GigaTEPS on a 320-node/5120 core Intel® Endeavor cluster with E5-2700 Sandybridge nodes, which matches the second ranked result in the most recent November 2011 Graph500 list [2] with about 5.6X fewer nodes. Our cluster optimizations only have a 1.8X overhead in overall performance from the performance of the optimized single-node implementation, and allows for near-linear scaling with number of nodes. Our algorithm on 1024 nodes on an Intel® Xeon® X5670 Westmere processor (with lower per-node performance) for a large multi-Terabyte graph attained 195 GigaTEPS in performance, proving the high scalability of our algorithm. Our per-node performance is the highest in the top 10 of the Nov 2011 Graph500 list.