Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing

  • Authors:
  • Nadathur Satish;Changkyu Kim;Jatin Chhugani;Pradeep Dubey

  • Affiliations:
  • Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation

  • Venue:
  • SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal on single nodes has been well studied and optimized on modern CPU architectures. However, current cluster implementations suffer from high latency data communication with large volumes of transfers across nodes, leading to inefficiency in performance and energy consumption. In this work, we show that we can overcome these constraints using a combination of efficient low-overhead data compression techniques to reduce transfer volumes along with latency-hiding techniques. Using an optimized single node graph traversal algorithm [1], our novel cluster optimizations result in over 6.6X performance improvements over state-of-the-art data transfer techniques, and almost an order of magnitude in energy savings. Our resulting implementation of the Graph500 benchmark achieves 115 GigaTEPS on a 320-node/5120 core Intel® Endeavor cluster with E5-2700 Sandybridge nodes, which matches the second ranked result in the most recent November 2011 Graph500 list [2] with about 5.6X fewer nodes. Our cluster optimizations only have a 1.8X overhead in overall performance from the performance of the optimized single-node implementation, and allows for near-linear scaling with number of nodes. Our algorithm on 1024 nodes on an Intel® Xeon® X5670 Westmere processor (with lower per-node performance) for a large multi-Terabyte graph attained 195 GigaTEPS in performance, proving the high scalability of our algorithm. Our per-node performance is the highest in the top 10 of the Nov 2011 Graph500 list.