Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
Randomized algorithms
ACM Computing Surveys (CSUR)
A query language for a Web-site management system
ACM SIGMOD Record
A multithreaded message passing interface (MPI) architecture: performance and program issues
Journal of Parallel and Distributed Computing
On the Architectural Requirements for Efficient Execution of Graph Algorithms
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Measurement and analysis of online social networks
Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
Efficient Breadth-First Search on the Cell/BE Processor
IEEE Transactions on Parallel and Distributed Systems
FAWN: a fast array of wimpy nodes
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units
Proceedings of the VLDB Endowment
What is Twitter, a social network or a news media?
Proceedings of the 19th international conference on World wide web
Approximating betweenness centrality
WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
FAST: fast architecture sensitive tree search on modern CPUs and GPUs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Overlapping communication and computation by using a hybrid MPI/SMPSs approach
Proceedings of the 24th ACM International Conference on Supercomputing
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Scalable Graph Exploration on Multicore Processors
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Accelerating CUDA graph algorithms at maximum warp
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
Parallel breadth-first search on distributed memory systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Better benchmarking for supercomputers
IEEE Spectrum
The entropy of ordered sequences and order statistics
IEEE Transactions on Information Theory
Performance characteristics of Graph500 on large-scale distributed environment
IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization
Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
On fast parallel detection of strongly connected components (SCC) in small-world graphs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal on single nodes has been well studied and optimized on modern CPU architectures. However, current cluster implementations suffer from high latency data communication with large volumes of transfers across nodes, leading to inefficiency in performance and energy consumption. In this work, we show that we can overcome these constraints using a combination of efficient low-overhead data compression techniques to reduce transfer volumes along with latency-hiding techniques. Using an optimized single node graph traversal algorithm [1], our novel cluster optimizations result in over 6.6X performance improvements over state-of-the-art data transfer techniques, and almost an order of magnitude in energy savings. Our resulting implementation of the Graph500 benchmark achieves 115 GigaTEPS on a 320-node/5120 core Intel® Endeavor cluster with E5-2700 Sandybridge nodes, which matches the second ranked result in the most recent November 2011 Graph500 list [2] with about 5.6X fewer nodes. Our cluster optimizations only have a 1.8X overhead in overall performance from the performance of the optimized single-node implementation, and allows for near-linear scaling with number of nodes. Our algorithm on 1024 nodes on an Intel® Xeon® X5670 Westmere processor (with lower per-node performance) for a large multi-Terabyte graph attained 195 GigaTEPS in performance, proving the high scalability of our algorithm. Our per-node performance is the highest in the top 10 of the Nov 2011 Graph500 list.