On the Architectural Requirements for Efficient Execution of Graph Algorithms
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
GraphStep: A System Architecture for Sparse-Graph Algorithms
FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Evaluating synchronization techniques for light-weight multithreaded/multicore architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Efficient Breadth-First Search on the Cell/BE Processor
IEEE Transactions on Parallel and Distributed Systems
Early experiences with large-scale Cray XMT systems
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Accelerating CUDA graph algorithms at maximum warp
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Parallel breadth-first search on distributed memory systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
An overview of Medusa: simplified graph processing on GPUs
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Green-Marl: a DSL for easy and efficient graph analysis
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Highly scalable graph search for the Graph500 benchmark
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Managing large graphs on multi-cores with graph awareness
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
A yoke of oxen and a thousand chickens for heavy lifting graph processing
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Direction-optimizing breadth-first search
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Approximate weighted matching on emerging manycore and multithreaded architectures
International Journal of High Performance Computing Applications
Multi-core reachability for timed automata
FORMATS'12 Proceedings of the 10th international conference on Formal Modeling and Analysis of Timed Systems
Ligra: a lightweight graph processing framework for shared memory
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Fast exact shortest-path distance queries on large networks by pruned landmark labeling
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Understanding parallelism in graph traversal on multi-core clusters
Computer Science - Research and Development
Massive data analytics: the graph 500 on IBM Blue Gene/Q
IBM Journal of Research and Development
Scale-up graph processing: a storage-centric view
First International Workshop on Graph Data Management Experiences and Systems
Efficient breadth first search on multi-GPU systems
Journal of Parallel and Distributed Computing
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
X-Stream: edge-centric graph processing using streaming partitions
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The energy case for graph processing on hybrid CPU and GPU systems
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Data Parallel Implementation of Belief Propagation in Factor Graphs on Multi-core Platforms
International Journal of Parallel Programming
Direction-optimizing breadth-first search
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Many important problems in computational sciences, social network analysis, security, and business analytics, are data-intensive and lend themselves to graph-theoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadth-first search (BFS) algorithm for advanced multi-core processors that are likely to become the building blocks of future exascale systems. Our new methodology for large-scale graph analytics combines a highlevel algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processorspecific optimizations. We present an experimental study that uses state-of-the-art Intel Nehalem EP and EX processors and up to 64 threads in a single system. Our performance on several benchmark problems representative of the power-law graphs found in real-world problems reaches processing rates that are competitive with supercomputing results in the recent literature. In the experimental evaluation we prove that our graph exploration algorithm running on a 4-socket Nehalem EX is (1) 2.4 times faster than a Cray XMT with 128 processors when exploring a random graph with 64 million vertices and 512 millions edges, (2) capable of processing 550 million edges per second with an R-MAT graph with 200 million vertices and 1 billion edges, comparable to the performance of a similar graph on a Cray MTA-2 with 40 processors and (3) 5 times faster than 256 BlueGene/L processors on a graph with average degree 50.