Fat-trees: universal networks for hardware-efficient supercomputing
IEEE Transactions on Computers
The connection machine
Communications of the ACM - Special issue on parallelism
Actors: a model of concurrent computation in distributed systems
Actors: a model of concurrent computation in distributed systems
A bridging model for parallel computation
Communications of the ACM
Introduction to algorithms
Implementation of a portable nested data-parallel language
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The high performance Fortran handbook
The high performance Fortran handbook
GRASP: A Search Algorithm for Propositional Satisfiability
IEEE Transactions on Computers
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
Compact, multilayer layout for butterfly fat-tree
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
Distributed computation on graphs: shortest path algorithms
Communications of the ACM
A machine program for theorem-proving
Communications of the ACM
Improved algorithms for hypergraph bipartitioning
ASP-DAC '00 Proceedings of the 2000 Asia and South Pacific Design Automation Conference
Types and programming languages
Types and programming languages
A unified approach to global program optimization
POPL '73 Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Hardware-assisted simulated annealing with application for fast FPGA placement
FPGA '03 Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays
Classification and Retrieval of Knowledge on a Parallel Marker-Passing Architecture
IEEE Transactions on Knowledge and Data Engineering
Stream Computations Organized for Reconfigurable Execution (SCORE)
FPL '00 Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications
Markov Random Fields with Efficient Approximations
CVPR '98 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Repartitioning Unstructured Adaptive Meshes
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
ConceptNet — A Practical Commonsense Reasoning Tool-Kit
BT Technology Journal
Floating-point sparse matrix-vector multiply for FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
GraphStep: A System Architecture for Sparse-Graph Algorithms
FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Packet Switched vs. Time Multiplexed FPGA Overlay Networks
FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
On a Pin Versus Block Relationship For Partitions of Logic Graphs
IEEE Transactions on Computers
Hi-index | 0.00 |
How do we develop programs that are easy to express, easy to reason about, and able to achieve high performance on massively parallel machines? To address this problem, we introduce GraphStep, a domain-specific compute model that captures algorithms that act on static, irregular, sparse graphs. In GraphStep, algorithms are expressed directly without requiring the programmer to explicitly manage parallel synchronization, operation ordering, placement, or scheduling details. Problems in the sparse graph domain are usually highly concurrent and communicate along graph edges. Exposing concurrency and communication structure allows scheduling of parallel operations and management of communication that is necessary for performance on a spatial computer. We study the performance of a semantic network application, a shortest-path application, and a max-flow/min-cut application. We introduce a language syntax for GraphStep applications. The total speedup over sequential versions of the applications studied ranges from a factor of 19 to a factor of 15,000. Spatially-aware graph optimizations (e.g., node decomposition, placement and route scheduling) delivered speedups from 3 to 30 times over a spatially-oblivious mapping.