Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
What have we learnt from using real parallel machines to solve real problems?
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Vector models for data-parallel computing
Vector models for data-parallel computing
Run-Time Parallelization and Scheduling of Loops
IEEE Transactions on Computers
Software—Practice & Experience
Maximizing parallelism and minimizing synchronization with affine transforms
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Digital systems engineering
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
Monsoon: an explicit token-store architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Vector instruction set support for conditional operations
Proceedings of the 27th annual international symposium on Computer architecture
Communications of the ACM - Special issue on computer architecture
Efficient conditional operations for data-parallel architectures
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Architecture of the Atlas Chip-Multiprocessor: Dynamically Parallelizing Irregular Applications
IEEE Transactions on Computers
Conversion of control dependence to data dependence
POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Compiling Global Name-Space Parallel Loops for Distributed Execution
IEEE Transactions on Parallel and Distributed Systems
Asynchronous Problems on SIMD Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
The design and implementation of a parallel array operator for the arbitrary remapping of data
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Program improvement by source to source transformation
POPL '76 Proceedings of the 3rd ACM SIGACT-SIGPLAN symposium on Principles on programming languages
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
Task Pool Teams for Implementing Irregular Algorithms on Clusters of SMPs
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Case for Economy Grid Architecture for Service Oriented Grid Computing
IPDPS '01 Proceedings of the 10th Heterogeneous Computing Workshop â"" HCW 2001 (Workshop 1) - Volume 2
Programmable Stream Processors
Computer
Scatter-Add in Data Parallel Architectures
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Power Efficient Processor Architecture and The Cell Processor
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Automatic Support for Irregular Computations in a High-Level Language
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Stream Register Files with Indexed Access
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
The design space of data-parallel memory systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Impulse: Memory system support for scientific applications
Scientific Programming
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors
Proceedings of the 21st annual international conference on Supercomputing
Vectorized sparse matrix multiply for compressed row storage format
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Comparative evaluation of memory models for chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Languages and Compilers for Parallel Computing
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU
International Journal of Computational Science and Engineering
Analysis and performance results of computing betweenness centrality on IBM Cyclops64
The Journal of Supercomputing
Scalable heterogeneous parallelism for atmospheric modeling and simulation
The Journal of Supercomputing
Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters
The Journal of Supercomputing
Proceedings of the 8th ACM International Conference on Computing Frontiers
I/O streaming evaluation of batch queries for data-intensive computational turbulence
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scalability study of molecular dynamics simulation on Godson-T many-core architecture
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The recent emergence of compute-intensive stream processors such as the Cell Broadband Engine, Stanford's Merrimac, and Clear-Speed's CSX600 has made them attractive platforms for scientific high-performance computing. Unstructured mesh and graph applications are an important class of numerical algorithms used in the scientific computing domain, which are particularly challenging for stream architectures. These codes have irregular structures where nodes have a variable number of neighbors, resulting in irregular memory access patterns and irregular control. We study four representative sub-classes of irregular algorithms, including finite-element and finite-volume methods for modeling physical systems, direct methods for n-body problems, and computations involving sparse algebra. We propose a framework for representing the diverse characteristics of these algorithms in the context of the unique properties of stream architectures, and demonstrate it using one representative application from each sub-class. We then develop techniques for mapping the applications onto a stream processor, placing emphasis on data-localization and parallelizations. Our simulations show that efficient stream hardware with restricted control abilities can effectively run challenging irregular applications with, for example, a finite element method and a molecular dynamic code sustaining 69GFLOP/s and 46GFLOP/s (64-bit) respectively using a single chip that measures 12mm on a side and consumes less than 70W in 90nm technology.