QR factorization for the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Scheduling two-sided transformations using tile algorithms on multicore architectures
Scientific Programming
High Resolution Program Flow Visualization of Hardware Accelerated Hybrid Multi-core Applications
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Massive video-surveillance parallelization on the cell broadband engine processor
IBM Journal of Research and Development
Implementation and performance analysis of parallel conjugate gradient on the cell broadband engine
IBM Journal of Research and Development
A scalable high performant Cholesky factorization for multicore with GPU accelerators
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A performance evaluation on monte carlo simulation for radiation dosimetry using cell processor
Journal of Computational Methods in Sciences and Engineering
Topological decomposition algorithm for optimized solution of a system of linear equations
Proceedings of the 2012 Joint International Conference on Human-Centered Computer Environments
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Parallelization of pagerank on multicore processors
ICDCIT'12 Proceedings of the 8th international conference on Distributed Computing and Internet Technology
Concurrency and Computation: Practice & Experience
Concurrency and Computation: Practice & Experience
New level-3 BLAS kernels for cholesky factorization
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Cache blocking for linear algebra algorithms
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms
ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An (almost) direct deployment of the Fast Multipole Method on the Cell processor
The Journal of Supercomputing
Hi-index | 0.00 |
The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in processor architecture. At the same time it presents new challenges for the development of numerical algorithms. One is effective exploitation of the differential between the speed of single and double precision arithmetic; the other is efficient parallelization between the short vector SIMD cores. The first challenge is addressed by utilizing the well known technique of iterative refinement for the solution of a dense symmetric positive definite system of linear equations, resulting in a mixed-precision algorithm, which delivers double precision accuracy, while performing the bulk of the work in single precision. The main contribution of this paper lies in addressing the second challenge by successful thread-level parallelization, exploiting fine-grained task granularity and a lightweight decentralized synchronization. The implementation of the computationally intensive sections gets within 90 percent of peak floating point performance, while the implementation of the memory intensive sections reaches within 90 percent of peak memory bandwidth. On a single CELL processor, the algorithm achieves over 170~Gflop/s when solving a symmetric positive definite system of linear equation in single precision and over 150~Gflop/s when delivering the result in double precision accuracy.