The impact of multicore on math software

Authors:
Alfredo Buttari;Jack Dongarra;Jakub Kurzak;Julien Langou;Piotr Luszczek;Stanimire Tomov
Affiliations:
Innovative Computing Laboratory, University of Tennessee, Knoxville, TN;Innovative Computing Laboratory, University of Tennessee, Knoxville, TN and Computer Science and Mathematics Division, Oak Ridge National Laboratory, TN;Innovative Computing Laboratory, University of Tennessee, Knoxville, TN;Department of Mathematical Sciences, University of Colorado at Denver and Health Sciences Center, CO;Innovative Computing Laboratory, University of Tennessee, Knoxville, TN;Innovative Computing Laboratory, University of Tennessee, Knoxville, TN
Venue:
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Year:
2006

Citing 0
Cited 17

Multi-threading and one-sided communication in parallel LU factorization

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel tiled QR factorization for multicore architectures

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
Cache Performance Optimization for Processing XML-Based Application Data on Multi-core Processors

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Short note: Parallelizing a 3D finite difference MT inversion algorithm on a multicore PC using OpenMP

Computers & Geosciences
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Parallel Computing
On the performance of an algebraic multigrid solver on multicore clusters

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Solving dense interval linear systems with verified computing on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Design of a Multicore Sparse Cholesky Factorization Using DAGs

SIAM Journal on Scientific Computing
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

ACM Transactions on Mathematical Software (TOMS)
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Power consumption and heat dissipation issues are pushing the microprocessors industry towards multicore design patterns. Given the cubic dependence between core frequency and power consumption, multicore technologies leverage the idea that doubling the number of cores and halving the cores frequency gives roughly the same performance reducing the power consumption by a factor of four. With the number of cores on multicore chips expected to reach tens in a few years, efficient implementations of numerical libraries using shared memory programming models is of high interest. The current message passing paradigm used in ScaLAPACK and elsewhere introduces unnecessary memory overhead and memory copy operations, which degrade performance, along with the making it harder to schedule operations that could be done in parallel. Limiting the use of shared memory to fork-join parallelism (perhaps with OpenMP) or to its use within the BLAS does not address all these issues.