Multi-threading and one-sided communication in parallel LU factorization
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel tiled QR factorization for multicore architectures
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Computing
Cache Performance Optimization for Processing XML-Based Application Data on Multi-core Processors
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
On the performance of an algebraic multigrid solver on multicore clusters
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Solving dense interval linear systems with verified computing on multicore architectures
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Design of a Multicore Sparse Cholesky Factorization Using DAGs
SIAM Journal on Scientific Computing
DAGuE: A generic distributed DAG engine for High Performance Computing
Parallel Computing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures
ACM Transactions on Mathematical Software (TOMS)
Scaling LAPACK panel operations using parallel cache assignment
ACM Transactions on Mathematical Software (TOMS)
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Power consumption and heat dissipation issues are pushing the microprocessors industry towards multicore design patterns. Given the cubic dependence between core frequency and power consumption, multicore technologies leverage the idea that doubling the number of cores and halving the cores frequency gives roughly the same performance reducing the power consumption by a factor of four. With the number of cores on multicore chips expected to reach tens in a few years, efficient implementations of numerical libraries using shared memory programming models is of high interest. The current message passing paradigm used in ScaLAPACK and elsewhere introduces unnecessary memory overhead and memory copy operations, which degrade performance, along with the making it harder to schedule operations that could be done in parallel. Limiting the use of shared memory to fork-join parallelism (perhaps with OpenMP) or to its use within the BLAS does not address all these issues.