The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Limitations of cache prefetching on a bus-based multiprocessor
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
ScaLAPACK user's guide
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
ACM Computing Surveys (CSUR)
Implementation in ScaLAPACK of Divide-and-Conquer Algorithms forBanded and Tridiagonal Linear Systems
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
A parallel hybrid banded system solver: the SPIKE algorithm
Parallel Computing - Parallel matrix algorithms and applications (PMAA'04)
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Parallelization and Performance Analysis of Video Feature Extractions on Multi-Core Based Systems
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Improving parallelism and locality with asynchronous algorithms
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Overview of Multicore Requirements towards Real-Time Communication
SEUS '09 Proceedings of the 7th IFIP WG 10.2 International Workshop on Software Technologies for Embedded and Ubiquitous Systems
A compiler-automated array compression scheme for optimizing memory intensive programs
Proceedings of the 24th ACM International Conference on Supercomputing
Tackling cache-line stealing effects using run-time adaptation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Performance models for the Spike banded linear system solver
Scientific Programming
Capacity metric for chip heterogeneous multiprocessors
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Can manycores support the memory requirements of scientific applications?
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Hi-index | 0.00 |
As the shared memory bus becomes a major performance bottleneck for many numerical applications on multicore chips, understanding how the increased parallelism on chip strains the memory bandwidth and hence affects the efficiency of parallel codes becomes a critical issue. This paper introduces the notion of memory access intensity to facilitate quantitative analysis of program's memory behavior on multicores which employ state-of-the-art prefetching hardware. Three numerical solvers for large scale sparse linear systems are used to demonstrate the estimation of memory access intensity and its effect on program performance.