Cache Performance and Algorithm Optimization

Authors:
Qiao Xiangzhen
Affiliations:
-
Venue:
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Year:
1997

Citing 8
Cited 1

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Scalability issues affecting the design of a dense linear algebra library

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Advanced Computer Architecture: Parallelism,Scalability,Programmability

Advanced Computer Architecture: Parallelism,Scalability,Programmability
Parallel Computers Two: Architecture, Programming and Algorithms

Parallel Computers Two: Architecture, Programming and Algorithms
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
Effects of Multithreading on Data and Workload Distribution for Distributed-Memory Multiprocessors

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Techniques to Enhance Cache Performance Across Parallel Program Sections

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 01
Parallel FFT Algorithms for Cache Based Shared Memory Multiprocessors

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 03

Performance Improvement for Applications on Parallel Computers

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

A technique to enhance the cache performance of some blocked algorithms is proposed in this paper. According to the results of the Number Theory, we present a principle for array padding so that accesses of array sub- blocks do not generate conflict misses. The technique is used to calcu- late the LU factorization and matrix multiplication. The principle is tested on a shared memory multiprocessor. The practical results agree with the theoretical analysis, and 20% to 150% increasing in performance is achieved.