A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
MIPS RISC architectures
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Alpha architecture reference manual
Alpha architecture reference manual
Prefetch unit for vector operations on scalar computers
ACM SIGARCH Computer Architecture News
Optimizing for parallelism and data locality
ICS '92 Proceedings of the 6th international conference on Supercomputing
ICS '93 Proceedings of the 7th international conference on Supercomputing
Supercomputer performance evaluation and the Perfect Benchmarks
ICS '90 Proceedings of the 4th international conference on Supercomputing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ACM Computing Surveys (CSUR)
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A Performance Study on Bounteous Transfer in Multiprocessor Sectored Caches
The Journal of Supercomputing - Special issue: high performance computing systems
An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors
International Journal of Parallel Programming
Designing a Modern Memory Hierarchy with Hardware Prefetching
IEEE Transactions on Computers
Revisiting Cache Block Superloading
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Hi-index | 0.00 |
Because the spatial locality of numerical codes is significant, the potential for performance improvements is important. However, large cache lines cannot be used in current on-chip data caches because of the important pollution they breed. In this paper, we propose a hardware design, called the Virtual Line Scheme, that allows the utilization of large virtual cache lines when fetching data from memory for better exploitation of spatial locality, while the actual physical cache line is smaller than currently found cache lines for better exploitation of temporal locality. Simulations show that a 17% to 64% reduction of the average memory access time can be obtained for a 20-cycle memory latency. It is also shown how simple software informations can be used to significantly decrease memory traffic, a flaw associated with the utilization of large cache lines.