Data prefetching for high-performance processors
Data prefetching for high-performance processors
Tolerating latency in multiprocessors through compiler-inserted prefetching
ACM Transactions on Computer Systems (TOCS)
Scheduling of uniform multidimensional systems under resource constraints
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Loop Scheduling and Partitions for Hiding Memory Latencies
Proceedings of the 12th international symposium on System synthesis
Optimal partitioning and balanced scheduling with the maximal overlap of data footprints
GLSVLSI '01 Proceedings of the 11th Great Lakes symposium on VLSI
Scheduling and partitioning for multiple loop nests
Proceedings of the 14th international symposium on Systems synthesis
Combined partitioning and data padding for scheduling multiple loop nests
CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Loop scheduling and bank type assignment for heterogeneous multi-bank memory
Journal of Parallel and Distributed Computing
Improving the memory bandwidth utilization using loop transformations
PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Hi-index | 0.00 |
The large latency of memory accesses in modern computers is a key obstacle in achieving high processor utilization. To hide this latency, this paper proposes a new memory management technique that can be applied to computer architectures with three levels of memory. The technique takes advantage of access pattern information that is available at compile time by prefetching certain data elements from the higher level memory. It as well maintains certain data for a period of time to prevent unnecessary data swapping. Data locality is much improved compared with the usual pattern by partitioning the iteration space and reducing execution in each partition. These combined approaches lead to improvements in average execution times of approximately 35% over the one-level partition algorithm and more than 80% over list scheduling and hardware prefetching.