Register requirements of pipelined processors
ICS '92 Proceedings of the 6th international conference on Supercomputing
Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Integration, the VLSI Journal
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Minimum register requirements for a modulo schedule
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Data relocation and prefetching for programs with large data sets
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Data prefetching for high-performance processors
Data prefetching for high-performance processors
Resource-constrained loop list scheduler for DSP algorithms
Journal of VLSI Signal Processing Systems - Special issue on VLSI design methodologies for digital signal processing systems
Cache miss heuristics and preloading techniques for general-purpose programs
Proceedings of the 28th annual international symposium on Microarchitecture
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Achieving Full Parallelism Using Multidimensional Retiming
IEEE Transactions on Parallel and Distributed Systems
Data prefetching for software DSMs
ICS '98 Proceedings of the 12th international conference on Supercomputing
Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors
IEEE Transactions on Parallel and Distributed Systems
Scheduling of uniform multidimensional systems under resource constraints
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A tile selection algorithm for data locality and cache interference
ICS '99 Proceedings of the 13th international conference on Supercomputing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
An adaptive sequential prefetching scheme in shared-memory multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Optimal Software Pipelining of Nested Loops
Proceedings of the 8th International Symposium on Parallel Processing
Optimal code size reduction for software-pipelined and unfolded loops
Proceedings of the 15th international symposium on System Synthesis
Code size reduction technique and implementation for software-pipelined DSP applications
ACM Transactions on Embedded Computing Systems (TECS)
Partitioning and scheduling DSP applications with maximal memory access hiding
EURASIP Journal on Applied Signal Processing
Loop scheduling and bank type assignment for heterogeneous multi-bank memory
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Over the last 20 years, the performance gap between CPU and memory has been steadily increasing. As a result, a variety of techniques has been devised to hide that performance gap, from intermediate fast memories (caches) to various prefetching and memory management techniques for manipulating the data present in these caches. In this paper we propose a new memory management technique that takes advantage of access pattern information that is available at compile time by prefetching certain data elements before explicitly being requested by the CPU, as well as maintaining certain data in the local memory over a number of iterations. In order to better take advantage of the locality of reference present in loop structures, our technique also uses a new approach to memory by partitioning it and reducing execution to each partition, so that information is reused at much smaller time intervals than if execution followed the usual pattern. These combined approaches—using a new set of memory instructions as well as partitioning the memory—lead to improvements in total execution time of approximately 25% over existing methods.