Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching

Authors:
Zhong Wang;Timothy W. O'neil;Edwin H.-M. Sha
Affiliations:
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA;Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA;Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083
Venue:
Journal of VLSI Signal Processing Systems
Year:
2001

Citing 21
Cited 4

Register requirements of pipelined processors

ICS '92 Proceedings of the 6th international conference on Supercomputing
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
(Pen)-ultimate tiling?

Integration, the VLSI Journal
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Minimum register requirements for a modulo schedule

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Data relocation and prefetching for programs with large data sets

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Data prefetching for high-performance processors

Data prefetching for high-performance processors
Resource-constrained loop list scheduler for DSP algorithms

Journal of VLSI Signal Processing Systems - Special issue on VLSI design methodologies for digital signal processing systems
Cache miss heuristics and preloading techniques for general-purpose programs

Proceedings of the 28th annual international symposium on Microarchitecture
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Achieving Full Parallelism Using Multidimensional Retiming

IEEE Transactions on Parallel and Distributed Systems
Data prefetching for software DSMs

ICS '98 Proceedings of the 12th international conference on Supercomputing
Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors

IEEE Transactions on Parallel and Distributed Systems
Scheduling of uniform multidimensional systems under resource constraints

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A tile selection algorithm for data locality and cache interference

ICS '99 Proceedings of the 13th international conference on Supercomputing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An adaptive sequential prefetching scheme in shared-memory multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Optimal Software Pipelining of Nested Loops

Proceedings of the 8th International Symposium on Parallel Processing

Optimal code size reduction for software-pipelined and unfolded loops

Proceedings of the 15th international symposium on System Synthesis
Code size reduction technique and implementation for software-pipelined DSP applications

ACM Transactions on Embedded Computing Systems (TECS)
Partitioning and scheduling DSP applications with maximal memory access hiding

EURASIP Journal on Applied Signal Processing
Loop scheduling and bank type assignment for heterogeneous multi-bank memory

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the last 20 years, the performance gap between CPU and memory has been steadily increasing. As a result, a variety of techniques has been devised to hide that performance gap, from intermediate fast memories (caches) to various prefetching and memory management techniques for manipulating the data present in these caches. In this paper we propose a new memory management technique that takes advantage of access pattern information that is available at compile time by prefetching certain data elements before explicitly being requested by the CPU, as well as maintaining certain data in the local memory over a number of iterations. In order to better take advantage of the locality of reference present in loop structures, our technique also uses a new approach to memory by partitioning it and reducing execution to each partition, so that information is reused at much smaller time intervals than if execution followed the usual pattern. These combined approaches—using a new set of memory instructions as well as partitioning the memory—lead to improvements in total execution time of approximately 25% over existing methods.