A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck

  • Authors:
  • Minas Dasygenis;Erik Brockmeyer;Bart Durinck;Francky Catthoor;Dimitrios Soudris;Antonios Thanailakis

  • Affiliations:
  • VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece;Design Technology for Integrated Information and Communication Systems (DESICS), Inter-University Micro-Electronics Center (IMEC), Heverlee, Belgium;Design Technology for Integrated Information and Communication Systems (DESICS), Inter-University Micro-Electronics Center (IMEC), Heverlee, Belgium;Design Technology for Integrated Information and Communication Systems (DESICS), Inter-University Micro-Electronics Center (IMEC), Heverlee, Belgium;VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece;VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece

  • Venue:
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Memory latency has always been a major issue in embedded systems that execute memory-intensive applications. This is even more true as the gap between processor and memory speed continues to grow. Hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherit in large off-chip memories; however, both types of prefetching have their shortcomings. Hardware schemes are more complex and require extra circuitry to compute data access strides, while software schemes generate prefetch instructions, which if not computed carefully may hamper performance. On the other hand, some applications domains (such as multimedia) have a uniform and known a priori memory access pattern, that if exploited, could yield significant application performance improvement. With this characteristic in mind, we present our findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping. Compared to previous approaches, we are able to estimate the performance and power metrics, without actually implementing the embedded system. Experimental results on nine well known multimedia and imaging applications prove the efficiency of our technique. Finally, we verify the performance estimations by implementing and simulating the algorithms on the TI C6201 processor.