A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck

Authors:
Minas Dasygenis;Erik Brockmeyer;Bart Durinck;Francky Catthoor;Dimitrios Soudris;Antonios Thanailakis
Affiliations:
VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece;Design Technology for Integrated Information and Communication Systems (DESICS), Inter-University Micro-Electronics Center (IMEC), Heverlee, Belgium;Design Technology for Integrated Information and Communication Systems (DESICS), Inter-University Micro-Electronics Center (IMEC), Heverlee, Belgium;Design Technology for Integrated Information and Communication Systems (DESICS), Inter-University Micro-Electronics Center (IMEC), Heverlee, Belgium;VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece;VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece
Venue:
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Year:
2006

Citing 23
Cited 9

Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
Data prefetching on the HP PA-8000

Proceedings of the 24th annual international symposium on Computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Minimizing the required memory bandwidth in VLSI system realizations

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Prefetching for improved bus wrapper performance in cores

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Image and Video Compression Standards: Algorithms and Architectures

Image and Video Compression Standards: Algorithms and Architectures
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design

Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Improving Data Prefetching Efficacy in Multimedia Applications

Multimedia Tools and Applications
Improving the Data Cache Performance of Multiprocessor Operating Systems

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
One-Shot Active 3D Shape Acquisition

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications
Compiler-Directed Content-Aware Prefetching for Dynamic Data Structures

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Layer Assignment echniques for Low Energy in Multi-Layered Memory Organisations

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
The Local Wavelet Transform: a memory-efficient, high-speed architecture optimized to a Region-Oriented Zero-Tree coder

Integrated Computer-Aided Engineering
A fast hierarchical motion vector estimation algorithm using mean pyramid

IEEE Transactions on Circuits and Systems for Video Technology

Interactive presentation: A decoupled architecture of processors with scratch-pad memory hierarchy

Proceedings of the conference on Design, automation and test in Europe
Enabling run-time memory data transfer optimizations at the system level with automated extraction of embedded software metadata information

Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

Languages and Compilers for Parallel Computing
Design and Tool Flow of Multimedia MPSoC Platforms

Journal of Signal Processing Systems
Direct memory access usage optimization in network applications for reduced memory latency and energy consumption

Journal of Embedded Computing - PATMOS 2007 selected papers on low power electronics
Decoupled Processors Architecture for Accelerating Data Intensive Applications using Scratch-Pad Memory Hierarchy

Journal of Signal Processing Systems
Software metadata: Systematic characterization of the memory behaviour of dynamic applications

Journal of Systems and Software
Template-based memory access engine for accelerators in SoCs

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Direct memory access optimization in wireless terminals for reduced memory latency and energy consumption

PATMOS'07 Proceedings of the 17th international conference on Integrated Circuit and System Design: power and timing modeling, optimization and simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory latency has always been a major issue in embedded systems that execute memory-intensive applications. This is even more true as the gap between processor and memory speed continues to grow. Hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherit in large off-chip memories; however, both types of prefetching have their shortcomings. Hardware schemes are more complex and require extra circuitry to compute data access strides, while software schemes generate prefetch instructions, which if not computed carefully may hamper performance. On the other hand, some applications domains (such as multimedia) have a uniform and known a priori memory access pattern, that if exploited, could yield significant application performance improvement. With this characteristic in mind, we present our findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping. Compared to previous approaches, we are able to estimate the performance and power metrics, without actually implementing the embedded system. Experimental results on nine well known multimedia and imaging applications prove the efficiency of our technique. Finally, we verify the performance estimations by implementing and simulating the algorithms on the TI C6201 processor.