ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An effective programmable prefetch engine for on-chip caches
Proceedings of the 28th annual international symposium on Microarchitecture
Data prefetching on the HP PA-8000
Proceedings of the 24th annual international symposium on Computer architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Minimizing the required memory bandwidth in VLSI system realizations
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Prefetching for improved bus wrapper performance in cores
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Image and Video Compression Standards: Algorithms and Architectures
Image and Video Compression Standards: Algorithms and Architectures
Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Improving Data Prefetching Efficacy in Multimedia Applications
Multimedia Tools and Applications
Improving the Data Cache Performance of Multiprocessor Operating Systems
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
One-Shot Active 3D Shape Acquisition
ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Guided region prefetching: a cooperative hardware/software approach
Proceedings of the 30th annual international symposium on Computer architecture
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Compiler-Directed Content-Aware Prefetching for Dynamic Data Structures
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Data Reuse Analysis Technique for Software-Controlled Memory Hierarchies
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Layer Assignment echniques for Low Energy in Multi-Layered Memory Organisations
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Integrated Computer-Aided Engineering
A fast hierarchical motion vector estimation algorithm using mean pyramid
IEEE Transactions on Circuits and Systems for Video Technology
Interactive presentation: A decoupled architecture of processors with scratch-pad memory hierarchy
Proceedings of the conference on Design, automation and test in Europe
Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture
Languages and Compilers for Parallel Computing
Design and Tool Flow of Multimedia MPSoC Platforms
Journal of Signal Processing Systems
Journal of Embedded Computing - PATMOS 2007 selected papers on low power electronics
Journal of Signal Processing Systems
Software metadata: Systematic characterization of the memory behaviour of dynamic applications
Journal of Systems and Software
Template-based memory access engine for accelerators in SoCs
Proceedings of the 16th Asia and South Pacific Design Automation Conference
PATMOS'07 Proceedings of the 17th international conference on Integrated Circuit and System Design: power and timing modeling, optimization and simulation
Hi-index | 0.00 |
Memory latency has always been a major issue in embedded systems that execute memory-intensive applications. This is even more true as the gap between processor and memory speed continues to grow. Hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherit in large off-chip memories; however, both types of prefetching have their shortcomings. Hardware schemes are more complex and require extra circuitry to compute data access strides, while software schemes generate prefetch instructions, which if not computed carefully may hamper performance. On the other hand, some applications domains (such as multimedia) have a uniform and known a priori memory access pattern, that if exploited, could yield significant application performance improvement. With this characteristic in mind, we present our findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping. Compared to previous approaches, we are able to estimate the performance and power metrics, without actually implementing the embedded system. Experimental results on nine well known multimedia and imaging applications prove the efficiency of our technique. Finally, we verify the performance estimations by implementing and simulating the algorithms on the TI C6201 processor.