Reducing impact of cache miss stalls in embedded systems by extracting guaranteed independent instructions

Authors:
Garo Bournoutian;Alex Orailoglu
Affiliations:
University of California, San Diego, La Jolla, CA, USA;University of California, San Diego, La Jolla, CA, USA
Venue:
CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Year:
2009

Citing 19
Cited 1

Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Implementation of precise interrupts in pipelined processors

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
The memory gap and the future of high performance memories

ACM SIGARCH Computer Architecture News
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Energy-effective issue logic

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
Out-of-Order Execution may not be Cost-Effective on Processors Featuring Simultaneous Multithreading

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Cheap Out-of-Order Execution Using Delayed Issue

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Miss reduction in embedded processors through dynamic, power-friendly cache design

Proceedings of the 45th annual Design Automation Conference
Hiding cache miss penalty using priority-based execution for embedded processors

Proceedings of the conference on Design, automation and test in Europe

Minimizing accumulative memory load cost on multi-core DSPs with multi-level memory

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, embedded processors are expected to be able to run complex, algorithm-heavy, memory-intensive applications that were originally designed and coded for general-purpose processors. As such, the impact of memory latencies on the execution time increasingly becomes evident. All the while, it is also expected that embedded processors be power-conscientious as well as of minimal area impact. As a result, traditional methods for addressing performance and memory latencies, such as multiple issue, out-of-order execution and large, associative caches, are not aptly suited for the embedded domain due to the significant area and power overhead. This paper explores a novel approach to mitigating execution delays caused by memory latencies that would otherwise not be possible in a regular in-order, single-issue embedded processor without large, power-hungry constructs like a Reorder Buffer (ROB). The concept relies on both compile-time and run-time information to safely allow non-data-dependent instructions to continue executing while a memory stall has occurred. The simulation results show significant improvement in execution throughput of approximately 11%, while having a minimal impact on area overhead and power.