Exceeding the dataflow limit via value prediction
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Dynamic speculation and synchronization of data dependences
Proceedings of the 24th annual international symposium on Computer architecture
Improving the accuracy and performance of memory communication through renaming
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Advanced performance features of the 64-bit PA-8000
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor
COMPCON '97 Proceedings of the 42nd IEEE International Computer Conference
A study of branch prediction strategies
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Predictive techniques for aggressive load speculation
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Dependence based prefetching for linked data structures
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Dynamic vectorization: a mechanism for exploiting far-flung ILP in ordinary programs
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Speculation techniques for improving load related instruction scheduling
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving the performance of speculatively parallel applications on the Hydra CMP
ICS '99 Proceedings of the 13th international conference on Supercomputing
Cyclic dependence based data reference prediction
ICS '99 Proceedings of the 13th international conference on Supercomputing
Clustered speculative multithreaded processors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Access region locality for high-bandwidth processor memory system design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Dynamic memory disambiguation in the presence of out-of-order store issuing
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Read-after-read memory dependence prediction
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Understanding the backward slices of performance degrading instructions
Proceedings of the 27th annual international symposium on Computer architecture
Early load address resolution via register tracking
Proceedings of the 27th annual international symposium on Computer architecture
Speculative Memory Cloaking and Bypassing
International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Predictor-directed stream buffers
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Register integration: a simple and efficient implementation of squash reuse
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Load and store reuse using register file contents
ICS '01 Proceedings of the 15th international conference on Supercomputing
Power and energy reduction via pipeline balancing
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A High-Bandwidth Memory Pipeline for Wide Issue Processors
IEEE Transactions on Computers
Reducing Memory Latency via Read-after-Read Memory Dependence Prediction
IEEE Transactions on Computers
Bloom filtering cache misses for accurate data speculation and prefetching
ICS '02 Proceedings of the 16th international conference on Supercomputing
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The predictability of load address
ACM SIGARCH Computer Architecture News
Direct load: dependence-linked dataflow resolution of load address and cache coordinate
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Control-Flow Speculation through Value Prediction
IEEE Transactions on Computers
A survey of processors with explicit multithreading
ACM Computing Surveys (CSUR)
A Decoupled Predictor-Directed Stream Prefetching Architecture
IEEE Transactions on Computers
Amir Roth: Speculative Multithreaded Processors
HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Cost Effective Memory Dependence Prediction using Speculation Levels and Color Sets
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
High Performance and Energy Efficient Serial Prefetch Architecture
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Decoupling Recovery Mechanism for Data Speculation from Dynamic Instruction Scheduling Structure
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Applying Machine Learning for Ensemble Branch Predictors
IEA/AIE '02 Proceedings of the 15th international conference on Industrial and engineering applications of artificial intelligence and expert systems: developments in applied artificial intelligence
Improving the Performance of Heterogeneous DSMs via Multithreading
VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
Microprocessors - 10 Years Back, 10 Years Ahead
Informatics - 10 Years Back. 10 Years Ahead.
Exploiting data-width locality to increase superscalar execution bandwidth
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Power-aware issue queue design for speculative instructions
Proceedings of the 40th annual Design Automation Conference
Phi-Predication for light-weight if-conversion
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Using thread-level speculation to simplify manual parallelization
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay
Proceedings of the 30th annual international symposium on Computer architecture
Proceedings of the 30th annual international symposium on Computer architecture
An Experimental Study of Polylogarithmic, Fully Dynamic, Connectivity Algorithms
Journal of Experimental Algorithmics (JEA)
Scalable Hardware Memory Disambiguation for High ILP Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Design Complexity of the Load/Store Queue
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures
Proceedings of the 18th annual international conference on Supercomputing
Memory Ordering: A Value-Based Approach
Proceedings of the 31st annual international symposium on Computer architecture
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Scalable selective re-execution for EDGE architectures
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
RIFLE: An Architectural Framework for User-Centric Information-Flow Security
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Using a serial cache for energy efficient instruction fetching
Journal of Systems Architecture: the EUROMICRO Journal
Toward kilo-instruction processors
ACM Transactions on Architecture and Code Optimization (TACO)
Exposing speculative thread parallelism in SPEC2000
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
RENO: A Rename-Based Instruction Optimizer
Proceedings of the 32nd annual international symposium on Computer Architecture
Scalable Load and Store Processing in Latency Tolerant Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization
Proceedings of the 32nd annual international symposium on Computer Architecture
Store Buffer Design in First-Level Multibanked Data Caches
Proceedings of the 32nd annual international symposium on Computer Architecture
Fast branch misprediction recovery in out-of-order superscalar processors
Proceedings of the 19th annual international conference on Supercomputing
Instruction Based Memory Distance Analysis and its Application
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Incremental Commit Groups for Non-Atomic Trace Processing
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Scalable Store-Load Forwarding via Store Queue Index Prediction
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Address-Indexed Memory Disambiguation and Store-to-Load Forwarding
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiling for EDGE Architectures
Proceedings of the International Symposium on Code Generation and Optimization
Program Counter-Based Prediction Techniques for Dynamic Power Management
IEEE Transactions on Computers
Decomposing the load-store queue by function for power reduction and scalability
IBM Journal of Research and Development
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Selective writeback: exploiting transient values for energy-efficiency and performance
Proceedings of the 2006 international symposium on Low power electronics and design
Feedback-directed memory disambiguation through store distance analysis
Proceedings of the 20th annual international conference on Supercomputing
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoSQ: Store-Load Communication without a Store Queue
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A comparison of two policies for issuing instructions speculatively
Journal of Systems Architecture: the EUROMICRO Journal
ACM Transactions on Computer Systems (TOCS)
Core fusion: accommodating software diversity in chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Transparent control independence (TCI)
Proceedings of the 34th annual international symposium on Computer architecture
Program-counter-based pattern classification in buffer caching
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Transient fault prediction based on anomalies in processor events
Proceedings of the conference on Design, automation and test in Europe
Working with process variation aware caches
Proceedings of the conference on Design, automation and test in Europe
IEEE Transactions on Computers
TaP: table-based prefetching for storage caches
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Improving single-thread performance with fine-grain state maintenance
Proceedings of the 5th conference on Computing frontiers
Counting Dependence Predictors
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Fetch-Criticality Reduction through Control Independence
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A Two-Level Load/Store Queue Based on Execution Locality
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Selective writeback: reducing register file pressure and energy consumption
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Using age registers for a simple load-store queue filtering
Journal of Systems Architecture: the EUROMICRO Journal
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Accurate Instruction Pre-scheduling in Dynamically Scheduled Processors
Transactions on High-Performance Embedded Architectures and Compilers II
Proceedings of the eighteenth international symposium on Software testing and analysis
Design and optimization of the store vectors memory dependence predictor
ACM Transactions on Architecture and Code Optimization (TACO)
Reusing cached schedules in an out-of-order processor with in-order issue logic
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
The potential of using dynamic information flow analysis in data value prediction
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Get the parallelism out of my cloud
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
SAMIE-LSQ: set-associative multiple-instruction entry load/store queue
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
CRIB: consolidated rename, issue, and bypass
Proceedings of the 38th annual international symposium on Computer architecture
SYRANT: SYmmetric resource allocation on not-taken and taken paths
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Leveraging Strength-Based Dynamic Information Flow Analysis to Enhance Data Value Prediction
ACM Transactions on Architecture and Code Optimization (TACO)
Predicting timing violations through instruction-level path sensitization analysis
Proceedings of the 49th Annual Design Automation Conference
Distributed replay protocol for distributed uniprocessors
Proceedings of the 26th ACM international conference on Supercomputing
Disjoint out-of-order execution processor
ACM Transactions on Architecture and Code Optimization (TACO)
Tuning the continual flow pipeline architecture
Proceedings of the 27th international ACM conference on International conference on supercomputing
Virtual register renaming: energy efficient substrate for continual flow pipelines
Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Tuning the continual flow pipeline architecture with virtual register renaming
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.02 |
For maximum performance, an out-of-order processor must issue load instructions as early as possible, while avoiding memory-order violations with prior store instructions that write to the same memory location. One approach is to use memory dependence prediction to identify the stores upon which a load depends, and communicate that information to the instruction scheduler. We designate the set of stores upon which each load has depended as the load's "store set". The processor can discover and use a load's store set to accurately predict the earliest time the load can safely execute. We show that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies. In addition, we explore the implementation aspects of store sets, and describe a low cost implementation that achieves nearly optimal performance.