Available instruction-level parallelism for superscalar and superpipelined machines
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
IBM RISC System/6000 processor architecture
IBM Journal of Research and Development
Design of the IBM RISC System/6000 floating-point execution unit
IBM Journal of Research and Development
Limits of instruction-level parallelism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Single instruction stream parallelism is greater than two
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The SPARC architecture manual: version 8
The SPARC architecture manual: version 8
Limits of control flow on parallelism
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Dynamic dependency analysis of ordinary programs
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Interlock collapsing ALU for increased instruction-level parallelism
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
SCISM: a scalable compound instruction set machine
IBM Journal of Research and Development
IBM Power and PowerPC
Zero-cycle loads: microarchitecture support for reducing load latency
Proceedings of the 28th annual international symposium on Microarchitecture
Value locality and load value prediction
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Examination of a memory access classification scheme for pointer-intensive and numeric programs
ICS '96 Proceedings of the 10th international conference on Supercomputing
Computer
IEEE Transactions on Computers
High-Performance 3-1 Interlock Collapsing ALU's
IEEE Transactions on Computers
Effective Hardware-Based Data Prefetching for High-Performance Processors
IEEE Transactions on Computers
Speculative execution via address prediction and data prefetching
ICS '97 Proceedings of the 11th international conference on Supercomputing
The design and performance of a conflict-avoiding cache
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving the accuracy and performance of memory communication through renaming
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Microarchitecture support for improving the performance of load target prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The predictability of data values
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The potential of data value speculation to boost ILP
ICS '98 Proceedings of the 12th international conference on Supercomputing
Load execution latency reduction
ICS '98 Proceedings of the 12th international conference on Supercomputing
Speculative multithreaded processors
ICS '98 Proceedings of the 12th international conference on Supercomputing
The effect of instruction fetch bandwidth on value prediction
Proceedings of the 25th annual international symposium on Computer architecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Compiler-directed early load-address generation
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Randomized Cache Placement for Eliminating Conflicts
IEEE Transactions on Computers - Special issue on cache memory and related problems
Correlated load-address predictors
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cyclic dependence based data reference prediction
ICS '99 Proceedings of the 13th international conference on Supercomputing
Clustered speculative multithreaded processors
ICS '99 Proceedings of the 13th international conference on Supercomputing
Control independence in trace processors
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Value prediction for speculative multithreaded architectures
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Memory Renaming: Fast, Early and Accurate Processing of Memory Communication
International Journal of Parallel Programming
Limits of Data Value Predictability
International Journal of Parallel Programming
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
Proceedings of the 27th annual international symposium on Computer architecture
Speculative Memory Cloaking and Bypassing
International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Reducing wire delay penalty through value prediction
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Designing domain-specific processors
Proceedings of the ninth international symposium on Hardware/software codesign
An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors
International Journal of Parallel Programming
Load Redundancy Removal through Instruction Reuse
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Macro-op Scheduling: Relaxing Scheduling Loop Constraints
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism
IEEE Transactions on Computers
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures
Proceedings of the 18th annual international conference on Supercomputing
Proceedings of the 31st annual international symposium on Computer architecture
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Exploring the design space of LUT-based transparent accelerators
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Speculative parallelization of multipath radiosity algorithm
SPECTS'09 Proceedings of the 12th international conference on Symposium on Performance Evaluation of Computer & Telecommunication Systems
The potential of using dynamic information flow analysis in data value prediction
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Leveraging Strength-Based Dynamic Information Flow Analysis to Enhance Data Value Prediction
ACM Transactions on Architecture and Code Optimization (TACO)
Aggressive Value Prediction on a GPU
International Journal of Parallel Programming
Hi-index | 0.01 |
Two hardware methods for remedying the effects of true data dependences are studied. The first method dependence speculation, is used to eliminate address generation-load dependences. This is enabled by address prediction that permits load instructions to proceed speculatively without waiting for their address operands. The second technique, dependence collapsing, is used to eliminate data dependences by combining a dependence among multiple instructions into one instruction. The potential of these techniques for improving processor performance is demonstrated via trace-driven simulation. When both techniques are used with maximum issue widths of 4, 8, 16, and 32, the overall speedups in comparison to a base instruction level parallel machine are 1.20, 1.35, 1.51, and 1.66, respectively. In general, dependence collapsing contributes the majority of the improvement in performance. Under the dependence collapsing model, 298 to 478 of the total number of instructions in a trace may be collapsed. The distance separating the collapsed instructions is nearly always less than 8. Our experimentation also suggests that further performance improvements can be achieved by incorporating mechanisms that increase the address prediction rate.