ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Examination of a memory access classification scheme for pointer-intensive and numeric programs
ICS '96 Proceedings of the 10th international conference on Supercomputing
Memory-system design considerations for dynamically-scheduled processors
Proceedings of the 24th annual international symposium on Computer architecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Dependence based prefetching for linked data structures
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Simultaneous subordinate microthreading (SSMT)
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving virtual function call target prediction via dependence-based pre-computation
ICS '99 Proceedings of the 13th international conference on Supercomputing
Understanding the backward slices of performance degrading instructions
Proceedings of the 27th annual international symposium on Computer architecture
Proceedings of the 27th annual international symposium on Computer architecture
Execution-based prediction using speculative slices
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Decoupled access/execute computer architectures
ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
The Alpha 21264 Microprocessor Architecture
ICCD '98 Proceedings of the International Conference on Computer Design
Speculative Data-Driven Multithreading
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Memory dependence prediction
Post-pass binary adaptation for software-based speculative precomputation
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Dynamic speculative precomputation
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Decoupled Predictor-Directed Stream Prefetching Architecture
IEEE Transactions on Computers
A Programmable Memory Hierarchy for Prefetching Linked Data Structures
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Pointer cache assisted prefetching
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Microarchitectural support for precomputation microthreads
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A quantitative framework for automated pre-execution thread selection
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A framework for modeling and optimization of prescient instruction prefetch
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors
IEEE Transactions on Computers
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
ACM Transactions on Computer Systems (TOCS)
Data forwarding through in-memory precomputation threads
Proceedings of the 18th annual international conference on Supercomputing
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
A study of source-level compiler algorithms for automatic construction of pre-execution code
ACM Transactions on Computer Systems (TOCS)
Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection
Proceedings of the 32nd annual international symposium on Computer Architecture
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework
Proceedings of the International Symposium on Code Generation and Optimization
Efficient emulation of hardware prefetchers via event-driven helper threading
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Speculative pre-execution assisted by compiler (SPEAR)
Journal of Parallel and Distributed Computing - Special issue on parallel bioinspired algorithms
Block-aware instruction set architecture
ACM Transactions on Architecture and Code Optimization (TACO)
Design and evaluation of a hierarchical decoupled architecture
The Journal of Supercomputing
SlicK: slice-based locality exploitation for efficient redundant multithreading
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Future execution: A prefetching mechanism that uses multiple cores to speed up single threads
ACM Transactions on Architecture and Code Optimization (TACO)
Accelerating sequential programs on Chip Multiprocessors via Dynamic Prefetching Thread
Microprocessors & Microsystems
Ubiquitous memory introspection
Proceedings of the International Symposium on Code Generation and Optimization
Optimization of data prefetch helper threads with path-expression based statistical modeling
Proceedings of the 21st annual international conference on Supercomputing
A low-complexity microprocessor design with speculative pre-execution
Journal of Systems Architecture: the EUROMICRO Journal
A performance-correctness explicitly-decoupled architecture
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
OUTRIDER: efficient memory latency tolerance with decoupled strands
Proceedings of the 38th annual international symposium on Computer architecture
Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
A hybrid hardware/software generated prefetching thread mechanism on chip multiprocessors
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Hi-index | 0.01 |
We describe the Slice Processor micro-architecture that implements a generalized operation-based prefetching mechanism. Operation-based prefetchers predict the series of operations, or the computation slice that can be used to calculate forthcoming memory references. This is in contrast to outcome-based predictors that exploit regularities in the (address) outcome stream. Slice processors are a generalization of existing operation-based prefetching mechanisms such as stream buffers where the operation itself is fixed in the design (e.g., address + stride). A slice processor dynamically identifies frequently missing loads and extracts on-the-fly the relevant address computation slices. Such slices are then executed in-parallel with the main sequential thread prefetching memory data. We describe the various support structures and emphasize the design of the slice detection mechanism. We demonstrate that a relatively simple organization can significantly improve performance over an aggressive, dynamically-scheduled processor and for a set of pointer-intensive programs and for some integer applications from the SPEC'95 suite. In particular, a slice processor that can detect slices of up to 8 instructions extracted over of a region of up to 32 instructions improves performance by 11% on the average (even if slice detection requires up to 32 cycles). Allowing slices of up to 16 instructions results in an average performance improvement of 15%. Finally, we study how our operation-based predictor interacts with an outcome-based one and find them mutually beneficial.