Cache Refill/Access Decoupling for Vector Machines
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Toward kilo-instruction processors
ACM Transactions on Architecture and Code Optimization (TACO)
Temporal Streaming of Shared Memory
Proceedings of the 32nd annual international symposium on Computer Architecture
Techniques for Efficient Processing in Runahead Execution Engines
Proceedings of the 32nd annual international symposium on Computer Architecture
Store-Ordered Streaming of Shared Memory
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Improving single-thread performance with fine-grain state maintenance
Proceedings of the 5th conference on Computing frontiers
Temporal instruction fetch streaming
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Memory management thread for heap allocation intensive sequential applications
Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Forwardflow: a scalable core for power-constrained CMPs
Proceedings of the 37th annual international symposium on Computer architecture
Scalable memory registration for high performance networks using helper threads
Proceedings of the 8th ACM International Conference on Computing Frontiers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
An instruction window that can tolerate latencies to DRAM memory is prohibitively complex and power hungry. To avoid having to build such large windows, runahead execution uses otherwise-idle clock cycles to achieve an average 22 percent performance improvement for processors with instruction windows of contemporary sizes. This technique incurs only a small hardware cost and does not significantly increase the processor's complexity.