Implementing Precise Interrupts in Pipelined Processors
IEEE Transactions on Computers
Estimating interlock and improving balance for pipelined architectures
Journal of Parallel and Distributed Computing
Loop quantization of unwinding done right
Proceedings of the 1st International Conference on Supercomputing
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Balanced scheduling: instruction scheduling when memory latency is uncertain
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors
IEEE Transactions on Computers - Special issue on cache memory and related problems
Improving memory hierarchy performance for irregular applications
ICS '99 Proceedings of the 13th international conference on Supercomputing
High-Bandwidth Interleaved Memories for Vector Processors - A Simulation Study
IEEE Transactions on Computers
Combining Optimization for Cache and Instruction-Level Parallelism
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Dynamically allocating processor resources between nearby and distant ILP
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
MIST: an algorithm for memory miss traffic management
Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design
International Journal of Parallel Programming
ACM Transactions on Computer Systems (TOCS)
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
Proceedings of the 31st annual international symposium on Computer architecture
A study of source-level compiler algorithms for automatic construction of pre-execution code
ACM Transactions on Computer Systems (TOCS)
Queue Usage and Memory-Level Parallelism Sensitive Scheduling
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
A Case for MLP-Aware Cache Replacement
Proceedings of the 33rd annual international symposium on Computer Architecture
Cache miss clustering for banked memory systems
Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
An analysis of the effects of miss clustering on the cost of a cache miss
Proceedings of the 4th international conference on Computing frontiers
Proceedings of the 2007 workshop on Experimental computer science
ecs'07 Experimental computer science on Experimental computer science
Memory-level parallelism aware fetch policies for simultaneous multithreading processors
ACM Transactions on Architecture and Code Optimization (TACO)
Studying compiler optimizations on superscalar processors through interval analysis
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Conservative open-page policy for mixed time-criticality memory controllers
Proceedings of the Conference on Design, Automation and Test in Europe
Hi-index | 0.00 |
Current microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). However, previous work has shown that these ILP techniques are less effective in removing memory stall time than CPU time, making the memory system a greater bottleneck in ILP-based systems than previous-generation systems. These deficiencies arise largely because applications present limited opportunities for an out-of-order issue processor to overlap multiple read misses, the dominant source of memory stalls.This work proposes code transformations to increase parallelism in the memory system by overlapping multiple read misses within the same instruction window, while preserving cache locality. We present an analysis and transformation framework suitable for compiler implementation. Our simulation experiments show substantial increases in memory parallelism, leading to execution time reductions averaging 23% in a multiprocessor and 30% in a uniprocessor. We see similar benefits on a Convex Exemplar.