The rice parallel processing testbed
SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Performance-Measurement Tools in a Multiprocessor Environment
IEEE Transactions on Computers
Journal of Parallel and Distributed Computing - Special issue: software tools for parallel programming and visualization
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
New CPU benchmark suites from SPEC
COMPCON '92 Proceedings of the thirty-seventh international conference on COMPCON
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
SPARC architecture, assembly language programming, and C
SPARC architecture, assembly language programming, and C
Architectural support for performance tuning: a case study on the SPARCcenter 2000
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Virtual memory mapped network interface for the SHRIMP multicomputer
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Complexity/performance tradeoffs with non-blocking loads
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Scheduling and page migration for multiprocessor compute servers
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Hardware and software support for efficient exception handling
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Avoiding conflict misses dynamically in large direct-mapped caches
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Interleaving: a multithreading technique targeting multiprocessors and workstations
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The effectiveness of multiple hardware contexts
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor
Digital Technical Journal - Special 10th anniversary issue
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ICS '90 Proceedings of the 4th international conference on Supercomputing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications
IEEE Transactions on Parallel and Distributed Systems
Informing Loads: Enabling Software to Observe and React to Memory Behavior
Informing Loads: Enabling Software to Observe and React to Memory Behavior
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Integrating performance monitoring and communication in parallel computers
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Wrong-path instruction prefetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Shared-memory performance profiling
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Predicting data cache misses in non-numeric applications through correlation profiling
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS)
Tolerating late memory traps in ILP processors
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Simultaneous subordinate microthreading (SSMT)
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Concurrent Event Handling through Multithreading
IEEE Transactions on Computers
Performance experiences on Sun's Wildfire prototype
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A fully associative software-managed cache design
Proceedings of the 27th annual international symposium on Computer architecture
Performance analysis using the MIPS R10000 performance counters
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Hardware Versus Software Implementation of COMA
ICPP '97 Proceedings of the international Conference on Parallel Processing
Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps
ICPP '97 Proceedings of the international Conference on Parallel Processing
METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Fighting the memory wall with assisted execution
Proceedings of the 1st conference on Computing frontiers
Tolerating Late Memory Traps in Dynamically Scheduled Processors
IEEE Transactions on Computers
Compiler orchestrated prefetching via speculation and predication
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
TMA: a trap-based memory architecture
Proceedings of the 20th annual international conference on Supercomputing
METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies
ACM Transactions on Programming Languages and Systems (TOPLAS)
A case for low-complexity MP architectures
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A novel migration-based NUCA design for chip multiprocessors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Hi-index | 0.01 |
Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architectures do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, we propose a new class of memory operations called informing memory operations, which essentially consist of a memory operation combined (either implicitly or explicitly) with a conditional branch-and-link operation that is taken only if the reference suffers a cache miss. We describe two different implementations of informing memory operations---one based on a cache-outcome condition code and another based on low-overhead traps---and find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.