ACM Computing Surveys (CSUR)
Computer Structures: Principles and Examples
Computer Structures: Principles and Examples
Architecture of a VLSI instruction cache for a RISC
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Using cache memory to reduce processor-memory traffic
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
RISC I: A Reduced Instruction Set VLSI Computer
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Cache memories for PDP-11 family computers
ISCA '76 Proceedings of the 3rd annual symposium on Computer architecture
S-1 architecture manual
ATUM: a new technique for capturing address traces using microcode
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
A class of compatible cache consistency protocols and their support by the IEEE futurebus
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Multiprocessor cache synchronization: issues, innovations, evolution
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Line (block) size choice for CPU cache memories
IEEE Transactions on Computers
An architectural perspective on a memory access controller
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Multiprocessor cache design considerations
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Performance evaluation of on-chip register and cache organizations
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Efficient (stack) algorithms for analysis of write-back and sector memories
ACM Transactions on Computer Systems (TOCS)
ACM Transactions on Computer Systems (TOCS)
The effects of processor architecture on instruction memory traffic
ACM Transactions on Computer Systems (TOCS)
Classification and performance evaluation of instruction buffering techniques
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
On reconfigurable on-chip data caches
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Cache behavior of combinator graph reduction
ACM Transactions on Programming Languages and Systems (TOPLAS)
Processor Architecture and Data Buffering
IEEE Transactions on Computers
Optimal allocation of on-chip memory for multiple-API operating systems
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Surpassing the TLB performance of superpages with less operating system support
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A new page table for 64-bit address spaces
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Memory bandwidth limitations of future microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Instruction fetch mechanisms for VLIW architectures with compressed encodings
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
The selection of optimal cache lines for microprocessor-based controllers
MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences
Proceedings of the 24th annual international symposium on Computer architecture
A Performance Study of Instruction Cache Prefetching Methods
IEEE Transactions on Computers
An evaluation of staged run-time optimizations in DyC
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
The pool of subsectors cache design
ICS '99 Proceedings of the 13th international conference on Supercomputing
An empirical evaluation of two memory-efficient directory methods
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache evaluation and the impact of workload choice
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Calpa: a tool for automating selective dynamic compilation
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
The benefits and costs of DyC's run-time optimizations
ACM Transactions on Programming Languages and Systems (TOPLAS)
The Effect of Code Expanding Optimizations on Instruction Cache Design
IEEE Transactions on Computers
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
A retrospective on: "an evaluation of staged run-time optimizations in DyC"
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking
Proceedings of the 32nd annual international symposium on Computer Architecture
Increasing cache capacity through word filtering
Proceedings of the 21st annual international conference on Supercomputing
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.01 |
Advances in integrated circuit density are permitting the implementation on a single chip of functions and performance enhancements beyond those of a basic processors. One performance enhancement of proven value is a cache memory; placing a cache on the processor chip can reduce both mean memory access time and bus traffic. In this paper we use trace driven simulation to study design tradeoffs for small (on-chip) caches. Miss ratio and traffic ratio (bus traffic) are the metrics for cache performance. Particular attention is paid to sub-block caches (also known as sector caches), in which address tags are associated with blocks, each of which contains multiple sub-blocks; sub-blocks are the transfer unit. Using traces from two 16-bit architectures (Z8000, PDP-11) and two 32-bit architectures (VAX-11, System/370), we find that general purpose caches of 64 bytes (net size) are marginally useful in some cases, while 1024-byte caches perform fairly well; typical miss and traffic ratios for a 1024 byte (net size) cache, 4-way set associative with 8 byte blocks are: PDP-11: .039, .156, Z8000: .015, .060, VAX 11: .080, .160, Sys/370: .244, .489. (These figures are based on traces of user programs and the performance obtained in practice is likely to be less good.) The use of sub-blocks allows tradeoffs between miss ratio and traffic ratio for a given cache size. Load forward is quite useful. Extensive simulation results are presented.