A performance analysis of automatically managed top of stack buffers
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
A Case for Direct-Mapped Caches
Computer
Coloring heuristics for register allocation
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
IEEE Transactions on Computers
The priority-based coloring approach to register allocation
ACM Transactions on Programming Languages and Systems (TOPLAS)
High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Increasing the instruction fetch rate via multiple branch prediction and a branch address cache
ICS '93 Proceedings of the 7th international conference on Supercomputing
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor
Digital Technical Journal - Special 10th anniversary issue
Increasing cache port efficiency for dynamic superscalar microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Spill code minimization via interference region spilling
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Designing high bandwidth on-chip caches
Proceedings of the 24th annual international symposium on Computer architecture
Dynamic speculation and synchronization of data dependences
Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
On high-bandwidth data cache design for multi-issue processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving the accuracy and performance of memory communication through renaming
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
A three dimensional register file for superscalar processors
HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Register allocation for free: The C machine stack cache
ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
Register allocation & spilling via graph coloring
SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
A Characterization of Processor Performance in the vax-11/780
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Access region locality for high-bandwidth processor memory system design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Early load address resolution via register tracking
Proceedings of the 27th annual international symposium on Computer architecture
Performance improvement with circuit-level speculation
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
L1 data cache decomposition for energy efficiency
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
A High-Bandwidth Memory Pipeline for Wide Issue Processors
IEEE Transactions on Computers
Partitioned first-level cache design for clustered microarchitectures
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Dynamically managing the communication-parallelism trade-off in future clustered processors
Proceedings of the 30th annual international symposium on Computer architecture
Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning
Proceedings of the 2003 international symposium on Low power electronics and design
Power-efficient prefetching via bit-differential offset assignment on embedded processors
Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Load elimination for low-power embedded processors
GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Scalable network-based buffer overflow attack detection
Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Power-efficient prefetching for embedded processors
ACM Transactions on Embedded Computing Systems (TECS)
I-cache multi-banking and vertical interleaving
Proceedings of the 17th ACM Great Lakes symposium on VLSI
Specializing Cache Structures for High Performance and Energy Conservation in Embedded Systems
Transactions on High-Performance Embedded Architectures and Compilers I
Stack oriented data cache filtering
CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Access region cache with register guided memory reference partitioning
Journal of Systems Architecture: the EUROMICRO Journal
Stack filter: Reducing L1 data cache power consumption
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.01 |
Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.