Decoupling local variable accesses in a wide-issue superscalar processor

Authors:
Sangyeun Cho;Pen-Chung Yew;Gyungho Lee
Affiliations:
MCU Team, System LSI Div., Samsung Electronics Co., Yongin-City, Korea;Dept. of Comp. Sci. and Eng., University of Minnesota, Minneapolis, MN;Division of Engineering, Univ. of Texas at San Antonio, San Antonio, TX
Venue:
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Year:
1999

Citing 25
Cited 18

A performance analysis of automatically managed top of stack buffers

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
A Case for Direct-Mapped Caches

Computer
Coloring heuristics for register allocation

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers

IEEE Transactions on Computers
The priority-based coloring approach to register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Increasing the instruction fetch rate via multiple branch prediction and a branch address cache

ICS '93 Proceedings of the 7th international conference on Supercomputing
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor

Digital Technical Journal - Special 10th anniversary issue
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Spill code minimization via interference region spilling

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
Dynamic speculation and synchronization of data dependences

Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
On high-bandwidth data cache design for multi-issue processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving the accuracy and performance of memory communication through renaming

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
One Billion Transistors, One Uniprocessor, One Chip

Computer
Superspeculative Microarchitecture for Beyond AD 2000

Computer
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
A three dimensional register file for superscalar processors

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Register allocation for free: The C machine stack cache

ASPLOS I Proceedings of the first international symposium on Architectural support for programming languages and operating systems
Register allocation & spilling via graph coloring

SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
A Characterization of Processor Performance in the vax-11/780

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture

Access region locality for high-bandwidth processor memory system design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Early load address resolution via register tracking

Proceedings of the 27th annual international symposium on Computer architecture
Performance improvement with circuit-level speculation

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
L1 data cache decomposition for energy efficiency

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
A High-Bandwidth Memory Pipeline for Wide Issue Processors

IEEE Transactions on Computers
Partitioned first-level cache design for clustered microarchitectures

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Dynamically managing the communication-parallelism trade-off in future clustered processors

Proceedings of the 30th annual international symposium on Computer architecture
Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning

Proceedings of the 2003 international symposium on Low power electronics and design
Power-efficient prefetching via bit-differential offset assignment on embedded processors

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Load elimination for low-power embedded processors

GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
How to Fake 1000 Registers

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Scalable network-based buffer overflow attack detection

Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Power-efficient prefetching for embedded processors

ACM Transactions on Embedded Computing Systems (TECS)
I-cache multi-banking and vertical interleaving

Proceedings of the 17th ACM Great Lakes symposium on VLSI
Specializing Cache Structures for High Performance and Energy Conservation in Embedded Systems

Transactions on High-Performance Embedded Architectures and Compilers I
Stack oriented data cache filtering

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Access region cache with register guided memory reference partitioning

Journal of Systems Architecture: the EUROMICRO Journal
Stack filter: Reducing L1 data cache power consumption

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.