Available instruction-level parallelism for superscalar and superpipelined machines

Authors:
N. P. Jouppi;D. W. Wall
Affiliations:
Digital Equipment Corporation, Western Research Lab;Digital Equipment Corporation, Western Research Lab
Venue:
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Year:
1989

Citing 8
Cited 99

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Global register allocation at link time

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors

IEEE Transactions on Computers
Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Mahler experience: using an intermediate language as the machine description

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Reduced Instruction Set Computer Architectures for VLSI

Reduced Instruction Set Computer Architectures for VLSI
Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)

Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific)

Tradeoffs in instruction format design for horizontal architectures

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Cost-effective design of application specific VLIW processors using the SCARCE framework

MICRO 22 Proceedings of the 22nd annual workshop on Microprogramming and microarchitecture
The Nonuniform Distribution of Instruction-Level and Machine Parallelism and its Effect on Performance

IEEE Transactions on Computers
Efficient trace-driven simulation method for cache performance analysis

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Compiling Scientific Code Using Partial Evaluation

Computer
Reducing the branch penalty by rearranging instructions in a double-width memory

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The floating point performance of a superscalar SPARC processor

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance from architecture: comparing a RISC and a CISC with similar hardware organization

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Architecture and implementation of a VLIW supercomputer

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Global instruction scheduling for superscalar machines

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An empirical study of the CRAY Y-MP processor using the Perfect club benchmarks

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
IMPACT: an architectural framework for multiple-instruction-issue processors

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Single instruction stream parallelism is greater than two

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Exploiting fine-grained parallelism through a combination of hardware and software techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Strategies for achieving improved processor throughput

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
DSNS (dynamically-hazard-resolved statically-code-scheduled, nonuniform superscalar): yet another superscalar processor architecture

ACM SIGARCH Computer Architecture News
How many operation units are adequate?

ACM SIGARCH Computer Architecture News
Comparing static and dynamic code scheduling for multiple-instruction-issue processors

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
The effect of real data cache behavior on the performance of a microarchitecture that supports dynamic scheduling

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Two-level adaptive training branch prediction

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Code duplication: an assist for global instruction scheduling

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Computer Technology and Architecture: An Evolving Interaction

Computer
Computer Architecture in the 1990s

Computer
MOVE: a framework for high-performance processor design

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Efficient trace-driven simulation methods for cache performance analysis

ACM Transactions on Computer Systems (TOCS)
The expandable split window paradigm for exploiting fine-grain parallelsim

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Alternative implementations of two-level adaptive branch prediction

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Processor coupling: integrating compile time and runtime scheduling for parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Evaluation of the WM architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Efficient superscalar performance through boosting

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Improving instruction supply efficiency in superscalar architectures using instruction trace buffers

SAC '92 Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing: technological challenges of the 1990's
On the limits of program parallelism and its smoothability

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
On the instruction-level characteristics of scalar code in highly-vectorized scientific applications

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
A comprehensive instruction fetch mechanism for a processor supporting speculative execution

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Enhanced superscalar hardware: the schedule table

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
SCISM: a scalable compound instruction set machine

IBM Journal of Research and Development
Programming, compilation, and resource management issues for multithreading (panel session II)

ACM SIGARCH Computer Architecture News - Special issue: panel sessions of the 1991 workshop on multithreaded computers
Designing the TFP Microprocessor

IEEE Micro
Branch with masked squashing in superpipelined processors

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Height reduction of control recurrences for ILP processors

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Evaluating Performance Tradeoffs Between Fine-Grained and Coarse-Grained Alternatives

IEEE Transactions on Parallel and Distributed Systems
The influence of branch prediction table interference on branch prediction scheme performance

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Strategic directions in computer architecture

ACM Computing Surveys (CSUR) - Special ACM 50th-anniversary issue: strategic directions in computing research
The performance potential of data dependence speculation & collapsing

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
The 16-fold way: a microparallel taxonomy

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Techniques for extracting instruction level parallelism on MIMD architectures

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Available paralellism in video applications

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Parallelizing nonnumerical code with selective scheduling and software pipelining

ACM Transactions on Programming Languages and Systems (TOPLAS)
The potential of data value speculation to boost ILP

ICS '98 Proceedings of the 12th international conference on Supercomputing
Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers

25 years of the international symposia on Computer architecture (selected papers)
IMPACT: an architectural framework for multiple-instruction-issue processors

25 years of the international symposia on Computer architecture (selected papers)
Alternative implementations of two-level adaptive branch prediction

25 years of the international symposia on Computer architecture (selected papers)
Increasing effective IPC by exploiting distant parallelism

ICS '99 Proceedings of the 13th international conference on Supercomputing
Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

IEEE Transactions on Computers
Multiple instruction issue in the NonStop cyclone processor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The impact of synchronization and granularity on parallel systems

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Trace-driven simulations for a two-level cache design in open bus systems

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Application of instruction analysis/scheduling techniques to resource allocation of superscalar processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Trident: a scalable architecture for scalar, vector, and matrix operations

CRPIT '02 Proceedings of the seventh Asia-Pacific conference on Computer systems architecture
Architectural differences of efficient sequential and parallel computers

Journal of Systems Architecture: the EUROMICRO Journal
Computation in the Context of Transport Triggered Architectures

International Journal of Parallel Programming
Cache Memories for Dataflow Systems

IEEE Parallel & Distributed Technology: Systems & Technology
Motorola's 88000 Family Architecture

IEEE Micro
The Metaflow Architecture

IEEE Micro
Design and Implementation Trade-Offs in the Clipper C400 Architecture

IEEE Micro
Toward Advanced Parallel Processing: Exploiting Parallelism at Task and Instruction Levels

IEEE Micro
Organization of the Motorola 88110 Superscalar RISC Microprocessor

IEEE Micro
Exploiting Instruction- and Data-Level Parallelism

IEEE Micro
Virtual-Address Caches Part 1: Problems and Solutions in Uniprocessors

IEEE Micro
Virtual-Address Caches, Part 2: Multiprocessor Issues

IEEE Micro
Efficient Instruction Sequencing with Inline Target Insertion

IEEE Transactions on Computers
Interlock Collapsing ALU's

IEEE Transactions on Computers
High-Performance 3-1 Interlock Collapsing ALU's

IEEE Transactions on Computers
The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors

IEEE Transactions on Computers
Three Architectural Models for Compiler-Controlled Speculative Execution

IEEE Transactions on Computers
A Performance and Cost Analysis of Applying Superscalar Method to Mainframe Computers

IEEE Transactions on Computers
Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling

IEEE Transactions on Parallel and Distributed Systems
Non-deterministic Processors

ACISP '01 Proceedings of the 6th Australasian Conference on Information Security and Privacy
Random Register Renaming to Foil DPA

CHES '01 Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems
An Efficient Technique of Instruction Scheduling on a Superscalar-Based Mulprocessor

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Micronets: a model for decentralising control in asynchronous processor architectures

ASYNC '95 Proceedings of the 2nd Working Conference on Asynchronous Design Methodologies
ARAS: asynchronous RISC architecture simulator

ASYNC '95 Proceedings of the 2nd Working Conference on Asynchronous Design Methodologies
Program balance and its impact on high performance RISC architectures

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
A parallel computer as a NOC region

Networks on chip
Register allocation for optimal loop scheduling

CASCON '93 Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing - Volume 2
The impact of x86 instruction set architecture on superscalar processing

Journal of Systems Architecture: the EUROMICRO Journal
ILP in the undergraduate curriculum

WCAE '02 Proceedings of the 2002 workshop on Computer architecture education: Held in conjunction with the 29th International Symposium on Computer Architecture
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution

Microprocessors & Microsystems
Proof of correctness of high-performance 3-1 interlock collapsing ALUs

IBM Journal of Research and Development
A load-instruction unit for pipelined processors

IBM Journal of Research and Development
A multithreaded multicore system for embedded media processing

Transactions on high-performance embedded architectures and compilers III
Data sharing conscious scheduling for multi-threaded applications on SMP machines

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Designing programming languages for the analyzability of pointer data structures

Computer Languages

Quantified Score

Hi-index	0.03

Visualization

Abstract

Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining.