The Organization of Microprogram Stores
ACM Computing Surveys (CSUR)
MIPS: A microprocessor architecture
MICRO 15 Proceedings of the 15th annual workshop on Microprogramming
MICRO 15 Proceedings of the 15th annual workshop on Microprogramming
Monte Carlo techniques in code optimization
MICRO 15 Proceedings of the 15th annual workshop on Microprogramming
Using an oracle to measure potential parallelism in single instruction stream programs
MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
2n-way jump microinstruction hardware and an effective instruction binding method
MICRO 13 Proceedings of the 13th annual workshop on Microprogramming
Towards an efficient, machine-independent language for microprogramming
MICRO 12 Proceedings of the 12th annual workshop on Microprogramming
A technique of global optimization of microprograms
MICRO 11 Proceedings of the 11th annual workshop on Microprogramming
Principles of Compiler Design (Addison-Wesley series in computer science and information processing)
Principles of Compiler Design (Addison-Wesley series in computer science and information processing)
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Highly concurrent scalar processing
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
A study of scalar compilation techniques for pipelined supercomputers
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
A VLIW architecture for a trace scheduling compiler
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
A VLIW architecture for a trace Scheduling Compiler
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
The performance potential of multiple functional unit processors
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Toward a dataflow/von Neumann hybrid architecture
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The white dwarf: a high-performance application-specific processor
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A two-tier memory architecture for high-performance multiprocessor systems
ICS '88 Proceedings of the 2nd international conference on Supercomputing
A method for asynchronous parallelization
ICSE '88 Proceedings of the 10th international conference on Software engineering
Organization of array data for concurrent memory access
MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Architecture and compiler tradeoffs for a long instruction wordprocessor
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Tradeoffs in instruction format design for horizontal architectures
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Overlapped loop support in the Cydra 5
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
A preceding activation scheme with graph unfolding for the parallel processing system-array
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A study of scalar compilation techniques for pipelined supercomputers
ACM Transactions on Mathematical Software (TOMS)
A variable instruction stream extension to the VLIW architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Parallelization of loops with exits on pipelined architectures
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Architecture and implementation of a VLIW supercomputer
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
GT-EP: a novel high-performance real-time architecture
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
A parallel pipelined processor with conditional instruction execution
ACM SIGARCH Computer Architecture News - Symposium on parallel algorithms and architectures
Exploiting multi-way branching to boost superscalar processor performance
ACM SIGPLAN Notices
Architecture synthesis of high-performance application-specific processors
DAC '90 Proceedings of the 27th ACM/IEEE Design Automation Conference
An instruction-level performance analysis of the Multiflow TRACE 14/300
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Comparing static and dynamic code scheduling for multiple-instruction-issue processors
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Implementation optimization techniques for architecture synthesis of application-specific processors
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Instruction-level parallelism in Prolog: analysis and architectural support
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Concurrency Extraction Via Hardware Methods Executing the Static Instruction Stream
IEEE Transactions on Computers
Predicting conditional branch directions from previous runs of a program
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
A new approach to schedule operations across nested-ifs and nested-loops
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
High-level synthesis of scalable architectures for IIR filters using multichip modules
DAC '93 Proceedings of the 30th international Design Automation Conference
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
Instruction scheduling in the TOBEY compiler
IBM Journal of Research and Development
Design at the system level with VLSI CMOS
IBM Journal of Research and Development - Special issue: IBM CMOS technology
Reduced instruction set computers
Communications of the ACM - Special section on computer architecture
Critical path reduction for scalar programs
Proceedings of the 28th annual international symposium on Microarchitecture
Spert-II: A Vector Microprocessor System
Computer - Special issue: neural computing: companion issue to Spring 1996 IEEE Computational Science & Engineering
Exploiting dual data-memory banks in digital signal processors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Strategic directions in computer architecture
ACM Computing Surveys (CSUR) - Special ACM 50th-anniversary issue: strategic directions in computing research
Custom-fit processors: letting applications define architectures
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Hardware implementation of a general multi-way jump mechanism
MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
The 16-fold way: a microparallel taxonomy
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
An analysis of dynamic scheduling techniques for symbolic applications
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Scalable instruction-level parallelism through tree-instructions
ICS '97 Proceedings of the 11th international conference on Supercomputing
Performance analysis of tree VLIW architecture for exploiting branch ILP in non-numerical code
ICS '97 Proceedings of the 11th international conference on Supercomputing
Exploiting instruction level parallelism in processors by caching scheduled groups
Proceedings of the 24th annual international symposium on Computer architecture
Simulation/evaluation environment for a VLIW processor architecture
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Exploiting idle floating-point resources for integer execution
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A programmable hardware accelerator for compiled electrical simulation
DAC '88 Proceedings of the 25th ACM/IEEE Design Automation Conference
Maps: a compiler-managed memory system for raw machines
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Exploiting ILP in page-based intelligent memory
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Evon: an extended von Neumann model for parallel processing
ACM '86 Proceedings of 1986 ACM Fall joint computer conference
An investigation of static versus dynamic scheduling
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The impact of synchronization and granularity on parallel systems
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Design Alternatives of Multithreaded Architecture
International Journal of Parallel Programming
Proceedings of the 14th international conference on Supercomputing
Polygon rendering on a stream architecture
HWWS '00 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
Reducing the complexity of the issue logic
ICS '01 Proceedings of the 15th international conference on Supercomputing
Parallel processing: a smart compiler and a dumb machine
SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures
IEEE Transactions on Parallel and Distributed Systems
Compiler Support for Scalable and Efficient Memory Systems
IEEE Transactions on Computers
A design space evaluation of grid processor architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Modulo scheduling with integrated register spilling for clustered VLIW architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Architectural differences of efficient sequential and parallel computers
Journal of Systems Architecture: the EUROMICRO Journal
Simulating Multimedia Systems with MVPSIM
IEEE Design & Test
Exploiting Instruction-Level Parallelism for Integrated Control-Flow Monitoring
IEEE Transactions on Computers
Generalized Multiway Branch Unit for VLIW Microprocessors
IEEE Transactions on Parallel and Distributed Systems
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Compiler optimization on VLIW instruction scheduling for low power
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Indirect VLIW memory allocation for the ManArray multiprocessor DSP
ACM SIGARCH Computer Architecture News
Region-based hierarchical operation partitioning for multicluster processors
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
A high performance factoring machine
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
A model of clocked micro-architectures for firmware engineering and design automation applications
MICRO 17 Proceedings of the 17th annual workshop on Microprogramming
Cheap Out-of-Order Execution Using Delayed Issue
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Banked multiported register files for high-frequency superscalar microprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Partitioned Schedules for Clustered VLIW Architectures
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Matrix bidiagonalization: implementation and evaluation on the Trident processor
Neural, Parallel & Scientific Computations
Controlling the data space of tree structured computations
Information and Computation
Parallel processing: a smart compiler and a dumb machine
ACM SIGPLAN Notices - Best of PLDI 1979-1999
Synthesizable HDL generation method for configurable VLIW processors
Proceedings of the 2004 Asia and South Pacific Design Automation Conference
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
A Speculative Control Scheme for an Energy-Efficient Banked Register File
IEEE Transactions on Computers
A Simulation and Exploration Technology for Multimedia-Application-Driven Architectures
Journal of VLSI Signal Processing Systems
RPU: a programmable ray processing unit for realtime ray tracing
ACM SIGGRAPH 2005 Papers
Encyclopedia of Computer Science
A Distributed Control Path Architecture for VLIW Processors
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Experimentation with a two-level microprogrammed multiprocessor computer
ACM SIGMICRO Newsletter
Software and hardware techniques to optimize register file utilization in VLIW architectures
International Journal of Parallel Programming
Compiler-directed Data Partitioning for Multicluster Processors
Proceedings of the International Symposium on Code Generation and Optimization
Hybrid multi-core architecture for boosting single-threaded performance
ACM SIGARCH Computer Architecture News
Efficient design space exploration for application specific systems-on-a-chip
Journal of Systems Architecture: the EUROMICRO Journal
Code and data partitioning for fine-grain parallelism
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
An Analytical Approach to Scheduling Code for Superscalar and VLIW Architectures
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
VLIW-DLX simulator for educational purposes
WCAE '07 Proceedings of the 2007 workshop on Computer architecture education
Multimedia terminal system-on-chip design and simulation
EURASIP Journal on Applied Signal Processing
A GaAs-Based Microprocessor Architecture for Real-Time Applications
IEEE Transactions on Computers
Measuring the Parallelism Available for Very Long Instruction Word Architectures
IEEE Transactions on Computers
Neural, Parallel & Scientific Computations
Communications of the ACM - Web science
A highly efficient implementation of a backpropagation learning algorithm using matrix ISA
Journal of Parallel and Distributed Computing
Reducing complexity of multiobjective design space exploration in VLIW-based embedded systems
ACM Transactions on Architecture and Code Optimization (TACO)
Approximating the buffer allocation problem using epochs
Journal of Parallel and Distributed Computing
Trend and Challenge on System-on-a-Chip Designs
Journal of Signal Processing Systems
Configurable emulated shared memory architecture for general purpose MP-SOCs and NOC regions
NOCS '09 Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip
Dynamic Malicious Code Detection Based on Binary Translator
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
A VLIW vector media coprocessor with cascaded SIMD ALUs
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Task superscalar: using processors as functional units
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
RTRAM: reconfigurable and testable multi-bit RAM design
ITC'88 Proceedings of the 1988 international conference on Test: new frontiers in testing
Exploiting dynamic reconfiguration techniques: the 2D-VLIW approach
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Neural, Parallel & Scientific Computations
Automatic OpenCL device characterization: guiding optimized kernel design
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
FPGA implementation of variable-precision floating-point arithmetic
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Integrated Code Generation for Loops
ACM Transactions on Embedded Computing Systems (TECS)
Mat-core: a decoupled matrix core extension for general-purpose processors
Neural, Parallel & Scientific Computations
DRMA: dynamically reconfigurable MPSoC architecture
Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Journal of Parallel and Distributed Computing
VLIW coprocessor for IEEE-754 quadruple-precision elementary functions
ACM Transactions on Architecture and Code Optimization (TACO)
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Modular multi-ported SRAM-based memories
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Hi-index | 0.02 |
By compiling ordinary scientific applications programs with a radical technique called trace scheduling, we are generating code for a parallel machine that will run these programs faster than an equivalent sequential machine—we expect 10 to 30 times faster. Trace scheduling generates code for machines called Very Long Instruction Word architectures. In Very Long Instruction Word machines, many statically scheduled, tightly coupled, fine-grained operations execute in parallel within a single instruction stream. VLIWs are more parallel extensions of several current architectures. These current architectures have never cracked a fundamental barrier. The speedup they get from parallelism is never more than a factor of 2 to 3. Not that we couldn't build more parallel machines of this type; but until trace scheduling we didn't know how to generate code for them. Trace scheduling finds sufficient parallelism in ordinary code to justify thinking about a highly parallel VLIW. At Yale we are actually building one. Our machine, the ELI-512, has a horizontal instruction word of over 500 bits and will do 10 to 30 RISC-level operations per cycle [Patterson 82]. ELI stands for Enormously Longword Instructions; 512 is the size of the instruction word we hope to achieve. (The current design has a 1200-bit instruction word.) Once it became clear that we could actually compile code for a VLIW machine, some new questions appeared, and answers are presented in this paper. How do we put enough tests in each cycle without making the machine too big? How do we put enough memory references in each cycle without making the machine too slow?