Efficient compilation for queue size constrained queue processors

Authors:
Arquimedes Canedo;Ben A. Abderazek;Masahiro Sowa
Affiliations:
IBM, Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-Shi, Kanagawa-Ken 242-8502, Japan;University of Aizu, Aizu-Wakamatsu, Fukushima-Ken 965-8580, Japan;University of Electro-Communications, Graduate School of Information Systems, Chofugaoka 1-5-1, Chofu-Shi 182-8585, Japan
Venue:
Parallel Computing
Year:
2009

Citing 35
Cited 1

Stack computers: the new wave

Stack computers: the new wave
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Evaluation of the WM architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Laying out graphs using queues

SIAM Journal on Computing
Partitioned register file for TTAs

Proceedings of the 28th annual international symposium on Microarchitecture
Processor design for portable systems

Journal of VLSI Signal Processing Systems - Special issue on technologies for wireless computing
Quantitative Evaluation of Register Pressure on Software Pipelined Loops

International Journal of Parallel Programming
Evolution and evaluation of SPEC benchmarks

ACM SIGMETRICS Performance Evaluation Review
Stack and Queue Layouts of Directed Acyclic Graphs: Part I

SIAM Journal on Computing
High-speed top-of-stack scheme for VLSI processor: a management algorithm and its analysis

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Data flow on a queue machine

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Design and implementation of generics for the .NET Common language runtime

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
P-code and compiler portability: experience with a Modula-2 optimizing compiler

ACM SIGPLAN Notices
Java Virtual Machine Specification

Java Virtual Machine Specification
PicoJava: A Direct Execution Engine For Java Bytecode

Computer
SH3: High Code Density, Low Power

IEEE Micro
The Design Space of Register Renaming Techniques

IEEE Micro
Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors

IEEE Micro
A preliminary architecture for a basic data-flow processor

ISCA '75 Proceedings of the 2nd annual symposium on Computer architecture
Thumb: Reducing the Cost of 32-bit RISC Performance in Portable and Consumer Applications

COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
Queue Machines: Hardware Compilation in Hardware

FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Register File Design Considerations in Dynamically Scheduled Processors

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Register Queues: A New Hardware/Software Approach to Efficient Software Pipelining

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Code Size Efficiency in Global Scheduling for ILP Processors

INTERACT '02 Proceedings of the Sixth Annual Workshop on Interaction between Compilers and Computer Architectures
Power-aware compilation for register file energy reduction

International Journal of Parallel Programming - Special issue: Workshop on application specific processors (WASP)
Investigating Available Instruction Level Parallelism for Stack Based Machine Architectures

DSD '04 Proceedings of the Digital System Design, EUROMICRO Systems
Parallel Queue Processor Architecture Based on Produced Order Computation Model

The Journal of Supercomputing
The evolution of Forth

History of programming languages---II
Partitioning Variables across Register Windows to Reduce Spill Code in a Low-Power Processor

IEEE Transactions on Computers
Software and hardware techniques to optimize register file utilization in VLIW architectures

International Journal of Parallel Programming
Compilation framework for code size reduction using reduced bit-width ISAs (rISAs)

ACM Transactions on Design Automation of Electronic Systems (TODAES)
High-Level Modeling and FPGA Prototyping of Produced Order Parallel Queue Processor Core

The Journal of Supercomputing
Exploring a Stack Architecture

Computer
A new code generation algorithm for 2-offset producer order queue computation model

Computer Languages, Systems and Structures
Design and architecture for an embedded 32-bit QueueCore

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures

Queue Layouts of Hypercubes

SIAM Journal on Discrete Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Queue computers use a FIFO data structure for data processing. The essential characteristics of a queue-based architecture excel at satisfying the demands of embedded systems, including compact instruction set, simple hardware logic, high parallelism, and low power consumption. The size of the queue is an important concern in the design of a realizable embedded queue processor. We introduce the relationship between parallelism, length of data dependency edges in data flow graphs and the queue utilization requirements. This paper presents a technique developed to make the compiler aware of the size of the queue register file and, thus, optimize the programs to effectively utilize the available hardware. The compiler examines the data flow graph of the programs and partitions it into clusters whenever it exceeds the queue limits of the target architecture. The presented algorithm deals with the two factors that affect the utilization of the queue, namely parallelism and the length of variables' reaching definitions. We analyze how the quality of the generated code is affected for SPEC CINT95 benchmark programs and different queue size configurations. Our results show that for reasonable queue sizes the compiler generates a code that is comparable to the code generated for infinite resources in terms of instruction count, static execution time, and instruction level parallelism.