A distributed, simultaneously multi-threaded (SMT) processor with clustered scheduling windows for scalable DSP performance

Authors:
Mladen Berekovic;Tim Niggemeier
Affiliations:
IMEC, Eindhoven, The Netherlands and TU Delft, Delft, The Netherlands;IBM Deutschland Entwicklung GmbH, Böblingen, Germany
Venue:
Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
Year:
2008

Citing 35
Cited 2

Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers

IEEE Transactions on Computers
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Software pipelining for transport-triggered architectures

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Realization of a programmable parallel DSP for high performance image processing applications

DAC '98 Proceedings of the 35th annual Design Automation Conference
An Algorithm-Hardware-System Approach to VLIW Multimedia Processors

Journal of VLSI Signal Processing Systems - special issue on multimedia signal processing
Instruction Set Extensions for MPEG-4 Video

Journal of VLSI Signal Processing Systems - Special issue on implementation of MPEG-4 multimedia codecs
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Communicating sequential processes

Communications of the ACM
Microprocessor Architectures: From VLIW to Tta

Microprocessor Architectures: From VLIW to Tta
A design space evaluation of grid processor architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
2001 Technology Roadmap for Semiconductors

Computer
Accelerating Multimedia with Enhanced Microprocessors

IEEE Micro
MMX Technology Extension to the Intel Architecture

IEEE Micro
The Alpha 21264 Microprocessor

IEEE Micro
Introducing the IA-64 Architecture

IEEE Micro
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
The Softening of Hardware

Computer
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Calisto: A Low-Power Single-Chip Multiprocessor Communications Platform

IEEE Micro
Itanium 2 Processor Microarchitecture

IEEE Micro
Hyperthreading Technology in the Netburst Microarchitecture

IEEE Micro
The Impact of SMT/SMP Designs on Multimedia Software Engineering " A Workload Analysis Study

MSE '02 Proceedings of the Fourth IEEE International Symposium on Multimedia Software Engineering
MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
HiBRID-SoC: A Multi-Core System-on-Chip Architecture for Multimedia Signal Processing Applications

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe: Designers' Forum - Volume 2
A scalable, clustered SMT processor for digital signal processing

MEDEA '03 Proceedings of the 2003 workshop on MEmory performance: DEaling with Applications , systems and architecture
A scalable, multi-thread, multi-issue array processor architecture for DSP applications based on extended tomasulo scheme

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Multicore system-on-chip architecture for MPEG-4 streaming video

IEEE Transactions on Circuits and Systems for Video Technology

MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs

ACM Transactions on Architecture and Code Optimization (TACO)
Colored Petri Net model with automatic parallelization on real-time multicore architectures

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.