TTAs: missing the ILP complexity wall
Journal of Systems Architecture: the EUROMICRO Journal - Special double issue on microprocessor architecture
Computation in the context of transport triggered architectures
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
A user-programmable vertex engine
Proceedings of the 28th annual conference on Computer graphics and interactive techniques
3D graphics LSI core for mobile phone "Z3D"
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
High-Speed Function Approximation Using a Minimax Quadratic Interpolator
IEEE Transactions on Computers
IEEE Micro
ACM SIGGRAPH 2006 Papers
Parallel Memory Architecture for Application-Specific Instruction-Set Processors
Journal of Signal Processing Systems
A Floating-Point Unit for 4D Vector Inner Product with Reduced Latency
IEEE Transactions on Computers
Programmable and Scalable Architecture for Graphics Processing Units
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
A 186-Mvertices/s 161-mW floating-point vertex processor with optimized datapath and vertex caches
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Tuning a protocol processor architecture towards DSP operations
SAMOS'05 Proceedings of the 5th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Low-power, high-performance TTA processor for 1024-point fast fourier transform
SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Low-power 3D graphics processors for mobile terminals
IEEE Communications Magazine
Hi-index | 0.00 |
A fully programmable vertex shader based on Transport Triggered Architecture (TTA) is proposed in this paper to provide high efficiency of performance and connectivity for embedded applications. At the architecture level, fine-grained data transport in TTA datapath and multi-threading method are adopted to exploit instruction and data level parallelism respectively in the graphics applications. The datapath connectivity can be optimized mainly by native architectural visible bypass in TTA and hybrid result re-collection schemes. At the shader core level, a novel SIMD multi-functional dot-production unit and an area efficient special function unit are introduced for floating-point processing. The proposed processor which achieves peak capacity of 1.5 GFLOPS and 125 Mvertices/s can totally acquire 17.6% reduction in hardware cost and can provide 1.3 times improvement in performance per logic cost ratio under a 0.18@mm CMOS process for real graphics benchmarks compared to previous expanded VLIW vertex processor.