Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Simple vector microprocessors for multimedia applications
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Implicitly-multithreaded processors
Proceedings of the 30th annual international symposium on Computer architecture
Overcoming the limitations of conventional vector processors
Proceedings of the 30th annual international symposium on Computer architecture
The Reconfigurable Streaming Vector Processor (RSVPTM)
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture
IEEE Micro
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Multiple Instruction Stream Processor
Proceedings of the 33rd annual international symposium on Computer Architecture
Hierarchical Parallelization of an H.264/AVC Video Encoder
PARELEC '06 Proceedings of the international symposium on Parallel Computing in Electrical Engineering
A Highly Integrated 8mW H.264/AVC Main Profile Real-time CIF Video Decoder on a 16MHz SoC Platform
ASP-DAC '07 Proceedings of the 2007 Asia and South Pacific Design Automation Conference
A Parallel Algorithm for Advanced Video Motion Estimation on Multicore Architectures
CISIS '08 Proceedings of the 2008 International Conference on Complex, Intelligent and Software Intensive Systems
Hierarchical circuit-switched NoC for multicore video processing
Microprocessors & Microsystems
Hi-index | 0.00 |
Multithreading and multicore processing are powerful ways to take advantage of parallelism in applications in order to boost a system's performance. However, exploring sufficient parallelism and achieving data locality with low communication overhead are still important research issues in embedded multithreading/ multicore design. This paper introduces the design of a fast data switching mechanism between multilevel storage structures in a new multicore architecture. This paper makes several contributions to the development of contemporary sophisticated multimedia applications with advanced standards such as H.264. The first contribution, collaborative-multithreading, tightly unifies reduced instruction set computer and collaborative multithreading digital signal processing (DSP) in order to exploit high parallelism to provide sufficient computing power to applications. Each collaborative thread of our DSP is constructed by a heterogeneous-simultaneously multithreading single instruction, multiple data structure, and four media processing cores, which is connected by a fast switch for providing a fast data exchange mechanism among correlative streams on a thread-level basis. Our second contribution is one-stop streaming processing, which aims to keep data in the system for as long as possible until it is no longer needed, thus making data more efficient to access. Our third contribution is a chunk threading programming model, including a thread management library and threading communication directives for reducing data communication and synchronization overhead. By a combination of coarse-grained and fine-grained threading, programmers can choose various threading levels based on the amount of data exchange in a program. With our proposed techniques and an appropriate programming model, we can reduce processing time by 54.9% in H.264 video encoding (common intermediate format video at 16.574 f/s) with the 1-virtual independent and streaming processing by open collaborative multithreading configuration, compared to the Texas Instruments C62 core that owns 8 function units. We realize our design as a prototype by chip implementation, and fabricate it as a chip based on the Taiwan Semiconductor Manufacturing Company Ltd. 0.13µm process. The die size of the processor core is 16.12mm2, including 414k logic transistors and 34.4 kB of on-chip static random access memory. The processor runs at 180 MH0z/1.2-V and consumes 245mW by postsimulation results.