Limits of instruction-level parallelism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Out-of-order vector architectures
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A bandwidth-efficient architecture for media processing
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
Vector instruction set support for conditional operations
Proceedings of the 27th annual international symposium on Computer architecture
What's next in high-performance computing?
Communications of the ACM - Ontology: different ways of representing the same concept
Tarantula: a vector extension to the alpha architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Scalable vector media-processors for embedded systems
Scalable vector media-processors for embedded systems
Simultaneous Multithreaded Vector Architecture: Merging ILP and DLP for High Performance
HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Exploring the VLSI Scalability of Stream Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Overcoming the limitations of conventional vector processors
Proceedings of the 30th annual international symposium on Computer architecture
Proceedings of the 30th annual international symposium on Computer architecture
A fast parallel reed-solomon decoder on a reconfigurable architecture
Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A scalable wide-issue clustered VLIW with a reconfigurable interconnect
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
A new look at exploiting data parallelism in embedded systems
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Multiobjective Design of Embedded Processors on FPGA Platforms
ICDCSW '04 Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04) - Volume 7
The CSI multimedia architecture
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
SCMP: a single-chip message-passing parallel computer
The Journal of Supercomputing - Special issue: Parallel and distributed processing and applications
VICTORIA: VMX indirect compute technology oriented towards in-line acceleration
Proceedings of the 3rd conference on Computing frontiers
SODA: A Low-power Architecture For Software Radio
Proceedings of the 33rd annual international symposium on Computer Architecture
The potential energy efficiency of vector acceleration
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
ParallAX: an architecture for real-time physics
Proceedings of the 34th annual international symposium on Computer architecture
Vector processing as a soft-core CPU accelerator
Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
VESPA: portable, scalable, and flexible FPGA-based vector processors
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Architecture and Evaluation of an Asynchronous Array of Simple Processors
Journal of Signal Processing Systems
Outer-loop vectorization: revisited for short SIMD architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Embedded DSP Processor Design: Application Specific Instruction Set Processors
Embedded DSP Processor Design: Application Specific Instruction Set Processors
AnySP: anytime anywhere anyway signal processing
Proceedings of the 36th annual international symposium on Computer architecture
Understanding throughput-oriented architectures
Communications of the ACM
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Considerations when evaluating microprocessor platforms
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Fast parallel FFT on CTaiJi: a coarse-grained reconfigurable computation platform
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Versatile design of shared vector coprocessors for multicores
Microprocessors & Microsystems
Vector Extensions for Decision Support DBMS Acceleration
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Portable, flexible, and scalable soft vector processors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Multicore-based vector coprocessor sharing for performance and energy gains
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Soft vector processors with streaming pipelines
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor
Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Hi-index | 0.02 |
Multimedia processing on embedded devices requires an architecture that leads to high performance, low power consumption, reduced design complexity, and small code size. In this paper, we use EEMBC, an industrial benchmark suite, to compare the VIRAM vector architecture to superscalar and VLIW processors for embedded multimedia applications. The comparison covers the VIRAM instruction set, vectorizing compiler, and the prototype chip that integrates a vector processor with DRAM main memory.We demonstrate that executable code for VIRAM is up to 10 times smaller than VLIW code and comparable to x86 CISC code. The simple, cache-less VIRAM chip is 2 times faster than a 4-way superscalar RISC processor that uses a 5 times faster clock frequency and consumes 10 times more power. VIRAM is also 10 times faster than cache-based VLIW processors. Even after manual optimization of the VLIW code and insertion of SIMD and DSP instructions, the single-issue VlRAM processor is 60%faster than 5-way to 8-way VLIW designs.