Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
The interaction of software prefetching with ILP processors in shared-memory systems
Proceedings of the 24th annual international symposium on Computer architecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Architecture and arithmetic for multimedia-enhanced processors
Architecture and arithmetic for multimedia-enhanced processors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Evaluating MMX technology using DSP and multimedia applications
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
VIS Speeds New Media Processing
IEEE Micro
Subword Parallelism with MAX-2
IEEE Micro
Real-Time Parallel MPEG-2 Decoding in Software
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
The visual instruction set (VIS) in UltraSPARC
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
An Automated Method for Software Controlled Cache Prefetching
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences-Volume 7 - Volume 7
Performance Characterization of the Pentium® Pro Processor
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
RSIM: a simulator for shared-memory multiprocessor and uniprocessor systems that exploit ILP
WCAE-3 '97 Proceedings of the 1997 workshop on Computer architecture education
Exploiting a new level of DLP in multimedia applications
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
MOM: a matrix SIMD instruction set architecture for multimedia applications
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Proceedings of the 27th annual international symposium on Computer architecture
Reconfigurable caches and their application to media processing
Proceedings of the 27th annual international symposium on Computer architecture
IEEE Transactions on Computers
Reconfigurable Filter Coprocessor Architecture for DSP Applications
Journal of VLSI Signal Processing Systems
Access pattern based local memory customization for low power embedded systems
Proceedings of the conference on Design, automation and test in Europe
Improving 3D geometry transformations on a simultaneous multithreaded SIMD processor
ICS '01 Proceedings of the 15th international conference on Supercomputing
A study of memory system performance of multimedia applications
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Variability in the execution of multimedia applications and implications for architecture
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The architecture of the DIVA processing-in-memory chip
ICS '02 Proceedings of the 16th international conference on Supercomputing
MediaBreeze: a decoupled architecture for accelerating multimedia applications
ACM SIGARCH Computer Architecture News - Special Issue: PACT 2001 workshops
Measuring the Performance of Multimedia Instruction Sets
IEEE Transactions on Computers
Architectural Support for Data-intensive Applications
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Memory Bandwidth: The True Bottleneck of SIMD Multimedia Performance on a Superscalar Processor
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Performance of the Complex Streamed Instruction Set on Image Processing Kernels
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Performance Scalability of Multimedia Instruction Set Extensions
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Performance Evaluation and Benchmarking of Native Signal Processing
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
A Quantitative Understanding of the Performance of Reconfigurable Coprocessors
FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Using Intel Streaming SIMD Extensions for 3D Geometry Processing
PCM '02 Proceedings of the Third IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Quantifying behavioral differences between multimedia and general-purpose workloads
Journal of Systems Architecture: the EUROMICRO Journal
Three-dimensional memory vectorization for high bandwidth media memory systems
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements
IEEE Transactions on Computers
Behavior and Performance of Interactive Multi-Player Game Servers
Cluster Computing
An Analysis of Cache Performance of Multimedia Applications
IEEE Transactions on Computers
Performance of reconfigurable architectures for image-processing applications
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Reconfigurable systems
Implementation of a streaming execution unit
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Synthesis and verification
A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System
Journal of VLSI Signal Processing Systems
Architecture optimization for multimedia application exploiting data and thread-level parallelism
Journal of Systems Architecture: the EUROMICRO Journal
The CSI multimedia architecture
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Implications of Executing Compression and Encryption Applications on General Purpose Processors
IEEE Transactions on Computers
Memory Performance Optimizations For Real-Time Software HDTV Decoding
Journal of VLSI Signal Processing Systems
Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions
Journal of VLSI Signal Processing Systems
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Memory bandwidth optimization through stream descriptors
MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Configurable data memory for multimedia processing
Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
The Impact of Multimedia Extensions for Multimedia Applications on Mobile Computing Systems
APCHI '08 Proceedings of the 8th Asia-Pacific conference on Computer-Human Interaction
Performance of commercial multimedia workloads on the Intel Pentium 4: A case study
Computers and Electrical Engineering
VLSI architecture design approaches for real-time video processing
WSEAS Transactions on Circuits and Systems
Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit
Journal of Signal Processing Systems
ICC'08 Proceedings of the 12th WSEAS international conference on Circuits
Performance Improvement of Multimedia Kernels by Alleviating Overhead Instructions on SIMD Devices
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Access-pattern-aware on-chip memory allocation for SIMD processors
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Proceedings of the Conference on Design, Automation and Test in Europe
An FFT performance model for optimizing general-purpose processor architecture
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Hi-index | 0.02 |
This paper aims to provide a quantitative understanding of the performance of image and video processing applications on general-purpose processors, without and with media ISA extensions. We use detailed simulation of 12 benchmarks to study the effectiveness of current architectural features and identify future challenges for these workloads.Our results show that conventional techniques in current processors to enhance instruction-level parallelism (ILP) provide a factor of 2.3X to 4.2X performance improvement. The Sun VIS media ISA extensions provide an additional 1.1X to 4.2X performance improvement. The ILP features and media ISA extensions significantly reduce the CPU component of execution time, making 5 of the image processing benchmarks memory-bound.The memory behavior of our benchmarks is characterized by large working sets and streaming data accesses. Increasing the cache size has no impact on 8 of the benchmarks. The remaining benchmarks require relatively large cache sizes (dependent on the display sizes) to exploit data reuse, but derive less than 1.2X performance benefits with the larger caches. Software prefetching provides 1.4X to 2.5X performance improvement in the image processing benchmarks where memory is a significant problem. With the addition of software prefetching, all our benchmarks revert to being compute-bound.