Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Cg: a system for programming graphics hardware in a C-like language
ACM SIGGRAPH 2003 Papers
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Evaluating heuristics in automatically mapping multi-loop applications to FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Compiling for stream processing
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
TCP offload is a dumb idea whose time has come
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
ACM SIGGRAPH 2007 courses
Amdahl's Law in the Multicore Era
Computer
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation
A computing origami: folding streams in FPGAs
Proceedings of the 46th Annual Design Automation Conference
Introduction to the wire-speed processor and architecture
IBM Journal of Research and Development
Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ
Proceedings of the 10th FPGAworld Conference
Disengaged scheduling for fair, protected access to fast computational accelerators
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
As the clock frequency of silicon chips is leveling off, the computer architecture community is looking for different solutions to continue application performance scaling. One such solution is the multicore approach, i.e., using multiple simple cores that enable higher performance than wide superscalar processors, provided that the workload can exploit the parallelism. Another emerging alternative is the use of customized designs (accelerators) at different levels within the system. These are specialized functional units integrated with the core, specialized cores, attached processors, or attached appliances. The design tradeoff is quite compelling because current processor chips have billions of transistors, but they cannot all be activated or switched at the same time at high frequencies. Specialized designs provide increased power efficiency but cannot be used as general-purpose compute engines. Therefore, architects trade area for power efficiency by placing in the design additional units that are known to be active at different times. The resulting system is a heterogeneous architecture, with the potential of specialized execution that accelerates different workloads. While designing and building such hardware systems is attractive, writing and porting software to a heterogeneous platform is even more challenging than parallelism for homogeneous multicore systems. In this paper, we propose a taxonomy that allows us to define classes of accelerators, with the goal of focusing on a small set of programming models for accelerators. We discuss several types of currently popular accelerators and identify challenges to exploiting such accelerators in current software stacks. This paper serves as a guide for both hardware designers by providing them with a view on how software best exploits specialization and software programmers by focusing research efforts to address parallelism and heterogeneity.