Larrabee: a many-core x86 architecture for visual computing

Authors:
Larry Seiler;Doug Carmean;Eric Sprangle;Tom Forsyth;Michael Abrash;Pradeep Dubey;Stephen Junkins;Adam Lake;Jeremy Sugerman;Robert Cavin;Roger Espasa;Ed Grochowski;Toni Juan;Pat Hanrahan
Affiliations:
Intel® Corporation;Intel® Corporation;Intel® Corporation;Intel® Corporation;RAD Game Tools;Intel® Corporation;Intel® Corporation;Intel® Corporation;Stanford University;Intel® Corporation;Intel® Corporation;Intel® Corporation;Intel® Corporation;Stanford University
Venue:
ACM SIGGRAPH 2008 papers
Year:
2008

Citing 36
Cited 214

Pixel-planes 5: a heterogeneous multiprocessor graphics system using processor-enhanced memories

SIGGRAPH '89 Proceedings of the 16th annual conference on Computer graphics and interactive techniques
PixelFlow: high-speed rendering using image composition

SIGGRAPH '92 Proceedings of the 19th annual conference on Computer graphics and interactive techniques
A scalable hardware render accelerator using a modified scanline algorithm

SIGGRAPH '92 Proceedings of the 19th annual conference on Computer graphics and interactive techniques
A Sorting Classification of Parallel Rendering

IEEE Computer Graphics and Applications
Hardware accelerated rendering of CSG and transparency

SIGGRAPH '94 Proceedings of the 21st annual conference on Computer graphics and interactive techniques
I-COLLIDE: an interactive and exact collision detection system for large-scale environments

I3D '95 Proceedings of the 1995 symposium on Interactive 3D graphics
Computer graphics (2nd ed. in C): principles and practice

Computer graphics (2nd ed. in C): principles and practice
Hierarchical polygon tiling with coverage masks

SIGGRAPH '96 Proceedings of the 23rd annual conference on Computer graphics and interactive techniques
Talisman: commodity realtime 3D graphics for the PC

SIGGRAPH '96 Proceedings of the 23rd annual conference on Computer graphics and interactive techniques
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Simple models of the impact of overlap in bucket rendering

HWWS '98 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware
New microarchitecture challenges in the coming generations of CMOS process technologies (keynote address)(abstract only)

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Parallel programming in OpenMP

Parallel programming in OpenMP
A parallel algorithm for polygon rasterization

SIGGRAPH '88 Proceedings of the 15th annual conference on Computer graphics and interactive techniques
Lightning-2: a high-performance display subsystem for PC clusters

Proceedings of the 28th annual conference on Computer graphics and interactive techniques
Real-Time Rendering

Real-Time Rendering
ZR: a 3D API transparent technology for chunk rendering

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Architecture of the Pentium Microprocessor

IEEE Micro
Imagine: Media Processing with Streams

IEEE Micro
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Designing graphics architectures around scalability and communication

Designing graphics architectures around scalability and communication
OpenGL(R) Shading Language

OpenGL(R) Shading Language
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Best of Both Latency and Throughput

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Hardware-Assisted Visibility Sorting for Unstructured Volume Rendering

IEEE Transactions on Visualization and Computer Graphics
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
GPU-accelerated high-quality hidden surface removal

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Multi-level ray tracing algorithm

ACM SIGGRAPH 2005 Papers
The irregular Z-buffer: Hardware acceleration for irregular data structures

ACM Transactions on Graphics (TOG)
The Direct3D 10 system

ACM SIGGRAPH 2006 Papers
Multi-fragment effects on the GPU using the k-buffer

Proceedings of the 2007 symposium on Interactive 3D graphics and games
Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Practical logarithmic rasterization for low-error shadow maps

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Intel threading building blocks

Intel threading building blocks
Alias-free shadow maps

EGSR'04 Proceedings of the Fifteenth Eurographics conference on Rendering Techniques

Real-time Reyes-style adaptive surface subdivision

ACM SIGGRAPH Asia 2008 papers
Logarithmic perspective shadow maps

ACM Transactions on Graphics (TOG)
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
GRAMPS: A programming model for graphics pipelines

ACM Transactions on Graphics (TOG)
Soft irregular shadow mapping: fast, high-quality, and robust soft shadows

Proceedings of the 2009 symposium on Interactive 3D graphics and games
Light interaction with human skin: from believable images to predictable models

ACM SIGGRAPH ASIA 2008 courses
Accelerating critical section execution with asymmetric multi-core architectures

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
StreamRay: a stream filtering architecture for coherent ray tracing

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Evaluation of memory performance on the cell BE with the SARC programming model

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
GViM: GPU-accelerated virtual machines

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Toward a multicore architecture for real-time ray-tracing

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
An efficient GPU-based approach for interactive global illumination

ACM SIGGRAPH 2009 papers
High-performance regular expression scanning on the Cell/B.E. processor

Proceedings of the 23rd international conference on Supercomputing
Using many-core hardware to correlate radio astronomy signals

Proceedings of the 23rd international conference on Supercomputing
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Time-predictable computer architecture

EURASIP Journal on Embedded Systems - FPGA supercomputing platforms, architectures, and techniques for accelerating computationally complex algorithms
AnySP: anytime anywhere anyway signal processing

Proceedings of the 36th annual international symposium on Computer architecture
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Thread motion: fine-grained power management for multi-core systems

Proceedings of the 36th annual international symposium on Computer architecture
Practical Random Linear Network Coding on GPUs

NETWORKING '09 Proceedings of the 8th International IFIP-TC 6 Networking Conference
A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Viewpoint: Face the inevitable, embrace parallelism

Communications of the ACM - The Status of the P versus NP Problem
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Data-parallel rasterization of micropolygons with defocus and motion blur

Proceedings of the Conference on High Performance Graphics 2009
Morphological antialiasing

Proceedings of the Conference on High Performance Graphics 2009
Selective and adaptive supersampling for real-time ray tracing

Proceedings of the Conference on High Performance Graphics 2009
Efficient ray traced soft shadows using multi-frusta tracing

Proceedings of the Conference on High Performance Graphics 2009
Faster incoherent rays: Multi-BVH ray stream tracing

Proceedings of the Conference on High Performance Graphics 2009
Efficient stream compaction on wide SIMD many-core architectures

Proceedings of the Conference on High Performance Graphics 2009
Stream compaction for deferred shading

Proceedings of the Conference on High Performance Graphics 2009
Optimizing total power of many-core processors considering voltage scaling limit and process variations

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
ClearPath: highly parallel collision avoidance for multi-agent simulation

Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation
Programmable and Scalable Architecture for Graphics Processing Units

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
A Data Parallel Algorithm for XML DOM Parsing

XSym '09 Proceedings of the 6th International XML Database Symposium on Database and XML Technologies
Efficient Multiplication of Polynomials on Graphics Hardware

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
SSE Implementation of Multivariate PKCs on Modern x86 CPUs

CHES '09 Proceedings of the 11th International Workshop on Cryptographic Hardware and Embedded Systems
Ray casting of multiple volumetric datasets with polyhedral boundaries on manycore GPUs

ACM SIGGRAPH Asia 2009 papers
GPU virtualization on VMware's hosted I/O architecture

ACM SIGOPS Operating Systems Review
Achieving high memory performance from heterogeneous architectures with the SARC programming model

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Helios: heterogeneous multiprocessing with satellite kernels

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Massively parallel processing: it's déjà vu all over again

Proceedings of the 46th Annual Design Automation Conference
APRON: a cellular processor array simulation and hardware design tool

EURASIP Journal on Advances in Signal Processing - CNN technology for spatiotemporal signal processing
Reducing Query Latencies in Web Search Using Fine-Grained Parallelism

World Wide Web
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PFunc: modern task parallelism for modern high performance computing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Towards a framework for abstracting accelerators in parallel applications: experience with cell

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
An adaptative game loop architecture with automatic distribution of tasks between CPU and GPU

Computers in Entertainment (CIE) - SPECIAL ISSUE: Games
Interactive sound rendering

ACM SIGGRAPH 2009 Courses
Complexity effective memory access scheduling for many-core accelerator architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Introduction to GPU programming for EDA

Proceedings of the 2009 International Conference on Computer-Aided Design
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment
Utilizing predictors for efficient thermal management in multiprocessor SoCs

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRaX: a multicore hardware architecture for real-time ray tracing

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects

Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games
Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming

International Journal of High Performance Computing Applications
MacroSS: macro-SIMDization of streaming applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Scalability of relaxed consistency models in NoC based multicore architectures

ACM SIGARCH Computer Architecture News
A self-adaptive scheduler for asymmetric multi-cores

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture

Proceedings of the 7th ACM international conference on Computing frontiers
Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
State-of-the-art in heterogeneous computing

Scientific Programming
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
FAST: fast architecture sensitive tree search on modern CPUs and GPUs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization

Proceedings of the 24th ACM International Conference on Supercomputing
OpenMP extensions for FPGA accelerators

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
WiDGET: Wisconsin decoupled grid execution tiles

Proceedings of the 37th annual international symposium on Computer architecture
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Web search using mobile cores: quantifying and mitigating the price of efficiency

Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Speeding up homomorpic hashing using GPUs

ICC'09 Proceedings of the 2009 IEEE international conference on Communications
A new multi-core pipelined architecture for executing sequential programs for parallel geospatial computing

Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application
Efficient fault simulation on many-core processors

Proceedings of the 47th Design Automation Conference
Technical Section: Efficient volume rendering on the body centered cubic lattice using box splines

Computers and Graphics
Memory efficient ray tracing with hierarchical mesh quantization

Proceedings of Graphics Interface 2010
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Remote Process Execution and Remote File I/O for Heterogeneous Processors in Cluster Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A Memory Centric Kernel Framework for Accelerating Short-Range, Interactive Particle Simulation

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
directCell: hybrid systems with tightly coupled accelerators

IBM Journal of Research and Development
The reverse-acceleration model for programming petascale hybrid systems

IBM Journal of Research and Development
Introduction to the wire-speed processor and architecture

IBM Journal of Research and Development
A multi-streaming SIMD multimedia computing engine

Microprocessors & Microsystems
PacketShader: a GPU-accelerated software router

Proceedings of the ACM SIGCOMM 2010 conference
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proximity coherence for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A programmable parallel accelerator for learning and classification

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Revisiting sorting for GPGPU stream architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
GPU virtualization on VMware's hosted I/O architecture

WIOV'08 Proceedings of the First conference on I/O virtualization
Fast field solver for the simulation of large-area OLEDs

Microelectronics Journal
A balanced programming model for emerging heterogeneous multicore systems

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Real-time collision culling of a million bodies on graphics processing units

ACM SIGGRAPH Asia 2010 papers
Detail-preserving fully-Eulerian interface tracking framework

ACM SIGGRAPH Asia 2010 papers
Fast parallel surface and solid voxelization on GPUs

ACM SIGGRAPH Asia 2010 papers
Many-core virtual machines: decoupling abstract from concrete concurrency

Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
MEDEA: a hybrid shared-memory/message-passing multiprocessor NoC-based architecture

Proceedings of the Conference on Design, Automation and Test in Europe
Compilation of stream programs for multicore processors that incorporate scratchpad memories

Proceedings of the Conference on Design, Automation and Test in Europe
Destination-based adaptive routing on 2D mesh networks

Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
A link arbitration scheme for quality of service in a latency-optimized network-on-chip

Proceedings of the Conference on Design, Automation and Test in Europe
Latency criticality aware on-chip communication

Proceedings of the Conference on Design, Automation and Test in Europe
A memory interface for multi-purpose multi-stream accelerators

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Rank based dynamic voltage and frequency scaling fortiled graphics processors

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Comparing last-level cache designs for CMP architectures

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Weighted random oblivious routing on torus networks

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Power-efficient spilling techniques for chip multiprocessors

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Efficient address mapping of shared cache for on-chip many-core architecture

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Efficient throughput-guarantees for latency-sensitive networks-on-chip

Proceedings of the 2010 Asia and South Pacific Design Automation Conference
Coherent depth test scheme in FreePipe

Proceedings of the 9th ACM SIGGRAPH Conference on Virtual-Reality Continuum and its Applications in Industry
A capabilities-aware framework for using computational accelerators in data-intensive computing

Journal of Parallel and Distributed Computing
A lazy object-space shading architecture with decoupled sampling

Proceedings of the Conference on High Performance Graphics
Task management for irregular-parallel workloads on the GPU

Proceedings of the Conference on High Performance Graphics
Parallel SAH k-D tree construction

Proceedings of the Conference on High Performance Graphics
Efficient bounding of displaced Bézier patches

Proceedings of the Conference on High Performance Graphics
Parallel-vector algorithms for particle simulations on shared-memory multiprocessors

Journal of Computational Physics
Erasing Core Boundaries for Robust and Configurable Performance

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
The future of microprocessors

Communications of the ACM
Attaining system performance points: revisiting the end-to-end argument in system design for heterogeneous many-core systems

ACM SIGOPS Operating Systems Review
Bothnia: a dual-personality extension to the Intel integrated graphics driver

ACM SIGOPS Operating Systems Review
Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform

ACM SIGOPS Operating Systems Review
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Exascale computing technology challenges

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Applying parallel design techniques to template matching with GPUs

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Decoupled sampling for graphics pipelines

ACM Transactions on Graphics (TOG)
Modeling and Evaluating Non-shared Memory CELL/BE Type Multi-core Architectures for Local Image and Video Processing

Journal of Signal Processing Systems
Programming heterogeneous clusters with accelerators using object-based programming

Scientific Programming
SSLShader: cheap SSL acceleration with commodity processors

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A programming model for GPU-based parallel computing with scalability and abstraction

Proceedings of the 25th Spring Conference on Computer Graphics
Mind the gap!: bridging the dichotomy of design and implementation

Proceedings of the 4th International Workshop on Software Engineering for Computational Science and Engineering
A minimalist cache coherent MPSoC designed for FPGAs

International Journal of High Performance Systems Architecture
Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Proceedings of the international conference on Supercomputing
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
Considerations when evaluating microprocessor platforms

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Mobile processors for energy-efficient web search

ACM Transactions on Computer Systems (TOCS)
High-performance software rasterization on GPUs

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Razor: An architecture for dynamic multiresolution ray tracing

ACM Transactions on Graphics (TOG)
T&I engine: traversal and intersection engine for hardware accelerated ray tracing

Proceedings of the 2011 SIGGRAPH Asia Conference
Stylization-based ray prioritization for guaranteed frame rates

Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Non-Photorealistic Animation and Rendering
Green challenges to system software in data centers

Frontiers of Computer Science in China
Designing fast architecture-sensitive tree search on modern multicore/many-core processors

ACM Transactions on Database Systems (TODS)
Obsidian: a domain specific embedded language for parallel programming of graphics processors

IFL'08 Proceedings of the 20th international conference on Implementation and application of functional languages
High-performance 3D compressive sensing MRI reconstruction using many-core architectures

Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
A memory accelerator with gather functions for bandwidth-bound irregular applications

Proceedings of the first workshop on Irregular applications: architectures and algorithm
A hoare calculus for the verification of synchronous languages

PLPV '12 Proceedings of the sixth workshop on Programming languages meets program verification
PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

Parallel Computing
Exploring high throughput computing paradigm for global routing

Proceedings of the International Conference on Computer-Aided Design
Design and analysis of adaptive processor

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification

ACM Transactions on Architecture and Code Optimization (TACO)
Extending a C-like language for portable SIMD programming

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Region scheduling: efficiently using the cache architectures via page-level affinity

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Hardware support for OpenMP collective operations

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Hardware transactional memory for GPU architectures

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Topology-Aware OpenMP process scheduling

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems (TOCS)
Multi core design for chip level multiprocessing

Advanced Lectures on Software Engineering
A parallelizing compiler cooperative heterogeneous multicore processor architecture

Transactions on High-Performance Embedded Architectures and Compilers IV
Whole-function vectorization

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The case for elastic operating system services in fos

Proceedings of the 49th Annual Design Automation Conference
Extending a highly parallel data mining algorithm to the intel ® many integrated core architecture

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

Parallel Computing
Exploring cross-layer power management for PGAS applications on the SCC platform

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Proceedings of the 26th ACM international conference on Supercomputing
3D rasterization: a bridge between rasterization and ray casting

Proceedings of Graphics Interface 2012
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Proceedings of the 39th Annual International Symposium on Computer Architecture
Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Special Section on CANS: Ray prioritization using stylization and visual saliency

Computers and Graphics
Softshell: dynamic scheduling on GPUs

ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia 2012
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Riposte: a trace-driven compiler and parallel VM for vector code in R

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Power-efficient computing for compute-intensive GPGPU applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Fragment-parallel composite and filter

EGSR'10 Proceedings of the 21st Eurographics conference on Rendering
Simulation of radio wave propagation by beam tracing

EG PGV'09 Proceedings of the 9th Eurographics conference on Parallel Graphics and Visualization
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
NUMA-aware graph mining techniques for performance and energy efficiency

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Improving Data Locality for Efficient In-Core Path Tracing

Computer Graphics Forum
Tsunami: massively parallel homomorphic hashing on many-core GPUs

Concurrency and Computation: Practice & Experience
Graphics processing unit (GPU) programming strategies and trends in GPU computing

Journal of Parallel and Distributed Computing
GPP-Grep: high-speed regular expression processing engine on general purpose processors

RAID'12 Proceedings of the 15th international conference on Research in Attacks, Intrusions, and Defenses
A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures

Journal of Signal Processing Systems
Vector Extensions for Decision Support DBMS Acceleration

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
A sort-based deferred shading architecture for decoupled sampling

ACM Transactions on Graphics (TOG) - SIGGRAPH 2013 Conference Proceedings
Fast deformation of volume data using tetrahedral mesh rasterization

Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation
Exploring memory consistency for massively-threaded throughput-oriented processors

Proceedings of the 40th Annual International Symposium on Computer Architecture
Locality-aware task management for unstructured parallelism: a quantitative limit study

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Distributed run-time resource management for malleable applications on many-core platforms

Proceedings of the 50th Annual Design Automation Conference
An energy and bandwidth efficient ray tracing architecture

Proceedings of the 5th High-Performance Graphics Conference
A divide and conquer based distributed run-time mapping methodology for many-core platforms

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications
Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Destination-based congestion awareness for adaptive routing in 2D mesh networks

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)
On supernode transformations and multithreading for the longest common subsequence problem

AusPDC '12 Proceedings of the Tenth Australasian Symposium on Parallel and Distributed Computing - Volume 127
RSVM: a region-based software virtual memory for GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Expandable process networks to efficiently specify and explore task, data, and pipeline parallelism

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Optimization of interconnects between accelerators and shared memories in dark silicon

Proceedings of the International Conference on Computer-Aided Design
Boosting CUDA Applications with CPU---GPU Hybrid Computing

International Journal of Parallel Programming
A Case Study of Implementing Supernode Transformations

International Journal of Parallel Programming

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.