Computer
Multilanguage Parallel Programming of Heterogeneous Machines
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
An analysis of Memnet—an experiment in high-speed shared-memory local networking
SIGCOMM '88 Symposium proceedings on Communications architectures and protocols
Mirage: a coherent distributed shared memory design
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The performance impact of flexibility in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
PLUS: a distributed shared-memory system
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
IEEE Transactions on Computers
Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Global arrays: a portable "shared-memory" programming model for distributed memory computers
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Parallel Programming Using Skeleton Functions
PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
The programming model of ASSIST, an environment for parallel and distributed portable applications
Parallel Computing - Special issue: Advanced environments for parallel and distributed computing
Garp: a MIPS processor with a reconfigurable coprocessor
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
The chimaera reconfigurable functional unit
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The MOLEN Polymorphic Processor
IEEE Transactions on Computers
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
TreadMarks: distributed shared memory on standard workstations and operating systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
GPU computing with NVIDIA CUDA
ACM SIGGRAPH 2007 courses
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
CUBA: an architecture for efficient CPU/co-processor data communication
Proceedings of the 22nd annual international conference on Supercomputing
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
IEEE Micro
Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Proceedings of the 36th annual international symposium on Computer architecture
Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform
ACM SIGOPS Operating Systems Review
MDR: performance model driven runtime for heterogeneous parallel platforms
Proceedings of the international conference on Supercomputing
Enabling multiple accelerator acceleration for Java/OpenMP
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts
Proceedings of the 8th ACM International Conference on Computing Frontiers
PTask: operating system abstractions to manage GPUs as compute devices
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Reflex: using low-power processors in smartphones without knowing them
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Design space exploration of memory model for heterogeneous computing
Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Dynamically managed data for CPU-GPU architectures
Proceedings of the Tenth International Symposium on Code Generation and Optimization
A virtual memory based runtime to support multi-tenancy in clusters with GPUs
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
iGPU: exception support and speculative execution on GPUs
Proceedings of the 39th Annual International Symposium on Computer Architecture
GPUstore: harnessing GPU computing for storage systems in the OS kernel
Proceedings of the 5th Annual International Systems and Storage Conference
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Automatic generation of software pipelines for heterogeneous parallel systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A scalable, numerically stable, high-performance tridiagonal solver using GPUs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CUDA-for-clusters: a system for efficient execution of CUDA kernels on multi-core clusters
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Encapsulated synchronization and load-balance in heterogeneous programming
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
GPUfs: integrating a file system with GPUs
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Comparison based sorting for systems with multiple GPUs
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
SemCache: semantics-aware caching for efficient GPU offloading
Proceedings of the 27th international ACM conference on International conference on supercomputing
RSVM: a region-based software virtual memory for GPU
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Journal of Parallel and Distributed Computing
Heterogeneous system coherence for integrated CPU-GPU systems
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
K2: a mobile operating system for heterogeneous coherence domains
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Portable and Transparent Host-Device Communication Optimization for GPGPU Environments
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
GPUfs: Integrating a file system with GPUs
ACM Transactions on Computer Systems (TOCS)
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms
Scientific Programming
Hi-index | 0.00 |
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.