Sequoia: programming the memory hierarchy

Authors:
Kayvon Fatahalian;Daniel Reiter Horn;Timothy J. Knight;Larkhoon Leem;Mike Houston;Ji Young Park;Mattan Erez;Manman Ren;Alex Aiken;William J. Dally;Pat Hanrahan
Affiliations:
Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 24
Cited 123

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Chores: enhanced run-time support for shared-memory parallel computing

ACM Transactions on Computer Systems (TOCS)
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
An annotation language for optimizing software libraries

Proceedings of the 2nd conference on Domain-specific languages
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Program Structuring for Effective Parallel Portability

IEEE Transactions on Parallel and Distributed Systems
External memory algorithms

Handbook of massive data sets
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Space-limited procedures: a methodology for portable high-performance

PMMP '95 Proceedings of the conference on Programming Models for Massively Parallel Computers
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
A programming system for the imagine media processor

A programming system for the imagine media processor
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
The Stream Virtual Machine

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
ClawHMMER: A Streaming HMMer-Search Implementatio

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Programming for parallelism and locality with hierarchically tiled arrays

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell

Journal of VLSI Signal Processing Systems
Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

Parallel Computing
Parallelization schemes for memory optimization on the cell processor: a case study of image processing algorithm

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A portable runtime interface for multi-level memory hierarchies

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Data exploration of turbulence simulations using a database cluster

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Multi-level tiling: M for the price of one

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine

Proceedings of the 5th conference on Computing frontiers
Dma-based prefetching for i/o-intensive workloads on the cell architecture

Proceedings of the 5th conference on Computing frontiers
Visions for application development on hybrid computing systems

Parallel Computing
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
Efficient computation of sum-products on GPUs through software-managed cache

Proceedings of the 22nd annual international conference on Supercomputing
Harmony: an execution model and runtime for heterogeneous many core systems

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations

International Journal of Parallel, Emergent and Distributed Systems
A lightweight streaming layer for multicore execution

ACM SIGARCH Computer Architecture News
A Buffered-Mode MPI Implementation for the Cell BETM Processor

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
A Real-Time Programming Model for Heterogeneous MPSoCs

SAMOS '08 Proceedings of the 8th international workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Radioastronomy Image Synthesis on the Cell/B.E.

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Revisiting SIMD Programming

Languages and Compilers for Parallel Computing
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
SPENK: adding another level of parallelism on the cell broadband engine

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Certified Reasoning in Memory Hierarchies

APLAS '08 Proceedings of the 6th Asian Symposium on Programming Languages and Systems
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study

Parallel Computing
Vector stream processing for effective application of heterogeneous parallelism

Proceedings of the 2009 ACM symposium on Applied Computing
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore

Proceedings of the 2009 ACM symposium on Applied Computing
Scheduling dynamic parallelism on accelerators

Proceedings of the 6th ACM conference on Computing frontiers
Evaluating multi-core platforms for HPC data-intensive kernels

Proceedings of the 6th ACM conference on Computing frontiers
Compile-Time and Run-Time Issues in an Auto-Parallelisation System for the Cell BE Processor

Euro-Par 2008 Workshops - Parallel Processing
A Unified Runtime System for Heterogeneous Multi-core Architectures

Euro-Par 2008 Workshops - Parallel Processing
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
Hierarchical Task-Based Programming With StarSs

International Journal of High Performance Computing Applications
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Exposing non-standard architectures to embedded software using compile-time virtualisation

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU

International Journal of Computational Science and Engineering
Towards a framework for abstracting accelerators in parallel applications: experience with cell

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and use of htalib: a library for hierarchically tiled arrays

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
State-of-the-art in heterogeneous computing

Scientific Programming
Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Low depth cache-oblivious algorithms

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Supporting islands of coherency for highly-parallel embedded architectures using compile-time virtualisation

Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems
FPGA implementation of a configurable cache/scratchpad memory with virtualized user-level RDMA capability

SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development
A locality model for the real-time specification for Java

Proceedings of the 8th International Workshop on Java Technologies for Real-Time and Embedded Systems
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A balanced programming model for emerging heterogeneous multicore systems

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Task superscalar: using processors as functional units

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Efficient OpenMP data mapping for multicore platforms with vertically stacked memory

Proceedings of the Conference on Design, Automation and Test in Europe
Recursion-driven parallel code generation for multi-core platforms

Proceedings of the Conference on Design, Automation and Test in Europe
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Towards metaprogramming for parallel systems on a chip

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Automatic calibration of performance models on heterogeneous multicore architectures

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Algorithm engineering: bridging the gap between algorithm theory and practice

Algorithm engineering: bridging the gap between algorithm theory and practice
Compiler-directed memory management for heterogeneous MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
Task Superscalar: An Out-of-Order Task Pipeline

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
The future of microprocessors

Communications of the ACM
Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
SpiceC: scalable parallelism via implicit copying and explicit commit

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
DDM-VMc: the data-driven multithreading virtual machine for the cell processor

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A parallel numerical solver using hierarchically tiled arrays

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Unified parallel C for GPU clusters: language extensions and compiler implementation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Targeting complex embedded architectures by combining the multicore communications API (mcapi) with compile-time virtualisation

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

The Journal of Supercomputing
Programming heterogeneous clusters with accelerators using object-based programming

Scientific Programming
Parallelization schemes for memory optimization on the cell processor: a case study on the Harris corner detector

Transactions on high-performance embedded architectures and compilers III
Improving programmability of heterogeneous many-core systems via explicit platform descriptions

Proceedings of the 4th International Workshop on Multicore Software Engineering
A programming model for deterministic task parallelism

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Scalable heterogeneous parallelism for atmospheric modeling and simulation

The Journal of Supercomputing
A runtime implementation of OpenMP tasks

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Automatic analysis of DMA races using model checking and k-induction

Formal Methods in System Design
Green challenges to system software in data centers

Frontiers of Computer Science in China
A fast, GPU based, dictionary attack to OpenPGP secret keyrings

Journal of Systems and Software
Obsidian: a domain specific embedded language for parallel programming of graphics processors

IFL'08 Proceedings of the 20th international conference on Implementation and application of functional languages
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Using explicit platform descriptions to support programming of heterogeneous many-core systems

Parallel Computing
Programmable data dependencies and placements

DAMP '12 Proceedings of the 7th workshop on Declarative aspects and applications of multicore programming
DOJ: dynamically parallelizing object-oriented programs

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
BDDT:: block-level dynamic dependence analysisfor deterministic task-based parallelism

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Hierarchical place trees: a portable abstraction for task parallelism and data movement

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Tagged procedure calls (TPC): efficient runtime support for task-based parallelism on the cell processor

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Offload – automating code migration to heterogeneous multicore systems

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Towards a codelet-based runtime for exascale computing: position paper

Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Decoupling algorithms from schedules for easy optimization of image processing pipelines

ACM Transactions on Graphics (TOG) - SIGGRAPH 2012 Conference Proceedings
The myrmics memory allocator: hierarchical,message-passing allocation for global address spaces

Proceedings of the 2012 international symposium on Memory Management
Integrated in-system storage architecture for high performance computing

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
CHARM: a composable heterogeneous accelerator-rich microprocessor

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
A multi-objective auto-tuning framework for parallel codes

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Legion: expressing locality and independence with logical regions

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing a unified programming model for heterogeneous machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scheduling streaming applications on a complex multicore platform

Concurrency and Computation: Practice & Experience
Parallelization strategies for the points of interests algorithm on the cell processor

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
A synchronous mode MPI implementation on the cell BETM architecture

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
A type-based approach to separating protocol from application logic: a case study in hybrid computer programming

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration

The Journal of Supercomputing
Abstractions and Middleware for Petascale Computing and Beyond

International Journal of Distributed Systems and Technologies
Work-stealing with configurable scheduling strategies

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Implementing OmpSs support for regions of data in architectures with multiple address spaces

Proceedings of the 27th international ACM conference on International conference on supercomputing
Locality-aware task management for unstructured parallelism: a quantitative limit study

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Language support for dynamic, hierarchical data partitioning

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Tomahawk: Parallelism and heterogeneity in communications signal processing MPSoCs

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Analysis of Recursively Parallel Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
RSVM: a region-based software virtual memory for GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms

Microprocessors & Microsystems
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.02

Visualization

Abstract

We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communication vertically through the machine and to localize computation to particular memory locations within it. We have implemented a complete programming system, including a compiler and runtime systems for Cell processor-based blade systems and distributed memory clusters, and demonstrate efficient performance running Sequoia programs on both of these platforms.