Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Authors:
Shih-wei Liao;Zhaohui Du;Gansha Wu;Guei-Yuan Lueh
Affiliations:
Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation
Venue:
Proceedings of the International Symposium on Code Generation and Optimization
Year:
2006

Citing 20
Cited 39

Interprocedural dependence analysis and parallelization

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Some efficient solutions to the affine scheduling problem: I. One-dimensional time

International Journal of Parallel Programming
Detecting coarse-grain parallelism using an interprocedural parallelizing compiler

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Maximizing parallelism and minimizing synchronization with affine transforms

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Maximizing parallelism and minimizing synchronization with affine partitions

Parallel Computing - Special issues on languages and compilers for parallel computers
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
FORTRAN 95 Handbook

FORTRAN 95 Handbook
Ray tracing on programmable graphics hardware

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
How Multimedia Workloads Will Change Processor Design

Computer
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture

Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Hierarchical coarse-grained stream compilation for software defined radio

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Optimizing scientific application loops on stream processors

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Energy efficient streaming applications with guaranteed throughput on MPSoCs

EMSOFT '08 Proceedings of the 8th ACM international conference on Embedded software
Exploiting loop-dependent stream reuse for stream processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Languages and Compilers for Parallel Computing
Stream processing for fast and efficient rotated Haar-like features using rotated integral images

International Journal of Intelligent Systems Technologies and Applications
Stream Compilation for Real-Time Embedded Multicore Systems

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Mapping stream programs onto heterogeneous multiprocessor systems

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Minimizing communication in rate-optimal software pipelining for stream programs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Partitioning streaming parallelism for multi-cores: a machine learning based approach

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Compilation of stream programs for multicore processors that incorporate scratchpad memories

Proceedings of the Conference on Design, Automation and Test in Europe
memCUDA: map device memory to host memory on GPGPU platform

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Memory Latency Reduction via Thread Throttling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Streaming Data Movement for Real-Time Image Analysis

Journal of Signal Processing Systems
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming

Proceedings of the 48th Design Automation Conference
PLDS: Partitioning linked data structures for parallelism

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Parallelizing user-defined and implicit reductions globally on multiprocessors

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Mapping streaming languages to general purpose processors through vectorization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Characteristics of workloads using the pipeline programming model

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Unrolling and retiming of stream applications onto embedded multicore processors

Proceedings of the 49th Annual Design Automation Conference
Integrating Memory Optimization with Mapping Algorithms for Multi-Processors System-on-Chip

ACM Transactions on Embedded Computing Systems (TECS)
Dynamic scheduling of stream programs on embedded multi-core processors

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Early evaluation of directive-based GPU programming models for productive exascale computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Sigma*: symbolic learning of input-output specifications

POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Mapping of streaming applications considering alternative application specifications

ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Kernel Partitioning of Streaming Applications: A Statistical Approach to an NP-complete Problem

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Combining module selection and replication for throughput-driven streaming programs

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Using machine learning to partition streaming programs

ACM Transactions on Architecture and Code Optimization (TACO)
Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Combining computation and communication optimizations in system synthesis for streaming applications

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multicore processors are about to become prevalent in the PC world. Meanwhile, over 90% of the computing cycles are estimated to be consumed by streaming media applications [24]. Although stream programming exposes parallelism naturally, we found that achieving high performance on multiprocessors is challenging. Therefore, we develop a parallel compiler for the Brook streaming language with aggressive data and computation transformations. First, we formulate fifteen Brook stream operators in terms of systems of inequalities. Our compiler optimizes the modeled operators to improve memory footprint and performance. Second, the stream computation including both kernels and operators is mapped to the affine partitioning model by modeling each kernel as an implicit loop nest over stream elements. Note that our general abstraction is not limited to Brook. Our modeling and transformations yield high performance on uniprocessors as well. The geometric mean of speedups is 4.7 on ten streaming applications on a Xeon. On multiprocessors, we show that exploiting the standard intra-kernel data parallelism is inferior to our general modeling. The former yields a speedup of 1.5 for ten applications on a 4-way Xeon, while the latter achieves a speedup of 6.4 over the same baseline. We show that our compiler effectively reduces memory footprint, exploits parallelism, and circumvents phase-ordering issues.