An introduction to the theory of lists
Proceedings of the NATO Advanced Study Institute on Logic of programming and calculi of discrete design
Why functional programming matters
The Computer Journal - Special issue on Lazy functional programming
Algorithm 659: Implementing Sobol's quasirandom sequence generator
ACM Transactions on Mathematical Software (TOMS)
Programming parallel algorithms
Communications of the ACM
GUM: a portable parallel implementation of Haskell
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Constraint-based array dependence analysis
ACM Transactions on Programming Languages and Systems (TOPLAS)
Evaluation of predicated array data-flow analysis for automatic parallelization
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Composing contracts: an adventure in financial engineering (functional pearl)
ICFP '00 Proceedings of the fifth ACM SIGPLAN international conference on Functional programming
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Efficient and precise array access analysis
ACM Transactions on Programming Languages and Systems (TOPLAS)
Research Directions in Parallel Functional Programming
Research Directions in Parallel Functional Programming
Systematic Extraction and Implementation of Divide-and-Conquer Parallelism
PLILP '96 Proceedings of the 8th International Symposium on Programming Languages: Implementations, Logics, and Programs
Systematic Efficient Parallelization of Scan and Other List Homomorphisms
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Analysis of Irregular Single-Indexed Array Accesses and Its Applications in Compiler Optimizations
CC '00 Proceedings of the 9th International Conference on Compiler Construction
The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops
IPDPS '02 Proceedings of the 16th International Symposium on Parallel and Distributed Processing
Compiler Optimization of Implicit Reductions for Distributed Memory Multiprocessors
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Performance transformations for irregular applications
Performance transformations for irregular applications
Hybrid analysis: static & dynamic memory reference analysis
International Journal of Parallel Programming
Parallel functional programming in Eden
Journal of Functional Programming
Interprocedural parallelization analysis in SUIF
ACM Transactions on Programming Languages and Systems (TOPLAS)
Data parallel Haskell: a status report
Proceedings of the 2007 workshop on Declarative aspects of multicore programming
Automatic inversion generates divide-and-conquer parallel programs
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Paradise: a two-stage DSL embedded in Haskell
Proceedings of the 13th ACM SIGPLAN international conference on Functional programming
Application of Automatic Parallelization to Modern Challenges of Scientific Computing Industries
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A lightweight in-place implementation for software thread-level speculation
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Nikola: embedding compiled GPU functions in Haskell
Proceedings of the third ACM Haskell symposium on Haskell
Accelerating Haskell array codes with multicore GPUs
Proceedings of the sixth workshop on Declarative aspects of multicore programming
Loop transformations: convexity, pruning and optimization
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A monad for deterministic parallelism
Proceedings of the 4th ACM symposium on Haskell
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Compiling a high-level language for GPUs: (via language support for architectures and compilers)
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Logical inference techniques for loop parallelization
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A T2 graph-reduction approach to fusion
Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Hi-index | 0.00 |
This paper presents a real-world pricing kernel for financial derivatives and evaluates the language and compiler tool chain that would allow expressive, hardware-neutral algorithm implementation and efficient execution on graphics-processing units (GPU). The language issues refer to preserving algorithmic invariants, e.g., inherent parallelism made explicit by map-reduce-scan functional combinators. Efficient execution is achieved by manually; applying a series of generally-applicable compiler transformations that allows the generated-OpenCL code to yield speedups as high as 70x and 540x on a commodity mobile and desktop GPU, respectively. Apart from the concrete speed-ups attained, our contributions are twofold: First, from a language perspective;, we illustrate that even state-of-the-art auto-parallelization techniques are incapable of discovering all the requisite data parallelism when rendering the functional code in Fortran-style imperative array processing form. Second, from a performance perspective;, we study which compiler transformations are necessary to map the high-level functional code to hand-optimized OpenCL code for GPU execution. We discover a rich optimization space with nontrivial trade-offs and cost models. Memory reuse in map-reduce patterns, strength reduction, branch divergence optimization, and memory access coalescing, exhibit significant impact individually. When combined, they enable essentially full utilization of all GPU cores. Functional programming has played a crucial double role in our case study: Capturing the naturally data-parallel structure of the pricing algorithm in a transparent, reusable and entirely hardware-independent fashion; and supporting the correctness of the subsequent compiler transformations to a hardware-oriented target language by a rich class of universally valid equational properties. Given the observed difficulty of automatically parallelizing imperative sequential code and the inherent labor of porting hardware-oriented and -optimized programs, our case study suggests that functional programming technology can facilitate high-level; expression of leading-edge performant portable; high-performance systems for massively parallel hardware architectures.