Financial software on GPUs: between Haskell and Fortran

Authors:
Cosmin E. Oancea;Christian Andreetta;Jost Berthold;Alain Frisch;Fritz Henglein
Affiliations:
University of Copenhagen, Copenhagen, Denmark;University of Copenhagen, Copenhagen, Denmark;University of Copenhagen, Copenhagen, Denmark;LexiFi, Paris, France;University of Copenhagen, Copenhagen, Denmark
Venue:
Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
Year:
2012

Citing 36
Cited 1

An introduction to the theory of lists

Proceedings of the NATO Advanced Study Institute on Logic of programming and calculi of discrete design
Why functional programming matters

The Computer Journal - Special issue on Lazy functional programming
Algorithm 659: Implementing Sobol's quasirandom sequence generator

ACM Transactions on Mathematical Software (TOMS)
Programming parallel algorithms

Communications of the ACM
GUM: a portable parallel implementation of Haskell

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Constraint-based array dependence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Evaluation of predicated array data-flow analysis for automatic parallelization

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Composing contracts: an adventure in financial engineering (functional pearl)

ICFP '00 Proceedings of the fifth ACM SIGPLAN international conference on Functional programming
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Efficient and precise array access analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Research Directions in Parallel Functional Programming

Research Directions in Parallel Functional Programming
Systematic Extraction and Implementation of Divide-and-Conquer Parallelism

PLILP '96 Proceedings of the 8th International Symposium on Programming Languages: Implementations, Logics, and Programs
Systematic Efficient Parallelization of Scan and Other List Homomorphisms

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Analysis of Irregular Single-Indexed Array Accesses and Its Applications in Compiler Optimizations

CC '00 Proceedings of the 9th International Conference on Compiler Construction
The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops

IPDPS '02 Proceedings of the 16th International Symposium on Parallel and Distributed Processing
Compiler Optimization of Implicit Reductions for Distributed Memory Multiprocessors

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Performance transformations for irregular applications

Performance transformations for irregular applications
Hybrid analysis: static & dynamic memory reference analysis

International Journal of Parallel Programming
Parallel functional programming in Eden

Journal of Functional Programming
Interprocedural parallelization analysis in SUIF

ACM Transactions on Programming Languages and Systems (TOPLAS)
Data parallel Haskell: a status report

Proceedings of the 2007 workshop on Declarative aspects of multicore programming
Automatic inversion generates divide-and-conquer parallel programs

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Paradise: a two-stage DSL embedded in Haskell

Proceedings of the 13th ACM SIGPLAN international conference on Functional programming
Application of Automatic Parallelization to Modern Challenges of Scientific Computing Industries

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A lightweight in-place implementation for software thread-level speculation

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Nikola: embedding compiled GPU functions in Haskell

Proceedings of the third ACM Haskell symposium on Haskell
Accelerating Haskell array codes with multicore GPUs

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Loop transformations: convexity, pruning and optimization

Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A monad for deterministic parallelism

Proceedings of the 4th ACM symposium on Haskell
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Logical inference techniques for loop parallelization

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation

A T2 graph-reduction approach to fusion

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a real-world pricing kernel for financial derivatives and evaluates the language and compiler tool chain that would allow expressive, hardware-neutral algorithm implementation and efficient execution on graphics-processing units (GPU). The language issues refer to preserving algorithmic invariants, e.g., inherent parallelism made explicit by map-reduce-scan functional combinators. Efficient execution is achieved by manually; applying a series of generally-applicable compiler transformations that allows the generated-OpenCL code to yield speedups as high as 70x and 540x on a commodity mobile and desktop GPU, respectively. Apart from the concrete speed-ups attained, our contributions are twofold: First, from a language perspective;, we illustrate that even state-of-the-art auto-parallelization techniques are incapable of discovering all the requisite data parallelism when rendering the functional code in Fortran-style imperative array processing form. Second, from a performance perspective;, we study which compiler transformations are necessary to map the high-level functional code to hand-optimized OpenCL code for GPU execution. We discover a rich optimization space with nontrivial trade-offs and cost models. Memory reuse in map-reduce patterns, strength reduction, branch divergence optimization, and memory access coalescing, exhibit significant impact individually. When combined, they enable essentially full utilization of all GPU cores. Functional programming has played a crucial double role in our case study: Capturing the naturally data-parallel structure of the pricing algorithm in a transparent, reusable and entirely hardware-independent fashion; and supporting the correctness of the subsequent compiler transformations to a hardware-oriented target language by a rich class of universally valid equational properties. Given the observed difficulty of automatically parallelizing imperative sequential code and the inherent labor of porting hardware-oriented and -optimized programs, our case study suggests that functional programming technology can facilitate high-level; expression of leading-edge performant portable; high-performance systems for massively parallel hardware architectures.