Polar generation of random variates with the t-distribution
Mathematics of Computation
Algorithm 659: Implementing Sobol's quasirandom sequence generator
ACM Transactions on Mathematical Software (TOMS)
Remark on algorithm 659: Implementing Sobol's quasirandom sequence generator
ACM Transactions on Mathematical Software (TOMS)
Parallel and Distributed Computing Issues in Pricing Financial Derivatives through Quasi Monte Carlo
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Variance Reduction Techniques for Estimating Value-at-Risk
Management Science
The Journal of Supercomputing
Low discrepancy sequences in high dimensions: How well are their projections distributed?
Journal of Computational and Applied Mathematics
Multivariate Gaussian Random Number Generation Targeting Reconfigurable Hardware
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Smoothness and dimension reduction in Quasi-Monte Carlo methods
Mathematical and Computer Modelling: An International Journal
Monte Carlo methods: a computational pattern for our pattern language
Proceedings of the 2010 Workshop on Parallel Programming Patterns
Finding the right level of abstraction for minimizing operational expenditure
Proceedings of the fourth workshop on High performance computational finance
WHPCF '13 Proceedings of the 6th Workshop on High Performance Computational Finance
Hi-index | 0.00 |
The proliferation of algorithmic trading, derivative usage and highly leveraged hedge funds necessitates the acceleration of market Value-at-Risk (VaR) estimation to measure the severity of portfolios losses. This paper demonstrates how solely relying on advances in computer hardware to accelerate market VaR estimation overlooks significant opportunities for acceleration. We use a simulation based delta-gamma Value-at-Risk (VaR) estimate and compute the loss function using basic linear algebra subroutines (BLAS). Our NVIDIA GeForce GTX280 graphics processing unit (GPU) based baseline implementation is a straight-forward port from the CPU implementation and only had a 8.21x speed advantage over a quadcore Intel Core2 Q9300 central processing unit (CPU) based implementation. We demonstrate three approaches to gain additional speedup over the baseline GPU implemention. Firstly, we reformulate the loss function to reduce the amount of necessary computation and achieved a 60.3x speedup. Secondly, we selected functionally equivalent distribution conversion modules to give the best convergence rate - providing an additional 2x speedup. Thirdly, we merged data-parallel computational kernels to remove redundant load store operations leading to an additional 1.85x speedup. Overall, we have achieved a speedup of 148x against the baseline GPU implementation, reducing the time of a VaR estimation with a standard error of 0.1% from minutes to less than one second.