Stream HPC

Improving FinanceBench for GPUs Part II – low hanging fruit

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

<pre class="wp-block-preformatted"><em>Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536</em>

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

As a refresher, besides porting the CUDA code to HIP, some restructuring of the code and build system was also done. Such improvements are a standard phase in all projects we do, to make sure we spend the minimum time on building, testing and benchmarking.

  1. CMake is now used to build the binaries. This allows the developer to choose their own IDE
  2. Benchmarks and Unit Tests are now implemented for each algorithm
  3. Googlebenchmark and googletest are used as the benchmarking and unit test framework. These integrate well within our automated testing and benchmarking environment
  4. The unit tests are designed to compare OpenMP and HIP implementations against the standard C++ implementation

The original code only measured the compute times. In the new benchmarks, compute times and transfer times are measured separately.

Note: for the new benchmarks we used more recent AMD and Nvidia drivers (ROCm 3.7 and CUDA 10.2).

Monte Carlo

QMC (Sobol) Monte-Carlo method (Equity Option Example)

Below are the results of the original code ported to HIP without any initial optimisations.

Size**2621445242881048576* Compute+ TransfersCompute+ TransfersCompute+ Transfers2x Intel Xeon E5-2650 v3 OpenMP215.311437.140877.425Titan V24.71225.29542.859543.9301595.6741597.240GTX98075.83276.8521754.5861755.8515120.5555127.255Vega 2011.0712.16119.40822.57836.11240.575MI25 (Vega 10)12.9113.96425.24726.66249.55151.983s9300 (Fiji)42.90942.83985.73989.463169.858174.248Benchmark-results of the original Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)The Monte-Carlo algorithm was observed to be compute-bound*, thus making it easy to identify the low-hanging fruits in the kernel.

  • The original implementation initialised the random states in a separate kernel; this initialisation can actually be done in the same compute kernel
  • Instead of using calculating the normal distribution of the random number manually, it’s faster to use the HIP provided function (which we built for AMD)
  • On Nvidia GPUs, using the default cuRAND state (XORWOW) is pretty slow. Switching to Philox improves performance significantly on Nvidia GPUs

A big speed-up can be observed on the Nvidia GPUs; although a considerable speed-up can also be observed on the AMD GPUs.

Size**2621445242881048576 Compute+ TransfersCompute+ TransfersCompute+ Transfers2x Intel Xeon E5-2650 v3 OpenMP215.311437.140877.425Titan V9.4679.83117.63818.46634.28135.765GTX98022.57823.22344.92346.01390.108100.584Vega 204.4565.0038.57010.11717.00319.626MI25 (Vega 10)9.1189.66317.99518.91835.76042.621s9300 (Fiji)17.80218.47735.08636.79569.74673.015Benchmark-results of the improved Monte Carlo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms) less is better less is better## Black Scholes**

Black-Sholes-Merton Process with Analytic European Option engine

Below are the results of the original code ported to HIP without any initial optimisations.

Size**2621445242881048576* Compute+ TransfersCompute+ TransfersCompute+ Transfers2x Intel Xeon E5-2650 v3 OpenMP5.00522.19443.994Titan V0.0955.1810.40725.1110.79249.890GTX9802.2147.29410.66234.53419.94668.626Vega 200.2017.1050.89433.6931.59670.732MI25 (Vega 10)1.0907.6573.45341.3436.00074.387s9300 (Fiji)1.04811.5244.63553.2469.170129.262Benchmark-results of the original* Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)The Black-Scholes algorithm was observed to be memory-bound, thus there were a some low hanging fruits and structural problems to tackle.

  • Use CUDA/HIP provided erf function instead of custom error function

The first step was to tackle the low hanging fruits in the kernel. A decent speed-up in the compute times could be observed on most GPUs (except for the Titan V).

Size**2621445242881048576* Compute+ TransfersCompute+ TransfersCompute+ Transfers2x Intel Xeon E5-2650 v3 OpenMP5.00522.19443.994Titan V0.0845.0920.36724.5490.72249.956GTX9801.4586.2816.40230.58412.79061.965Vega 200.1146.9950.39733.5710.77569.858MI25 (Vega 10)0.5176.9291.83630.4903.02164.961s9300 (Fiji)0.42310.6211.93148.4173.73381.898Benchmark-results of the improved* Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)With the algorithm being memory-bound, the next step was to tackle the structural problems.

  • Given that the original code input required an Array of Structs to be transferred to the GPU, the next step was to restructure the input data into a linear array
  • This prevents transferring an entire struct where not all inputs are used

The results can be found below, where transfer times on all GPUs improved.

Size**2621445242881048576 Compute+ TransfersCompute+ TransfersCompute+ Transfers2x Intel Xeon E5-2650 v3 OpenMP5.00522.19443.994Titan V0.0683.9370.28618.2580.56536.479GTX9801.2905.0966.38724.33712.75848.578Vega 200.1215.0670.44725.8090.82747.541MI25 (Vega 10)0.5064.8611.84123.5803.11553.006s9300 (Fiji)0.4447.4162.00236.8593.92264.056Benchmark-results of the further improved Black Scholes code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Here we changed the data structures. Less is better Less is better Less is better## Repo**

Securities repurchase agreement

Below are the results of the original code ported to HIP without any initial optimisations.

Size**2621445242881048576* Compute+ TransfersCompute+ TransfersCompute+ Transfers2x Intel Xeon E5-2650 v3 OpenMP186.928369.718732.446Titan V19.67832.24135.72760.95170.673120.402GTX980387.308402.682767.263793.1591520.3511578.572Vega 2014.77137.17428.59569.74356.699131.652MI25 (Vega 10)46.46171.19191.742143.673182.137277.597s9300 (Fiji)77.615107.822153.334217.205306.206418.602Benchmark-results of the original Repo code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms). Did not allow for easy improvements, but needs some more extensive rewritingThe Repo algorithm was observed to compute-bound*, but also relied on pure double-precision operations. There were no obvious low-hanging fruits in the kernel, and the structure of the data was found to be rather complex (a mixture of Struct-of-Arrays and Array-of-Structs that are intertwined). Additionally, there are far too many transfer calls for different inputs and outputs that saturating the transfers with multiple non-blocking streams isn’t effective. Also, the current state of the CUDA/HIP implementation is working best on GPUs that have good double-precision performance.

There are improvements possible, but these need a larger effort.

Bonds

Fixed-rate bond valuation with flat forward curve

Below are the results of the original code ported to HIP without any initial optimisations.

Size**2621445242881048576* Compute+ TransfersCompute+ TransfersCompute+ Transfers2x Intel Xeon E5-2650 v3 OpenMP241.248482.187952.058Titan V31.91849.61861.20997.502123.750195.225GTX980746.7281117.3491494.6792233.7612976.8764470.009Vega 2040.11266.46077.123127.623152.657250.067MI25 (Vega 10)141.908215.855278.618425.969553.423844.268s9300 (Fiji)229.011340.212451.539699.059891.2841361.436Benchmark-results of the original Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms)The Bonds algorithm was observed to more compute-bound* than the Repo algorithm, and also relied on pure double-precision operations. The same problems were observed with the Repo algorithm, where no low hanging fruit could be easily identified, and the structure of the data is complex. That said, unlike the Repo algorithm, there aren’t as many transfers of inputs/outputs, making it possible to use multiple streams.

The results can be found below, where 2 streams are used to transfer all the data.

Size**2621445242881048576 Compute+ TransfersCompute+ TransfersCompute+ Transfers2x Intel Xeon E5-2650 v3 OpenMP241.248482.187952.058Titan V31.91845.98861.20989.198123.750178.688GTX980746.728770.1801494.6791527.5382976.8763043.102Vega 2040.11259.21677.123113.009152.657216.965MI25 (Vega 10)141.908156.164278.618310.922553.423637.924s9300 (Fiji)229.011256.373451.539493.679891.284981.604Benchmark-results of the improved Bonds code on a selected set of GPUs and a dual CPU, measured in milliseconds (ms) Less is better Less is better## Next steps**

The improvements described above produced good results, as improvements across the algorithms (except Repo) could be observed. In combination with newer drivers from AMD and Nvidia, general improvements can also be observed when compared to the results obtained in the previous article.

That said, there is currently a bug in AMD’s current drivers where data transfers are slower; we will update this blog with the results once this is fixed in a future driver release.

What’s next? The next step is to look for the high-hanging fruits for both the CPU and GPU implementations of the algorithms. This would be the next step in achieving better performance, as we’ve hit the limit of optimising the current implementations.

Milestones we have planned:

  1. Get it started + low-hanging fruit in the kernels (Done)
  2. Looking for structural problems outside the kernels + fixes (Done)
  3. High-hanging fruit for both CPU and GPU
  4. OpenCL / SYCL port
  5. Extending algorithms. Any financial company can sponsor such a development

What we can do for you

Finance algorithms on GPUs are often not really optimised for performance, which is quite unexpected. A company that can react in minutes instead of a day is more competitive, and especially in finance this is crucial.

As you have seen, we made quite some speedups with a relatively small investment. When we design code from scratch, we can more quickly go into the right direction. The difficulty about financial software is that it often needs a holistic approach.

We can help your organization:

  • Select the best hardware
  • Build libraries that work with existing software, even Excel
  • Architect larger scale software with performance in mind
  • Improve performance of existing algorithms

Feel free to contact us for inquiries, but also about sponsoring algorithms for the benchmark.