If you’re into computational finance, you might have heard of FinanceBench.
It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.
The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.
The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.
Ofcourse the first step was selecting good algorithms that had both needed a lot of compute and were representable within the finance industry. We could not have done this as well as the research team did. This was the main reason to choose the project.
The work done for making the code, according to the team itself:
The original QuantLib samples were written in C++. QuantLib is a C++ library. Unfortunately, languages like OpenCL, CUDA, and OpenACC cannot directly operate on C++ data structures, and virtual function calls are not possible. Because of this problem, all of the existing code had to be “flattened” to C code. We used a debugger is used to step through the code paths of each application and see what lines of QuantLib code are executed for each application, and manually flattened all of the QuantLib code.
This is typical work when porting CPU-code to the GPU. Complex C++ code can be very hard to bend in directions it was not intended to bend to. Simplification is then the first thing to do, so it can be split in manageable parts. As this can be time-consuming, we were happy it was already done, though it is often easier to also have the original simplified CPU-code for reference.
We have another focus than a research group, and logically this results in code changes. We’re now in phase 1.
As it’s difficult to focus on multiple languages while improving performance and project quality, we focused on OpenMP and CUDA first. Then we ported the project to HIP and made sure that the translation from CUDA to HIP could be fast. This way we could make sure CPUs and the fastest GPUs could be benchmarked, leaving OpenCL and OpenACC out for now. We have no intentions to keep HMPP in, and have chosen for introducing SYCL to prepare for Intel Xe. We have more interest in benchmarking different types of algorithms on all kinds of hardware than to compare programming languages.
Also the project has been cleaned up, cmake was introduced, and google-benchmark was added, to make it easier for us to work on. We did not look into Quantlib for improvements or read papers on the latest advancements. So the goal was really to get it started.
We picked a few broadly available AMD and Nvidia GPUs, and choose a dual socket Xeon (40 cores in total) for the CPU benchmarks. The times are INCLUDING transfer times for the GPUs. The original benchmark unfortunately showed compute-times only, so we might get some Nvidia Kepler GPUs back in a server to re-benchmark these.
QMC (Sobol) Monte-Carlo method (Equity Option Example)
Monte Carlo is seen often when HPC is applied in the finance domain. A good part is the easy interpretation and straightforward implementation, making it easy to explain to HPC-developers while showing the performance advantage to quants. It returns a distribution of future prices of assets by doing thousands to millions of simulations.
The below results were from the code as provided, with a direct port to HIP to include AMD GPUs. As you can see, the Titan V and GTX980 numbers don’t look good.
Size 262144 524288 1048576 2x Intel Xeon E5-2650 v3 OpenMP 215.311 437.140 877.425 Titan V 25.162 544.829 1599.428 GTX980 76.456 1753.816 5120.598 Vega 20 15.286 30.140 62.110 MI25 (Vega 10) 13.694 26.733 52.971 s9300 (Fiji) 25.853 51.403 98.484 Here are the results after fixing the obvious problems and low-hanging fruit. This benefited the Nvidia GPUs a lot, but also the AMD GPUs. There was no low hanging fruit in the OpenMP-code, so no speedup there.
Size 262144 524288 1048576 2x Intel Xeon E5-2650 v3 OpenMP 215.311 437.14 877.425 Titan V 9.885 18.501 35.714 GTX980 23.427 45.81 91.721 Vega 20 10.864 21.467 42.995 MI25 (Vega 10) 10.857 21.147 41.214 s9300 (Fiji) 20.806 41.633 80.465 ## Black Scholes
Black-Sholes-Merton Process with Analytic European Option engine
Black Scholes is used for estimating the variation over time of financial instruments such as stocks, and using the implied volatility of the underlying asset to derive the price of a call option. Again, it is compute intensive.
The performance of the original code looked good at first sight, but the transfers took 95% of the time. That’s for the next phase.
Size 1048576 5242880 10485760 2x Intel Xeon E5-2650 v3 OpenMP 5.005 22.194 43.994 Titan V 7.959 38.852 77.443 GTX980 10.051 48.038 95.568 Vega 20 5.907 26.468 53.342 MI25 (Vega 10) 7.827 36.947 71.499 s9300 (Fiji) 9.642 37.07 80.118 On some projects it’s better to focus on the largest bottleneck – for this project we chose to go through the project in a structured way. It sometimes is difficult to explain the improvements are only to be “activated” very late in the project – luckily the explanation “experience” is often accepted.
So the applied fixes had good influence, but are hardly noticeable right now.
Size 1048576 5242880 10485760 2x Intel Xeon E5-2650 v3 OpenMP 5.005 22.194 43.994 Titan V 7.841 37.957 77.382 GTX980 9.12 44.473 89.847 Vega 20 5.87 27.663 51.828 MI25 (Vega 10) 7.037 31.737 64.898 s9300 (Fiji) 7.295 32.866 79.558 ## WIP: Repo
Fixed-rate bond valuation with flat forward curve
Only ported to HIP. We did not do any optimisations yet, as we have stability-problems with 2 AMD GPUs to focus on. With the current code Vega 20 is faster than Titan V.
Size 262144 524288 1048576 2x Intel Xeon E5-2650 v3 OpenMP 186.928 369.718 732.446 Titan V 38.328 72.444 141.664 GTX980 404.359 796.748 1599.416 Vega 20 36.399 67.299 128.133 ## WIP: Bonds
Securities repurchase agreement
Only ported to HIP. The code for Bonds benchmark needs some attention still. You see that FTX980 is too slow in comparison.
Size 262144 524288 1048576 2x Intel Xeon E5-2650 v3 OpenMP 241.248 482.187 952.058 Titan V 46.172 88.865 177.188 GTX980 761.382 1518.756 3030.603 Vega 20 63.937 122.484 241.917 ## Next steps
As you see this is really work in progress. Why show it already? Reason is that you can see how a project goes. Cleaning up the code is always done in every project, to avoid delays later on. Adding good tests and benchmarks is another foundational step. Most time has gone into these preparations, and limited time into the improvements.
Milestones we have planned for now:
We intend to release a milestone every 6 to 8 weeks. You can get noticed by following us on Twitter or LinkedIn.
Feel free to contact us with any question.