At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.
They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:
Yes, I can answer your questions.
For that I need to do a test-run.
Please come back tomorrow.
The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.
Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.
The code consisted of several hundreds megabytes of Delphi-code that was build up over the years, which made the project a lot less standard to do.
The interesting part about payrolling needs to have large flexibility to cope with continuously changing laws and ruling within the country. This is a main reason you see most payroll companies not working internationally. Intersoft’s software can be configured to work in any branch or country.
Porting code that offers both correctness and flexibility is less easy to port to the GPU, that expects very straightforward instructions – we needed to be very creative to convert the internal scripting language to OpenCL.
After about 2 weeks the test-input was processed correctly in seconds. After that the client worked with us to program out redesigns of the database-loading and storing, being able to feed the OpenCL calculations the data fast enough.
The code now runs correctly on AMD and NVIDIA GPUs, and (for backwards compatibility) on CPUs. We have not tested on Intel ARC. We neither benchmarked it, as this type of software has too many interdependencies to define a good case. Hence we kept it with only mentioning that ~250x speedup.
Most software we write is in CUDA and HIP. OpenCL is in C and less convenient, so why did we choose this language? Reason is the built-in compiler. Every time the script or configuration changes, OpenCL can be generated for it, and be compiled on the spot. Even just-in-time optimizations, based on the provided data, can be applied. If we had chosen CUDA, HIP or SYCL, this would have taken much more effort to do.
You might say that OpenCL’s kernels can be reverse engineered more easily than when using C++ based languages. In this project the real IP is in the OpenCL-code generation and the software to visually create changes to how payroll-calculations are done, and thus reverse engineering does not make much sense. The offender would need to reverse engineer the software with every change.
Intersoft is a provider of Payrolling Systems that can be adapted to any country and branch, with a specialization in the cleaning industry. Other software is created for Time & Task Tracking, Staff Planning & Scheduling, Forecasting & Workload Monitoring, Customer Surveys & Analysis and Incentive Programs.
Intersoft says that their customers are attracted by the combination of flexibility and correctness of the solution – unique in its market. Now also performance is added to that list, which allows companies with over 10,000 employees can be served now.
Where do you see software that is unexpectedly slow, that is making it difficult to provide a good service?