Stream HPC

Learning both OpenCL and CUDA

Be sure to read Taking on OpenCL where I’ve put my latest insights – also for CUDA.

The two¹ “camps” OpenCL and CUDA both claim you should first learn their language first, after which the other would be easy to learn. I’m from the OpenCL-camp, so I say you should learn OpenCL first, but with a strong emphasis on hardware-architecture understanding. If I had chosen for CUDA I would have said the opposite, so in other words it does not matter which you do first. But psychology tells us that you probably like the first language more since there is where you discovered the magic; also most people do not like to learn a second language which is much alike and does not add a real difference. Most programmers just want to get the job done and both camps know that. Be aware of that.

NVIDIA is very good in marketing their products, AMD has – to say it modest – a lower budget for GPGPU-marketing. As a programmer you should be aware of this difference.

The possibilities of OpenCL are larger than those of CUDA, because of task-parallel programming and support for far more different architectures. At the other side CUDA is much more user-friendly and has a lot of convenience built-in.

Why learn both at the same time?

If you learn CUDA and OpenCL both at the same time, you get an advantage of having a better overview. If you add OpenCL to your knowledge, you cannot conveniently stick to a more convenient approach, but you actually have to learn a lot more about hardware. To my opinion a good investment; please check out the Self-Study material on our education-section for resources. If you choose for CUDA only, you miss that understanding and support for several architectures.

Also it’s very easy to learn both at the same time, because the differences explain a lot of unmentioned (dis)advantages.

How to do it?

You have to make your own study-plan based on the following steps:

  1. Learning the basics of GPGPU, like kernels, host, device, memory-transfers, basic commands, etc.
  2. Building your first program. Altering programs in the SDKs.
  3. Understanding the architectures of NVIDIA, AMD, ARM and IBM Cell.
  4. Learning how to make larger and multi-kernel programs, and actually write them.
  5. Learning about optimisation for the architectures you target.
  6. Getting to know the design patterns for integrating GPGPU in existing software.

So no word of CUDA or OpenCL here, or even any other GPGPU-environment.

The first two steps form the base, where you need to get a grasp of the technology and get a few views on how to look at it. Since current OpenCL- and CUDA-compilers don’t do a lot for you, like you are accustomed to when using X86-compilers, you need to understand more about the hardware. NVIDIA has a lot of information available about their architectures, but the same information for the other hardware is harder to find.

Understanding architectures is a very important part. CUDA-books have the best description of NVIDIA-architectures (like “warps”), whereas other architectures from a GPGPU-perspective are not that widely available. When you understand how the GPUs work, you have a better idea how optimisations are to be done. Be aware of the existence of Multiscalelab’s SWAN CUDA-to-OpenCL-converter; this tool will help you when converting CUDA-code to OpenCL. The embedded profile is also available for low-power devices like smart-phones, which you should take into account when studying different ARM-architectures.

The last 3 steps are to upgrade your current software for using GPU-power. It needs well written software to make these steps, so you might need to study some design patterns books before getting too advanced in GPGPU.

Task-parallel programming is supported by OpenCL and not by CUDA. Currently the market focuses on data-parallel programming and therefore the GPUs are not really optimised for this kind of programming. So actually you can skip this for 2010; expect the moment NVIDIA adds it to CUDA, the market will follow.

End-word

I’m in the OpenCL-camp for a reason. I can choose between NVIDIA, AMD, IBM, ARM² and upcoming FPGAs. Once the price per FLOP for a HPC-application is cheaper for one architecture, I can choose for that architecture. But I learnt NVIDIA’s architecture best by learning CUDA.


¹ The other camp Microsoft’s DirectCompute I did not consider, because it is more a re-branded OpenCL than an alternative to my opinion. But I like a good discussion though…

² See my previous blog, where you can read ARM has no OpenCL yet.