Stream HPC

OpenCL vs CUDA Misconceptions


Translation available: Russian/Русский. (Let us know if you have translated this article too… And thank you!)


Last year I explained the main differences between CUDA and OpenCL. Now I want to get some old (and partly) false stories around CUDA-vs-OpenCL out of this world. While it has been claimed too often that one technique is just better, it should be also said that CUDA is better in some aspects, whereas OpenCL is better in others.

Why did I write this article? I think NVIDIA is visionary in both technology and marketing. But as I’ve written before, the potential market for dedicated graphics cards is shrinking and therefore forecasting the end of CUDA on desktop. Not having this discussion opens the door for closed standards and delaying innovation, which can happen on top of OpenCL. The sooner people & companies start choosing for a standard that gives equal competitive advantages, the more we can expect from the upcoming hardware.

Let’s stand by what we have learnt at school when gathering information sources, don’t put all your eggs in one basket! Gather as many sources and references as possible. Please also read articles which claim (and underpin!) why CUDA has a more promising future than OpenCL. If you can, post comments with links to articles you think others should read too. We appreciate contributions!

Also found that Google Insights agrees with what I constructed manually.

The trends

The word “CUDA” existed for a long time as slang for “Could have” and there is some party-bar with that name in Canada, a ’71 car (once by Plymouth, an US car manufacturer based close to Canada) and an upcoming documentary (also from Canada… What would South Park say?!). If you peek at Google trends, the first thing you see is that CUDA (red) is much bigger than OpenCL (blue). Not paying too much attention gives the common idea that OpenCL is just cute in comparison to CUDA. I fixed the graph by setting the pre-2007 to zero (see the image above). Then you see clearly that CUDA is not as huge as it seemed, and that it has even been going down for the last 2 years. At the end of the year, you might see the two lines much, closer than NVIDIA wants you to see. In other words: if you had the feeling that CUDA was only rising, then note how OpenCL grew even harder according to Google trends.

SimplyHired gives a comparable view on CUDA vs OpenCL (OpenMP is for comparison, MPI is much bigger). Though CUDA is still bigger, it is comparable and the lines sometimes even touched (might it be love?). Nice to see: you can recognise the dates of CUDA-releases in the peaks. I can’t explain the big decline for both CUDA and OpenCL started in March ’11.

Then there is the potential R&D that can be put in developing new techniques. I found at Yahoo Finance the annual spending on R&D (based on last Quarter). For the most important X86-companies in OpenCL this is:

You understand that once the time is right, there’s no match for NVIDIA. Not that all R&D will be put into OpenCL, NVIDIA doesn’t put all R&D into CUDA.

Toolset

CUDA and OpenCL do mostly the same. It’s like Italians and French fighting over who has the most beautiful language, while they both come from the same Latin/Romanic branches. But there are some differences though. CUDA tries to be one in a packet for developers, while OpenCL is mostly language-description only. For OpenCL the SDK, IDE, debugger, etc., all come from different vendors. So, if you have an Intel SandyBridge and an AMD Radeon, you need even more software when working on performance-optimizing kernels for different hardware. In reality, this is not ideal, but all you need is really there. You need to go to different places, but it is not that the software is not available as is claimed much too often.

Currently VisualStudio-support is very good from NVIDIA, AMD and Intel. At OSX XCode gives all the developer-needs around OpenCL. Last year the developer-support for CUDA was better, but the catch-up here is finished.

Libraries

Where CUDA comes in strong and OpenCL needs a lot of catch-up is with what they’ve built on top of the language. CUDA has support for templates, which brings nice advantages. Then there is a math-libary which comes for free:

  • cuFFT – Fast Fourier Transforms Library
  • cuBLAS – Complete BLAS Library
  • cuSPARSE – Sparse Matrix Library
  • cuRAND – Random Number Generation (RNG) Library
  • NPP – Performance Primitives for Image & Video Processing
  • Thrust – Templated Parallel Algorithms & Data Structures
  • math.h – C99 floating-point Library

For most, there are alternatives you can easily build by yourself, but there is nothing alike. This will of course come in time for each architecture, but now this is the big win for CUDA.

One example of free math-software for OpenCL is ViennaCL, a full linear algebra library and iterative solvers. More about CUDA Math-libraries later in an upcoming article.

Heterogeneous Programming

OpenCL works on different hardware, but the software needs to be adapted for each architecture. It is not something that will blow minds: you need different types of cars to be fastest on different kinds of area. If CUDA could work on both Intel CPUs and NVIDIA GPUs there would be a problem: the performance of a GPU-optimized kernel will not work well on a CPU. Just as with OpenCL, you need to program the code specifically for CPUs. The claim that CUDA is better because of its performance is about the same on each piece of hardware it runs on its bogus. It just does not touch the problem OpenCL tries to solve: having a programming-concept for may types of hardware.

This makes you think about why we never saw the X86-implementation of CUDA that had been developed last year. Actually, it was announced as a public beta recently, but it is still not performance optimized and costs $299,- as a part of the Portland compiler suite. A performance-optimized version will be released the end of 2011, so then let’s have a look again.

Performance-comparisons

OpenCL 1.1 has some speed-up with i.e. strided copies. Comparing CUDA to OpenCL 1.0 (since NVIDIA’s 1.1-drivers a a year old and not been updated since) is just not fair.** What is fair to say is that one piece of hardware is faster than another, and certain compilers can be more advanced in optimizing. But since CUDA and OpenCL as a language are so much alike, it is impossible to put a verdict on which language is (potentially) faster. Would it be like saying that Objective C is faster than C++? No, again it’s the compiler (and the programmer) which makes it faster.

I also still see some comparisons to RADEON HD4000-series, which are not really fit for GPGPU. The 5000 and 6000 series are. This problem will slowly fade away with more benchmarking, but not as fast as I hoped it would.

Bang per buck

A Tesla c2050 with 3GB of RAM costs $2400,-, giving 1 TFLOPS single precision (0.5 TFLOPS double precision). The fastest AMD Radeon, the HD6990 with 4GB, costs $715,- and gives 5.1 TFLOPS performance single precision (1.2 TFLOPS double precision). Three of them give more than 15 TFLOPS for $2145,-. Of course these are theoretical numbers and we still have the issue of the limits of PCIe. But for many problems, RADEONs are just much faster than TESLA/GeForce with GPGPU. TESLAs have higher transfer-rates and can have 6GB of memory, so they are a better fit for other problems. FFT and alike computations, for instance, still rock on NVIDIA-hardware.

Edit 28-1-2012: There were comments on the above comparison of Tesla to Radeon and GeForce. This is not a technical comparison between the graphics cards but more a marketing perspective. Many serious research and  financial institutes were buying Tesla-cards as they were marketed as they must be the best, as they are so expensive. People who chose GPGPU but did not know what to buy, bought Tesla-cards since it was an obvious choice according to the marketing-stories. The reason why you would buy one is because you want, for example, ECC, but not if you want the fastest card (highest memory bandwidth + processor-power).

Books

Books are a very good measurement for expected popularity. But it is also used to push technologies (books published by the company who makes the software/hardware).

Since there are more (English) books on CUDA than on OpenCL, you might think CUDA is the bigger one. A nice one is the recently released GPU gems. But the only to-be-released-soon book I could find that mentioned CUDA was Multi-core programming with CUDA and OpenCL, and there are 3 books in the making for OpenCL (but actually three and a half then). I also understood that UK-based CUDA Developer is working on a book.

Edit 21-07-2011: Elsevier releases “CUDA” in august.
Edit 1-02-2012: As I mentioned on Twitter, “Multi-core programming with CUDA and OpenCL” was pulled back from release.

4.0 > 1.1?

This claim was made not long ago, and they were being serious: 4.0 is bigger than 1.1, so CUDA is much more advanced. This reminds me of the browser-discussions, where was said Firefox would be behind since it had only reached to version 4. But I understand; 1.0 sounds so new and just finished; 1.1 sounds like the first bugfix-release. But in reality OpenCL 1.1 has support for Cloud-computing, which CUDA only added recently. As said, CUDA still has support for graphics cards only, which OpenCL had since 1.0.

It is often said that CUDA has a 2 year advantage, but ATI already had a lot of research done on GPGPU (Close to metal) years before AMD eventually chose for OpenCL and almost a year before CUDA 1.0 was launched. Close-to-Metal was replaced by AMD’s Stream and then by OpenCL. Don’t think all projects started from scratch, and be aware that OpenCL was co-designed by both NVIDIA, AMD and others. This means that it has all the best the predecessors (including CUDA and AMD Stream) had to offer.

CUDA is said to be more mature, but since the language is comparable mature, they refer to the drivers and the programming environments (IDEs). This is the OpenCL driver-status:

  • AMD-drivers are mature (both GPU and CPU).
  • NVIDIA-drivers is still on 1.0.
  • Intel-drivers is in Beta.
  • IBM-drivers are stable (POWER), but still in ‘alphaworks‘.
  • ARM-drivers (various) are in closed beta.

So CUDA-drivers are as mature as AMD OpenCL drivers. Also, since many companies have put all their knowledge from other products into OpenCL, the technique is much older than the name and the version-number.

Conclusion (2011)

You might be completely missing the differences in the API. There are language-differences between CUDA 4.0, OpenCL 1.0 and OpenCL 1.1, but I will give an overview of differences later (and I’ll put the link here). We think we have enough to tell you how to port your CUDA-software to OpenCL.

My verdict:

CUDA

    • is marketed better.
    • has developer-support in one package.
    • has more built-in functions and features.
  • – only works on GPUs of NVIDIA.

OpenCL

    • has support for more types of processor architectures.
    • is a completely open standard.
  • – Only AMD’s and NVIDIA’s OpenCL-drivers are mature – Intel and IBM expected soon to mature their drivers.
  • – is supplied by many vendors, not provided as one packet or centrally orchestrated.

I hope you found that OpenCL is not a cute alternative to CUDA, but an equal standard which offers more potential. OpenCL has to do some catch-up, yes, but it will all happen soon this year.

http://www.google.com/products/catalog?client=ubuntu&channel=fs&q=tesla+s2050&oe=utf-8&um=1&ie=UTF-8&tbm=shop&cid=8721727382375528152&sa=X&ei=lQcBTuecNcWfOteR1YAO&ved=0CDIQ8wIwAw