CUDA Compute Capability 6.1 Features in OpenCL 2.0

On the CUDA page of Wikipedia there is a table with compute capabilities, as shown below. While double checking support for AMD Fijij GPUs (like Radeon Nano and FirePro S9300X2) I got curious how much support is still missing in OpenCL. For the support of Fiji it looks like there is 100% support of all features. For OpenCL 2.0 read on.

CUDA-features CUDA features per Compute Capability on Wikipedia# Feature overview

The below table does not discuss performance, which is ofcourse also a factor.

  **CUDA 3.5 or higher** **OpenCL 2.0**   Integer atomic functions operating on 32-bit words in global memory [yes](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/atomicFunctions.html)   atomicExch() operating on 32-bit floating point values in global memory function: [atomic\_xchg()](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/atomic_xchg.html)   Integer atomic functions operating on 32-bit words in shared memory [yes](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/atomicFunctions.html)   atomicExch() operating on 32-bit floating point values in shared memory function: [atomic\_xchg()](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/atomic_xchg.html)   Integer atomic functions operating on 64-bit words in global memory extensions: [cl\_khr\_int64\_base\_atomics](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/cl_khr_int64_base_atomics.html) and [cl\_khr\_int64\_extended\_atomics](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/cl_khr_int64_extended_atomics.html)   Double-precision floating-point operations if device info CL\_DEVICE\_DOUBLE\_FP\_CONFIG is not empty, it is supported. For backwards compatibility the extension cl\_khr\_fp64 is still available.   Atomic functions operating on 64-bit integer values in shared memory extensions: [cl\_khr\_int64\_base\_atomics](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/cl_khr_int64_base_atomics.html) and [cl\_khr\_int64\_extended\_atomics](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/cl_khr_int64_extended_atomics.html)   Floating-point atomic addition operating on 32-bit words in global and shared memory N/A – [see this post](https://streamhpc.com/blog/2016-02-09/atomic-operations-for-floats-in-opencl-improved/) for a hack.   Warp vote functions Implemented in the new Work-group Functions – [see this post by Intel](https://software.intel.com/en-us/articles/using-opencl-20-work-group-functions).   \_ballot() Hack: [work\_group\_all()](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/work_group_all.html) with bit-shift using get\_local\_id().   \_threadfence\_system() Hack: needs a sync from the host.   \_syncthreads\_count() Hack: [work\_group\_reduce\_sum()](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/work_group_reduce.html) + barrier()   \_syncthreads\_and() Hack: [work\_group\_all()](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/work_group_all.html) + [work\_group\_barrier()](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/work_group_barrier.html)   \_syncthreads\_or() Hack: [work\_group\_any()](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/work_group_any.html) + [work\_group\_barrier()](https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/work_group_barrier.html)   Surface functions Images   3D grid of thread block 3 dimensional work-groups   Warp shuffle functions N/A – see the notes below   Funnel shift This is a bit-shift where the shifted bits are not filled with zeroes but with the bits from the second integer.

hack: bit-shifting both integers (one left N bits and the other right (32-N) bits) and then doing a bit-wise sum. Dynamic parallelism Nested Parallelism So you see, that OpenCL almost covers what CUDA offers – most notable missing is the workgroup shuffle, whereas other missing functions can be implemented in two steps.

If you want to know what is new in OpenCL (including features not existing in CUDA, like pipes), see this blog post.