In contrary to what you see around (on slides like these), AMD and Intel also have support for RDMA.A while ago I found the slide at the right, claiming that AMD did not have any direct GPU-GPU communication. I found at several sources there was, but it seems not to be a well-known feature. The feature is known as SDI (mostly on network-cards, SSDs and FPGAs), but not much information is found on PCI+SDI. More often RDMA is used: Remote Direct Memory Access (wikipedia).
Questions I try to answer:
Enjoy reading!
You might know of clEnqueueMigrateMemObjects(), which is close to what RDMA is doing. It’s defined by “Enqueues a command to indicate which device a set of memory objects should be associated with”. The big difference is that clEnqueueMigrateMemObjects operates within an OpenCL context only, and RDMA works between different contexts.
We can be short: NVidia doesn’t have support of GPUDirect under OpenCL. If you need it, file a bug-report and tell how big customer you are – you might get access to some special program.
I’d like to discuss theirs first, as it’s better known than AMD’s solution. From NVidia’s GPUDirect page, one can conclude that their solution consists of three categories:
1) GPU-GPU communications:
2) GPU-PCIcard communications:
3) GPU-PCI-network-PCI-GPU communications. This includes Infiniband, if the network-driver supports it.
As multi-GPU is a typical professional feature (like double precision compute), all this only works on Tesla and Quadro cards.
Intel thinks that OpenMP and MPI are the only working solutions, and doesn’t promote OpenCL on XeonPhi anymore. If you need more information on Intel’s alternative to GPUDirect and DirectGMA, check Intra-node and inter-node MPI using OFED.
Albeit existing since 2011, not many know of its existence. You might have seen their demo’s of video-software, where camera-input is directly feeded into the GPU. Or if you already use recent FirePro GPUs on a cluster.
AMD has support for the two first categories (GPU-GPU and GPU-PCI) on FirePro GPUs. I heard that RDMA-support is coming in Q2/Q3. GPU-GPU communication is called DirectGMA and GPU-PCIcard can be found under SDILink. Often these two names are mixed.
Support is on FirePro S-series (S7xx0, S9xx0, S10xx0) and W-series (W5xx0, W7xx0, W8xx0 and W9xx0).
First step is to check if extension “cl_amd_bus_addressable_memory” is available. This function enables:
You also need standard available functions:
Extra flags:
The struct:
typedef struct _cl_bus_address_amd {
cl_long surface_bus_address;
cl_long marker_bus_address
} cl_bus_address_amd;
You’ll find this struct in the ‘cl_ext.h‘ from the AMD APP SDK.
Limitations of DirectGMA: there is no support for map/unmap and image objects.
The actual steps need to be done in two parts. The below example shows a single round of reading/writing, but with more signals you can make it much more complex.
This GPU is passive in the communication – it hosts the memory buffer, but cannot initiate transfers.
This GPU is active in the communication and does the transfers from and to the other GPU that has the memory buffer.
You need to manually align the buffer. The code to do that is as follows:
[raw]
Aligned address = ulMarkerAddr & ~0xfff;
Offset = ulMarkerAddr & 0xfff;
[/raw]
If you need multiple buffers, then you need to administrate the offsets. How that’s done exactly, you can find in the code linked below.
Download fully working sources of DirectGMA, and start to play around (found at Devgurus forum). This includes code for both OpenCL and OpenGL – you need to check the CLsources and a part of the common code. It contains a VS-solution and a Makefile. See also this more recent article from AMD.
At CLSink::processFrame() a buffer is created for writing to from another device. At CLSource::processFrame() a buffer at the other device is written to. The CLTransferBuffer object is created for both the Sink and Source. To test, you need two FirePro GPUs or a dual GPU like the S10000.
Want more information? Read this blog by AMD’s Bruno Stefanazzi, or contact us. We can also provide more insight into this mechanism than you read here, and can update your existing OpenCL software enabled with fast multi-GPU.