Stream HPC

Using async_work_group_copy() on 2D data

async_all_the_things1When copying data from global to local memory, you often see code like below (1D data):
[raw]

if (get_group_id(0)==0) {
  for (int i=0; i < N; i++) {
      data_local[i] = data_global[offset+i]
  }
}
mem_fence(CLK_LOCAL_MEM_FENCE);

[/raw]
This can be replaced this with an asynchronous copy with the function async_work_group_copy, which results in more manageable and cleaner code. The function behaves like an asynchronous version of memcpy() you know from C++.

event\_t **async\_work\_group\_copy** ( \_\_local gentype \***dst**,    const \_\_global gentype \***src**,    size\_t  ***data\_size***,    event\_t **event**    

event\_t **async\_work\_group\_copy **( \_\_global gentype \***dst**,    const \_\_local gentype \***src**,    size\_t ***data\_size***,    event\_t **event**    

The Khronos registry async_work_group_copy provides asynchronous copies between global and local memory and a prefetch from global memory. This way it’s much easier to hide the latency of the data-transfer. In de example below, you effectively get free time to do the do_other_stuff() – this results in faster code.

As I could not find a good code-snippets online, I decided to clean-up and share some of my code. Below is a kernel that uses a patch of size (offset*2+1) and works on 2D data, flattened to a float-array. You can use it for standard convolution-like kernels.

The code is executed on workgroup-level, so there is no need to write code that makes sure it’s only executed by one work-item.

[raw]

kernel void using_local(const global float* dataIn, local float* dataInLocal) {
    event_t event;
    const int dataInLocalWidth = (offset*2 + get_local_size(0));
        
    for (int i=0; i < (offset*2 + get_local_size(1)); i++) {
        event = async_work_group_copy(
             &dataInLocal[i*dataInLocalWidth],
             &dataIn[(get_group_id(1)*get_local_size(1) - offset + i) * get_global_size(0) 
                 + (get_group_id(0)*get_local_size(0)) - offset],
             dataInLocalWidth,
             event);
   }
   do_other_stuff(); // code that you can execute for free
   wait_group_events(1, &event); // waits until the copy has finished.
   use_data(dataInLocal);
}<br></br>

[/raw]

On the host (C++), the most important part:
[raw]

cl::Buffer cl_dataIn(*context, CL_MEM_READ_ONLY|CL_MEM_HOST_WRITE_ONLY, sizeof(float) 
          * gsize_x * gsize_y);
cl::LocalSpaceArg cl_dataInLocal = cl::Local(sizeof(float) * (lsize_x+2*offset) 
          * (lsize_y+2*offset));
queue.enqueueWriteBuffer(cl_dataIn, CL_TRUE, 0, sizeof(float) * size_x * size_y, dataIn);
cl::make_kernel kernel_using_local(cl::Kernel(*program,"using_local", &error));
cl::EnqueueArgs eargs(queue,cl::NullRange ,cl::NDRange(gsize_x, gsize_y), 
          cl::NDRange(lsize_x, lsize_y));
kernel_using_local(eargs, cl_dataIn, cl_dataInLocal);

[/raw]
This should work. Some have the preference to do local initialisation in the kernel, but I prefer not to do this JIT.

This code might not work optimal if you have special tricks for handling the outer border. If you see any improvement, please share via the comments.