March 2013
Writing Efficient OpenCL Code for Android Using PGCL
In a previous article, we introduced the basic steps required to create an OpenCL-enabled Android App running in parallel across multiple ARM CPU cores with NEON/SIMD extensions as an OpenCL compute device. In this article we focus on writing efficient OpenCL code for multi-core ARM using PGCL, an OpenCL compiler framework for ARM processor-based Systems-on-Chip (SoCs).
You need to understand the basic concepts of the OpenCL execution and programming model to write an optimized application for an OpenCL device. The first part of this article will give an overview of the OpenCL execution and programming models. In addition, the PGCL implementation of the OpenCL execution model for multi-core CPUs will be described in detail. The second part of this article describes how to write efficient OpenCL kernels using PGCL. In particular, tips and techniques to write efficient multi-core CPU-friendly OpenCL code will be presented.
OpenCL Execution and Programming Model
OpenCL defines a platform model which consists of a host connected to one or more OpenCL compute devices. An OpenCL compute device is divided into one or more compute units which are further divided into one or more processing elements. Execution on an OpenCL compute device is controlled and driven by the host. For this reason, the OpenCL programming model splits code into two parts; host code which uses a set of standardized C API function calls that are dedicated to compute device management and launching execution of code on a compute device, and device code written in the OpenCL-C language that becomes the operations performed by each processing element.
OpenCL defines a parallel programming model where computations are described in terms of kernels. A kernel written in the OpenCL-C language defines computations performed by a work-item executed by a single processing element. Work-items are grouped into work-groups, each of which is executed by a single compute unit. Work-items execute in parallel, and can cooperate with other work-items in same work-group using synchronization primitives and a local memory shared between all processing elements in a given compute unit.
On the contrary, no synchronization is possible between work-groups. For a given computation, the programmer must write a kernel describing a work-item task, then specify how the work-items are grouped into work-groups and on which domain the computations will be applied by defining an OpenCL NDRange. For a given kernel, a work-item can retrieve its coordinates (which correspond to the coordinates of the domain to which its computations will apply) by querying its index within the work-group and its work-group index (or global index) within the NDRange.
This programming/execution model maps well onto recent CPU+GPU architectures, with the CPU as the OpenCL host and the GPU as the OpenCL compute device. However, the OpenCL standard doesn't restrict execution of kernels to a GPU. It also defines other types of OpenCL compute devices: GPU-like, CPU-like and HW Accelerator-like, but the programming model remains the same regardless of the type of compute device.
Let's consider a simple function f(x) that we want to apply to each element of a matrix A. In C, such a computation might be written as follows:
for (j=0; j<dimY; j++)
for (i=0; i<dimX; i++)
B|j][i] = f(A[j][i]) ;
As shown in figure 1, a parallelizing/optimizing compiler targeting a multi-core CPU with SIMD capability can distribute the j loop iterations across the cores of the CPU, and vectorize the i loop iterations if the SIMD unit provides support for the function f(x) operating on a vector of n elements.
Fig 1: Distribution of computation on a dual core CPU
In OpenCL, there are many possible variants for mapping kernels to a compute device and the programmer must define how to efficiently group computations (e.g. 1D, 2D or 3D). The optimal choice depends on the computations to be performed as well as on the OpenCL compute device architecture and capabilities.
For instance, the programmer may decide that each work-item will compute a row of a matrix. In this case the programmer will define an OpenCL kernel code as follows:
int i, j ;
j = get_global_id(0) ; /* global coordinate for work-item, here line number */
for (i=0; i<dimX; i++)
B|j][i] = f(A[j][i]) ;
and the host code should specify the work-group size and shape. For example, the number of lines to be processed per work-group and a 1D domain for NDRange that together define how many rows of the matrix will be processed.
If the programmer decides that each work-item should compute a matrix point, the OpenCL kernel code might be written as follows:
int i, j ; i = get_global_id(0) ; /* global X coordinate for work-item */ j = get_global_id(1) ; /* global Y coordinate for work-item */ B|j][i] = f(A[j][i]) ;
and the host code should specify work-group size and shape. For example, how many points will be processed per work-group and a 2D domain for NDRange that together define how many points of the matrix will be processed. As shown in figure 2, work-groups will be distributed across the OpenCL device Compute Units, and work-items will be distributed across the Processing Elements in a given Compute Unit. The NDRange defines how many work-groups will be processed and how they are distributed across Compute Units.
Fig 2: Distribution of computations on an OpenCL device with 4 compute units
Distributing computations as described in figure 2 matches well with current GPGPU architectures.
When we designed PGCL, we realized that most OpenCL codes that are publicly available are written for GPGPU devices. They use large work-group sizes with 1D,2D or 3D shapes, and they take advantage of parallelism between work-groups as well as parallelism between work-items within each work-group. On multi-core Cortex-A9 CPUs with NEON SIMD units, we might have defined an OpenCL device where each core is a Compute Unit and relied on the NEON unit to provide parallelism between Work-Items. However, currently the NEON unit doesn't provide the semantics for all operations a work-item might execute.
A second option would have been to limit the work-group size to one and only one work-item. This solution is potentially very efficient, matches the multi-core CPU hardware capabilities, and is allowed in the OpenCL standard. However, it also would have prevented us from compiling and running most existing OpenCL code. As a result, we decided to implement the PGCL OpenCL runtime so that the OpenCL compute device can be customized using environment variables to support any work-group size up to 256 work-items. One pthread per CPU core is created to manage work-group execution, and work-items are emulated using user-level pseudo-threads.
This approach is flexible enough to run most any GPGPU-like OpenCL application without modification, but at the cost of potentially poor performance. The lack of any real parallelism between work-items, and the significant cost of synchronization between work-items, can result in very poor performance in some cases. To address this issue, three execution modes have been implemented in the PGCL runtime:
- For any size work-group, if a kernel doesn't use synchronization primitives between work-items, execution is serialized. Each work-item is executed to completion before the next one is started and the cores are in essence running free throughout execution of the whole work-group.
- For any work-group size, if a kernel uses synchronization primitives between work-items, parallel execution is emulated using user-level pseudo-threads. A context switch occurs when a work-item reaches a synchronization primitive, jumping to execution of the next work-item that has not reached the synchronization point. Once all work-items have reached the synchronization point, execution resumes for the first work-item until the next synchronization point is reached.
- For a work-group size of one work-item, if a kernel uses synchronization primitives, execution of the primitives is nop-ified (essentially the synchronization is deleted because it has no logical effect on execution of the kernel).
These execution modes allow the programmer to run any GPGPU-friendly OpenCL code in a mode that is not necessarily high-performance, but will compile and run correctly, and to incrementally adapt kernels to a CPU-friendly form that will run at optimal speed.
Writing Efficient OpenCL Kernels for Multi-core CPU Devices
In the previous sections, we have introduced the OpenCL execution model and its implementation in the PGCL runtime. Understanding the PGCL execution engine helps us to define some guidelines for writing efficient code for a multi-core Cortex-A9 CPU as an OpenCL compute device. Indeed, these rules will apply to any upcoming multi-core ARM CPU compute device supported by the PGCL runtime.
It is important to remember that with the PGCL runtime there is no real parallelism between work-items for multi-core CPU OpenCL compute device. Having this information in mind, the following rules can be applied to write efficient code for a multi-core CPU compute device.
- It's always better to write a kernel executed by fewer work-items that do lots of computations, rather than a kernel that does very few computations and is executed by lot of work-items. OpenCL kernels written for a GPU device usually group computations so that work-items perform computations on a 'pixel' basis, for a CPU device it is recommended that work-items operate on 'lines' for instance. Back to our matrix example, for a multi-core CPU it's better to write kernel code that computes a matrix line and is executed by few work-items than to write kernel code that computes a matrix point and is executed by many work-items.
For instance, a simple image horizontal swap kernel can be written as follows:
__kernel void inverse_image_pixel (__global unsigned int* pixels) { int x = get_global_id(0); int y = get_global_id(1); int w = 2*get_global_size(0); int i0 = x+y*w; int i1 = (y+1)*w-x-1; unsigned int tmp = pixels[i1]; pixels[i1] = pixels[i0]; pixels[i0] = tmp; }In this case, each work-item performs a swap of two pixels. In this kernel, the work-item task is limited. The overhead of computing each pixel coordinate requires more operations than swapping the pixels, so this code is likely to perform very poorly. For better performance, performing the pixel swapping with blocks of pixels is a better approach. So, for a multi-core CPU device it would be better to rewrite the kernel code as follows:
__kernel void inverse_image_block (__global unsigned int* pixels, int blk_size_x, int blk_size_y, int image_width) { int x = get_global_id(0)*blk_size_x; int y = get_global_id(1)*blk_size_y; int i0 = x+y*image_width; int i1 = (y+1)*image_width-x-1; int i, j; for (j=0; j<blk_size_y; j++) { for (i=0; i<blk_size_x; i++) { unsigned int tmp = pixels[i1-i]; pixels[i1-i] = pixels[i0+i]; pixels[i0+i] = tmp; } i0 += image_width ; i1 += image_width ; } }Each work-item then performs a swap of a block of pixels, reducing the indexing cost relative to the pixel swapping cost. The code is flexible enough to be launched with any work-group size (even with only one work-item) and any number of work-groups to match the number of available cores on the target CPU device.
- Use of work-item synchronization primitives should be avoided if possible. Synchronization primitives, i.e OpenCL barriers, are usually used to synchronize work-items after writing to local memory. On a GPGPU device, use of local memory is required in order to minimize the impact of high latency device global memory, or to perform reduction operations between work-items.
- On a multi-core CPU OpenCL compute device as supported by the PGCL runtime, there is no hardware difference between OpenCL local and global memories. Both are physically located in same memory, in this case DDR memory accessed through CPU L2 cache. As a consequence, there is no benefit to using local memory accesses versus global memory accesses since both are performed using same shared L2 cache, accessing the same physical memory.
- Reduction operations, such as a sum of all elements of an array, are usually implemented on a GPGPU OpenCL device using a logarithmic split summation technique, as shown in the following code snippet:
/* locSumArray: local array of resulting sums * wiIndex: Work-Item index * wgSize: Work-Group size, size of local array * After call to this routine locSumArray[0] = SUM(locSumArray[0..wgSize]) */ void sumAll(__local int* locSumArray, int wiIndex, int wgSize) { int stride ; for (stride = wgSize/2; stride>0; stride>>=1) { if (wiIndex < stride) locSumArray[wiIndex] += locSumArray[wiIndex+stride]; barrier(CLK_LOCAL_MEM_FENCE); } }In the first stage, N/2 partial sums are performed by N/2 work-items and stored into a local array. The work-items synchronize before execution of the next stage, where N/4 work-items compute partial sums on the previous results and again store them to a local array. This is repeated until execution of the last stage, when a final sum is computed by a single work-item using the last two partial sums. This technique requires log2N synchronizations for an N-element sum.
To be executed efficiently on multi-core CPU device using PGCL, it is better to have only one work-item performing the N-element sum and then to synchronize after completion:
/* locSumArray: local array of resulting sums * wiIndex: Work-Item index * wgSize: Work-Group size, size of local array * After call to this routine locSumArray[0] = SUM(locSumArray[0..wgSize]) */ void sumAll(__local sum* locSumArray, int wiIndex, int wgSize) { int i ; int sum = 0; if (wiIndex == 0) { for (i=0; i<wgSize; i++) { sum += locSumArray[i] ; } locSumArray[0] = sum ; } barrier(CLK_LOCAL_MEM_FENCE); }
- If possible, it is recommended that you write kernels that will be executed by a limited number of work-groups with a work-group size of one work-item. The ideal situation is to limit the number of work-groups to equal the number of available CPU cores on your platform. However, be aware that if you hard code the number of work-groups your code performance won't scale when running on a platform with more CPU cores. Using more than one work-item per work-group can give reasonable performance as long as there is no synchronization between work-items and each work-item performs a significant amount of computation.
- Further optimization of your OpenCL code on multi-core ARM can be achieved using SIMD instructions, also known as the NEON unit. PGCL generates code that uses SIMD instructions for OpenCL kernel code that contains vectorizable loops or when the OpenCL code makes use of vector arithmetic. To enable vectorization,the compiler options ‑fast ‑Mvect should be specified when compiling OpenCL kernels:
# pgcl ‑-opencl-flags ‑fast ‑Mvect ‑- sum.cl
When applied to the previously described summation code for following loop
… for (i=0; i<wgSize; i++) { sum += locSumArray[i] ; } …PGCL generates automatically the following ARM NEON SIMD code:
… .LBB0_5: @ %L.B0007 @ =>This Inner Loop Header: Depth=1 vldmia r12, {d18, d19} add r2, r12, #16 sub lr, lr, #8 add r12, r12, #32 vadd.i32 q8, q9, q8 cmp lr, #0 vldmia r2, {d18, d19} vadd.i32 q8, q8, q9 bgt .LBB0_5 …In this code, the loop has been unrolled by a factor of eight and four sums are computed in parallel. While details of handling the final summation and residual iterations (for loops with iteration counts that are not a multiple of eight) are not shown here, they are of course correctly handled by the PGCL compiler.
If you want to fine-tune control over vector code generation, you can use OpenCL vector data types. For instance, the summation kernel can be rewritten as follows:
void sumAll4(__local int* locSumArray, int wiIndex, int wgSize) { int i = 0; int4 sum4 = 0; if (wiIndex == 0) { for (i=0; i<wgSize; i+=4) { sum4 += *((int4 *)&locSumArray[i]) ; } locSumArray[0] = sum4.x + sum4.y + sum4.z + sum4.w ; } barrier(CLK_LOCAL_MEM_FENCE) ; }Then, when the PGCL ‑O2 command-line option is used the main loop is generated as follows:
… .LBB0_4: @ %L.B0001 @ =>This Inner Loop Header: Depth=1 vldmia r3, {d18, d19} sub r2, r2, #1 add r3, r3, #16 cmp r2, #0 vadd.i32 q8, q9, q8 bgt .LBB0_4 …Note that for correct execution this code assumes that locSumArray is aligned on an int4 boundary. In this case it is the programmer's responsibility to guarantee array alignment, or the code is likely to fail. Note also that the ‑fast ‑Mvect options were not used, because in that case the loop would have been vectorized.
Conclusion
In this article we have given an overview of the OpenCL execution model and detailed the PGCL runtime implementation for multi-core CPUs as an OpenCL compute device. Understanding the PGCL kernel execution model helps us to define the following guidelines for writing efficient code for CPU-like devices:
- Do a lot of computation per work-item
- Avoid use of OpenCL barrier operations in your OpenCL kernels
- Try to minimize the number of work-items per work-group (one is optimal)
- Try to reduce number of work-groups, and in particular to use a number of work-groups that matches the number of CPU cores on the target device
- On multi-core ARM CPUs, take advantage of the NEON SIMD unit either by writing vectorizable loops in your kernels or by using OpenCL vector datatypes
In a future article we will detail how these rules have been applied to port and optimized an OpenCL-enabled application written for a GPGPU compute device to multi-core ARM as a compute device using PGCL.