Technical News from The Portland Group
Targeting AVX-Enabled Processors Using PGI Compilers and Tools
Vectorization for x64 processors is used to identify and transform loops to take advantage of packed SSE instructions, which are wide operations on data from multiple iterations of a loop. PGI produced the first x86/SSE automatic vectorizing compiler in 1999. At that time, Intel was providing and promoting a set of SSE intrinsics. Today, there are nine different header files of intrinsics in the latest Intel compiler distribution, covering each generation of hardware updates going back to the MMX days. In total, they define 781 different routines. Unlike these intrinsics, vectorization is a technology which provides both code and performance portability.
This article explores the new 256-bit SIMD AVX instructions available in the latest CPUs coming from Intel and AMD. AVX is the successor to 128-bit SIMD SSE instructions in previous generation processors. Using automatic vectorization of loops you can migrate to using AVX instructions through a simple re-compile, rather than a re-write of explicitly coded intrinsics.
The VEX Prefix
AMD and Intel are releasing new microprocessors in 2011 with extended AVX support. Both new processor architectures, the Sandy Bridge from Intel and the Bulldozer from AMD, support the new instruction encoding format called VEX. This instruction prefix allows a set of extensions to the traditional x86 instruction set architecture (ISA) that enable most of the new chip features that are important to compiler designers, library developers, and applications experts who are interested in optimizing performance at the micro-architecture level.
The VEX prefix enables three and four operand syntax. As we will explain later, even if your code does not take advantage of the wider SIMD floating point units, you still possibly can see performance gains due to the flexibility of these new operations.
Traditional x86 instructions contain two operands. One operand usually plays the role of both a source and destination, thus the instruction is destructive to one of the operands. As an example, suppose you need to keep two inputs x and y in regisnters xmm0 and xmm1, while adding x+y. Prior to VEX, the compiler would generate assembly code that looks like this:
movsd %xmm1, %xmm2 addsd %xmm0, %xmm2
With VEX enablement, the compiler generates only one instruction:
vaddsd %xmm0, %xmm1, %xmm2
256-bit SIMD Support
Probably the most anticipated feature of the new chips for scientists and engineers is the doubling of the SIMD vector width and the width of the floating point pipeline. The new floating point units from both AMD and Intel are now capable of performing four double precision floating-point multiplies and four double precision floating-point adds every cycle. For single precision data, these numbers double. In rough terms, depending on the clock frequency, tuned libraries such as MKL from Intel and ACML from AMD could see 20 or more GFlops/core for double precision routines such as dgemm, and 40 or more GFlops/core for single precision routines.
To accommodate the wider SIMD vectors, a new set of registers have been introduced. The new ymm registers overlay the current SSE xmm registers that were introduced with the Pentium III. Instructions operating on the ymm registers are enabled by the VEX prefix. Diagram 1 shows how they are mapped onto the existing hardware.
A complicating factor for compiler designers is that the new instructions available for manipulating ymm registers are somewhat limited. The 256-bit SIMD units are more or less divided into two separate lanes of 128-bits each. There are only four instructions which can move data across lane boundaries, and three of the four only operate on 128-bit chunks:
To illustrate potential complexity, consider the following simple Fortran subroutine:
subroutine sum5(a, c, n) real*8 a(n+4), c(n) do i = 1, n c(i)=a(i)+a(i+1)+a(i+2)*2.d0+a(i+3)+a(i+4) end do end
Compiling this for execution on a second generation x64 processor yields this CCFF message:
sum5: 3, 2 loop-carried redundant expressions removed with 2 operations and 4 arrays Generated 4 alternate versions of the loop Generated vector sse code for the loop Generated a prefetch instruction for the loop
Using loop redundancy elimination (LRE) optimization, the compiler finds redundant expressions in the intermediate sums and carries those around to the next iteration. When using xmm registers, the generated code for this loop requires one shuffle, but then produces two results for every three packed add operations.
Now with AVX and the dual-lane nature of the floating point units, the processor can perform twice as many additions per cycle, but the staging and manipulation of the data is more complex. To enable the same operation-reducing strategy requires the new VEX vperm2f128 instruction to move data between lanes.
With the latest generation processors, we can produce four results for every three packed add operations, at the cost of one cross-lane transfer and one shuffle within lanes. Contrast this with the scalar, straight-forward method which requires five scalar adds (or four adds and one multiply) for every element, and you can begin to see the gains that are possible with an optimizing compiler. With an uncountable number of loop construct possibilities, heuristics and strategies are necessary to ensure optimal code generation.
Another complicating factor for AVX code generation is the difficulty of generating aligned loads and stores. For ymm registers, the optimal alignment is 32 bytes. Using current tool chains, 32 byte alignment can be forced for static data, but it is not produced by most malloc() implementations, nor required of the stack by the ABIs. Thus, it is not easy to force 32 byte alignment for local variables. As was the case in earlier generation processors, unaligned loads and stores, even when the data resides in L1 cache, run substantially slower than aligned loads. The logical fallback position is for the compiler to generate a loop schedule for the sum5 subroutine similar to what is shown in Diagram 5. However, we've found that on both Intel and AMD processors, this runs slower than the original non-AVX code. In fact, we've verified that in some cases breaking up a single 256-bit unaligned load or store (a movupd operation) into two 128-bit unaligned operations can actually run faster.
Finally we should mention the new AVX masked move instruction. While we've made some use of this capability in our hand-coded AVX-enabled library routines, because the vmaskmov load operation loads a zero if the mask is zero and because there is no corresponding way to mask the exception generation, another level of analysis would be needed for the compiler to make general use of this for residual loops. Performance issues currently make it unappealing for commonly executed code paths.
AMD-Specific Code Generation
The AMD Bulldozer chip not only implements the AVX instruction set, but extends it by adding FMA4 instructions as well. FMA, which stands for fused multiply-add, has not, up to this point, been a part of the x86 instruction set, but has long been an instrumental part of traditional embedded processors. In an FMA operation, the result of the multiplier directly feeds the adder, without being rounded or moved to a register. This results in a shorter total pipeline for common multiply-followed-by-add operations found in signal and image processing as well as in matrix multiplication, dot products, and other BLAS routines, and in FFTs. The FMA4 instructions use four operands total, three for sources, and one destination.
On the Bulldozer chips, peak floating-point performance requires using FMA4 instructions. It is easiest to model the performance by thinking of a single pipelined floating-point unit with three inputs, and on each cycle it can initiate some variant of an add, a multiply, or a multiply-add. The results may come out of the unit at different points (cycles) in the pipe, but basically the input ports control the rate of execution.
Because of this design, legacy objects and executables written or compiled using a style of vector packed adds and packed multiply operations won't run at peak performance on a Bulldozer chip. It is important to recompile using the proper target processor flags, and to use Bulldozer-tuned math libraries for best performance.
As an added wrinkle, the floating point unit on a Bulldozer chip is shared between two integer cores. The two integer cores can perform address generation support and sequencer control for the floating-point operations independently.
One interesting aspect of the Bulldozer chip is that you can compile and tune your floating-point codes to run in 256-bit wide "AVX Mode", where each thread assumes control of, and schedules ymm-based operations over the entire floating point unit. Alternatively, you can compile for a "Shared Mode", where each thread uses either traditional SSE code generation, or newer VEX-based operations on xmm registers, and uses just one lane of the shared floating point resource. VEX-based FMA4 opcodes working on xmm registers can be used to achieve peak performance on multi-threaded codes.
Diagram 7 shows two versions of code generated for a daxpy routine and the FMA4 operations that are used. The PGI compiler can produce code for either mode based on compiler command-line options. It's still too early to determine which mode will be default.
.LB1_427: vmovapd (%rax,%r9), %xmm1 vfmaddpd %xmm1,(%rax,%r10),%xmm0,%xmm3 vmovapd %xmm3, (%rax,%r9) vmovapd 16(%rax,%r9), %xmm2 vfmaddpd %xmm2,16(%rax,%r10),%xmm0,%xmm3 vmovapd %xmm3, 16(%rax,%r9) . . . addq $64, %rax subl $8, %edi testl %edi, %edi jg .LB1_427 ------------------------------ .LB1_427: vmovapd (%rax,%r9), %ymm2 vfmaddpd %ymm2,(%rax,%r10),%ymm0,%ymm3 vmovapd %ymm3, (%rax,%r9) vmovapd 32(%rax,%r9), %ymm4 vfmaddpd %ymm4,32(%rax,%r10),%ymm0,%ymm3 vmovapd %ymm3, 32(%rax,%r9) . . . addq $128, %rax subl $16, %edi testl %edi, %edi jg .LB1_427
Reconsidering the discussions in the previous section about the difficulties manipulating data between lanes, and guaranteeing aligned loads and stores of 256-bit data, it might turn out that for Bulldozer targets, 128-bit VEX enabled code is the best approach. Also, for codes where the data is not all in registers or the lowest level caches, the floating point unit might be idle for many cycles. Tighter code might run more efficiently than wider-vector code, especially when it is multi-threaded. We will continue our experiments in this area.
Users should be aware that results obtained using FMA operations may differ in the lowest bits from results obtained on other X64 processors. The intermediate result fed from the multiplier to the adder is not rounded to 64 bits.
Intel-Specific Code Generation
Running either of the above loops on an Intel Sandy Bridge machine, or any Intel-based processor for that matter, will produce this result:
bash-4.1$ ./a.out Illegal instruction (core dumped)
There seems to be no question that on a Sandy Bridge processor it is best to run simple, cache-based loops using 256-bit wide AVX enabled code. Sandy Bridge optimized code generated for the same daxpy operation is shown below in Diagram 8.
.LB1_485: vmulpd (%rdi,%r10), %ymm0, %ymm1 vaddpd (%rdi,%r9), %ymm1, %ymm2 vmovapd %ymm2, (%rdi,%r9) vmulpd 32(%rdi,%r10), %ymm0, %ymm1 vaddpd 32(%rdi,%r9), %ymm1, %ymm2 vmovapd %ymm2, 32(%rdi,%r9) . . . addq $128, %rdi subl $16, %eax testl %eax, %eax jg .LB1_485
This version runs fairly well on an AMD Bulldozer system, but it is not optimal. On both chips, loops like this are limited by the L1 bandwidth. For instance, the Sandy Bridge has two 128-bit ports for loading data from L1 cache, and so it is capable of loading four double precision floating-point values every cycle. This daxpy operation using ymm registers could consume eight inputs each cycle: four for the multiplier and four for the adder. And on Sandy Bridge, as is typical of many x64 processors, the bandwidth storing to L1 cache is half of that for loads.
One final issue worth mentioning concerning Sandy Bridge code generation is the need for the vzeroupper instruction. This instruction zeroes out the upper 128 bits of all of the ymm registers and marks them as clean. If you mix 256-bit AVX instructions with legacy SSE instructions that use xmm registers, you will incur performance penalties of roughly one hundred cycles at each transition between legacy code and AVX code. This is because the processor must save the state of the upper 128-bits of the ymm registers at the first legacy instruction so that they can be restored upon issuing another AVX instruction. To avoid this problem, use vzeroupper before the transition between these two code sequences can occur. This instruction seems to have minimal overhead.
In the PGI compiler, when generating AVX instruction sequences, we generate a vzeroupper instruction right before a call is made. We do this because we cannot always guarantee or assume how the callee has been compiled. Also, when writing library routines or compiling functions that perform AVX instruction sequences, we generate a vzeroupper instruction right before returning, again because we cannot make assumptions about how the caller was compiled. This is always safe because the ABI specifies that parameters are never passed to, or returned in, the high half of the ymm registers.
The new AVX-enabled processors require substantial changes in the compiler generated code to achieve optimal performance. The major points of difference are well understood, but tuning will continue for the foreseeable future, and continue to evolve as new revisions of these architectures are released. An incompatible mix of code can lead to disastrous performance penalties. The PGI Unified Binary technology can assist ISVs and other users who need to produce code optimized for all possible x64 architectures in a single executable. Differences in processor architectures have narrowed in the last few years, but with the release of the Bulldozer processor and its use of the FMA4 instruction, it looks as though we're headed for another period of divergence. Care must be taken when choosing a set of target processors to avoid performance incompatibilities or even worse, instruction faults.