October 2011
A Look at the Impact of AMD Bulldozer FMA4 Instructions on Data Precision
Introduction
Back in 2008, AMD and Intel proposed an extension to the x86 architecture instruction set. AVX or Advanced Vector Extensions, introduces a three-operand SIMD instruction format where the destination register is distinct from the two source operands. The Sandy Bridge architecture from Intel and the Bulldozer architecture from AMD both support the new 256-bit SIMD AVX instructions. AMD extended the AVX instruction set by adding FMA4 instructions to Bulldozer. These new FMA4 instructions (FMA stands for fused multiply-add) are four-operand floating-point instructions. They perform the multiply-add operation in one step thus avoiding the intermediate stage rounding done by earlier processors.
For example consider the operation d = a + b * c. Using the non-fused multiply-add sequence, first the product b * c would be computed and this result would be rounded to N significant bits. This result would then be added to a, with the final result rounded to N significant bits and stored in d. A fused multiply-add would compute the entire term a + b * c to its full precision before rounding the final result to N significant bits. These FMA4 instructions are usually faster than non-fused multiply-add operations, and thus the reason for their existence.
The FMA4 instructions can speed up and improve the accuracy of many computations that involve the accumulation of products like:
- Dot product
- Matrix multiplication
- Polynomial evaluation
An Example
Users should be aware however that the use of these FMA4 instructions can cause slight differences in program results. This can be an issue for some applications that require bit-for-bit accuracy with previous generation x86 processors. For example, consider the following program which computes d = a + b * c, where a, b, c and d are double precision scalar variables. The values for a, b and c are given in hexadecimal.
> cat testfma.f90
program testfma
double precision :: a, b, c, d
a = Z'3c54c9b71a0e6500' ! 4.507E-018
b = Z'bf43a04556d864ae' ! -5.989E-004
c = Z'bfc55364b6b08299' ! -0.166
d = 0.0d0
call nofma(a, b, c, d)
write(6,100) "Without FMA4: ",d,"(",d,")"
d = 0.0d0
call fma(a, b, c, d)
write(6,100) "With FMA4: ",d,"(",d,")"
100 format (" ",a15,Z,a1,e22.16,a1)
end program
> cat nofma.s
.text
.align 16
.globl nofma_
nofma_:
vmovsd (%rdx), %xmm0 #fetch c
vmulsd (%rsi), %xmm0, %xmm1 #b * c
vaddsd (%rdi), %xmm1, %xmm2 #a + b * c
vmovsd %xmm2, (%rcx) #store result in d
ret
.type nofma_,@function
.size nofma_,.-nofma_
> cat fma.s
.text
.align 16
.globl fma_
fma_:
vmovsd (%rdx), %xmm0 #fetch c
vmovsd (%rdi), %xmm2 #fetch a
vfmaddsd %xmm2, (%rsi), %xmm0, %xmm3 #a + b * c
vmovsd %xmm3, (%rcx) #store result in d
ret
.type fma_,@function
.size fma_,.-fma_
To compile, link and execute the test case:
> as -o fma.o fma.s > as -o nofma.o nofma.s > pgfortran -o testfma testfma.f90 fma.o nofma.o > ./testfma Without FMA4: 3F1A28A5F3777EAD(0.9978783497599049E-04) With FMA4: 3F1A28A5F3777EAC(0.9978783497599047E-04)
The above program shows an example of computing d = a + b * c. Function nofma first computes b * c, then takes the rounded result and adds to a. The result we get is d = 0x3F1A28A5F3777EAD in hexadecimal. If the FMA4 instruction is used by calling the function fma, then the final result of d becomes 0x3F1A28A5F3777EAC. The last bit is different for these two cases.
Consider, though, if the value for a is changed:
!a = Z'3c54c9b71a0e6500' a = z'bF1A28A5F3777D60' Without FMA4: 0000000000000000(0.0000000000000000E+00) With FMA4: BBBAD89127ADE008(-.5684854190555145E-20)
The absolute error is roughly the same, but the results now are completely different, potentially different enough to perturb many programs. As a developer, it might be difficult to know where your answers will fall. One tool which is at your disposal is the support for changing rounding modes now in most languages. For instance, in Fortran:
call ieee_set_rounding_mode(ieee_up)
call nofma(a,b,c,d)
write(6,100) "UP Without FMA4: ",d,"(",d,")"
call ieee_set_rounding_mode(ieee_down)
call nofma(a,b,c,d)
write(6,100) "DOWN Without FMA4:",d,"(",d,")"
UP Without FMA4: 0000000000000000(0.0000000000000000E+00)
DOWN Without FMA4: BBD0000000000000(-.1355252715606881E-19)
With FMA4: BBBAD89127ADE008(-.5684854190555145E-20)
Your results for an individual FMA4 operation should always fall within the rounded up and rounded down results for a non-FMA multiply-followed-by-add operation. The FMA results are the most accurate, they just may not be what you expect, or what you've been led to believe are correct.
This has been an issue before. You can try coding the operation in Fortran or C, and compiling like this:
> pgfortran -tp piii testfma.f90 -o tstack > ./tstack Without FMA4: BBBAD80000000000(-.5684385169464406E-20)
In another experiment, we ran one million points through the fma and nofma code. The b and c values were randomly spread over the interval [ -1, 1 ) and the a values were randomly spread over [ -0.5, 0.5 ). This is the distribution in bits of difference we saw from the two resulting d values:
| Number in Bin | ||
|---|---|---|
| Bits of difference | 0 | 765215 |
| Bits of difference | 1 | 179440 |
| Bits of difference | 2 | 33027 |
| Bits of difference | 3 | 12477 |
| Bits of difference | 4 | 5213 |
| Bits of difference | 5 | 2307 |
| Bits of difference | 6 | 1167 |
| Bits of difference | 7 | 593 |
| Bits of difference | 8 | 308 |
| Bits of difference | 9 | 125 |
| Bits of difference | 10 | 67 |
| Bits of difference | 11 | 23 |
| Bits of difference | 12 | 19 |
| Bits of difference | 13 | 6 |
| Bits of difference | 14 | 3 |
| Bits of difference | 15 | 6 |
| Bits of difference | 16 | 1 |
| Bits of difference | 17 | 1 |
| Bits of difference | 21 | 1 |
| Bits of difference | 25 | 1 |
Of course, all of this analysis is for a single FMA operation. In real programs, this is just one contributing calculation among millions or billions of others. But understanding the fundamental differences in generated code and resulting precision is important.
Disabling FMA4 Code Generation
Users may want to disable the use of the FMA4 instructions in order to achieve bit-for-bit accuracy comparing with the previously generated results using non FMA4 processors. This can be accomplished using the following PGI compiler switch:
-nofma
By default, the PGI 2011 version 11.9 and later compilers will enable FMA4 code generation when targeting Bulldozer CPU cores.
It should also be noted, that today's current GPUs from NVIDIA, like the Tesla C1060 and C2070 also have support for the FMA instructions. As with the Bulldozer CPU cores, the generation of these instructions can also be disabled by using the following compiler flag:
-ta=nvidia,nofma
Conclusion
Because the intermediate products are no longer rounded, and basically have infinite precision as internal inputs into the adder during FMA4 operations, users of AMD Bulldozer chips may see different results than on previous generation x86 processors. These differences may be slight, or they may be major depending on how conditional checks are done, tolerances are determined, and on the order of operations. Most of those issues are outside the scope of this paper, but here we've shown some ways to determine what difference can be expected from a single FMA4 op.