# BASM for Beginners

## BASM for beginners, lesson 7

Welcome to lesson number 7. Today’s subject is floating point BASM. This was also the subject of an earlier lesson, but this lesson will add new information. We will look at how to code scalar SSE2 code and how instructions are scheduled in the fp pipelines. Today’s example function is evaluating a third order polynomial.

function ArcSinApprox1a(X, A, B, C, D : Double) : Double;
begin
Result := A*X*X*X + B*X*X + C*X + D;
end;

Instead of just analyzing and optimizing this function we will see an actual use of it. A third order polynomial can approximate the ArcSin function on its entire interval, [-1;1], with a maximal absolute error of 0.086. This is not impressive but what we develop in this lesson will extend to higher order polynomials in a straightforward manner, yielding higher precision.
The parameters A, B, C and D define the shape of the curve for the function and the values for a fit to the ArcSin approximation with minimal error must be found. For this purpose we develop an optimizer, which is also used as the benchmark. Because ArcSin(0) = 0 we immediately see that D=0 and D can be left out of optimization. We also know that ArcSin is an odd function and therefore the second order term B*X*X is of no use in the approximation. This is because a second order term is even and has symmetry around the Y-axis. Odd order functions have anti symmetry around the Y-axis with f(X) = -f(-X). All this means that our function could be reduced to

Result := A*X*X*X + C*X;

We do however not do that because it is more illustrative to use the full function. ArcSin is a special case and we want to be as general as possible.

Function number 1a has 6 multiplications and 3 additions. Writing it on Horner form

Result := ((A*X + B)*X + C)*X + D;

reduces this to 3 multiplications and 3 additions.

Another form is

Result := (A*X + B)*(X*X)+(C*X + D);

which has 4 multiplications and 3 additions.

On modern processors it is as important how much parallelism can be extracted from the formula as it is how many multiplications and additions it has. Modern processors such as AMD Athlon, Intel P4 and P3 have pipelines. Pipelines are a necessity on processors running at high frequencies, because the amount of work for an addition, subtraction, multiplication or division can not be done in a single clock cycle. On P4 there is a pipeline called FP_ADD that handles addition and subtraction. This pipeline has 5 stages, which means that the process of performing an addition or subtraction is broken down into 5 sub tasks. Therefore addition and subtraction is done in 5 clock cycles. The advantage of having a pipeline is that even though it takes 5 cycles to complete an addition a new one can be started in every clock cycle. This is because the first add in a series leaves the first stage of the pipeline in the second clock cycle and then this stage can accept add number 2. If we have this series of adds the first one will leave the pipeline in cycle 5, number 2 will leave in cycle 6 etc. Throughput is one add per cycle. Parallelism is achieved by having up to 5 additions/subtractions in flight in the pipeline at the same time. The drawback is that if a second of two following additions need the output from the first addition it has to wait for the first one to complete. We say that there is a data dependency between the two instructions and we see the full latency of the additions which is 2 times 5 cycles.
Let’s use our function as an example to see how pipelines work.

Result := A*X*X*X + B*X*X + C*X + D;

It is seen that the 4 terms can be evaluated in parallel and then be added as the final act Of course term number 4 is not "evaluated". A*X is the first operation ready to be scheduled to run in the F_MUL pipeline. The latency for FMUL on P4 is 7 cycles and A*X will be ready in cycle 7. FMUL has a throughput of 2 cycles. From this we see that FMUL is not fully pipelined. The pipeline will accept a new instruction in cycle 3 and not in cycle 2. B*X is the second instruction ready to execute and it will start in cycle 3. In cycle 5 the pipeline is again ready to receive a new instruction and this is C*X. In cycle 7 A*X is complete and (A*X)*X can start in cycle 8. In cycle 10 B*X is finished and (B*X)*X will start. In cycle 12 C*X is ready to go to the F_ADD pipeline to be added to D. In cycle 15 is (A*X)*X finished and (A*X*X)*X can start. In cycle 17 are (B*X)*X and (C*X) + D complete and they can start in the F_ADD pipeline. This addition completes in cycle 21, where (A*X*X)*X is also ready Then the last addition can start in cycle 22. Now there is only one operation in flight and we must lean back and wait for the full latency of FADD, which is 5 cycles. In clock 27 the last addition is finished and the job is done.
These tables give the details. The left column symbolizes the F_MUL pipeline with 7 stages and the right column symbolizes the F_ADD pipeline with 5 stages.

A*X

Cycle 1

A*X

Cycle 2

B*X

A*X

Cycle 3

B*X

A*X

Cycle 4

C*X

B*X

A*X

Cycle 5

C*X

B*X

A*X

Cycle 6

C*X

B*X

A*X
Cycle 7

(A*X)*X

C*X

B*X

Cycle 8

(A*X)*X

C*X

B*X
Cycle 9

(B*X)*X

(A*X)*X

C*X

Cycle 10

(B*X)*X

(A*X)*X

C*X
Cycle 11

(C*X)+D

(B*X)*X

(A*X)*X

Cycle 12

(C*X)+D

(B*X)*X

(A*X)*X

Cycle 13

(C*X)+D

(B*X)*X

(A*X)*X
Cycle 14

(A*X*X)*X

(C*X)+D

(B*X)*X

Cycle 15

(A*X*X)*X

(C*X)+D

(B*X)*X
Cycle 16

(B*X*X)+(C*X+D)

(A*X*X)*X

Cycle 17

(B*X*X)+(C*X+D)

(A*X*X)*X

Cycle 18

(B*X*X)+(C*X+D)

(A*X*X)*X

Cycle 19

(B*X*X)+(C*X+D)

(A*X*X)*X

Cycle 20

(B*X*X)+(C*X+D)

(A*X*X)*X
Cycle 21

(A*X*X*X)+ (B*X*X+C*X+D)

Cycle 22

(A*X*X*X)+ (B*X*X+C*X+D)

Cycle 23

(A*X*X*X)+ (B*X*X+C*X+D)

Cycle 24

(A*X*X*X)+ (B*X*X+C*X+D)

Cycle 25

(A*X*X*X)+ (B*X*X+C*X+D)

Cycle 26

Finished

Cycle 27

An out of order core schedules instructions to run as soon as data and resources are ready. Resources are registers and execution pipelines. I do not know for sure, but I think that instructions are scheduled to run in program order except when an instruction stalls. In this situation the next ready instruction in program order is scheduled. The stalled instruction will be started as soon as the stall reason is removed. It can stall because of missing resources or not ready data.

After having seen how instructions are scheduled to run in the pipelines of a P4 we establish the benchmark. The benchmark is an optimizer that search for the best possible fit of our polynomial to ArcSin. It is based on the most simple of all optimization algorithms, which is the exhaustive search. We simply try a lot of combinations of parameter values and remember the set of parameters which give the best fit. A and C are run in the intervals [AStart; AEnd] and [CStart; CEnd], with the stepsizes AStepSize and CStepsize. This is done in two nested while loops.

StartA := 0;
StartC := -1;
EndA := 1;
EndC := 1;
AStepSize := 1E-2;
CStepSize := 1E-3;
OptA := 9999;
OptC := 9999;
A := StartA;
while A <= EndA do
begin
C := StartC;
while C <= EndC do
begin
Inc(NoOfIterations);
MaxAbsError := CalculateMaxAbsError(A,C, ArcSinArray);
if MaxAbsError <= MinMaxAbsError then
begin
MinMaxAbsError := MaxAbsError;
OptA := A;
OptC := C;
end;
C := C + CStepSize;
end;
A := A + AStepSize;
end;

The CalculateMaxAbsError function calculates a number of points on the X interval [-1;1], which is the definition interval of the ArcSin function

TMainForm.CalculateMaxAbsError(A, C : Double; ArcSinArray : TArcSinArray) : Double;
var
X, Y, D, B, Yref, Error, AbsError, MaxAbsError : Double;

begin
B := 0;
D := 0;
MaxAbsError := 0;
X := -1;
repeat
Yref := ArcSin (X);
Y := ArcSinApproxFunction(X, A, B, C, D);
Error := Yref-Y;
AbsError := Abs(Error);
MaxAbsError := Max(MaxAbsError, AbsError);
X := X + XSTEPSIZE;
until(X > 1);
Result := MaxAbsError;
end;

At each point we calculate the error by subtracting the Y value from our approximation function from the reference Y value obtained from a call to the Delphi RTL ArcSin function. The error can be positive or negative, but we are interested in the absolute value. We remember the biggest absolute error by taking the maximum of the two values MaxAbsError and AbsError assigning it to MaxAbsError. MaxAbsError is initialized to zero and in the first evaluation it will get the value of the first error (if it is bigger than zero). The MaxAbsError is returned as the result from the function after a full sweep has been completed. In the optimizer function the two values of A and C that gave the smallest maximum error are remembered together with the actual MinMaxAbsError.
All that matters in an optimizer of this kind is to be able to evaluate as many parameter combinations as possible. For this purpose we must optimize the optimizer ;-) and the functions we evaluate. In this lesson the purpose is slightly different because all we want is some valid benchmarks for the functions we optimize. The means are however the same, the code of the optimizer must take as few cycles as possible such that the cycles the functions use is the biggest part of the total number of cycles used. The first optimizer optimization that is done is to realize that there is no need to evaluate the reference function over and over again. It returns, of course, the same values no matter which values A and C have. This is done by sweeping the reference function once and storing the Yref values in an array.
The next optimization is to compact the lines that evaluate the MaxAbsError

Long version

Yref := ArcSinArray[I];
Error := Yref-Y;
AbsError := Abs(Error);

To the short version

AbsError := Abs(ArcSinArray[I]-Y);

This helps because Delphi creates a lot of redundant code, when compiling FP code.

The long version compiles into this

Yref := ArcSinArray[I];

mov eax,[ebp-\$14]
mov edx,[eax+ebx*8]
mov [ebp-\$48],edx
mov edx,[eax+ebx*8+\$04]
mov [ebp-\$44],edx

Error := Yref-Y;

fld qword ptr [ebp-\$48]
fsub qword ptr [ebp-\$30]
fstp qword ptr [ebp-\$50]
wait

AbsError := Abs(Error);

fld qword ptr [ebp-\$50]
fabs
fstp qword ptr [ebp-\$10]
wait

There are a lot of redundancies in this code and we must conclude that Delphi is doing a bad job on optimizing floating point code. Let us add some explanations to the code. The first line of Pascal is an assignment of one double variable to another. This is done by to pairs of mov, one for the lower 4 byte of the variables and one for the upper 4 byte. The first line of asm loads the address of the array into eax, which is used as base for addressing into the array. Ebx is I and it is multiplied by 8 because each entry in the array is 8 byte. The 4 byte offset in the last two lines (in the last line it is hidden!) is moving the reference into the middle of element I.
Yref is located in the stack frame at [ebp-\$48] and is loaded by the first line of FP code. Y is located at [ebp-\$30] and is subtracted from Yref by fsub. The result is Error and this is stored at [ebp-\$50].
The last line of Pascal compiles into four lines of asm of which the first starts loading Error. Saving and then loading Error is redundant and the optimizer should have removed it. Fabs is the Abs function and is probably one of the shortest function implementations seen ;-) The Delphi compiler does not have the inlining optimization, but it applies “compiler magic” to a few functions, one of which is Abs. The last line stores AbsError on the stack.

The short version compiles into this

mov eax,[ebp-\$14]
fld qword ptr [eax+ebx*8]
fsub qword ptr [ebp-\$30]
fabs
fstp qword ptr [ebp-\$10]
wait

Here there is no redundancy and the compiler should have emitted this code as result of the long Pascal version as well. All lines of code are present in the long version, but all redundant lines have been removed. The first line loads the base address of the array into eax. The second line loads element I, I is in ebx, unto the fp stack. The third line subtracts Y from Yref. The fourth line is the Abs function. The fift line stores the result in the AbsError variable.

There is a peculiarity with this benchmark that I have no explanation of. The benchmark values are heavily influenced by the way it is activated. If the keyboard is used to hit the button we get a different score from the one we get by clicking the button with the mouse! The one who can explain this will probably get the Nobel prize in Delphi;-)

Another irritating thing is that Delphi does not align double variables properly. They shall be 8 byte aligned but Delphi only 4 byte aligns them. The penalty we can get from this is a level 1 cache line split (and L2 cache line splits as well). Loading a variable that splits over two cache lines takes the double time of loading one that does not. Because double variables are 8 byte and the L1 cache line of P4 is 64 byte at most 1 of 8 variables can have a split. On P3 that has 32 byte L1 cache lines it is 1 of 4.
The ideal situation is that variables are aligned at 4 byte if they are 4 byte big, 8 if 8 byte etc. To make things simple let us imagine that the first line in the level one cache is where our variables are loaded. The first line starts at address 0, that is - memory from address 0 is loaded into it. Our first variable is aligned and occupies the first 8 bytes at line 1. Variable number two occupies byte 9-16 …. Variable number 8 occupies byte 57-64 and does not cross the line boundary. If variables are only 4 byte aligned the first one could start at byte 4 and number 8 could start at byte 61. The first 4 byte of it will be in line 1 and the next 4 bytes will be in line 2. The processor loads it by first loading the lower part and then loading the higher part instead of loading it all in one go.

Because of this misalignment of double variables in Delphi our benchmark will not be as stable as we could wish. Alignment can change when we recompile especially if code is changed. I have chosen (a bad choice) to not include code to align variables in the benchmark, but will give an example of it in a later lesson.

Let us dive into the first function optimization. We start with the function that uses the naive formula in textbook format.

function ArcSinApprox1a(X, A, B, C, D : Double) : Double;
begin
Result := A*X*X*X + B*X*X + C*X + D;
end;

This function implementation scores the benchmark 43243 on my P4 1600 MHz clocked at 1920 MHz

Delphi compiled it like this

function ArcSinApprox1b(X, A, B, C, D : Double) : Double;
begin
{
push ebp
mov ebp,esp
}
Result := A*X*X*X + B*X*X + C*X + D;
{
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$18]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$10]
fmul qword ptr [ebp+\$28]
fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]
}
{
pop ecx
pop ecx
pop ebp
}
end;

The code from the CPU view will not compile because of the instruction faddp st(1) and we remove st(1). As default the faddp instruction operates on st(0), st(1) and there is no need to write it out.

function ArcSinApprox1c(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$18]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$10]
fmul qword ptr [ebp+\$28]
fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]
pop ecx
pop ecx
end;

First we observe that there is no need to set up a stackframe. The stack is actually used for storing the result temporarily and reloading it again in the lines

fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]

but the basepointer and not the stackpointer is used for this. The lines that use ebp plus a value are accessing the parameters which are located above the basepointer, which is in the calling functions stackframe. The stackpointer is not used at all in the function and changing its value is meaningless. The mov ebp,esp instruction added by the compiler together with the line add esp,-\$08 creates an 8 byte stackframe. Because these lines change the ebp register it is necessary to back it up by pushing it to the stack. Unfortunately we can only remove the add esp, 8 line and the two pop ecx lines that has the purpose of subtracting 8 bytes from the stackpointer, esp.

function ArcSinApprox1d(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$18]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$10]
fmul qword ptr [ebp+\$28]
fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]
//pop ecx
//pop ecx
end;

This function implementation scores the benchmark 42391 and performance actually dipped a little.

The compiler inserts the line mov ebp,esp and we can make it redundant by using esp instead of ebp.

function ArcSinApprox1e(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
//fld qword ptr [ebp+\$20]
fld qword ptr [esp+\$20]
//fmul qword ptr [ebp+\$28]
fmul qword ptr [esp+\$28]
//fmul qword ptr [ebp+\$28]
fmul qword ptr [esp+\$28]
//fmul qword ptr [ebp+\$28]
fmul qword ptr [esp+\$28]
//fld qword ptr [ebp+\$18]
fld qword ptr [esp+\$18]
//fmul qword ptr [ebp+\$28]
fmul qword ptr [esp+\$28]
//fmul qword ptr [ebp+\$28]
fmul qword ptr [esp+\$28]
//fld qword ptr [ebp+\$10]
fld qword ptr [esp+\$10]
//fmul qword ptr [ebp+\$28]
fmul qword ptr [esp+\$28]
//fstp qword ptr [ebp-\$08]
fstp qword ptr [esp-\$08]
wait
//fld qword ptr [ebp-\$08]
fld qword ptr [esp-\$08]
end;

Unfortunately the compiler still inserts the mov instruction and we performed a copy propagation that gave no optimization because it is not followed by a dead code removal. Therefore performance is almost the same – 43094.

Without investigating whether the result stored on the stack is used we can optimize the lines coping it there and reloading it. The result of them is that there is a copy of Result left on the stack. They redundantly pop the results of the FP stack and reload Result from the stack. This single line has the same effect, but redundancy is removed.

fst qword ptr [ebp-\$08]

This optimization is very often possible on Delphi generated code and is important to remember.

function ArcSinApprox1f(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+\$20]
fmul qword ptr [esp+\$28]
fmul qword ptr [esp+\$28]
fmul qword ptr [esp+\$28]
fld qword ptr [esp+\$18]
fmul qword ptr [esp+\$28]
fmul qword ptr [esp+\$28]
fld qword ptr [esp+\$10]
fmul qword ptr [esp+\$28]
//fstp qword ptr [esp-\$08]
fst qword ptr [esp-\$08]
wait
//fld qword ptr [esp-\$08]
end;

This function implementation scores the benchmark 47939 and the improvement is 11 %

The next question to ask is: Is the copy of the Result on the stack ever used? To answer it we must inspect the code at the location of the call to the function.

Y := ArcSinApproxFunction(X, A, B, C, D);

call dword ptr [ArcSinApproxFunction]
fstp qword ptr [ebp-\$30]
wait

The first line after the call stores the result in Y and pops the stack. Seeing this we assume that the result on the stack is not used, but to be sure we must scan through the rest of the code too. If the rule for the Register calling convention is that FP results are transferred on the FP stack it is weird that a copy is also placed on the stack. We conclude that it is redundant to copy the Result to the stack and remove the line doing it.

function ArcSinApprox1g(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+\$20]
fmul qword ptr [esp+\$28]
fmul qword ptr [esp+\$28]
fmul qword ptr [esp+\$28]
fld qword ptr [esp+\$18]
fmul qword ptr [esp+\$28]
fmul qword ptr [esp+\$28]
fld qword ptr [esp+\$10]
fmul qword ptr [esp+\$28]
//fst qword ptr [esp-\$08]
wait
end;

This function implementation scores the benchmark 47405

Instead of writing all the qword ptr [esp+\$xx] lines we can write the names of the variables and let the compiler translate them into addresses. This actually makes the code more robust. If the location of the variables should change then the code breaks if we use hard coded addressing. This will however only happen if the calling convention is changed and this is not likely to happen very often.

function ArcSinApprox1g_2(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
//fld qword ptr [esp+\$20]
fld A
//fmul qword ptr [esp+\$28]
fmul X
//fmul qword ptr [esp+\$28]
fmul X
//fmul qword ptr [esp+\$28]
fmul X
//fld qword ptr [esp+\$18]
fld B
//fmul qword ptr [esp+\$28]
fmul X
//fmul qword ptr [esp+\$28]
fmul X
//fld qword ptr [esp+\$10]
fld C
//fmul qword ptr [esp+\$28]
fmul X
wait
end;

Try having both types of lines enabled

fld qword ptr [esp+\$20]
fld A

and see in the CPU view how the compiler generated exactly the same code for both versions.

X is used in a lot of lines and it is referenced on the stack. Therefore it is loaded from the stack into the internal FP registers every time. It should be faster to load it once into the FP stack and let all uses reference the FP stack.

function ArcSinApprox1h(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+\$20]
fld qword ptr [esp+\$28] //New
fxch
//fmul qword ptr [esp+\$28]
fmul st(0),st(1)
//fmul qword ptr [esp+\$28]
fmul st(0),st(1)
//fmul qword ptr [esp+\$28]
fmul st(0),st(1)
fld qword ptr [esp+\$18]
//fmul qword ptr [esp+\$28]
fmul st(0),st(2)
//fmul qword ptr [esp+\$28]
fmul st(0),st(2)
fld qword ptr [esp+\$10]
//fmul qword ptr [esp+\$28]
fmul st(0),st(2)
ffree st(2)
fst qword ptr [esp-\$08]
wait
end;

The second line is one we added and it loads X once and for all. Because it places X on the top of the stack in st(0) and this position is needed as temp for further operations we exchange st(0) with st(1) with the fxch instruction. We could of course have changed the position of line 1 and 2 and obtained the same effect. All the lines multiplying st(0) with X

fmul qword ptr [esp+\$28]

are changed to

fmul st(0),st(1)

after the FP copy of X has been used for the last time we remove it with the instruction ffree.

This function implementation scores the benchmark 46882 and performance is decreased by 1 %. This was a surprise. Fxch is claimed by Intel to be nearly free, because it works by renaming the internal registers. Let us check that by removing it

function ArcSinApprox1i(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+\$28]
fld qword ptr [esp+\$20]
//fld qword ptr [esp+\$28]
//fxch
fmul st(0),st(1)
fmul st(0),st(1)
fmul st(0),st(1)
fld qword ptr [esp+\$18]
fmul st(0),st(2)
fmul st(0),st(2)
fld qword ptr [esp+\$10]
fmul st(0),st(2)
ffree st(2)
wait
end;

This function implementation scores the benchmark 45393 and performance is decreased by 3 %. Fxch is surely not to blame because performance once more went down. What is going on?

The wait instruction was discussed in an earlier lesson and this time we just remove it.

function ArcSinApprox1j(X, A, B, C, D : Double) : Double;
asm
//Result := A*X*X*X + B*X*X + C*X + D;
fld qword ptr [esp+\$28]
fld qword ptr [esp+\$20]
fmul st(0),st(1)
fmul st(0),st(1)
fmul st(0),st(1)
fld qword ptr [esp+\$18]
fmul st(0),st(2)
fmul st(0),st(2)
fld qword ptr [esp+\$10]
fmul st(0),st(2)
ffree st(2)
//wait
end;

Performance went down to 44140.

Let us cross check these surprising results by running the functions on a P3.

ArcSinApprox1a 63613
ArcSinApprox1b 64412
ArcSinApprox1c 64433
ArcSinApprox1d 65062
ArcSinApprox1e 64830
ArcSinApprox1f 62598
ArcSinApprox1g 79586
ArcSinApprox1h 85361
ArcSinApprox1i 80515
ArcSinApprox1j 80192

First of all we see that ArcSinApprox1h is the fastest function on P3. Thereby it is seen that loading data from the level 1 data cache is more expensive on P3 than on P4, because changing the code such that X is loaded only once improved performance on P3 and not on P4. On the other hand we could also say that it must always be slower to get data from the cache than from internal registers if the architecture is sound and this is only true for the P6 architecture here. P4 has a fast L1 data cache which can be read in only 2 clock cycles but an internal register read should have a latency of only one cycle. It however looks like reads from registers are 2 clock cycles.

Then we see that a P3 at 1400 nearly 80 % faster than a P4 at 1920 on this code. We know that latencies on P3 are shorter, but this is not enough to explain the huge difference.

The latencies and throughput of the used instructions are on P3

Fadd latency is 3 clock cycles and throughput is 1
Fmul latency is 5 clock cycles and throughput is 1

On P4

Fadd latency is 5 clock cycles and throughput is 1
Fmul latency is 7 clock cycles and throughput is 2

I could not find numbers for fld

The explanation for the very bad performance of P4 on this code must be the 2 cycle throughput on fmul together with the slow FP register access. The fmul pipeline only accepts a new instruction on every second cycle where the P3 pipeline accepts one every cycle.

Scaling the results by clock frequency

47939 / 1920 = 25
85361 / 1400 = 61

reveals that clock by clock on the fastest function version for each processor P3 is nearly 2.5 times faster than P4. This is truly astonishing. If P4 should have a slight chance against a P3 we must remove some of those multiplications. This leads us to the Horner version of our function.

function ArcSinApprox3a(X, A, B, C, D : Double) : Double;
begin
Result := ((A*X + B)*X + C)*X + D;
end;

Which is compiled into

function ArcSinApprox3b(X, A, B, C, D : Double) : Double;
begin
{
push ebp
mov ebp,esp
}
Result := ((A*X + B)*X + C)*X + D;
{
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]
}
{
pop ecx
pop ecx
pop ebp
}
end;

The first three versions of this function are identical and they surprisingly score the same benchmark. Our benchmark is not perfect but it was precise this time ;-)

ArcSinApprox3a 45076
ArcSinApprox3b 45076
ArcSinApprox3c 45076

Optimization follows the same pattern as on the first function. Here is the first BASM version with no optimizations. The out commented code is supplied by the compiler.

function ArcSinApprox3c(X, A, B, C, D : Double) : Double;
asm
//push ebp
//mov ebp,esp
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]
pop ecx
pop ecx
//pop ebp
end;

First thing is to remove the add esp,-\$08 line and the two pop ecx. They are setting up a stackframe and does nothing but manipulate the stackpointer, which is not used at all.

function ArcSinApprox3d(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]
//pop ecx
//pop ecx
end;

This function implementation scores the benchmark 43535.

Both of the redundant lines copying the result to stack and back are removed at the same time.

function ArcSinApprox3e(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
//fstp qword ptr [ebp-\$08]
wait
//fld qword ptr [ebp-\$08]
end;

This function implementation scores the benchmark 47237 and the improvement is 8.5 %

Then we change the code such that X is loaded only once.

function ArcSinApprox3f(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+\$20]
fld qword ptr [ebp+\$28]
fxch
//fmul qword ptr [ebp+\$28]
fmul st(0),st(1)
//fmul qword ptr [ebp+\$28]
fmul st(0),st(1)
//fmul qword ptr [ebp+\$28]
fmul st(0),st(1)
ffree st(1)
wait
end;

This function implementation scores the benchmark 47226 and performance is unchanged.

The ffree instruction can be removed by using fmulp instead of fmul, but to do this we must interchange the two registers used. Only these two registers are in use and A*B = B*A so there is no problem doing that. We are not removing any redundancy by this and the two ways of coding the same thing should perform identically.

function ArcSinApprox3g(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+\$20]
fld qword ptr [ebp+\$28]
fxch st(1)
fmul st(0),st(1)
fmul st(0),st(1)
//fmul st(0),st(1)
fmulp st(1),st(0)
//ffree st(1)
wait
end;

This function implementation scores the benchmark 47416.

Then we remove the wait instruction.

function ArcSinApprox3h(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+\$20]
fld qword ptr [ebp+\$28]
fxch st(1)
fmul st(0),st(1)
fmul st(0),st(1)
fmulp st(1),st(0)
//wait
end;

This function implementation scores the benchmark 47059.

The last thing to do is interchanging the lines that load X and A, and remove the fxch instruction.

function ArcSinApprox3i(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
fld qword ptr [ebp+\$28]
fld qword ptr [ebp+\$20]
//fld qword ptr [ebp+\$28]
//fxch st(1)
fmul st(0),st(1)
fmul st(0),st(1)
fmulp st(1),st(0)
end;

This function implementation scores the benchmark 46544 and performance went down!

Let us compare the performance of the Horner style function with the naive one by picking the fastest implementations of both on P4.

ArcSinApprox1g 47939
ArcSinApprox3g 47416

On P3

ArcSinApprox1h 85361
ArcSinApprox3h 87604

There difference is not big, but the naive approach is a little faster on P4 and slower on P3. The naive approach has more calculations, but parallelism makes up for it. The Horner way has very little parallelism and latencies are fully exposed. This is especially bad on P4.

Having this in mind we continue to the third possible approach, which looks like this.

function ArcSinApprox4b(X, A, B, C, D : Double) : Double;
begin
{
push ebp
mov ebp,esp
}
Result := (A*X + B)*(X*X)+(C*X + D);
{
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmulp st(1)
fld qword ptr [ebp+\$10]
fmul qword ptr [ebp+\$28]
fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]
}
{
pop ecx
pop ecx
pop ebp
}
end;

Experienced as we are now optimizing this function is going to be easy and fast ;-)

This version is as Delphi made it

function ArcSinApprox4c(X, A, B, C, D : Double) : Double;
asm
//push ebp
//mov ebp,esp
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmulp //st(1)
fld qword ptr [ebp+\$10]
fmul qword ptr [ebp+\$28]
fstp qword ptr [ebp-\$08]
wait
fld qword ptr [ebp-\$08]
pop ecx
pop ecx
//pop ebp
end;

Removing the stack frame and the two lines that store the result in the stack frame

function ArcSinApprox4d(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+\$20]
fmul qword ptr [ebp+\$28]
fld qword ptr [ebp+\$28]
fmul qword ptr [ebp+\$28]
fmulp //st(1)
fld qword ptr [ebp+\$10]
fmul qword ptr [ebp+\$28]
//fstp qword ptr [ebp-\$08]
wait
//fld qword ptr [ebp-\$08]
//pop ecx
//pop ecx
end;

function ArcSinApprox4e(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+\$20]
fld qword ptr [ebp+\$28]
//fmul qword ptr [ebp+\$28]
fxch
fmul st(0),st(1)
//fld qword ptr [ebp+\$28]
fld st(1)
//fmul qword ptr [ebp+\$28]
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+\$10]
//fmul qword ptr [ebp+\$28]
fmul st(0),st(2)
ffree st(1)
wait
end;

Remove fxch and wait.

function ArcSinApprox4f(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+\$28]
fld qword ptr [ebp+\$20]
//fxch
fmul st(0),st(1)
fld st(1)
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+\$10]
fmul st(0),st(2)
ffree st(1)
//wait
end;

Reschedule ffree st(1)

function ArcSinApprox4g(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+\$28]
fld qword ptr [ebp+\$20]
fmul st(0),st(1)
fld st(1)
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+\$10]
fmul st(0),st(2)
ffree st(2)
//ffree st(1)
end;

Replace fmul/ffree by fmulp

function ArcSinApprox4h(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+\$28]
fld qword ptr [ebp+\$20]
fmul st(0),st(1)
fld st(1)
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+\$10]
//fmul st(0),st(2)
fmulp st(2),st(0)
//ffree st(2)
end;

Cleaning up and observing that the compiler still backs up ebp and modifies esp redundantly.

function ArcSinApprox4i(X, A, B, C, D : Double) : Double;
asm
//Result := (A*X + B)*(X*X)+(C*X + D);
fld qword ptr [ebp+\$28]
fld qword ptr [ebp+\$20]
fmul st(0),st(1)
fld st(1)
fmul st(0),st(2)
fmulp
fld qword ptr [ebp+\$10]
fmulp st(2),st(0)
end;

The big question is now how well this function version performs.

ArcSinApprox4a 45228
ArcSinApprox4b 45239
ArcSinApprox4c 45228
ArcSinApprox4d 51813
ArcSinApprox4e 49044
ArcSinApprox4f 48674
ArcSinApprox4g 48852
ArcSinApprox4h 44914
ArcSinApprox4i 44914

We see that “optimizations” from function d to i are “deoptimizations” on P4 except for g.

On P3

ArcSinApprox4a 68871
ArcSinApprox4b 68871
ArcSinApprox4c 68634
ArcSinApprox4d 86806
ArcSinApprox4e 85727
ArcSinApprox4f 83542
ArcSinApprox4g 80548
ArcSinApprox4h 88378
ArcSinApprox4i 85324

We see that optimizations d and h are very good and optimizations e, f g and I are bad. It is quite possible that the optimal function implementation is none of the ones we have made. We could pick version h and remove the bad optimizations or simply make some more variants and this way get a faster implementation.

Which function approach is the winner? To find out we pick the fastest implementation of each approach

On P4

ArcSinApprox1f 47939
ArcSinApprox3g 47416
ArcSinApprox4d 51813

The last version is the fastest. Parallelization is very important on a modern processor and version 4 beats the others by 9 %.

On P3

ArcSinApprox1h 85361
ArcSinApprox3h 87604
ArcSinApprox4h 88378

The function version 4 is a winner on P3 as well, but with a less margin.

The P4 has an SSE2 instruction set, which contains instructions for double precision floating point calculations. The main idea with this instruction set is SIMD calculations. SIMD is an acronym for Single Instruction Multiple Data. Multiple data is here two FP double precision variables (64 bit) and two sets of these data can be added, subtracted, multiplied or divided with one instruction. SSE2 also have some instructions for scalar calculations, which are calculations on one pair of data just like ordinary FP math in the FPU. The biggest difference between ordinary FP math and SSE2 scalar math is that FP math is performed in extended precision and results are rounded to double precision when copied to double precision variables in RAM/cache. SSE2 math is double precision calculation and double precision registers. The code examples of this lesson have relatively few calculations and precision on the FPU will be double. If we load data, perform all calculations and store the result, the result will only bee a little less than extended precision when still on the FPU stack, and will be rounded to exact double precision when copied to a memory location. SSE2 calculations on the other hand are a little less than double precision and the result in a register is a little less than double precision too. If there is only one calculation the result will be in double precision, but when performing additional calculations the error from each will sum up. Because the FPU does all calculations in extended precision and can hold temporary results in registers, there can be done many calculations before precision declines below double.
We have seen that the drawback of SSE2 is that precision is double or less versus the double precision of IA32 FP. What is the advantage? There are two advantages. Registers are not ordered as a stack, which makes it easier to arrange code and secondly calculations in double precision are faster than calculations in extended precision. We must expect scalar SSE2 instructions to have shorter latencies than their IA32 FP counterparts.

Fsub latency is 5
Fmul latency is 7
Fdiv latency is 38

Subsd latency is 4
Mulsd
Divsd latency is 35

The P4 optimizations reference manual has no latency and throughput information for the Mulsd instruction!

We see that latencies are one cycle less for scalar SSE2 in general, but 3 cycles less for division.

Throughput is

Fsub throughput is 1
Fmul throughput is 2
Fdiv throughput is 38

Subsd throughput is 2
Mulsd
Divsd latency is 35

We see that throughput for addsd and subsd surprisingly are the double of fadd and fsub.
All that think SSE2 has dedicated hardware and that SIMD is calculation on two sets of data in parallel raise your hands!
From the manual “Optimizations for Intel P4 and Intel Xeon” latency and throughput tables at page C-1 it is seen that all SSE2 FP instructions are executed in the same pipelines as old time FP instructions. This means that an SIMD addition as example is generating two microinstructions that execute in the F_ADD pipeline. At clock cycle one the first one enters the pipeline at the second cycle number 2 enters the pipeline. Because latency is 4 cycles the first one leaves the pipeline at clock cycle 3 and the second one leaves at cycle four. This leads us to expect that a scalar SSE2 add should generate one microinstruction of the same type and have a latency of 3 cycles and a throughput of 1. From the tables it is seen that the SIMD version of add, addpd, has the same latency and throughput as the scalar version, addsd. Either there is an error in the tables or the scalar instruction also generates two microinstructions of which one is “blind”, that is have no effect. Come on Intel!

To verify the numbers from the table we create some dedicated code and time the instructions.

var
RunNo, ClockFrequency : Cardinal;
StartTime, EndTime, RunTime : TDateTime;
NoOfClocksPerRun, RunTimeSec : Double;
const
ONE : Double = 1;
NOOFINSTRUCTIONS : Cardinal = 895;

begin
Update;
StartTime := Time;
for RunNo := 1 to MAXNOOFRUNS do
begin
asm
movsd xmm0, ONE
movsd xmm1, xmm0
movsd xmm2, xmm0
movsd xmm3, xmm0
movsd xmm4, xmm0
movsd xmm5, xmm0
movsd xmm6, xmm0
movsd xmm7, xmm0

//Repeat the addsd block of code such that there are 128 blocks

end;
end;
EndTime := Time;
RunTime := EndTime - StartTime;
RunTimeSec := (24 * 60 *60 * RunTime);
ClockFrequency := StrToInt(ClockFrequencyEdit.Text);
NoOfClocksPerRun := (RunTimeSec / MaxNoOfRuns) * ClockFrequency * 1000000 / NOOFINSTRUCTIONS;
ADDSDThroughputEdit.Text := FloatToStrF(NoOfClocksPerRun, ffFixed, 9, 1);
Update;
end;

The addsd instructions all operate on the same two registers and therefore they cannot execute in parallel. The second instruction has to wait for the first to finish and the full latency of the instruction is exposed.

For measuring throughput insert this block 128 times

Here there are no data dependencies between instructions and they can execute in parallel. Xmm0 is used as source in every line but this does not create a data dependency.

Results from run of the code show us that latency is 4 cycles and throughput is 2 cycles. This is in consistency with the table numbers.

Let us code the three functions in scalar SSE2 and perform some benchmarks. The 8 SSE2 registers are called xmm0-xmm7 and Delphi has no register view for them. So we must create our own, by creating a global (or local) variable for each register, put a watch on them and add code in the function to copy register contents to variables. It is somewhat cumbersome to do all this and I am looking forward to Borland creating a xmm register view. This code shows how I do it.

var
XMM0reg, XMM1reg, XMM2reg, XMM3reg, XMM4reg : Double;

function ArcSinApprox3i(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;

fld qword ptr [ebp+\$20]
movsd xmm0,qword ptr [ebp+\$20]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fld qword ptr [ebp+\$28]
movsd xmm1,qword ptr [ebp+\$28]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fxch st(1)
fmul st(0),st(1)
mulsd xmm0,xmm1

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fmul st(0),st(1)
mulsd xmm0,xmm1

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

fmulp st(1),st(0)
mulsd xmm0,xmm1

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

movsd [esp-8],xmm0
fld qword ptr [esp-8]

movsd XMM0reg,xmm0
movsd XMM1reg,xmm1
movsd XMM2reg,xmm2
movsd XMM3reg,xmm3

wait
end;

The code is not using xmm4-xmm7 and there was no need to create a register view for them. There is added xmm view code after each line of SSE2 code. All lines but the last two are the FP code with the SSE2 code added such that every operation is done in FP as well as in SSE2. This way it is possible to trace through the code and control that the SSE2 version is doing the same as the classic version. Open the FPU view and see how the FP stack is updated and control that xmm registers are updated in the same way. I developed the SSE2 code simply by adding an SSE2 instruction after each line of FP code.

fld qword ptr [ebp+\$20]
movsd xmm0,qword ptr [ebp+\$20]

movsd copy one double precision variable from the memory location at [ebp+\$20] into a xmm register. “qword ptr” is not needed but I kept it to emphasize the pattern between SSE2 code and FP code.
A big difference between FP code and scalar SSE2 code is that the FP registers are organized as a stack and SSE2 registers are not. At first while coding the SSE2 code I just ignored this and then after having made all the SSE2 lines I went back and traced through the lines one by one and corrected them to work on the correct variable/register. Activate the function with some variable values that are easy to follow in the two views (eg. X=2, A=3, B=4, C=5, D=6), and see that first “2” is loaded, then “3”, then 2 is multiplied by “3” and “2” is overwritten by “6” etc.

The scalar SSE2 counterpart for fmul is mulsd. The sd prefix means Scalar – Double.

fxch st(1)
fmul st(0),st(1)
mulsd xmm0,xmm1

Continue this way line by line.

The FP code leaves the result in st(0), but the SSE2 code leaves the result in an xmm register. Then the result has to be copied from xmm to st(0) via a memory location on the stack.

movsd [esp-8],xmm0
fld qword ptr [esp-8]

These two lines do this. At esp-8, 8 bytes above the top of the stack, there is some place we can use as the temporary location for the result. The first line copies xmm0 to temp and then the last line loads temp on the FP stack. These two lines are overhead that will make small SSE2 functions less effective than their FP cousins.

After having double checked the SSE2 code we can remove the instrumentation code as well as the old FP code, leaving a nice scalar SSE2 function with only a little extra overhead.

function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0,qword ptr [ebp+\$20]
movsd xmm1,qword ptr [ebp+\$28]
mulsd xmm0,xmm1
mulsd xmm0,xmm1
mulsd xmm0,xmm1
movsd [esp-8],xmm0
fld qword ptr [esp-8]
end;

It can be even nicer if we remove the not needed “qword ptr” text.

function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0, [ebp+\$20]
movsd xmm1, [ebp+\$28]
mulsd xmm0,xmm1
mulsd xmm0,xmm1
mulsd xmm0,xmm1
movsd [esp-8],xmm0
fld qword ptr [esp-8]
end;

Change the pointers with the parameter names

function ArcSinApprox3j(X, A, B, C, D : Double) : Double;
asm
//Result := ((A*X + B)*X + C)*X + D;
movsd xmm0, A
movsd xmm1, X
mulsd xmm0,xmm1
mulsd xmm0,xmm1
mulsd xmm0,xmm1
movsd [esp-8],xmm0
fld qword ptr [esp-8]
end;

Well how does this version perform?

The benchmark is 45882

This version is somewhat slower than the FP version which scored 48292. We need to investigate what is the reason for this. Is it the overhead of the last two lines or is it due to the 2 cycle throughput of addsd and mulsd? The overhead can be removed by transferring the result as an out parameter or we can inline the function. It would be interesting for us to see how big an advantage it is to inline this relatively small function. After all there is a good deal of overhead of copying 5 double precision parameters each 8 byte big. Let us see how much code is actually needed for this.

push dword ptr [ebp+\$14]
push dword ptr [ebp+\$10]
push dword ptr [ebp+\$34]
push dword ptr [ebp+\$30]
push dword ptr [ebp+\$2c]
push dword ptr [ebp+\$28]
push dword ptr [ebp+\$24]
push dword ptr [ebp+\$20]
push dword ptr [ebp+\$1c]
push dword ptr [ebp+\$18]
call dword ptr [ArcSinApproxFunction]
fstp qword ptr [ebp+\$08]

No less than 10 push instructions each pushing a 4 byte half of each parameter onto the stack. Imagine that the register calling convention took its name seriously and transferred the parameters on the FP stack instead. Then we would have 5 fld, which would also remove the need to load parameters from the stack in the function. That is – 5 fld in the function would be replaced by 5 fld at the call place of the function and 10 push instructions would go away. This would lead to a dramatic increase in performance. Inlining the function would also remove the overhead of the call/ret pair which is a lot less than the overhead of the mega push, and this would give as a clue about the performance of the updated register2 calling convention ;-).

Inlined ArcSinApprox3i 156006
Inlined ArcSinApprox3j 160000

The improvement is an astonishing 400 %.
I truly wish Borland introduces a true register calling convention for floating point parameters in the near future.

The SSE2 version is only 3 % faster than the IA32 version. This could be more on a decent SSE2 processor implementation.

Lesson 7 has hereby come to an end. You now know nearly all about floating point programming ;-)

Regards
Dennis