# BASM for Beginners

## BASM for beginners, lesson 4

In this lesson we will learn about branching, illustrated by an if-else construct. Conditional floating point move will also be introduced.

The example function of this lesson is the Min function from the Delphi Math unit.

function Min1(const A, B: Single) : Single;

begin

if A < B then

Result := A

else

Result := B;

end;

The compiler generated asm for this function looks like this

function Min2(const A, B: Single) : Single;

begin

{

00452458 55               push ebp

00452459 8BEC             mov ebp,esp

0045245B 51               push ecx

}

if A < B then

{

0045245C D9450C           fld dword ptr [ebp+\$0c]

0045245F D85D08           fcomp dword ptr [ebp+\$08]

00452462 DFE0             fstsw ax

00452464 9E               sahf

00452465 7308             jnb +\$08

}

Result := A

{

00452467 8B450C           mov eax,[ebp+\$0c]

0045246A 8945FC           mov [ebp-\$04],eax

0045246D EB06             jmp +\$06

}

else

Result := B;

{

0045246F 8B4508           mov eax,[ebp+\$08]

00452472 8945FC           mov [ebp-\$04],eax

}

{

00452475 D945FC           fld dword ptr [ebp-\$04]

00452478 59               pop ecx

00452479 5D               pop ebp

}

end;

This time I included the address and opcode columns, because we need them later. Let’s analyze it line by line, like we always do.

function Min3(const A, B: Single) : Single;

begin

{

push ebp                     // Save ebp on stack

mov ebp,esp                  // New basepointer is the old stackpointer

push ecx                     // subtract 4 from esp

}

if A < B then

{

fld dword ptr [ebp+\$0c]      // Load A on FP stack

fcomp dword ptr [ebp+\$08]    // FP compare A to B and pop A from stack

fstsw ax                     // Store FP statusword in ax

sahf                         // Store ah into EFlags register

jnb +\$08                     // If not below jump 8 bytes forward

}

Result := A

{

mov eax,[ebp+\$0c]           // Copy A into eax

mov [ebp-\$04],eax           // Copy A into stackframe

jmp +\$06                    // Jmp 6 bytes forward

}

else

Result := B;

{

mov eax,[ebp+\$08]           // Copy B into eax

mov [ebp-\$04],eax           // Copy B into stackframe

}

{

fld dword ptr [ebp-\$04]     // Load A or B from stackframe onto FP stack

pop ecx                     // Add 4 to esp

pop ebp                     // Restore ebp

}

end;

This time I commented every line of BASM. Read the details there. The first new instruction introduced by this example is fcomp. F says floating point instruction as always. Com says compare and p says pop FP stack. Fcom compares two floating point values and sets the condition code flags of the floating point unit, named C0, C1, C2 and C3. These flags are the equivalents of the EFlags register of the CPU. The flags are checked by conditional jump instructions, which jump or not depending on the type of jump and the status of the flags. Conditional jump instructions check the CPU flags and not the FPU flags and therefore it is necessary to copy the flags from the FPU to the CPU. This is done by the next two instructions. fstsw stores the FP flags into the ax register and sahf copy the 8 bits from ah to the EFlags register. This is a long way for the flags to travel before they can be used by the jnb instruction. Jnb is short for jump not below. In the Intel SW Developers Manual Vol. 2 at page 394 there is a table with all the different jump instructions and a description of the flags they test. Here it is seen that jnb jumps if CF=1 and ZF=1. Try tracing through the code with the FPU view and the CPU view open. Watch how the FPU flags are set, their values are copied into ah and then copied to the CPU EFlags register.

If the jnb jump is not taken execution continues in the line after it. This is the if-part of the if-else construct. If the jump is taken execution is continued at the line 8 bytes ahead. This is at the start of the else part. The if- and the else-part is very much the same. As seen in the Pascal code A or B is copied to the Result variable depending on the if-condition. Instead of copying A or B directly on the top of the FP stack, which is the place of a FP Result from a function due to the register calling convention, the Delphi compiler chose to use the stack as a temporary location. The fld dword ptr [ebp-\$04] instruction copies the result in the right place.

Observe that an unconditional jump is needed at the end of the if-block such that execution does not continue in the else block. This way there is a jump no matter with branch is taken.

I will write short word on branch prediction. Branch prediction comes in two flavors, static and dynamic. The first time a branch is executed the CPU can have no knowledge about the probability it will be taken or not. In this situation it uses the static predictor, which says that forward jumps are not taken and backward branches are taken. Our example jump no 1 is a forward branch and it will be predicted not taken the first time through. If we have knowledge about the values A and B, we can use it to code the if-else such that the if-part is the most often executed and static prediction is optimized. An unconditional jump needs no prediction - it is always taken ;-) A backward jump is often part of a loop and most loops will run more than once (why else have a loop?). This way it makes sense to statically predict backward jumps taken. Dynamic branch prediction tries to accumulate knowledge about the probability that a certain jump is taken or not and then predict it correctly as often as possible.

Now it is time to transform the function into a pure asm one.

function Min4(const A, B: Single) : Single;

asm

//push  ebp

//mov   ebp,esp

push  ecx

//if A < B then

fld   dword ptr [ebp+\$0c]

fcomp dword ptr [ebp+\$08]

fstsw ax

sahf

jnb   @ElseBegin

//Result := A

mov   eax,[ebp+\$0c]

mov   [ebp-\$04],eax

jmp   @ElseEnd

//else

//Result := B;

@ElseBegin :

mov   eax,[ebp+\$08]

mov   [ebp-\$04],eax

@ElseEnd :

fld   dword ptr [ebp-\$04]

pop   ecx

//pop   ebp

end;

The new thing this time is that we need some labels. Our analysis of the function made it clear where we jumped to when, I hope. In general when placing labels it is a good thing to use the knowledge we have about the structure of the code. You can open the FPU view and just trace through the code and see where you go when jumps are taken. If you want to place labels without having to think, then use some math. Example follows

We have this jump

00452465 7308             jnb +\$08

This is the line after it

00452467 8B450C           mov eax,[ebp+\$0c]

This is the line 8 byte after that line

0045246F 8B4508           mov eax,[ebp+\$08]

Take the address of the line after the jump and add the jump offset to it and you have the jump target address. This is the math: 00452467 + 8 = 0045246F.

Why do we add the jump offset to the address of the line after the jump and not to the address of the line with the jump?

It is time to optimize.

function Min5(const A, B: Single) : Single;

asm

push  ecx

//if A < B then

fld   dword ptr [ebp+\$0c]

fcomp dword ptr [ebp+\$08]

fstsw ax

sahf

jnb   @ElseBegin

//Result := A

mov   eax,[ebp+\$0c]

mov   [ebp-\$04],eax

jmp   @ElseEnd

//else

//Result := B;

@ElseBegin :

mov   eax,[ebp+\$08]

mov   [ebp-\$04],eax

@ElseEnd :

fld   dword ptr [ebp-\$04]

pop   ecx

end;

This is just a cleaned up version of the function. Let’s change the push ecx, pop ecx instructions with instructions that manipulate the esp register directly and do not move data between ecx and the stack.

function Min6(const A, B: Single) : Single;

asm

//push  ecx

sub   esp, 4

//if A < B then

fld   dword ptr [ebp+\$0c]

fcomp dword ptr [ebp+\$08]

fstsw ax

sahf

jnb   @ElseBegin

//Result := A

mov   eax,[ebp+\$0c]

mov   [ebp-\$04],eax

jmp   @ElseEnd

//else

//Result := B;

@ElseBegin :

mov   eax,[ebp+\$08]

mov   [ebp-\$04],eax

@ElseEnd :

fld   dword ptr [ebp-\$04]

//pop   ecx

end;

When the code was analyzed we saw that the flags were traveling a long way and 2 instructions were needed for this. What about a floating point compare instruction that sets the EFlags directly? This is the fcomi instruction which was introduced in the P6 architecture. Lets use it and abandon those CPU's older than the Pro. It would be fair to believe that these lines

fcomp dword ptr [ebp+\$08]

fstsw ax

sahf

could be substituted by this

fcomip dword ptr [ebp+\$08]

The fcomi instruction does however not accept a memory reference as operand. Therefore it is necessary to load data before issuing it.

fld    dword ptr [ebp+\$0c]

fcomip st(0), st(1)

Then because we loaded data we have to remove them again with the ffree instruction. A fcomipp instruction would have been nice to have.

fld    dword ptr [ebp+\$0c]

fcomip st(0), st(1)

ffree  st(0)

This is a hell of an optimization, substituting three lines with 3 other. It is necessary to time the two versions to know whether it was an optimization. Now the function looks like this.

function Min7(const A, B: Single) : Single;

asm

sub    esp, 4

//if A < B then

fld    dword ptr [ebp+\$08]

fld    dword ptr [ebp+\$0c]

fcomip st(0), st(1)

ffree  st(0)

//fstsw ax

//sahf

jnb    @ElseBegin

//Result := A

mov    eax,[ebp+\$0c]

mov    [ebp-\$04],eax

jmp    @ElseEnd

//else

//Result := B;

@ElseBegin :

mov    eax,[ebp+\$08]

mov    [ebp-\$04],eax

@ElseEnd :

fld    dword ptr [ebp-\$04]

end;

Now we apply some logic thinking. Why copy the result around? Both A & B is on the stack for use by the comparison by fcom and the result should be delivered on the stack. The only thing needed is to delete either A or B and leave the smallest one of them on the stack

function Min8(const A, B: Single) : Single;

asm

sub    esp, 4

//if A < B then

fld    dword ptr [ebp+\$08]

fld    dword ptr [ebp+\$0c]

//fcomip  st(0), st(1)

fcomi  st(0), st(1)

//ffree  st(0)

jnb    @ElseBegin

//Result := A

//mov    eax,[ebp+\$0c]

//mov    [ebp-\$04],eax

ffree st(1)

jmp    @ElseEnd

//else

//Result := B;

@ElseBegin :

//mov    eax,[ebp+\$08]

//mov    [ebp-\$04],eax

fxch

ffree st(1)

@ElseEnd :

//fld    dword ptr [ebp-\$04]

end;

Fcomip is replaced with fcomi because we do not want to remove B from the stack at this point. Ffree is deleted because it deleted A. Then all the lines that copied the result around are removed. In the if-block A is the result and B should be deleted. B is in st(1) and ffree st(1) does the job. In the else block we want to delete A and leave B in st(0). This is done by swapping A and B with the fxch instruction and then delete A in st(1) with ffree. Fxch is nearly free (takes 0 cycles), because if works by register renaming instead of actually copying data.

function Min9(const A, B: Single) : Single;

asm

//sub    esp, 4

//if A < B then

fld    dword ptr [ebp+\$08]

fld    dword ptr [ebp+\$0c]

fcomi  st(0), st(1)

jnb    @ElseBegin

//Result := A

ffree st(1)

jmp    @ElseEnd

//else

//Result := B;

@ElseBegin :

fxch

ffree st(1)

@ElseEnd :

end;

Now the stackframe is not needed and we delete the code that set it up.

function Min10(const A, B: Single) : Single;

asm

//if A < B then

fld    dword ptr [ebp+\$08]

fld    dword ptr [ebp+\$0c]

fcomi  st(0), st(1)

jnb    @ElseBegin

//Result := A

ffree st(1)

jmp    @ElseEnd

//else

//Result := B;

@ElseBegin :

fxch

ffree st(1)

@ElseEnd :

end;

This is a pretty nice function, but somebody in the newsgroup has told us about conditional moves. fcmovnb is such a thing - floating point conditional move not below. It copies data from st(1)-st(7) to st(0) if the condition is fulfilled. The Eflags are tested for the condition. Fcmov was introduced in the P6 architecture as well fcomi was.

function Min11(const A, B: Single) : Single;

asm

fld     dword ptr [ebp+\$08]

fld     dword ptr [ebp+\$0c]

fcomi   st(0), st(1)

fcmovnb st(0), st(1)

ffree   st(1)

end;

Instead of all the jumping around we copy A to the top of the stack where B is, but only if A is the smaller of the two. Delete B and done.

This is a nice clean function with only one piece of redundancy left. The compiler still pushes and pops ebp even though it is not modified in the function.