Previous Table of Contents Next


On the 386, ROR was the only way to split a 32-bit register into two 16-bit registers. On the 486, however, BSWAP can not only do the job, but can do it better, because BSWAP executes in just one cycle. BSWAP has the added benefit of not affecting any flags, unlike ROR. With BSWAP-based code like that in Listing 13.6, the upper 16 bits of a register can be accessed with only 2 cycles of overhead and without altering any flags, making the technique of packing two 16-bit registers into one 32-bit register much more useful.

LISTING 13.6 L13-6.ASM

      mov    cx,[initialskip]
      bswap  ecx        ;put skip value in upper half of ECX
      mov    cx,100     ;put loop count in CX
looptop:
       :
      bswap  ecx        ;make skip value word accessible in CX
      add    bx,cx      ;skip BX ahead
      inc    cx         ;set next skip value
      bswap  ecx        ;put loop count in CX
      dec    cx         ;count down loop
      jnz    looptop

Pushing and Popping Memory

Pushing or popping a memory location, as in PUSH WORD PTR [BX] or POP [MemVar], is a compact, easy way to get a value onto or off of the stack, especially when pushing parameters for calling a C-compatible function. However, on a 486, these are unattractive instructions from a performance perspective. Pushing a memory location takes four cycles; by contrast, loading a memory location into a register takes only one cycle, and pushing a register takes just 1 more cycle, for a total of two cycles. Therefore,

mov   ax,[bx]
push  ax

is twice as fast as

push   word ptr [bx]

and the only cost is that the previous contents of AX are destroyed.

Likewise, popping a memory location takes six cycles, but popping a register and writing it to memory takes only two cycles combined. The i486 Microprocessor Programmer’s Reference Manual lists a 4-cycle execution time for popping a register, but pay that no mind; popping a register takes only 1 cycle.

Why is it that such a convenient operation as pushing or popping memory is so slow? The rule on the 486 is that simple operations, which can be executed in a single cycle by the 486’s RISC core, are fast; whereas complex operations, which must be carried out in microcode just as they were on the 386, are almost all relatively slow. Slow, complex operations include all the string instructions except REP MOVS, as well as XLAT, LOOP, and, of course, PUSH mem and POP mem.

Whenever possible, try to use the 486’s 1-cycle instructions, including MOV, ADD, SUB, CMP, ADC, SBB, XOR, AND, OR, TEST, LEA, and PUSH reg and POP reg. These instructions have an added benefit in that it’s often possible to rearrange them for maximum pipeline efficiency, as is the case with Terje’s optimization described earlier in this chapter.

Optimal 1-Bit Shifts and Rotates

On a 486, the n-bit forms of the shift and rotate instructions—as in ROR AX,2 and SHL BX,9—are 2-cycle instructions, but the 1-bit forms—as in ROR AX,1 and SHL BX,1—are 3-cycle instructions. Go figure.

Assemblers default to the 1-bit instruction for 1-bit shifts and rotates. That’s not unreasonable since the 1-bit form is a byte shorter and is just as fast as the n-bit forms on a 386 and faster on a 286, and the n-bit form doesn’t even exist on an 8088. In a really critical loop, however, it might be worth hand-assembling the n-bit form of a single-bit shift or rotate in order to save that cycle. The easiest way to do this is to assemble a 2-bit form of the desired instruction, as in SHL AX,2, then look at the hex codes that the assembler generates and use DB to insert them in your program code, with the value two replaced with the value one. For example, you could determine that SHL AX,2 assembles to the bytes 0C1H 0E0H 002H, either by looking at the disassembly in a debugger or by having the assembler generate a listing file. You could then insert the n-bit version of SHL AX,1 in your code as follows:

mov   ax,1
db    0c1h, 0e0h, 001h
mov   dx,ax

At the end of this sequence, DX will contain 2, and the fast n-bit version of SHL AX,1 will have executed. If you use this approach, I’d recommend using a macro, rather than sticking DBs in the middle of your code.

Again, this technique is advantageous only on a 486. It also doesn’t apply to RCL and RCR, where you definitely want to use the 1-bit versions whenever you can, because the n-bit versions are horrendously slow. But if you’re optimizing for the 486, these tidbits can save a few critical cycles—and Lord knows that if you’re optimizing for the 486—that is, if you need even more performance than you get from unoptimized code on a 486—you almost certainly need all the speed you can get.


Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash