Previous Table of Contents Next


You don’t need to understand every corner of the 486 universe unless you’re a diehard ASMhead who does this stuff for fun. Just learn enough to be able to speed up the key portions of your programs, and spend the rest of your time on a fast design and overall implementation.

More Fun with Byte Registers

Rule #4: Don’t load any byte register exactly 2 cycles before using any register to address memory.

This, the last of this chapter’s rules, is the strangest of the lot. If any byte register is loaded, and then two cycles later any register is used to point to memory, one cycle is lost. So, for example, this code

mov    al,bl
mov    cx,dx
mov    si,[di]

takes four rather than the expected three cycles to execute. Note that it is not required that the byte register be part of the register used to address memory; any byte register will do the trick.

Worse still, loading byte registers both one and two cycles before a register is used to address memory costs two cycles, as in

mov    bl,al
mov    cl,3
mov    bx,[si]

which takes five rather than three cycles to run. However, there is no penalty if a byte register is loaded one cycle but not two cycles before a register is used to address memory. Therefore,

mov    cx,3
mov    dl,al
mov    si,[bx]

runs in the expected three cycles.

In truth, I do not know why this happens. Clearly, it has something to do with interrupting the start of the addressing pipeline, and I have my theories about how this works, but at this point they’re pure speculation. Whatever the reason for this rule, ignorance of it—and of its interaction with the other rules—could lead to considerable performance loss in seemingly air-tight code. For instance, a casual observer would expect the following code to run in 3 cycles:

mov    bx,offset MemVar
mov    cl,al
mov    ax,[bx]

A more sophisticated programmer would expect to lose one cycle, because BX is loaded two cycles before being used to address memory. In fact, though, this code takes 5 cycles—2 cycles, or 67 percent, longer than normal. Why? Well, under normal conditions, loading a byte register—CL in this case—one cycle before using a register to address memory produces no penalty; loading 2 cycles ahead is the only case that normally incurs a penalty. However, think of Rule #4 as meaning that loading a byte register disrupts the memory addressing pipeline as it starts up. Viewed that way, we can see that MOV BX,OFFSET MemVar interrupts the addressing pipeline, forcing it to start again, and then, presumably, MOV CL,AL interrupts the pipeline again because the pipeline is now on its first cycle: the one that loading a byte register can affect.

I know—it seems awfully complicated. It isn’t, really. Generally, try not to use byte destinations exactly two cycles before using a register to address memory, and try not to load a register either one or two cycles before using it to address memory, and you’ll be fine.

Timing Your Own 486 Code

In case you want to do some 486 performance analysis of your own, let me show you how I arrived at one of the above conclusions; at the same time, I can warn you of the timing hazards of the cache. Listings 12.1 and 12.2 show the code I ran through the Zen timer in order to establish the effects of loading a byte register before using a register to address memory. Listing 12.1 ran in 120 µs on a 33 MHz 486, or 4 cycles per repetition (120 µs/1000 repetitions = 120 ns per repetition; 120 ns per repetition/30 ns per cycle = 4 cycles per repetition); Listing 12.2 ran in 90 µs, or 3 cycles, establishing that loading a byte register costs a cycle only when it’s performed exactly 2 cycles before addressing memory.

LISTING 12.1 LST12-1.ASM

; Measures the effect of loading a byte register 2 cycles before
; using a register to address memory.
    mov    bp,2    ;run the test code twice to make sure
                   ; it’s cached
    sub    bx,bx
CacheFillLoop:
    call    ZTimerOn ;start timing
    rept    1000
    mov     dl,cl
    nop
    mov     ax,[bx]
    endm
    call    ZTimerOff ;stop timing
    dec     bp
    jz      Done
    jmp     CacheFillLoop
Done:

LISTING 12.2 LST12-2.ASM

; Measures the effect of loading a byte register 1 cycle before
; using a register to address memory.
    mov    bp,2   ;run the test code twice to make sure
                  ; it’s cached
    sub    bx,bx
CacheFillLoop:
    call   ZTimerOn ;start timing
    rept   1000
    nop
    mov    dl,cl
    mov    ax,[bx]
    endm
    call   ZTimerOff ;stop timing
    dec    bp
    jz     Done
    jmp    CacheFillLoop
Done:

Note that Listings 12.1 and 12.2 each repeat the timing of the code under test a second time, to make sure that the instructions are in the cache on the second pass, the one for which results are displayed. Also note that the code is less than 8K in size, so that it can all fit in the 486’s 8K internal cache. If I double the REPT value in Listing 12.2 to 2,000, making the test code larger than 8K, the execution time more than doubles to 224 µs, or 3.7 cycles per repetition; the extra seven-tenths of a cycle comes from fetching non-cached instruction bytes.

Whenever you see non-integral timing results of this sort, it’s a good bet that the test code or data isn’t cached.

The Story Continues

There’s certainly plenty more 486 lore to explore, including the 486’s unique prefetch queue, more optimization rules, branching optimizations, performance implications of the cache, the cost of cache misses for reads, and the implications of cache write-through for writes. Nonetheless, we’ve covered quite a bit of ground in this chapter, and I trust you’ve gotten a feel for the considerable extent to which 486 optimization differs from what you’re used to. Odd as 486 optimization is, though, it’s well worth mastering, for the 486 is, at its best, so staggeringly fast that carefully crafted 486 code can do more than twice as much per cycle as the best 386 code—which makes it perhaps 50 times as fast as optimized code for the original PC.

Sometimes it is hard to believe we’re still in Kansas!


Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash