Previous Table of Contents Next


As on the 486, you should keep a careful eye out for AGIs involving the stack pointer. Implicit modifiers of ESP, such as PUSH and POP, are special-cased so you don’t have to worry about AGIs. However, if you explicitly modify ESP with this instruction

sub esp,100h

for example, or with the popular

mov esp,ebp

you can then get AGIs if you attempt to use ESP to address memory, either explicitly with instructions like this one

moveax,[esp+20h]

or via PUSH, POP, or other instructions that implicitly use ESP as an addressing register.

On the 486, any instruction that had both a constant value and an addressing displacement, such as

mov dword ptr [ebp+16],1

suffered a 1-cycle penalty, taking a total of 2 cycles. Such instructions take only one cycle on the Pentium, but they cannot pair, so they’re still the most expensive sort of MOV. Knowing this can speed up something as simple as zeroing two memory variables, as in

sub eax,eax        ;U-pipe 1
                   ;any V-pipe pairable
                   ; instruction can go here,
                   ; or SUB could be in V-pipe
mov [MemVar1],eax  ;U-pipe 2
mov [MemVar2],eax  ;V-pipe 2

which should never be slower and should potentially be 0.5 cycles faster, and six bytes smaller than this sequence:

mov [MemVar1],0 ;U-pipe 1
mov [MemVar2],0 ;U-pipe 2

Note, however, that my experiments thus far indicate that the two writes in the first case don’t actually pair (possibly because the memory variables have never been read into the internal cache), so you might want to insert an instruction between the two MOVs—and, of course, this is yet another reason why you should always measure your code’s actual performance.

Register Contention

Finally, we come to the last major component of superscalar optimization: register contention. The basic premise here is simple: You can’t use the same register in two inherently sequential ways in a single cycle. For example, you can’t execute

inc eax     ;U-pipe cycle 1
            ;V-pipe idle cycle 1
            ; due to dependency
and ebx,eax ;U-pipe cycle 2

in a single cycle; AND EBX,EAX can’t execute until the value in EAX is known, and that can’t happen until INC EAX is done. Consequently, the V-pipe idles while INC EAX executes in the U-pipe. We saw this in the last chapter when we discussed splitting instructions into simple instructions, and it is by far the most common sort of register contention, known as read-after-write register contention. Read-after-write register contention is the primary reason we have to interleave independent operations in order to get maximum V-pipe usage.

The other sort of register contention is known as write-after-write. Write-after-write register contention happens when two instructions try to write to the same register on the same cycle. While that may not seem like a particularly useful operation in general, it can happen when subregisters are being set, as in the following

sub eax,eax   ;U-pipe cycle 1
              ;V-pipe idle cycle 1
              ; due to register contention
mov al,[Var]  ;U-pipe cycle 2

where an attempt is made to set both EAX and its AL subregister on the same cycle. Write-after-write contention implies that the two instructions comprising the above substitute for MOVZX should have at least one unrelated instruction between them when SUB EAX,EAX executes in the V-pipe.

Exceptions to Register Contention

Intel has special-cased some very useful exceptions to register contention. Happily, write-after-read operations do not cause contention. Such operations, as in

mov eax,edx ;U-pipe cycle 1
sub edx,edxX ;V-pipe cycle 1

are free of charge.

Also, stack-related instructions that modify ESP only implicitly (without ESP as part of any explicit operand) do not cause AGIs, and neither do they cause register contention with other instructions that use ESP only implicitly; such instructions include PUSH reg/immed, POP reg, and CALL. (However, these instructions do cause register contention on ESP—but not AGIs—with instructions that use ESP explicitly, such as MOV EAX,[ESP+4].) Without this special case, the following sequence would hardly use the V-pipe at all:

mov  eax,[MemVar] ;U-pipe cycle 1
push esi          ;V-pipe cycle 1
push eax          ;U-pipe cycle 2
push edi          ;V-pipe cycle 2
push ebx          ;U-pipe cycle 3
call FooTilde     ;V-pipe cycle 3

But in fact, all the instructions pair, even though ESP is modified five times in the space of six instructions.

The final register-contention special case is both remarkable and remarkably important. There is exactly one sort of instruction that can pair only in the V-pipe: branches. Any near call or conditional or unconditional near jump can execute in the V-pipe paired with any pairable U-pipe instruction, as illustrated by this sequence:

LoopTop:
   mov [esi],eax ;U-pipe cycle 1
   add esi,4     ;V-pipe cycle 1
   dec ecx       ;U-pipe cycle 2
   jnz LoopTop   ;V-pipe cycle 2

Branches can’t pair in the U-pipe; a branch that executes in the U-pipe runs alone, with the V-pipe idle. If a call or jump is correctly predicted by the Pentium’s branch prediction circuitry (as discussed in the last chapter), it executes in a single cycle, pairing if it runs in the V-pipe; if mispredicted, conditional jumps take 4 cycles in the U-pipe and 5 cycles in the V-pipe, and mispredicted calls and unconditional jumps take 3 cycles in either pipe. Note that RET can’t pair.

Who’s in First?

One of the trickiest things about superscalar optimization is that a given instruction stream can execute at a different speed depending on the pipe where it starts execution, because which instruction goes through the U-pipe first determines which of the following instructions will be able to pair. If we take the last example and add one more instruction, the other instructions will go through different pipes than previously, and cause the loop as a whole to take 50 percent longer, even though we only added 25 percent more cycles:

LoopTop:
   inc edx           ;U-pipe cycle 1
   mov [esi],eax     ;V-pipe cycle 1
   add esi,4         ;U-pipe cycle 2
   dec ecx           ;V-pipe cycle 2
   jnz LoopTop       ;U-pipe cycle 3
                     ;V-pipe idle cycle 3
                     ; because JNZ can’t
                     ; pair in the U-pipe


Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash