Previous Table of Contents Next


Official Execution Times Are Only Part of the Story

The sequence of 5 SHR instructions in the last example is 10 bytes long. That means that it can never execute in less than 24 cycles even if the 4-byte prefetch queue is full when it starts, since 6 instruction bytes would still remain to be fetched, at 4 cycles per fetch. If the prefetch queue is empty at the start, the sequence could take 40 cycles. In short, thanks to instruction fetching, the code won’t run at its documented speed, and could take up to four times longer than it is supposed to.

Why does Intel document Execution Unit execution time rather than overall instruction execution time, which includes both instruction fetch time and Execution Unit (EU) execution time? Well, instruction fetching isn’t performed as part of instruction execution by the Execution Unit, but instead is carried on in parallel by the Bus Interface Unit (BIU) whenever the external data bus isn’t in use or whenever the EU runs out of instruction bytes to execute. Sometimes the BIU is able to use spare bus cycles to prefetch instruction bytes before the EU needs them, so in those cases instruction fetching takes no time at all, practically speaking. At other times the EU executes instructions faster than the BIU can fetch them, and instruction fetching then becomes a significant part of overall execution time. As a result, the effective fetch time for a given instruction varies greatly depending on the code mix preceding that instruction. Similarly, the state in which a given instruction leaves the prefetch queue affects the overall execution time of the following instructions.

In other words, while the execution time for a given instruction is constant, the fetch time for that instruction depends heavily on the context in which the instruction is executing—the amount of prefetching the preceding instructions allowed—and can vary from a full 4 cycles per instruction byte to no time at all.

As we’ll see later, other cycle-eaters, such as DRAM refresh and display memory wait states, can cause prefetching variations even during different executions of the same code sequence. Given that, it’s meaningless to talk about the prefetch time of a given instruction except in the context of a specific code sequence.

So now you know why the official instruction execution times are often wrong, and why Intel can’t provide better specifications. You also know now why it is that you must time your code if you want to know how fast it really is.

There Is No Such Beast as a True Instruction Execution Time

The effect of the code preceding an instruction on the execution time of that instruction makes the Zen timer trickier to use than you might expect, and complicates the interpretation of the results reported by the Zen timer. For one thing, the Zen timer is best used to time code sequences that are more than a few instructions long; below 10µs or so, prefetch queue effects and the limited resolution of the clock driving the timer can cause problems.

Some slight prefetch queue-induced inaccuracy usually exists even when the Zen timer is used to time longer code sequences, since the calls to the Zen timer usually alter the code’s prefetch queue from its normal state. (Branches—jumps, calls, returns and the like—empty the prefetch queue.) Ideally, the Zen timer is used to measure the performance of an entire subroutine, so the prefetch queue effects of the branches at the start and end of the subroutine are similar to the effects of the calls to the Zen timer when you’re measuring the subroutine’s performance.

Another way in which the prefetch queue cycle-eater complicates the use of the Zen timer involves the practice of timing the performance of a few instructions over and over. I’ll often repeat one or two instructions 100 or 1,000 times in a row in listings in this book in order to get timing intervals that are long enough to provide reliable measurements. However, as we just learned, the actual performance of any 8088 instruction depends on the code mix preceding any given use of that instruction, which in turn affects the state of the prefetch queue when the instruction starts executing. Alas, the execution time of an instruction preceded by dozens of identical instructions reflects just one of many possible prefetch states (and not a very likely state at that), and some of the other prefetch states may well produce distinctly different results.

For example, consider the code in Listings 4.5 and 4.6. Listing 4.5 shows our familiar SHR case. Here, because the prefetch queue is always empty, execution time should work out to about 4 cycles per byte, or 8 cycles per SHR, as shown in Figure 4.3. (Figure 4.3 illustrates the relationship between instruction fetching and execution in a simplified way, and is not intended to show the exact timings of 8088 operations.) That’s quite a contrast to the official 2-cycle execution time of SHR. In fact, the Zen timer reports that Listing 4.5 executes in 1.81µs per byte, or slightly more than 4 cycles per byte. (The extra time is the result of the dynamic RAM refresh cycle-eater, which we’ll discuss shortly.) Going by Listing 4.5, we would conclude that the “true” execution time of SHR is 8.64 cycles.

LISTING 4.5 LST4-5.ASM

; Measures the performance of 1,000 SHR instructions
; in a row. Since SHR executes in 2 cycles but is
; 2 bytes long, the prefetch queue is always empty,
; and prefetching time determines the overall
; performance of the code.
;
      call  ZTimerOn
      rept  1000
      shr   ax,1
      endm
      call  ZTimerOff

LISTING 4.6 LST4-6.ASM

; Measures the performance of 1,000 MUL/SHR instruction
; pairs in a row. The lengthy execution time of MUL
; should keep the prefetch queue from ever emptying.
;
      mov   cx,1000
      sub   ax,ax
      call  ZTimerOn
      rept  1000
      mul   ax
      shr   ax,1
      endm
      call  ZTimerOff


Figure 4.3
  Execution and instruction prefetching sequence for Listing 4.5.

Now let’s examine Listing 4.6. Here each SHR follows a MUL instruction. Since MUL instructions take so long to execute that the prefetch queue is always full when they finish, each SHR should be ready and waiting in the prefetch queue when the preceding MUL ends. As a result, we’d expect that each SHR would execute in 2 cycles; together with the 118-cycle execution time of multiplying 0 times 0, the total execution time should come to 120 cycles per SHR/MUL pair, as shown in Figure 4.4. And, by God, when we run Listing 4.6 we get an execution time of 25.14 µs per SHR/MUL pair, or exactly 120 cycles! According to these results, the “true” execution time of SHR would seem to be 2 cycles, quite a change from the conclusion we drew from Listing 4.5.

The key point is this: We’ve seen one code sequence in which SHR took 8-plus cycles to execute, and another in which it took only 2 cycles. Are we talking about two different forms of SHR here? Of course not—the difference is purely a reflection of the differing states in which the preceding code left the prefetch queue. In Listing 4.5, each SHR after the first few follows a slew of other SHR instructions which have sucked the prefetch queue dry, so overall performance reflects instruction fetch time. By contrast, each SHR in Listing 4.6 follows a MUL instruction which leaves the prefetch queue full, so overall performance reflects Execution Unit execution time.


Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash