Previous Table of Contents Next

By the way, don’t fall victim to the lures of JCXZ and do something like this:

and     cx,ofh          ;Isolate the desired field
jcxz    SkipLoop        ;If field is 0, don’t bother

The AND instruction has already set the Zero flag, so this

and     cx,0fh           ;Isolate the desired field
jz      SkipLoop         ;If field is 0, don’t bother

will do just fine and is faster on all processors. Use JCXZ only when the Zero flag isn’t already set to reflect the status of CX.

The Lessons of LOOP and JCXZ

What can we learn from LOOP and JCXZ? First, that a single instruction that is intended to do a complex task is not necessarily faster than several instructions that together do the same thing. Second, that the relative merits of instructions and optimization rules vary to a surprisingly large degree across the x86 family.

In particular, if you’re going to write 386 protected mode code, which will run only on the 386, 486, and Pentium, you’d be well advised to rethink your use of the more esoteric members of the x86 instruction set. LOOP, JCXZ, the various accumulator-specific instructions, and even the string instructions in many circumstances no longer offer the advantages they did on the 8088. Sometimes they’re just not any faster than more general instructions, so they’re not worth going out of your way to use; sometimes, as with LOOP, they’re actually slower, and you’d do well to avoid them altogether in the 386/486 world. Reviewing the instruction cycle times in the MASM or TASM manuals, or looking over the cycle times in Intel’s literature, is a good place to start; published cycle times are closer to actual execution times on the 386 and 486 than on the 8088, and are reasonably reliable indicators of the relative performance levels of x86 instructions.

Avoiding LOOPS of Any Stripe

Cycle counting and directly substituting instructions (DEC CX/JNZ for LOOP, for example) are techniques that belong at the lowest level of optimization. It’s an important level, but it’s fairly mechanical; once you’ve learned the capabilities and relative performance levels of the various instructions, you should be able to select the best instructions fairly easily. What’s more, this is a task at which compilers excel. What I’m saying is that you shouldn’t get too caught up in counting cycles because that’s a small (albeit important) part of the optimization picture, and not the area in which your greatest advantage lies.

Local Optimization

One level at which assembly language programming pays off handsomely is that of local optimization; that is, selecting the best sequence of instructions for a task. The key to local optimization is viewing the 80x86 instruction set as a set of building blocks, each with unique characteristics. Your job is to sequence those blocks so that they perform well. It doesn’t matter what the instructions are intended to do or what their names are; all that matters is what they do.

Our discussion of LOOP versus DEC/JNZ is an excellent example of optimization by cycle counting. It’s worth knowing, but once you’ve learned it, you just routinely use DEC/JNZ at the bottom of loops in 386/486-specific code, and that’s that. Besides, you’ll save at most a few cycles each time, and while that helps a little, it’s not going to make all that much difference.

Now let’s step back for a moment, and with no preconceptions consider what the x86 instruction set can do for us. The bulk of the time with both LOOP and DEC/JNZ is taken up by branching, which just happens to be one of the slowest aspects of every processor in the x86 family, and the rest is taken up by decrementing the count register and checking whether it’s zero. There may be ways to perform those tasks a little faster by selecting different instructions, but they can get only so fast, and branching can’t even get all that fast.

The trick, then, is not to find the fastest way to decrement a count and branch conditionally, but rather to figure out how to accomplish the same result without decrementing or branching as often. Remember the Kobiyashi Maru problem in Star Trek?The same principle applies here: Redefine the problem to one that offers better solutions.

Consider Listing 7.1, which searches a buffer until either the specified byte is found, a zero byte is found, or the specified number of characters have been checked. Such a function would be useful for scanning up to a maximum number of characters in a zero-terminated buffer. Listing 7.1, which uses LOOP in the main loop, performs a search of the sample string for a period (‘.’) in 170 µs on a 20 MHz cached 386.

When the LOOP in Listing 7.1 is replaced with DEC CX/JNZ, performance improves to 168 µs, less than 2 percent faster than Listing 7.1. Actually, instruction fetching, instruction alignment, cache characteristics, or something similar is affecting these results; I’d expect a slightly larger improvement—around 7 percent—but that’s the most that counting cycles could buy us in this case. (All right, already; LOOPNZ could be used at the bottom of the loop, and other optimizations are surely possible, but all that won’t add up to anywhere near the benefits we’re about to see from local optimization, and that’s the whole point.)

Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash