Previous Table of Contents Next


Truth to tell, I didn’t expect a three-times speedup; around two times was what I had in mind. Which just goes to show that any code can be made faster than you’d expect, if you think about it long enough and from many different perspectives. (The most potent word-counting technique seems to be a 64K lookup table that allows handling two bytes simultaneously. This is not the sort of technique one comes up with by brute-force optimization.) Thinking (or, worse yet, boasting) that your code is the fastest possible is rollescating on a tightrope in a hurricane; you’re due for a fall, if you catch my drift. Case in point: Terje Mathisen’s word-counting program.

Blinding Yourself to a Better Approach

Not so long ago, Terje Mathisen, who I introduced earlier in this book, wrote a very fast word-counting program, and posted it on Bix. When I say it was fast, I mean fast; this code was optimized like nobody’s business. We’re talking top-quality code here.

When the topic of optimizing came up in one of the Bix conferences, Terje’s program was mentioned, and he posted the following message: “I challenge BIXens (and especially mabrash!) to speed it up significantly. I would consider 5 percent a good result.” The clear implication was, “That code is as fast as it can possibly be.”

Naturally, it wasn’t; there ain’t no such thing as the fastest code (TANSTATFC? I agree, it doesn’t have the ring of TANSTAAFL). I pored over Terje’s 386 native-mode code, and found the critical inner loop, which was indeed as tight as one could imagine, consisting of just a few 386 native-mode instructions. However, one of the instructions was this:

 
 CMP   DH,[EBX+EAX]
 

Harmless enough, save for two things. First, EBX happened to be zero at this point (a leftover from an earlier version of the code, as it turned out), so it was superfluous as a memory-addressing component; this made it possible to use base-only addressing ([EAX]) rather than base+index addressing ([EBX+EAX]), which saves a cycle on the 386. Second: Changing the instruction to CMP [EAX],DH saved 2 cycles—just enough, by good fortune, to speed up the whole program by 5 percent.

CMP reg,[mem] takes 6 cycles on the 386, but CMP [ mem ],reg takes only 5 cycles; you should always performCMP with the memory operand on the left on the 386.

(Granted, CMP [mem],reg is 1 cycle slower than CMP reg,[mem] on the 286, and they’re both the same on the 8088; in this case, though, the code was specific to the 386. In case you’re curious, both forms take 2 cycles on the 486; quite a lot faster, eh?)

Watch Out for Luggable Assumptions!

The first lesson to be learned here is not to lug assumptions that may no longer be valid from the 8088/286 world into the wonderful new world of 386 native-mode programming. The second lesson is that after you’ve slaved over your code for a while, you’re in no shape to see its flaws, or to be able to get the new perspectives needed to speed it up. I’ll bet Terje looked at that [EBX+EAX] addressing a hundred times while trying to speed up his code, but he didn’t really see what it did; instead, he saw what it was supposed to do. Mental shortcuts like this are what enable us to deal with the complexities of assembly language without overloading after about 20 instructions, but they can be a major problem when looking over familiar code.

The third, and most interesting, lesson is that a far more fruitful optimization came of all this, one that nicely illustrates that cycle counting is not the key to happiness, riches, and wondrous performance. After getting my 5 percent speedup, I mentioned to Terje the possibility of using a 64K lookup table. (This predated the arrival of entries for the optimization contest.) He said that he had considered it, but it didn’t seem to him to be worthwhile. He couldn’t shake the thought, though, and started to poke around, and one day, voila, he posted a new version of his word count program, WC50, that was much faster than the old version. I don’t have exact numbers, but Terje’s preliminary estimate was 80 percent faster, and word counting—including disk cache access time—proceeds at more than 3 MB per second on a 33 MHz 486. Even allowing for the speed of the 486, those are very impressive numbers indeed.

The point I want to make, though, is that the biggest optimization barrier that Terje faced was that he thought he had the fastest code possible. Once he opened up the possibility that there were faster approaches, and looked beyond the specific approach that he had so carefully optimized, he was able to come up with code that was a lot faster. Consider the incongruity of Terje’s willingness to consider a 5 percent speedup significant in light of his later near-doubling of performance.

Don’t get stuck in the rut of instruction-by-instruction optimization. It’s useful in key loops, but very often, a change in approach will work far greater wonders than any amount of cycle counting can.

By the way, Terje’s WC50 program is a full-fledged counting program; it counts characters, words, and lines, can handle multiple files, and lets you specify the characters that separate words, should you so desire. Source code is provided as part of the archive WC50 comes in. All in all, it’s a nice piece of work, and you might want to take a look at it if you’re interested in really fast assembly code. I wouldn’t call it the fastest word-counting code, though, because I would of course never be so foolish as to call anything the fastest.

The Astonishment of Right-Brain Optimization

As it happened, the challenge I issued to my PC TECHNIQUES readers was a smashing success, with dozens of good entries. I certainly enjoyed it, even though I did have to look at a lot of tricky assembly code that I didn’t write—hard work under the best of circumstances. It was worth the trouble, though. The winning entry was an astonishing example of what assembly language can do in the right hands; on my 386, it was four times faster at word counting than the nice, tight assembly code I provided as a starting point—and about 13 times faster than the original C implementation. Attention, high-level language chauvinists: Is the speedup getting significant yet? Okay, maybe word counting isn’t the most critical application, but how would you like to have that kind of improvement in your compression software, or in your real-time games—or in Windows graphics?

The winner was David Stafford, who at the time was working for Borland International; his entry is shown in Listing 16.5. Dave Methvin, whom some of you may recall as a tech editor of the late, lamented PC Tech Journal, was a close second, and Mick Brown, about whom I know nothing more than that he is obviously an extremely good assembly language programmer, was a close third, as shown in Table 16.2, which precedes Listing 16.5. Those three were out ahead of the pack; the fourth-place entry, good as it was (twice as fast as my original code), was twice as slow as David’s winning entry, so you can see that David, Dave, and Mick attained a rarefied level of optimization indeed.


Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash