## Chapter 9Hints My Readers Gave Me

### Optimization Odds and Ends from the Field

Back in high school, I took a pre-calculus class from Mr. Bourgeis, whose most notable characteristics were incessant pacing and truly enormous feet. My friend Barry, who sat in the back row, right behind me, claimed that it was because of his large feet that Mr. Bourgeis was so restless. Those feet were so heavy, Barry hypothesized, that if Mr. Bourgeis remained in any one place for too long, the floor would give way under the strain, plunging the unfortunate teacher deep into the mantle of the Earth and possibly all the way through to China. Many amusing cartoons were drawn to this effect.

Unfortunately, Barry was too busy drawing cartoons, or, alternatively, sleeping, to actually learn any math. In the long run, that didn’t turn out to be a handicap for Barry, who went on to become vice-president of sales for a ham-packing company, where presumably he was rarely called upon to derive the quadratic equation. Barry’s lack of scholarship caused some problems back then, though. On one memorable occasion, Barry was half-asleep, with his eyes open but unfocused and his chin balanced on his hand in the classic “if I fall asleep my head will fall off my hand and I’ll wake up” posture, when Mr. Bourgeis popped a killer problem:

“Barry, solve this for X, please.” On the blackboard lay the equation:

```X - 1 = 0
```

“Minus 1,” Barry said promptly.

Mr. Bourgeis shook his head mournfully. “Try again.” Barry thought hard. He knew the fundamental rule that the answer to most mathematical questions is either 0, 1, infinity, -1, or minus infinity (do not apply this rule to balancing your checkbook, however); unfortunately, that gave him only a 25 percent chance of guessing right.

“One,” I whispered surreptitiously.

“One,” I whispered louder. Barry looked still more thoughtful—a bad sign—so I whispered “one” again, even louder. Barry looked so thoughtful that his eyes nearly rolled up into his head, and I realized that he was just doing his best to convince Mr. Bourgeis that Barry had solved this one by himself.

As Barry neared the climax of his stirring performance and opened his mouth to speak, Mr. Bourgeis looked at him with great concern. “Barry, can you hear me all right?”

“Yes, sir,” Barry replied. “Why?”

“Well, I could hear the answer all the way up here. Surely you could hear it just one row away?”

The class went wild. They might as well have sent us home early for all we accomplished the rest of the day.

I like to think I know more about performance programming than Barry knew about math. Nonetheless, I always welcome good ideas and comments, and many readers have sent me a slew of those over the years. So in this chapter, I think I’ll return the favor by devoting a chapter to reader feedback.

#### Another Look at LEA

Several people have pointed out that while LEA is great for performing certain additions (see Chapter 6), it isn’t a perfect replacement for ADD. What’s the difference? LEA, an addressing instruction by trade, doesn’t affect the flags, while the arithmetic ADD instruction most certainly does. This is no problem when performing additions that involve only quantities that fit in one machine word (32 bits in 386 protected mode, 16 bits otherwise), but it renders LEA useless for multiword operations, which use the Carry flag to tie together partial results. For example, these instructions

```ADD  EAX,EBX
```

could not be replaced

```LEA  EAX,[EAX+EBX]
```

because LEA doesn’t affect the Carry flag.

The no-carry characteristic of LEA becomes a distinct advantage when performing pointer arithmetic, however. For instance, the following code uses LEA to advance the pointers while adding one 128-bit memory variable to another such variable:

```   MOV   ECX,4   ;# of 32-bit words to add
CLC
;no carry into the initial ADC

MOV   EAX,[ESI]    ;get the next element of one array
LEA   ESI,[ESI+4]  ;advance one array’s pointer
LEA   EDI,[EDI+4]  ;advance the other array’s pointer
```

(Yes, I could use LODSD instead of MOV/LEA; I’m just illustrating a point here. Besides, LODS is only 1 cycle faster than MOV/LEA on the 386, and is actually more than twice as slow on the 486.) If we used ADD rather than LEA to advance the pointers, the carry from one ADC to the next would have to be preserved with either PUSHF/POPF or LAHF/SAHF. (Alternatively, we could use multiple INCs, since INC doesn’t affect the Carry flag.)

In short, LEA is indeed different from ADD. Sometimes it’s better. Sometimes not; that’s the nature of the various instruction substitutions and optimizations that will occur to you over time. There’s no such thing as “best” instructions on the x86; it all depends on what you’re trying to do.

But there sure are a lot of interesting options, aren’t there?

#### The Kennedy Portfolio

Reader John Kennedy regularly passes along intriguing assembly programming tricks, many of which I’ve never seen mentioned anywhere else. John likes to optimize for size, whereas I lean more toward speed, but many of his optimizations are good for both purposes. Here are a few of my favorites:

John’s code for setting AX to its absolute value is:

```CWD
XOR   AX,DX
SUB   AX,DX
```

This does nothing when bit 15 of AX is 0 (that is, if AX is positive). When AX is negative, the code “nots” it and adds 1, which is exactly how you perform a two’s complement negate. For the case where AX is not negative, this trick usually beats the stuffing out of the standard absolute value code:

```   AND   AX,AX        ;negative?
JNS   IsPositive   ;no
NEG   AX           ;yes,negate it
IsPositive:
```

However, John’s code is slower on a 486; as you’re no doubt coming to realize (and as I’ll explain in Chapters 12 and 13), the 486 is an optimization world unto itself.

Here’s how John copies a block of bytes from DS:SI to ES:DI, moving as much data as possible a word at a time:

```SHR   CX,1      ;word count
REP   MOVSW     ;copy as many words as possible
ADC   CX,CX     ;CX=1 if copy length was odd,
;0 else
REP   MOVSB     ;copy any odd byte
```

(ADC CX,CX can be replaced with RCL CX,1; which is faster depends on the processor type.) It might be hard to believe that the above is faster than this:

```   SHR   CX,1      ;word count
REP   MOVSW     ;copy as many words as
;possible
JNC   CopyDone  ;done if even copy length
MOVSB           ;copy the odd byte
CopyDone:
```

Graphics Programming Black Book © 2001 Michael Abrash