Previous Table of Contents Next


LISTING 16.4 L16-4.ASM

 ; Assembly subroutine for Listing 16.2. Scans through Buffer, of
 ; length BufferLength, counting words and updating WordCount as
 ; appropriate, using a lookup table-based approach. BufferLength
 ; must be > 0. *CharFlag and *WordCount should equal 0 on the
 ; first call. Tested with TASM.
 ; C near-callable as:
 ; void ScanBuffer(char *Buffer, unsigned int BufferLength,
 ; char *CharFlag, unsigned long *WordCount);
 
 parms   struc
         dw      2 dup(?)        ;pushed return address & BP
 Buffer  dw      ?               ;buffer to scan
 BufferLength dw ?               ;length of buffer to scan
 CharFlag dw     ?               ;pointer to flag for state of last
                                 ;char processed on entry (0 on
                                 ;initial call). Updated on exit
 WordCount dw    ?               ;pointer to 32-bit count of words
                                 ; found (0 on initial call)
 parms   ends
 
         .model  small
         .data
 ; Table of char/not statuses for byte values 0-255 (128-255 are
 ; duplicates of 0-127 to effectively mask off bit 7, which some
 ; word processors set as an internal flag).
 CharStatusTable label   byte
         REPT    2
         db      39 dup(0)
         db      1               ;apostrophe
         db      8 dup(0)
         db      10 dup(1)       ;0-9
         db      7 dup(0)
         db      26 dup(1)       ;A-Z
         db      6 dup(0)
         db      26 dup(1)       ;a-z
         db      5 dup(0)
         ENDM
 
         .code
         public  _ScanBuffer
 _ScanBuffer     proc    near
         push    bp              ;preserve caller’s stack frame
         mov     bp,sp           ;set up local stack frame
         push    si              ;preserve caller’s register vars
         push    di
 
         mov     si,[bp+Buffer]  ;point to buffer to scan
         mov     bx,[bp+WordCount]
         mov     di,[bx]         ;get current 32-bit word count
         mov     dx,[bx+2]
         mov     bx,[bp+CharFlag]
         mov     al,[bx]         ;get current CharFlag
         mov     cx,[bp+BufferLength] ;get # of bytes to scan
         mov     bx,offset CharStatusTable
 ScanLoop:
         and     al,al           ;ZF=0 if last byte was a char,
                                 ; ZF=1 if not
         lodsb                   ;get the next byte
                                 ;***doesn’t change flags***
         xlat                    ;look up its char/not status
                                 ;***doesn’t change flags***
         jz      ScanLoopBottom  ;don’t count a word if last byte was
                                 ; not a character
         and     al,al           ;last byte was a character; is the
                                 ; current byte a character?
         jz      CountWord       ;no, so count a word
 ScanLoopBottom:
         dec     cx              ;count down buffer length
         jnz     ScanLoop
 Done:
         mov     si,[bp+CharFlag]
         mov     [si],al         ;set new CharFlag
         mov     bx,[bp+WordCount]
         mov     [bx],di         ;set new word count
         mov     [bx+2],dx
 
         pop     di              ;restore caller’s register vars
         pop     si
         pop     bp              ;restore caller’s stack frame
         ret
 
         align   2
 CountWord:
         add     di,1            ;increment the word count
         adc     dx,0
         dec     cx              ;count down buffer length
         jnz     ScanLoop
         jmp     Done
 _ScanBuffer     endp
         end
 

Listing 16.4 features several interesting tricks. First, it uses LODSB and XLAT in succession, a very neat way to get a pointed-to byte, advance the pointer, and look up the value indexed by the byte in a table, all with just two instruction bytes. (Interestingly, Listing 16.4 would probably run quite a bit better still on an 8088, where LODSB and XLAT have a greater advantage over conventional instructions. On the 486 and Pentium, however, LODSB and XLAT lose much of their appeal, and should be replaced with MOV instructions.) Better yet, LODSB and XLAT don’t alter the flags, so the Zero flag status set before LODSB is still around to be tested after XLAT .

Finally, if you look closely, you will see that Listing 16.4 jumps out of the loop to increment the word count in the case where a word is actually found, with a duplicate of the loop-bottom code placed after the code that increments the word count, to avoid an extra branch back into the loop; this replaces the more intuitive approach of jumping around the incrementing code to the loop bottom when a word isn’t found. Although this incurs a branch every time a word is found, a word is typically found only once every 5 or 6 bytes; on average, then, a branch is saved about two-thirds of the time. This is an excellent example of how understanding the nature of the data you’re processing allows you to optimize in ways the compiler can’t. Know your data!

So, gosh, Listing 16.4 is the best word-counting code in the universe, right? Not hardly. If there’s one thing my years of toil in this vale of silicon have taught me, it’s that there’s never a lack of potential for further optimization. Never! Off the top of my head, I can think of at least three ways to speed up Listing 16.4; and, since Turbo Profiler reports that even in Listing 16.4, 88 percent of the time is spent scanning the buffer (as opposed to reading the file), there’s potential for those further optimizations to improve performance significantly. (However, it is true that when access is performed to a hard rather than RAM disk, disk access jumps to about half of overall execution time.) One possible optimization is unrolling the loop, although that is truly a last resort because it tends to make further changes extremely difficult.

Exhaust all other optimizations before unrolling loops.

Challenges and Hazards

The challenge I put to the readers of PC TECHNIQUES was to write a faster module to replace Listing 16.4. The author of the code that counted the words in my secret test file fastest on my 20 MHz cached 386 would be the winner and receive Numerous Valuable Prizes.

No listings were to be longer than 200 lines. No complete programs were to be accepted; submissions had to be plug-compatible with Listing 16.4. (This was to encourage people not to waste time optimizing outside the inner loop.) Finally, the code had to produce the same results as Listing 16.4; I didn’t want to see functions that approximated the word count by dividing the number of characters by six instead of counting actual words!

So how did the entrants in this particular challenge stack up? More than one claimed a speed-up over my assembly word-counting code of more than three times. On top of the three-times speedup over the original C code that I had already realized, we’re almost up to an order of magnitude faster. You are, of course, entitled to your own opinion, but I consider an order of magnitude to be significant.


Previous Table of Contents Next

Graphics Programming Black Book © 2001 Michael Abrash