### The Boyer-Moore Algorithm

All our a priori knowledge of string searching is stated above, but there’s another sort of knowledge—knowledge that’s generated dynamically. As we search through the buffer, we acquire information each time we check for a match. One sort of information that we acquire is based on partial matches; we can often skip ahead after partial matches because (take a deep breath!) by partially matching, we have already implicitly done a comparison of the partially matched buffer characters with all possible pattern start locations that overlap those partially-matched bytes.

If that makes your head hurt, it should—and don’t worry. This line of thinking, which is the basis of the Knuth-Morris-Pratt algorithm and half the basis of the Boyer-Moore algorithm, is what gives Boyer-Moore its reputation for inscrutability. That reputation is well deserved for this aspect (which I will not discuss further in this book), but there’s another part of Boyer-Moore that’s easily understood, easily implemented, and highly effective.

Consider this: We’re searching for the pattern “ABC,” beginning the search at the start (offset 0) of a buffer containing “ABZABC.” We match on ‘A,’ we match on ‘B,’ and we mismatch on ‘C’; the buffer contains a ‘Z’ in this position. What have we learned? Why, we’ve learned not only that the pattern doesn’t match the buffer starting at offset 0, but also that it can’t possibly match starting at offset 1 or offset 2, either! After all, there’s a ‘Z’ in the buffer at offset 2; since the pattern doesn’t contain a single ‘Z,’ there’s no way that the pattern can match starting at any location from which it would span the ‘Z’ at offset 2. We can just skip straight from offset 0 to offset 3 and continue, saving ourselves two comparisons.

Unfortunately, this approach only pays off big when a near-complete partial match is found; if the comparison fails on the first pattern character, as often happens, we can only skip ahead 1 byte, as usual. Look at it differently, though: What if we compare the pattern starting with the last (rightmost) byte, rather than the first (leftmost) byte? In other words, what if we compare from high memory toward low, in the direction in which string instructions go after the STD instruction? After all, we’re comparing one set of bytes (the pattern) to another set of bytes (a portion of the buffer); it doesn’t matter in the least in what order we compare them, so long as all the bytes in one set are compared to the corresponding bytes in the other set.

 Why on earth would we want to start with the rightmost character? Because a mismatch on the rightmost character tells us a great deal more than a mismatch on the leftmost character.

We learn nothing new from a mismatch on the leftmost character, except that the pattern can’t match starting at that location. A mismatch on the rightmost character, however, tells us about the possibilities of the pattern matching starting at every buffer location from which the pattern spans the mismatch location. If the mismatched character in the buffer doesn’t appear in the pattern, then we’ve just eliminated not one potential match, but as many potential matches as there are characters in the pattern; that’s how many locations there are in the buffer that might have matched, but have just been shown not to, because they overlap the mismatched character that doesn’t belong in the pattern. In this case, we can skip ahead by the full pattern length in the buffer! This is how we can outperform even REPNZ SCASB; REPNZ SCASB has to check every byte in the buffer, but Boyer-Moore doesn’t.

Figure 14.1 illustrates the operation of a Boyer-Moore search when the rightcharacter of the search pattern (which is the first character that’s compared at each location because we’re comparing backwards) mismatches with a buffer character that appears nowhere in the pattern. Figure 14.2 illustrates the operation of a partial match when the mismatch occurs with a character that’s not a pattern member. In this case, we can only skip ahead past the mismatch location, resulting in an advance of fewer bytes than the pattern length, and potentially as little as the same single byte distance by which the standard search approach advances.

Figure 14.1
Mismatch on first character checked.

What if the mismatch occurs with a buffer character that does occur in the pattern? Then we can’t skip past the mismatch location, but we can skip to whatever location aligns the rightmost occurrence of that character in the pattern with the mismatch location, as shown in Figure 14.3.

Basically, we exercise our right as members of a free society to compare strings in whichever direction we choose, and we choose to do so right to left, rather than the more intuitive left to right. Whenever we find a mismatch, we see what we can learn from the buffer character that failed to match the pattern. Imagine that we move the pattern to the right across the mismatch location until we find a start location that the mismatch does not eliminate as a possible match for the pattern. If the mismatch character doesn’t appear in the pattern, the pattern can move clear past the mismatch location. Otherwise, the pattern moves until a matching pattern byte lies atop the mismatch. That’s all there is to it!

Figure 14.2
Mismatch on third character checked.

### Boyer-Moore: The Good and the Bad

The worst case for this version of Boyer-Moore is that the pattern mismatches on the leftmost character—the last character compared—every time. Again, not very likely, but it is true that this version of Boyer-Moore performs better as there are fewer and shorter partial matches; ideally, the rightmost character would never match until the full match location was reached. Longer patterns, which make for longer skips, help Boyer-Moore, as does a long distance to the match location, which helps diffuse the overhead of building the table of distances to skip ahead on all the possible mismatch values.

Figure 14.3
Mismatch on character that appears in pattern.

The best case for Boyer-Moore is good indeed: About N/M comparisons are required, where N is the buffer length and M is the pattern length. This reflects the ability of Boyer-Moore to skip ahead by a full pattern length on a complete mismatch.

Graphics Programming Black Book © 2001 Michael Abrash