Algorithm showcase: The two-way string matching algorithm
Posted: Sat Jul 01, 2023 1:40 am
It is interesting to see which algorithms are getting writeups and videos made about them on the Internet. When you search for string matching algorithms, you can find lecture after lecture about the naive algorithm, the Rabin-Karp algorithm, the Boyer-Moore algorithm, and the Knuth-Morris-Pratt algorithm. But the de-facto standard one, the two-way algorithm (which, to stay in the naming convention, we should probably call the Crochemore-Perrin algorithm) gets an unfinished Wikipedia page and the odd Sewage Outflow question about it. I guess because of Computer Science text books. But it's understandable that those cannot contain every innovation. The algorithm is only about as old as me, after all (and to my knowledge, this board has many members younger than it).
Crochemore and Perrin have published the algorithm in this paper. But I found that one not the easiest read, especially for practical implementations. So I'm going to try to explain it in a simpler way.
Chapter 1: The problem
We want to find the first occurrence of one string inside another. The simplest solution for this is probably the naive algorithm:You can also try to improve on this by searching for the next possible match with strchr() and doing the inner compare with strncmp(). But it doesn't change anything about the complexity of the algorithm. And this algorithm has problems: It does not use the position of the first mismatch for anything. Mismatch on the first or the last character, it won't matter, the outer loop simply goes one step forward. In this way there is the well known pathological case if h is "aaaaaaa..." and n is "aaaa...b". So if we could change the algorithm in such a way that the outer loop skips more than one character, we could improve upon this massively. This is precisely what basically all string matching algorithms try to do.
Chapter 2: Magic pseudo-code
At this point the paper goes into a lot of detail on periods and local periods. I honestly didn't quite absorb everything. But we can skip forward a bit to where it gets interesting again. So we have established that it is important to do something like the above code is doing, but maximize the step width in the outer loop each time, as the situation allows. To that end, the designers of the algorithm have defined a magic split, which they call a critical factorization, of the pattern string. And they are going to check the right side first and the left side after. Because this allows for some clever tricks. So the code so far looks like this:So what's going on here? We have split the string into two parts, and we verify that the right part part matches first. If the right part doesn't match, we can move forward our checking window such that the mismatch goes left of the split (which just means we can move forward one more step than we could successfully match). Then we move on to match the left side. If a mismatch occurs there, then we can move the needle forward by one entire period.
So what's up with that "mem" business? Well, if a mismatch occurs on the left side, then we could successfully match the right side. Which means next iteration we already know that the first so many characters are matching. Which means we don't need to check them again.
Now, there are two magic pieces in there. For one, calculating the period of a string is not actually simple. For two, there's the algorithm for getting a critical factorization, that is not given right now. Thankfully, both problems are linked, because Crochemore and Perrin noticed that if you calculate the maximum suffix of a string, and you do it for both possible orderings of the alphabet, then the shorter suffix of the two is at a critical factorization. And the algorithm that produces the maximum suffix also produces the period of the suffix. And then they noticed that it is easy to check if that is also the period of the string itself, and if not then they don't know the period, but can give a lower bound.
In the above algorithm, if we don't know the period of the needle exactly, then we can still execute the algorithm with a lower bound. We move the window forward less efficiently than we could, but we wouldn't skip a possible solution. And the argument for the "mem" stuff wouldn't work then, so in that case we need to keep "mem" set to 0.
Chapter 3: The maximum suffix algorithm
That is the most magical part of the algorithm. It is hard to understand its result, and also its way of calculating it. I'll just present my way of writing the algorithm first:This algorithm corresponds to the function "maximal-suffix" in the paper. I only add the twist that I can flip the comparison with a parameter. The XOR gate is very versatile, and here I am using it in its role of the programmable NOT. By ordering the cases such that equality is taken care of already, the only cases remaining are the less-than and greater-than. It also does not matter if I've got them the wrong way around from the paper, since I am going to run both directions (setting rev to 1 flips the cases again, and in the end I don't care about the ordering that wins, I only care about which suffix is shorter).
Incidentally, this is where we reach almost the end, because all that is left is using it.
Chapter 4: Putting it all together
The final (pure) algorithm then looks like this:As I said, the important thing is which suffix is shorter, so which offset to the suffix is larger. That one wins. And this is getting pretty close to a practical implementation already, but there a couple of things we can still do.
Chapter 5: Optimizations
Which I'm not putting into source code, mind you. You can read up on those yourself in practical implementations (such as in musl). But here are some ideas:
Crochemore and Perrin have published the algorithm in this paper. But I found that one not the easiest read, especially for practical implementations. So I'm going to try to explain it in a simpler way.
Chapter 1: The problem
We want to find the first occurrence of one string inside another. The simplest solution for this is probably the naive algorithm:
Code: Select all
char *strstr(const char *h, const char *n)
{
size_t hl = strlen(h), nl = strlen(n);
if (hl < nl) return 0;
for (size_t i = 0; i <= hl - nl; i++) {
size_t j;
for (j = 0; j < nl && h[i+j] == n[j]; j++);
if (j == nl) return (char *)h + i;
}
return 0;
}
Chapter 2: Magic pseudo-code
At this point the paper goes into a lot of detail on periods and local periods. I honestly didn't quite absorb everything. But we can skip forward a bit to where it gets interesting again. So we have established that it is important to do something like the above code is doing, but maximize the step width in the outer loop each time, as the situation allows. To that end, the designers of the algorithm have defined a magic split, which they call a critical factorization, of the pattern string. And they are going to check the right side first and the left side after. Because this allows for some clever tricks. So the code so far looks like this:
Code: Select all
char *strstr(const char *h, const char *n)
{
size_t hl = strlen(h);
size_t nl = strlen(n);
if (hl < nl) return 0;
size_t p = period(n);
size_t split = critical_factorization(n);
size_t mem = 0;
size_t k;
for (size_t i = 0; i <= hl - nl;)
{
for (k = MAX(mem, split); k < nl && h[i+k] == n[k]; k++);
if (k < nl) {
i += k - split + 1;
mem = 0;
continue;
}
for (k = split; k > mem && h[i+k-1] == n[k-1]; k--);
if (k <= mem) return (char *)h + i;
i += p;
mem = nl - p;
}
return 0;
}
So what's up with that "mem" business? Well, if a mismatch occurs on the left side, then we could successfully match the right side. Which means next iteration we already know that the first so many characters are matching. Which means we don't need to check them again.
Now, there are two magic pieces in there. For one, calculating the period of a string is not actually simple. For two, there's the algorithm for getting a critical factorization, that is not given right now. Thankfully, both problems are linked, because Crochemore and Perrin noticed that if you calculate the maximum suffix of a string, and you do it for both possible orderings of the alphabet, then the shorter suffix of the two is at a critical factorization. And the algorithm that produces the maximum suffix also produces the period of the suffix. And then they noticed that it is easy to check if that is also the period of the string itself, and if not then they don't know the period, but can give a lower bound.
In the above algorithm, if we don't know the period of the needle exactly, then we can still execute the algorithm with a lower bound. We move the window forward less efficiently than we could, but we wouldn't skip a possible solution. And the argument for the "mem" stuff wouldn't work then, so in that case we need to keep "mem" set to 0.
Chapter 3: The maximum suffix algorithm
That is the most magical part of the algorithm. It is hard to understand its result, and also its way of calculating it. I'll just present my way of writing the algorithm first:
Code: Select all
struct factorization {
size_t ms; /* position of the maximum suffix */
size_t p; /* period of the suffix */
};
static struct factorization maximum_suffix(const unsigned char *n, size_t nl, int rev)
{
size_t ip = -1; /* candidate suffix position */
size_t jp = 0; /* candidate compare position */
size_t k = 1; /* check index inside the window */
size_t p = 1; /* period / window length */
while (jp+k < nl) {
if (n[ip+k] == n[jp+k]) {
if (k == p) {
jp += k;
k = 1;
} else k++;
} else if ((n[ip+k] < n[jp+k]) ^ rev) {
ip = jp++;
k = p = 1;
} else {
jp += k;
k = 1;
p = jp - ip;
}
}
return (struct factorization){ip+1, p};
}
Incidentally, this is where we reach almost the end, because all that is left is using it.
Chapter 4: Putting it all together
The final (pure) algorithm then looks like this:
Code: Select all
char *strstr(const char *h, const char *n)
{
size_t hl = strlen(h);
size_t nl = strlen(n);
if (hl < nl) return 0;
struct factorization crit, rev;
size_t mem = 0, mem0 = 0;
size_t k;
crit = maximum_suffix((void *)n, nl, 0);
rev = maximum_suffix((void *)n, nl, 1);
if (rev.ms > crit.ms)
crit = rev;
/* is crit.p the period of the needle? */
if (!memcmp(n, n+crit.p, crit.ms) {
/* yes */
mem0 = nl - crit.p;
} else {
/* no */
crit.p = MAX(crit.ms, nl-crit.ms-1) + 1; /* lower bound, so leave mem0 at 0 */
}
for (size_t i = 0; i <= hl - nl;)
{
for (k = MAX(mem, crit.ms); k < nl && h[i+k] == n[k]; k++);
if (k < nl) {
i += k - split + 1;
mem = 0;
continue;
}
for (k = crit.ms; k > mem && h[i+k-1] == n[k-1]; k--);
if (k <= mem) return (char *)h + i;
i += crit.p;
mem = mem0;
}
return 0;
}
Chapter 5: Optimizations
Which I'm not putting into source code, mind you. You can read up on those yourself in practical implementations (such as in musl). But here are some ideas:
- If the needle is short enough to fit in a register, you can perform a big-endian decode on it. Then do something similar with a sliding view of the haystack, and you have yourself a strstr algorithm that is guaranteed linear in time.
- If the above algorithm is called with a needle consisting of characters entirely not contained in the haystack, then it will still perform hl-nl comparisons. But if the alphabet is relatively small (say, if it really is a strstr() implementation and CHAR_BIT is 8, so the alphabet consists of 256 characters), then the memory costs for a single round of the Boyer-Moore algorithm are not only constant but justifiable to put on the stack. That way, the algorithm will perform hl/nl comparisons in the case mentioned. In all others, it ensures that the last character of the haystack window we are currently looking at matches before the rest of the algorithm is engaged.
- It is not actually necessary to calculate either the length of the needle or the haystack. Doing so could indeed take a long time. It is only necessary to calculate the smaller of both lengths. If the needle is longer than the haystack, we can terminate early. And in the other case, the only thing we actually need in the loop is that the string starting at h+i is at least nl characters long. So we could just look for a null byte in that range of bytes only.