Crochemore and Perrin have published the algorithm in this paper. But I found that one not the easiest read, especially for practical implementations. So I'm going to try to explain it in a simpler way.
Chapter 1: The problem
We want to find the first occurrence of one string inside another. The simplest solution for this is probably the naive algorithm:
Code: Select all
char *strstr(const char *h, const char *n)
{
size_t hl = strlen(h), nl = strlen(n);
if (hl < nl) return 0;
for (size_t i = 0; i <= hl - nl; i++) {
size_t j;
for (j = 0; j < nl && h[i+j] == n[j]; j++);
if (j == nl) return (char *)h + i;
}
return 0;
}
Chapter 2: Magic pseudo-code
At this point the paper goes into a lot of detail on periods and local periods. I honestly didn't quite absorb everything. But we can skip forward a bit to where it gets interesting again. So we have established that it is important to do something like the above code is doing, but maximize the step width in the outer loop each time, as the situation allows. To that end, the designers of the algorithm have defined a magic split, which they call a critical factorization, of the pattern string. And they are going to check the right side first and the left side after. Because this allows for some clever tricks. So the code so far looks like this:
Code: Select all
char *strstr(const char *h, const char *n)
{
size_t hl = strlen(h);
size_t nl = strlen(n);
if (hl < nl) return 0;
size_t p = period(n);
size_t split = critical_factorization(n);
size_t mem = 0;
size_t k;
for (size_t i = 0; i <= hl - nl;)
{
for (k = MAX(mem, split); k < nl && h[i+k] == n[k]; k++);
if (k < nl) {
i += k - split + 1;
mem = 0;
continue;
}
for (k = split; k > mem && h[i+k-1] == n[k-1]; k--);
if (k <= mem) return (char *)h + i;
i += p;
mem = nl - p;
}
return 0;
}
So what's up with that "mem" business? Well, if a mismatch occurs on the left side, then we could successfully match the right side. Which means next iteration we already know that the first so many characters are matching. Which means we don't need to check them again.
Now, there are two magic pieces in there. For one, calculating the period of a string is not actually simple. For two, there's the algorithm for getting a critical factorization, that is not given right now. Thankfully, both problems are linked, because Crochemore and Perrin noticed that if you calculate the maximum suffix of a string, and you do it for both possible orderings of the alphabet, then the shorter suffix of the two is at a critical factorization. And the algorithm that produces the maximum suffix also produces the period of the suffix. And then they noticed that it is easy to check if that is also the period of the string itself, and if not then they don't know the period, but can give a lower bound.
In the above algorithm, if we don't know the period of the needle exactly, then we can still execute the algorithm with a lower bound. We move the window forward less efficiently than we could, but we wouldn't skip a possible solution. And the argument for the "mem" stuff wouldn't work then, so in that case we need to keep "mem" set to 0.
Chapter 3: The maximum suffix algorithm
That is the most magical part of the algorithm. It is hard to understand its result, and also its way of calculating it. I'll just present my way of writing the algorithm first:
Code: Select all
struct factorization {
size_t ms; /* position of the maximum suffix */
size_t p; /* period of the suffix */
};
static struct factorization maximum_suffix(const unsigned char *n, size_t nl, int rev)
{
size_t ip = -1; /* candidate suffix position */
size_t jp = 0; /* candidate compare position */
size_t k = 1; /* check index inside the window */
size_t p = 1; /* period / window length */
while (jp+k < nl) {
if (n[ip+k] == n[jp+k]) {
if (k == p) {
jp += k;
k = 1;
} else k++;
} else if ((n[ip+k] < n[jp+k]) ^ rev) {
ip = jp++;
k = p = 1;
} else {
jp += k;
k = 1;
p = jp - ip;
}
}
return (struct factorization){ip+1, p};
}
Incidentally, this is where we reach almost the end, because all that is left is using it.
Chapter 4: Putting it all together
The final (pure) algorithm then looks like this:
Code: Select all
char *strstr(const char *h, const char *n)
{
size_t hl = strlen(h);
size_t nl = strlen(n);
if (hl < nl) return 0;
struct factorization crit, rev;
size_t mem = 0, mem0 = 0;
size_t k;
crit = maximum_suffix((void *)n, nl, 0);
rev = maximum_suffix((void *)n, nl, 1);
if (rev.ms > crit.ms)
crit = rev;
/* is crit.p the period of the needle? */
if (!memcmp(n, n+crit.p, crit.ms) {
/* yes */
mem0 = nl - crit.p;
} else {
/* no */
crit.p = MAX(crit.ms, nl-crit.ms-1) + 1; /* lower bound, so leave mem0 at 0 */
}
for (size_t i = 0; i <= hl - nl;)
{
for (k = MAX(mem, crit.ms); k < nl && h[i+k] == n[k]; k++);
if (k < nl) {
i += k - split + 1;
mem = 0;
continue;
}
for (k = crit.ms; k > mem && h[i+k-1] == n[k-1]; k--);
if (k <= mem) return (char *)h + i;
i += crit.p;
mem = mem0;
}
return 0;
}
Chapter 5: Optimizations
Which I'm not putting into source code, mind you. You can read up on those yourself in practical implementations (such as in musl). But here are some ideas:
- If the needle is short enough to fit in a register, you can perform a big-endian decode on it. Then do something similar with a sliding view of the haystack, and you have yourself a strstr algorithm that is guaranteed linear in time.
- If the above algorithm is called with a needle consisting of characters entirely not contained in the haystack, then it will still perform hl-nl comparisons. But if the alphabet is relatively small (say, if it really is a strstr() implementation and CHAR_BIT is 8, so the alphabet consists of 256 characters), then the memory costs for a single round of the Boyer-Moore algorithm are not only constant but justifiable to put on the stack. That way, the algorithm will perform hl/nl comparisons in the case mentioned. In all others, it ensures that the last character of the haystack window we are currently looking at matches before the rest of the algorithm is engaged.
- It is not actually necessary to calculate either the length of the needle or the haystack. Doing so could indeed take a long time. It is only necessary to calculate the smaller of both lengths. If the needle is longer than the haystack, we can terminate early. And in the other case, the only thing we actually need in the loop is that the string starting at h+i is at least nl characters long. So we could just look for a null byte in that range of bytes only.