1. The problem
The problem is matching a UNIX shell pattern string with the special characters * and ? against a file name and figuring out if it matches or not. * matches any number of characters (even 0), ? matches any character, but there must be one, and all other characters match themselves. For simplicity I am ignoring the additional complexity incurred by bracket expressions, pathname matching, and the possibility of escaping that the actual fnmatch offers.
2. The classical solution
I once saw the following in a discussion about beautiful code. Well, beauty is in the eye of the beholder.
Code: Select all
int match(const char *pattern, const char *string)
{
switch (*pattern) {
case 0: return !*string;
case '?': return *string && match(pattern + 1, string + 1);
case '*': return match(pattern + 1, string) || (*string && match(pattern, string + 1));
default: return *string == *pattern && match(pattern + 1, string + 1);
}
}
At its heart, this algorithm implements a backtracking approach to regular-expression matching, albeit with a very simple RE language: A star is matched by either ignoring it completely, or, if that fails, backtracking to the most recent star (by way of the call stack), then advancing the string.
The algorithm I'm showcasing today will keep that backtracking behavior, but massively improve the auxiliary storage needed (dropping down to O(1)).
3. The solution
Note one thing: Even if you add back in the complexity I removed earlier, every element of the pattern string but the star can match exactly one character in the string. This means that for every pattern that has no star in it, we can know how many characters a matching string needs to have. Thus we can actually successfully match a string backwards, which is normally impossible in RE matching. The algorithm therefore takes the following general structure:
- Match the beginning of the string against the part of the pattern before the first star. A mismatch here is a mismatch of the entire string. Reaching the end of the pattern disposes of the entire function at this point. Remove the already-matched parts of the string from consideration.
- Count the number of tokens in the pattern past the last star in it.
- Match the tail of the string against the part of the pattern past the last star. A mismatch once again is a mismatch of the entire pattern. But reaching the end is obviously no longer dispositive. Finally, remove the already-matched tail of the string from consideration.
- You now have a pattern that starts and ends with a star. Repeatedly attempt to match the part between the first and last star against the pattern. On mismatch, remove the first character of the string and retry. If in so doing you reach yet another star, discard the now newly-matched string part and the pattern up to the new star. Do this until you run out of string or match everything up to the last star. The former case is a matching failure, the latter is a success.