Algorithm showcase: The Sea of Stars fnmatch algorithm
Posted: Wed Apr 09, 2025 11:17 am
Thought I'd do another one of these. I have no idea if the algorithm is actually called that or if Rich Felker just made it up. The only reference to it I could find was the musl FAQ just listing it, but there is no description of it there. And searching the web is made difficult thanks to the release of a video game by the name "Sea of Stars", which apparently takes precedence.
1. The problem
The problem is matching a UNIX shell pattern string with the special characters * and ? against a file name and figuring out if it matches or not. * matches any number of characters (even 0), ? matches any character, but there must be one, and all other characters match themselves. For simplicity I am ignoring the additional complexity incurred by bracket expressions, pathname matching, and the possibility of escaping that the actual fnmatch offers.
2. The classical solution
I once saw the following in a discussion about beautiful code. Well, beauty is in the eye of the beholder.The code is very simple, easy to understand and to read, but unfortunately has a very big security flaw: It requires linear stack (i.e. the number of recursive calls is O(n)). In many real-world use-cases, this will mean stack overflows, global state corruption, heap corruption, and just general nastiness. Even if it does just lead to a crash, that is still an attacker-controlled denial-of-service opportunity. And while the compiler can turn three of these recursive calls into tail-calls, there is still at least one remaining where that is impossible (the first call in the '*' case). Meaning an attacker can crash the system if they can make the pattern string just be a long string of stars.
At its heart, this algorithm implements a backtracking approach to regular-expression matching, albeit with a very simple RE language: A star is matched by either ignoring it completely, or, if that fails, backtracking to the most recent star (by way of the call stack), then advancing the string.
The algorithm I'm showcasing today will keep that backtracking behavior, but massively improve the auxiliary storage needed (dropping down to O(1)).
3. The solution
Note one thing: Even if you add back in the complexity I removed earlier, every element of the pattern string but the star can match exactly one character in the string. This means that for every pattern that has no star in it, we can know how many characters a matching string needs to have. Thus we can actually successfully match a string backwards, which is normally impossible in RE matching. The algorithm therefore takes the following general structure:
1. The problem
The problem is matching a UNIX shell pattern string with the special characters * and ? against a file name and figuring out if it matches or not. * matches any number of characters (even 0), ? matches any character, but there must be one, and all other characters match themselves. For simplicity I am ignoring the additional complexity incurred by bracket expressions, pathname matching, and the possibility of escaping that the actual fnmatch offers.
2. The classical solution
I once saw the following in a discussion about beautiful code. Well, beauty is in the eye of the beholder.
Code: Select all
int match(const char *pattern, const char *string)
{
switch (*pattern) {
case 0: return !*string;
case '?': return *string && match(pattern + 1, string + 1);
case '*': return match(pattern + 1, string) || (*string && match(pattern, string + 1));
default: return *string == *pattern && match(pattern + 1, string + 1);
}
}
At its heart, this algorithm implements a backtracking approach to regular-expression matching, albeit with a very simple RE language: A star is matched by either ignoring it completely, or, if that fails, backtracking to the most recent star (by way of the call stack), then advancing the string.
The algorithm I'm showcasing today will keep that backtracking behavior, but massively improve the auxiliary storage needed (dropping down to O(1)).
3. The solution
Note one thing: Even if you add back in the complexity I removed earlier, every element of the pattern string but the star can match exactly one character in the string. This means that for every pattern that has no star in it, we can know how many characters a matching string needs to have. Thus we can actually successfully match a string backwards, which is normally impossible in RE matching. The algorithm therefore takes the following general structure:
- Match the beginning of the string against the part of the pattern before the first star. A mismatch here is a mismatch of the entire string. Reaching the end of the pattern disposes of the entire function at this point. Remove the already-matched parts of the string from consideration.
- Count the number of tokens in the pattern past the last star in it.
- Match the tail of the string against the part of the pattern past the last star. A mismatch once again is a mismatch of the entire pattern. But reaching the end is obviously no longer dispositive. Finally, remove the already-matched tail of the string from consideration.
- You now have a pattern that starts and ends with a star. Repeatedly attempt to match the part between the first and last star against the pattern. On mismatch, remove the first character of the string and retry. If in so doing you reach yet another star, discard the now newly-matched string part and the pattern up to the new star. Do this until you run out of string or match everything up to the last star. The former case is a matching failure, the latter is a success.