Algorithm showcase: The Sea of Stars fnmatch algorithm

Programming, for all ages and all languages.
Post Reply
nullplan
Member
Member
Posts: 1880
Joined: Wed Aug 30, 2017 8:24 am

Algorithm showcase: The Sea of Stars fnmatch algorithm

Post by nullplan »

Thought I'd do another one of these. I have no idea if the algorithm is actually called that or if Rich Felker just made it up. The only reference to it I could find was the musl FAQ just listing it, but there is no description of it there. And searching the web is made difficult thanks to the release of a video game by the name "Sea of Stars", which apparently takes precedence.

1. The problem
The problem is matching a UNIX shell pattern string with the special characters * and ? against a file name and figuring out if it matches or not. * matches any number of characters (even 0), ? matches any character, but there must be one, and all other characters match themselves. For simplicity I am ignoring the additional complexity incurred by bracket expressions, pathname matching, and the possibility of escaping that the actual fnmatch offers.

2. The classical solution
I once saw the following in a discussion about beautiful code. Well, beauty is in the eye of the beholder.

Code: Select all

int match(const char *pattern, const char *string)
{
  switch (*pattern) {
    case 0: return !*string;
    case '?': return *string && match(pattern + 1, string + 1);
    case '*': return match(pattern + 1, string) || (*string && match(pattern, string + 1));
    default: return *string == *pattern && match(pattern + 1, string + 1);
  }
}
The code is very simple, easy to understand and to read, but unfortunately has a very big security flaw: It requires linear stack (i.e. the number of recursive calls is O(n)). In many real-world use-cases, this will mean stack overflows, global state corruption, heap corruption, and just general nastiness. Even if it does just lead to a crash, that is still an attacker-controlled denial-of-service opportunity. And while the compiler can turn three of these recursive calls into tail-calls, there is still at least one remaining where that is impossible (the first call in the '*' case). Meaning an attacker can crash the system if they can make the pattern string just be a long string of stars.

At its heart, this algorithm implements a backtracking approach to regular-expression matching, albeit with a very simple RE language: A star is matched by either ignoring it completely, or, if that fails, backtracking to the most recent star (by way of the call stack), then advancing the string.

The algorithm I'm showcasing today will keep that backtracking behavior, but massively improve the auxiliary storage needed (dropping down to O(1)).

3. The solution
Note one thing: Even if you add back in the complexity I removed earlier, every element of the pattern string but the star can match exactly one character in the string. This means that for every pattern that has no star in it, we can know how many characters a matching string needs to have. Thus we can actually successfully match a string backwards, which is normally impossible in RE matching. The algorithm therefore takes the following general structure:
  1. Match the beginning of the string against the part of the pattern before the first star. A mismatch here is a mismatch of the entire string. Reaching the end of the pattern disposes of the entire function at this point. Remove the already-matched parts of the string from consideration.
  2. Count the number of tokens in the pattern past the last star in it.
  3. Match the tail of the string against the part of the pattern past the last star. A mismatch once again is a mismatch of the entire pattern. But reaching the end is obviously no longer dispositive. Finally, remove the already-matched tail of the string from consideration.
  4. You now have a pattern that starts and ends with a star. Repeatedly attempt to match the part between the first and last star against the pattern. On mismatch, remove the first character of the string and retry. If in so doing you reach yet another star, discard the now newly-matched string part and the pattern up to the new star. Do this until you run out of string or match everything up to the last star. The former case is a matching failure, the latter is a success.
Code for this to be added later...
Carpe diem!
User avatar
eekee
Member
Member
Posts: 920
Joined: Mon May 22, 2017 5:56 am
Location: Kerbin
Discord: eekee
Contact:

Re: Algorithm showcase: The Sea of Stars fnmatch algorithm

Post by eekee »

nullplan wrote: Wed Apr 09, 2025 11:17 am 1. The problem
The problem is matching a UNIX shell pattern string with the special characters * and ? against a file name and figuring out if it matches or not. * matches any number of characters (even 0), ? matches any character, but there must be one, and all other characters match themselves. For simplicity I am ignoring the additional complexity incurred by bracket expressions, pathname matching, and the possibility of escaping that the actual fnmatch offers.
Common names for this are "glob" in the Unix world and "wildcard" in MS-DOS and 8-bit worlds (to just list the ones I know), or more fully, "glob pattern" and "[filename|string] matching with wildcards". I love "sea of stars" though. :D

I once saw it implemented quite simply in 6502 assembly language, but only at the cost of never being able to use a star at the start of a pattern, if I read the code right. It simply ignored text after the star, treating `*foo*` as `*`. Most versions of MS-DOS have the same bug, at least with 8.3 filenames.
Kaph — a modular OS intended to be easy and fun to administer and code for.
"May wisdom, fun, and the greater good shine forth in all your work." — Leo Brodie
Post Reply