OSDev.org

Posted: **Wed Apr 09, 2025 11:17 am**

Thought I'd do another one of these. I have no idea if the algorithm is actually called that or if Rich Felker just made it up. The only reference to it I could find was the musl FAQ just listing it, but there is no description of it there. And searching the web is made difficult thanks to the release of a video game by the name "Sea of Stars", which apparently takes precedence.

1. The problem
The problem is matching a UNIX shell pattern string with the special characters * and ? against a file name and figuring out if it matches or not. * matches any number of characters (even 0), ? matches any character, but there must be one, and all other characters match themselves. For simplicity I am ignoring the additional complexity incurred by bracket expressions, pathname matching, and the possibility of escaping that the actual fnmatch offers.

2. The classical solution
I once saw the following in a discussion about beautiful code. Well, beauty is in the eye of the beholder.

Code: Select all

int match(const char *pattern, const char *string)
{
  switch (*pattern) {
    case 0: return !*string;
    case '?': return *string && match(pattern + 1, string + 1);
    case '*': return match(pattern + 1, string) || (*string && match(pattern, string + 1));
    default: return *string == *pattern && match(pattern + 1, string + 1);
  }
}

The code is very simple, easy to understand and to read, but unfortunately has a very big security flaw: It requires linear stack (i.e. the number of recursive calls is O(n)). In many real-world use-cases, this will mean stack overflows, global state corruption, heap corruption, and just general nastiness. Even if it does just lead to a crash, that is still an attacker-controlled denial-of-service opportunity. And while the compiler can turn three of these recursive calls into tail-calls, there is still at least one remaining where that is impossible (the first call in the '*' case). Meaning an attacker can crash the system if they can make the pattern string just be a long string of stars.

At its heart, this algorithm implements a backtracking approach to regular-expression matching, albeit with a very simple RE language: A star is matched by either ignoring it completely, or, if that fails, backtracking to the most recent star (by way of the call stack), then advancing the string.

The algorithm I'm showcasing today will keep that backtracking behavior, but massively improve the auxiliary storage needed (dropping down to O(1)).

3. The solution
Note one thing: Even if you add back in the complexity I removed earlier, every element of the pattern string but the star can match exactly one character in the string. This means that for every pattern that has no star in it, we can know how many characters a matching string needs to have. Thus we can actually successfully match a string backwards, which is normally impossible in RE matching. The algorithm therefore takes the following general structure:

Match the beginning of the string against the part of the pattern before the first star. A mismatch here is a mismatch of the entire string. Reaching the end of the pattern disposes of the entire function at this point. Remove the already-matched parts of the string from consideration.
Count the number of tokens in the pattern past the last star in it.
Match the tail of the string against the part of the pattern past the last star. A mismatch once again is a mismatch of the entire pattern. But reaching the end is obviously no longer dispositive. Finally, remove the already-matched tail of the string from consideration.
You now have a pattern that starts and ends with a star. Repeatedly attempt to match the part between the first and last star against the pattern. On mismatch, remove the first character of the string and retry. If in so doing you reach yet another star, discard the now newly-matched string part and the pattern up to the new star. Do this until you run out of string or match everything up to the last star. The former case is a matching failure, the latter is a success.

Code for this to be added later...

Posted: **Thu Apr 24, 2025 2:10 pm**

nullplan wrote: ↑Wed Apr 09, 2025 11:17 am 1. The problem
The problem is matching a UNIX shell pattern string with the special characters * and ? against a file name and figuring out if it matches or not. * matches any number of characters (even 0), ? matches any character, but there must be one, and all other characters match themselves. For simplicity I am ignoring the additional complexity incurred by bracket expressions, pathname matching, and the possibility of escaping that the actual fnmatch offers.

Common names for this are "glob" in the Unix world and "wildcard" in MS-DOS and 8-bit worlds (to just list the ones I know), or more fully, "glob pattern" and "[filename|string] matching with wildcards". I love "sea of stars" though.

I once saw it implemented quite simply in 6502 assembly language, but only at the cost of never being able to use a star at the start of a pattern, if I read the code right. It simply ignored text after the star, treating `*foo*` as `*`. Most versions of MS-DOS have the same bug, at least with 8.3 filenames.

Posted: **Fri May 09, 2025 10:57 pm**

eekee wrote: ↑Thu Apr 24, 2025 2:10 pm Common names for this are "glob" in the Unix world and "wildcard" in MS-DOS and 8-bit worlds (to just list the ones I know), or more fully, "glob pattern" and "[filename|string] matching with wildcards". I love "sea of stars" though.

Globbing is what the algorithm attempts to do (although more exactly, it is attempting to match, and globbing is what you get when you list all names that match). "Sea of stars" is how it tries to do it.

Actually implementing the algorithm as given is complicated in a modern environment by the presence of multibyte strings. Because the second step requires you to find the last n characters of a string. In ASCII mode, this is simple (since every character is exactly one byte), but how to do it with multibyte characters? If you can assume UTF-8, then you know a lead byte when you see it, but in general, you can parse a multibyte string only going forwards. So you would have to save the starts of the last n multibyte characters to do it correctly.

Posted: **Sat May 10, 2025 5:21 am**

nullplan wrote: ↑Fri May 09, 2025 10:57 pm
eekee wrote: ↑Thu Apr 24, 2025 2:10 pm Common names for this are "glob" in the Unix world and "wildcard" in MS-DOS and 8-bit worlds (to just list the ones I know), or more fully, "glob pattern" and "[filename|string] matching with wildcards". I love "sea of stars" though.
Globbing is what the algorithm attempts to do (although more exactly, it is attempting to match, and globbing is what you get when you list all names that match). "Sea of stars" is how it tries to do it.

Ah, thanks.

nullplan wrote: ↑Fri May 09, 2025 10:57 pm Actually implementing the algorithm as given is complicated in a modern environment by the presence of multibyte strings. Because the second step requires you to find the last n characters of a string. In ASCII mode, this is simple (since every character is exactly one byte), but how to do it with multibyte characters? If you can assume UTF-8, then you know a lead byte when you see it, but in general, you can parse a multibyte string only going forwards. So you would have to save the starts of the last n multibyte characters to do it correctly.

This must be one of the reasons the inventors of UTF-8 converted characters to their Rune type in many of their programs. Rune is the result of a trivial conversion of UTF-8 to a fixed 32 bits per character. They made no syntax for Rune constants, likely because Runes are meant for bulk text you might want to search, not the short strings they put into their C code.

Posted: **Sat May 10, 2025 10:30 am**

eekee wrote: ↑Sat May 10, 2025 5:21 am This must be one of the reasons the inventors of UTF-8 converted characters to their Rune type in many of their programs. Rune is the result of a trivial conversion of UTF-8 to a fixed 32 bits per character. They made no syntax for Rune constants, likely because Runes are meant for bulk text you might want to search, not the short strings they put into their C code.

Sounds like they were simply widening the MB string to a wide string. That is also a possibility, but will likely eat even more memory. The tradeoff is between ca. 4 * string length vs. 8 * tail length. And I mean, for UTF-8, the problem doesn't even exist. It is for other MB encodings where I'd have a problem, such as EUC-CN, where you can't reliably tell a lead byte apart.

Posted: **Mon May 12, 2025 11:57 am**

I'll be happy if I can get through life without ever working directly on other MB encodings.

Even UTF-16 can be a problem if you lose track of code page changes.

OSDev.org

Algorithm showcase: The Sea of Stars fnmatch algorithm

Algorithm showcase: The Sea of Stars fnmatch algorithm

Re: Algorithm showcase: The Sea of Stars fnmatch algorithm

Re: Algorithm showcase: The Sea of Stars fnmatch algorithm

Re: Algorithm showcase: The Sea of Stars fnmatch algorithm

Re: Algorithm showcase: The Sea of Stars fnmatch algorithm

Re: Algorithm showcase: The Sea of Stars fnmatch algorithm