How to represent large constant sets?

nullplan · Post by **nullplan** » Sun Mar 02, 2025 3:34 pm

Hi all,

my libc effort has grown far enough to require me to implement part of Unicode for the wctype functions. The isw* functions are basically just asking if a number is in a set. And I am wondering how to represent those large but eclectic sets I have to carve out of Unicode. How to do it in a way that it doesn't eat overly much memory and can still be queried quickly.

My initial idea is to use run-length encoding. Since we just enumerate a part of the natural numbers, I simply write down how many codepoints are not in the set, starting from 0, then how many are from that point on, and then back to how many are not. Each number is encoded as ULEB128 to allow dense packing without limiting the length. So for the alphabetic characters, the list would start with 65, 26, 5, 26, ... (because the first 65 code points starting from 0 are all not alphabetic, then come 26 that are, then come 5 that aren't, then another 26 that are, and so on).

This is simple and relatively densely packed (I'll check again how big the alphabetic list was, but I think it was 2kB or so), but to query whether a number is in the set, I can only iterate over the list, summing up entries until I exceed the queried number. It's not the worst thing in the world, but I wonder if I can do better.

The obvious alternative is a bit field. But for Unicode, that would be 136kB in size, per category. I could condense it down, since not all planes are equally densely packed, then it would only be 8kB per plane and category. Since of the 17 defined planes, only the first 4 contain alphabetic characters, that would only be 32kB for the alphabetic list. But I am worried about the list of printing characters, since that is most Unicode.

Another question is how to implement case folding. See, my above scheme can also be understood as defining a set in terms of ranges by providing two numbers: The offset of the range (relative to the end of the previous) and the length of the range. I can implement case folding by extending the idea: A casefold range is defined as a triplet: The offset of the range, the length of the range, and the distance between upper case and lower case characters. But again, this means that to casefold a character is to iterate over the whole thing. Ideas?

Demindiro · Post by **Demindiro** » Sun Mar 02, 2025 3:58 pm

An interesting technique I've seen is the use of a state machine, where each token (character,byte or something else..) is a transition: https://burntsushi.net/transducers/. I don't know how effective it'll be for your usecase though.

For run-length encoding: you may get slightly denser packing if you do delta encoding on top:
- set xdelta = 0
- for every (signed) number x, do xdelta += x, then return xdelta.
(I got that idea from https://aras-p.info/blog/2016/09/01/SPIR-V-Compression/)
However, given you're working with lengths only (which in a way already is delta encoding...) it'll only be effective if many of the lengths are close to each other. Easy to try, though.
To reduce the penalty of linear iteration you can use a table with (id, offset) entries to skip forward quickly.

Demindiro · Post by **Demindiro** » Mon Mar 03, 2025 2:22 am

I actually started to wonder about the state-machine approach myself, since a paper I'm reading happens to mention automatons and I also wanted to test if I actually understand how it works.

I (somewhat clumsily) hacked together a prototype based on each token being a single bit and applied some optimizations to it, like the realization that all entries have the same depth ( = bit length). In any case, I've managed to make something that seem to work. Results:

Code: Select all

char::is_lowercase: 1 bits -> 544 nodes ([1, 1, 1, 1, 1, 2, 4, 6, 8, 10, 16, 19, 24, 36, 56, 77, 97, 89, 76, 15, 3, 1])
char::is_uppercase: 1 bits -> 475 nodes ([1, 1, 1, 1, 1, 2, 4, 6, 8, 11, 14, 16, 20, 29, 46, 68, 78, 79, 70, 15, 3, 1])
char::is_alphanumeric: 1 bits -> 1585 nodes ([1, 1, 1, 1, 2, 4, 7, 12, 20, 29, 43, 60, 89, 133, 210, 298, 307, 235, 113, 15, 3, 1])
char::is_alphabetic: 1 bits -> 1491 nodes ([1, 1, 1, 1, 2, 4, 7, 12, 20, 28, 41, 57, 83, 126, 196, 278, 274, 229, 111, 15, 3, 1])
char::is_whitespace: 1 bits -> 69 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 4, 4, 4, 4, 4, 5, 6, 6, 7, 7, 3, 1])
|c| c.is_ascii(): 1 bits -> 22 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
|c| c.is_ascii_punctuation(): 1 bits -> 41 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 5, 6, 6, 3, 1])
|c| c.is_ascii_hexdigit(): 1 bits -> 32 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 3, 1])
|c| c.is_digit(10): 1 bits -> 24 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1])
wrap(iswalnum): 1 bits -> 58 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 3, 1])
wrap(iswblank): 1 bits -> 54 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1])
wrap(iswgraph): 1 bits -> 55 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1])
wrap(iswprint): 1 bits -> 50 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1])
wrap(iswspace): 1 bits -> 55 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 3, 1])
wrap(iswxdigit): 1 bits -> 56 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 3, 1])
wrap(iswalpha): 1 bits -> 53 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 1])
wrap(iswcntrl): 1 bits -> 50 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1])
wrap(iswdigit): 1 bits -> 48 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1])
wrap(iswlower): 1 bits -> 53 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 1])
wrap(iswpunct): 1 bits -> 65 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 3, 4, 6, 6, 6, 3, 1])
wrap(iswupper): 1 bits -> 53 nodes ([1, 2, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 1])

I have tested the state machines on all characters immediately afterwards, so the results should be correct.
What I did is:
- Iterate over all valid Unicode characters (char::MIN..=char::MAX) and generate a trie with them
- Deduplicate nodes from leaf to root, layer by layer.
- Done! Test it and print out bits per token, total amount of nodes and nodes per layer.
Some of the layers have more than 2^8 nodes, which is annoying, but I suppose with some clever tricks you can avoid using whole 16 bit integers (delta encoding?).
It seems suspect to me that iswalnum and char::is_alphanumeric have such different results, but they both pass so *shrug*. Maybe some env variable I need to set? I tried messing with LC_ALL, LC_CTYPE etc but no results.

Actually, it bothers me a lot, so I'm going to feed the functions with the whole 32-bit integer range, but that'll take a while.

I originally was going to do multiple bits too, but something is broken in my very hacky implementation, so I couldn't test that

eekee · Post by **eekee** » Fri Mar 07, 2025 8:02 am

nullplan wrote: ↑Sun Mar 02, 2025 3:34 pm The obvious alternative is a bit field. But for Unicode, that would be 136kB in size, per category. I could condense it down, since not all planes are equally densely packed, then it would only be 8kB per plane and category. Since of the 17 defined planes, only the first 4 contain alphabetic characters, that would only be 32kB for the alphabetic list. But I am worried about the list of printing characters, since that is most Unicode.

You could use a different algorithm for printing characters, perhaps a per-plane RLE test.

nullplan · Post by **nullplan** » Sun Mar 09, 2025 3:11 am

Demindiro wrote: ↑Mon Mar 03, 2025 2:22 am It seems suspect to me that iswalnum and char::is_alphanumeric have such different results, but they both pass so *shrug*. Maybe some env variable I need to set? I tried messing with LC_ALL, LC_CTYPE etc but no results.

What is your source of truth for these sets? I am comparing standards, implementations, and getting stuff directly out of the unicode.org data files. But even if you just query your host libc, that wouldn't explain why your iswblank() needs that much space. That set is 2 characters in musl, and something like 10 in glibc.

I have now settled on a Unicode<->POSIX mapping I am going to use:
iswalnum(x) = iswalpha(x) || iswdigit(x)
iswalpha(x) = __is_in_set(x, alpha)
alpha = Unicode "Alphabetic" (DerivedCoreProperties.txt) + Category "Nd" from UnicodeData.txt - ASCII digits
iswdigit(x) = ASCII digits (POSIX requirement)
iswxdigit(x) = ASCII hex digits (another POSIX requirement)

This means that iswalnum(x) returns true for all characters in Nd. POSIX requiring to only have the ASCII digits in iswdigit() is a bit nasty, but understandable. I'd have to have multiple digit<->value mappings for strtoul() and similar if I didn't have that.

iswprint(x) = __is_in_set(x, printable)
printable = All characters in UnicodeData.txt, except categories "Cc","Cs","Zl", and "Zp"
iswspace(x) = Unicode property "White_Space"
iswcntrl(x) = Categories "Cc","Zl", and "Zp" (i.e. the C0 range, the C1 range, U+2028, and U+2029)
iswgraph(x) = iswprint(x) && !iswspace(x)

iswblank(x) = x == '\t' || x == ' '
This is the standard set, and also what musl uses. The standard says these are supposed to be those white space characters that break lines into words, and glibc has more characters in there, but I don't know how the cultures these characters come from use them. It accepts U+1680 "Ogham space mark", but I don't know oghamic. I don't even know what that is, and I especially don't know if they even write in lines.

iswlower(x) = towupper(x) != x (I think most people expect this)
iswupper(x) = towlower(x) != x
iswpunct(x) = iswgraph(x) && !iswalnum(x)

And yes, this means that category "Cs" is excluded from everything. The surrogates are not Unicode, as far as I'm concerned.

OK, so for the sets I am now going with the original idea (RLE sets with ULEB128). This encoding for the alphabetic characters clocks in at 1676 bytes. Run time I will have to check. On the other hand, I doubt too many applications call iswalpha() in their critical path, right? Although they might... Oh well, cross that bridge when I come to it.

Printable characters in this encoding only make up 1564 bytes, because there are huge ranges at the end that can be defined in just a few bytes, but there are a lot of missing code points earlier that increase the size. Oh well, price of correctness. (musl's iswprint(0x378) will return 1, for example, and mine won't).

Now I am stuck designing the case folding functions. I have dumped the common and simple case folding from CaseFolding.txt, and looking through it I see mostly these types of case folding blocks:

Classic blocks. Upper case characters and lower case characters are contiguous each and just a constant away from each other.
Striped blocks: Lower case characters directly follow their upper case counterparts.
Special cases: Single translation pairs that have nothing to do with any pair before or after them, completely out of left field.

So each block can be defined in terms of four parameters: Start, length, stride, and offset. In striped blocks, offset is 1 and stride is 2, in the others stride is 1, and in the special cases, length is also 1.

I will encode the start relative to the end of the previous block, as per usual. For the length, I shall take the distance from the highest to the lowest upper case code point plus 1, and then comes the offset. The stride is implicit in the offset: Stride is 2 if offset is 1, else stride is 1.

That should encode all three cases with reasonable efficiency. We'll see.

Demindiro · Post by **Demindiro** » Sun Mar 09, 2025 4:25 am

nullplan wrote: ↑Sun Mar 09, 2025 3:11 am What is your source of truth for these sets? I am comparing standards, implementations, and getting stuff directly out of the unicode.org data files. But even if you just query your host libc, that wouldn't explain why your iswblank() needs that much space. That set is 2 characters in musl, and something like 10 in glibc.

I'm just using whatever functions are provided by my (g)libc. Since it was easy to test all functions I did just that to get an idea the overall efficiency.

The state machine approach mainly shines if you have many, many entries with no immediate obvious pattern. For functions like iswblank and |c| c.is_ascii() it makes no sense to use a state machine since it's trivial to implement without, but it should work well for the other cases like char::is_alphanumeric (AKA iswalnum, or so I thought until testing it..)

And there was indeed a bug with my testing of the isw* functions (testing == 0 instead of != 0), but even after fixing it the node count remains very low, so I'm not sure what is up with that. Note that if the amount of nodes per layer remains below 256 you need only two bytes per node at most, so iswalnum would take only 116 bytes. char::is_alphanumeric is trickier but can be handled reasonably efficiently if you use a bit-array to extend to 9-bit integers without wasting too much space, hence 1585 * 9/8 * 2 -> about 3568 bytes, without taking in account other optimizations such as the fact the first layers have only single nodes.

And of course, state machines have O(k) time complexity, where k is the length of your input. Since the inputs all have a 21-bit length it's O(1).

Also, here's the layer counts for the isw* functions with the fixed version:

Code: Select all

wrap(iswalnum): 1 bits -> 34 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 3, 1])
wrap(iswblank): 1 bits -> 27 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1])
wrap(iswgraph): 1 bits -> 33 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 1])
wrap(iswprint): 1 bits -> 28 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1])
wrap(iswspace): 1 bits -> 29 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 1])
wrap(iswxdigit): 1 bits -> 32 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 3, 1])
wrap(iswalpha): 1 bits -> 29 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 1])
wrap(iswcntrl): 1 bits -> 28 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1])
wrap(iswdigit): 1 bits -> 24 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1])
wrap(iswlower): 1 bits -> 29 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 1])
wrap(iswpunct): 1 bits -> 41 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 5, 6, 6, 3, 1])
wrap(iswupper): 1 bits -> 29 nodes ([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 1])

Quite a bit less nodes, at least, but still mysterious for cases like iswalnum.

nullplan · Post by **nullplan** » Sun Mar 09, 2025 7:10 am

Demindiro wrote: ↑Sun Mar 09, 2025 4:25 am I'm just using whatever functions are provided by my (g)libc. Since it was easy to test all functions I did just that to get an idea the overall efficiency.

Well, glibc has some omissions in its list of alphabetic characters (for those curious, the list is in localedata/locales/i18n_ctype in the glibc source) compared to the Unicode "Alphabetic" property. Not sure what is up with that, and I wasn't about to find out. The source code lists an entirely unrelated ISO standard as their source, which I was not going to read. I have enough on my plate already.

But at least I understand your approach better now. And it does explain a lot about why your iswblank() needs so many nodes. glibc's set of blank characters, translated to binary is:

Code: Select all

0009_16 = 0000000000001001_2
0020_16 = 0000000000100000_2
1680_16 = 0001011010000000_2
2000_16 = 0010000000000000_2
2001_16 = 0010000000000001_2
2002_16 = 0010000000000010_2
2003_16 = 0010000000000011_2
2004_16 = 0010000000000100_2
2005_16 = 0010000000000101_2
2006_16 = 0010000000000110_2
2008_16 = 0010000000001000_2
2009_16 = 0010000000001001_2
200A_16 = 0010000000001010_2
205F_16 = 0010000001011111_2
3000_16 = 0011000000000000_2

That is quite a few different bits. Especially the Ogham Space Mark will contribute to that count, I suppose. That has a lot of bits different from all the others.

Anyway, I am done. The results can be seen here: https://github.com/nullplan/sirius/tree/main/src/wchar

And now I hope I won't have to look at Unicode again for a while.

OSDev.org

How to represent large constant sets?

How to represent large constant sets?

Re: How to represent large constant sets?

Re: How to represent large constant sets?

Re: How to represent large constant sets?

Re: How to represent large constant sets?

Re: How to represent large constant sets?

Re: How to represent large constant sets?