How to represent large constant sets?
Posted: Sun Mar 02, 2025 3:34 pm
Hi all,
my libc effort has grown far enough to require me to implement part of Unicode for the wctype functions. The isw* functions are basically just asking if a number is in a set. And I am wondering how to represent those large but eclectic sets I have to carve out of Unicode. How to do it in a way that it doesn't eat overly much memory and can still be queried quickly.
My initial idea is to use run-length encoding. Since we just enumerate a part of the natural numbers, I simply write down how many codepoints are not in the set, starting from 0, then how many are from that point on, and then back to how many are not. Each number is encoded as ULEB128 to allow dense packing without limiting the length. So for the alphabetic characters, the list would start with 65, 26, 5, 26, ... (because the first 65 code points starting from 0 are all not alphabetic, then come 26 that are, then come 5 that aren't, then another 26 that are, and so on).
This is simple and relatively densely packed (I'll check again how big the alphabetic list was, but I think it was 2kB or so), but to query whether a number is in the set, I can only iterate over the list, summing up entries until I exceed the queried number. It's not the worst thing in the world, but I wonder if I can do better.
The obvious alternative is a bit field. But for Unicode, that would be 136kB in size, per category. I could condense it down, since not all planes are equally densely packed, then it would only be 8kB per plane and category. Since of the 17 defined planes, only the first 4 contain alphabetic characters, that would only be 32kB for the alphabetic list. But I am worried about the list of printing characters, since that is most Unicode.
Another question is how to implement case folding. See, my above scheme can also be understood as defining a set in terms of ranges by providing two numbers: The offset of the range (relative to the end of the previous) and the length of the range. I can implement case folding by extending the idea: A casefold range is defined as a triplet: The offset of the range, the length of the range, and the distance between upper case and lower case characters. But again, this means that to casefold a character is to iterate over the whole thing. Ideas?
my libc effort has grown far enough to require me to implement part of Unicode for the wctype functions. The isw* functions are basically just asking if a number is in a set. And I am wondering how to represent those large but eclectic sets I have to carve out of Unicode. How to do it in a way that it doesn't eat overly much memory and can still be queried quickly.
My initial idea is to use run-length encoding. Since we just enumerate a part of the natural numbers, I simply write down how many codepoints are not in the set, starting from 0, then how many are from that point on, and then back to how many are not. Each number is encoded as ULEB128 to allow dense packing without limiting the length. So for the alphabetic characters, the list would start with 65, 26, 5, 26, ... (because the first 65 code points starting from 0 are all not alphabetic, then come 26 that are, then come 5 that aren't, then another 26 that are, and so on).
This is simple and relatively densely packed (I'll check again how big the alphabetic list was, but I think it was 2kB or so), but to query whether a number is in the set, I can only iterate over the list, summing up entries until I exceed the queried number. It's not the worst thing in the world, but I wonder if I can do better.
The obvious alternative is a bit field. But for Unicode, that would be 136kB in size, per category. I could condense it down, since not all planes are equally densely packed, then it would only be 8kB per plane and category. Since of the 17 defined planes, only the first 4 contain alphabetic characters, that would only be 32kB for the alphabetic list. But I am worried about the list of printing characters, since that is most Unicode.
Another question is how to implement case folding. See, my above scheme can also be understood as defining a set in terms of ranges by providing two numbers: The offset of the range (relative to the end of the previous) and the length of the range. I can implement case folding by extending the idea: A casefold range is defined as a triplet: The offset of the range, the length of the range, and the distance between upper case and lower case characters. But again, this means that to casefold a character is to iterate over the whole thing. Ideas?