In the FAT file system file names are not case sensitive although the directory set entries for FAT32 LFN and exFAT unicode (16 bit wide character) are stored
in the original case and must be converted to upper case for comparison. The easiest way to do this is with a unicode upcase lookup table.
Every exFAT formatted image comes complete with just such a table. It's a 6K compressed (128K when decompressed) unnamed hidden file in the root
directory identified by the special directory set entry type 0x82. You only need the first half (3K) (even though the table is easy enough to decompress) since
it contains the last in order unicode lower case converted character near the end and even though this half could be compressed it's not.
Simply check the unicode character and if it less than or equal to 0x586 (Armenian small letter 'FEH') use the table to convert to uppercase. Otherwise the
character is either already upcased or (as is the case with all the characters in the second half of the complete table) has no upper case equivalent (for 16bit
unicode wide characters).
If you are not working with exFAT simply grab a copy of the table.
Simple Unicode toupper
Re: Simple Unicode toupper
This is interesting. My OS has a case-insensitive (supposed to be case-preserving, but the current iteration is deficient) filesystem as I believe this to be more "user-friendly". I'm looking to do a major rewrite/re-structure of the filesystem layer in the near future and having a decent solution to Unicode case-insensitivity would be nice. This seems like it might be a thing to look at (I'll be adding exFAT support soon too).
By storing this table on-disk, I assume that it's locale-specific (I know a similar table used in NTFS is). While lowercase to uppercase is quite simple in English, it's not so easy in other languages/locales. Some examples:
By storing this table on-disk, I assume that it's locale-specific (I know a similar table used in NTFS is). While lowercase to uppercase is quite simple in English, it's not so easy in other languages/locales. Some examples:
- In Germany, the "ß" is usually replaced with "SS" when capitalised, while in some parts of Austria "SZ" is more common, but a capital "ẞ" has recently been standardised... (Note that this case results in a single lowercase character becoming two characters in uppercase). Similarly, "ü" can sometimes be rendered "UE".
- In Turkish, capital 'i' is 'İ', not 'I' ("I" has its own Turkish lowercase, 'ı').
- In French, it's common to drop accents when uppercasing a word; (e.g. "résumé" becomes "RESUME") but in other languages that use accents this is not true.
Re: Simple Unicode toupper
You could look at how HFS+ does it (normalising in a single way for fs storage).
Learn to read.
Re: Simple Unicode toupper
Actually the table only maps the 65536 unicode 16 bit wide characters to their upper case equivalent or if there is nomallard wrote:By storing this table on-disk, I assume that it's locale-specific (I know a similar table used in NTFS is).
upper case to itself.
Primarily the table is used by the exFAT driver to upcase the directory set entry file name.mallard wrote: While lowercase to uppercase is quite simple in English, it's not so easy in other languages/locales. Some examples:Note that before even thinking about converting Unicode to a different case, it must be normalised (so that things like "Å" and "Å" are actually the same).
- In Germany, the "ß" is usually replaced with "SS" when capitalised, while in some parts of Austria "SZ" is more common, but a capital "ẞ" has recently been standardised... (Note that this case results in a single lowercase character becoming two characters in uppercase). Similarly, "ü" can sometimes be rendered "UE".
- In Turkish, capital 'i' is 'İ', not 'I' ("I" has its own Turkish lowercase, 'ı').
- In French, it's common to drop accents when uppercasing a word; (e.g. "résumé" becomes "RESUME") but in other languages that use accents this is not true.
That's a very good point when the table is used for a localized conversion. You would however, have to generatemallard wrote:I'm thinking about having a solution that involves having a "system locale" with a "system" capitalisation table and allowing the user to change this. The FS layer would take care of ensuring the filenames remain unique (e.g. by appending numbers to conflicting names; whether that's because they differed only in case originally or because they conflict when capitalised in the current locale).
multiple lookup names and search for each though. Problem is, say you find a file name in Germany with the "SS"
how do you know that it is an upcased "ß" or just two upcase "s"es. Probably more trouble than it's worth. You have
to draw the line somewhere, for example in FAT32 LFN and exFAT any (except a handful of punctuation marks)
character can be used in the file name including any number of special characters which are not part of anyone's locale.
I just use the table as the standard conversion.
Even though the table is not officially from http://unicode.org, all Unicode tables and algorithms are free to usemallard wrote:Extracting tables from exFAT sounds like it might be worth trying, although I'd worry about the copyright status...
and the upcase table could be/was generated from these free tables and algorithms.
Re: Simple Unicode toupper
Hi,
Cheers,
Brendan
mikegonta wrote:Every exFAT formatted image comes complete with just such a table. It's a 6K compressed (128K when decompressed) unnamed hidden file in the root
directory identified by the special directory set entry type 0x82. You only need the first half (3K) (even though the table is easy enough to decompress) since
it contains the last in order unicode lower case converted character near the end and even though this half could be compressed it's not.
I allow for Murphy's law. More specifically, if there's supposed to be an upper case conversion table on disk then I would:mikegonta wrote:If you are not working with exFAT simply grab a copy of the table.
- Assume that it's wrong and/or corrupt
- Write code that uses "known good" table/s instead
- Use the "known good" table/s to check if the table on disk is/isn't correct
- Use the "known good" table/s to auto-correct the table on disk if necessary
- Wish there wasn't a table on disk
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.