Simple Unicode toupper

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
mikegonta
Member
Member
Posts: 229
Joined: Thu May 19, 2011 5:13 am
Contact:

Simple Unicode toupper

Post by mikegonta »

In the FAT file system file names are not case sensitive although the directory set entries for FAT32 LFN and exFAT unicode (16 bit wide character) are stored
in the original case and must be converted to upper case for comparison. The easiest way to do this is with a unicode upcase lookup table.

Every exFAT formatted image comes complete with just such a table. It's a 6K compressed (128K when decompressed) unnamed hidden file in the root
directory identified by the special directory set entry type 0x82. You only need the first half (3K) (even though the table is easy enough to decompress) since
it contains the last in order unicode lower case converted character near the end and even though this half could be compressed it's not.

Simply check the unicode character and if it less than or equal to 0x586 (Armenian small letter 'FEH') use the table to convert to uppercase. Otherwise the
character is either already upcased or (as is the case with all the characters in the second half of the complete table) has no upper case equivalent (for 16bit
unicode wide characters).

If you are not working with exFAT simply grab a copy of the table.
Mike Gonta
look and see - many look but few see

https://mikegonta.com
mallard
Member
Member
Posts: 280
Joined: Tue May 13, 2014 3:02 am
Location: Private, UK

Re: Simple Unicode toupper

Post by mallard »

This is interesting. My OS has a case-insensitive (supposed to be case-preserving, but the current iteration is deficient) filesystem as I believe this to be more "user-friendly". I'm looking to do a major rewrite/re-structure of the filesystem layer in the near future and having a decent solution to Unicode case-insensitivity would be nice. This seems like it might be a thing to look at (I'll be adding exFAT support soon too).

By storing this table on-disk, I assume that it's locale-specific (I know a similar table used in NTFS is). While lowercase to uppercase is quite simple in English, it's not so easy in other languages/locales. Some examples:
  1. In Germany, the "ß" is usually replaced with "SS" when capitalised, while in some parts of Austria "SZ" is more common, but a capital "ẞ" has recently been standardised... (Note that this case results in a single lowercase character becoming two characters in uppercase). Similarly, "ü" can sometimes be rendered "UE".
  2. In Turkish, capital 'i' is 'İ', not 'I' ("I" has its own Turkish lowercase, 'ı').
  3. In French, it's common to drop accents when uppercasing a word; (e.g. "résumé" becomes "RESUME") but in other languages that use accents this is not true.
Note that before even thinking about converting Unicode to a different case, it must be normalised (so that things like "Å" and "Å" are actually the same). I'm thinking about having a solution that involves having a "system locale" with a "system" capitalisation table and allowing the user to change this. The FS layer would take care of ensuring the filenames remain unique (e.g. by appending numbers to conflicting names; whether that's because they differed only in case originally or because they conflict when capitalised in the current locale). Extracting tables from exFAT sounds like it might be worth trying, although I'd worry about the copyright status...
Image
User avatar
dozniak
Member
Member
Posts: 723
Joined: Thu Jul 12, 2012 7:29 am
Location: Tallinn, Estonia

Re: Simple Unicode toupper

Post by dozniak »

You could look at how HFS+ does it (normalising in a single way for fs storage).
Learn to read.
mikegonta
Member
Member
Posts: 229
Joined: Thu May 19, 2011 5:13 am
Contact:

Re: Simple Unicode toupper

Post by mikegonta »

mallard wrote:By storing this table on-disk, I assume that it's locale-specific (I know a similar table used in NTFS is).
Actually the table only maps the 65536 unicode 16 bit wide characters to their upper case equivalent or if there is no
upper case to itself.
mallard wrote: While lowercase to uppercase is quite simple in English, it's not so easy in other languages/locales. Some examples:
  1. In Germany, the "ß" is usually replaced with "SS" when capitalised, while in some parts of Austria "SZ" is more common, but a capital "ẞ" has recently been standardised... (Note that this case results in a single lowercase character becoming two characters in uppercase). Similarly, "ü" can sometimes be rendered "UE".
  2. In Turkish, capital 'i' is 'İ', not 'I' ("I" has its own Turkish lowercase, 'ı').
  3. In French, it's common to drop accents when uppercasing a word; (e.g. "résumé" becomes "RESUME") but in other languages that use accents this is not true.
Note that before even thinking about converting Unicode to a different case, it must be normalised (so that things like "Å" and "Å" are actually the same).
Primarily the table is used by the exFAT driver to upcase the directory set entry file name.
mallard wrote:I'm thinking about having a solution that involves having a "system locale" with a "system" capitalisation table and allowing the user to change this. The FS layer would take care of ensuring the filenames remain unique (e.g. by appending numbers to conflicting names; whether that's because they differed only in case originally or because they conflict when capitalised in the current locale).
That's a very good point when the table is used for a localized conversion. You would however, have to generate
multiple lookup names and search for each though. Problem is, say you find a file name in Germany with the "SS"
how do you know that it is an upcased "ß" or just two upcase "s"es. Probably more trouble than it's worth. You have
to draw the line somewhere, for example in FAT32 LFN and exFAT any (except a handful of punctuation marks)
character can be used in the file name including any number of special characters which are not part of anyone's locale.
I just use the table as the standard conversion.
mallard wrote:Extracting tables from exFAT sounds like it might be worth trying, although I'd worry about the copyright status...
Even though the table is not officially from http://unicode.org, all Unicode tables and algorithms are free to use
and the upcase table could be/was generated from these free tables and algorithms.
Mike Gonta
look and see - many look but few see

https://mikegonta.com
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Simple Unicode toupper

Post by Brendan »

Hi,
mikegonta wrote:Every exFAT formatted image comes complete with just such a table. It's a 6K compressed (128K when decompressed) unnamed hidden file in the root
directory identified by the special directory set entry type 0x82. You only need the first half (3K) (even though the table is easy enough to decompress) since
it contains the last in order unicode lower case converted character near the end and even though this half could be compressed it's not.
mikegonta wrote:If you are not working with exFAT simply grab a copy of the table.
I allow for Murphy's law. More specifically, if there's supposed to be an upper case conversion table on disk then I would:
  • Assume that it's wrong and/or corrupt
  • Write code that uses "known good" table/s instead
  • Use the "known good" table/s to check if the table on disk is/isn't correct
  • Use the "known good" table/s to auto-correct the table on disk if necessary
  • Wish there wasn't a table on disk

Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply