Page 1 of 1

Confused about FAT32 long filename character encoding

Posted: Sat May 06, 2023 5:37 pm
by manny
I am confused about the character encoding for long filenames in FAT32. I am adding FAT32 support to my kernel and it's going very well. When reading the FAT32 specifications, it tells me that long filenames are encoded as UTF-16. But when I read a long filename from disk, it doesn't look like UTF-16 to me. Here is what I am seeing.

When I read the directory entry for the file cmsis_armcc.h from disk, the long filename is encoded as follows (shown in hexadecimal):

Code: Select all

0000: 63 -- 6D -- 73 -- 69 -- 73 -- 5F -- 61 -- 72 --
0016: 6D -- 63 -- 63 -- 2E -- 68 -- -- -- -- -- -- --
But that can't be UTF-16 because the character encoding contains nulls, right? Perhaps my understanding is incorrect but I thought Unicode did not contain nulls? Please help me get back on the right track. :)

Thank you!

Re: Confused about FAT32 long filename character encoding

Posted: Sun May 07, 2023 8:12 am
by JAAman
UTF16 uses 16-bits for each character, so each HEX row, has only 8 characters:
0063: 'c'
006D: 'm'
0073: 's'
0069: 'i'
0073: 's'
005F: '_'
0061: 'a'
0072: 'r'
006D: 'm'
0063: 'c'
0063: 'c'
002E: '.'
0068: 'h'
0000: this is likely a null terminator (I'd have to check the documentation to be sure)

there are no null characters here, only 8 16-bit characters, with the upper 8 bits of each character 0, a null terminator (unless long filenames use counted strings), and unused space at the end of the filename

the one thing to be aware of: originally Unicode was fully contained in 16bits, however when they decided to start including rarely used characters/character sets, they had to extend this, so some characters in UTF16 require 2 adjacent 16-bit entries to define a single character (called surrogate pairs). So while most characters are 16-bits, some (very rarely used) characters are actually 32-bits (using specially designated pairs of 16-bit values)

Re: Confused about FAT32 long filename character encoding

Posted: Mon May 08, 2023 6:48 pm
by manny
JAAman, that certainly matches what I am seeing. So when I scan a directory for a long filename, what is the conversion process so I can do a compare? Do I just do a bitwise OR of the 8-bit ASCII value with a 16-bit value to make it "UTF-16"? I really do not "need" support for other character sets as my FAT32 driver is for my RTOS. Suggestions?

Re: Confused about FAT32 long filename character encoding

Posted: Mon May 08, 2023 8:02 pm
by Octocontrabass
You can convert 7-bit ASCII to UTF-16 by filling the remaining 9 bits with zeroes.

If you're using an 8-bit character encoding, it's not ASCII. The conversion will be more complex depending on exactly which character encoding you're using.

Re: Confused about FAT32 long filename character encoding

Posted: Tue May 09, 2023 12:49 am
by rdos
I use UTF-8, and that conversion is pretty simple. UTF-8, unlike UTF-16, works with normal C string operations that use a single null-terminator. UTF-8 also can express anything that UTF-16 can express.

Re: Confused about FAT32 long filename character encoding

Posted: Tue May 09, 2023 9:14 am
by nullplan
manny wrote:JAAman, that certainly matches what I am seeing. So when I scan a directory for a long filename, what is the conversion process so I can do a compare? Do I just do a bitwise OR of the 8-bit ASCII value with a 16-bit value to make it "UTF-16"? I really do not "need" support for other character sets as my FAT32 driver is for my RTOS. Suggestions?
The normal process is to decode the directory entries one by one and then checking if an entry matches the name you are seeking. And typically this does entail a character set conversion. Most operating systems coming from the Unix corner do not want to deal with UTF-16 everywhere, and instead use something like UTF-8. You can try to remain charset-agnostic, like Linux, but that is also more complicated. Linux allows file names to come in any character set that is 8-bit and null free, so for VFAT it MUST convert the character set. And then the question is: Convert it to what? You can set it as a mount option. So this also means that there is a full charset library inside Linux. That's the price of agnosticism.

I have made the decision to use UTF-8 for everything. UTF-8 is 8-bits and null free, and UTF-16 (and actually any other character set) can be losslessly converted to it. All device drivers that happen to know what character set a string is in must convert it to UTF-8 before use. The nice thing is that for many things, you can just push a UTF-8 string through all the usual APIs and it really doesn't matter so often what the string actually means.

I would counsel against just using ASCII and hardcoding that to UTF-16, since then you get into trouble as soon as any characters show up in strings coming from outside your system that contain non-ASCII characters.

Re: Confused about FAT32 long filename character encoding

Posted: Wed May 10, 2023 6:47 am
by manny
Octocontrabass wrote:You can convert 7-bit ASCII to UTF-16 by filling the remaining 9 bits with zeroes.

If you're using an 8-bit character encoding, it's not ASCII. The conversion will be more complex depending on exactly which character encoding you're using.
This is what I will do. Thank you.

Re: Confused about FAT32 long filename character encoding

Posted: Wed May 10, 2023 6:56 am
by manny
rdos wrote:I use UTF-8, and that conversion is pretty simple. UTF-8, unlike UTF-16, works with normal C string operations that use a single null-terminator. UTF-8 also can express anything that UTF-16 can express.
My reply isn't to disagree. But my understanding was that UTF-16 was designed to work with null terminated char arrays in C because no single byte (in a multi-byte char) was null. Here is how I understood UTF-16. The character 'A' would be encoded as follows with the most significant bit telling you that it's multibyte.

10000000 01000001

Thoughts? :D

Re: Confused about FAT32 long filename character encoding

Posted: Wed May 10, 2023 7:17 am
by manny
nullplan wrote:
manny wrote:JAAman, that certainly matches what I am seeing. So when I scan a directory for a long filename, what is the conversion process so I can do a compare? Do I just do a bitwise OR of the 8-bit ASCII value with a 16-bit value to make it "UTF-16"? I really do not "need" support for other character sets as my FAT32 driver is for my RTOS. Suggestions?
The normal process is to decode the directory entries one by one and then checking if an entry matches the name you are seeking. And typically this does entail a character set conversion. Most operating systems coming from the Unix corner do not want to deal with UTF-16 everywhere, and instead use something like UTF-8. You can try to remain charset-agnostic, like Linux, but that is also more complicated. Linux allows file names to come in any character set that is 8-bit and null free, so for VFAT it MUST convert the character set. And then the question is: Convert it to what? You can set it as a mount option. So this also means that there is a full charset library inside Linux. That's the price of agnosticism.

I have made the decision to use UTF-8 for everything. UTF-8 is 8-bits and null free, and UTF-16 (and actually any other character set) can be losslessly converted to it. All device drivers that happen to know what character set a string is in must convert it to UTF-8 before use. The nice thing is that for many things, you can just push a UTF-8 string through all the usual APIs and it really doesn't matter so often what the string actually means.

I would counsel against just using ASCII and hardcoding that to UTF-16, since then you get into trouble as soon as any characters show up in strings coming from outside your system that contain non-ASCII characters.
I definitely cannot afford the Linux approach in an embedded operating system. Sticking with UTF-8 throughout makes the most sense in a highly power and memory constrained device.

Re: Confused about FAT32 long filename character encoding

Posted: Wed May 10, 2023 8:09 am
by rdos
manny wrote:
rdos wrote:I use UTF-8, and that conversion is pretty simple. UTF-8, unlike UTF-16, works with normal C string operations that use a single null-terminator. UTF-8 also can express anything that UTF-16 can express.
My reply isn't to disagree. But my understanding was that UTF-16 was designed to work with null terminated char arrays in C because no single byte (in a multi-byte char) was null. Here is how I understood UTF-16. The character 'A' would be encoded as follows with the most significant bit telling you that it's multibyte.

10000000 01000001

Thoughts? :D
I think UTF-8 was "invented" so you didn't need to create new string functions with a double null terminator. UTF-8 works with normal C string functions, but UTF-16 won't since the basic type is two bytes. If you use English only, then UTF-8 strings will occupy only half the size compared to UTF-16 which will waste a zero byte for each character. The original motivation for UTF-16 was that the size of the object was the same as the number of characters in the string, but this is no longer the case, so that argument is a bit invalid. For UTF-8 it's a bit complicated to determine number of characters, but all C string functions work without fixes.

Re: Confused about FAT32 long filename character encoding

Posted: Wed May 10, 2023 10:46 am
by Octocontrabass
manny wrote:But my understanding was that UTF-16 was designed to work with null terminated char arrays in C because no single byte (in a multi-byte char) was null.
No, that's incorrect. Unicode was originally a 16-bit character set, and software would use 16-bit wchar_t arrays instead of 8-bit char arrays. You would never examine individual bytes in a Unicode string. Later, Unicode was extended beyond 16 bits, and UTF-16 was designed to encode the new characters using the original 16-bit wchar_t strings.

You might be thinking of UTF-8, which was designed to work with char arrays in C.

Re: Confused about FAT32 long filename character encoding

Posted: Thu May 11, 2023 12:15 pm
by thewrongchristian
Octocontrabass wrote:
manny wrote:But my understanding was that UTF-16 was designed to work with null terminated char arrays in C because no single byte (in a multi-byte char) was null.
No, that's incorrect. Unicode was originally a 16-bit character set, and software would use 16-bit wchar_t arrays instead of 8-bit char arrays. You would never examine individual bytes in a Unicode string. Later, Unicode was extended beyond 16 bits, and UTF-16 was designed to encode the new characters using the original 16-bit wchar_t strings.

You might be thinking of UTF-8, which was designed to work with char arrays in C.
An added complication of UTF-16 is our old friend, byte orders. Not so much a problem for internal strings, but anything that is exported to external storage must define the byte order, hence an optional Byte Order Mark code point from which the software reading the UTF-16 string can infer byte order.

vFAT doesn't use a BOM, and instead defines little endian byte order.

Re: Confused about FAT32 long filename character encoding

Posted: Fri May 12, 2023 12:44 pm
by linguofreak
manny wrote:My reply isn't to disagree. But my understanding was that UTF-16 was designed to work with null terminated char arrays in C because no single byte (in a multi-byte char) was null. Here is how I understood UTF-16. The character 'A' would be encoded as follows with the most significant bit telling you that it's multibyte.

10000000 01000001

Thoughts? :D
That's (basically) how UTF-8 works, so you're likely thinking of UTF-8.

The difference between UTF-8 and what you have above is that UTF-8 uses more than one bit to flag multi-byte characters:

A one-byte character just has the high bit zero (so any valid ASCII string is also a valid UTF-8 string):

0xxxxxxx

A multi-byte character has a (variable length) prefix indicating the number of following bytes on the first byte, and then a two-bit prefix of "10" on following bytes (so that if you're looking at one of the following bytes, you know it isn't the start of a character).

110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

These features simplify error recovery: The prefix on the first character allows UTF-8 decoders to determine from the first byte of a character how long the character is (instead of just keeping on going until they find a new byte with the high bit 1, which might cause trouble if the high bit has been corrupted. The prefix on the following characters makes sure that if a random number of bytes is lost from the beginning of a string (with the remaining part potentially beginning in the middle of a character), you can still recover the next full character after the lost data, and everything after it. If the following bytes could be any arbitrary byte, then they might look like a single-byte character or the start of a multi-byte character in this situation, and so you'd start decoding at the wrong point and get garbage.

Re: Confused about FAT32 long filename character encoding

Posted: Sat May 13, 2023 12:07 am
by nullplan
Octocontrabass wrote: Later, Unicode was extended beyond 16 bits, and UTF-16 was designed to encode the new characters using the original 16-bit wchar_t strings.
BTW, for this reason these days you should expect wchar_t to be at least 21 bits (so, 32 bits in all realistic cases). If you do need to interoperate with software using UTF-16, then since C11, there's char16_t for that, and you can prefix string literals with a small u to make them UTF-16 (if you put a capital U, they become UTF-32).

The thing is that C's wide character API simply cannot deal with UTF-16, because it has characters of multiple wide characters. C only allows multibyte strings (of type char[], and they must be null free) and wide character strings (where each element is a complete character). UTF-16 is the worst of both worlds.

Re: Confused about FAT32 long filename character encoding

Posted: Sat May 13, 2023 1:29 pm
by Octocontrabass
nullplan wrote:BTW, for this reason these days you should expect wchar_t to be at least 21 bits (so, 32 bits in all realistic cases).
Good point - systems that widely adopted Unicode after it was extended beyond 16 bits tend to have 32-bit wchar_t. However, early adopters like Windows (and anything following the Windows ABI, like UEFI) still use 16-bit wchar_t.