Hi guys,
Octocontrabass wrote:It's a very safe assumption that non-Windows hosts will use UTF-8. UTF-16 exists pretty much exclusively for backwards compatibility with early Unicode adopters.
Agreed, but it is still an assumption. Also, if I am going to support UTF-8, via an "encode_to()" function, how difficult is it to allow other encodings?
Octocontrabass wrote:But you still need to know the length in bytes to parse the directory. And how do you figure out the length of the directory entry if the name contains invalid bytes?
No you don't. The nameLen field has nothing to do with the length of the directory entry, only the length of the name (in characters) that resides in this entry. The recLen field indicates the length of the whole record which by definition, indicates the length of the buffer holding the name. Traversing the directory uses the recLen field, not the nameLen field. The nameLen field is only used when you need to extract the name. If the name contains invalid bytes, that is on the faulty driver that placed these invalid bytes in the entry, not on the specification of the file system.
BenLunt wrote:4) What happens when one of the characters in the filename is a non-standard character and now uses a two-byte UTF-8 character? The count of bytes used is now one byte longer than the count of characters used. A non-UTF-8 compliant driver may not encode the last character, thinking that the count is a count of bytes used, not characters used.
Octocontrabass wrote:This is not a problem if only one encoding is allowed. All drivers must be compliant with the only allowed encoding.
Unless I am missing your point, this has nothing to do with allowing multiple encodings. If a (incorrectly written) driver assumes that the UTF-8 name will only contain 8-bit chars, when a multi-byte char is included, that driver will fail to parse the filename correctly. This can happen on any type of encoding, no matter the type or size of the encoding.
Octocontrabass wrote:If the host uses UTF-8 already, there's no encoding overhead, but you still have the overhead from turning a character count into a byte count to allocate your buffer.
Again, I disagree. When extracting the name, the byte count of the buffer holding the name can be easily found by the recLen field. When creating an entry, the UTF-8 encoder will know how many bytes were needed (possibly a byte count greater than the char count) to encode the name, then you can create the directory entry accordingly.
Octocontrabass wrote:(re: Windows Wide Char)...What does this have to do with your filesystem's design?
If I make my driver in Windows and Windows is using Wide Chars, I must some how convert the UTF-8 to this Wide Char. I was using this as an example of why you must have an encoder within your driver. Since you must have an encoder for your driver, having multiple encodings is no big deal.
Octocontrabass wrote:Again, I don't think anyone is suggesting that you limit the encoding to ASCII.
I am not suggesting that anyone is. I am suggesting that as soon as a "non-standard" character is used, i.e.: Non-ASCII, your driver now must understand UTF-8, *unless* you assume that the host is already UTF-8. Again, assume.
klange wrote:And further, I would strongly caution against storing character counts in any context. Traversing directory entries should not depend on the encoding, and character counts will require parsing even for UTF-16 which is not a fixed 2 bytes per codepoint - don't confuse it with UCS-2.
Again, unless I am completely missing your point, the encoding of the name has nothing to do with traversing the directory. Traversing the directory is done with the recLen field, not the nameLen field. The nameLen field is only used when you need to extract the name. The length of the buffer holding the name can easily be found with the recLen field.
Again, I appreciate the comments and hopefully you understand that I am not trying to be argumentative, but I think you guys don't understand that traversing the directory uses the recLen field, not the nameLen field. By doing so, the length of the record has nothing to do with the encoding of the name nor the nameLen field. The recLen field points to the next record used. The nameLen field tells you how many characters are in the buffer in this record. The buffer len is from the recLen field. So that you don't assume the host and the volume use the same encoding, your driver should have an encoder/decoder for UTF-8. If this is the case, having an encoder/decoder for other types is a simple task.
Again, unless I am completely missing your point, I think (and I could be wrong) you guys are missing the point that the recLen field is used to traverse the directory, *not* using the nameLen field.
When you traverse the directory, you are pointing to records, not names. Each record points to the next record, no matter the length of the name or how it is encoded. Within each record contains a buffer, the length easily found by the recLen field, this buffer holding an encoded string. The count of the characters in the string is given with nameLen. The buffer will always be at least the extact length it needs to be to hold the name, only having extra bytes to pad to a dword boundary, or on a rare case, pad to a record more than 16 bytes away (i.e.: when the next record is only 16 bytes long, the currect record and that 16-byte record can be combined).
As for my Windows example, when writing a driver for this file system, if the Windows compiler is set for Wide Char chars, a UTF-8 encoder/decoder is a must. Therefore, why does it matter if there are options for other encodings when you must have an encoder/decoder to begin with?
Again, not trying to be argumentative and I hope it isn't perceived that way. Thank you for your comments, you guys are much appreciated. This forum, and those who regular it have been a great help in my hobby and it is much appreciated.
Thank you,
Ben