OSDev.org

Posted: **Wed Dec 07, 2022 2:13 pm**

BenLunt wrote:For announcement and for your information, I have released version 1.0.0-rc0 of this Lean File System specification.

It is in "request for comments" stage. This is a major release and I don't plan on changing it much after this, unless someone finds a flaw in my implementation . Hopefully the way I have written it, function can be added without breaking existing implementations (those starting with version 1.0).

Thanks again for all the comments I have received before. My Ultimate app is version 1.0.0 compliant and should be uploaded to the github within a day or so.

I personally think the character encoding option is a mistake. I'd make it UTF-8, or UTF-16 (LE I assume, the spec doesn't actually say,) but not either/or. Pick a lane, I suggest, and UTF-8 takes any big/little endian out of the equation.

Posted: **Wed Dec 07, 2022 3:07 pm**

thewrongchristian wrote:
BenLunt wrote:For announcement and for your information, I have released version 1.0.0-rc0 of this Lean File System specification.

It is in "request for comments" stage. This is a major release and I don't plan on changing it much after this, unless someone finds a flaw in my implementation . Hopefully the way I have written it, function can be added without breaking existing implementations (those starting with version 1.0).

Thanks again for all the comments I have received before. My Ultimate app is version 1.0.0 compliant and should be uploaded to the github within a day or so.
I personally think the character encoding option is a mistake. I'd make it UTF-8, or UTF-16 (LE I assume, the spec doesn't actually say,) but not either/or. Pick a lane, I suggest, and UTF-8 takes any big/little endian out of the equation.

I'll agree with that, and particularly using UTF-8 appears to be the best solution for saving file & directory names. It provides no overhead for standard characters, unlike UTF-16 which increase standard characters to twice the size.

Posted: **Wed Dec 07, 2022 8:38 pm**

Hi guys, and thanks for the thoughts.

I put a bit of thought into this as well, though decided to allow alternative encodings, and here is why.

1) If you truly have a UTF-8 compliant driver, don't you have to send the "text" to your encoder anyway, unless you are assuming your host is UTF-8?
2) Only allowing "standard characters", as you say, means that you are assuming that the user will only use the ASCII equivalent characters (0 to 127), correct? If so, why not just make it an ASCIIZ string?
3) By giving a length in characters instead of bytes, allows for any encoding.
4) What happens when one of the characters in the filename is a non-standard character and now uses a two-byte UTF-8 character? The count of bytes used is now one byte longer than the count of characters used. A non-UTF-8 compliant driver may not encode the last character, thinking that the count is a count of bytes used, not characters used.
5) If you truly have a UTF-8 compliant driver, you have to have an UTF-8 encoder to encode to the host's encoding. What more effort/overhead is needed to call a different encoder? You have a buffer in and a buffer out with a static character count.
6) Microsoft has stated that they are moving toward wide character strings (CStringW instead of CStringA) in their compilers. If you use CString and it overrides to CStringW[1], you have to have some sort of encoder from UTF-8 to a Wide Char to use CString. Also in reverse.
7) Using UTF-8 and a non-English language now requires "multi-byte" UTF-8 characters, correct? Why should I only allow English?

I guess what I am saying is, allowing only one encoding (UTF-8 or UTF-16 or whatever), assumes the host is using the exact same encoding. This is not the case, even when using ASCII characters, as proven with entry 6) above and its footnote below.

Please don't think I am being argumentative with my comments, I was just asking questions without expecting an answer.

Anyway, thanks so much for the comments, I do appreciate them, and would gladly read more if you are so inclined.

Ben

[1] If _UNICODE is defined, while _MBCS is not, the compiler will use CStringW, 16-bit wide chars. Some form of encoding will need to be done to encode the UTF-8 string to CStringW. 'MultiByteToWideChar()' and 'WideCharToMultiByte()' are currently being used.

Posted: **Wed Dec 07, 2022 10:22 pm**

BenLunt wrote:1) If you truly have a UTF-8 compliant driver, don't you have to send the "text" to your encoder anyway, unless you are assuming your host is UTF-8?

It's a very safe assumption that non-Windows hosts will use UTF-8. UTF-16 exists pretty much exclusively for backwards compatibility with early Unicode adopters.

BenLunt wrote:2) Only allowing "standard characters", as you say, means that you are assuming that the user will only use the ASCII equivalent characters (0 to 127), correct? If so, why not just make it an ASCIIZ string?

I don't think anyone is suggesting that you limit the encoding to ASCII. I do agree with the others that you should choose only one encoding and stick with it instead of allowing a choice of encodings.

BenLunt wrote:3) By giving a length in characters instead of bytes, allows for any encoding.

But you still need to know the length in bytes to parse the directory. And how do you figure out the length of the directory entry if the name contains invalid bytes?

BenLunt wrote:4) What happens when one of the characters in the filename is a non-standard character and now uses a two-byte UTF-8 character? The count of bytes used is now one byte longer than the count of characters used. A non-UTF-8 compliant driver may not encode the last character, thinking that the count is a count of bytes used, not characters used.

This is not a problem if only one encoding is allowed. All drivers must be compliant with the only allowed encoding.

BenLunt wrote:5) If you truly have a UTF-8 compliant driver, you have to have an UTF-8 encoder to encode to the host's encoding. What more effort/overhead is needed to call a different encoder? You have a buffer in and a buffer out with a static character count.

If the host uses UTF-8 already, there's no encoding overhead, but you still have the overhead from turning a character count into a byte count to allocate your buffer.

BenLunt wrote:6) Microsoft has stated that they are moving toward wide character strings (CStringW instead of CStringA) in their compilers. If you use CString and it overrides to CStringW[1], you have to have some sort of encoder from UTF-8 to a Wide Char to use CString. Also in reverse.

...What does this have to do with your filesystem's design?

BenLunt wrote:7) Using UTF-8 and a non-English language now requires "multi-byte" UTF-8 characters, correct? Why should I only allow English?

Again, I don't think anyone is suggesting that you limit the encoding to ASCII.

Posted: **Thu Dec 08, 2022 1:23 am**

I am going to passively link this: http://utf8everywhere.org/

And further, I would strongly caution against storing character counts in any context. Traversing directory entries should not depend on the encoding, and character counts will require parsing even for UTF-16 which is not a fixed 2 bytes per codepoint - don't confuse it with UCS-2.

Posted: **Thu Dec 08, 2022 11:34 am**

Hi guys,

Octocontrabass wrote:It's a very safe assumption that non-Windows hosts will use UTF-8. UTF-16 exists pretty much exclusively for backwards compatibility with early Unicode adopters.

Agreed, but it is still an assumption. Also, if I am going to support UTF-8, via an "encode_to()" function, how difficult is it to allow other encodings?

Octocontrabass wrote:But you still need to know the length in bytes to parse the directory. And how do you figure out the length of the directory entry if the name contains invalid bytes?

No you don't. The nameLen field has nothing to do with the length of the directory entry, only the length of the name (in characters) that resides in this entry. The recLen field indicates the length of the whole record which by definition, indicates the length of the buffer holding the name. Traversing the directory uses the recLen field, not the nameLen field. The nameLen field is only used when you need to extract the name. If the name contains invalid bytes, that is on the faulty driver that placed these invalid bytes in the entry, not on the specification of the file system.

BenLunt wrote:4) What happens when one of the characters in the filename is a non-standard character and now uses a two-byte UTF-8 character? The count of bytes used is now one byte longer than the count of characters used. A non-UTF-8 compliant driver may not encode the last character, thinking that the count is a count of bytes used, not characters used.

Octocontrabass wrote:This is not a problem if only one encoding is allowed. All drivers must be compliant with the only allowed encoding.

Unless I am missing your point, this has nothing to do with allowing multiple encodings. If a (incorrectly written) driver assumes that the UTF-8 name will only contain 8-bit chars, when a multi-byte char is included, that driver will fail to parse the filename correctly. This can happen on any type of encoding, no matter the type or size of the encoding.

Octocontrabass wrote:If the host uses UTF-8 already, there's no encoding overhead, but you still have the overhead from turning a character count into a byte count to allocate your buffer.

Again, I disagree. When extracting the name, the byte count of the buffer holding the name can be easily found by the recLen field. When creating an entry, the UTF-8 encoder will know how many bytes were needed (possibly a byte count greater than the char count) to encode the name, then you can create the directory entry accordingly.

Octocontrabass wrote:(re: Windows Wide Char)...What does this have to do with your filesystem's design?

If I make my driver in Windows and Windows is using Wide Chars, I must some how convert the UTF-8 to this Wide Char. I was using this as an example of why you must have an encoder within your driver. Since you must have an encoder for your driver, having multiple encodings is no big deal.

Octocontrabass wrote:Again, I don't think anyone is suggesting that you limit the encoding to ASCII.

I am not suggesting that anyone is. I am suggesting that as soon as a "non-standard" character is used, i.e.: Non-ASCII, your driver now must understand UTF-8, *unless* you assume that the host is already UTF-8. Again, assume.

klange wrote:And further, I would strongly caution against storing character counts in any context. Traversing directory entries should not depend on the encoding, and character counts will require parsing even for UTF-16 which is not a fixed 2 bytes per codepoint - don't confuse it with UCS-2.

Again, unless I am completely missing your point, the encoding of the name has nothing to do with traversing the directory. Traversing the directory is done with the recLen field, not the nameLen field. The nameLen field is only used when you need to extract the name. The length of the buffer holding the name can easily be found with the recLen field.

Again, I appreciate the comments and hopefully you understand that I am not trying to be argumentative, but I think you guys don't understand that traversing the directory uses the recLen field, not the nameLen field. By doing so, the length of the record has nothing to do with the encoding of the name nor the nameLen field. The recLen field points to the next record used. The nameLen field tells you how many characters are in the buffer in this record. The buffer len is from the recLen field. So that you don't assume the host and the volume use the same encoding, your driver should have an encoder/decoder for UTF-8. If this is the case, having an encoder/decoder for other types is a simple task.

Again, unless I am completely missing your point, I think (and I could be wrong) you guys are missing the point that the recLen field is used to traverse the directory, *not* using the nameLen field.

When you traverse the directory, you are pointing to records, not names. Each record points to the next record, no matter the length of the name or how it is encoded. Within each record contains a buffer, the length easily found by the recLen field, this buffer holding an encoded string. The count of the characters in the string is given with nameLen. The buffer will always be at least the extact length it needs to be to hold the name, only having extra bytes to pad to a dword boundary, or on a rare case, pad to a record more than 16 bytes away (i.e.: when the next record is only 16 bytes long, the currect record and that 16-byte record can be combined).

As for my Windows example, when writing a driver for this file system, if the Windows compiler is set for Wide Char chars, a UTF-8 encoder/decoder is a must. Therefore, why does it matter if there are options for other encodings when you must have an encoder/decoder to begin with?

Again, not trying to be argumentative and I hope it isn't perceived that way. Thank you for your comments, you guys are much appreciated. This forum, and those who regular it have been a great help in my hobby and it is much appreciated.

Thank you,
Ben

Posted: **Thu Dec 08, 2022 11:52 am**

I am going to chime in and say that I am with Octocontrabass and Klange here. Octo's answers are all spot on. You should just use UTF-8 everywhere. Anything else if much harder than you think it is.

If you read what he linked, it's pretty clear that "counting characters" is not meaningful in any context as "characters" can have many meanings. From your reply it sounds like you are using the term "character" to mean "codepoint". I am not sure. This is confusing. This is why it's easier to just use UTF-8. Once you have UTF-8 everywhere, you don't ever need to count "characters". You only deal with bytes. It makes the code easier to read, simpler and less buggy. Counting codepoints in a filesystem implementation has no use whatsoever. You want to use byte counts, not codepoints or "characters" counts.

If someone wants to use your file system and UTF-16, let them deal with the complexities of encoding/decoding strings in different formats. You don't have to. UCS-2/UTF-16/UTF-32 only exists for legacy reasons and I believe should not be used in any new spec / code going forward.

I personally have switched to "UTF-8 everywhere" years ago and it has made my life much simpler. It's efficient (space and processing time), it's easy and it just works. Years of Windows TCHAR programming has caused irreparable damage to my brain.

Posted: **Thu Dec 08, 2022 2:17 pm**

Same here. I've used UTF-8 for years too. There is no need to support wide character encodings, and when handling other encodings like resource files or long FAT entries, I build the converter into the access functions /file system s.

Posted: **Thu Dec 08, 2022 3:02 pm**

BenLunt wrote:Also, if I am going to support UTF-8, via an "encode_to()" function, how difficult is it to allow other encodings?

Very difficult, if you're supporting any non-Unicode encodings. But even if you stick to Unicode only, that's still making it unnecessarily complicated to write a driver since the driver must now handle multiple different character encodings instead of only one. And what do you gain from allowing multiple encodings?

BenLunt wrote:The recLen field indicates the length of the whole record

Aha, I must have missed that part. So, you can still parse the rest of the directory if one name is corrupt, but the rest of the problems are still present.

BenLunt wrote:If the name contains invalid bytes, that is on the faulty driver that placed these invalid bytes in the entry, not on the specification of the file system.

Faulty drivers are not the only source of filesystem corruption.

BenLunt wrote:If a (incorrectly written) driver assumes that the UTF-8 name will only contain 8-bit chars, when a multi-byte char is included, that driver will fail to parse the filename correctly.

Why would anyone make this assumption when the encoding is UTF-8?

BenLunt wrote:When extracting the name, the byte count of the buffer holding the name can be easily found by the recLen field. When creating an entry, the UTF-8 encoder will know how many bytes were needed (possibly a byte count greater than the char count) to encode the name, then you can create the directory entry accordingly.

I still don't understand why the character count is so important that you'd store it instead of the byte count, especially when OS APIs typically care more about the byte count.

BenLunt wrote:If I make my driver in Windows and Windows is using Wide Chars, I must some how convert the UTF-8 to this Wide Char. I was using this as an example of why you must have an encoder within your driver. Since you must have an encoder for your driver, having multiple encodings is no big deal.

Any filesystem that uses an encoding different from your OS APIs will require character set conversion anyway; allowing more character sets just makes your driver more complicated for no benefit.

BenLunt wrote:I am suggesting that as soon as a "non-standard" character is used, i.e.: Non-ASCII, your driver now must understand UTF-8, *unless* you assume that the host is already UTF-8.

Except you must still parse the name to figure out how many bytes it is before you can pass it along to any OS APIs, so actually your driver needs to understand UTF-8 no matter what.

Posted: **Thu Dec 08, 2022 3:35 pm**

Octocontrabass wrote:source of filesystem corruption.
BenLunt wrote:If a (incorrectly written) driver assumes that the UTF-8 name will only contain 8-bit chars, when a multi-byte char is included, that driver will fail to parse the filename correctly.
Why would anyone make this assumption when the encoding is UTF-8?

And why would the driver fail to parse the filename correctly? Why does the driver need to parse anything here? Shouldn't the driver just compare array of bytes representing filenames?

Posted: **Thu Dec 08, 2022 3:51 pm**

Hi guys,

kzinti wrote:From your reply it sounds like you are using the term "character" to mean "codepoint".

Yes, I apologize, I do mean codepoint when unicode is used, character when ascii is used. Forgive me, I will clarify that.

The specification states that UTF-8 should be used by default. If a driver finds something else, it can simply refuse to mount it.

Another example of a non-UTF-8 environment is UEFI. If I write a LEAN driver for UEFI, I still must write an encoder/decoder for the UEFI LEAN driver.

kzinti wrote:And why would the driver fail to parse the filename correctly? Why does the driver need to parse anything here? Shouldn't the driver just compare array of bytes representing filenames?

My concern is when the host (UEFI for example) and the driver (UTF-8) use different encodings. If the host uses a wide char encoding (like UEFI) but the volume is encoded in UTF-8, then to answer your question; No, you cannot just compare an array of bytes representing the filenames. You must encode one of the strings to the other, either direction.

Again, I appreciate everyone's comments, I will take them all under consideration.

Ben

Posted: **Thu Dec 08, 2022 4:18 pm**

If the driver sits between a UCS-2 host API (like UEFI) and the filesystem, you will need encoding/decoding functions in the driver. This is true whether the filesystem only supports UTF-8 or multiple formats.

Supporting only UTF-8 means you only need an encoder/decode for UTF-8. If your host happens to support UTF-8 you don't need any encoder/decoder.

Supporting multiple formats means every driver will need one or multiple encoder/decoder. This is more work for everyone. Also unicode is hard.

Posted: **Thu Dec 08, 2022 4:27 pm**

The filesystem I'm designing does not explicitly use any encoding. The driver only uses raw bytes that may contain any values. It does not have any notion of a path separator either.

That said, it is assumed that names are all UTF-8 since it is compatible with ASCII and the most efficient for English words and / is reserved for path separators. I'll leave it up to the OS to enforce that though.

Posted: **Thu Dec 08, 2022 5:58 pm**

After much consideration and thought, I have taken your advise and removed the option of multiple text encodings. UTF-8 is now specified for all text strings.

May I quote something I saw over on hackaday.org today:

"but it better be better...to falling so in love with your idea, that you lose sight of what it really means to be better."

:-)

Thank you all for your comments. The new specs are up, with an addition. This addition makes all aspects of the volume have a checksum (when extended checksums are used). The Superblock, Inodes, Indirects, Extents, and now the bitmap.

Thanks again,
Ben

Posted: **Thu Dec 08, 2022 7:45 pm**

BenLunt wrote:After much consideration and thought, I have taken your advise and removed the option of multiple text encodings. UTF-8 is now specified for all text strings.

Hurray!

OSDev.org

About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS

Re: About LeanFS