UTF-8 support in a File System

XCHG · Post by **XCHG** » Fri Jun 15, 2007 2:31 am

I am starting to write basic drafts for the implementation of my own File System and I am thinking of supporting the UTF-8 encoding format. I have one question though: should all the characters take 2 or 4 bytes of space even if they fall in the range 0x00 to 0x7F or should I detect the type of the character and its character set then allocate enough space for it with accurate bits set for the MSB of the High Order Byte of the range?

mathematician · Post by **mathematician** » Fri Jun 15, 2007 3:11 am

As I understand it, ascii characters should be implemented as single byte characters in order to ensure backwards compatability.

jnc100 · Post by **jnc100** » Fri Jun 15, 2007 3:45 am

XCHG wrote:should all the characters take 2 or 4 bytes of space

If all characters took 2 or 4 bytes then it wouldn't be UTF-8. Various different methods of describing a unicode charachter exist. For example, UTF-32 uses 4 bytes for each charachter. UTF-16 uses at least 2 and at most 4, whereas UTF-8 uses at least one and at most 4. Although the definitions have become a bit muddled, generally those starting UTF are able to describe the entire unicode system. This differs somewhat from Microsoft's standard implementation of a wchar which is two bytes, but not extensible like UTF-16 and is therefore only able to represent the 65536 members of the unicode basic multilingual plane (BMP). Linux, on the other hand, uses UTF-8.

When working with disk files, you also need to define whether the data is written in little-endian or big-endian. I believe UTF-8 algorithms do this by making the first character a non-breaking space which can only be expressed in one way.

If you're interested in UTF-8, then the encoding algorithm can be found here.

Regards,
John.

urxae · Post by **urxae** » Fri Jun 15, 2007 5:01 am

jnc100 wrote:When working with disk files, you also need to define whether the data is written in little-endian or big-endian. I believe UTF-8 algorithms do this by making the first character a non-breaking space which can only be expressed in one way.

Not exactly. If you're using UTF-16 or UTF-32 you need to define whether it's little-endian or big-endian. If you're using UTF-8 it doesn't matter, since UTF-8 encodes to single-byte code units, not to multi-byte code units like UTF-16 and UTF-32.
To be specific, there are 3 commonly used variants of both UTF-16 and UTF-32. There's UTF-16LE, UTF-16BE and UTF-16 with BOM (and similar for UTF-32). The first two are simply fixed little-endian and big-endian, respectively. The latter starts with a BOM, which is the encoding of a NBSP character (0xFFFE, IIRC). This character is encoded differently in little-endian and big-endian and is therefore used as a byte-order mark.
However, for a file system a byte order can simply be defined if using UTF-16/32, or UTF-8 can be used without needing such an order to be specified.

(Note: Although a BOM is not necessary for UTF-8 text to be interpreted correctly (since the byte order is strictly defined), the UTF-8 encoding for an NBSP is often used at the start of a file as a signature to indicate that a file is encoded in UTF-8 at all. Again, this isn't necessary for a file system: you can simply define that UTF-8 is used and use the raw UTF-8 encoded text as filenames or whatever)

jnc100 · Post by **jnc100** » Fri Jun 15, 2007 5:39 am

Its been a while since I read the specifications. Sorry about that...

Regards,
John.

XCHG · Post by **XCHG** » Fri Jun 15, 2007 6:27 am

Thank you guys. I had read that Wiki article and I just coded a function that can encode a character into UTF-8 up to 4 bytes depending on the character code. The prototype for the function is:

Code: Select all

DWORD __UTFEncodeCharacter (DWORD UTF8CharacterCode); StdCall;

Where the [UTF8CharacterCode] parameter is a character code that must be converted to a UTF-8 character in the result of the function. I will also put the function because it might help somebody having difficulties encoding characters to UTF-8 in (N)ASM:

Code: Select all

; â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”â€”
  __UTF8EncodeCharacter:
    ; DWORD __UTFEncodeCharacter (DWORD UTF8CharacterCode); StdCall;
    ; Returns the encoded UTF-8 compliant character for the entered character code
    ; Returns 0xFFFFFFFF upon failure
    
    PUSH    EBX
    PUSH    ECX
    PUSH    EDX
    PUSH    EBP
    MOV     EBP , ESP
    MOV     EAX , 0xFFFFFFFF
    ; [EBP + 0x14] = UTF8CharacterCode
    MOV     EBX , DWORD PTR [EBP + 0x14]
    ; See if the character is valid (Must be less than
    ; or equal to 0x0010FFFF)
    CMP     EBX , 0x0010FFFF
    JBE     .DetectCodeRange1
    JMP     .EP
    
    
    .DetectCodeRange1:
      ; EBX = UTF8CharacterCode
      ; Code Range 1 in UTF-8 must be at most 0x0000007F
      ; and all other bits must be zero
      TEST    EBX , ~(0x0000007F)
      JNZ     .DetectCodeRange2
      MOV     EAX , EBX
      JMP     .EP
    
    
    .DetectCodeRange2:
      ; EBX = UTF8CharacterCode
      ; Code Range 2 in UTF-8 must be at most 0x000007FF
      ; and all other bits must be zero
      ; General format of a 2-bytes-long UTF-8 code is
      ; in binary: 00000000, 00000000, 110xxxxx, 10xxxxxx
      TEST    EBX , ~(0x000007FF)
      JNZ     .DetectCodeRange3
      MOV     EAX , EBX
      AND     EAX , 00000000000000000000000000111111b
      OR      EAX , 00000000000000000000000010000000b 
      ; EAX = Least Significant Byte of the character
      MOV     ECX , EBX
      SHL     ECX , 0x00000002
      AND     ECX , 00000000000000000001111100000000b
      OR      ECX , 00000000000000001100000000000000b
      OR      EAX , ECX
      JMP     .EP

    .DetectCodeRange3:
      ; EBX = UTF8CharacterCode
      ; Code Range 3 in UTF-8 must be at most 0x0000FFFF
      ; and all other bits must be zero
      ; General format of a 3-bytes-long UTF-8 code is
      ; in binary: 00000000, 111xxxxx, 10xxxxxx, 10xxxxxx
      TEST    EBX , ~(0x0000FFFF)
      JNZ     .DetectCodeRange4
      MOV     EAX , EBX
      AND     EAX , 00000000000000000000000000111111b
      OR      EAX , 00000000000000000000000010000000b
      ; EAX = Byte1
      MOV     ECX , EBX
      SHL     ECX , 0x00000002
      AND     ECX , 00000000000000000011111100000000b
      OR      ECX , 00000000000000001000000000000000b
      ; ECX = Byte2
      MOV     EDX , EBX
      SHL     EDX , 0x00000004
      AND     EDX , 00000000000011110000000000000000b
      OR      EDX , 00000000111000000000000000000000b
      ; Now mix them all together
      OR      EAX , ECX
      OR      EAX , EDX
      JMP     .EP
      
    
    
    .DetectCodeRange4:
      ; EBX = UTF8CharacterCode
      ; Code Range 4 in UTF-8 must be at most 0x0010FFFF and
      ; all other bits must be zero
      ; General format of a 3-bytes-long UTF-8 code is
      ; in binary: 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx
      MOV     EAX , EBX
      AND     EAX , 00000000000000000000000000111111b
      OR      EAX , 00000000000000000000000010000000b
      ; EAX = Byte1
      MOV     ECX , EBX
      SHL     ECX , 0x00000002
      AND     ECX , 00000000000000000011111100000000b
      OR      ECX , 00000000000000001000000000000000b
      ; ECX = Byte2
      MOV     EDX , EBX
      SHL     EDX , 0x00000004
      AND     EDX , 00000000001111110000000000000000b
      OR      EDX , 00000000100000000000000000000000b
      ; EDX = Byte3
      SHL     EBX , 0x00000006
      AND     EBX , 00000111000000000000000000000000b
      OR      EBX , 11110000000000000000000000000000b
      ; EBX = Byte4
      OR      EAX , EBX
      OR      EAX , ECX
      OR      EAX , EDX

    .EP:
      POP     EBP
      POP     EDX
      POP     ECX
      POP     EBX
    RET     0x04

Colonel Kernel · Post by **Colonel Kernel** » Fri Jun 15, 2007 8:31 am

jnc100 wrote:Linux, on the other hand, uses UTF-8.

In the filesystem it does, but in-memory Unicode strings in C/C++ programs are usually UTF-32 because GCC defines wchar_t to be 4 bytes.

os64dev · Post by **os64dev** » Mon Jun 18, 2007 1:58 am

Colonel Kernel wrote:
jnc100 wrote:Linux, on the other hand, uses UTF-8.
In the filesystem it does, but in-memory Unicode strings in C/C++ programs are usually UTF-32 because GCC defines wchar_t to be 4 bytes.

Or UTF-16 when specified on the command line wchar_t is 2 bytes.

XCHG · Post by **XCHG** » Thu Jun 28, 2007 7:13 am

Suppose I should write the character 'A' followed by a character with the character code of 0x83 as a file name. Now 'A' is, in binary:

Code: Select all

[(A)
0100,0001 (UTF-8)

And 0x83 is:

Code: Select all

(BYTE1)      (BYTE2)      (BYTE3)
1100,0000    1000,0010    1000,0011 (UTF-8)

Now how should these characters appear in the file? Should they be:

Code: Select all

(A)          (BYTE1)      (BYTE2)      (BYTE3)
0100,0001    1100,0000    1000,0010    1000,0011

or like this:

Code: Select all

(A)          (BYTE3)      (BYTE2)      (BYTE1)
0100,0001    1000,0011    1000,0010    1100,0000

Could anyone please give me a hint related to the order in which these bytes should be written to the volume?

Combuster · Post by **Combuster** » Thu Jun 28, 2007 10:03 am

first, the encoding is wrong: 0x80-0x7ff encodes to two bytes:
110xxxxx 10xxxxxx
and they appear in that order.

XCHG · Post by **XCHG** » Fri Jun 29, 2007 5:50 am

Oh my bad. Thank you so much. Appreciations.