UTF-8 support in a File System

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
XCHG
Member
Member
Posts: 416
Joined: Sat Nov 25, 2006 3:55 am
Location: Wisconsin
Contact:

UTF-8 support in a File System

Post by XCHG »

I am starting to write basic drafts for the implementation of my own File System and I am thinking of supporting the UTF-8 encoding format. I have one question though: should all the characters take 2 or 4 bytes of space even if they fall in the range 0x00 to 0x7F or should I detect the type of the character and its character set then allocate enough space for it with accurate bits set for the MSB of the High Order Byte of the range?
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
User avatar
mathematician
Member
Member
Posts: 437
Joined: Fri Dec 15, 2006 5:26 pm
Location: Church Stretton Uk

Post by mathematician »

As I understand it, ascii characters should be implemented as single byte characters in order to ensure backwards compatability.
jnc100
Member
Member
Posts: 775
Joined: Mon Apr 09, 2007 12:10 pm
Location: London, UK
Contact:

Post by jnc100 »

XCHG wrote:should all the characters take 2 or 4 bytes of space
If all characters took 2 or 4 bytes then it wouldn't be UTF-8. Various different methods of describing a unicode charachter exist. For example, UTF-32 uses 4 bytes for each charachter. UTF-16 uses at least 2 and at most 4, whereas UTF-8 uses at least one and at most 4. Although the definitions have become a bit muddled, generally those starting UTF are able to describe the entire unicode system. This differs somewhat from Microsoft's standard implementation of a wchar which is two bytes, but not extensible like UTF-16 and is therefore only able to represent the 65536 members of the unicode basic multilingual plane (BMP). Linux, on the other hand, uses UTF-8.

When working with disk files, you also need to define whether the data is written in little-endian or big-endian. I believe UTF-8 algorithms do this by making the first character a non-breaking space which can only be expressed in one way.

If you're interested in UTF-8, then the encoding algorithm can be found here.

Regards,
John.
urxae
Member
Member
Posts: 149
Joined: Sun Jul 30, 2006 8:16 am
Location: The Netherlands

Post by urxae »

jnc100 wrote:When working with disk files, you also need to define whether the data is written in little-endian or big-endian. I believe UTF-8 algorithms do this by making the first character a non-breaking space which can only be expressed in one way.
Not exactly. If you're using UTF-16 or UTF-32 you need to define whether it's little-endian or big-endian. If you're using UTF-8 it doesn't matter, since UTF-8 encodes to single-byte code units, not to multi-byte code units like UTF-16 and UTF-32.
To be specific, there are 3 commonly used variants of both UTF-16 and UTF-32. There's UTF-16LE, UTF-16BE and UTF-16 with BOM (and similar for UTF-32). The first two are simply fixed little-endian and big-endian, respectively. The latter starts with a BOM, which is the encoding of a NBSP character (0xFFFE, IIRC). This character is encoded differently in little-endian and big-endian and is therefore used as a byte-order mark.
However, for a file system a byte order can simply be defined if using UTF-16/32, or UTF-8 can be used without needing such an order to be specified.

(Note: Although a BOM is not necessary for UTF-8 text to be interpreted correctly (since the byte order is strictly defined), the UTF-8 encoding for an NBSP is often used at the start of a file as a signature to indicate that a file is encoded in UTF-8 at all. Again, this isn't necessary for a file system: you can simply define that UTF-8 is used and use the raw UTF-8 encoded text as filenames or whatever)
jnc100
Member
Member
Posts: 775
Joined: Mon Apr 09, 2007 12:10 pm
Location: London, UK
Contact:

Post by jnc100 »

:oops: Its been a while since I read the specifications. Sorry about that...

Regards,
John.
User avatar
XCHG
Member
Member
Posts: 416
Joined: Sat Nov 25, 2006 3:55 am
Location: Wisconsin
Contact:

Post by XCHG »

Thank you guys. I had read that Wiki article and I just coded a function that can encode a character into UTF-8 up to 4 bytes depending on the character code. The prototype for the function is:

Code: Select all

DWORD __UTFEncodeCharacter (DWORD UTF8CharacterCode); StdCall;
Where the [UTF8CharacterCode] parameter is a character code that must be converted to a UTF-8 character in the result of the function. I will also put the function because it might help somebody having difficulties encoding characters to UTF-8 in (N)ASM:

Code: Select all

; ——————————————————————————————————————————————————
  __UTF8EncodeCharacter:
    ; DWORD __UTFEncodeCharacter (DWORD UTF8CharacterCode); StdCall;
    ; Returns the encoded UTF-8 compliant character for the entered character code
    ; Returns 0xFFFFFFFF upon failure
    
    PUSH    EBX
    PUSH    ECX
    PUSH    EDX
    PUSH    EBP
    MOV     EBP , ESP
    MOV     EAX , 0xFFFFFFFF
    ; [EBP + 0x14] = UTF8CharacterCode
    MOV     EBX , DWORD PTR [EBP + 0x14]
    ; See if the character is valid (Must be less than
    ; or equal to 0x0010FFFF)
    CMP     EBX , 0x0010FFFF
    JBE     .DetectCodeRange1
    JMP     .EP
    
    
    .DetectCodeRange1:
      ; EBX = UTF8CharacterCode
      ; Code Range 1 in UTF-8 must be at most 0x0000007F
      ; and all other bits must be zero
      TEST    EBX , ~(0x0000007F)
      JNZ     .DetectCodeRange2
      MOV     EAX , EBX
      JMP     .EP
    
    
    .DetectCodeRange2:
      ; EBX = UTF8CharacterCode
      ; Code Range 2 in UTF-8 must be at most 0x000007FF
      ; and all other bits must be zero
      ; General format of a 2-bytes-long UTF-8 code is
      ; in binary: 00000000, 00000000, 110xxxxx, 10xxxxxx
      TEST    EBX , ~(0x000007FF)
      JNZ     .DetectCodeRange3
      MOV     EAX , EBX
      AND     EAX , 00000000000000000000000000111111b
      OR      EAX , 00000000000000000000000010000000b 
      ; EAX = Least Significant Byte of the character
      MOV     ECX , EBX
      SHL     ECX , 0x00000002
      AND     ECX , 00000000000000000001111100000000b
      OR      ECX , 00000000000000001100000000000000b
      OR      EAX , ECX
      JMP     .EP

    .DetectCodeRange3:
      ; EBX = UTF8CharacterCode
      ; Code Range 3 in UTF-8 must be at most 0x0000FFFF
      ; and all other bits must be zero
      ; General format of a 3-bytes-long UTF-8 code is
      ; in binary: 00000000, 111xxxxx, 10xxxxxx, 10xxxxxx
      TEST    EBX , ~(0x0000FFFF)
      JNZ     .DetectCodeRange4
      MOV     EAX , EBX
      AND     EAX , 00000000000000000000000000111111b
      OR      EAX , 00000000000000000000000010000000b
      ; EAX = Byte1
      MOV     ECX , EBX
      SHL     ECX , 0x00000002
      AND     ECX , 00000000000000000011111100000000b
      OR      ECX , 00000000000000001000000000000000b
      ; ECX = Byte2
      MOV     EDX , EBX
      SHL     EDX , 0x00000004
      AND     EDX , 00000000000011110000000000000000b
      OR      EDX , 00000000111000000000000000000000b
      ; Now mix them all together
      OR      EAX , ECX
      OR      EAX , EDX
      JMP     .EP
      
    
    
    .DetectCodeRange4:
      ; EBX = UTF8CharacterCode
      ; Code Range 4 in UTF-8 must be at most 0x0010FFFF and
      ; all other bits must be zero
      ; General format of a 3-bytes-long UTF-8 code is
      ; in binary: 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx
      MOV     EAX , EBX
      AND     EAX , 00000000000000000000000000111111b
      OR      EAX , 00000000000000000000000010000000b
      ; EAX = Byte1
      MOV     ECX , EBX
      SHL     ECX , 0x00000002
      AND     ECX , 00000000000000000011111100000000b
      OR      ECX , 00000000000000001000000000000000b
      ; ECX = Byte2
      MOV     EDX , EBX
      SHL     EDX , 0x00000004
      AND     EDX , 00000000001111110000000000000000b
      OR      EDX , 00000000100000000000000000000000b
      ; EDX = Byte3
      SHL     EBX , 0x00000006
      AND     EBX , 00000111000000000000000000000000b
      OR      EBX , 11110000000000000000000000000000b
      ; EBX = Byte4
      OR      EAX , EBX
      OR      EAX , ECX
      OR      EAX , EDX

    .EP:
      POP     EBP
      POP     EDX
      POP     ECX
      POP     EBX
    RET     0x04
Last edited by XCHG on Sat Jun 16, 2007 11:53 am, edited 1 time in total.
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
User avatar
Colonel Kernel
Member
Member
Posts: 1437
Joined: Tue Oct 17, 2006 6:06 pm
Location: Vancouver, BC, Canada
Contact:

Post by Colonel Kernel »

jnc100 wrote:Linux, on the other hand, uses UTF-8.
In the filesystem it does, but in-memory Unicode strings in C/C++ programs are usually UTF-32 because GCC defines wchar_t to be 4 bytes.
Top three reasons why my OS project died:
  1. Too much overtime at work
  2. Got married
  3. My brain got stuck in an infinite loop while trying to design the memory manager
Don't let this happen to you!
User avatar
os64dev
Member
Member
Posts: 553
Joined: Sat Jan 27, 2007 3:21 pm
Location: Best, Netherlands

Post by os64dev »

Colonel Kernel wrote:
jnc100 wrote:Linux, on the other hand, uses UTF-8.
In the filesystem it does, but in-memory Unicode strings in C/C++ programs are usually UTF-32 because GCC defines wchar_t to be 4 bytes.
Or UTF-16 when specified on the command line wchar_t is 2 bytes.
Author of COBOS
User avatar
XCHG
Member
Member
Posts: 416
Joined: Sat Nov 25, 2006 3:55 am
Location: Wisconsin
Contact:

Post by XCHG »

Suppose I should write the character 'A' followed by a character with the character code of 0x83 as a file name. Now 'A' is, in binary:

Code: Select all

[(A)
0100,0001 (UTF-8)
And 0x83 is:

Code: Select all

(BYTE1)      (BYTE2)      (BYTE3)
1100,0000    1000,0010    1000,0011 (UTF-8)

Now how should these characters appear in the file? Should they be:

Code: Select all

(A)          (BYTE1)      (BYTE2)      (BYTE3)
0100,0001    1100,0000    1000,0010    1000,0011
or like this:

Code: Select all

(A)          (BYTE3)      (BYTE2)      (BYTE1)
0100,0001    1000,0011    1000,0010    1100,0000
Could anyone please give me a hint related to the order in which these bytes should be written to the volume?
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Post by Combuster »

first, the encoding is wrong: 0x80-0x7ff encodes to two bytes:
110xxxxx 10xxxxxx
and they appear in that order.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
XCHG
Member
Member
Posts: 416
Joined: Sat Nov 25, 2006 3:55 am
Location: Wisconsin
Contact:

Post by XCHG »

Oh my bad. Thank you so much. Appreciations.
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
Post Reply