UTF-8 support in a File System
UTF-8 support in a File System
I am starting to write basic drafts for the implementation of my own File System and I am thinking of supporting the UTF-8 encoding format. I have one question though: should all the characters take 2 or 4 bytes of space even if they fall in the range 0x00 to 0x7F or should I detect the type of the character and its character set then allocate enough space for it with accurate bits set for the MSB of the High Order Byte of the range?
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
- mathematician
- Member
- Posts: 437
- Joined: Fri Dec 15, 2006 5:26 pm
- Location: Church Stretton Uk
If all characters took 2 or 4 bytes then it wouldn't be UTF-8. Various different methods of describing a unicode charachter exist. For example, UTF-32 uses 4 bytes for each charachter. UTF-16 uses at least 2 and at most 4, whereas UTF-8 uses at least one and at most 4. Although the definitions have become a bit muddled, generally those starting UTF are able to describe the entire unicode system. This differs somewhat from Microsoft's standard implementation of a wchar which is two bytes, but not extensible like UTF-16 and is therefore only able to represent the 65536 members of the unicode basic multilingual plane (BMP). Linux, on the other hand, uses UTF-8.XCHG wrote:should all the characters take 2 or 4 bytes of space
When working with disk files, you also need to define whether the data is written in little-endian or big-endian. I believe UTF-8 algorithms do this by making the first character a non-breaking space which can only be expressed in one way.
If you're interested in UTF-8, then the encoding algorithm can be found here.
Regards,
John.
Not exactly. If you're using UTF-16 or UTF-32 you need to define whether it's little-endian or big-endian. If you're using UTF-8 it doesn't matter, since UTF-8 encodes to single-byte code units, not to multi-byte code units like UTF-16 and UTF-32.jnc100 wrote:When working with disk files, you also need to define whether the data is written in little-endian or big-endian. I believe UTF-8 algorithms do this by making the first character a non-breaking space which can only be expressed in one way.
To be specific, there are 3 commonly used variants of both UTF-16 and UTF-32. There's UTF-16LE, UTF-16BE and UTF-16 with BOM (and similar for UTF-32). The first two are simply fixed little-endian and big-endian, respectively. The latter starts with a BOM, which is the encoding of a NBSP character (0xFFFE, IIRC). This character is encoded differently in little-endian and big-endian and is therefore used as a byte-order mark.
However, for a file system a byte order can simply be defined if using UTF-16/32, or UTF-8 can be used without needing such an order to be specified.
(Note: Although a BOM is not necessary for UTF-8 text to be interpreted correctly (since the byte order is strictly defined), the UTF-8 encoding for an NBSP is often used at the start of a file as a signature to indicate that a file is encoded in UTF-8 at all. Again, this isn't necessary for a file system: you can simply define that UTF-8 is used and use the raw UTF-8 encoded text as filenames or whatever)
Thank you guys. I had read that Wiki article and I just coded a function that can encode a character into UTF-8 up to 4 bytes depending on the character code. The prototype for the function is:
Where the [UTF8CharacterCode] parameter is a character code that must be converted to a UTF-8 character in the result of the function. I will also put the function because it might help somebody having difficulties encoding characters to UTF-8 in (N)ASM:
Code: Select all
DWORD __UTFEncodeCharacter (DWORD UTF8CharacterCode); StdCall;
Code: Select all
; ——————————————————————————————————————————————————
__UTF8EncodeCharacter:
; DWORD __UTFEncodeCharacter (DWORD UTF8CharacterCode); StdCall;
; Returns the encoded UTF-8 compliant character for the entered character code
; Returns 0xFFFFFFFF upon failure
PUSH EBX
PUSH ECX
PUSH EDX
PUSH EBP
MOV EBP , ESP
MOV EAX , 0xFFFFFFFF
; [EBP + 0x14] = UTF8CharacterCode
MOV EBX , DWORD PTR [EBP + 0x14]
; See if the character is valid (Must be less than
; or equal to 0x0010FFFF)
CMP EBX , 0x0010FFFF
JBE .DetectCodeRange1
JMP .EP
.DetectCodeRange1:
; EBX = UTF8CharacterCode
; Code Range 1 in UTF-8 must be at most 0x0000007F
; and all other bits must be zero
TEST EBX , ~(0x0000007F)
JNZ .DetectCodeRange2
MOV EAX , EBX
JMP .EP
.DetectCodeRange2:
; EBX = UTF8CharacterCode
; Code Range 2 in UTF-8 must be at most 0x000007FF
; and all other bits must be zero
; General format of a 2-bytes-long UTF-8 code is
; in binary: 00000000, 00000000, 110xxxxx, 10xxxxxx
TEST EBX , ~(0x000007FF)
JNZ .DetectCodeRange3
MOV EAX , EBX
AND EAX , 00000000000000000000000000111111b
OR EAX , 00000000000000000000000010000000b
; EAX = Least Significant Byte of the character
MOV ECX , EBX
SHL ECX , 0x00000002
AND ECX , 00000000000000000001111100000000b
OR ECX , 00000000000000001100000000000000b
OR EAX , ECX
JMP .EP
.DetectCodeRange3:
; EBX = UTF8CharacterCode
; Code Range 3 in UTF-8 must be at most 0x0000FFFF
; and all other bits must be zero
; General format of a 3-bytes-long UTF-8 code is
; in binary: 00000000, 111xxxxx, 10xxxxxx, 10xxxxxx
TEST EBX , ~(0x0000FFFF)
JNZ .DetectCodeRange4
MOV EAX , EBX
AND EAX , 00000000000000000000000000111111b
OR EAX , 00000000000000000000000010000000b
; EAX = Byte1
MOV ECX , EBX
SHL ECX , 0x00000002
AND ECX , 00000000000000000011111100000000b
OR ECX , 00000000000000001000000000000000b
; ECX = Byte2
MOV EDX , EBX
SHL EDX , 0x00000004
AND EDX , 00000000000011110000000000000000b
OR EDX , 00000000111000000000000000000000b
; Now mix them all together
OR EAX , ECX
OR EAX , EDX
JMP .EP
.DetectCodeRange4:
; EBX = UTF8CharacterCode
; Code Range 4 in UTF-8 must be at most 0x0010FFFF and
; all other bits must be zero
; General format of a 3-bytes-long UTF-8 code is
; in binary: 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx
MOV EAX , EBX
AND EAX , 00000000000000000000000000111111b
OR EAX , 00000000000000000000000010000000b
; EAX = Byte1
MOV ECX , EBX
SHL ECX , 0x00000002
AND ECX , 00000000000000000011111100000000b
OR ECX , 00000000000000001000000000000000b
; ECX = Byte2
MOV EDX , EBX
SHL EDX , 0x00000004
AND EDX , 00000000001111110000000000000000b
OR EDX , 00000000100000000000000000000000b
; EDX = Byte3
SHL EBX , 0x00000006
AND EBX , 00000111000000000000000000000000b
OR EBX , 11110000000000000000000000000000b
; EBX = Byte4
OR EAX , EBX
OR EAX , ECX
OR EAX , EDX
.EP:
POP EBP
POP EDX
POP ECX
POP EBX
RET 0x04
Last edited by XCHG on Sat Jun 16, 2007 11:53 am, edited 1 time in total.
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
In the filesystem it does, but in-memory Unicode strings in C/C++ programs are usually UTF-32 because GCC defines wchar_t to be 4 bytes.jnc100 wrote:Linux, on the other hand, uses UTF-8.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
Or UTF-16 when specified on the command line wchar_t is 2 bytes.Colonel Kernel wrote:In the filesystem it does, but in-memory Unicode strings in C/C++ programs are usually UTF-32 because GCC defines wchar_t to be 4 bytes.jnc100 wrote:Linux, on the other hand, uses UTF-8.
Author of COBOS
Suppose I should write the character 'A' followed by a character with the character code of 0x83 as a file name. Now 'A' is, in binary:
And 0x83 is:
Now how should these characters appear in the file? Should they be:
or like this:
Could anyone please give me a hint related to the order in which these bytes should be written to the volume?
Code: Select all
[(A)
0100,0001 (UTF-8)
Code: Select all
(BYTE1) (BYTE2) (BYTE3)
1100,0000 1000,0010 1000,0011 (UTF-8)
Now how should these characters appear in the file? Should they be:
Code: Select all
(A) (BYTE1) (BYTE2) (BYTE3)
0100,0001 1100,0000 1000,0010 1000,0011
Code: Select all
(A) (BYTE3) (BYTE2) (BYTE1)
0100,0001 1000,0011 1000,0010 1100,0000
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.