UTF-8 Encoding
Posted: Sat Aug 04, 2007 2:19 am
I just found a bug in my [__UTF8EncodeCharacter] function but when I compare the results to what the Windows OS gives me, either me or Windows is wrong. I am trying to encode the character • with the character code of 0x95.
According to the UTF-8 encoding, this character falls into the Code Range 2, needing 2 bytes to be encoded (while Windows is using 3 bytes to encode this character). 0x95 is 10010101 in binary. Taking 6 bits from the right side (010101) and ORing it with 10000000b will give us (10010101) as the High Order Byte. Now take the leftmost 2 bits 10 from the character code and or it with 11000000 will give us (11000010) as the Low Order Byte.
So now we should have:
That is exactly the value that my function returns but Windows encodes the character • in 3 bytes 0x00E280A2.
Could anybody tell me whether I am doing something wrong or it is Windows' fault?
According to the UTF-8 encoding, this character falls into the Code Range 2, needing 2 bytes to be encoded (while Windows is using 3 bytes to encode this character). 0x95 is 10010101 in binary. Taking 6 bits from the right side (010101) and ORing it with 10000000b will give us (10010101) as the High Order Byte. Now take the leftmost 2 bits 10 from the character code and or it with 11000000 will give us (11000010) as the Low Order Byte.
So now we should have:
Code: Select all
+-------------------------------------------+
| Byte#3 | Byte#2 | Byte#1 | Byte#0 |
|----------|----------|----------|----------|
Binary | 00000000 | 00000000 | 10010101 | 11000010 |
|----------|----------|----------|----------|
Hex | 0x00 | 0x00 | 0x95 | 0xC2 |
+-------------------------------------------+
0x000095C2
Could anybody tell me whether I am doing something wrong or it is Windows' fault?