Page 1 of 1

UTF-8 Encoding

Posted: Sat Aug 04, 2007 2:19 am
by XCHG
I just found a bug in my [__UTF8EncodeCharacter] function but when I compare the results to what the Windows OS gives me, either me or Windows is wrong. I am trying to encode the character • with the character code of 0x95.

According to the UTF-8 encoding, this character falls into the Code Range 2, needing 2 bytes to be encoded (while Windows is using 3 bytes to encode this character). 0x95 is 10010101 in binary. Taking 6 bits from the right side (010101) and ORing it with 10000000b will give us (10010101) as the High Order Byte. Now take the leftmost 2 bits 10 from the character code and or it with 11000000 will give us (11000010) as the Low Order Byte.

So now we should have:

Code: Select all

          +-------------------------------------------+
          |  Byte#3  |  Byte#2  |  Byte#1  |  Byte#0  |
          |----------|----------|----------|----------|
  Binary  | 00000000 | 00000000 | 10010101 | 11000010 |
          |----------|----------|----------|----------|
  Hex     |   0x00   |   0x00   |   0x95   |   0xC2   |
          +-------------------------------------------+
                           0x000095C2
              
That is exactly the value that my function returns but Windows encodes the character • in 3 bytes 0x00E280A2.

Could anybody tell me whether I am doing something wrong or it is Windows' fault?

Re: UTF-8 Encoding

Posted: Sat Aug 04, 2007 3:06 am
by urxae
XCHG wrote:I just found a bug in my [__UTF8EncodeCharacter] function but when I compare the results to what the Windows OS gives me, either me or Windows is wrong. I am trying to encode the character • with the character code of 0x95.

[snip encoding of 0x95 to 0x000095C2]

That is exactly the value that my function returns but Windows encodes the character • in 3 bytes 0x00E280A2.

Could anybody tell me whether I am doing something wrong or it is Windows' fault?
Well, according to my little program (written in D, which supports this stuff in the language/standard library) you're correct that character code 0x95 encodes to 'c2 95', and Windows is correct that "•" encodes to 'e2 80 a2'. :shock:
What you seem to have gotten wrong is either the character code or the character: it appears that "•" has character code 0x2022, and that character code 0x95 is something different. I've attached part of a screenshot showing how my teminal (gnome-terminal on Ubuntu) shows it since my browser doesn't like to show it (understandable as it's supposed to be a control code, see below).
According to the Unicode code charts (Latin-1 and General Punctuation) 0x95 is a control code named "MESSAGE WAITING", and 0x2022 is called "BULLET" (which is how the character you posted shows up).

Posted: Sat Aug 04, 2007 7:38 am
by XCHG
Thank you so much urxae. I didn't have the General Punctuation Character Set list and to be honest with you, I wouldn't have looked even if I had it. I was just so confused why Windows was encoding the Bullet character in that way that I completely forgot about looking it up in UTF references.

So that says it. Windows uses ASCII characters bigger than 0x7F to represent some important UTF characters. Is that close on being on the money?