UTF-8 Encoding

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
XCHG
Member
Member
Posts: 416
Joined: Sat Nov 25, 2006 3:55 am
Location: Wisconsin
Contact:

UTF-8 Encoding

Post by XCHG »

I just found a bug in my [__UTF8EncodeCharacter] function but when I compare the results to what the Windows OS gives me, either me or Windows is wrong. I am trying to encode the character • with the character code of 0x95.

According to the UTF-8 encoding, this character falls into the Code Range 2, needing 2 bytes to be encoded (while Windows is using 3 bytes to encode this character). 0x95 is 10010101 in binary. Taking 6 bits from the right side (010101) and ORing it with 10000000b will give us (10010101) as the High Order Byte. Now take the leftmost 2 bits 10 from the character code and or it with 11000000 will give us (11000010) as the Low Order Byte.

So now we should have:

Code: Select all

          +-------------------------------------------+
          |  Byte#3  |  Byte#2  |  Byte#1  |  Byte#0  |
          |----------|----------|----------|----------|
  Binary  | 00000000 | 00000000 | 10010101 | 11000010 |
          |----------|----------|----------|----------|
  Hex     |   0x00   |   0x00   |   0x95   |   0xC2   |
          +-------------------------------------------+
                           0x000095C2
              
That is exactly the value that my function returns but Windows encodes the character • in 3 bytes 0x00E280A2.

Could anybody tell me whether I am doing something wrong or it is Windows' fault?
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
urxae
Member
Member
Posts: 149
Joined: Sun Jul 30, 2006 8:16 am
Location: The Netherlands

Re: UTF-8 Encoding

Post by urxae »

XCHG wrote:I just found a bug in my [__UTF8EncodeCharacter] function but when I compare the results to what the Windows OS gives me, either me or Windows is wrong. I am trying to encode the character • with the character code of 0x95.

[snip encoding of 0x95 to 0x000095C2]

That is exactly the value that my function returns but Windows encodes the character • in 3 bytes 0x00E280A2.

Could anybody tell me whether I am doing something wrong or it is Windows' fault?
Well, according to my little program (written in D, which supports this stuff in the language/standard library) you're correct that character code 0x95 encodes to 'c2 95', and Windows is correct that "•" encodes to 'e2 80 a2'. :shock:
What you seem to have gotten wrong is either the character code or the character: it appears that "•" has character code 0x2022, and that character code 0x95 is something different. I've attached part of a screenshot showing how my teminal (gnome-terminal on Ubuntu) shows it since my browser doesn't like to show it (understandable as it's supposed to be a control code, see below).
According to the Unicode code charts (Latin-1 and General Punctuation) 0x95 is a control code named "MESSAGE WAITING", and 0x2022 is called "BULLET" (which is how the character you posted shows up).
Attachments
u0095.png
u0095.png (930 Bytes) Viewed 767 times
User avatar
XCHG
Member
Member
Posts: 416
Joined: Sat Nov 25, 2006 3:55 am
Location: Wisconsin
Contact:

Post by XCHG »

Thank you so much urxae. I didn't have the General Punctuation Character Set list and to be honest with you, I wouldn't have looked even if I had it. I was just so confused why Windows was encoding the Bullet character in that way that I completely forgot about looking it up in UTF references.

So that says it. Windows uses ASCII characters bigger than 0x7F to represent some important UTF characters. Is that close on being on the money?
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
Post Reply