manny wrote:My reply isn't to disagree. But my understanding was that UTF-16 was designed to work with null terminated char arrays in C because no single byte (in a multi-byte char) was null. Here is how I understood UTF-16. The character 'A' would be encoded as follows with the most significant bit telling you that it's multibyte.
10000000 01000001
Thoughts?
That's (basically) how UTF-8 works, so you're likely thinking of UTF-8.
The difference between UTF-8 and what you have above is that UTF-8 uses more than one bit to flag multi-byte characters:
A one-byte character just has the high bit zero (so any valid ASCII string is also a valid UTF-8 string):
0xxxxxxx
A multi-byte character has a (variable length) prefix indicating the number of following bytes on the first byte, and then a two-bit prefix of "10" on following bytes (so that if you're looking at one of the following bytes, you know it isn't the start of a character).
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
These features simplify error recovery: The prefix on the first character allows UTF-8 decoders to determine from the first byte of a character how long the character is (instead of just keeping on going until they find a new byte with the high bit 1, which might cause trouble if the high bit has been corrupted. The prefix on the following characters makes sure that if a random number of bytes is lost from the beginning of a string (with the remaining part potentially beginning in the middle of a character), you can still recover the next full character after the lost data, and everything after it. If the following bytes could be any arbitrary byte, then they might look like a single-byte character or the start of a multi-byte character in this situation, and so you'd start decoding at the wrong point and get garbage.