Bi-directional text (Unicode)

Antti · Post by **Antti** » Fri Aug 07, 2015 3:00 am

Have you added support for bi-directional text? I tried but it ended up going too complicated at this point. Had it worked somehow, the implementation would have been a hack. I will try it again later but it requires more research and study. One of the problem is the lack of knowledge of any language written right-to-left. Another problem is to decide the proper internal presentation of text, i.e. whether to use the logical sequence of characters or the correct visual presentation. There are other problems too.

It is sure that I will not have a glyph for every character but I would like to have replacement characters ordered like they should.

Combuster · Post by **Combuster** » Fri Aug 07, 2015 8:53 am

There's a lot in the south asian region that's RTL. There's also top-to-bottom conventions in CJK which complicate matters significantly further, although I see current computer-native Japanese only in its horizontal form. One advantage of only supporting RTL and LTR is that you only need to mirror your user interface to keep everything in its logical position. I've seen various UI libraries that strictly separate left/right from start/end.

AndrewAPrice · Post by **AndrewAPrice** » Fri Aug 07, 2015 9:05 am

Antti wrote:Have you added support for bi-directional text?

Not in my operating system, but I've dealt with i11n issues before.

Antti wrote:I tried but it ended up going too complicated at this point. Had it worked somehow, the implementation would have been a hack. I will try it again later but it requires more research and study. One of the problem is the lack of knowledge of any language written right-to-left. Another problem is to decide the proper internal presentation of text, i.e. whether to use the logical sequence of characters or the correct visual presentation. There are other problems too.

The internal representation is exactly the same. The characters in a word or sentence should still be stored in reading order (if it's a five character word, store the first character first, second character next, third character, fourth character, fifth character, etc.), the only difference is when you go to print you start at the right and decrease X until you reach the end of the line.

The real kicker here is when you try to deal with Bi-directional text. If the text is rendered separately (e.g. imagine a GUI with two text labels), you could probably set one to render left-to-right and the other to render right-to-left, and because each one is being drawn separately, it should work.

But what if you tried to mix the two? For example, having Arabic text yet inserting an English word or vice versa? Two solutions:
1. Insert the quoted backwards.
2. Handle Unicode characters U+200E and U+200F to switch printing direction mid character, but this complicates your printing functionality slightly - when you encounter the control code to switch direction, you'll need to read ahead to find the opposite control code, then print the part of the string enclosed in those Unicode characters backwards.

Antti wrote:It is sure that I will not have a glyph for every character but I would like to have replacement characters ordered like they should.

Unicode is pretty much the universal standard for encoding data. (UTF-8 for storage and transmission, and perhaps UTF-32 internally.) Because there are too many glyphs in Unicode for you to probably want to make your own font, you're going to want to find a font you can already use. Here's a monospaced bitmap Unicode font, but most fonts you find online are in a vector format like TrueType (and the complex topic of Font rasterization. If a font is missing a glyph, then usually a '□' or '?' will be printed instead.

I'm not sure what you mean by:

Antti wrote:have replacement characters ordered

Do you mean that Unicode is so large that you don't want all characters loaded into memory at once? Your font rasterizer will probably only want to load and store recently used glyphs. You could probably get away with a glyph cache that only stores about 1000 rasterized characters, and every time a glyph is used it's either loaded in and added to the front or if it's already in the cache just bumped to the front. If you're trying to load a glyph when there's already 1000 characters in the glyph cache, unload the one at the very end.

Antti · Post by **Antti** » Fri Aug 07, 2015 10:14 am

MessiahAndrw wrote:you could probably set one to render left-to-right and the other to render right-to-left, and because each one is being drawn separately, it should work.

It is exactly the bi-directional text that gave me a headache. All the overrides and other control characters..., and, ..., something that I do not know yet but will give me a surprise. I need to study more. If I had a UTF-8 text file and my simple "TYPE" command printed it, I would like to see something like this (the replacement character here is '?'):

Code: Select all

>TYPE sometextfile.txt


                                                   ????? ????
                                         ????? ???? ???? ????

                              ????  ??? ??? (English) ??? ???
                             ??? "OSDev" ??? ??? ??? ????????
                                 ???? ????? ?? "123456789" ??

Look at how numbers and quatations are still left-to-right. Please note that I do not actually have a "TYPE" command and probably I will not have a command line at all (but that is not the point here). Here I have an example screenshot (Windows).

AndrewAPrice · Post by **AndrewAPrice** » Fri Aug 07, 2015 11:03 am

Antti wrote:Here I have an example screenshot (Windows).

I played with editing that Wikipedia page, and noticed when I would press 'a' 'b' 'c' it would write backwards as 'cba', yet if I it among an English word, it would appear as 'abc', even though that English word as inlined in right-to-left Arabic. I don't know how it's doing this, but I assume there are control characters telling you did.

I'm curious - do you input text in Arabic? How would you insert a left-to-right word or number in the middle of some right-to-left text?

Octocontrabass · Post by **Octocontrabass** » Fri Aug 07, 2015 11:09 am

MessiahAndrw wrote:But what if you tried to mix the two? For example, having Arabic text yet inserting an English word or vice versa?

The Unicode standard (version 7, since version 8 isn't available online yet) has the following to say about the rendering of bidirectional text:

A process that displays text containing supported right-to-left characters or embedding codes shall display all visible representations of characters (excluding format characters) in the same order as if the Bidirectional Algorithm had been applied to the text, unless tailored by a higher-level protocol as permitted by the specification.

The Bidirectional Algorithm is also part of the Unicode specification, as annex 9.

Antti · Post by **Antti** » Fri Aug 07, 2015 11:12 am

MessiahAndrw wrote:Do you mean that Unicode is so large that you don't want all characters loaded into memory at once?

I have not thought that much yet. I just want to have solid foundations for this so it could be extended if needed. This kind of thing could be very hard to retrofit later so I would like to take this into account from the very beginning.

MessiahAndrw wrote:do you input text in Arabic?

I do not understand a single character. I was just playing with copy-and-paste.

Antti · Post by **Antti** » Fri Aug 07, 2015 11:29 am

...in the same order as if the Bidirectional Algorithm had been applied to the text...

I already thought that I had an internal reprentation of characters (like an array of Unicode points, UTF-32) that I would send into some black box. It would return me an array of Unicode points ("the algorithm applied") that are in correct order. Like I said, there probably are other control characters that make things even more complicated. Very confusing and takes time to figure out.

Octocontrabass · Post by **Octocontrabass** » Fri Aug 07, 2015 12:25 pm

Automatic line breaks (for wrapping long lines) have to be inserted in logical order, not display order, so performing the Bidirectional Algorithm the way you've described would make it very difficult to do that.

AndrewAPrice · Post by **AndrewAPrice** » Fri Aug 07, 2015 12:56 pm

That Bidirectional Algorithm looks pretty awesome.

It looks like you feed it a paragraph of text, and it can:
- determine if the paragraph needs to be printed left-to-right or right-to-left.
- rearrange inlined LtR/RtL text to fit the printing order.

Brendan · Post by **Brendan** » Fri Aug 07, 2015 1:17 pm

Hi,

I've always imagined it as 3 separate layers.

The lowest layer is the font engine, which gets a string in "left to right" order and converts it to a picture (e.g. "alpha per pixel" data). This is extremely complicated all by itself (especially when you start looking at things like ligatures).

The middle layer is a "layout engine". This handles things like paragraphs, word wrap, indenting, etc. It also splits strings (e.g. paragraphs) into lines and applies Unicode's bidirectional text algorithm to the resulting sub-strings (to convert to "glyphs always left to right" for the font engine). This is also extremely complicated all by itself.

The highest layer is internationalisation - whether to display 1,234 or 1.234, whether to display 1:23 AM or 13:23, etc. For strings, I imagine each application provides a file for each locale/language containing an array of "format strings" where each string has an ID. Software supplies a string ID and any data it would need; and the internationalisation layer fetches the format string from the current locale/language's file, scans for escape codes, and replaces escape codes with (internationalised versions of) the other data (sort of like "printf()", but not). This would also do the reverse (e.g. converting the string "1,234" into the value 1234, converting the date string "1/2/3456 local time" into a ticks since epoch value, etc). This is also extremely complicated all by itself.

Cheers,

Brendan

Antti · Post by **Antti** » Fri Aug 07, 2015 10:53 pm

Brendan wrote:The lowest layer is the font engine, which gets a string in "left to right" order and converts it to a picture

This is something like "I wanted to hear". Intead of using the word "string", I would think it as a glyph array sent to a renderer. Pushing aside the fact that things that I do not know yet may change this vision.

A question. If your middle layer (the "layout engine") handles word wrapping, does it ask from the font engine how much it takes screen space to render the text?

Brendan · Post by **Brendan** » Sat Aug 08, 2015 1:39 am

Hi,

Antti wrote:
Brendan wrote:The lowest layer is the font engine, which gets a string in "left to right" order and converts it to a picture
This is something like "I wanted to hear". Intead of using the word "string", I would think it as a glyph array sent to a renderer. Pushing aside the fact that things that I do not know yet may change this vision.

A question. If your middle layer (the "layout engine") handles word wrapping, does it ask from the font engine how much it takes screen space to render the text?

You'd have to have some way for layout engine to know how much screen space an array of glyphs would consume (that takes into account current font, font size, bold/italic and proportional fonts).

Also; I'd be very tempted to implement "bare minimum" (e.g. mono-spaced ASCII only bitmap font renderer, left to right paragraphs only layout engine, English only internationalisation) and then worry about complicating things more after you've got that working. Word wrap alone is messy.

For example, consider a sentence like "The hippopotamus ate." where the "hippopotamus" fills a line. You could do this:

Code: Select all

The
hippopotamus
ate.

Or this:

Code: Select all

The hippo-
potamus ate.

Cheers,

Brendan

Antti · Post by **Antti** » Sat Aug 08, 2015 9:13 am

Brendan wrote:Also; I'd be very tempted to implement "bare minimum" (e.g. mono-spaced ASCII only bitmap font renderer, left to right paragraphs only layout engine, English only internationalisation) and then worry about complicating things more after you've got that working. Word wrap alone is messy.

Totally agreed. In this case it is not actually very important to spend weeks for implementing the bi-directional algorithm but having a place reserved for it. What this means in practical terms is to avoid assumptions like you could have when pushing characters into some data sink without maintaining any state or looking back at what has been pushed previously. Avoiding things that make it impossible to add that feature later. If I had a procedure reserved for "text processing" that has enough information for doing processing like this, it would be possible to add features like bi-directional text without writing everything from scratch. The three separate layers could be the way to go even if it is just a "bare minimum" hobby OS. For example, thinking that I could draw a single glyph and increment an x variable until wrapping the line is something that could never support things like bi-directional text elegantly.

Of course, it is not that simple and I need to learn more. Also, I need to find a good balance for things so that the project does suffer from creeping featurism.

OSDev.org

Bi-directional text (Unicode)

Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)

Re: Bi-directional text (Unicode)