OSDev.org

Posted: **Tue Sep 12, 2017 7:54 am**

My OS has a relatively simple GUI, but I have many performance issues, and I'm wondering if this is an inherent flaw in my design somewhere.

It all starts with a simple library "libddi" which handles 2D graphics. It uses FreeType for rendering text. Everything is represented as a "surface", and we can render text, images, load images from files, and blit. It can perform approximately 3000 blits of a 500x500 surface every second.

On top of this, there is a window manager, which gets sole control of the screen, which is represented as a "screen surface" after setting the video mode. It organises windows into a hierarchy, where the desktop is the root of the tree. An application can use the library "libgwm" to talk to the window manager. The communication happens via Unix/Local sockets. Widgets such as buttons etc are "windows" too.

Each window is represented by some metadata like the title, icon, etc, but most importantly, there are 3 surfaces associated with it: a back canvas, a front canvas, and a "display". The back canvas is in shared memory, accessible to the application that created it, and to the window manager.

When an application draws something in the back buffer, it swaps buffers by sending a command to the WM. The WM then copies the back canvas into the front canvas and then renders the "display": this is done by copying the front canvas into the display, then blitting all child windows' displays into the parent display. Then, the parent of the window being swapped becomes "dirty", and its display (and just the display) is re-rendered. This eventually cascades all the way to the desktop window, whose display is then blitted onto the screen surface (the display also includes the mouse cursor, put on the very top).

The problem is that the mouse lags a little bit when I move it, and complex GUIs take an extremely long time (a few seconds) to open, and they respond rather slowly.

Furthermore, when doing operations that require a lot of refresh (like typing in a text area) it gets unacceptably laggy.

Is there something wrong with this design, or do I just need to debug and try to find out what might be causing the latency?

Posted: **Wed Sep 13, 2017 1:22 am**

I won't ask you to port valgrind to your OS, so instead I will ask if your GUI code can run on linux in a test environment (perhaps to the framebuffer device)?

If you can do this then....

use the cachegrind (http://valgrind.org/docs/manual/cg-manual.html) tool of the valgrind tool distribution....

and then use the KCacheGrind tool from KDE to easily drill down to see just what code is eating up so much time....

failing the above - if you have gcc on your OS - attempt using the profiling command line settings and get the data produced to a tool which can visualize it....

cheers

Posted: **Wed Sep 13, 2017 2:10 am**

disregard my previous post....

ok.... here are some questions...

in: https://github.com/madd-games/glidix/bl ... /src/ptr.c

You spawn a separate thread to draw the mouse as it moves where:

ptrThreadFunc loops on a read() and then calls wndMouseMove which calls wndDrawScreen

1) is your OS blocking on read()? or is the <0 saying nothing was available so loop again?

2) why not use select() or poll() if your OS implements them?

I ask because it looks like your mouse thread beats the hell out of the CPU and thrashes the cache.

in: https://github.com/madd-games/glidix/bl ... c/window.c

in wndMouseMove() you lock mouseLock, unlock mouseLock, and then call wndDrawScreen() which locks mouseLock and unlocks mouseLock

in wndMouseMove() you lock wincacheLock, unlock wincacheLock, and then call wndDrawScreen() which locks wincacheLock and unlocks wincacheLock

3) could you create merge wndDrawScreen into wndMouseMove to get rid of the redundant locks?

4a) in fact you have a ton of locks... like the desktop lock.... not sure why the desktop has to be locked to draw a mouse sprite....

4b) so many locks that I can't help but think many of them are being used where they are not necessarily required or redundant

5) you call ddiOverlay() which passes a ton of args to a function which is linked from the libddi library - unless you have link time optimizations then this function should be in a header for inlining - just to get rid of the function call...

6) you use stock memcpy() and memset() alot.... since this is a x86_64 OS perhaps using SSE or even AVX versions of these functions will help... certainly a 128 bit memcpy could help.... here is an MIT licensed set of examples.... seriously good read: https://github.com/skywind3000/BasicBit ... p_SSE2.cpp

There is probably more issues - but the locks and memcpy and the mouse thread hammering the CPU are the worst offenders.... (in that order if your read() function blocks)

I want to thank you though... your code is SUPER EASY to understand and very well written.

Cheers

Posted: **Wed Sep 13, 2017 2:31 am**

dchapiesky wrote:disregard my previous post....

ok.... here are some questions...

in: https://github.com/madd-games/glidix/bl ... /src/ptr.c

You spawn a separate thread to draw the mouse as it moves where:

ptrThreadFunc loops on a read() and then calls wndMouseMove which calls wndDrawScreen

1) is your OS blocking on read()? or is the <0 saying nothing was available so loop again?

2) why not use select() or poll() if your OS implements them?

I ask because it looks like your mouse thread beats the hell out of the CPU and thrashes the cache.

The read() is blocking. The "<0" check just ensures that there was no error in the read.

dchapiesky wrote:in: https://github.com/madd-games/glidix/bl ... c/window.c

in wndMouseMove() you lock mouseLock, unlock mouseLock, and then call wndDrawScreen() which locks mouseLock and unlocks mouseLock

in wndMouseMove() you lock wincacheLock, unlock wincacheLock, and then call wndDrawScreen() which locks wincacheLock and unlocks wincacheLock

3) could you create merge wndDrawScreen into wndMouseMove to get rid of the redundant locks?

4a) in fact you have a ton of locks... like the desktop lock.... not sure why the desktop has to be locked to draw a mouse sprite....

4b) so many locks that I can't help but think many of them are being used where they are not necessarily required or redundant

-mouseLock synchronizes access to the global mouse position variables (mouseX, mouseY).
-wincacheLock synchronizes access to the "window cache"; which remembers the currently "hovered" window (the window that the mouse currently points at), the window in focus, and the "active" window (currently clicked).
-the "desktop lock" is just the lock for the desktop window, so a special case of the window lock. Furthermore, it synchronizes access to the framebuffer, and wndDrawScreen() draws everything (i.e. the contents of the desktop window's display) plus the mouse cursor.
-the window lock obviously synchroniizes access to an individual Window structure. Each window has its own lock and a reference counting is used so that the parent's lock does not need to be held when modifying most properties of its child.

In general, you should see that all locks are held for a very short time. (If you see that they are not, you can point that out to me).

wndMouseMove() and wndDrawScreen() aren't merged because they do separate things, and wndDrawScreen() is called from other parts of the code. wndMouseMove() reports mouse motion and then calls wndDrawScreen(), which just draws the desktop window and cursor on top.

dchapiesky wrote:5) you call ddiOverlay() which passes a ton of args to a function which is linked from the libddi library - unless you have link time optimizations then this function should be in a header for inlining - just to get rid of the function call...

6) you use stock memcpy() and memset() alot.... since this is a x86_64 OS perhaps using SSE or even AVX versions of these functions will help... certainly a 128 bit memcpy could help.... here is an MIT licensed set of examples.... seriously good read: https://github.com/skywind3000/BasicBit ... p_SSE2.cpp

There is probably more issues - but the locks and memcpy and the mouse thread hammering the CPU are the worst offenders.... (in that order if your read() function blocks)

I want to thank you though... your code is SUPER EASY to understand and very well written.

Cheers

ddiOverlay() is an optimised memcpy() essentially, it copies a surface onto another without alpha blending etc. As stated before, ddiBlit() (which DOES do alpha blending) can perform 3000 operations on 500x500 surfaces in one second; so ddiOverlay() does at least as much.

Inlining ddiOverlay() would probably be a bad idea since libddi is supposed to, some day, support hardware acceleration.

Posted: **Wed Sep 13, 2017 10:33 am**

I like the fact that the components have well defined responsibilities. Some of the burden can be attributed to memory transfers that the CPU has to perform, but there are also some inefficiencies, I think.

You perform full copies of the entire window/screen content for all graphics events. For example, when the mouse cursor moves, you redraw the entire desktop (i.e. the screen buffer), just to reposition the tiny cursor sprite. Am I right? The slow repainting of the cursor can interact badly with the mouse movement response of the application. If the application reacts by changing appearance, it will send GWM_CMD_POST_DIRTY, the screen will have been first redrawn to repaint the cursor, then the window hierarchy will be updated, and finally the screen will be redrawn again. If the cursor was displayed using a partial update it would have been at least a smaller penalty.

When compositing, the entire canvas is transferred to the front buffer of the window and then only part of that front buffer is painted on the display buffer. First, I am not even sure that the canvas management has to be dealt with on the server side, especially since it does not involve the window heirarchy. But even if it is, it can be done more efficiently. The front buffers of each window are always repainted on their corresponding display buffer using the entire displayed area. To begin with, GWM_CMD_POST_DIRTY provides no area information (i.e. invalidation region), which prevents efficient partial updates. Furthermore, when a child changes, there is no need to repaint the entire display surface of the parent onto its parent, and so forth. Actually, why not paint onto the screen buffer from the window front buffers directly. For non-alpha blended windows, this will be significantly cheaper, because the parent window content in the invalidated region is fully overlapped and thus it has no involvement. Not many applications actively use alpha blending.

This is probably a so called reparenting window manager. In other words, different processes can draw descendant windows inside the same window hierarchy. But there is no technical need to have separate buffers for overlapped nested windows originating from the same process. One surface buffer shared with the server per floating window per process should be enough. The rest can be composited on the client side. The reason why I bring this up is that this change may decrease the traffic with the server, which will increase the responsiveness for situations where a number of children have to be updated simultaneously. Depends on your design objectives.

You may want to consider graphical buffer tiling to improve the memory locality. Especially if you decide to implement partial updates. But you have to evaluate how this will interact with the code complexity, especially that of the drawing primitives. Assuming that the aspect ratio of the invalidation regions will tend to comparatively balanced, for page level locality (not cache level), you could break the surface in say 32 x 32 x 4 bpp blocks for 32 bit formats, or 64 x 64 x 1 for 8 bit, 64 x 32 x 2 for 16 bit, and... I am not sure for 24 bit - 42 x 32 x 3 with 64 bytes residue stride will be space optimal, but will make the width non-power-of-two, whereas the nearest exact powers will spill or straddle pages. And you may want to cap the update rate in order to colesce GWM_CMD_POST_DIRTY messages and other events.

Finally, I don't know how fast the socket communication is, but you may want to consider shared memory IPC. At least for local clients.

Edit: In ddiOverlay, there is loop-based compensation of negative destX and destY. I don't know if it is used, but it could be a problem.

Posted: **Wed Sep 13, 2017 12:37 pm**

Aside from simonz' point that only top-level windows need to be managed by the compositor (doing that will reduce IPC and context-switches), storing three buffers per window seems wasteful. If the compositor draws the window decorations, why can't it draw the decorations to a global backbuffer instead of rendering to wnd->display? I'm not sure what the purpose of wnd->front really is.

EDIT: It seems to me that you use wnd->front as a backbuffer so that client can begin drawing the next frame while the compositor is copying the window. I'd suggest to let clients do their own backbuffering. Let GWM_CMD_POST_DIRTY take a new canvas ID so that clients can do a per-window page flip and implement single/double/tripple buffering as they wish. You might want to consider implementing something like FD passing over sockets though, so that open() does not need to be called that often.

The shared memory and IPC architecture is well though out though.

Posted: **Wed Sep 13, 2017 12:39 pm**

OK, so when implemnting partial updates, let me get some things right:

1. when the display of a child is updated, i blit only that child onto the parent's display ,right? But then, how do I account for alpha blending?
2. How do I perform partial updates of the display if there are overlapping children? (mainly a concern for the desktop window of course).
3. so I can move the cursor by remembering what was under it, then overlaying it on the cursor, and blitting the cursor elsewhere. In this case, is it still preferable to have double buffering? As in, I perform the move on the "back" surface, and then overlay the whole thing onto the front buffer?

@Korona: the purpose of wnd->front is to allow the application to draw into a back buffer, and the WM can draw the previous frame (in the front buffer) to the screen if it must do so before the back buffer has been updated by the application.

Posted: **Wed Sep 13, 2017 12:45 pm**

I see. I ninja edited my previous post. Posting again for visibility

.

Posted: **Wed Sep 13, 2017 1:18 pm**

mariuszp wrote:1. when the display of a child is updated, i blit only that child onto the parent's display ,right? But then, how do I account for alpha blending?

I will try to illustrate the two issues. Suppose that a 16x16 region was just updated with new content, and that all windows - children/siblings/parents - that have non-trivial intersection with that region have translucency effects. Then, you need to only alpha-blit those 16x16 buffer parts from all intersecting windows, not their entire content. This is the too much buffer bytes and window area are blit-ed issue. Second, suppose that a maximized application has solid background and is currently the focused top window. Then none of its siblings in the desktop (other application windows), or indeed, the desktop background itself, will be visible. They must be eliminated from the drawing process, because they are hidden at the moment by the top window that occludes the entire display. This is very important, because an OS does not draw windows that are not visible. The applications may update the window content (although they may also prefer to avoid this fully or partially based on visibility), but the OS does not blit that content. Of course, you need to have separate cases for alpha blended and solid surfaces when computing occlusion, but always assuming translucency is a big performance hit.

mariuszp wrote:2. How do I perform partial updates of the display if there are overlapping children? (mainly a concern for the desktop window of course).

You need to compute the visible part of the updated region. You need to blit only the rectilinear polygon (i.e. set of rectangles) which remains after removing the occluded regions, whether by child or sibling windows. Edit: Occlusion assumes solid surface, without translucency. Translucency naturally does not produce occlusion. Which means, you need to handle two separate cases.

mariuszp wrote:3. so I can move the cursor by remembering what was under it, then overlaying it on the cursor, and blitting the cursor elsewhere. In this case, is it still preferable to have double buffering? As in, I perform the move on the "back" surface, and then overlay the whole thing onto the front buffer?

Yes - you got the cursor idea. Regarding the buffers, I didn't review deep enough to say if the front surface is actually the framebuffer of the bga, but the point of double buffering is to have scratch memory where you draw and then to transfer to the visible surface with maximal speed, without producing visual artifacts. With triple buffering you put another frame between the scratch buffer and the visible surface, which keeps the most recent finished content, such that the visible surface can use that on demand without waiting for in-progress work on the back buffer to finish. However, you don't need scratch buffer for the scratch buffer for the scratch buffer, unless you pursue some specific separation, such as one process transferring to another and so forth. Aside from specific separation boundaries, one scratch buffer for dynamic work should be enough. For asynchronous time accurate operation - such as in capped frame rate mode - one more for triple buffering may be desirable.

Edit: Basically, keep "screen" and frontBuffer (if frontBuffer is the bga memory), but remove wnd->display and blit the wnd->front directly to the screen. Keep the double buffering of the process in the process (as Korona suggested).

Posted: **Wed Sep 13, 2017 1:38 pm**

Regarding the BGA: You can do "real" double buffering by allocating multiple buffers in the BGA's VRAM (suitably aligned to their pitch) and changing the Y offset register. This will avoid one software blit.

Posted: **Mon Sep 18, 2017 3:08 pm**

One more question: when doing a partial update, what should happen if the cursor overlaps the "dirty rectangle"? Do I "undraw" the cursor (by re-blitting what's under it), do the update, and draw the cursor again? Or is there a faster way to handle this?

Posted: **Mon Sep 18, 2017 11:02 pm**

You could erase (reblit) only the difference between the new rectangle area and the old one. Something like this:

Code: Select all

int diff, max;

if (new.y > old.y)
{
  diff = new.y - old.y;
  max = new.y;
  if (diff < height)
    erase(old.x, old.y, width, diff);
  else
    erase(old.x, old.y, width, height);
}
else
{
  diff = old.y - new.y;
  max = old.y;
  if (new.y < old.y)
    if (diff < height)
      erase(old.x, new.y + height, width, diff);
    else
      erase(old.x, old.y, width, height);
}
if (diff < height)
{
  if (new.x > old.x)
    if (new.x - old.x < width)
      erase(old.x, max, new.x - old.x, height - diff);
    else
      erase(old.x, max, width, height - diff);
  else if (new.x < old.x)
    if (old.x - new.x < width)
      erase(new.x + width, max, old.x - new.x, height - diff);
    else
      erase(old.x, max, width, height - diff);
}

P.S. I am not saying that you should use "new" as identifier, but think of this as pseudo-code

Edit: Bug! The code did not handle non-intersecting rectangles. Which is a good opportunity to remark that the code is for illustrative purposes only and is completely untested. Something else that should be noted is that it can split the erase in the case of horizontally non-intersecting rectangles in two when the vertical projections collide. This is a performance aspect, which could be optimized.

Posted: **Tue Sep 19, 2017 7:27 am**

This is probably not your main issue but I'm just going to throw this out there because it's a mistake that I've made in the past: don't draw intermediate mouse positions. What I mean by this is that if the mouse is at, for example, position (10, 15) and moves, in one cycle, to position (13, 17) and then moves again in the next cycle to position (16, 19), you don't need to draw it at position (13, 17).

This can happen if each mouse movement is posted to an event queue to be consumed later by the GUI subsystem. A better way is to not post mouse movement as events but to have a way where the GUI subsystem can request the last known position of the mouse at any moment in time. So when the GUI subsystem comes to draw the mouse, it simply says "where's the mouse" and the input subsystem says "the mouse is at (16, 19)" and the GUI subsystem can draw it straight away, no matter how many movement events/cycles it would've otherwise gone through to get there. Also include position information in your button press and release events so that the GUI subsystem doesn't need to rely on receiving mouse movement events to figure out where the pointer was at the time of a button press/release.

Posted: **Tue Sep 19, 2017 2:24 pm**

onlyonemac wrote:This is probably not your main issue but I'm just going to throw this out there because it's a mistake that I've made in the past: don't draw intermediate mouse positions. What I mean by this is that if the mouse is at, for example, position (10, 15) and moves, in one cycle, to position (13, 17) and then moves again in the next cycle to position (16, 19), you don't need to draw it at position (13, 17).

This can happen if each mouse movement is posted to an event queue to be consumed later by the GUI subsystem. A better way is to not post mouse movement as events but to have a way where the GUI subsystem can request the last known position of the mouse at any moment in time. So when the GUI subsystem comes to draw the mouse, it simply says "where's the mouse" and the input subsystem says "the mouse is at (16, 19)" and the GUI subsystem can draw it straight away, no matter how many movement events/cycles it would've otherwise gone through to get there. Also include position information in your button press and release events so that the GUI subsystem doesn't need to rely on receiving mouse movement events to figure out where the pointer was at the time of a button press/release.

Yes, I had the problem with an event queue before, and fixed it in (almost) exactly the same way as you just described.

Posted: **Tue Sep 19, 2017 2:36 pm**

mariuszp wrote:Yes, I had the problem with an event queue before, and fixed it in (almost) exactly the same way as you just described.

Good. How did you fix it?

OSDev.org

Is my GUI design terrible?

Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?

Re: Is my GUI design terrible?