UEFI+BIOS bootloader combination

rdos · Post by **rdos** » Sat Jul 27, 2013 4:29 am

Antti wrote: I think it is much more flexible to store information like "Drawline" and "DrawBox". It is like a high-level description of what we would like to draw. If it is faster to keep a backbuffer copy of the application working area (pixels) when doing focus switches, nothing would stop doing that. We can do whatever we wanted when we have a "high-level" description of what to draw. There could even be different levels of rendering quality, e.g. fast = poor, slow = excellent.

It sure is more flexible, but flexibility often costs in terms of performance.

Similar examples were presented here before, like the view-port concept. I decided not to support transformations at all in my API (and thus applications use pixel-coordinates directly) because that is faster. OTOH, I did decide to support clipping (with a rectangle only), because that was thought to be indispensable even if it costed in performance. Many things are decisions between flexibility and performance, and these decisions usually make a lot of difference for graphics.

I also decided not to mix-up GUI functions with the graphics API, so the graphic API can be used both for a GUI app and for some simple graphics without a GUI. That also means that things like key and mouse notifications are not in the graphics API, and neither are icons, windows or menus. Those instead are part of the application class-library and today contain advanced widgets for building things like dialog-boxes and form-based applications. The PNG/JPEG/BMP loaders/savers are also in the class-library.

Antti · Post by **Antti** » Sat Jul 27, 2013 4:51 am

rdos wrote:applications use pixel-coordinates directly

This will make them extremely dependent on particular resolutions. It probably is faster but I would not do that anymore.

When it comes to games, I am a fan of pixel art. Vector graphics looks somehow too clean. Maybe it is just good memories of old games that make it seem like that. Anyway, I think it is superior to have resolution-independent vector graphics. It is even possible to "emulate" pixelated look. There should be no advantages to keep control of individual pixels. Maybe the efficency... but it will not turn out to be a good decision in the long run.

Brendan · Post by **Brendan** » Sat Jul 27, 2013 5:06 am

Hi,

Owen wrote:What do you care more about, saving minute proportions of RAM, or performance?

Graphics isn't (or shouldn't be) as simple as "application generates graphics and video driver displays them". Typically there are layers of software between the applications and the video drivers, and these middle layers do a significant amount of processing.

For an example; imagine an application called "Foo" that wants to display a blue rectangle in the middle of a dark green semi-transparent background. Foo generates the data for this (in whatever format) and sends it to the GUI. The GUI puts Foo's data into a window, which involves transposing and possibly scaling Foo's data, and may include clipping Foo's data to the window's edges (and adding a scroll bar); then adding window decoration (borders, title, etc) to it. Of course there may be 10 different windows and dialog boxes on screen at the same time; so the GUI also has to combine all of the graphics from all of the windows and dialog boxes; which includes handling things like Foo's dark green semi-transparent background. Of course for modern GUIs there's also special effects; like maybe each window casts some sort of shadow onto any window/s behind it.

Once the GUI has done all of this processing; it might send the final result to a "virtual screen" layer. The virtual screen layer has to determine which pieces of the GUI's virtual screen get sent to which video driver. Let's assume there's one monitor running at 768 * 1024 with 32-bit colour that's connected to an ATI video card, and another monitor running at 1920*1600 with 16-bit colour that's connected to an Nvidia video card; and the left half of the GUI's virtual screen needs to be sent to the first monitor and the right half of the GUI's virtual screen needs to be sent to the second monitor. This means that the virtual screen layer needs to clip the GUI's graphics to the edges of each monitor, and scale the graphics (and convert colours?) to suit each monitor. Also assume that the "768 * 1024" monitor is actually using 1024 * 768 (the monitor is on its side) and the graphics data for that monitor also needs to be rotated 90 degrees.

Finally (after all of this processing in between the application/s and the video drivers) the virtual screen layer sends its graphics data (in whatever format) to both of the video drivers. Assume the video drivers support all sorts of hardware acceleration (or assume the OS is designed to allow video drivers to do all sorts of hardware acceleration one day).

Now; with all of this "graphics processing between the application and video driver/s" in mind (and without neglecting the fact that all video cards for the last 20 years are quite capable of doing things like drawing blue rectangles using hardware acceleration): An application called "Foo" generates the data for a blue rectangle in the middle of a dark green semi-transparent background and sends this data to a GUI. What format should the data that Foo sends to the GUI be in? Raw pixels, or something like "draw_rectangle (x1, y1, x2, y2, blue)"?

Cheers,

Brendan

Owen · Post by **Owen** » Sat Jul 27, 2013 6:58 am

Brendan wrote:Hi,

Owen wrote:What do you care more about, saving minute proportions of RAM, or performance?
Graphics isn't (or shouldn't be) as simple as "application generates graphics and video driver displays them". Typically there are layers of software between the applications and the video drivers, and these middle layers do a significant amount of processing.

You're certain of that? Have you looked into how Wayland, the Windows DWM, Android's SurfaceFlinger and similar work?

For full screen applications, the window server allocates a scanout capable buffer as the application's backing store and points the the framebuffer hardware at it. This works perfectly and with perfect efficiency until a notification (or other overlay) shows up. At that point, the window manager switches back to the standard compositing pathway it uses when multiple windows are on screen.

Why would you not do this? Why would you sacrifice the efficiency that is trivially gained?

Brendan wrote:For an example; imagine an application called "Foo" that wants to display a blue rectangle in the middle of a dark green semi-transparent background. Foo generates the data for this (in whatever format) and sends it to the GUI. The GUI puts Foo's data into a window, which involves transposing and possibly scaling Foo's data, and may include clipping Foo's data to the window's edges (and adding a scroll bar); then adding window decoration (borders, title, etc) to it. Of course there may be 10 different windows and dialog boxes on screen at the same time; so the GUI also has to combine all of the graphics from all of the windows and dialog boxes; which includes handling things like Foo's dark green semi-transparent background. Of course for modern GUIs there's also special effects; like maybe each window casts some sort of shadow onto any window/s behind it.

Once the GUI has done all of this processing; it might send the final result to a "virtual screen" layer. The virtual screen layer has to determine which pieces of the GUI's virtual screen get sent to which video driver. Let's assume there's one monitor running at 768 * 1024 with 32-bit colour that's connected to an ATI video card, and another monitor running at 1920*1600 with 16-bit colour that's connected to an Nvidia video card; and the left half of the GUI's virtual screen needs to be sent to the first monitor and the right half of the GUI's virtual screen needs to be sent to the second monitor. This means that the virtual screen layer needs to clip the GUI's graphics to the edges of each monitor, and scale the graphics (and convert colours?) to suit each monitor. Also assume that the "768 * 1024" monitor is actually using 1024 * 768 (the monitor is on its side) and the graphics data for that monitor also needs to be rotated 90 degrees.

Finally (after all of this processing in between the application/s and the video drivers) the virtual screen layer sends its graphics data (in whatever format) to both of the video drivers. Assume the video drivers support all sorts of hardware acceleration (or assume the OS is designed to allow video drivers to do all sorts of hardware acceleration one day).

Now; with all of this "graphics processing between the application and video driver/s" in mind (and without neglecting the fact that all video cards for the last 20 years are quite capable of doing things like drawing blue rectangles using hardware acceleration): An application called "Foo" generates the data for a blue rectangle in the middle of a dark green semi-transparent background and sends this data to a GUI. What format should the data that Foo sends to the GUI be in? Raw pixels, or something like "draw_rectangle (x1, y1, x2, y2, blue)"?

Consider the application Bar. Bar draws a complex 3D scene involving complex per-pixel dynamic lighting fragment shaders, tessellation hull and domain shaders for improved fidelity, and geometry shaders for efficient particle effects based upon taking a series of points and generating a camera-facing quadrilateral in their place. In addition, Bar uses dynamic occlusion queries where it tells the GPU to take a chunk of geometry and pretend to draw it, and asks it "If you had drawn it, would it have been visible?", which it uses for conservative visibility determination. It also uses render-to-texture in order to calculate shadow volumes.

Now, what format should this be in: should the GUI server receive an unyielding series of commands it has no insight to and which it can't modify without breaking the app, or should it just take the image buffer that the application produced each frame and use that as the input for whatever compositing it might do?

What if the sytem has a crappy Intel HD Graphics 2000 IGP with one monitor plugged into it, and an AMD Radeon HD 7990 with the other plugged into it. Should the app be crippled to the capabilities of the former, or should the system (or application) intelligently decide that attempting to do anything complex on the former is a terrible idea?

Antti wrote:
rdos wrote:applications use pixel-coordinates directly
This will make them extremely dependent on particular resolutions. It probably is faster but I would not do that anymore.

When it comes to games, I am a fan of pixel art. Vector graphics looks somehow too clean. Maybe it is just good memories of old games that make it seem like that. Anyway, I think it is superior to have resolution-independent vector graphics. It is even possible to "emulate" pixelated look. There should be no advantages to keep control of individual pixels. Maybe the efficency... but it will not turn out to be a good decision in the long run.

We are 3 decades from the invention of the GUI. It is not a stretch to say that the hardware we have today, even in mobile and embedded in CPUs, is millions of times more capable than the hardware used by the first GUIs.

Yet.. efficiency matters more than ever. The pixel density of our displays has started climbing, and yet is nowhere near the vernier accuity of our eyes required to make pixels imperceptible. We now have complex 3D games that run on the battery powered devices that fit in our pockets. On our desktops and games consoles, the relentless march of games continue to demand every bit of efficiency they can extract from the hardware. 3D engines, in particular, are more optimized than ever: they're starting to fit all of the "critical information" into contiguous buffers for fast SIMD processing; theyr're beginning to make use of intelligent ordering designs to split the culling work (and similar) between multiple threads. Its' only in the last 4 years or so that graphics drivers and APIs have started to support multithreaded rendering.

I would not want to bet against efficiency in my low level graphics system. I would also not completely abstract away the pixel grid: pixel addressibility is sometimes required; consider font rendering

I would, instead, build a highly efficient low level graphics system. I would then build a high level GUI system on top of that. At the same time, game developers and similar would code directly against the low level system (which, for hardware acceleration, would probably use OpenGL or similar - for modern systems, I'd suggest supporting just OpenGL 3.2 Core and later, and for downlevel cards OpenGL ES 1.0/1.1/2.0)

That high level system would probably use the EM as the primary layout unit, which would take DPI scaling and the users' desired font sizes into account. It would be based entirely upon floating point coordinates, to enable subpixel layout and flexible transform and scaling.

Brendan · Post by **Brendan** » Sat Jul 27, 2013 8:49 am

Hi,

Owen wrote:
Brendan wrote:
Owen wrote:What do you care more about, saving minute proportions of RAM, or performance?
Graphics isn't (or shouldn't be) as simple as "application generates graphics and video driver displays them". Typically there are layers of software between the applications and the video drivers, and these middle layers do a significant amount of processing.
You're certain of that? Have you looked into how Wayland, the Windows DWM, Android's SurfaceFlinger and similar work?

For full screen applications, the window server allocates a scanout capable buffer as the application's backing store and points the the framebuffer hardware at it. This works perfectly and with perfect efficiency until a notification (or other overlay) shows up. At that point, the window manager switches back to the standard compositing pathway it uses when multiple windows are on screen.

Why would you not do this?

Because it's retarded?

Owen wrote:Why would you sacrifice the efficiency that is trivially gained?

Because the efficiency you "gain" from failing to provide adequate abstractions between hardware and software is negligible at best (and a significant efficiency loss at worst); and the flexibility you sacrifice in a misguided attempt to "gain" that efficiency is far more important.

Owen wrote:Consider the application Bar. Bar draws a complex 3D scene involving complex per-pixel dynamic lighting fragment shaders, tessellation hull and domain shaders for improved fidelity, and geometry shaders for efficient particle effects based upon taking a series of points and generating a camera-facing quadrilateral in their place. In addition, Bar uses dynamic occlusion queries where it tells the GPU to take a chunk of geometry and pretend to draw it, and asks it "If you had drawn it, would it have been visible?", which it uses for conservative visibility determination. It also uses render-to-texture in order to calculate shadow volumes.

Now, what format should this be in: should the GUI server receive an unyielding series of commands it has no insight to and which it can't modify without breaking the app, or should it just take the image buffer that the application produced each frame and use that as the input for whatever compositing it might do?

The GUI should receive a series of standardised commands that it has a complete understanding of; which it can easily modify without breaking the app (and without a massive amount of "repeatedly reprocesses individual pixels that may or may not be visible anyway" overhead). These standardised commands should describe what should be drawn, not how it should be drawn (as "how it should be drawn" is up to the renderer and may be extremely different for different cases - e.g. "lower quality" real-time hardware accelerated video vs. "extremely high quality" ray tracing used by a printer driver). For example; an application would describe a complex 3D scene (textures, meshes, particles, lights, camera, etc), and the renderer should decide how shaders (if any) should be used, what is occluded and what isn't, whether or not to use render-to-texture for shadow volumes or any other technique instead, and all of the other implementation details that the application should never have needed to care about to begin with.

Cheers,

Brendan

Owen · Post by **Owen** » Sat Jul 27, 2013 9:33 am

Brendan wrote:The GUI should receive a series of standardised commands that it has a complete understanding of; which it can easily modify without breaking the app (and without a massive amount of "repeatedly reprocesses individual pixels that may or may not be visible anyway" overhead). These standardised commands should describe what should be drawn, not how it should be drawn (as "how it should be drawn" is up to the renderer and may be extremely different for different cases - e.g. "lower quality" real-time hardware accelerated video vs. "extremely high quality" ray tracing used by a printer driver). For example; an application would describe a complex 3D scene (textures, meshes, particles, lights, camera, etc), and the renderer should decide how shaders (if any) should be used, what is occluded and what isn't, whether or not to use render-to-texture for shadow volumes or any other technique instead, and all of the other implementation details that the application should never have needed to care about to begin with.

Cheers,

Brendan

So you think that you can produce the one final retained mode 3D engine to rule all 3D engines?

When Microsoft failed at that with early versions of DirectX - and not without trying?

Where SGI failed at that - again, not without trying (remember the days when all 3D was done on SGIs)?

When Electronic Arts maintains two in house engines for most games, plus more specialized engines for others, and all their competitors do similar?

Do you know why this is delegated to the application developer? Because they're the only people that know. You can't "just" switch out shadow techniques as you wish (for example), because they all have different tradeoffs, performance models and side effects (e.g. a game with 20 dynamic lights might want faster but uglier shadows, while one with 2 will probably go for the prettier option; some shadow techniques interact badly with certain types of geometry).

Unreal Engine 4's global illumination system is just one example: the artist has to exclude certain objects or performance and quality both significantly degrade.

Oh, and you plan to use the same assets for raytracing? You're deluded. Rasterization and raytracing take very different geometry formats.

If you think you can write the 3D engine to end all 3D engines when companies investing billions into them can't... your deluded.

Brendan wrote:
Owen wrote: You're certain of that? Have you looked into how Wayland, the Windows DWM, Android's SurfaceFlinger and similar work?

For full screen applications, the window server allocates a scanout capable buffer as the application's backing store and points the the framebuffer hardware at it. This works perfectly and with perfect efficiency until a notification (or other overlay) shows up. At that point, the window manager switches back to the standard compositing pathway it uses when multiple windows are on screen.

Why would you not do this?
Because it's retarded?

So efficiency is retarded now?

Wow, you must live in a messed up world.

Cheers,

Owen

Brendan · Post by **Brendan** » Sat Jul 27, 2013 9:47 am

Hi,

Brendan wrote:
Owen wrote:Why would you not do this?
Because it's retarded?

I thought I should probably clarify what I meant; with a little experiment.

Start "glxgears" in a large window, and write down the frame rate you get out of it (I get about 2000 frames per second).

Now open some other window (doesn't matter what it is as long as there's no transparency - I used a text editor) and arrange that other window so that it covers almost all of the "glxgears" window, so that only a thin strip of black pixels from "glxgears" is showing. Based on the original "entire window visible" frame rate and the amount that is now invisible, write down an estimate of how much you think the frame rate should've improved. My guess was about 10000 times faster (given that all of the polygons should've been clipped and nothing needed to be drawn at all). Check what frame rate you're actually getting now (I get about 4000 frames per second) and see how much the frame rate improved (about 2 times faster). Is this anywhere near what you expected (no); or is it many orders of magnitude worse than what I should've been able to expect (yes)?

Now arrange the other window so that the entire "glxgears" window is entirely covered. Write down a new estimate of how much you think the frame rate should've improved. I've got 16 CPUs that are capable of doing "nothing" about 4 billion times per second, so my estimate was 64 billion frames per second. Check what frame rate you're actually getting now (I still get about 4000 frames per second). Is this anywhere near what you expected; or have you realised why I call it retarded?

Now minimise the "glxgears" window and try again. My estimate was that it should've reported the frame rate as "NaN". I still get about 4000 frames per second. I was expecting "retarded" from the start, so does this mean I'm an optimist?

[EDIT]
I've got 2 completely separate ATI video cards (both driving separate monitors). For fun, I positioned a "glxgears" window so that half was on each monitor. With 2 entire video cards doing half the work each, the efficiency "improved" from 4000 frames per second all the way "up" to 200 frames per second! This is obvious proof that I'm deluded, and that it's impossible to do anything more efficiently...

[/EDIT]

Cheers,

Brendan

Brendan · Post by **Brendan** » Sat Jul 27, 2013 10:00 am

Hi,

Owen wrote:So you think that you can produce the one final retained mode 3D engine to rule all 3D engines?

No; that's why I want to provide an adequate abstraction, where applications only describe what to draw and the 3D rendering engine (built into video drivers, etc) can be improved without breaking everything.

Owen wrote:Oh, and you plan to use the same assets for raytracing? You're deluded. Rasterization and raytracing take very different geometry formats.

Is there any technical reason why the same "description of what to draw" can't be used by both rasterisation and ray tracing? If there is, I'd like to know what they might be.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Jul 27, 2013 11:49 am

Brendan wrote:Now open some other window (doesn't matter what it is as long as there's no transparency - I used a text editor) and arrange that other window so that it covers almost all of the "glxgears" window, so that only a thin strip of black pixels from "glxgears" is showing. Based on the original "entire window visible" frame rate and the amount that is now invisible, write down an estimate of how much you think the frame rate should've improved. My guess was about 10000 times faster (given that all of the polygons should've been clipped and nothing needed to be drawn at all). Check what frame rate you're actually getting now (I get about 4000 frames per second) and see how much the frame rate improved (about 2 times faster). Is this anywhere near what you expected (no); or is it many orders of magnitude worse than what I should've been able to expect (yes)?

Yes, it is what we expect. We certainly wouldn't expect a large difference since the only thing that didn't happen was the actual video-output.

Brendan wrote: Now arrange the other window so that the entire "glxgears" window is entirely covered. Write down a new estimate of how much you think the frame rate should've improved. I've got 16 CPUs that are capable of doing "nothing" about 4 billion times per second, so my estimate was 64 billion frames per second. Check what frame rate you're actually getting now (I still get about 4000 frames per second). Is this anywhere near what you expected; or have you realised why I call it retarded?

Now minimise the "glxgears" window and try again. My estimate was that it should've reported the frame rate as "NaN". I still get about 4000 frames per second. I was expecting "retarded" from the start, so does this mean I'm an optimist?

Also expected. 4000 frames per second is what the algorithm can produce with no video output.

Brendan wrote: I've got 2 completely separate ATI video cards (both driving separate monitors). For fun, I positioned a "glxgears" window so that half was on each monitor. With 2 entire video cards doing half the work each, the efficiency "improved" from 4000 frames per second all the way "up" to 200 frames per second! This is obvious proof that I'm deluded, and that it's impossible to do anything more efficiently...

It is very possible to improve these special cases (at the expense of the typical case), but why would anybody want to do that? Who cares about the speed of a hidden or partially hidden animation?

Brendan · Post by **Brendan** » Sat Jul 27, 2013 12:59 pm

Hi,

rdos wrote:
Brendan wrote:Now open some other window (doesn't matter what it is as long as there's no transparency - I used a text editor) and arrange that other window so that it covers almost all of the "glxgears" window, so that only a thin strip of black pixels from "glxgears" is showing. Based on the original "entire window visible" frame rate and the amount that is now invisible, write down an estimate of how much you think the frame rate should've improved. My guess was about 10000 times faster (given that all of the polygons should've been clipped and nothing needed to be drawn at all). Check what frame rate you're actually getting now (I get about 4000 frames per second) and see how much the frame rate improved (about 2 times faster). Is this anywhere near what you expected (no); or is it many orders of magnitude worse than what I should've been able to expect (yes)?
Yes, it is what we expect. We certainly wouldn't expect a large difference since the only thing that didn't happen was the actual video-output.

I would expect that people here are smart enough to realise that the "list of commands" approach can easily fix 95% of the "wasting time rendering stuff that isn't visible anyway" problem (and drastically improve the efficiency of "post-processing" an application's graphics if/when it is visible).

rdos wrote:It is very possible to improve these special cases (at the expense of the typical case), but why would anybody want to do that? Who cares about the speed of a hidden or partially hidden animation?

Except they aren't special cases. For normal use (e.g. GUI with several windows, browser tabs, etc) most of the graphics generated by most applications (and most of the GUI's desktop/background) isn't visible.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Jul 27, 2013 1:39 pm

Brendan wrote: I would expect that people here are smart enough to realise that the "list of commands" approach can easily fix 95% of the "wasting time rendering stuff that isn't visible anyway" problem (and drastically improve the efficiency of "post-processing" an application's graphics if/when it is visible).

They can't. Even if the commands are completely ignored (the completely hidden case), the application will still do a lot of work to create the scenes, and your extremely slow command lists will still be allocated and sent to the video driver. In the partial case, your commands must employ some smart tactics as to which commands to ignore, which to modify and which to send. This requires extra processing (slows down the fully visible case). It is also possible to do the same thing without commands, but that also slows down the typical case.

Brendan wrote: Except they aren't special cases. For normal use (e.g. GUI with several windows, browser tabs, etc) most of the graphics generated by most applications (and most of the GUI's desktop/background) isn't visible.

That's the problem of typical GUIs, not the graphics API. This is easily avoided by not having any windows, only supporting full-screen applications.

Combuster · Post by **Combuster** » Sat Jul 27, 2013 2:15 pm

rdos wrote:your extremely slow command lists

Biased remark based on your own ignorance. In fact, manual pixelpumping in my OS is slower than sending render commands since the driver gets to convert that to accelerator instructions.

Brendan · Post by **Brendan** » Sat Jul 27, 2013 3:06 pm

Hi,

rdos wrote:
Brendan wrote:I would expect that people here are smart enough to realise that the "list of commands" approach can easily fix 95% of the "wasting time rendering stuff that isn't visible anyway" problem (and drastically improve the efficiency of "post-processing" an application's graphics if/when it is visible).
They can't. Even if the commands are completely ignored (the completely hidden case), the application will still do a lot of work to create the scenes, and your extremely slow command lists will still be allocated and sent to the video driver.

If the graphics are completely ignored; then in both cases the application decides what to draw. After that, in my case the application appends some tiny little commands to a buffer (which is sent, then ignored). In your case the application does a massive amount of pointless pixel pounding (which is sent, then ignored). If you think my way is slower, you're either a moron or a troll.

rdos wrote:In the partial case, your commands must employ some smart tactics as to which commands to ignore, which to modify and which to send. This requires extra processing (slows down the fully visible case). It is also possible to do the same thing without commands, but that also slows down the typical case.

The "smart tactics" are a small amount of work that avoids a much larger amount of work. For the fully visible case, the only extra processing is appending commands to a list and decoding commands from the list (all the rendering will cost the same, it just happens in a different place). If nothing happens to the graphics after the application creates it, then your way would avoid the negligible extra work of creating the list of commands and decoding it. However, if anything does needs to happen to the data afterwards (e.g. rotating, scaling, etc; or even just copying it "as is" from the application's buffer into a GUI's buffer) then your way ends up slower.

Basically, for the common case (where something does need to happen to the application's graphics after the application has created it, regardless of what that "something" is), your way is slower. However, for your special case (e.g. where the OS is so lame that all applications are full screen and nothing can possibly happen to the application's graphics afterwards), your way might be unnoticeably faster.

Also note that you are completely ignoring the massive amount of extra flexibility that "list of commands" provides (e.g. trivial to send the "list of commands" to multiple video cards, over a network, to a printer, to a file, etc).

rdos wrote:That's the problem of typical GUIs, not the graphics API. This is easily avoided by not having any windows, only supporting full-screen applications.

I heard that it's possible to avoid testicular cancer by chopping your testicles off with a rusty knife. This might sound painful; but it makes a lot more sense than not having any GUI and not having any windows.

Cheers,

Brendan

Owen · Post by **Owen** » Sat Jul 27, 2013 3:39 pm

Combuster wrote:
rdos wrote:your extremely slow command lists
Biased remark based on your own ignorance. In fact, manual pixelpumping in my OS is slower than sending render commands since the driver gets to convert that to accelerator instructions.

GPU accelerated path is faster than non accelerated path? What a surprise.

Sending generic command to graphics driver, which must then translate them into device specific commands is slower than calling driver function which just places a GPU command into a command buffer, and when that is full asks the driver to take it and pass it to the hardware (and give it another buffer)? Only sounds logical

Brendan wrote:I thought I should probably clarify what I meant; with a little experiment.

Start "glxgears" in a large window, and write down the frame rate you get out of it (I get about 2000 frames per second).

Now open some other window (doesn't matter what it is as long as there's no transparency - I used a text editor) and arrange that other window so that it covers almost all of the "glxgears" window, so that only a thin strip of black pixels from "glxgears" is showing. Based on the original "entire window visible" frame rate and the amount that is now invisible, write down an estimate of how much you think the frame rate should've improved. My guess was about 10000 times faster (given that all of the polygons should've been clipped and nothing needed to be drawn at all). Check what frame rate you're actually getting now (I get about 4000 frames per second) and see how much the frame rate improved (about 2 times faster). Is this anywhere near what you expected (no); or is it many orders of magnitude worse than what I should've been able to expect (yes)?

Now arrange the other window so that the entire "glxgears" window is entirely covered. Write down a new estimate of how much you think the frame rate should've improved. I've got 16 CPUs that are capable of doing "nothing" about 4 billion times per second, so my estimate was 64 billion frames per second. Check what frame rate you're actually getting now (I still get about 4000 frames per second). Is this anywhere near what you expected; or have you realised why I call it retarded?

Now minimise the "glxgears" window and try again. My estimate was that it should've reported the frame rate as "NaN". I still get about 4000 frames per second. I was expecting "retarded" from the start, so does this mean I'm an optimist?

So you're taking GLXGears - the height of 1992 3D graphics - on top of X11 - the height of windowing systems in 1984, and declaring that because it sucks (and people who don't think X11 sucks are a rarity), all graphics systems suck?

GLXGears has one mission in life: Push as many frames as possible. It's dumb, its' stupid, it doesn't listen for minimize events/etc. Yeah. We get it. Way to choose a terrible example.

Yes, I expected it to do 4000 frames a second. GLXGears is terrible, its' entirely CPU bound, its' using crappy old OpenGL immediate mode from 1992. Your GLXGears framerate is a better proxy for how quickly your machine can context switch than anything graphics related.

If you'd chosen a decent example (say, something built upon a widget toolkit), you would have found that they stop rendering when they're not visible because they never receive paint events.

Of course, if you were implementing a modern GUI system you might decide to be smart and take away GLXGears' OpenGL context whenever it was minimized (or a full screen application was running in front of it). You know, like Android does. Actually, at that point GLXGears would crash but thats only because its' a crappy program (and because X11 never takes away your context).

Brendan wrote:I've got 2 completely separate ATI video cards (both driving separate monitors). For fun, I positioned a "glxgears" window so that half was on each monitor. With 2 entire video cards doing half the work each, the efficiency "improved" from 4000 frames per second all the way "up" to 200 frames per second! This is obvious proof that I'm deluded, and that it's impossible to do anything more efficiently...

So I take it you're using an old and crappy non-compositing window manager then? Because that's the only situation in which it would become relevant that the window was interposed in between two displays

Of course, in that case each graphics card isn't doing half the work each. Assuming the commands are going to both cards, they're both rendering the scene. Even if you managed to spatially divide the work up precisely between the cards.. they'd still be rendering more than 1/2 of the triangles each (because of overlap)

Whatever is going on, its' not smart, and it says more about the X server than GLXGears.

What I suspect is happening is that the "primary" card is rendering the image. Because its' split between the two screens, X11 cant' do its normal non-composited thing and render directly into a scisor-rect of the desktop. Therefore, its' falling back to rendering to a buffer, and its' probably picked something crappy like a pixmap (allocated in system memory), then copying out of said pixmap to the framebuffer on the CPU.

Even if you're using a compositing window manager, X11 barely counts as modern. If you're not using a compositing window manager, yeah, expect these problems because that's just how X11 is.

Brendan wrote:Hi,

Owen wrote:So you think that you can produce the one final retained mode 3D engine to rule all 3D engines?
No; that's why I want to provide an adequate abstraction, where applications only describe what to draw and the 3D rendering engine (built into video drivers, etc) can be improved without breaking everything.

OK, so you do expect to make the one final retained mode 3D engine to rule all 3D engines then.

[quote="Brendan"

Owen wrote:Oh, and you plan to use the same assets for raytracing? You're deluded. Rasterization and raytracing take very different geometry formats.

Is there any technical reason why the same "description of what to draw" can't be used by both rasterisation and ray tracing? If there is, I'd like to know what they might be.[/quote]

Because efficient and pretty rasterization uses triangle meshes, normal and/or displacement maps, fragment shaders with atomic counters for real time global illumination, stencil buffer "hacks" for shadows, etc, all of which are engineered for the rasterization pipeline.

"Efficient" and pretty raytracing, on the other hand, uses NURBS and other mathematical models of shapes which avoid the discontinuities of such vertex based geometry.

(Also note that, for a lot of materials, the "rasterization plus a lot of trickery" model works a lot better than raytracing, because raytracing breaks down for diffuse illumination)

Brendan wrote:If the graphics are completely ignored; then in both cases the application decides what to draw. After that, in my case the application appends some tiny little commands to a buffer (which is sent, then ignored). In your case the application does a massive amount of pointless pixel pounding (which is sent, then ignored). If you think my way is slower, you're either a moron or a troll.

GLXGears is hardly doing any pixel pounding. In fact, its' doing none at all. It's building command lists which, if your driver is good, go straight to the hardware. If your driver is mediocre, then.. well, all sorts of shenanigans can occur.

[quote="Brendan"

rdos wrote:In the partial case, your commands must employ some smart tactics as to which commands to ignore, which to modify and which to send. This requires extra processing (slows down the fully visible case). It is also possible to do the same thing without commands, but that also slows down the typical case.

The "smart tactics" are a small amount of work that avoids a much larger amount of work. For the fully visible case, the only extra processing is appending commands to a list and decoding commands from the list (all the rendering will cost the same, it just happens in a different place). If nothing happens to the graphics after the application creates it, then your way would avoid the negligible extra work of creating the list of commands and decoding it. However, if anything does needs to happen to the data afterwards (e.g. rotating, scaling, etc; or even just copying it "as is" from the application's buffer into a GUI's buffer) then your way ends up slower.[/quote]

Assuming that the application just draws its' geometry to the window with no post processing, yes, you will get a negligible speed boost (any 3D accelerator can scale and rotate bitmaps at a blistering speed).

Of course, for any modern 3D game (where modern covers the last 7 years or so) what you'll find is that the geometry you're scaling is a quad with one texture attached and a fragment shader doing the last post-processing step.

Of course, that speed boost will only apply if you're drawing the contents of that window once. Of course, most non-game windows don't change every frame (and we have established already that for game windows its' pointless),

Basically, for the common case (where something does need to happen to the application's graphics after the application has created it, regardless of what that "something" is), your way is slower. However, for the special case (e.g. where the application is full screen)... oh wait, your way is slower.

Brendan wrote:Also note that you are completely ignoring the massive amount of extra flexibility that "list of commands" provides (e.g. trivial to send the "list of commands" to multiple video cards, over a network, to a printer, to a file, etc).

Whats more efficient:

Executing the same commands on multiple independent GPUs
Executing them on one GPU (or multiple couple GPUs, i.e. CrossFire/SLI)

Answer: The later, because it uses less power (and because it lets you use the second GPU for other things, maybe accelerating physics or running the GUI)

Also, I don't know how you intend to do buffer readbacks when you're slicing the commands between the GPUs, or deal with when one GPU only supports the fixed function pipeline and the other supports all the latest shiny features.

Any GUI lets you send the commands over a network or to a file; you just need to implement the appropriate "driver"

To a printer is a special case. Actually, an OpenVG-like API to drive a printer isn't a bad an idea... Cairo essentially implements that (with the PostScript/PDF backends). It makes sense for 2D... not so much for 3D. But thats nothing new; Microsoft managed that abstraction decades ago; you've used GDI to talk to printers in Windows since eternity.

3D rendering to a printer? There is no "one size fits all" solution, just like there isn't for converting a 3D scene to an image today.

Every good system is built upon layers. For graphics, I view three:

The low level layer; your OpenGL and similar; direct access for maximum performance. Similarly, you might allow direct PostScript access to PostScript printers (for example)
The mid level layer; your vector 2D API; useful for drawing arbitrary graphics. Cairo or OpenVG.
The high level layer; your GUI library; uses the vector API to draw widgets

You'll find all competent systems are built this way today, and all incompetent systems evolve to look like it (e.g. X11 has evolved Cairo, because X11's rendering APIs suck)

Cheers,

Owen

Brendan · Post by **Brendan** » Sat Jul 27, 2013 8:21 pm

Hi,

Owen wrote:
Brendan wrote:I thought I should probably clarify what I meant; with a little experiment.

Start "glxgears" in a large window, and write down the frame rate you get out of it (I get about 2000 frames per second).

Now open some other window (doesn't matter what it is as long as there's no transparency - I used a text editor) and arrange that other window so that it covers almost all of the "glxgears" window, so that only a thin strip of black pixels from "glxgears" is showing. Based on the original "entire window visible" frame rate and the amount that is now invisible, write down an estimate of how much you think the frame rate should've improved. My guess was about 10000 times faster (given that all of the polygons should've been clipped and nothing needed to be drawn at all). Check what frame rate you're actually getting now (I get about 4000 frames per second) and see how much the frame rate improved (about 2 times faster). Is this anywhere near what you expected (no); or is it many orders of magnitude worse than what I should've been able to expect (yes)?

Now arrange the other window so that the entire "glxgears" window is entirely covered. Write down a new estimate of how much you think the frame rate should've improved. I've got 16 CPUs that are capable of doing "nothing" about 4 billion times per second, so my estimate was 64 billion frames per second. Check what frame rate you're actually getting now (I still get about 4000 frames per second). Is this anywhere near what you expected; or have you realised why I call it retarded?

Now minimise the "glxgears" window and try again. My estimate was that it should've reported the frame rate as "NaN". I still get about 4000 frames per second. I was expecting "retarded" from the start, so does this mean I'm an optimist?
So you're taking GLXGears - the height of 1992 3D graphics - on top of X11 - the height of windowing systems in 1984, and declaring that because it sucks (and people who don't think X11 sucks are a rarity), all graphics systems suck?

I was lazy and it was an easy test to do. I downloaded "glxgears" for Windows (Vista) and did the experiment again - exactly the same speed when everything except a thin black strip is covered, and 5 times faster when the window is minimised. This shows that both X and Vista suck (but doesn't rule out the fact that glxgears sucks).

I don't have any 3D games for Linux. Instead I tested the frame rate of Minecraft on Windows. I set the rendering distance to "far" and stood on top of a very tall tower looking out towards the horizon, so that there's lots of graphics on the bottom of the screen and nothing but blue sky at the top of the screen. With the game paused and the window entirely visible I got 25 frames per second, and with the game paused and everything except for part of the sky (and the FPS display) covered up I got the same 25 frames per second - no speedup at all. I didn't/couldn't measure it minimised. Then I decided to try Crysis and got the same frame rates (for both the menu and while playing the game) regardless of how much of the game's window is obscured.

Finally (for fun) I decided to try Minecraft and Crysis running (and both visible) at the same time. Minecraft was paused and Crysis seemed playable. After loading a saved game in Crysis, I ran up to a beach and decided to unload some ammo into a passing turtle. As soon as I pressed the trigger the graphics froze, then Minecraft and Crysis both crashed at the same time. Three cheers for mouldy-tasking OSs!

Feel free to conduct your own tests using whatever games and/or 3D applications you like, running on whatever OS you like, if you need further confirmation that existing graphics systems suck.

Owen wrote:
Brendan wrote:I've got 2 completely separate ATI video cards (both driving separate monitors). For fun, I positioned a "glxgears" window so that half was on each monitor. With 2 entire video cards doing half the work each, the efficiency "improved" from 4000 frames per second all the way "up" to 200 frames per second! This is obvious proof that I'm deluded, and that it's impossible to do anything more efficiently...
So I take it you're using an old and crappy non-compositing window manager then? Because that's the only situation in which it would become relevant that the window was interposed in between two displays

I'm using KDE 4, which is meant to support compositing (but every time I attempt to enable compositing effects it complains that something else is using the graphics accelerator and refuses). To be honest; it was hard enough just getting it to run OpenGL in a window (different versions of X and ATI drivers and libraries and whatever, all with varying degrees of "unstable"); and once I found a magic combination that seemed to work I stopped updating any of it in case I upset something.

Owen wrote:Of course, in that case each graphics card isn't doing half the work each. Assuming the commands are going to both cards, they're both rendering the scene. Even if you managed to spatially divide the work up precisely between the cards.. they'd still be rendering more than 1/2 of the triangles each (because of overlap)

True; but I'd expect worst case to be the same frame rate when the window is split across 2 video cards, rather than 50 times slower.

Owen wrote:Whatever is going on, its' not smart, and it says more about the X server than GLXGears.

What I suspect is happening is that the "primary" card is rendering the image. Because its' split between the two screens, X11 cant' do its normal non-composited thing and render directly into a scisor-rect of the desktop. Therefore, its' falling back to rendering to a buffer, and its' probably picked something crappy like a pixmap (allocated in system memory), then copying out of said pixmap to the framebuffer on the CPU.

Even if you're using a compositing window manager, X11 barely counts as modern. If you're not using a compositing window manager, yeah, expect these problems because that's just how X11 is.

Yes, but I doubt Wayland (or Windows) will be better. The problem is that the application expects to draw to a buffer in video display memory, and that buffer can't be in 2 different video cards at the same time. There's many ways to make it work (e.g. copy from one video card to another) but the only way that doesn't suck is my way (make the application create a "list of commands", create 2 copies and do clipping differently for each copy).

Owen wrote:OK, so you do expect to make the one final retained mode 3D engine to rule all 3D engines then.

I expect to create a standard set of commands for describing the contents of (2D) textures and (3D) volumes; where the commands say where (relative to the origin of the texture/volume being described) different things (primitive shapes, other textures, other volumes, lights, text, etc) should be, and some more commands set attributes (e.g. ambient light, etc). Applications create these lists of commands (but do none of the rendering).

Owen wrote:
Brendan wrote:
Owen wrote:Oh, and you plan to use the same assets for raytracing? You're deluded. Rasterization and raytracing take very different geometry formats.
Is there any technical reason why the same "description of what to draw" can't be used by both rasterisation and ray tracing? If there is, I'd like to know what they might be.
Because efficient and pretty rasterization uses triangle meshes, normal and/or displacement maps, fragment shaders with atomic counters for real time global illumination, stencil buffer "hacks" for shadows, etc, all of which are engineered for the rasterization pipeline.

"Efficient" and pretty raytracing, on the other hand, uses NURBS and other mathematical models of shapes which avoid the discontinuities of such vertex based geometry.

So you're saying that it's entirely possible to use ray tracing on data originally intended for rasterization, and that the ray traced version can be much higher quality than the "real time rasterized" version; and your only complaint is that you'd get even higher quality images from ray tracing if mathematical models of shapes is used instead things like meshes?

Owen wrote:
Brendan wrote:If the graphics are completely ignored; then in both cases the application decides what to draw. After that, in my case the application appends some tiny little commands to a buffer (which is sent, then ignored). In your case the application does a massive amount of pointless pixel pounding (which is sent, then ignored). If you think my way is slower, you're either a moron or a troll.
GLXGears is hardly doing any pixel pounding. In fact, its' doing none at all. It's building command lists which, if your driver is good, go straight to the hardware. If your driver is mediocre, then.. well, all sorts of shenanigans can occur.

I don't think we were talking about GLXGears here; but anyway...

For GLXGears, the application sends a list of commands to the video card's hardware, and the video card's hardware does a massive pile of pointless pixel pounding, then the application sends the results to the GUI (which ignores/discards it)?

Why not just send the list of commands to the GUI instead, so that the GUI can ignore it before the prodigious pile of pointless pixel pounding occurs?

Owen wrote:
Brendan wrote:
rdos wrote:In the partial case, your commands must employ some smart tactics as to which commands to ignore, which to modify and which to send. This requires extra processing (slows down the fully visible case). It is also possible to do the same thing without commands, but that also slows down the typical case.
The "smart tactics" are a small amount of work that avoids a much larger amount of work. For the fully visible case, the only extra processing is appending commands to a list and decoding commands from the list (all the rendering will cost the same, it just happens in a different place). If nothing happens to the graphics after the application creates it, then your way would avoid the negligible extra work of creating the list of commands and decoding it. However, if anything does needs to happen to the data afterwards (e.g. rotating, scaling, etc; or even just copying it "as is" from the application's buffer into a GUI's buffer) then your way ends up slower.
Assuming that the application just draws its' geometry to the window with no post processing, yes, you will get a negligible speed boost (any 3D accelerator can scale and rotate bitmaps at a blistering speed).

Of course, for any modern 3D game (where modern covers the last 7 years or so) what you'll find is that the geometry you're scaling is a quad with one texture attached and a fragment shader doing the last post-processing step.

So for a modern 3D game; you start with a bunch of vertexes and textures, rotate/scale/whatever the vertexes wrong (e.g. convert "3D world-space co-ords" into "wrong 2D screen space co-ords"), then use these wrong vertexes to draw the screen wrong, then rotate/scale/whatever a second time to fix the mess you made; and for some amazing reason that defies any attempt at logic, doing it wrong and then fixing your screw-up is "faster" than just doing the original rotating/scaling/whatevering correctly to begin with? Yay!

Owen wrote:Of course, that speed boost will only apply if you're drawing the contents of that window once. Of course, most non-game windows don't change every frame (and we have established already that for game windows its' pointless),

..and of course the idea of caching things is older than I am.

Owen wrote:Basically, for the common case (where something does need to happen to the application's graphics after the application has created it, regardless of what that "something" is), your way is slower.

...because doing work you throw away, and then doing things wrong and wasting time fixing your screw-up is faster than doing things right and not wasting that time. In the same way, the fastest way to build a 3-bedroom house is to build five 2-bedroom houses, then demolish four of them, then add an extension to the remaining house.

Owen wrote:
Brendan wrote:Also note that you are completely ignoring the massive amount of extra flexibility that "list of commands" provides (e.g. trivial to send the "list of commands" to multiple video cards, over a network, to a printer, to a file, etc).
Whats more efficient:

Executing the same commands on multiple independent GPUs

Executing them on one GPU (or multiple couple GPUs, i.e. CrossFire/SLI)
Answer: The later, because it uses less power (and because it lets you use the second GPU for other things, maybe accelerating physics or running the GUI)

I don't know what your point is. Are you suggesting that "list of commands" is so flexible that one GPU can execute the list of commands once and generate graphics data in 2 different resolutions (with 2 different pixel formats) and transfer data to a completely separate GPU instantly?

Owen wrote:Also, I don't know how you intend to do buffer readbacks when you're slicing the commands between the GPUs, or deal with when one GPU only supports the fixed function pipeline and the other supports all the latest shiny features.

I have no intention of supporting buffer readbacks. When one GPU only supports the fixed function pipeline and the other supports all the latest shiny features; one GPU's device driver will do what its programmer programmed it to do and the other GPU's device driver will do what its programmer programmed it to do.

Owen wrote:Any GUI lets you send the commands over a network or to a file; you just need to implement the appropriate "driver"

Yes; and sending 2 KiB of "list of commands" 60 times per second is a lot more efficient than sending 8 MiB of pixel data 60 times per second; so that "appropriate driver" can just do nothing.

Owen wrote:To a printer is a special case. Actually, an OpenVG-like API to drive a printer isn't a bad an idea... Cairo essentially implements that (with the PostScript/PDF backends). It makes sense for 2D... not so much for 3D. But thats nothing new; Microsoft managed that abstraction decades ago; you've used GDI to talk to printers in Windows since eternity.

3D rendering to a printer? There is no "one size fits all" solution, just like there isn't for converting a 3D scene to an image today.

I'll just slap a "rendering quality" slider at the bottom of the "printer properties" dialog box.

Owen wrote:Every good system is built upon layers. For graphics, I view three:

The low level layer; your OpenGL and similar; direct access for maximum performance. Similarly, you might allow direct PostScript access to PostScript printers (for example)

The mid level layer; your vector 2D API; useful for drawing arbitrary graphics. Cairo or OpenVG.

The high level layer; your GUI library; uses the vector API to draw widgets
You'll find all competent systems are built this way today, and all incompetent systems evolve to look like it (e.g. X11 has evolved Cairo, because X11's rendering APIs suck)

Every bad system is also built on layers. The most important thing is that application developers get a single consistent interface for all graphics (not just video cards), rather than a mess of several APIs and libraries that all do similar things in different ways.

Cheers,

Brendan

OSDev.org

UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination

Re: UEFI+BIOS bootloader combination