More faster putpixel than 0ch?

monobogdan · Post by **monobogdan** » Sat Jan 28, 2017 3:24 am

And so, my shell works too slow.
Rectangle fills very slow(about 10 seconds).

Here is code:

for i := x to w do
	begin
	  for j := y to j + h do
	  begin
		PutPixel(i, j, Color);
	  end;
	end;

How to make this code faster?

dchapiesky · Post by **dchapiesky** » Sat Jan 28, 2017 3:29 am

Hmmmm.... this is a hard one...

1) Don't use pascal
2) get rid of the p-code interpeter
3) use assembly
4) specifically 512 bit VMX instructions
5) in parallel on multiple cores
6) with prisma-chromatic fiber interconnects between your VRAM and the combustinator

Cheers and good luck!

monobogdan · Post by **monobogdan** » Sat Jan 28, 2017 3:33 am

dchapiesky wrote:Hmmmm.... this is a hard one...

1) Don't use pascal
2) get rid of the p-code interpeter
3) use assembly
4) specifically 512 bit VMX instructions
5) in parallel on multiple cores
6) with prisma-chromatic fiber interconnects between your VRAM and the combustinator

Cheers and good luck!

TP is generate small and fast code.

TP is native language, not interpreter

Yes, i'm using inline assembler for interrupts.

Multiple cores on 486?

monobogdan · Post by **monobogdan** » Sat Jan 28, 2017 3:34 am

monobogdan wrote:And so, my shell works too slow.
Rectangle fills very slow(about 10 seconds).

Here is code:
Code: Select all
for i := x to w do
	begin
	  for j := y to j + h do
	  begin
		PutPixel(i, j, Color);
	  end;
	end;
How to make this code faster?

PutPixel is not from graph unit function. It's implemented by me.

dchapiesky · Post by **dchapiesky** » Sat Jan 28, 2017 3:45 am

please explain what TP means...

alexfru · Post by **alexfru** » Sat Jan 28, 2017 3:46 am

dchapiesky wrote:please explain what TP means...

Borland Turbo Pascal.

dchapiesky · Post by **dchapiesky** » Sat Jan 28, 2017 3:49 am

well then - from wikipedia --

Several versions of Turbo Pascal, including the latest version 7, include a CRT unit used by many fullscreen text mode applications. This unit contains code in its initialization section to determine the CPU speed and calibrate delay loops. This code fails on processors with a speed greater than about 200 MHz and aborts immediately with a "Runtime error 200" message.[25] (the error code 200 had nothing to do with the CPU speed 200 MHz). This is caused because a loop runs to count the number of times it can iterate in a fixed time, as measured by the real-time clock. When Turbo Pascal was developed it ran on machines with CPUs running at 1 to 8 MHz, and little thought was given to the possibility of vastly higher speeds, so from about 200 MHz enough iterations can be run to overflow the 16-bit counter.[26] A patch was produced when machines became too fast for the original method, but failed as processor speeds increased yet further, and was superseded by others.

dchapiesky · Post by **dchapiesky** » Sat Jan 28, 2017 4:07 am

You are using integers and not floating point variables?

Turbo Pascal had slow floating point code...

alexfru · Post by **alexfru** » Sat Jan 28, 2017 4:19 am

monobogdan wrote: TP is generate small and fast code.

Sorry, it does not. It compiles fast (which is why I mainly used Turbo Pascal instead of Turbo C++ in the 90's when my computers weren't fast enough), but it doesn't produce fast code. I had to write quite a bit of assembly code (with various hacks) to make rendering fast on my machines.

You either need to do the same (rendering individual pixels in a loop is a bad idea when your compiler doesn't do a good job at optimizing) or use a better compiler. I hear Free Pascal is good. Btw, no compiler will be able to optimize this kind of loop for 16-color modes where you have to switch between the four pixel planes.

I should also add that for fast rendering you need not only optimized rendering code, you also need to avoid rendering the same pixel more than once, and learning and implementing the algorithms for this is a lot of fun (both for 2d and 3d).

alexfru · Post by **alexfru** » Sat Jan 28, 2017 4:24 am

dchapiesky wrote:Turbo Pascal had slow floating point code...

The 6-byte Real type is not supported by the x87 FPU directly, so, quite a bit of 16-bit code from the system library would be involved.

iansjack · Post by **iansjack** » Sat Jan 28, 2017 5:29 am

Making a BIOS interrupt call for every pixel that you plot is horrendously inefficient. What should be a simple "mov" instruction is translated into hundreds of instructions.

You need to write a routine to address the display directly. This should be pretty simple in real mode.

Brendan · Post by **Brendan** » Sat Jan 28, 2017 6:00 am

Hi,

First, "int 0x10, ah = 0x0C" is completely unusable. To understand why, here's a detailed break-down of what it actually does:

It starts with a software interrupt, which is relatively expensive all by itself because it involves micro-code and typically flushes the CPU's pipeline. This is pure pointless bloat.
Then the BIOS has a whole bunch of tests to figure out which function you actually wanted, which is typically an insanely poor sequence of comparisons and branches (each with potential branch misprediction). This is pure pointless bloat.
Once you reach the code you actually wanted, it has to figure out which video mode and what the pixel format is (to figure out how to write a pixel for the current video mode). This is pure pointless bloat.
Then it has to calculate an address in the frame buffer from your coordinates. This is almost pure pointless bloat (more on that later).
Then it does a write to the frame buffer. This is the only part that actually matters, and is probably faster than every single step of pure pointless bloat that occurred before and after.
Then it has to unwind all the crud it had to spew all over the stack from earlier. This is pure pointless bloat.
Finally, it returns ("iret"), which is relatively expensive all by itself because it involves micro-code and typically flushes the CPU's pipeline. This is pure pointless bloat.

Mostly; there's about 100 times more pure pointless bloat than there is actual useful work.

Second, "putpixel()" is almost never sane. The problem is that you end up doing an "address = x * bytes_per_pixel + y * bytes_per_line" calculation for every single pixel; and there's almost always a way to avoid that. For a simple example, to draw a horizontal line you only need to calculate the "starting address", and after that you know that the next pixel will be at the next highest address after the previous pixel. More specifically, to draw a line you can typically do something like calculate the address once then do a "rep stosb" (if it's an 8-bpp mode) or "rep stosw" (if it's an 15-bpp mode or 16-bpp mode) or "rep stosd" (if it's an 32-bpp mode). The same happens for rectangles; where you can do one horizontal line (as already described) and then add "bytes between end of one line to start of next line" and do the next line; and only calculate that "address = x * bytes_per_pixel + y * bytes_per_line" once for the entire rectangle.

Third, for any video mode that a user won't mind looking at (which excludes ancient "320*200" nonsense) you can't use the legacy/deprecated "VGA area" without bank switching, and bank switching makes everything slow (not just the bank switching itself, but the checking to determine if you do/don't need to switch banks ruins most other optimisations). For this reason any OS that isn't worthless trash will use "linear frame buffer" (and therefore must use protected mode or long mode).

Cheers,

Brendan

alexfru · Post by **alexfru** » Sat Jan 28, 2017 6:32 am

Brendan wrote: * Then it does a write to the frame buffer. This is the only part that actually matters, and is probably faster than every single step of pure pointless bloat that occurred before and after.

You may find that in planar 16-color modes (e.g. the VGA 640x480x16 mode) the entire screen can be updated no faster than ~30 times per second using the most optimal code. On a 1GHz+ CPU. Which is like 3K+ CPU clocks/pixel. Ouch. I'm not sure which part of the video hardware is to blame (non-planar VGA and SVGA modes don't have this problem). I never looked into it. Slow port I/O for switching planes and setting pixel masks? Some weird compatibility feature?

Brendan · Post by **Brendan** » Sat Jan 28, 2017 7:11 am

Hi,

alexfru wrote:
Brendan wrote: * Then it does a write to the frame buffer. This is the only part that actually matters, and is probably faster than every single step of pure pointless bloat that occurred before and after.
You may find that in planar 16-color modes (e.g. the VGA 640x480x16 mode) the entire screen can be updated no faster than ~30 times per second using the most optimal code. On a 1GHz+ CPU. Which is like 3K+ CPU clocks/pixel. Ouch. I'm not sure which part of the video hardware is to blame (non-planar VGA and SVGA modes don't have this problem). I never looked into it. Slow port I/O for switching planes and setting pixel masks? Some weird compatibility feature?

For this case I pre-arrange it all as "buffer per plane" in RAM; then do "switch to plane 0; blit everything for plane 0; switch to plane 1; blit everything for plane 1; ..." (and I set the pixel mask and write mode once when setting the mode). I've never had any kind of performance problem.

Cheers,

Brendan

alexfru · Post by **alexfru** » Sat Jan 28, 2017 7:24 am

Brendan wrote:
alexfru wrote:
Brendan wrote: * Then it does a write to the frame buffer. This is the only part that actually matters, and is probably faster than every single step of pure pointless bloat that occurred before and after.
You may find that in planar 16-color modes (e.g. the VGA 640x480x16 mode) the entire screen can be updated no faster than ~30 times per second using the most optimal code. On a 1GHz+ CPU. Which is like 3K+ CPU clocks/pixel. Ouch. I'm not sure which part of the video hardware is to blame (non-planar VGA and SVGA modes don't have this problem). I never looked into it. Slow port I/O for switching planes and setting pixel masks? Some weird compatibility feature?
For this case I pre-arrange it all as "buffer per plane" in RAM; then do "switch to plane 0; blit everything for plane 0; switch to plane 1; blit everything for plane 1; ..." (and I set the pixel mask and write mode once when setting the mode). I've never had any kind of performance problem.

I did the same, except I probably switched the planes for every scanline, not just four times per frame. It would be interesting to test this on different machines.

OSDev.org

More faster putpixel than 0ch?

More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?

Re: More faster putpixel than 0ch?