X86 asm experts: moving large blocks of data.

bloodline · Post by **bloodline** » Mon Nov 02, 2020 12:50 pm

I’m trying to composite my GUI using only the VBE interface Grub2 provides... I’ve already established that there is no VBlanking interrupt, but it seems a 16ms timer interrupt gives reasonable quality (occasional tearing/flickering... but it does work), so the next problem I face is moving large blocks of graphics data from the various offscreen buffers to the display without hardware acceleration.

Each window in my GUI has a list of non overlapping rectangles which describe the visible and non visible areas of that window. So as an optimisation, I only ever blit data from a visible rectangle which describes an area of the window which had changed since the last update. This works fine when only a new line of text has been added or the window above has moved to reveal a small area, but sometimes a really large area need to be updated.

Without any blitter or DMA hardware I’m am reduced to a simple CPU bound copy loop, from one buffer to the other. My question is do any of the fancy MMX/SSE instructions apparently available offer some kind of performance advantage for copies like this?

-edit- I’m using 32bit bpp, so each transfer is 4 bytes...

alexfru · Post by **alexfru** » Mon Nov 02, 2020 1:02 pm

See Agner Fog's doc on optimization.
Also look at memcpy() implementations for inspiration. They may be using multiple large (e.g. XMM) registers to load and store data in large contiguous chunks.
But before all that you may want to try using write combining for the video buffer (see PAT and MTRRs).

bloodline · Post by **bloodline** » Mon Nov 02, 2020 1:08 pm

alexfru wrote:See Agner Fog's doc on optimization.
Also look at memcpy() implementations for inspiration. They may be using multiple large (e.g. XMM) registers to load and store data in large contiguous chunks.
But before all that you may want to try using write combining for the video buffer (see PAT and MTRRs).

Cheers, I’ll read about write combining now!

Also, now I’m wondering if writing the the framebuffer might be the slowest part of the operation, and it might be quicker to composite the entire display to a RAM buffer and then copy the whole thing to the framebuffer in one operation...

-edit-
Searching for PAT and MTRR information, I found this thread: viewtopic.php?f=1&t=28428

A nice suggestion there is to keep track of “dirty scan lines”, so that you only copy horizontal lines which need updating... I guess I’m going to have to try these out and see what works best.

8infy · Post by **8infy** » Mon Nov 02, 2020 1:55 pm

Don't worry about this. I recently made a GUI for my OS as well, and without write combining I was getting like 5 frames per second at full hd on my laptop, after setting up write combining it got like 20 times faster and I don't have to worry about rendering speed at all now. (Obviously I dont redraw the entire screen every frame, only the dirty rects, but nonetheless, its super fast) So yeah, just set up write combining, its pretty easy.

bloodline · Post by **bloodline** » Mon Nov 02, 2020 2:10 pm

8infy wrote:Don't worry about this. I recently made a GUI for my OS as well, and without write combining I was getting like 5 frames per second at full hd on my laptop, after setting up write combining it got like 20 times faster and I don't have to worry about rendering speed at all now. (Obviously I dont redraw the entire screen every frame, only the dirty rects, but nonetheless, its super fast) So yeah, just set up write combining, its pretty easy.

Have you any links to some good resources? I’m totally new to X86.

I’m reading this Intel document, but it reads like an advert more than a technical document https://download.intel.com/design/Penti ... 442201.pdf

8infy · Post by **8infy** » Mon Nov 02, 2020 2:58 pm

bloodline wrote:
8infy wrote:Don't worry about this. I recently made a GUI for my OS as well, and without write combining I was getting like 5 frames per second at full hd on my laptop, after setting up write combining it got like 20 times faster and I don't have to worry about rendering speed at all now. (Obviously I dont redraw the entire screen every frame, only the dirty rects, but nonetheless, its super fast) So yeah, just set up write combining, its pretty easy.
Have you any links to some good resources? I’m totally new to X86.

I’m reading this Intel document, but it reads like an advert more than a technical document https://download.intel.com/design/Penti ... 442201.pdf

Page 3243 of the PDF intel manual.

Basically u have an MSR (at 0x277), that keeps a 64 bit table of 8 entries (PAT) where each one is 3 bits, and there are 6 possible caching settings, write-combining is 0b001.
Each page table entry can select one of the 8 entries in the PAT, using 3 specific bits. You just need to set all the page table entries of the framebuffer to use the entry from the PAT that you
set to write combining. It is enabled by default and first 4 entries should be left unchanged to keep backwards compatibility.

bloodline · Post by **bloodline** » Mon Nov 02, 2020 3:29 pm

8infy wrote:
bloodline wrote:
8infy wrote:Don't worry about this. I recently made a GUI for my OS as well, and without write combining I was getting like 5 frames per second at full hd on my laptop, after setting up write combining it got like 20 times faster and I don't have to worry about rendering speed at all now. (Obviously I dont redraw the entire screen every frame, only the dirty rects, but nonetheless, its super fast) So yeah, just set up write combining, its pretty easy.
Have you any links to some good resources? I’m totally new to X86.

I’m reading this Intel document, but it reads like an advert more than a technical document https://download.intel.com/design/Penti ... 442201.pdf
Page 3243 of the PDF intel manual.

Basically u have an MSR (at 0x277), that keeps a 64 bit table of 8 entries (PAT) where each one is 3 bits, and there are 6 possible caching settings, write-combining is 0b001.
Each page table entry can select one of the 8 entries in the PAT, using 3 specific bits. You just need to set all the page table entries of the framebuffer to use the entry from the PAT that you
set to write combining. It is enabled by default and first 4 entries should be left unchanged to keep backwards compatibility.

I'm not using paging, I'm using the memory as a single flat address space. Can I just set the physical Address of the framebuffer to be cashed into the write combine buffer?

Reading the manual (Page 3230), it seems the MTRRs can be set up without needing the PAT.

This site seems to be quite helpful: https://www.kernel.org/doc/html/latest/x86/mtrr.html

bzt · Post by **bzt** » Mon Nov 02, 2020 9:53 pm

bloodline wrote:My question is do any of the fancy MMX/SSE instructions apparently available offer some kind of performance advantage for copies like this?

Definitely. Here's my CC-BY-NC-SA licensed implementation with SSE2 (which is always available in long mode). As you can see it is quite complicated because it checks for lots of things, and if possible copies 256 aligned bytes at once per iteration with data prefetch. For the best performance you should use AVX on aligned addresses. Google for "github fast memcpy". Many has MIT license, and you can also find C implementations using intrinsics instead of Assembly (see here).

On modern Intel processors (Ivy Bridge and up) the simplest "REP MOVSB" should be the fastest (see ERMSB, Intel Optimization Manual Section 3.7.7), however many have reported it does not outperform SSE and AVX based memcpy implementations (see for example here). The Linux kernel also uses ERMSB if possible, but it does not use SSE/AVX, just GPR based copy with 32 bytes per iteration otherwise. The reason for this is that the Linux kernel does not save SSE/AVX registers on syscalls, so the kernel must preserve the SIMD state and therefore not allowed to use those registers.

Finally, if you're about to use this memcpy in a GUI compositor with 32 bit pixels, then you're looking for a SSSE3/SSE4 implementation which can calculate alpha blending effectively and very-very fast (see here). This is required to have semi-transparent windows. (Not a typo: 3 times "S" in SSSE3, which is different to SSE3.)

Cheers,
bzt

nullplan · Post by **nullplan** » Mon Nov 02, 2020 10:32 pm

bloodline wrote:My question is do any of the fancy MMX/SSE instructions apparently available offer some kind of performance advantage for copies like this?

Very little to add to what bzt said. Apparently, aligned copies are faster with SSE/AVX (I wouldn't bother with MMX if I were you). GCC can now also optimize code into using SSE instructions, even when there is no floating-point involved. But using SSE in kernel has the same problem using FPU in kernel has: If you do that, you have to save/restore FPU state before/after using it. If you allow GCC to generate SSE instructions, that means you have to save FPU state before entering the kernel. It's your call whether increased switching costs are worth it. Especially since you would have to pay the price on every entry into the kernel, but reap the benefits only sometimes (when those routines are called).

Alternative (which I am using): Allow the frame buffer to be mapped into user space. Then a user space program can deal with drawing things on screen however it wants, including with SSE. If you then also use MTRRs to set the physical memory to write-combining, you are basically at peak performance, without requiring an FPU save on every kernel entry (FPU save on interrupt is especially a big problem for systems under load).

moonchild · Post by **moonchild** » Mon Nov 02, 2020 11:17 pm

An in-depth analysis is available here. The consensus is: rep movsb is fast enough for most purposes. The optimal memcpy is maybe 10-20% faster than rep movsb (depending on CPU), and is a lot more code/work; so unlikely to be worth it unless you've confirmed it's a bottleneck.

bloodline · Post by **bloodline** » Tue Nov 03, 2020 2:58 am

Fantastic advice guys, Cheers!

moonchild wrote:An in-depth analysis is available here. The consensus is: rep movsb is fast enough for most purposes. The optimal memcpy is maybe 10-20% faster than rep movsb (depending on CPU), and is a lot more code/work; so unlikely to be worth it unless you've confirmed it's a bottleneck.

Really interesting page... I was wondering about the movsb instruction, but all the official documentation said leave it alone...

bloodline · Post by **bloodline** » Tue Nov 03, 2020 3:28 am

ok, I replaced the horizontal line copy part of my "blitting" functions with an inline asm "rep movsl" and the speed improvement is mind blowing! Literally several orders of magnitude!

I suppose replacing a uint32_t copy loop which iterates maybe several hundred times, with a single instruction is always going to speed things up a bit!

Thanks guys now to see where else I can use this instruction

8infy · Post by **8infy** » Tue Nov 03, 2020 3:58 am

bloodline wrote:ok, I replaced the horizontal line copy part of my "blitting" functions with an inline asm "rep movsl" and the speed improvement is mind blowing! Literally several orders of magnitude!

I suppose replacing a uint32_t copy loop which iterates maybe several hundred times, with a single instruction is always going to speed things up a bit!

Thanks guys now to see where else I can use this instruction

Maybe consider using smarter drawing algorithms where you only redraw certain invalidated rects as well, redrawing entire scanlines is a bit crazy as well, imagine doing this on a 4k screen.

bloodline · Post by **bloodline** » Tue Nov 03, 2020 4:12 am

8infy wrote:
bloodline wrote:ok, I replaced the horizontal line copy part of my "blitting" functions with an inline asm "rep movsl" and the speed improvement is mind blowing! Literally several orders of magnitude!

I suppose replacing a uint32_t copy loop which iterates maybe several hundred times, with a single instruction is always going to speed things up a bit!

Thanks guys now to see where else I can use this instruction
Maybe consider using smarter drawing algorithms where you only redraw certain invalidated rects as well, redrawing entire scanlines is a bit crazy as well, imagine doing this on a 4k screen.

Already done.

As stated in my original post, each window is backed by an offscreen buffer and described by a list of visible and non-visible rectangles. When a drawing operation occurs to a window's buffer, it is checked which visible rectangles the operation occurs in and then that rectangle is marked as needing an update the next VBL (faked 16ms timer as I'm using VBE). Each VBL the GUI traverses the list of visible rectangles per window and any which need a refresh, are blitted to the screen from the window's buffer. The only time this was a problem was if a large window was moved, and then the GUI task would just block the whole system as it slowly copied the data... Now even massive windows move smoothly!

alexfru · Post by **alexfru** » Tue Nov 03, 2020 4:15 am

bloodline wrote:ok, I replaced the horizontal line copy part of my "blitting" functions with an inline asm "rep movsl" and the speed improvement is mind blowing! Literally several orders of magnitude!

Which makes me wonder whether there was unnecessary stuff in your loops (or loops were too short) or you were compiling with compiler optimizations disabled.
Properly structuring the code (and, of course, using effective algorithms) and enabling optimizations usually works quite well.

OSDev.org

X86 asm experts: moving large blocks of data.

X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.

Re: X86 asm experts: moving large blocks of data.