How to speed up memory copying within RAM?

ashishkumar4 · Post by **ashishkumar4** » Sat Sep 03, 2016 10:58 pm

I am not talking about optimizations(done already), I need some hardware solution for speeding up this process. I was making a Windowing system ( :/ ) And a double buffering process for it (VESA). Now because the buffers are large enough (mbs depending on video mode), I need a faster way of doing it. Offcourse I have made an algorithm to decrease the buffer copied by making it copy only those pixels which have changed by making a small rectangular buffer around it such that only that buffer is copied, rest all remains same.
(Body of dbuff process).

Code: Select all

    offset = (cx0*dv + (cy0*dv*widthVESA))/4;
    cx0/=2;
    cx1/=2;
    sp = (uint_fast32_t*) buff + offset;
    dp = (uint_fast32_t*) vga_mem + offset;
    mp = (uint_fast32_t*) mouse_buff + offset;
    for(uint32_t _i = cy0; _i < cy1; _i++)
    {
      for(uint32_t _j = cx0; _j < cx1; _j++)
      {
        *dp++ = *sp++*!*mp + *mp++;
      }
      dp += 512 - (cx1-cx0);
      sp += 512 - (cx1-cx0);
      mp += 512 - (cx1-cx0);
    }
    cx0 = 512;
    cx1 = 512;
    cy0 = 384;
    cy1 = 384;

And whenever any process wants to change any pixel, it notifies about change->

Code: Select all

inline void refresh_area(int x0, int y0, int x1, int y1)
{
  cx0 = MIN(cx0,x0);
  cx1 = MAX(cx1,x1);
  cy0 = MIN(cy0,y0);
  cy1 = MAX(cy1,y1);
}

This has increased the speed of double buffering several times but still, it has to copy several pixels if several pixels change. I want to ask that is there some DMA type thing which can copy all the buffer in one go or would i have to make graphics card drivers to automate the dbuff thing?
[NOTE: I am a newbie, a stupid @$$, dont get angry on my silly doubts/reasons]

thepowersgang · Post by **thepowersgang** » Sun Sep 04, 2016 12:03 am

That code looks strangely inefficient (and badly formatted) - The multiplication there may not optimise well, and has a chance of being slower than the video write in an emulator.

The best solution is to use hardware-provided double-buffering, but this requires a native driver for the video device (which opens a nice large kettle of fish).

Brendan · Post by **Brendan** » Sun Sep 04, 2016 12:32 am

Hi,

Some random notes:

a) If 3 different processes update 3 different areas each within the same ~1/60th of a second time period, then you want to update the screen once, not 9 times, especially when some of the areas being updated overlap. For "dirty rectangles" this can get complicated as you'd have to detect when a new rectangle overlaps an existing rectangle, and (if it does) split rectangles and/or discard them to end up with a list of non-overlapping rectangles. Note that ideally you'd do this on vertical sync to get that "~1/60th of a second time period" accurately synced with the monitor and avoid "tearing" effects.

b) Just because a pixel was changed doesn't mean it needs to be updated - if could be "changed to the same colour it was before". This tends to happen a lot for text (especially for the gaps between rows of characters, and white-space at end of lines, etc).

c) It'd be better to draw the mouse pointer in the double buffer, then blit the double buffer. This avoids the need to care about the mouse pointer within the blitter itself. The idea would be to copy the area under the mouse pointer into a temporary buffer, then draw the mouse pointer, then blit, and then restore the pixels that were under the mouse pointer from the temporary buffer.

d) Doing one dword/pixel at a time is slow - use "rep movsd" wherever possible. This is because each write to video memory is like a packet being sent over a network (over the PCI bus) and has something like a header (transfer type, transfer address, transfer size); so with small (4-byte) transfers you end up wasting most PCI bus bandwidth on those headers and not for the data. For "rep movsd" the CPU is optimised to work on cache lines if it can, so you can end up doing 64 byte writes (where header is small compared to data) and use PCI bus bandwidth much more efficiently. Note that there are other options (SSE, AVX, "write-combining"). Also it can be better to update pixels that don't need to be updated just to make the write larger (for example, if you have 4 pixels then a 2 pixel gap then another 10 pixels, it'd be better to update all 16 pixels instead of breaking it into multiple writes).

e) For hardware acceleration; most video cards have hardware mouse pointer support, and most video cards have "blit from RAM to display memory" (where "most" is almost every single video card since the 1990s). However, anything involving hardware acceleration requires a video driver to suit the specific video card. This means that you need to design your video driver interface so that mouse pointer and blitting are done by the driver. For example, you tell the video driver what the mouse pointer looks like (once) and where the mouse pointer is (each time its location changes); and the video driver draws the mouse pointer using hardware acceleration (if possible) or in software (if there's no hardware acceleration).

f) Most video cards also have "blit from display memory to display memory". This is extremely fast because there's no PCI bus bottleneck in the way (and, at least for discrete video cards, video display memory is often much faster that normal RAM). To take advantage of this you want to split the graphics into "canvases", with one canvas for the background/desktop, one for each window, etc. Then you can tell the video driver to construct the frame by using fast "blit from display to display memory" operations. This means that you need to design your video driver interface so that everything is split into "canvases". For example, you tell the video driver to update whichever canvases have changed, then tell the video driver to update the frame; and the video driver uses hardware acceleration ("blit from RAM to display memory" to update canvases, and "blit from display memory to display memory" to update the frame) or does it all in software (if there is no hardware acceleration).

g) Once you have "canvases", you can call them textures and shove them down a 3D pipeline. You mostly have to design your video driver's interface for 3D anyway, and there's no reason a video driver interface needs a special/extra "2D interface" for the GUI (even if the GUI is strictly using boring old 2D it can still use the same 3D interface). Note that modern CPUs are plenty fast enough to do full 3D rendering of 1920*1600 at 60+ frames per second in software as long as there isn't a huge number of polygons involved (or fancy lighting/shadows); and for an average GUI (background, task bar, mouse pointer plus maybe 5 different windows) there isn't a huge number of polygons involved. Ironically, if you rasterise the polygons to get "fragments" and do occlusion on the fragments, it's relatively trivial to compare "fragments for this frame" to "fragments for previous frame" and get a list of "dirty fragments" (where "horizontally adjacent dirty fragments" can be combined), resulting in "list of horizontal lines (that are perfect for "rep movsd") that need to be updated during blitting".

Cheers,

Brendan

ashishkumar4 · Post by **ashishkumar4** » Sun Sep 04, 2016 12:54 am

Thanks thepowergang and Brenden sir for replying :')
thepowergang: Yeah sir, its not well optimized :'( i left the optimization part to gcc :/ thinking it can do better :/ let me correct my mistakes :/

Brenden:
Sir,
a) If several processes make changes just almost at once (that 1/60th sec), The Rectangle stretches to incompass all those areas (atmax, equal to the whole screen, at which point the delay of dbuff would be equivalent to dbuffering whole screen + some more delay (~4/768th) of the delay mentioned before). This happens because suppose three different regions in three different quadrants (0,0 at center of screen) are updated at once, the upper edge of the rectange stretches to engulf the upper two quadrant (max of y's of the regions). similar things happens with all other edges and thus those all regions shall be updated at once :/ If thats not the case, i am mistakening in understanding ur question :/ sorry :/

b) hmm, ok, let me work there.
c) ok i would work there also :p
d) I Think I got the answer to my main question :p If this is the way to do it faster, then i dont need to worry

But :/ I dont know what is "rep movsd" or how to use it :/ Please sir tell me more about this :/

e),f),g) Hmm sir, thats all based on creating my own video drivers :/ that would take time and my friend has started developing it for me :p Thanks :p

ashishkumar4 · Post by **ashishkumar4** » Sun Sep 04, 2016 12:58 am

oh ok i got it :p let me test :p

Brendan · Post by **Brendan** » Sun Sep 04, 2016 1:22 am

Hi,

ashishkumar4 wrote:a) If several processes make changes just almost at once (that 1/60th sec), The Rectangle stretches to incompass all those areas (atmax, equal to the whole screen, at which point the delay of dbuff would be equivalent to dbuffering whole screen + some more delay (~4/768th) of the delay mentioned before). This happens because suppose three different regions in three different quadrants (0,0 at center of screen) are updated at once, the upper edge of the rectange stretches to engulf the upper two quadrant (max of y's of the regions). similar things happens with all other edges and thus those all regions shall be updated at once :/ If thats not the case, i am mistakening in understanding ur question :/ sorry :/

So, if you want to update a very wide and very short/thin rectangle at the bottom of the screen and also want to update a very tall and very narrow/thin rectangle on the left of the screen (that happens to overlap with the first rectangle a tiny little bit in the bottom left corner of the screen), then you end up updating a massive very wide and very tall area?

ashishkumar4 wrote:d) I Think I got the answer to my main question :p If this is the way to do it faster, then i dont need to worry But :/ I dont know what is "rep movsd" or how to use it :/ Please sir tell me more about this :/

The "rep movsd" instruction is (sort of) the assembly language equivalent of "memcpy()"; except it works on dwords not bytes.

ashishkumar4 wrote:e),f),g) Hmm sir, thats all based on creating my own video drivers :/ that would take time and my friend has started developing it for me :p Thanks :p

Heh. From here: "Essentially, as soon as the GUI ceases being a "RGB pixel buffer toy" it has to depend on the OS's device driver interfaces."

Cheers,

Brendan

ashishkumar4 · Post by **ashishkumar4** » Sun Sep 04, 2016 1:30 am

oh ohk. I got that all :/ thanks for that, I am corrrecting and implementing things now :p

Ch4ozz · Post by **Ch4ozz** » Sun Sep 04, 2016 6:54 am

When you added dirty regions etc and everything works, then you should enable SSE and get some nice SSE versions of memcpy, memset, ..
You will FEEL the difference though

Just saying, my VBE stuff runs on stable 60FPS on my test notebook using optimized memcpy and double buffering with dirty regions (no fancy maths required).
You should limit the max FPS so you dont render too often to the screen as already stated by Brendan

Schol-R-LEA · Post by **Schol-R-LEA** » Sun Sep 04, 2016 11:24 am

Brendan wrote: The "rep movsd" instruction is (sort of) the assembly language equivalent of "memcpy()"; except it works on dwords not bytes.

Indeed, the 'move string to string' instructions (or the equivalent AVX and SSE instructions) are sometimes used in implementing memcpy().

The basic instruction ('movs', which moves a single 16-bit word) is in the form

Code: Select all

    movs

and copies a word value pointed to by SI (or ESI or RSI, depending on the CPU mode) to the location pointed to by the value in DI (EDI, RDI). There are 8, 16, 32, and 64 bit variants (movsb/movs/movsd/movsq), and it can also be written with explicit arguments to indicate the size instead (though this is rarely done):

Code: Select all

   movs rsi, rdi

The 'repeat string operation' ('rep'|'repz'|'repnz' etc.) prefixes are a special class of sub-opcode in string operations that indicates that a string operation should be repeated until some condition is met. For example, REP checks for a non-zero value in [E|R]CX and performs the string operation, then decrements the value of [E|R]CX and starts again, in effect performing a definite loop.

Since the larger word size you use the fewer memory cycles this will take, you can often speed up a memcpy() function using the largest string move operation that the processor mode permits. However, if you want to implement mempcy() with this, you do need to watch for two things: first, as Brendan says, movsd operates on doublewords (there is a similar movsq that moves quadwords which was added to 64-bit systems), so a memcpy() function using one need to check to see if the data is a) in a multiple of the the data size, and b) is aligned on the data size in memory. If it isn't aligned, then it needs to copy the leading part until it is aligned before performing the 'rep movs'; then, if there are trailing bytes which don't make up a full word (doubleword, quadword) set the counter register to the number of data elements are full words, perform the repeated string instruction, then use 'movsb' to copy the trailing bytes.

Second, the performance of the repeated string ops is highly dependent on the processor implementation, meaning that different CPU models may do better or worse compared to the conventional for() loop memcpy(). While this is pretty much a thing of the past, as the newer CPUs all should have the repeats optimized for faster operations in the memory pipelines, it is still something to be aware of. The same applies to the SSE and AVX instructions, so this is one place where - if you are up to it - you might want to have multiple versions of a given function (memcpy() in this case) and have the OS installer select the right one for the CPU model at installation time. It's not something most of us here have to concern ourselves with, but it is definitely a factor if you think you can move into the big leagues.

Mind you, I have always wondered why the standard memory subsystem doesn't have a hardware memory-to-memory DMA transfer; it would seem like an obvious way to bypass transfers to and from the CPU, though it would probably require some sort of interlock mechanism to prevent the CPU from trying to access memory in the process of being copied. It would be similar to but simpler than the Bit BLIT hardware in most graphics cards, so it isn't as if it would be difficult. Write-combining is (AFAIK) similar but not quite the same as what I mean, but I don't know enough about it to really say how applicable it would be.

ashishkumar4 · Post by **ashishkumar4** » Mon Sep 05, 2016 8:31 am

I have a doubt now :/ as i have implemented both "rep movsd" method as well as SSE movdqa method, I am confused that which is the faster way for x86 arch? Because both are running well and good enough that i dont find much differences (which may be due to my wrong way of implementing it). SO please tell which is the faster way and if i am doing it wrong :
sse-> (The size given is divided by 8 wrt size given to rep function)

Code: Select all


memcpy_sse:
  mov edi, [esp+4]
  mov esi, [esp+8]
  mov ebx, [esp+12]

  loop_copy:
    prefetchnta [128+ESI]; //SSE2 prefetch
    prefetchnta [160+ESI];
    prefetchnta [192+ESI];
    prefetchnta [224+ESI];

    movdqa xmm0, [0+ESI]; //move data from src to registers
    movdqa xmm1, [16+ESI];
    movdqa xmm2, [32+ESI];
    movdqa xmm3, [48+ESI];
    movdqa xmm4, [64+ESI];
    movdqa xmm5, [80+ESI];
    movdqa xmm6, [96+ESI];
    movdqa xmm7, [112+ESI];

    movntdq [0+EDI], xmm0 ; //move data from registers to dest
    movntdq [16+EDI], xmm1
    movntdq [32+EDI], xmm2;
    movntdq [48+EDI], xmm3;
    movntdq [64+EDI], xmm4;
    movntdq [80+EDI], xmm5;
    movntdq [96+EDI], xmm6;
    movntdq [112+EDI], xmm7;

    add esi, 128;
    add edi, 128;
    dec ebx;

    jnz loop_copy; //loop please
  ret

rep-->

Code: Select all


memcpy_rep:
  mov edi, [esp+4]
  mov esi, [esp+8]
  mov ecx, [esp+12]

  rep movsd

  ret

mikegonta · Post by **mikegonta** » Mon Sep 05, 2016 8:49 am

ashishkumar4 wrote:sse-> (The size given is divided by 8 wrt size given to rep function)

Keep in mind that movntdq moves 16 bytes, while movsd moves 4.

ashishkumar4 · Post by **ashishkumar4** » Mon Sep 05, 2016 9:02 am

Yes but that loop copies 128 bits = 32 bytes per cycle (decrement of ebx). while rep movsd copies 4 bytes. given the same value for ebx and ecx, its 32/4 = 8 times relative.
I mean, memcpy_rep(dp,sp,x) is equivalent to memcpy_sse(dp,sp,x/8). Atleast I have noticed it like that :/

Brendan · Post by **Brendan** » Mon Sep 05, 2016 9:17 am

Hi,

mikegonta wrote:
ashishkumar4 wrote:sse-> (The size given is divided by 8 wrt size given to rep function)
Keep in mind that movntdq moves 16 bytes, while movsd moves 4.

No; "rep movs" (of any size) has been optimised to work on entire cache lines for a very long time, and moves 64 bytes when it can (which even includes "swizzling" for modern CPUs, where CPU can/will read 2 halves of a cache line and write a whole cache line).

ashishkumar4 wrote:I have a doubt now :/ as i have implemented both "rep movsd" method as well as SSE movdqa method, I am confused that which is the faster way for x86 arch?

It depends (faster in which cases?). Note that you can split it into 4 parts:

Setup overhead; where "rep movsd" always wins
The memory copying itself (where you're typically limited by bus bandwidth)
The "tear down overhead" (the last bytes that didn't fill an entire cache line)
The after-effects (how many cache and TLB misses you get after you've finished copying causes because your memory copying polluted caches)

In general; for massive memory moves it's faster to brutally butcher the programmer that failed to avoid massive memory moves (while forcing their loved ones to watch as you rip out their intestines and swing them around the room as some kind of macabre "ribbon dance"); and for lots of small memory moves "rep movsd" is always faster.

Cheers,

Brendan

glauxosdever · Post by **glauxosdever** » Mon Sep 05, 2016 9:49 am

Hi,

Brendan wrote:In general; for massive memory moves it's faster to brutally butcher the programmer that failed to avoid massive memory moves (while forcing their loved ones to watch as you rip out their intestines and swing them around the room as some kind of macabre "ribbon dance"); and for lots of small memory moves "rep movsd" is always faster.

Are you sure it's the proper way to support your arguments? This quote will now stay on these forums for a long time, and people will judge you from this. So please, delete it before it's too late, and also delete this post of mine in order to leave as less traces as possible.

Regards,
glauxosdever

mikegonta · Post by **mikegonta** » Mon Sep 05, 2016 9:51 am

Brendan wrote:
mikegonta wrote:
ashishkumar4 wrote:sse-> (The size given is divided by 8 wrt size given to rep function)
Keep in mind that movntdq moves 16 bytes, while movsd moves 4.
No; "rep movs" (of any size) has been optimised to work on entire cache lines for a very long time, and moves 64 bytes when it can
(which even includes "swizzling" for modern CPUs, where CPU can/will read 2 halves of a cache line and write a whole cache line).

Yes; however the total bytes moved is still the product of ecx and the bytes moved by movs, cache line size notwithstanding.
The OP indicated that the sse iterations were 1/8 that of the movs resulting in 4X as many bytes moved.

OSDev.org

How to speed up memory copying within RAM?

How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?

Re: How to speed up memory copying within RAM?