OSDev.org

Posted: **Sat Feb 01, 2014 4:08 pm**

PearOS, is your memory bottleneck related to the slowness of video RAM ? It is often more than an order of magnitude slower than normal RAM.

Posted: **Sat Feb 01, 2014 4:14 pm**

gerryg400 wrote:PearOS, is your memory bottleneck related to the slowness of video RAM ? It is often more than an order of magnitude slower than normal RAM.

I believe it is a either a video memory bottleneck as I have to read from video memory, read from ram, then write back to video memory after the alpha blend. Where directly from ram to video memory seems instant.

- Matt

Posted: **Sat Feb 01, 2014 4:21 pm**

Reading from video memory is probably even slower than writing. In general I would say you should never read from video memory.

I have found the best solution is to keep 2 back-buffers, one of which is an exact copy of the video buffer and one which is the new frame. You can diff the 2 buffers and write the diffs to video RAM. Then switch the pointers to the buffers because now the video is identical to the 2nd one.

I do most of my alpha blending in plain old C and have no issues getting 50 fps.

Posted: **Sat Feb 01, 2014 4:28 pm**

gerryg400 wrote:Reanebfrom video memory is probably even slower than writing. In general I would say you should never read from video memory.

I have found the best solution is to keep 2 back-buffers, one of which is an exact copy of the video buffer and one which is the new frame. You can diff the 2 buffers and write the diffs to video RAM. Then switch the pointers to the buffers because now the video is identical to the 2nd one.

I do most of my alpha blending in plain old C and have no issues getting 50 fps.

Ah very brilliant! I never thought of that. Your absolutely right, I think its around 2-6 times slower than writing. So ill just create a third buffee, use it for all rendering and then just blit the whole thing into video memory when I'm ready which will speed up things if writing directly to video memory is slow.

ill try that and then maybe I wont have to use MMX or SSE I can just use my optomized assembly routine.

Thanks a bunch! Ill try this and post my results - Matt

Posted: **Sat Feb 01, 2014 6:35 pm**

How slow video memory reads are is highly variable, and depends upon caching parameters. If you're very lucky, it might be as fast as system RAM; if you're very unlucky, it might be orders of magnitudes slower.

As a random but illustrative example, the PS3 topology is System RAM <-> CPU <-> GPU <-> GPU RAM. There is ~22GBit/s each way between the CPU and system RAM, and another ~22GBits/s between the GPU and its' RAM. However, while there is 20GBit/s of bandwidth in the CPU-to-GPU transfer direction, there is 16 MBit/s of bandwidth in the reverse direction.

As might be obvious, in that kind of situation reads from video RAM are death.

Posted: **Sat Feb 01, 2014 7:12 pm**

Oh my gosh guys! My writing directly to Video Memory and Reading from it is about 10x slower than writing into the Systems RAM and then at the end copying a buffer directly to the Videos Buffer. I was able to achieve 30fps no problem alpha blending a 640x480 Image on 800x600 resolution. So here's a tip for anyone reading this, DO NOT WORK DIRECTLY WITH VIDEO MEMORY EVER! Unless your accelerated of course.

Well this has been a learning curve guys and I have enjoyed this a lot. If anyone want's to know how I achieved my Alpha blending use a byte lookup table just let me know and I'd be happy to give you some examples.

Thanks, Matt

Posted: **Tue Feb 04, 2014 4:23 am**

I wrote the code before, and I hope that it is useful.

Code: Select all

uint32_t MMXBlend(uint32_t source, uint32_t overlap){
	asm volatile(
		"movd %%edx,%%mm0;"::"d"(source));
	asm volatile(
	    "movd %%edx,%%mm1;"::"d"(overlap));
	asm volatile(
	    "pxor %mm2, %mm2;"
	    "punpcklbw %mm2,%mm0;"
	    "punpcklbw %mm2,%mm1;"
	    "movq %mm1, %mm3;"
	    "punpckhwd %mm3,%mm3;"
	    "punpckhdq %mm3,%mm3;"
	    "mov $0xFF00FF, %edx;"
	    "movd %edx, %mm2;"
	    "punpckldq %mm2,%mm2;"
		"psubw %mm3,%mm2;"
		"pmullw %mm3,%mm0;"
		"pmullw %mm2,%mm1;"
		"psrlw $8,%mm0;"
		"psrlw $8,%mm1;"
		"paddw %mm1,%mm0;"
		"packuswb %mm0,%mm0;"
	);
	asm volatile(
	    "movd %%mm0,%%edx":"=d"(source));
	return source;
}

PS. Don't forget to `asm volatile("emms")` after processing all pixels.

Posted: **Tue Feb 04, 2014 8:03 pm**

You should use <mmintrin.h> for MMX instructions (such headers exist for other instruction set extensions) rather than inline assembly. It cooperates considerably better with the compiler and there is less room for error. Additionally, I think you are really abusing inline assembly and the volatile keyword in it. I'm not certain, but that looks like something that is going to explode, as you don't appear to use clobber lists. As a rule of thumb, if you do inline assembly with numerous instructions in a single statement, you are doing it wrong.

Posted: **Wed Feb 05, 2014 4:38 am**

Code: Select all

uint32_t MMXBlend(uint32_t source, uint32_t overlap)

And seriously, you don't write function to process single pixel, with lots of clobber registers the compiler is unable to vectorize or unroll stuffs, or effectively utilize data fetch / memory throughput.
I suggest to write a function to process a huge block of pixels, and then either use compiler intrinsics or move it out to standalone assembly routine.

Posted: **Wed Feb 05, 2014 6:21 am**

...And if you're using MMX, remember you NEEED to (F)EMMS afterwards otherwise your next x87 instruction is going to raise hell

OSDev.org

Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process

Re: Alpha Blending - A painful process