Enabling write-combining through PAT in long mode.

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
M2004
Member
Member
Posts: 65
Joined: Sun Mar 07, 2010 2:12 am

Enabling write-combining through PAT in long mode.

Post by M2004 »

Let's presume that LFB is located at 0xD0000000. MTRR's
are untouched and left in a state where bios has left them.


This is what I do:
-------------------
1) I setup long mode with 2mb paging.

2) I check PAT support and I get positive result.

3) PA2 entry from IA32_PAT MSR is set to write-combining (==0x1)

4) I calculate 0xD0000000 /0x200000 = 0x680 = (1664 dec)

5) Then I convert the result from stage 4 to bytes and subtract
8 bytes: ((0x680*0x8)-0x8)

6) I add the offset from stage 5 to PDE (Page directory table) start.

7) I change PAT,PCD and PWT bits from following entries Of PDE table
and select PA2=entry. (PAT==0,PCD==1, PWT==0)


I'am still unable to get the write combining to work properly.
Everything else works but no speed is improved,

So what I'am doing wrong?

regards
M2004
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Enabling write-combining through PAT in long mode.

Post by Brendan »

Hi,
M2004 wrote: 4) I calculate 0xD0000000 /0x200000 = 0x680 = (1664 dec)

5) Then I convert the result from stage 4 to bytes and subtract
8 bytes: ((0x680*0x8)-0x8)

6) I add the offset from stage 5 to PDE (Page directory table) start.
If 0xD0000000 is the virtual address you want to change (and not the physical address); then "0xD0000000 /0x200000 = 0x680" gives you the page directory entry number, and each page directory entry is 8 bytes, so you want to set the relevant bits in the page directory entry at offset 0x680*8. This is offset 0x3400 (or offset 0x0400 in the third page directory). I don't know why you're subtracting 8 and I suspect that may be why it doesn't work.
M2004 wrote:3) PA2 entry from IA32_PAT MSR is set to write-combining (==0x1)
Minor note: I'd be tempted to leave the first 4 PAT entries as their default settings (e.g. and use PA4 for write combining), so that if the PAT bit is clear then the PCD and PWT bits do the same as they always have, and if the PAT bit is set you get OS defined caching. This is a little easier to understand for systems that do support PAT, and makes it easier for the OS to support systems that don't support PAT.
M2004 wrote:Everything else works but no speed is improved,
If you (e.g.) draw everything in a buffer in RAM and (when you've finished drawing everything) copy from the buffer to display memory using something like "rep movsd" (where CPU typically copies cache line by cache line); then write combining won't make any difference because the CPU is writing cache lines anyway. If you (e.g.) use SSE instead, then write-combining would combine four 16-byte writes into one 64-byte write, and while this might help a little the difference may not be noticeable.

If you draw directly into video display memory; then for things like drawing vertical lines (or scattered dots, or...) the CPU will run out of "write combining buffers" before it can combine any writes; and write combining won't make any difference. Also, if you read from display memory (e.g. maybe you're searching for every blue pixel and changing them to green pixels) then you should be shot (and write-combining won't help the reads at all, and may not help with the writes either).

However, if you're (e.g.) writing sequential bytes (without using something like "rep stosd" or SSE), then your code is bad, and write combining should help a lot.

Mostly what I'm saying is that (depending on how you write to display memory) write combining may not help, and if it does help then it's probably just hiding the fact that your code to write to display memory wasn't very good. ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Enabling write-combining through PAT in long mode.

Post by Owen »

Brendan wrote:If you (e.g.) draw everything in a buffer in RAM and (when you've finished drawing everything) copy from the buffer to display memory using something like "rep movsd" (where CPU typically copies cache line by cache line); then write combining won't make any difference because the CPU is writing cache lines anyway. If you (e.g.) use SSE instead, then write-combining would combine four 16-byte writes into one 64-byte write, and while this might help a little the difference may not be noticeable.

If you draw directly into video display memory; then for things like drawing vertical lines (or scattered dots, or...) the CPU will run out of "write combining buffers" before it can combine any writes; and write combining won't make any difference. Also, if you read from display memory (e.g. maybe you're searching for every blue pixel and changing them to green pixels) then you should be shot (and write-combining won't help the reads at all, and may not help with the writes either).

However, if you're (e.g.) writing sequential bytes (without using something like "rep stosd" or SSE), then your code is bad, and write combining should help a lot.

Mostly what I'm saying is that (depending on how you write to display memory) write combining may not help, and if it does help then it's probably just hiding the fact that your code to write to display memory wasn't very good. ;)


Cheers,

Brendan
REP MOVS doesn't copy by cache lines unless caching is enabled on the destination memory (because if you're REP MOVSDing, etc, to uncached memory, there might be side effects and so on). Therefore, write combining is incredibly useful for such copies.
M2004
Member
Member
Posts: 65
Joined: Sun Mar 07, 2010 2:12 am

Re: Enabling write-combining through PAT in long mode.

Post by M2004 »

Ok, time to give some background info.

Firstly, I use double buffering and all the "hard work" is done in back buffer.
LFB memory is only touched when the finished result is copied there.

Secondly, linear mapping is used to make things simpler.
Both virtual and physical LFB should be the same.

Thirdly, -8 thing is purposed to make the PDE offset value zero based. Entry0 of
PDE starts from offset zero and entry is 8 bytes long.

Fourthly, I use simple qword sized write and copy loops
little similar ones below (these are not the actual nor whole ones,
but they should give the idea.... )

Code: Select all

;Setup everything for the write loop before entering.

.....

write_to_back_buffer:
  mov [rdi],rax
  add rdi,8
  dec rcx
  jnz write_to_back_buffer

Code: Select all

;Setup everything  for the copy loop before entering.

......

Copy_from_back_buffer_to_LFB:
  mov rax,[rsi]
  mov [rdi],rax
  add rdi,8
  add rsi,8
  dec rcx
jnz Copy_from_back_buffer_to_LFB
Fiftly, I know this kind of concept works, because I wrote
a similar test program, which uses protected mode (no paging)
In that case MTRR's were used to setup write combining.
The speed gain was dramatical (20x or more) on several of my test pc's.
Worth to note that dword sized accesses to backbuffer and LFB were used.

The problem was that importing similar MTRR setup under long mode proved
to be much harder than I thought it would be. Maybe long mode paging has got
something to do with it.

Has anyone been able to use MTRR's in long mode for write combing?
If yes, is there some special tweaks to be made?


Regards,
M2004
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Enabling write-combining through PAT in long mode.

Post by Brendan »

Hi,
Owen wrote:REP MOVS doesn't copy by cache lines unless caching is enabled on the destination memory (because if you're REP MOVSDing, etc, to uncached memory, there might be side effects and so on). Therefore, write combining is incredibly useful for such copies.
D'oh - you're right. I remember seeing warnings ("The stores produced by fast-string operation may appear to execute out of order. Software dependent upon sequential store ordering should not use string operations for the entire data structure to be stored.") and assumed the same applies to all types of caching.
M2004 wrote:Thirdly, -8 thing is purposed to make the PDE offset value zero based. Entry0 of
PDE starts from offset zero and entry is 8 bytes long.
Except that "0xD0000000 /0x200000 = 0x680" already produces a zero based value (in the same way that "0x00000000 /0x200000 = 0x0000" produces a zero based value). Your -8 is an off by one error (specifically, an off by one 8-byte entry error).
M2004 wrote:Fourthly, I use simple qword sized write and copy loops
little similar ones below (these are not the actual nor whole ones,
but they should give the idea.... )
One of the things I always do is avoid writing pixels that didn't change. For example, often (if the code doing the drawing doesn't redraw the entire frame all the time) I'll have a "line changed flag" for each horizontal line where code that draws things sets the flags for the lines it touches, and then the blitting code skips the entire horizontal line of pixels if the flag wasn't set. Typically my blitting code also use 2 buffers in RAM (one containing the old data and one containing the new data), and checks if the pixel/s in the new buffer are different to the pixel's in the old buffer, and only writes to display memory (and updates the old buffer) if it's necessary. This usually gives a massive performance improvement (even though its doing more reads/writes from RAM) because RAM is so much faster than PCI bus bandwidth.

Of course there are many other ways of doing this (e.g. "dirty rectangles", etc - my way is just a very easy way to do it). The point is, if only a handful of pixels changed (e.g. someone is typing one letter at a time and almost all pixels stay the same) then you don't want to send 1920*1200 pixels (about 10 MiB?) for no reason.
M2004 wrote:Has anyone been able to use MTRR's in long mode for write combing?
If yes, is there some special tweaks to be made?
For MTRRs it makes no difference if you're in long mode or not. The MTRRs work at a lower level (the physical address space) and they don't care what type of paging you're using (if any) or what mode the CPU is in.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply