Disable MTRRs
Disable MTRRs
Hi,
So I've recently changed my memory management and paging system and while doing so wanted to see how much difference WC memory would make to the linear frame buffer. (It was huge, even more than I expected) push 1920x1080x32 from 5fps up to 1500fps!).
I was doing that purely as a quick test so to accomplish it I quickly changed 3gb-4gb large page to be write combining just to check the performance, obviously it's a terrible idea as there are many other things in that range that need to be uncached and a bit of space which can be WB.. but I digress..
I thought never mind, I'll map in the LFB range using one of the variable MTRRs, (I happen to have 10 on my test machine) and one free for use as planned.
The problem I ran into is that the MTRR setup from boot happens to cover the range where the LFB is as uncached, I don't have enough free ranges to split that and still add the write combining range.
Another issue is that when you want to start using large pages (2mb, 1gb) there are some issues with pages that span multiple mtrr ranges leading to undefined behaviour (which you really want to avoid).
So given that my minimum requirement is support for PAT and I have that setup, should I (or could I) just completely disable the MTRRs altogether and just transfer their memory types/ranges into the paging table setup?
I was also curious if switching off MTRRs might improve memory access performance as in theory it should be less for the cpu to check accesses against.
I'm running in longmode exclusively, so I wouldn't expect any legacy bios or firmware calls to happen that might require the MTRRs to still be present, I'm not sure about ACPI/SMI.. but I would think if the page tables are setup to map
the same ranges the same way as the MTRRs it should be seamless ?
It just seems cleaner to have one system responsible for memory typing that two different setups which have a bit of an impedence mismatch when it comes to larger pages , and also MTRRs being pretty out-dated in comparison to PAT.
So the question is :
1) Can I disable MTTRs completely (is it safe to do in longmode / with regards to acpi/smi/smm etc)? - Assuming that I remap pages into paging tables that mirror the mtrr settings?
So I've recently changed my memory management and paging system and while doing so wanted to see how much difference WC memory would make to the linear frame buffer. (It was huge, even more than I expected) push 1920x1080x32 from 5fps up to 1500fps!).
I was doing that purely as a quick test so to accomplish it I quickly changed 3gb-4gb large page to be write combining just to check the performance, obviously it's a terrible idea as there are many other things in that range that need to be uncached and a bit of space which can be WB.. but I digress..
I thought never mind, I'll map in the LFB range using one of the variable MTRRs, (I happen to have 10 on my test machine) and one free for use as planned.
The problem I ran into is that the MTRR setup from boot happens to cover the range where the LFB is as uncached, I don't have enough free ranges to split that and still add the write combining range.
Another issue is that when you want to start using large pages (2mb, 1gb) there are some issues with pages that span multiple mtrr ranges leading to undefined behaviour (which you really want to avoid).
So given that my minimum requirement is support for PAT and I have that setup, should I (or could I) just completely disable the MTRRs altogether and just transfer their memory types/ranges into the paging table setup?
I was also curious if switching off MTRRs might improve memory access performance as in theory it should be less for the cpu to check accesses against.
I'm running in longmode exclusively, so I wouldn't expect any legacy bios or firmware calls to happen that might require the MTRRs to still be present, I'm not sure about ACPI/SMI.. but I would think if the page tables are setup to map
the same ranges the same way as the MTRRs it should be seamless ?
It just seems cleaner to have one system responsible for memory typing that two different setups which have a bit of an impedence mismatch when it comes to larger pages , and also MTRRs being pretty out-dated in comparison to PAT.
So the question is :
1) Can I disable MTTRs completely (is it safe to do in longmode / with regards to acpi/smi/smm etc)? - Assuming that I remap pages into paging tables that mirror the mtrr settings?
Re: Disable MTRRs
Hi,
However, with MTRRs disabled everything ends up as "uncached", and PAT can't change that to anything except "write combining" (and can't change it to "write-back"); so the performance of everything will be destroyed.
Also note that the CPU has to assume that the same page might be mapped multiple times with different PAT values (in different virtual address spaces or in the same virtual address space), and therefore the data could (e.g.) be cached even when the PAT says "uncached". The result of this assumption is that (for most cases) the CPU has to do extra work when the PAT is used to modify the "base cache-ability" described by MTRRs. Basically, it's better to use MTRRs if you can (so that CPU doesn't have to do extra work), and only use PAT if you can't use MTRRs.
Also, for old CPUs, some don't support PAT (but do support MTRRs), and some ancient CPUs (Cyrix) had "something like MTRRs" (I think Cyrix they called then "address range registers" or something) before Intel did (and Intel mostly stole the idea in implemented it in an incompatible way). For newer CPUs, AMD extended MTRRs in some way (I can't remember - something to do with IOMMU and/or virtualisation I think). This means that you (potentially) have about 4 different cases to worry about (nothing like MTRRs, "address range registers", MTRRs, and "extended MTRRs"); and could/should provide an abstraction (e.g. some kind of "change base cache-ability for physical region" function in your physical memory manager) to hide the differences.
Finally; there are 2 different use cases to worry about:
For the second case (normal "write-back" RAM being changed for performance reasons by user-space) you only want to use PAT (and don't want to use MTRRs or equivalent) and if PAT can't be used (e.g. not supported) then do nothing (consider it a "performance hint" that can be ignored).
Cheers,
Brendan
It's "safe" as long as all CPUs have the same MTRR configuration (and as long as you follow the sequence that Intel describes to change MTRRs on all CPUs at the same time).johnsa wrote:So the question is :
1) Can I disable MTTRs completely (is it safe to do in longmode / with regards to acpi/smi/smm etc)? - Assuming that I remap pages into paging tables that mirror the mtrr settings?
However, with MTRRs disabled everything ends up as "uncached", and PAT can't change that to anything except "write combining" (and can't change it to "write-back"); so the performance of everything will be destroyed.
Also note that the CPU has to assume that the same page might be mapped multiple times with different PAT values (in different virtual address spaces or in the same virtual address space), and therefore the data could (e.g.) be cached even when the PAT says "uncached". The result of this assumption is that (for most cases) the CPU has to do extra work when the PAT is used to modify the "base cache-ability" described by MTRRs. Basically, it's better to use MTRRs if you can (so that CPU doesn't have to do extra work), and only use PAT if you can't use MTRRs.
Also, for old CPUs, some don't support PAT (but do support MTRRs), and some ancient CPUs (Cyrix) had "something like MTRRs" (I think Cyrix they called then "address range registers" or something) before Intel did (and Intel mostly stole the idea in implemented it in an incompatible way). For newer CPUs, AMD extended MTRRs in some way (I can't remember - something to do with IOMMU and/or virtualisation I think). This means that you (potentially) have about 4 different cases to worry about (nothing like MTRRs, "address range registers", MTRRs, and "extended MTRRs"); and could/should provide an abstraction (e.g. some kind of "change base cache-ability for physical region" function in your physical memory manager) to hide the differences.
Finally; there are 2 different use cases to worry about:
- Cache-ability being changed because of memory mapped device (because a device driver asked for it)
- Cache-ability being changed for performance reasons (because a normal user-space process asked for it - e.g. possibly to avoid cache pollution in rare cases where "least recently used" doesn't make sense).
For the second case (normal "write-back" RAM being changed for performance reasons by user-space) you only want to use PAT (and don't want to use MTRRs or equivalent) and if PAT can't be used (e.g. not supported) then do nothing (consider it a "performance hint" that can be ignored).
If your code is extremely bad (e.g. not well optimised and does lots of tiny writes, possibly including writing the same pixel/s multiple times), then WC can make a huge difference.johnsa wrote:So I've recently changed my memory management and paging system and while doing so wanted to see how much difference WC memory would make to the linear frame buffer. (It was huge, even more than I expected) push 1920x1080x32 from 5fps up to 1500fps!).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Disable MTRRs
Hi,
Thanks for the info.
I was basically just writing to a normal cached buffer for the screen then transferring it to the LFB with a rep movsq. Without WC it was about 5fps, with about 1500fps.. so no individual pixel writes etc, I was surprised by the increase myself, it runs about the same speed as the same code (software rendering) does under windows 7 64bit, in fact it's about 20% faster.. but I take that with a pinch of salt without a full stack of drivers running firing off interrupts and 50 odd tasks switching.
The problem I have with the mtrr's is that the configuration as it stands leaves me with only a single free variable mtrr and the range I need for the LFB to be marked as write combining is already in a range marked uncached. If I split the range, uncached -> write combine -> uncached again I'd need two free mtrrs to map it which I don't have and then mtrr's only allow you to have overlapping areas when the types are writeback and uncached, even if I could the uncached would take priority over write combining where they overlap so that doesn't seem to be an option.
I didn't realise that the PAT entries couldn't convert accesses to write back if the mtrr's were disabled, that is bad news in deed! As now.. Even if I wanted to mark the LFB as write combining with the PAT, that PAT entries will overlap an MTRR saying it's uncached.. which I assume will take precedence..
The only thing I can think of doing is setting the default mtrr type to uncached (which it should be anyway) and then removing all ranges from the mtrr's that are uncached, as that would then be the default for any memory not covered by an mtrr.
Then the MTRR's would only serve to map areas as Write Back or Write Combine .. thus freeing up a bunch of mtrr's .. the other way would be to make the default cached and then only map areas that are uncached or write-back (which is opposite to the intel manual suggestion of the default being uncached).. but either way my mtrr's contain a mix of type 6 and 0 .. which seems silly as one of those should be the default and we should be able to remove the others.
Why this has to be such a mess and a pain grrr
Thanks for the info.
I was basically just writing to a normal cached buffer for the screen then transferring it to the LFB with a rep movsq. Without WC it was about 5fps, with about 1500fps.. so no individual pixel writes etc, I was surprised by the increase myself, it runs about the same speed as the same code (software rendering) does under windows 7 64bit, in fact it's about 20% faster.. but I take that with a pinch of salt without a full stack of drivers running firing off interrupts and 50 odd tasks switching.
The problem I have with the mtrr's is that the configuration as it stands leaves me with only a single free variable mtrr and the range I need for the LFB to be marked as write combining is already in a range marked uncached. If I split the range, uncached -> write combine -> uncached again I'd need two free mtrrs to map it which I don't have and then mtrr's only allow you to have overlapping areas when the types are writeback and uncached, even if I could the uncached would take priority over write combining where they overlap so that doesn't seem to be an option.
I didn't realise that the PAT entries couldn't convert accesses to write back if the mtrr's were disabled, that is bad news in deed! As now.. Even if I wanted to mark the LFB as write combining with the PAT, that PAT entries will overlap an MTRR saying it's uncached.. which I assume will take precedence..
The only thing I can think of doing is setting the default mtrr type to uncached (which it should be anyway) and then removing all ranges from the mtrr's that are uncached, as that would then be the default for any memory not covered by an mtrr.
Then the MTRR's would only serve to map areas as Write Back or Write Combine .. thus freeing up a bunch of mtrr's .. the other way would be to make the default cached and then only map areas that are uncached or write-back (which is opposite to the intel manual suggestion of the default being uncached).. but either way my mtrr's contain a mix of type 6 and 0 .. which seems silly as one of those should be the default and we should be able to remove the others.
Why this has to be such a mess and a pain grrr
-
- Member
- Posts: 5587
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Disable MTRRs
Nope! According to Intel, if the MTRR says memory should be UC and the PAT says memory should be WC, the result is WC. (I believe the same applies to AMD, but I haven't checked AMD's manuals to confirm.)johnsa wrote:Even if I wanted to mark the LFB as write combining with the PAT, that PAT entries will overlap an MTRR saying it's uncached.. which I assume will take precedence..
With that said, you still shouldn't map a large page across multiple MTRRs. That tends to make bad things happen.
Re: Disable MTRRs
Do you know off-hand where in the manual it mentions that ? I can't find any mention of the a PAT W/C entry taking precedence over a MTRR UC type ?
Thanks!
I see in the AMD Manual section 7.8.5 it says a PAT entry of WC and - for the MTRR type (which I assume to mean any) results in WC .. I would imagine Intel must be the same for consistency.
Thanks!
I see in the AMD Manual section 7.8.5 it says a PAT entry of WC and - for the MTRR type (which I assume to mean any) results in WC .. I would imagine Intel must be the same for consistency.
-
- Member
- Posts: 5587
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Disable MTRRs
Volume 3, chapter 11, section 11.5.2.2.johnsa wrote:Do you know off-hand where in the manual it mentions that ?
Re: Disable MTRRs
Hi,
Note that some firmware simply isn't very good at creating the MTRRs; and maybe you can save some MTRRs by reconfiguring all of them. This is a relatively risking proposition - it's hard to write code that generates "ideal MTRRs" that works for all possible cases (with different memory maps, different CPUs with different numbers of MTRRs and different "overlapping" rules), and its impossible to test all possible cases; which means there's a good chance of "works on some computers" (which is the same as "fails on other computers but you don't know that").
Cheers,
Brendan
That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?johnsa wrote:I was basically just writing to a normal cached buffer for the screen then transferring it to the LFB with a rep movsq. Without WC it was about 5fps, with about 1500fps..
In general, PAT can only make caching worse (e.g. "write-back" can be changed to "write-through" or "uncached"; "write-through" can be changed to "uncached", etc). Write-combining is a special case, and (if CPU supports both PAT and WC) you can use PAT to convert "uncached" to "write-combining".johnsa wrote:I didn't realise that the PAT entries couldn't convert accesses to write back if the mtrr's were disabled, that is bad news in deed! As now.. Even if I wanted to mark the LFB as write combining with the PAT, that PAT entries will overlap an MTRR saying it's uncached.. which I assume will take precedence..
If firmware's default is currently "write-back", then changing the default to "uncached" means that all RAM will become "uncached" (and PAT can't change that to "write-back" so the performance of everything involving RAM would be severely crippled). To fix that you'd have to create new MTRRs to describe RAM as "write-back". Maybe this means that you can delete 2 existing MTRRs for "uncached" and have to create 3 new entries for "write-back", and then you'd have no MTRRs left over for the "write-combining" anyway.johnsa wrote:The only thing I can think of doing is setting the default mtrr type to uncached (which it should be anyway) and then removing all ranges from the mtrr's that are uncached, as that would then be the default for any memory not covered by an mtrr. Then the MTRR's would only serve to map areas as Write Back or Write Combine .. thus freeing up a bunch of mtrr's .. the other way would be to make the default cached and then only map areas that are uncached or write-back (which is opposite to the intel manual suggestion of the default being uncached).. but either way my mtrr's contain a mix of type 6 and 0 .. which seems silly as one of those should be the default and we should be able to remove the others.
Note that some firmware simply isn't very good at creating the MTRRs; and maybe you can save some MTRRs by reconfiguring all of them. This is a relatively risking proposition - it's hard to write code that generates "ideal MTRRs" that works for all possible cases (with different memory maps, different CPUs with different numbers of MTRRs and different "overlapping" rules), and its impossible to test all possible cases; which means there's a good chance of "works on some computers" (which is the same as "fails on other computers but you don't know that").
Improving performance or capabilities always increases complexity, so (unless you're willing to accept "bad but simple") everything is always complicated.johnsa wrote:Why this has to be such a mess and a pain grrr
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Disable MTRRs
Are you sure about that (specifically the performance hit when using the PAT)? Intel explicitly states that mapping the same page with different memory types is undefined behavior:Brendan wrote:Also note that the CPU has to assume that the same page might be mapped multiple times with different PAT values (in different virtual address spaces or in the same virtual address space), and therefore the data could (e.g.) be cached even when the PAT says "uncached". The result of this assumption is that (for most cases) the CPU has to do extra work when the PAT is used to modify the "base cache-ability" described by MTRRs. Basically, it's better to use MTRRs if you can (so that CPU doesn't have to do extra work), and only use PAT if you can't use MTRRs.
The manual also states that you must issue a wbinvd (i.e. do the same thing you would do if you reprogrammed the PAT) before you use a WC mapping of a previously cached page unless the processor supports self-snooping. I guess that scenarios different from cached -> WC might work by chance (i.e. because no support in the cache coherency protocol is required for them) but are not architecturally supported either.Section 11.12.4: Programming the PAT wrote:The PAT allows any memory type to be specified in the page tables, and therefore it is possible to have a single physical page mapped to two or more different linear addresses, each with different memory types. Intel does not support this practice because it may lead to undefined operations that can result in a system failure. In particular, a WC page must never be aliased to a cacheable page because WC writes may not check the processor caches.
As the PAT cannot be disabled on processors that support it I always use it when it is available (e.g. on x86_64). AFAIR this is the same strategy Linux uses. As you said, reprogramming MTRRs is hard in general and might be impossible to do correctly for unknown chipsets.
You'll have to use the PAT (or the corresponding PT flags if the PAT is not supported) anyway at some point because you won't have enough MTRRs for each device that wants to perform DMA (PCI DMA to cached pages is not allowed; PCIe does allow it by issuing cache snoops which kill performance especially on NUMA systems).
What performance should be expected then? WC allows the CPU to buffer writes which should not improve the performance of rep movsq. However it also allows the CPU to issue delayed writes. With UC each iteration of rep movsq has to wait until the previous iteration hit main memory. With WC the CPU can issue many iterations of rep movsq concurrently and does not have to wait for main memory.Brendan wrote:That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
-
- Member
- Posts: 5587
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Disable MTRRs
That's also described in Volume 3, chapter 11, section 11.5.2.2. Check the footnotes for table 11-7.Korona wrote:Are you sure about that (specifically the performance hit when using the PAT)?
Re: Disable MTRRs
Ah I see, thank you. That means you pay a performance penalty if you use the PAT to force pages to be UC that are not UC in the MTRRs. Performance never suffers if you mark pages as WC in the PAT compared to marking them WC in the MTRRs.Octocontrabass wrote:That's also described in Volume 3, chapter 11, section 11.5.2.2. Check the footnotes for table 11-7.Korona wrote:Are you sure about that (specifically the performance hit when using the PAT)?
The part I cited from the SDM about cached -> WC makes a lot more sense now and transitions cached <-> UC are indeed architecturally supported.
That is nice because you generally want to set DMA regions to WC and only have UC for memory mapped device registers. So you can forget about the MTRRs (and use the PAT without performance penalties) if you assume that your firmware gets at least the memory mapped registers right. Only if your firmware messes up the MTRRs for memory mapped registers you need to reprogram MTRRs to get full performance. It does not even matter if your firmware gets the MTRRs for the frame buffer right.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
Re: Disable MTRRs
That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?
The rendering code itself is a checkerboard raytrace, it determines the intersection against the board for each pixel (1920x1080). It's single-core using AVX, it takes about 5ms for the actual render code.
Under Windows using a GDI bitmap and 16byte aligned buffer (in both cases) it achieves about 140fps. Without the write-combining it was getting about 5fps for me, with write-combining on the LFB I get 180fps (render + copy to lfb with rep movsq).
The rendering code itself is a checkerboard raytrace, it determines the intersection against the board for each pixel (1920x1080). It's single-core using AVX, it takes about 5ms for the actual render code.
Under Windows using a GDI bitmap and 16byte aligned buffer (in both cases) it achieves about 140fps. Without the write-combining it was getting about 5fps for me, with write-combining on the LFB I get 180fps (render + copy to lfb with rep movsq).
Re: Disable MTRRs
Hi,
For code that is well optimised (e.g. avoids writing pixels that didn't change to display memory, uses SSE or AVX, uses non-temporal stores, etc) I would expect that using WC (in the MTRR) makes no difference at all.
For code that is less well optimised (e.g. avoids writing pixels that didn't change to display memory, but doesn't use SSE or AVX or non-temporal stores) I would expect that using WC (in the MTRR) might make it no more than 10 times faster (and that the time spent to blit the data is negligible compared to the time spent generating that pixel data in the first place).
For code that is even less well optimised (e.g. writes all pixels to display memory regardless of whether they changed or not, and doesn't use SSE or AVX or non-temporal stores) I'd still expect that using WC (in the MTRR) might make it no more than 10 times faster (because "pointless writes that should've been avoided" would be effecting "with WC" and "without WC" the same).
If you assume that johnsa's code is spending 0.33333 ms to generate the pixel data, 0.33333 ms to blit all pixel data with a single "rep movsq" when WC is being used, and 199.666666 ms to blit all pixel data with a single "rep movsq" when WC is being used; then that would imply that WC makes it 600 times faster. That's so far beyond the expected performance differences that something else must be causing misleading results.
Cheers,
Brendan
Yes. Take a look at the notes for "Table 11-7. Effective Page-Level Memory Types for Pentium III and More Recent Processor Families", where it says:Korona wrote:Are you sure about that (specifically the performance hit when using the PAT)?Brendan wrote:Also note that the CPU has to assume that the same page might be mapped multiple times with different PAT values (in different virtual address spaces or in the same virtual address space), and therefore the data could (e.g.) be cached even when the PAT says "uncached". The result of this assumption is that (for most cases) the CPU has to do extra work when the PAT is used to modify the "base cache-ability" described by MTRRs. Basically, it's better to use MTRRs if you can (so that CPU doesn't have to do extra work), and only use PAT if you can't use MTRRs.
- 1. The UC attribute comes from the MTRRs and the processors are not required to snoop their caches since the data could never have been cached. This attribute is preferred for performance reasons.
2. The UC attribute came from the page-table or page-directory entry and processors are required to check their caches because the
data may be cached due to page aliasing, which is not recommended.
That section is slightly badly worded. For this they are only talking about WC and not talking about other caching types. This is because WC is not like any of the other caching types and is not really a true caching type - WC is more accurately described as "uncached caching type as far as normal caches are concerned, but where CPU combines writes in a buffer on the side that has nothing to do with normal caches". It's this "buffer on the side that has nothing to do with normal caches" that causes potential problems.Korona wrote:Intel explicitly states that mapping the same page with different memory types is undefined behavior:The manual also states that you must issue a wbinvd (i.e. do the same thing you would do if you reprogrammed the PAT) before you use a WC mapping of a previously cached page unless the processor supports self-snooping. I guess that scenarios different from cached -> WC might work by chance (i.e. because no support in the cache coherency protocol is required for them) but are not architecturally supported either.Section 11.12.4: Programming the PAT wrote:The PAT allows any memory type to be specified in the page tables, and therefore it is possible to have a single physical page mapped to two or more different linear addresses, each with different memory types. Intel does not support this practice because it may lead to undefined operations that can result in a system failure. In particular, a WC page must never be aliased to a cacheable page because WC writes may not check the processor caches.
PAT can't be disabled in the same way that segmentation (in protected mode) can't be disabled - in both cases you can achieve "effectively disabled" by configuring it to do nothing (e.g. "base = 0, limit = 4 GiB" for segments). For PAT, "configured to do nothing/effectively disabled" is the default setting (where it behaves identically to older CPUs that didn't have PAT and only had the PCD and PWT flags).Korona wrote:As the PAT cannot be disabled on processors that support it I always use it when it is available (e.g. on x86_64).
Normally "same strategy Linux uses" means that it's bad; however for this case I don't think that applies (if the PAT exists there's no real reason not to use it). Note that Linux does support fully reconfiguring MTRRs during boot (for when firmware doesn't configure MTRRs well) and will use MTRRs (and not PAT) for things like device's memory mapped IO areas if it can.Korona wrote:AFAIR this is the same strategy Linux uses. As you said, reprogramming MTRRs is hard in general and might be impossible to do correctly for unknown chipsets.
That's very much wrong. There are no problems with PCI devices doing DMA to normal RAM configured as write-back, write-through, etc. There would be a potential problem with DMA to RAM configured as WC caused by "buffer on the side that has nothing to do with normal caches", but even in that case you can probably work around it with fences to ensure the data is out of that "buffer on the side" before you begin the DMA.Korona wrote:You'll have to use the PAT (or the corresponding PT flags if the PAT is not supported) anyway at some point because you won't have enough MTRRs for each device that wants to perform DMA (PCI DMA to cached pages is not allowed; PCIe does allow it by issuing cache snoops which kill performance especially on NUMA systems).
When unknown software running on an unknown number of unknown CPUs (with unknown cache sizes, speed, etc) is spending an unknown amount of time to generate pixel data in unknown ways in a buffer of unknown size in unknown RAM, and then doing unknown things (with unknown alignment, etc) to blit that data across an unknown bus to an unknown video controller; I would be "extremely shocked" if it didn't take exactly 1234.5678 nanoseconds.Korona wrote:What performance should be expected then? WC allows the CPU to buffer writes which should not improve the performance of rep movsq. However it also allows the CPU to issue delayed writes. With UC each iteration of rep movsq has to wait until the previous iteration hit main memory. With WC the CPU can issue many iterations of rep movsq concurrently and does not have to wait for main memory.Brendan wrote:That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?
For code that is well optimised (e.g. avoids writing pixels that didn't change to display memory, uses SSE or AVX, uses non-temporal stores, etc) I would expect that using WC (in the MTRR) makes no difference at all.
For code that is less well optimised (e.g. avoids writing pixels that didn't change to display memory, but doesn't use SSE or AVX or non-temporal stores) I would expect that using WC (in the MTRR) might make it no more than 10 times faster (and that the time spent to blit the data is negligible compared to the time spent generating that pixel data in the first place).
For code that is even less well optimised (e.g. writes all pixels to display memory regardless of whether they changed or not, and doesn't use SSE or AVX or non-temporal stores) I'd still expect that using WC (in the MTRR) might make it no more than 10 times faster (because "pointless writes that should've been avoided" would be effecting "with WC" and "without WC" the same).
If you assume that johnsa's code is spending 0.33333 ms to generate the pixel data, 0.33333 ms to blit all pixel data with a single "rep movsq" when WC is being used, and 199.666666 ms to blit all pixel data with a single "rep movsq" when WC is being used; then that would imply that WC makes it 600 times faster. That's so far beyond the expected performance differences that something else must be causing misleading results.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Disable MTRRs
So it looks like my approach is as follows based on the above information:
The firmware has correctly mapped all my device mmio ranges, I'm not convinced it's optimal but hey. I can use the PAT to create the Write combining area due to the special provision that MTRR=UC can be forced to WC by PAT without penalty
(So I will use that for LFB as well as any DMA accesses/buffers that require WC). So in theory, apart from transferring the BSP MTRR settings to the other cores on trampoline I shouldn't have to look at them again.
The firmware has correctly mapped all my device mmio ranges, I'm not convinced it's optimal but hey. I can use the PAT to create the Write combining area due to the special provision that MTRR=UC can be forced to WC by PAT without penalty
(So I will use that for LFB as well as any DMA accesses/buffers that require WC). So in theory, apart from transferring the BSP MTRR settings to the other cores on trampoline I shouldn't have to look at them again.
Re: Disable MTRRs
Hi,
At 180 fps it's one frame every 5.555 ms, and with 5 fps it's one frame every 200 ms. That means with WC it's 5 ms to generate the data and 0.555 ms to blit; and without WC it's the same 5 ms to generate the data and 199.4444 ms to blit. That implies it's 0.555 vs 199.4444 ms - around 400 times faster with WC.
Now think of those writes as packets across a bus, where each packet has a header (saying that it's a write, which address is being written, the number of bytes being written, etc) and the data itself. The amount of data is the same in both cases - the difference is the amount of bandwidth consumed by "per packet overhead" (those headers, etc). Essentially it becomes "total_traffic = packets * per_packet_overhead + total_bytes". If WC has no packet overhead then it'd be "total_traffic = 1920*1080*4", and without WC it would have to be 400 times worse and therefore it'd have to be "400 * 1920*1080*4 = packets * per_packet_overhead + 1920*1080*4". That means "packets * per_packet_overhead = 399 * 1920*1080*4". We know that you're writing 8 bytes at a time, and that we're looking at "1920*1080*4 / 8 = 1036800" packets. Therefore we can estimate "per_packet_overhead = 399 * 1920*1080*4 / 1036800 = 3192".
Essentially; for what you're saying to be believable, you have to assume that sending 8 bytes across the PCI bus costs the equivalent of at least (because we ignored all packet overhead for WC when we probably shouldn't have) 3192 bytes of overhead.
That is simply not believable.
Believable might be more like 16 bytes of per packet overhead (e.g. a 1-byte packet type field, an 8 byte "address of write" field, a 16-bit size field, and 5 extra bytes for no particular reason). That would work out to a maximum performance difference of "0 + 1920*1080*4" with WC (and no packet overhead) vs. "16*1036800 + 1920*1080*4"; or "8294400 vs 24883200"; or WC being (no more than) 3 times faster.
In that case, assuming 5 ms to generate the pixels data, it would be the same 5.555 ms (180 fps) for WC, and "5+3*0.555 = 6.555 ms" (150 fps) without WC.
Cheers,
Brendan
Originally you said "1500 fps with WC" and now you're saying "180 fps with WC".johnsa wrote:The rendering code itself is a checkerboard raytrace, it determines the intersection against the board for each pixel (1920x1080). It's single-core using AVX, it takes about 5ms for the actual render code.That doesn't add up. How much time do you spend generating the data to blit, and how much time (e.g. in nanoseconds or something) do you spend doing the blitting itself for each case? How are you blitting (e.g. a simple "for each row of pixels { if row of pixels changed, copy row of pixels from buffer to display memory" loop)? Are the writes aligned on a 8-byte boundary?johnsa wrote:I was basically just writing to a normal cached buffer for the screen then transferring it to the LFB with a rep movsq. Without WC it was about 5fps, with about 1500fps..
Under Windows using a GDI bitmap and 16byte aligned buffer (in both cases) it achieves about 140fps. Without the write-combining it was getting about 5fps for me, with write-combining on the LFB I get 180fps (render + copy to lfb with rep movsq).
At 180 fps it's one frame every 5.555 ms, and with 5 fps it's one frame every 200 ms. That means with WC it's 5 ms to generate the data and 0.555 ms to blit; and without WC it's the same 5 ms to generate the data and 199.4444 ms to blit. That implies it's 0.555 vs 199.4444 ms - around 400 times faster with WC.
Now think of those writes as packets across a bus, where each packet has a header (saying that it's a write, which address is being written, the number of bytes being written, etc) and the data itself. The amount of data is the same in both cases - the difference is the amount of bandwidth consumed by "per packet overhead" (those headers, etc). Essentially it becomes "total_traffic = packets * per_packet_overhead + total_bytes". If WC has no packet overhead then it'd be "total_traffic = 1920*1080*4", and without WC it would have to be 400 times worse and therefore it'd have to be "400 * 1920*1080*4 = packets * per_packet_overhead + 1920*1080*4". That means "packets * per_packet_overhead = 399 * 1920*1080*4". We know that you're writing 8 bytes at a time, and that we're looking at "1920*1080*4 / 8 = 1036800" packets. Therefore we can estimate "per_packet_overhead = 399 * 1920*1080*4 / 1036800 = 3192".
Essentially; for what you're saying to be believable, you have to assume that sending 8 bytes across the PCI bus costs the equivalent of at least (because we ignored all packet overhead for WC when we probably shouldn't have) 3192 bytes of overhead.
That is simply not believable.
Believable might be more like 16 bytes of per packet overhead (e.g. a 1-byte packet type field, an 8 byte "address of write" field, a 16-bit size field, and 5 extra bytes for no particular reason). That would work out to a maximum performance difference of "0 + 1920*1080*4" with WC (and no packet overhead) vs. "16*1036800 + 1920*1080*4"; or "8294400 vs 24883200"; or WC being (no more than) 3 times faster.
In that case, assuming 5 ms to generate the pixels data, it would be the same 5.555 ms (180 fps) for WC, and "5+3*0.555 = 6.555 ms" (150 fps) without WC.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Disable MTRRs
Hi,
Cheers,
Brendan
Firmware mostly only sets up MTRRs for RAM and "firmware special areas", and doesn't/shouldn't set up any MTRRs for any MMIO devices.johnsa wrote:So it looks like my approach is as follows based on the above information:
The firmware has correctly mapped all my device mmio ranges, I'm not convinced it's optimal but hey.
Yes.johnsa wrote:I can use the PAT to create the Write combining area due to the special provision that MTRR=UC can be forced to WC by PAT without penalty
Don't use WC for any DMA - it's bizarre and complicates things far too much (because WC involves those "special side-buffers that normal caches don't know about"). For DMA to normal RAM (the only case that's actually likely to occur) it's simpler and usually faster to leave the RAM as "write-back".johnsa wrote:(So I will use that for LFB as well as any DMA accesses/buffers that require WC).
I wouldn't want to assume that AP CPUs have caches fully disabled properly; which means that (for "defensive programming" against potentially buggy firmware, etc) I'd start AP CPUs before touching MTRRs, so that I can update all CPU's MTRRs at the same time.johnsa wrote:So in theory, apart from transferring the BSP MTRR settings to the other cores on trampoline I shouldn't have to look at them again.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.