An efficient way to fetch a single instruction from cache

limp · Post by **limp** » Sun Jul 22, 2012 5:44 am

Hi all,

I am running some benchmark tests and for that, I want to have only the RDTSC in cache and nothing else. So, only this instruction will be fetched from cache and all the others from the main memory. Another requirement is that this applies to both cores of my Intel Atom 330 target.

One way I thought about doing it is to do the following from the BSP, (before it boots the AP):
- Invalidate cache lines using WBINVD
- Set the cache to "Normal cache mode"
- Execute a "RDTSC" instruction
- Set the cache to "No-fill mode"

After the end of the above procedure, I will just keep the cache enabled as it is by default.
My understanding says that only the RDTSC instruction will be cached by both cores and everything else from the main memory.

What do you guys think? Do you think that this will work as excepted and if yes, do you have a more efficient way of doing it in mind?

I look forward to your comments/suggestions.

Regards,
limp

Nable · Post by **Nable** » Sun Jul 22, 2012 8:46 am

CPU doesn't cache single instructions, cache consists of cache-lines.
You're doing it wrong, think twice about what you really want to achieve.

limp · Post by **limp** » Sun Jul 22, 2012 9:23 am

Thanks for your reply Nable,

Nable wrote:CPU doesn't cache single instructions, cache consists of cache-lines.
You're doing it wrong, think twice about what you really want to achieve.

I know that the CPU caches cache-lines and what I want is to have only a cache-line cached which will contain the RDTSC instruction. When you just say "you're doing it wrong", you're not really helping. Which part seems wrong to you?

Owen · Post by **Owen** » Sun Jul 22, 2012 10:10 am

You set the cache to "normal cache mode" and the CPU will issue a bunch of speculative prefetch requests for unpredictable addresses. Theres no way to say precisely "just have this address in cache"

I don't see what you're trying to do here, and I see no benchmark which requires this kind of setup. That said, if I wanted to do it, I would:

Disable the CPU caches
Execute a WBINVD to flush them completely
Completely rewrite the MTRRs to set the whole of RAM to non cache
Set one MTRR to cache the page of RAM which contained my RDTSCP
Re-enable the CPU caches

Still, I see no meaningful benchmark which will come of this

limp · Post by **limp** » Sun Jul 22, 2012 5:15 pm

Thanks for your reply Owen,

I assumed that by setting the cache to "No-fill mode", no speculative execution will take place.
Will your workaround guarantee that only the RDTSC is used or speculative execution prefetches may still occur?

By the way, I am not using paging so I guess that the size of the region that contains the RDTSC instruction can be quite small...do you happen to know the absolute minimum for an MTRR region?

Thanks in advance.

Brendan · Post by **Brendan** » Mon Jul 23, 2012 5:42 am

Hi,

limp wrote:I assumed that by setting the cache to "No-fill mode", no speculative execution will take place.
Will your workaround guarantee that only the RDTSC is used or speculative execution prefetches may still occur?

In theory, speculative execution would still take place. The only difference is that instructions would be fetched from RAM instead of being fetched from cache. However, I don't think Intel's Atom does very much speculative execution anyway (it's a very simple core designed for low power not high performance; where hyper-threading is used to try to hide stalls).

limp wrote:By the way, I am not using paging so I guess that the size of the region that contains the RDTSC instruction can be quite small...do you happen to know the absolute minimum for an MTRR region?

Minimum size for variable range MTRRs is 4096 bytes. The smallest fixed range MTRR is 16 KiB.

Also note that if you execute a lot of code (e.g. 1 million instructions) plus one RDTSC, then the time taken by the RDTSC is going to be negligible regardless of what you do. I'd consider executing a lot of code (e.g. ensure that the ratio of "RDTSC" to other instructions is tiny) and disable all caches without bothering with MTRRs at all.

Finally, Nable was entirely correct - whatever you think you're doing doesn't make any sense. Essentially you'd be using instruction fetch to benchmark RAM speed (and not benchmarking anything to do with CPU performance at all); and if you actually wanted to benchmark RAM speed properly there's far better (and much more accurate) ways to do that.

Cheers,

Brendan

limp · Post by **limp** » Mon Jul 23, 2012 10:53 am

Hi Brendan,

Thanks for your reply.

Brendan wrote:
limp wrote:The only difference is that instructions would be fetched from RAM instead of being fetched from cache.

You mean that this will be the only difference if I have cache to "no-fill mode" rather than "normal cache mode"?

That is, if I have decalared all memory as uncached, apart from a single page that contains only my RDTSC instruction (and the rest of it is empty), if I configure the cache to "normal cache mode" then speculative prefetching will fetch from cache and if "no-fill mode" is enabled, it will ftech from RAM? Is that what you're saying here?

Thanks

Brendan · Post by **Brendan** » Wed Jul 25, 2012 1:26 am

Hi,

limp wrote:
Brendan wrote:The only difference is that instructions would be fetched from RAM instead of being fetched from cache.
You mean that this will be the only difference if I have cache to "no-fill mode" rather than "normal cache mode"?

Yes. If the caches are enabled the CPU will use caches. If the caches are disabled (e.g. "no-fill mode" with any previous cache contents flushed via. WBINVD or something) then the CPU won't use caches.

There is no way to disable speculative execution (regardless of what you do with caches), because there's no sane reason to disable speculative execution.

Cheers,

Brendan

Owen · Post by **Owen** » Wed Jul 25, 2012 1:47 am

Also note that disabling the instruction cache will have other unwanted effects - for example, on recent Intel CPUs, disabling the microcode trace cache; for most AMD CPUs, you're going to dramatically cut the instruction decode rate (because the ICache contains instruction boundary markers)

OSDev.org

An efficient way to fetch a single instruction from cache

An efficient way to fetch a single instruction from cache

Re: An efficient way to fetch a single instruction from cach

Re: An efficient way to fetch a single instruction from cach

Re: An efficient way to fetch a single instruction from cach

Re: An efficient way to fetch a single instruction from cach

Re: An efficient way to fetch a single instruction from cach

Re: An efficient way to fetch a single instruction from cach

Re: An efficient way to fetch a single instruction from cach

Re: An efficient way to fetch a single instruction from cach