Memory access performance

Brendan · Post by **Brendan** » Wed Dec 24, 2014 10:49 am

Hi,

embryo wrote:
Brendan wrote:Alternatively, maybe you write your code to handle misaligned arrays; with an initial loop to handle from the start of the array up to the first page boundary, then a middle loop for the majority of the array (that is page aligned), then another loop for the (partially used) last page of the array.
It works for big arrays, but when predictor catches the direction of a sequential memory access, wouldn't it preload next page when memory reads hit somewhere close to a predictor's threshold? Then only initial loop seems as viable.

If "predictor" is the hardware prefetcher's steam detector (which needs a few cache misses to get started, and where the hardware prefetcher stops every time it reaches a page boundary); then either you're relying on hardware prefetching and shouldn't do software prefetching; or you're relying on software prefetching and there shouldn't be any cache misses left for the hardware prefetcher to detect.

embryo wrote:But if there is an opportunity to work with data in chunks (like usage of cached data in more than one operation), then predictor will be unable to preload the data because memory read pattern tends to exhibit locality instead of sequentiality. Here we need to use prefetch instruction. And as I understand it should be executed at least twice to initiate predictor's preload action, right?

For the prefetch instruction, if there's a TLB miss it does nothing at all. If you have a loop that tries to prefetch the same data 12 billion times then if there's a TLB miss it will do nothing 12 billion times and still won't prefetch anything.

To force the TLB to be loaded you need something else - e.g. maybe "cmp dword [address],0". The important thing is to make sure this instruction doesn't depend on any other instructions and that no other instructions depend on it; so that a typical "our of order" CPU can/will do other instructions while it waits for the "cmp dword [address],0" to complete (and so you don't end up the CPU stalled until the "cmp dword [address],0" completes). Note: To be more accurate, the CPU will speculatively execute instructions until it knows the "cmp dword [address],0" won't cause a page fault (and until it knows the results of the speculatively executed instructions can be committed); and the CPU may still stall due to limits on the number of speculatively executed instructions it can keep track of; but this has less impact than a normal "TLB miss with dependant instructions" plus cache misses caused by failing to prefetch cache lines.

embryo wrote:
Brendan wrote:Ideally, the length of an instruction's opcode should depend on how often its used - the most frequently used instructions should have the shortest opcodes possible.
But if instruction read bandwidth is measured in gygabytes per second then it is possible to have just a bit more complex decoder and forget about instruction length. Or there are situation when processor stalls because of empty instruction pipeline?

There are many situations when processor stalls because of empty instruction pipeline. A simple example is a branch mis-prediction.

Improving instruction density (getting more instructions in less RAM) improves the time it takes to recover from stalls and also improves the number of instructions you can fit in caches.

embryo wrote:As I see it an instruction cache is independent and instruction read should have more priority than data read, so it is hardly possible to stall a processor because of high data load.

For an instruction to complete you need the instruction and its data. Having one and not the other doesn't help, regardless of whether you have the instruction and not the data, or the data and not the instruction. If you make instruction fetch higher priority, then you'd just spend more time stalled waiting for data.

embryo wrote:
Brendan wrote:Once we complete the transition to UEFI real mode will become removable
I suppose such transition will take us so far as 2030-s or even later. What has became removable in last 15-20 years? Or in 35 years?

In the last 20 years; PIC got replaced with IO APIC, PIT got replaced with HPET, ISA got replaced with PCI, serial and parallel and floppy got replaced with USB, PATA got replaced by AHCI, FPU/MMX/3DNow got replaced by SSE, protected mode got replaced by long mode, plain 32-bit paging got replaced by PAE which got replaced by long mode, hardware task switching got replaced by nothing, several exceptions (coprocessor segment overrun, bound, overflow) got deprecated, etc.

For all of these cases, whether or not they're "removable" depends on how much backward compatibility you really want. For some things (smartphones, tablets, high-end servers) you don't care and all of it is removable. For "BIOS compatible" systems you care more (and often you end up with "technically removed but emulated"). If you replace BIOS with UEFI you have a lot less reason to care.

What I expect is that it'll happen in stages - e.g. smartphones, tablets, high-end servers, 80x86 Apple stuff will be first to start removing the obsolete trash (already happening) and mainstream laptop/desktop/server will take far longer; and maybe in 20 years time most of it will be gone (but even then there's still going to be some hobo somewhere manufacturing systems with real mode, BIOS and ISA bus for an obscure niche market).

embryo wrote:
Brendan wrote:If an OS doesn't use paging, then it won't suffer from TLB misses at all; but will suffer from other problems (related to efficient memory management)
What kind of problems of memory management (without virtual memory) you can think of in case of managed code?

Dealing with physical address space fragmentation, supporting memory mapped files efficiently, supporting shared memory efficiently, supporting swap space efficiently, doing various "copy on write" tricks efficiently, etc. All of these things are more efficient because paging provides an extra layer of indirection.

Managed code could also provide an extra layer of indirection in software, but it'd have the same issues as paging (e.g. cache misses caused by a "virtual reference to physical address" lookup in managed code, that's equivalent to a TLB miss) but it'd be slower than the "already hardware accelerated" equivalent. More likely is that managed code wouldn't add the extra layer of indirection and wouldn't get any of the advantages, and then the developers would try to pretend everything is OK by publishing research papers showing the results of carefully selected (and heavily biased/misleading) micro-benchmarks in an attempt to hide the fact that the OS sucks for all practical purposes.

Cheers,

Brendan

Combuster · Post by **Combuster** » Wed Dec 24, 2014 11:12 am

embryo wrote:range [ecx], [ecx+0xaaaa]; this range also will be used by following loop
...; some loop instructions

A cache miss will still prevent the first instruction of the loop from executing regardless of the preceding range statement, possibly out-of-ordering the next item in the loop and firing an automatic loading sequence. The absence of a cache miss will make the process slower due to decoding overhead.

Basically, you claim the following is significantly more advantageous:

Code: Select all

range [ecx] ; cycle 1, load [ecx]
mov eax, [ecx] ; cycle 2 - stalls for 98 more.
; cycle 100

Code: Select all

mov eax, [ecx] ; cycle 1 - stalls for 99 more.
; cycle 100

On task switch system need to push the buffers on a stack just like it now pushes content of 16*256+16*128+16*64+[other registers] bits of all registers. It means we can have something like one kilobyte for mentioned buffers without serious performance degradation.

That's more than double the current task state, requires you to pump 32 more cache lines around, and how are you going to ensure memory coherency if you need to save additional copies of RAM?

mwait; some kind of synchronization between processor and memory controller

The whole advantage with out-of-order execution is that instructions can be executed while others are waiting. Any form of synchronisation is slow, and useless baggage if there's no cache miss.

some other x86 instructions, that expect quick memory access

Here you seem to believe the actual memory access can be magically made faster.

fetches memory range [eax]-[eax+0xbbbbb] into L1 cache

Oops, the state-of-the-art L1 cache isn't even 5% of that. What will the processor do now? Evict the other data the process needs and cause more cache misses? Evict the start of the memory and cause cache misses? Stop loading and cause cache misses?

And, of course, hardware operations should be more transparent to help us create effective programs, that are not spoiled by some unpredictable hardware behavior.

Hardware is less transparent to save us time from dealing with details and create effective programs instead.

embryo · Post by **embryo** » Thu Dec 25, 2014 7:03 am

Brendan wrote:If "predictor" is the hardware prefetcher's steam detector (which needs a few cache misses to get started, and where the hardware prefetcher stops every time it reaches a page boundary); then either you're relying on hardware prefetching and shouldn't do software prefetching; or you're relying on software prefetching and there shouldn't be any cache misses left for the hardware prefetcher to detect.

Yes, you're right, but in case of long array sequential access last page definitely will be available without software intervention. At least it is what I meant when was writing about viability of the first prefetch only.

Brendan wrote:To force the TLB to be loaded you need...

Thanks for explanations, they are helpful for me.

Brendan wrote:There are many situations when processor stalls because of empty instruction pipeline. A simple example is a branch mis-prediction.

Improving instruction density (getting more instructions in less RAM) improves the time it takes to recover from stalls and also improves the number of instructions you can fit in caches.

Yes, but if we remember the size of instruction cache of the lowest level then the picture will be more bright than you see it. The cache can hold more than 1000 instructions, bus rate of 16 byte per clock can deliver 4 long instructions as once, processor's internal instruction queue is about as long as 10 instructions, a processor executes hardly more than 3 instructions per clock. If there is a branch mis-prediction then processor can resume it' job right after one clock cycle after the data is available. And data will be available in such amount that is fairly enough to fill processor's queue until next data packet will be available. All this shows us that even 5-6-byte long instructions are not a problem because of which a processor can stall.

Brendan wrote:For an instruction to complete you need the instruction and its data. Having one and not the other doesn't help, regardless of whether you have the instruction and not the data, or the data and not the instruction. If you make instruction fetch higher priority, then you'd just spend more time stalled waiting for data.

But there are control of flow instructions, instructions with embedded constants, register manipulation, hardware management instructions and may be some other groups that require no data at all. Also processor needs some time to decode even data manipulation instructions. So I see instruction flow priority as more important than data flow.

Brendan wrote:In the last 20 years; PIC got replaced with IO APIC, PIT got replaced with HPET, ISA got replaced with PCI, serial and parallel and floppy got replaced with USB, PATA got replaced by AHCI, FPU/MMX/3DNow got replaced by SSE, protected mode got replaced by long mode, plain 32-bit paging got replaced by PAE which got replaced by long mode, hardware task switching got replaced by nothing, several exceptions (coprocessor segment overrun, bound, overflow) got deprecated, etc.

Yes, and now we can see that everything "replaced" is still within a modern processor. Of course, floppy drives were really replaced with USB sticks, but age of floppies is more than 40 years. So if today the MMX is near 15 years old then it's removal seems viable in a year 2040. Not very exciting situation

Brendan wrote:
embryo wrote:What kind of problems of memory management (without virtual memory) you can think of in case of managed code?
Dealing with physical address space fragmentation, supporting memory mapped files efficiently, supporting shared memory efficiently, supporting swap space efficiently, doing various "copy on write" tricks efficiently, etc. All of these things are more efficient because paging provides an extra layer of indirection.

Can you describe the efficiency part of the sentences above? Sharing of memory is not an issue for managed code because it has no pointers and just unable to get out of bounds if provided with an array with a data, that is shared among many processes. So managed code just uses physical address and it seems for me there is no need in virtual memory.

Brendan wrote:Managed code could also provide an extra layer of indirection in software

May be it is not an indirection layer. It's just different rules of a game. Any address is proved to be within it's range. And software passivation is implemented without paging. Also it is worth to mention that as any disk access is orders of magnitude slower than memory access, so we can pay much lesser attention to the nanosecond range of efficiency that the virtual memory indirection is about.

Brendan wrote:it'd have the same issues as paging (e.g. cache misses caused by a "virtual reference to physical address" lookup in managed code, that's equivalent to a TLB miss) but it'd be slower than the "already hardware accelerated" equivalent.

Hardware accelerates pointer conversion, but not a disk access.

Brendan wrote:More likely is that managed code wouldn't add the extra layer of indirection and wouldn't get any of the advantages, and then the developers would try to pretend everything is OK by publishing research papers showing the results of carefully selected (and heavily biased/misleading) micro-benchmarks in an attempt to hide the fact that the OS sucks for all practical purposes.

If there are such papers it doesn't mean there can't be an efficient OS with managed code.

embryo · Post by **embryo** » Thu Dec 25, 2014 7:25 am

Combuster wrote:
embryo wrote:range [ecx], [ecx+0xaaaa]; this range also will be used by following loop
...; some loop instructions
A cache miss will still prevent the first instruction of the loop from executing regardless of the preceding range statement, possibly out-of-ordering the next item in the loop and firing an automatic loading sequence. The absence of a cache miss will make the process slower due to decoding overhead.

You haven't got the idea. The idea was about declaring that some part of a code should be executed only when data is available. Before the data is available all instructions AFTER the specified code can be executed. So we have no problems with the cache miss. Until the data is available the specified instruction group is decoded and just kept in some buffer which is pushed on the stack when task switch occurs.

Combuster wrote:
mwait; some kind of synchronization between processor and memory controller
The whole advantage with out-of-order execution is that instructions can be executed while others are waiting. Any form of synchronisation is slow, and useless baggage if there's no cache miss.

It was another idea, that seems also isn't understood. The point here is that calculations can take much more time than memory access can take. So it is not very important for us if we lose some 100 clock cycles at the beginning of the process. But another point here is about processor being notified about a situation when it is free to execute instructions after the mwait command. When data is available a processor's pin is activated by memory controller and the processor can stop the current instruction flow execution in some orderly defined manner before starting to execute long loop instructions. Here we need not to fill processor's buffers with all loop instructions because it is enough to decode just a few first instructions and allocate for it (and address of the rest) just a small buffer.

Combuster wrote:
fetches memory range [eax]-[eax+0xbbbbb] into L1 cache
Oops, the state-of-the-art L1 cache isn't even 5% of that.

Yes, but it was an example and I just repeated a press of one button a few times. The real range should be smaller, of course, but the idea is given and, hopefully, is understood.

Combuster wrote:Hardware is less transparent to save us time from dealing with details and create effective programs instead.

Managed languages are less transparent to save us time from dealing with details and create effective programs instead.

But the point here is that efficiency is always in a short supply, would it be hardware or a virtual machine. So there is a place for the efficiency quest in both areas.

Owen · Post by **Owen** » Thu Dec 25, 2014 8:56 am

Brendan's cache flush instructions are pretty much useless - flushes have no effect on non-dirty lines and on dirty lines the data remains resident in cache (it just gets pushed out to RAM). Perhaps he meant clean, which empties the cache line. Regardless, they're an enormous waste of cache traffic - the processor will successfully do least-recently-used replacement anyway. If you are working on such large quantities of memory that the cache effects are important, and you can avoid reading or writing a location twice, just use a damn non-temporal read/write instruction.

Theres a reason why processor manufacturers don't tell you about the exact predictor setup of a processor and change it - it's because they, after all, want to be able to improve their processors from generation to generation. Or, even more importantly, they don't want to get locked into an inappropriate predictor design. Hell, consider, say, the Qualcomm Snapdragon 810 with 4x Cortex-A53 and 4x Cortex-A57. The Cortex-A53 is a little, in-order processor; the A57 is a big out-of-order machine. They have completely different designs, different performance characteristics, and different performance points.

Prefetch ahead (test it on your target to find the ideal performance point), use non-temporal instructions if appropriate, schedule your instructions correctly. Trust in the processor designers that they'll make their new cores run your code fast.

After all, no processor which sucked at running existing code ever succeeded without external influence...

Brendan · Post by **Brendan** » Thu Dec 25, 2014 11:00 am

Hi,

Owen wrote:Brendan's cache flush instructions are pretty much useless - flushes have no effect on non-dirty lines and on dirty lines the data remains resident in cache (it just gets pushed out to RAM).

According to Intel, the CLFLUSH instruction:
"Invalidates the cache line that contains the linear address specified with the source operand from all levels of the processor cache hierarchy (data and instruction). The invalidation is broadcast throughout the cache coherence domain. If, at any level of the cache hierarchy, the line is inconsistent with memory (dirty) it is written to memory before invalidation. The source operand is a byte memory location."

Ironically, Intel have recently created a new "CLWB" instruction which causes a modified cache line to be written back to memory without flushing/invalidating it. I don't think any currently available CPU supports this instruction (or PCOMMIT) yet.

Owen wrote:Perhaps he meant clean, which empties the cache line.

There is no "clean" instruction on 80x86. I suspect you may be thinking of ARM or some other CPU that does everything different.

Owen wrote:Regardless, they're an enormous waste of cache traffic - the processor will successfully do least-recently-used replacement anyway. If you are working on such large quantities of memory that the cache effects are important, and you can avoid reading or writing a location twice, just use a damn non-temporal read/write instruction.

For data that's only being read, CLFLUSH causes no additional RAM traffic as the cache line wasn't modified so no modified data has to be written back to RAM. It may or may not cause additional cache coherency traffic between CPUs, but (if only one CPU is using it) it's very likely the cache line will be in the "Exclusive" state (not the "Shared" state) and therefore no additional cache coherency traffic is needed either.

For non-temporal loads you're forced to use SSE (which would become a pain in the neck for "plain C" code that's reading bytes like my example); and when you're using multiple pieces of the cache line at different times the data/cache line may be flushed by the CPU before you're finished using it, which can waste RAM bandwidth (CPU re-fetching the same cache line from RAM multiple times because it doesn't know when if you're still using it or not).

Owen wrote:Theres a reason why processor manufacturers don't tell you about the exact predictor setup of a processor and change it - it's because they, after all, want to be able to improve their processors from generation to generation. Or, even more importantly, they don't want to get locked into an inappropriate predictor design. Hell, consider, say, the Qualcomm Snapdragon 810 with 4x Cortex-A53 and 4x Cortex-A57. The Cortex-A53 is a little, in-order processor; the A57 is a big out-of-order machine. They have completely different designs, different performance characteristics, and different performance points.

Intel do tell you about the characteristics of their hardware pre-fetcher in their Optimization Reference Manual (and AMD do the same in their Optimisation Guide).

For all of the different ARM chips, I don't know. I assume you're right and that the average ARM manufacturer is a n00b that failed to provide adequate documentation.

Typically (for 80x86) the problem is that the programmer can't know which CPU the end user/s might be using, and therefore has to rely on generic assumptions.

Cheers,

Brendan

Owen · Post by **Owen** » Thu Dec 25, 2014 12:59 pm

Brendan wrote:Hi,

Owen wrote:Brendan's cache flush instructions are pretty much useless - flushes have no effect on non-dirty lines and on dirty lines the data remains resident in cache (it just gets pushed out to RAM).
According to Intel, the CLFLUSH instruction:
"Invalidates the cache line that contains the linear address specified with the source operand from all levels of the processor cache hierarchy (data and instruction). The invalidation is broadcast throughout the cache coherence domain. If, at any level of the cache hierarchy, the line is inconsistent with memory (dirty) it is written to memory before invalidation. The source operand is a byte memory location."

Ironically, Intel have recently created a new "CLWB" instruction which causes a modified cache line to be written back to memory without flushing/invalidating it. I don't think any currently available CPU supports this instruction (or PCOMMIT) yet.

Owen wrote:Perhaps he meant clean, which empties the cache line.
There is no "clean" instruction on 80x86. I suspect you may be thinking of ARM or some other CPU that does everything different.

Clean is the general terminology (and what ARM, among others, follows).

Brendan wrote:
Owen wrote:Regardless, they're an enormous waste of cache traffic - the processor will successfully do least-recently-used replacement anyway. If you are working on such large quantities of memory that the cache effects are important, and you can avoid reading or writing a location twice, just use a damn non-temporal read/write instruction.
For data that's only being read, CLFLUSH causes no additional RAM traffic as the cache line wasn't modified so no modified data has to be written back to RAM. It may or may not cause additional cache coherency traffic between CPUs, but (if only one CPU is using it) it's very likely the cache line will be in the "Exclusive" state (not the "Shared" state) and therefore no additional cache coherency traffic is needed either.

CACHE traffic. Not RAM traffic. Your processor has limited capacity to issue memory operations. Those cache flush operations are almost certainly a waste of that capacity.

Brendan wrote: For non-temporal loads you're forced to use SSE (which would become a pain in the neck for "plain C" code that's reading bytes like my example); and when you're using multiple pieces of the cache line at different times the data/cache line may be flushed by the CPU before you're finished using it, which can waste RAM bandwidth (CPU re-fetching the same cache line from RAM multiple times because it doesn't know when if you're still using it or not).

Brendan wrote:
Owen wrote:Theres a reason why processor manufacturers don't tell you about the exact predictor setup of a processor and change it - it's because they, after all, want to be able to improve their processors from generation to generation. Or, even more importantly, they don't want to get locked into an inappropriate predictor design. Hell, consider, say, the Qualcomm Snapdragon 810 with 4x Cortex-A53 and 4x Cortex-A57. The Cortex-A53 is a little, in-order processor; the A57 is a big out-of-order machine. They have completely different designs, different performance characteristics, and different performance points.
Intel do tell you about the characteristics of their hardware pre-fetcher in their Optimization Reference Manual (and AMD do the same in their Optimisation Guide).

Sure, but not the exact design.

Brendan wrote: For all of the different ARM chips, I don't know. I assume you're right and that the average ARM manufacturer is a n00b that failed to provide adequate documentation.

You can find a lot of the same information in the Cortex-A57 TRM and in the Cortex-A series Software Developer's Guide (admittedly there isn't yet one of those for the 64-bit cores, but then they are only just reaching volume shipment) as you can in Intel and AMD's guides. They're slightly less detailed, but then a lot of the detail Intel give is geared towards supercomputer users.

Whilst there may be a million different ARM chips, there is a significantly smaller variety of cores. Manufacturers have very little to do with any of this.

Brendan wrote:Typically (for 80x86) the problem is that the programmer can't know which CPU the end user/s might be using, and therefore has to rely on generic assumptions.

The same is, of course, worse on mobile, where your program might get rescheduled from a high performance A57 to a low power A53 due to thermal throttling (etc), so knowing which you're running on now isn't exactly useful.

Of course, if you're writing performance critical code, you should probably optimize it for the big core (but then you need to consider devices like the Snapdragon 610, which are 4 CA53 optimized for low power, and 4 CA53 optimized for high performance...)

Brendan · Post by **Brendan** » Thu Dec 25, 2014 1:47 pm

Hi,

Owen wrote:
Brendan wrote:
Owen wrote:Regardless, they're an enormous waste of cache traffic - the processor will successfully do least-recently-used replacement anyway. If you are working on such large quantities of memory that the cache effects are important, and you can avoid reading or writing a location twice, just use a damn non-temporal read/write instruction.
For data that's only being read, CLFLUSH causes no additional RAM traffic as the cache line wasn't modified so no modified data has to be written back to RAM. It may or may not cause additional cache coherency traffic between CPUs, but (if only one CPU is using it) it's very likely the cache line will be in the "Exclusive" state (not the "Shared" state) and therefore no additional cache coherency traffic is needed either.
CACHE traffic. Not RAM traffic. Your processor has limited capacity to issue memory operations. Those cache flush operations are almost certainly a waste of that capacity.

I covered cache coherency traffic between CPUs. For traffic between a CPU and it's own cache; for "larger than cache" amounts of data; I don't think it's possible for the traffic caused by CLFLUSH to matter - either you're limited by RAM bandwidth and that traffic doesn't matter, or you're limited by the CPU pipeline and that traffic doesn't matter (but the additional instruction might matter in rare cases - e.g. code that's limited by instruction decode or something).

Owen wrote:
Brendan wrote:Intel do tell you about the characteristics of their hardware pre-fetcher in their Optimization Reference Manual (and AMD do the same in their Optimisation Guide).
Sure, but not the exact design.

They tell you what you need to know. They don't give you copious amounts of irrelevant information (e.g. a schematic diagram showing individual transistors).

Cheers,

Brendan

embryo · Post by **embryo** » Fri Dec 26, 2014 7:13 am

Brendan wrote:They tell you what you need to know. They don't give you copious amounts of irrelevant information (e.g. a schematic diagram showing individual transistors).

If, for example, there is a server OS that being optimized for high performance calculations in a grid, then you still have an option to tune it up to the least common denominator. But if you tune it up to a particular hardware (with some schematics and etc) then the same performance would cost you tenth percents less. And in some special cases the cost can be improved more than twice. As a simple example you can look at SSE and AVX instructions and if you assume the least common denominator then your choice is with SSE, but if you tune a system up to AVX support then the performance raises twice as much as in the first case. Of course, detailed algorithms of the memory operations hardly would give us 100% performance increase, but in a cost competitive environment even 10% is a big gain.

embryo · Post by **embryo** » Fri Dec 26, 2014 7:41 am

Owen wrote:If you are working on such large quantities of memory that the cache effects are important, and you can avoid reading or writing a location twice, just use a damn non-temporal read/write instruction.

In fact it's not as simple as one particular option. There should be a lot of work, dedicated to the best algorithm search. Next, having a group of candidate algorithms, you have to make assumptions about their behavior on a particular hardware. Next you have to optimize one or two algorithms for the hardware. Iterative optimization would take not a small amount of time and will be tailored for a system that is likely to change it's behavior with new mother boards installed. Next, all the iterations should be done again just to catch some differences between the previous and present setups. And finally you can realize, that initial hardware related algorithm behavior assumptions were wrong in some part and you should take another algorithm and repeat all the work again.

But if you have detailed algorithm how a hardware works, then it will free you from spending a lot of time in the iterative loop that is shown above.

Owen wrote:they, after all, want to be able to improve their processors from generation to generation. Or, even more importantly, they don't want to get locked into an inappropriate predictor design.

But they are locked into a lot of different hardware issues. And the issues are so big comparing with the prediction algorithm, that I hardly imagine any viable manufacturer's problems except a bit raised spendings on the hardware design. Is the economy of a percent or two is so important? Especially if new design can extend market penetration.

But of course, we can live with old hardware while getting just a fraction of a potentially possible performance.

Owen wrote:Hell, consider, say, the Qualcomm Snapdragon 810 with 4x Cortex-A53 and 4x Cortex-A57. The Cortex-A53 is a little, in-order processor; the A57 is a big out-of-order machine. They have completely different designs, different performance characteristics, and different performance points

I think that Google has access to the best information possible about the subject. So for big boys there's no problem with assumptions about how an OS would work on a particular hardware.

Owen wrote:After all, no processor which sucked at running existing code ever succeeded without external influence...

In the ARM land incompatible processors are not a seldom thing. But they still being manufactured. Android OS isolates the difference from Android developers by means of Dalvik JVM. And such approach is what I welcome a lot. Just hide the performance tuning in JVM and let developers to provide compatible bytecode only. But OS developers should have fullest possible access to the hardware information.

Combuster · Post by **Combuster** » Fri Dec 26, 2014 2:20 pm

embryo wrote:
Combuster wrote:
embryo wrote:range [ecx], [ecx+0xaaaa]; this range also will be used by following loop
...; some loop instructions
A cache miss will still prevent the first instruction of the loop from executing regardless of the preceding range statement, possibly out-of-ordering the next item in the loop and firing an automatic loading sequence. The absence of a cache miss will make the process slower due to decoding overhead.
You haven't got the idea. The idea was about declaring that some part of a code should be executed only when data is available.

So with your new explanation, instead of the previous interpretation of

Code: Select all

range [ecx] ; cycle 1, load [ecx]
mov eax, [ecx] ; cycle 2 - stalls for 98 more.
; cycle 100

vs

Code: Select all

mov eax, [ecx] ; cycle 1 - stalls for 99 more.
; cycle 100

You just turned your code into the following:

Code: Select all

range [ecx] ; cycle 1, load [ecx] - stalls for 99 more.
mov eax, [ecx] ; cycle 100 
; cycle 101

Which is actually worse.

Besides, in current x86 you can still prefetch or dummy-load [ecx] and then perform anything else you might be able to do, but more often than not you either do not know the range until very shortly before the actual loop or the data is already cached from an earlier access. Basically, hyperthreading is the better solution since you can actually do something completely unrelated with little extra thought instead of wasting overhead on microscheduling something where you have nothing to schedule anyway. Your method already forces the use of more than one instruction pointer, so why not spend the additional silicon on something much more effective?

Plus, the existing method doesn't waste your one silly clock cycle.

What x86 of course can use is a prefetch-array instruction that prevents the second cache miss (or in general, any prefetcher that includes TLB and cross-page loads)

Brendan · Post by **Brendan** » Fri Dec 26, 2014 3:11 pm

Hi,

embryo wrote:If, for example, there is a server OS that being optimized for high performance calculations in a grid, then you still have an option to tune it up to the least common denominator. But if you tune it up to a particular hardware (with some schematics and etc) then the same performance would cost you tenth percents less. And in some special cases the cost can be improved more than twice. As a simple example you can look at SSE and AVX instructions and if you assume the least common denominator then your choice is with SSE, but if you tune a system up to AVX support then the performance raises twice as much as in the first case. Of course, detailed algorithms of the memory operations hardly would give us 100% performance increase, but in a cost competitive environment even 10% is a big gain.

This is why I think software should be in an intermediate form and compiled to native when it's installed - so the compiler can optimise specifically for the actual CPU, rather than only being able to optimise for the lowest common denominator.

Of course if code was optimised specifically for a specific CPU, then you're probably going to want to disable the hardware prefetcher and use software prefetch instead. The hardware prefetcher has some nasty corner cases (e.g. accessing many small 128-byte arrays or structures, where the hardware prefetcher just makes things worse by over-fetching) and is mostly there for the benefit of "less optimised" code.

Also note that for most modern systems RAM bandwidth and some caches are shared by multiple CPUs. This means that something that improves performance for one process might also reduce performance for other processes and reduce the performance of the system as a whole.

Cheers,

Brendan

embryo · Post by **embryo** » Sat Dec 27, 2014 9:39 am

Combuster wrote:You just turned your code into the following:
Code: Select all
range [ecx] ; cycle 1, load [ecx] - stalls for 99 more.
mov eax, [ecx] ; cycle 100 
; cycle 101
Which is actually worse.

May be you see it in such way because of line wrap on your PC? There are really long comments, in the original post's code example, and they can wrap and spoil the picture.

Here it is with manual wrapping:

Code: Select all

long                         ;      enter long loop mode,
                             ;      may be this command can be omitted
                             ;      and it's job will be done by a first range instruction
range eax, ebx               ;      this range will be used by following loop
range [ecx], [ecx+0xaaaa]    ;      this range also will be used by the
                             ;      following loop
[set of loop instructions]   ;      some loop instructions (more than one instruction,
                             ;      but there should be some limit for the total number)

start                        ;      here processor and memory controller do their job
                             ;      of decoding loop instructions, starting prefetching
                             ;      process, executing some other code
                             ;      while data or the loop code are still not ready

And please pay attention to the last instruction (start) with this phrase in the end - executing some other code while data or the loop code are still not ready.

The idea is very simple - just cache the loop instructions and start doing something else. When data is available the cached (and decoded) instructions will be executed at a maximal speed. And total time will be decreased by the same interval, that is required for the memory controller to get data to the lowest level of cache. Or if some instructions, that follow the loop, will have no data available (like when the data is in L2 cache and they need to wait for say 10 cycles), then processor can use this time for loop instruction decoding and may be some optimization of actual execution order.

Combuster wrote:Besides, in current x86 you can still prefetch or dummy-load [ecx] and then perform anything else you might be able to do

Yes, but how long it takes? I want to utilize every cycle. If I order a processor to load some loop instructions and to start executing them when data is available, then I miss no one cycle. And what happens when I need a data from different memory locations and cache lines are arranged in such a manner, that they conflict for one chunk of the data? But if I tell the processor to cache some range, then it knows, that this range should not be touched in case of any conflict.

Combuster wrote:but more often than not you either do not know the range until very shortly before the actual loop or the data is already cached from an earlier access

If I optimize some memory access then I, of course, just look at the algorithm and see what range will be needed. So the range knowledge is not a problem.

Combuster wrote:Basically, hyperthreading is the better solution since you can actually do something completely unrelated with little extra thought instead of wasting overhead on microscheduling something where you have nothing to schedule anyway.

Hyperthreading can help to rearrange may be some 10 instructions until it hits the need for calculation. So here we can get 10-15 clocks, but memory access takes 100-1000 clocks. Just feel the difference.

Combuster wrote:Your method already forces the use of more than one instruction pointer, so why not spend the additional silicon on something much more effective?

One pointer takes 64 bit in 64-bit mode. Do you think it's a waste of silicon, which is able to provide 8k register file (the quickest possible memory) in modern processors? And L3 caches now are somewhere above 10m.

embryo · Post by **embryo** » Sat Dec 27, 2014 9:49 am

Brendan wrote:This is why I think software should be in an intermediate form and compiled to native when it's installed - so the compiler can optimise specifically for the actual CPU, rather than only being able to optimise for the lowest common denominator.

Yes, this is our common point.

Brendan wrote:Of course if code was optimised specifically for a specific CPU, then you're probably going to want to disable the hardware prefetcher and use software prefetch instead. The hardware prefetcher has some nasty corner cases (e.g. accessing many small 128-byte arrays or structures, where the hardware prefetcher just makes things worse by over-fetching) and is mostly there for the benefit of "less optimised" code.

And such need for switching hardware prefetching off just tells us about another need - the need for some simple and clear memory management, available to a low level programmer.

Brendan wrote:Also note that for most modern systems RAM bandwidth and some caches are shared by multiple CPUs. This means that something that improves performance for one process might also reduce performance for other processes and reduce the performance of the system as a whole.

It is an issue of parallel computing. But before we get to this area it is better to have some simple and clear method for managing memory access, while doing a traditional sequential computing.

Or it will be a nightmare to program a multiprocessor system with completely unpredictable behavior of memory access, if we allow non transparent procedures to be forwarded into the world of parallel computing.

SpyderTL · Post by **SpyderTL** » Sat Dec 27, 2014 10:09 am

This "feature" would need to be implemented at the hardware level, and this "inline" prefetch routine isn't really typical x86 instruction layout. The only time I've seen code arranged like this is for the classic FPU instructions, which were "mixed in" with the standard x86 instructions, but (somehow) executed on separate processors, in separate "threads". (Still not sure how that works, exactly.

)

I think it would make more sense to write the instructions in a separate location, and then "notify" the CPU that the code at that address (and size) should be run using the data at another address (and size), and let the CPU schedule the work to be done during idle time. Of course, this seems to be more along the lines of moving the OS thread scheduler to the CPU hardware, but whatever...

But, if this approach were to give a significant performance improvement, say 10% or more, then this would have to be brought to Intel or AMD, and they would have to accept it, fund it, figure out how to make it work from the hardware side, and then introduce it as an Extended Instruction Set, like MMX or SSE. Then the software community would have to adopt it and start using it, or else it would just go unused for a few years before finally being deprecated, like 3DNow!.

Unfortunately, there are probably other ways to modify a CPU to get a 10% performance improvement without changing the instruction set, and therefore would make existing code 10% faster, without having to rewrite it to use new CPU instructions that may or may not be available...

You may, however, want to talk to the Mill Computing guys, as they are discussing a completely new CPU architecture. Not sure how far long they are, but I think they are still in the design/discussion phase...

http://millcomputing.com/forum/the-mill/

OSDev.org

Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance

Re: Memory access performance