If "predictor" is the hardware prefetcher's steam detector (which needs a few cache misses to get started, and where the hardware prefetcher stops every time it reaches a page boundary); then either you're relying on hardware prefetching and shouldn't do software prefetching; or you're relying on software prefetching and there shouldn't be any cache misses left for the hardware prefetcher to detect.embryo wrote:It works for big arrays, but when predictor catches the direction of a sequential memory access, wouldn't it preload next page when memory reads hit somewhere close to a predictor's threshold? Then only initial loop seems as viable.Brendan wrote:Alternatively, maybe you write your code to handle misaligned arrays; with an initial loop to handle from the start of the array up to the first page boundary, then a middle loop for the majority of the array (that is page aligned), then another loop for the (partially used) last page of the array.
For the prefetch instruction, if there's a TLB miss it does nothing at all. If you have a loop that tries to prefetch the same data 12 billion times then if there's a TLB miss it will do nothing 12 billion times and still won't prefetch anything.embryo wrote:But if there is an opportunity to work with data in chunks (like usage of cached data in more than one operation), then predictor will be unable to preload the data because memory read pattern tends to exhibit locality instead of sequentiality. Here we need to use prefetch instruction. And as I understand it should be executed at least twice to initiate predictor's preload action, right?
To force the TLB to be loaded you need something else - e.g. maybe "cmp dword [address],0". The important thing is to make sure this instruction doesn't depend on any other instructions and that no other instructions depend on it; so that a typical "our of order" CPU can/will do other instructions while it waits for the "cmp dword [address],0" to complete (and so you don't end up the CPU stalled until the "cmp dword [address],0" completes). Note: To be more accurate, the CPU will speculatively execute instructions until it knows the "cmp dword [address],0" won't cause a page fault (and until it knows the results of the speculatively executed instructions can be committed); and the CPU may still stall due to limits on the number of speculatively executed instructions it can keep track of; but this has less impact than a normal "TLB miss with dependant instructions" plus cache misses caused by failing to prefetch cache lines.
There are many situations when processor stalls because of empty instruction pipeline. A simple example is a branch mis-prediction.embryo wrote:But if instruction read bandwidth is measured in gygabytes per second then it is possible to have just a bit more complex decoder and forget about instruction length. Or there are situation when processor stalls because of empty instruction pipeline?Brendan wrote:Ideally, the length of an instruction's opcode should depend on how often its used - the most frequently used instructions should have the shortest opcodes possible.
Improving instruction density (getting more instructions in less RAM) improves the time it takes to recover from stalls and also improves the number of instructions you can fit in caches.
For an instruction to complete you need the instruction and its data. Having one and not the other doesn't help, regardless of whether you have the instruction and not the data, or the data and not the instruction. If you make instruction fetch higher priority, then you'd just spend more time stalled waiting for data.embryo wrote:As I see it an instruction cache is independent and instruction read should have more priority than data read, so it is hardly possible to stall a processor because of high data load.
In the last 20 years; PIC got replaced with IO APIC, PIT got replaced with HPET, ISA got replaced with PCI, serial and parallel and floppy got replaced with USB, PATA got replaced by AHCI, FPU/MMX/3DNow got replaced by SSE, protected mode got replaced by long mode, plain 32-bit paging got replaced by PAE which got replaced by long mode, hardware task switching got replaced by nothing, several exceptions (coprocessor segment overrun, bound, overflow) got deprecated, etc.embryo wrote:I suppose such transition will take us so far as 2030-s or even later. What has became removable in last 15-20 years? Or in 35 years?Brendan wrote:Once we complete the transition to UEFI real mode will become removable
For all of these cases, whether or not they're "removable" depends on how much backward compatibility you really want. For some things (smartphones, tablets, high-end servers) you don't care and all of it is removable. For "BIOS compatible" systems you care more (and often you end up with "technically removed but emulated"). If you replace BIOS with UEFI you have a lot less reason to care.
What I expect is that it'll happen in stages - e.g. smartphones, tablets, high-end servers, 80x86 Apple stuff will be first to start removing the obsolete trash (already happening) and mainstream laptop/desktop/server will take far longer; and maybe in 20 years time most of it will be gone (but even then there's still going to be some hobo somewhere manufacturing systems with real mode, BIOS and ISA bus for an obscure niche market).
Dealing with physical address space fragmentation, supporting memory mapped files efficiently, supporting shared memory efficiently, supporting swap space efficiently, doing various "copy on write" tricks efficiently, etc. All of these things are more efficient because paging provides an extra layer of indirection.embryo wrote:What kind of problems of memory management (without virtual memory) you can think of in case of managed code?Brendan wrote:If an OS doesn't use paging, then it won't suffer from TLB misses at all; but will suffer from other problems (related to efficient memory management)
Managed code could also provide an extra layer of indirection in software, but it'd have the same issues as paging (e.g. cache misses caused by a "virtual reference to physical address" lookup in managed code, that's equivalent to a TLB miss) but it'd be slower than the "already hardware accelerated" equivalent. More likely is that managed code wouldn't add the extra layer of indirection and wouldn't get any of the advantages, and then the developers would try to pretend everything is OK by publishing research papers showing the results of carefully selected (and heavily biased/misleading) micro-benchmarks in an attempt to hide the fact that the OS sucks for all practical purposes.
Cheers,
Brendan