Scrolling terminal in software-emulated text mode

Octocontrabass · Post by **Octocontrabass** » Fri Dec 11, 2020 7:25 pm

bzt wrote:"CALL" instruction does not use branch predictions as conditional near jumps like "JE", "JNE" etc.

Yes it does. See Agner Fog's microarchitecture guide, which explains how the Branch Target Buffer (part of the branch predictor) tracks the destination of both conditional and unconditional jumps and calls, and how some CPUs have an additional branch target predictor for unconditional direct jumps and calls to minimize the penalty when the BTB hasn't already cached the destination address.

Or, refer to Intel's optimization manual (which you linked already), section E.2.2.3 on pages E-13 and E-14, which notes that the branch predictor in Sandy Bridge CPUs is used for "Direct calls and jumps" and "Indirect calls and jumps".

bzt wrote:The distance of the call also matters, see Intel Software Developer Manual Vol 2A page 3-126. See section "Operation" with the microcode.

I don't see anything about the distance of the call in the pseudocode. "Near" refers to instructions that don't load CS, and "far" refers to instructions that do.

bzt wrote:Also read about cache handling with and without LFENCE for both near, normal and far calls (hint: there's a difference).

Intel says speculative execution of an indirect near CALL may mispredict the branch destination as the subsequent instruction instead of the actual destination, and LFENCE can be used to prevent those instructions from being speculatively executed while the CPU catches up and determines the correct branch destination. From a performance perspective, a speculative execution barrier will prevent cache pollution in the case where the branch is mispredicted, but it may also hurt performance when the branch is correctly predicted.

bzt wrote:
eekee wrote:Note the "prediction" part - this is for conditional branches; unconditional calls must surely have been optimized long before.
If that were true, then the compiler wouldn't inline certain functions nor unroll loops for speed optimization. But it does (long before execution, so long that those are done in compile-time).

I suspect the rest of the function call overhead will outweigh the CALL instruction, thanks to the branch predictor. Do you have any benchmarks of this?

bzt wrote:Good read on the topic: http://www.ece.uah.edu/~milenka/docs/mi ... WDDD02.pdf (explains why jump distance matters, and some other things as well)

Perhaps I'm missing something, but I don't see where the distance between the jump instruction and its destination is mentioned in that paper. It talks a lot about the distance between different jump instructions.

bzt · Post by **bzt** » Sat Dec 12, 2020 7:34 am

eekee wrote:@bzt: I put myself in a difficult position by challenging you here because I don't exactly want to argue with you right now, but... I don't know, the things you say are just too weird.

Yes, the whole topic of optimization is weird

It is not an exact science plus the deeper you dive into, the more CPU family and model specific it became.

I don't want to argue either, so I say measure it! Create some simple test cases, and measure how long they run! I can guarantee you that a loop without a CALL will be faster than a loop with it, but see it for yourself!

Code: Select all

inlined function - fastest
function call - slower
shared library call (through PLT) - slowest

You might think I'm a known-it-all, but that's not true. I'm certain because I always and constantly do measurements during development. For example, see here and here, these are for my font renderer and my 3D model loader. I only use the specs to get some idea on how to speed up things, but then I always empirically prove if there's an actual performance gain after a modification. And that's a trick that anybody can do

Octocontrabass wrote:
bzt wrote:"CALL" instruction does not use branch predictions as conditional near jumps like "JE", "JNE" etc.
Yes it does.

Maybe my phrasing was not the best, sorry. I was trying to say they are not using the same predictions as conditional near jumps. Using a different branch prediction for CALL does not mean they can't have any.

Cheers,
bzt

Octocontrabass · Post by **Octocontrabass** » Sat Dec 12, 2020 4:33 pm

bzt wrote:Maybe my phrasing was not the best, sorry. I was trying to say they are not using the same predictions as conditional near jumps. Using a different branch prediction for CALL does not mean they can't have any.

The original Pentium (without MMX) uses the same branch predictor for both conditional and unconditional jumps, and it can mispredict unconditional jumps as not taken. Most other CPUs share the target predictor between conditional and unconditional jumps, but only use the taken/not-taken predictor for conditional jumps.

bzt · Post by **bzt** » Mon Dec 14, 2020 6:22 am

Octocontrabass wrote:The original Pentium (without MMX) uses the same branch predictor for both conditional and unconditional jumps

But modern CPUs do not. Look, it doesn't matter what you say when the measurements says otherwise. Inlining a function is faster than calling it.

Cheers,
bzt

OSDev.org

Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode