OSDev.org

Posted: **Wed Aug 11, 2010 1:05 pm**

Artlav wrote:
bewing wrote:If you do a REP and ECX is 0, it should repeat 4 billion times.
It should? I couldn't find a mention in the Intel docs.

If you look at the pseudocode for REP in the manual, it clearly shows that ECX is predecremented before testing against 0. I will try my own tests on real hardware. I never use REP with ECX = 0.
And rebochs is not locked up AFAIK -- it is just busy repeating your instruction 4 billion times, which takes awhile.

Not quite stupid - it's a quick and easy way to write out debug information, in two commands you can send a mark into the console that something went wrong or right. It's like a serial port.

I support that, but I also use port e9 for many other things. Point ESI at your output string, EDI at the string that should appear inside the [], set AL to [0 - 4] (info to fatal) and do an OUT 0xe9, AL. Your string will appear in the logfile.

By breakpoint i meant this sequence: ...
Which in Bochs breaks execution and gives you debug prompt. Very handy.

I see. I will probably support that with a special port 0xe9 call. Why not use XCHG EBX, EBX instead? That's a much simpler "magic breakpoint".

Posted: **Wed Aug 11, 2010 1:17 pm**

bewing wrote:If you look at the pseudocode for REP in the manual, it clearly shows that ECX is predecremented before testing against 0. I will try my own tests on real hardware. I never use REP with ECX = 0.

Strange.
Case in question:
-Real pc, step-by-step debugger
-mov ecx,0; rep stosb
-Nothing is written a given address, debugger shows nothing done

Special case?
Something i missed?

bewing wrote:Point ESI at your output string, EDI at the string that should appear inside the [], set AL to [0 - 4] (info to fatal) and do an OUT 0xe9, AL. Your string will appear in the logfile.

Handy by being somewhat higher-level. But, that needs a known memory and pre-defined string, while simple out can be done position and situation-independent.

bewing wrote: I see. I will probably support that with a special port 0xe9 call. Why not use XCHG EBX, EBX instead? That's a much simpler "magic breakpoint".

I agree, that might be better.
Regardless of how is it done, it being done would be a good thing.

Posted: **Wed Aug 11, 2010 1:24 pm**

Handy by being somewhat higher-level. But, that needs a known memory and pre-defined string, while simple out can be done position and situation-independent.

Not really.

Code: Select all

push 0x414243
mov esi, esp
push 0x444546
mov edi, esp
mov al, 3
out 0xe9, al
pop esi   ; fix the stack
pop esi

Just allocate enough stack space for a local byte array and fill it at runtime. Nothing needs to be predefined.

Regardless of how is it done, it being done would be a good thing.

The XCHG REG, REG magic breakpoints are already supported, on both bochs and rebochs.

Posted: **Wed Aug 11, 2010 1:28 pm**

Owen wrote:AMD says that "Repetition is terminated when rCX reaches zero", which implies to me that it is terminated if rCX is zero *after* the decrement. In fact, if Bochs is not doing the repetition in that case, it is very, very buggy and will choke randomly on a lot of optimized code, because "rep ret" is a very common optimization

Ok, quite nice way to misunderstand the documentation

If count is ZERO at the beginning of string instruction it won''t execute any iteration. Actual Intel manual pseudocode matching this.
Intel manual says:

Code: Select all

WHILE CountReg  0
DO
...
OD;

which is supposed to similar to analog plain-C while loop, the exit condition checked BEFORE the loop iteration is executed.
Try to argue - Bochs is verified vs real hardware in this case.

Owen wrote:(Why is "rep ret" used? If your ret is a branch target, then K8s will utterly fail to predict it; AMD's recommended optimization is to prefix the ret with a rep. This obviously also implies that branches should disable the repetition)

Rep prefix has no affect to ret instruction as well as on many others. For example there are no single byte opcodes except string instructions that affected by REP prefix.
The rep prefix used as extension to opcode for many MMX/SSE instructions, for all others it just ignored.

Stanislav

Posted: **Wed Aug 11, 2010 1:45 pm**

bewing wrote:Just allocate enough stack space for a local byte array and fill it at runtime. Nothing needs to be predefined.

What if there is no stack set up?
What if there's just enough place to fit a few bytes?
What if i want to write out several chars at large intervals, and then read what it spells out (keyboard input vs video driver testing)?
What, after all, if i want a simple command to write out?

bewing wrote:The XCHG REG, REG magic breakpoints are already supported, on both bochs and rebochs.

Nice.
Then there's a case of lack of documentation or it not being read by question asker.
Either way, not a priority at this stage.

Posted: **Tue Aug 17, 2010 4:37 am**

Owen wrote:because "rep ret" is a very common optimization

Common perhaps in a very small subset of code, since it is only for a very specific branch of AMD CPUs. See here.

Why is "rep ret" used? If your ret is a branch target, then K8s will utterly fail to predict it; AMD's recommended optimization is to prefix the ret with a rep. This obviously also implies that branches should disable the repetition

REP is only defined for a small set of instructions. It doesn't do anything for a RET (not even decrease CX, which is a good thing, or else CX was always destroyed on return from such an "optimized" function).

JAL

Posted: **Tue Aug 17, 2010 10:42 am**

jal wrote:
Owen wrote:because "rep ret" is a very common optimization
Common perhaps in a very small subset of code, since it is only for a very specific branch of AMD CPUs. See here.

Its the kind of optimization that compilers should have on when generating generic code (after all, it doesn't harm other CPUs)

Posted: **Tue Aug 17, 2010 11:35 am**

I costs more space, and it only goes wrong when you branch to the ret (in which case you can just invert the branch and add a ret at that specific location)

Pure architectural bloat.

Posted: **Tue Aug 17, 2010 12:44 pm**

Code: Select all

.. some other code ...

theRet: 
    rep ret
.. some more code...
    cmp eax, ecx
    jz theRet
    ... more code

vs

Code: Select all

.. some other code ...

theRetThatsThereAnyway: 
    ret
.. some more code...
    cmp eax, ecx
    jnz dontRet
    ret
dontRet:
    ... more code

So, it's no smaller. It also penalizes the not-returning case with a branch.

Oh, and if you end up doing this

Code: Select all

    cmp eax, ecx
    jnz dontRet1
    ret
dontRet1:
    cmp ebx, ecx
    jnz dontRet2
    ret
dontRet2:
    ... more code

You probably just blew out the K8 branch predictor, since you squashed 4 branches into a 16 byte block. Congratulations: The processor will now completely fail to predict the final ret branch.

Optimization at this level is non-trivial. rep ret is ugly... but it costs you just one byte, can be reused quite heavily, and is the shortest and most general option.

Posted: **Sat Aug 21, 2010 10:31 pm**

Hi,

Owen wrote:Its the kind of optimization that compilers should have on when generating generic code (after all, it doesn't harm other CPUs)

Intel are quite specific about this - "The behaviour of the REP prefix is undefined when used with non-string instructions".

The reason is that for x86 the opcode map is almost full. When Intel or AMD want to add new instructions they need to be clever to find unused opcodes; and can easily decide to use the "REP with non-string instructions" encodings for entirely new instructions. This has happened already (some encodings that used to be "REP with non-string instructions and undefined behaviour" were defined as new MMX and SSE instructions). Basically, "REP RET" should never be used (at least not without ensuring it's an AMD CPU with borked return prediction), and there's no guarantee it won't become something entirely different in future CPUs.

Cheers,

Brendan

Posted: **Sun Aug 22, 2010 7:12 am**

If it does... then Intel will be breaking a massive quantity of applications. If there is one rep-prefixed instruction they're not going to reuse, its "rep ret".

Additionally, Intel just opened up a massive chunk of coding space for themselves (Yes, themselves; they refuse to share it) with VEX.

(Crap like that is one of the reasons I think some anti trust commission should require that development of the x86 architecture be handed over to an independent organization, modelled somewhat like Power.org)

Posted: **Sun Aug 22, 2010 8:57 am**

Owen wrote:If it does... then Intel will be breaking a massive quantity of applications.

Such as? Can you at least point to one compiler that generates this "optimization"?

Posted: **Sun Aug 22, 2010 9:47 am**

Love4Boobies wrote:
Owen wrote:If it does... then Intel will be breaking a massive quantity of applications.
Such as? Can you at least point to one compiler that generates this "optimization"?

Every single GCC release post-K8.

Posted: **Sun Aug 22, 2010 2:08 pm**

Owen wrote:Additionally, Intel just opened up a massive chunk of coding space for themselves (Yes, themselves; they refuse to share it) with VEX.

What's VEX?

Posted: **Sun Aug 22, 2010 3:52 pm**

a prefix used in AMD's new 256-bit vector instructions

OSDev.org

bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request

Re: bewing's complete bochs rewrite: test request