Calling-Conventions

ErikVikinger · Post by **ErikVikinger** » Thu Jan 14, 2010 2:04 am

Hello,

I try to develop the Calling-Conventions for a new platform. Now i have some Problems.

1. I will pass as many as possible parameters with registers (i planed 32 registers for this), but some functions have to many parameters and i must push it on the stack (Top-of-Stack, like cdecl) or create a memory area (mostly on stack to but inside of the caller-stack-frame) and give the called function a pointer to this memory area (like FastCall). I think the second solution is inferior because it need an additional parameter (the pointer) that use an additional register and accessing the pointed parameters can cost sometime additional address calculations. In the thirst solution all parameters are directly accessible by register or by a SP-relative memory access. I would prefer the first solution but the second solution is used in Linux-kernel and its SYSCALLs and it could be (i think so) these people know that they do.
Exist any pro/contra arguments for this solutions? Exist any other/better way?

2. What is with the this-pointer for classes? I don't have found relevant informations about this calling-type, except the thisCall from MS and its not a solution. I have the idea to reserve a register special for this use, if methods of one class calls each other from the same class it is a benefit, this register must not be modified.
Is it a good idea? Exist any other/better way?

3. I will pass small structs directly in registers, for call-by-value (call-by-reference use ever a pointer). It could be a big benefit for types as div_t or complex.
How do you thing about this? Should i use for non-fundamental types ever a pointer (to a copy of the original from caller)?

Thanks for your help!
Erik

--
sorry for me terrible english, my favorite languages are assembly and VHDL (followed by german)

NickJohnson · Post by **NickJohnson** » Thu Jan 14, 2010 6:40 am

ErikVikinger wrote:I will pass as many as possible parameters with registers (i planed 32 registers for this)

What architecture are you using? x86 has only 8 or 16 "general purpose" registers depending on the mode.

gravaera · Post by **gravaera** » Thu Jan 14, 2010 7:00 am

ErikVikinger wrote:Hello, I'm trying to develop a Calling-Convention for a new platform, but I'm having some Problems.

What job do you do? With what company? If the architecture isn't still under development, and you're able to release details, what kind of market is it supposed to target? Embedded? How successful do you think it would be? Do you think it would be useful for an OSDever to produce a port for it? Are the manuals out?

Sorry for the storm of counter questions, but I'm hoping to help you out, too

1. I will pass as many as possible parameters with registers (I planned 32 registers for this), but some functions have too many parameters and i must push it on the stack (Top-of-Stack, like cdecl) or create a memory area (mostly on stack,but inside of the caller-stack-frame) and give the called function a pointer to this memory area (like FastCall).

This sounds like an interesting assignment. The most important thing to do when designing a calling convention seems to be ensuring that the convention is predictable and well documented. While I'm not a System Engineer, you may find it useful to look at the ABI specifications for architectures like ARM, etc.

I've not read that widely on this issue, but even on architectures with lots of registers, it seems like they rarely use more than 8 registers for procedure linkage. On ARM, for example, only r0 - r3 are used as parameter holders. The rest goes on the stack, or for floating point arguments, the FPU regs. PowerPC is almost the same, but it uses (IIRC) 8 registers for normal parameters.

AFAICT, rarely ever do you see procedures which 32 arguments. Of course you stated that sometimes you run out of registers, so it seems like you know that the people whom this architecture is targeting like to program with huge sets of arguments.

I think the second solution is inferior because it need an additional parameter (the pointer) that use an additional register and accessing the pointed parameters can cost sometime additional address calculations. In the thirst solution all parameters are directly accessible by register or by a SP-relative memory access. I would prefer the first solution but the second solution is used in Linux-kernel and its SYSCALLs and it could be (i think so) these people know that they do.
Exist any pro/contra arguments for this solutions? Exist any other/better way?

I myself don't know about the linux thing, but for a register load, the computation isn't normally that taxing on the processor. It's usually the memory access itself that is considered the bottleneck. Technically, an architecture that has slow address calculation is probably not very well designed...

And RISCs are usually designed to have very fast addressing mechanisms, and other perks while loading registers since they favour loads and use of registers. They try to allow informations to be moved into registers as fast as they can to compensate for their lack of all those fancy things CISCs like x86 have, such as quad-pumped buses, and whatnot. So I don't think that address calculation is going to be too much of an issue.

2. What is with the this-pointer for classes? I don't have found relevant informations about this calling-type, except the thisCall from MS and its not a solution. I have the idea to reserve a register special for this use, if methods of one class calls each other from the same class it is a benefit, this register must not be modified.
Is it a good idea? Exist any other/better way?

Although 'thiscall' is not in any way standardised, it would be best if you ensured that whether you reserve a register for it, or have it pushed on the stack, that it should be placed into the called procedure's frame as if it was an invisible first argument.

3. I will pass small structs directly in registers, for call-by-value (call-by-reference use ever a pointer). It could be a big benefit for types as div_t or complex.
How do you thing about this? Should i use for non-fundamental types ever a pointer (to a copy of the original from caller)?

One thing I have never liked about this thing with the 'pass small structs by value in registers' is that they never tell you how small the struct must be to qualify as a 'small struct'. The other thing is when you try to test it, they still pass it on the stack anyway. It never really made sense to me.

Anyway, I'm sure someone who's programming concurrently for multple architectures can help a lot more.

--Best of luck; you must have a really interesting job,
gravaera

ErikVikinger · Post by **ErikVikinger** » Thu Jan 14, 2010 8:42 am

Hello,

gravaera wrote:If the architecture isn't still under development

Yes. It is in a late spezification phase, here exist a simulator with a part of the instruction set and a first set of (not working) VHDL-Code.

gravaera wrote:and you're able to release details

Sorry, not today. It will be a 32Bit or/and 64Bit platform. If i have a working (VHDL and Simulator) release with a first alpha-state OS than go this project into public.

gravaera wrote:Are the manuals out?

I write the platform-specification at this time. I will have a good as possible specification with all relevant parts before i start developing SW.

gravaera wrote:ensuring that the convention is predictable and well documented.

ACK! I will have one Calling-Convention that fits all cases (with maximum performance).

gravaera wrote:I've not read that widely on this issue, but even on architectures with lots of registers, it seems like they rarely use more than 8 registers for procedure linkage. On ARM, for example, only r0 - r3 are used as parameter holders. The rest goes on the stack, or for floating point arguments, the FPU regs. PowerPC is almost the same, but it uses (IIRC) 8 registers for normal parameters.

My CPU have a base register file with 64 registers (60 general purpose and 4 special purpose (flags, link-register, stack-pointer and instruction-pointer). I think i can use up to 32 registers for procedure linkage (and keep in mind that the most functions not use all of them), otherwise i have no Problem with 24 (or 16 but not lesser) registers.

gravaera wrote:AFAICT, rarely ever do you see procedures which 32 arguments.

Not in real live code. This big number of registers is caused by question 3 of my last post.

gravaera wrote:It's usually the memory access itself that is considered the bottleneck. Technically, an architecture that has slow address calculation is probably not very well designed...

ACK for both. My problem is the additional register for the additional hidden argument for the FastCall solution. I will avoid it.

gravaera wrote:it would be best if you ensured that whether you reserve a register for it, or have it pushed on the stack, that it should be placed into the called procedure's frame as if it was an invisible first argument.

Okay. if the register range for parameter-passing go from R4 up to R35 than should be the this-pointer in R4 and the normal parameters start at R5. Is that what you mean?

gravaera wrote:never tell you how small the struct must be to qualify as a 'small struct'.

I think a good limit is a size (normal memory layout for the struct) of 8 registers, it means on a 32 Bit CPU-Version 32 Bytes and for the 64 Bit CPU-Version 64 Bytes. It is enough for a benefit of small mathematical functions.
Classes with non-trivial copy-constructor must be ever copied.

gravaera wrote:It never really made sense to me.

Why?

gravaera wrote:Anyway, I'm sure someone who's programming concurrently for multple architectures can help a lot more.

I am looking forward for any interesting answer or sugestion.

Thanks
Erik

Combuster · Post by **Combuster** » Thu Jan 14, 2010 9:39 am

Well, a calling convention not only specifies how you pass arguments to the function, but also from the function - in fact, the calling convention specifies all hardware preconditions and postconditions around a function call, so you should not only specify what are the inputs, but also what registers should contain return values, and what registers are preserved across calls (that may include parts of accessible control registers, like the FPU control, and some parts of a flags register like x86's D bit - apply where appropriate)

The idea of preserved registers is that you can store locals, and that other functions that do not need to use a large register file can stay off a fixed set, to save register-intensive functions the time pushing and popping values off the stack.
A few of those registers are preserved on all architectures - stackpointer, link pointer, and instruction pointer. From what I've seen, the balance between argument registers, scratch registers, and preserved registers have always been around a third each with register calling conventions, and fifty-fifty for stack-passing conventions, which is probably a decent balance between register-heavy and small register file functions.

In some cases, you want to have some registers take on globals for value - I know an architecture with 256 registers (which btw includes all i/o registers and various predefined constants like all ones/zeroes, so you have much less general purpose registers), but large spaces may allow you to use a portion of the register space as (frequently used) globals. If you don't have PC-relative addressing, you may want to specify one predefined register as the base register. Similarly, you could use a register as the this pointer. In any case, you should take care that you don't waste too much on things people might not use.

And while we are discussing structs, you may want to define how structures are passed as return values.

Owen · Post by **Owen** » Thu Jan 14, 2010 9:46 am

My policy with structures would be that each member is passed in registers, up to X registers. For example, for the struct

Code: Select all

long long z
int x;
short y;
char b;

You might allocate:
R4: z.high
R5: z.low
R6: x
R7: y
R8: b

I say this on the basis that if you simply load the structure into registers in the same format that it is in memory, things are going to get very messy (Perhaps this is why ARM & co always align members on word boundaries?)

fronty · Post by **fronty** » Thu Jan 14, 2010 10:28 am

gravaera wrote:all those useless things CISCs like x86 have

Fixed that for you.

Load-store architecture with register renaming and nice calling convention via it is just perfection.

ErikVikinger · Post by **ErikVikinger** » Thu Jan 14, 2010 10:49 am

Hello,

Combuster wrote:Well, a calling convention not only specifies how you pass arguments to the function, but also from the function - in fact, the calling convention specifies all hardware preconditions and postconditions around a function call, so you should not only specify what are the inputs, but also what registers should contain return values, and what registers are preserved across calls

Yes, i know.

Combuster wrote:(that may include parts of accessible control registers, like the FPU control, and some parts of a flags register like x86's D bit - apply where appropriate)

On my CPU are only the base registers R0...R63 accessible (direct and indirect) for User-Mode-SW. The specials, included the flags, are also located in a base register (not general purpose sections R0...R3 and R56...R63) like ARM. R4...R55 are for free use for the User-Mode-SW.

Combuster wrote:From what I've seen, the balance between argument registers, scratch registers, and preserved registers have always been around a third each with register calling conventions, and fifty-fifty for stack-passing conventions, which is probably a decent balance between register-heavy and small register file functions.

Thank you for this tip.

Combuster wrote:In any case, you should take care that you don't waste too much on things people might not use.

Yes, i am careful. I am here for a discussion about my ideas.

Combuster wrote:And while we are discussing structs, you may want to define how structures are passed as return values.

At the same way. Small structs are passed in registers and for larger structs the caller must give a pointer (as a hidden parameter) to the callee.

My present idea is:
(from R4 to R55)
- this-pointer : if it a class-method, if not than is no register wasted
- return-value(s) : if no return-value than is no register wasted, if the return-value is a big struct than is here a pointer from caller
- parameters : up to RX
- unused registers between R4..RX are caller-save
- RX+1..R55 are callee-save
now i need a good value for X

Owen wrote:You might allocate:
R4: z.high
R5: z.low
R6: x
R7: y
R8: b

Yes, this is my idea, except i will use little endian.

Erik

Owen · Post by **Owen** » Thu Jan 14, 2010 11:10 am

Why allocate the return value registers at call time? I (In the architecture I'm designing) say that the callee is permitted to trash the parameter values in registers (And, indeed, the parameter passing registers are a subset of the caller save registers). Therefore, the return value also uses the same registers, and is allocated starting at r1 in the same way as a parameter of the same type would be (Exception being when returning non-register-passable structs, in which case the function gets a hidden parameter)

Combuster · Post by **Combuster** » Thu Jan 14, 2010 12:25 pm

- unused registers between R4..RX are caller-save
- RX+1..R55 are callee-save

If you used up all arguments, you'd get serious trouble in moving things around - you might want to compute something, which needs a few registers, but you can't because you already need to preserve all registers (either as argument or callee-saved). Which is most likely why you have a subset of registers that is neither saved nor argument - to create breathing space and avoid the huge register pressure and consequent stalls.

You neither want no callee-saved registers - you'd lose your stack and everything. Neither do you want to do without register arguments for the obvious reason. Which is probably where the third-a-third-a-third ratio comes from. While the boundaries can be arbitrary without further detail, it happens when either tends to zero, your effort tends toward infinity.

ErikVikinger · Post by **ErikVikinger** » Thu Jan 14, 2010 2:53 pm

Hello,

Combuster wrote:If you used up all arguments, you'd get serious trouble in moving things around - you might want to compute something, which needs a few registers, but you can't because you already need to preserve all registers (either as argument or callee-saved).

I can not see the problem, the push/pop-instructions (for saving the callee-save registers at the function header/footer) can work without the need for any other register except the SP register (R62 in my CPU). If the callee need any (all) register inside of the "working range" RX+1...R55 it can save/restore all, or a part, without modifying any other register. Okay this saving need stack-space and memory-access-time but in the extreme rarely circumstance for a function with maximum register based parameters this costs IMHO are acceptable.

Combuster wrote:Which is most likely why you have a subset of registers that is neither saved nor argument - to create breathing space and avoid the huge register pressure and consequent stalls.

For this are the register ranges R0...R4 and R56...R59 ever caller-saved scratch registers (only R60...R63 are hardware defined special purpose and also caller-save). I'm sorry, i had this information forgotten in my last post. The pipeline stalls in the CPU are reduced by the cache at saving (with write allocation) and restoring (if cache is big enough).

Combuster wrote:You neither want no callee-saved registers - you'd lose your stack and everything. Neither do you want to do without register arguments for the obvious reason. Which is probably where the third-a-third-a-third ratio comes from. While the boundaries can be arbitrary without further detail, it happens when either tends to zero, your effort tends toward infinity.

I do not understand you. Please explain it for me.

Owen wrote:Why allocate the return value registers at call time?

Why not?

Owen wrote:I (In the architecture I'm designing) say that the callee is permitted to trash the parameter values in registers (And, indeed, the parameter passing registers are a subset of the caller save registers).

I to. Except for the this-pointer, if it is present, in R4 (in front of return-value and parameters). I think the this-pointer should be callee-save.

Owen wrote:... the return value also uses the same registers ...

Why? I think it can be a problem if you use the overwritten parameters to calc/generate the return-value.
A instruction that destroy its input-values can not simple restarted in the case of a exception. It must check during its processing the raise of an exception before it can write its results (and destroy its inputs). I prefer non-destructive instructions.

Regards
Erik

--
i hope my english is good enough for this forum

gravaera · Post by **gravaera** » Thu Jan 14, 2010 3:41 pm

ErikVikinger wrote:Hello,

gravaera wrote:it would be best if you ensured that whether you reserve a register for it, or have it pushed on the stack, that it should be placed into the called procedure's frame as if it was an invisible first argument.
Okay. if the register range for parameter-passing go from R4 up to R35 than should be the this-pointer in R4 and the normal parameters start at R5. Is that what you mean?

Yup. Pretty much.

gravaera wrote:It never really made sense to me.
Why?

For the same reasons I outlined in my first post: people are never really clear about how they implement it. Anyway, you seem to know what you're doing, so good luck

--All the best
gravaera

Owen · Post by **Owen** » Thu Jan 14, 2010 4:39 pm

ErikVikinger wrote:
Owen wrote:Why allocate the return value registers at call time?
Why not?

Because then they occupy registers which would be better used for storing parameters

Owen wrote:I (In the architecture I'm designing) say that the callee is permitted to trash the parameter values in registers (And, indeed, the parameter passing registers are a subset of the caller save registers).
I to. Except for the this-pointer, if it is present, in R4 (in front of return-value and parameters). I think the this-pointer should be callee-save.

I just have the this parameter behave as a hidden first parameter to a function. This has a few advantages:

C style functions can masquerade as class functions (This kind of thing is used a lot for dynamic bridging of C++ to dynamic languages)
This becomes a valid argument for the va_start C builtin

Additionally, classes call their own members relatively rarely; allowing trashing of the register means that code has an extra scratch register. In fact, since most calls are inter-class, you're going to end up wasting more instructions reloading your this register than you would if it was trashable.

Owen wrote:... the return value also uses the same registers ...
Why? I think it can be a problem if you use the overwritten parameters to calc/generate the return-value.
A instruction that destroy its input-values can not simple restarted in the case of a exception. It must check during its processing the raise of an exception before it can write its results (and destroy its inputs). I prefer non-destructive instructions.

What kind of exception are we talking about here - processor or high level language?

In the first case, the function will be restarted right at the instruction was interrupted. As interrupts/exceptions can (normally!) only be taken at instruction boundaries, this isn't a problem. If an exception is taken outside an instruction boundary, then you hit major problems and this should only really occur when there is no way this is a transient problem.

In the second case, then exceptions can be triggered from anywhere and the on-exception code path tends to diverge significantly from the no-exception one anyway. Additionally, the unwind logic for exceptions tends to be horrendously slow; any optimizations you make will be a drop in the ocean.

You shouldn't have to worry about register pressure as much as me though; I have 29 less integer general purpose registers than you (I have a 32 entry general purpose register file; r0 is a hard zero/bitbucket; things like the stack pointer are in a separate special function register file, also 32-entry but not fully populated)

On a tangent, I suspect our instruction encodings look radically different - I don't know about yours, but mine is pretty esoteric; it has 4 operand fields, conditional flag updates (CMP, for example, is a SUB with "NZ NC" specified, i.e don't update zero or carry), and full predication (Each instruction can be conditionally executed). Oh, and those 4 operand fields? They get used, for example in ADD r1, r2, 3, r4, which means "r4 = r1 + r2 + 3", among other instructions.

i hope my english is good enough for this forum

You're coming across quite clearly

js · Post by js » Thu Jan 14, 2010 5:24 pm

Just my two cents, but you might consider splitting your 64 registers in say, four groups, and allow flipping through them while providing efficient push group / pop group (to/from the stack, or maybe a specialized cache).

The "active" group (used for parameters) is the first one, so when you call a function, just flip the group to which you wrote parameters / expect return values to position one.

This way, for functions with few parameters (less than one group), you get 4 levels of calls for free without requiring the caller / callee to save some registers, and once you get past that limit, it's just a matter of pushing/poping the whole group, like this :

Code: Select all

func_1:
use parameters in group 1
set parameters in group 2
flip 1, 2
call func_2
flip 1, 3
call func_3
now our own parameters are in group 2, func_2's return values are in group 3, and func_3's return values are in group 1.
; if we need to save group n°x :
pushgrp x
; and restore it
popgrp x

Some FPU's have similar regiser banks (If you want to know which, I can tell you once I borrow again the library's big assembly book).

Also, you could consider a pushpop instruction, which pops a register and pushes it's current value on the stack (exchange it's value with the one on the stack). It would be very usefull when you get short of registers and want to dump one temporarily (although I doubt you'll run short of registers with 64 of them...), or when you need to load a parameter from the stack and don't want to loose the register's contents.

ErikVikinger · Post by **ErikVikinger** » Fri Jan 15, 2010 1:14 pm

Hello,

js wrote:Just my two cents, but you might consider splitting your 64 registers in say, four groups, and allow flipping through them .... <--- snip a cool idea ---->

It is a really cool idea but i have LDRM/STRM load/store of multiple registers (PUSHM/POPM for saving/restoring to/from stack) and MOVM move multiple registers inside of register file (MOVM R36,R10,10 copies the 10 registers R10...R19 into R36..R45, MOVM can also work correct with overlapping ranges, MOVM can never raise an exception during its execution and following instructions must not wait for its finish). I hope this instructions are powerful enough.

js wrote:(although I doubt you'll run short of registers with 64 of them...)

I hope. This is the reason for this big base register file. I want good performance without the need for register renaming.

Owen wrote:
ErikVikinger wrote:
Owen wrote:Why allocate the return value registers at call time?
Why not?
Because then they occupy registers which would be better used for storing parameters

The return value use at maximum 8 registers, the maximum parameter registers are reduced from 32 (R4...R35) to 24 (R12...R35), i think it is not a big problem. It is enough for

Code: Select all

complex_t calc_foo(complex_t a, complex_t b, double additional_param);

Owen wrote:I just have the this parameter behave as a hidden first parameter to a function.

Where is the position of the this pointer exactly?

Owen wrote:This has a few advantages:

C style functions can masquerade as class functions (This kind of thing is used a lot for dynamic bridging of C++ to dynamic languages)

This becomes a valid argument for the va_start C builtin

Please describe this a little bit more in detail.

Owen wrote:Additionally, classes call their own members relatively rarely;

You are sure? I have looked into vector.h and there are many method-calls with this.
I will try to discover it.

Owen wrote:What kind of exception are we talking about here - processor or high level language?

Processor-exceptions, for example at memory access (memory read access with multiple registers that cross page boundary) or during mathematical calculations. If this instructions are destructive they must be carefully and must detect all possible exceptions before it write its results into the registers. Or you have register renaming and can allocate new virtual registers for the results, i will not implement register renaming into my CPU. Non-destructive instructions can raise an exception and, after its handling, simply restarted. High level language exceptions are not my problem, the compiler can do this job, if needed.

Owen wrote:I suspect our instruction encodings look radically different - I don't know about yours, but mine is pretty esoteric;

Yes, our instructions looks really totally different, but esoteric is my too. I have 4 complete flag-sets and every instruction that modify it must specify which one is used, this avoids a single point of dependency and help for loop unrolling. In do not have indirect accesses, each usage of a register or flag-set must be specified on the assembly instruction. A PUSHM.F2:Z R10,10 is expanded to STRMIA.W,F2:Z [R62!],R10,10 which means : STRM = store multiple, I = increment address, A = after each register, W = word size (32 bit) for all registers and the increment of R62 is 4 after each register, F2 = execution depends on flag-set 2, Z = instruction is executed is Z is set (in flag-set 2), R62 is the stack-pointer, ! = the new value of R62 after all memory writes is written back to R62, R10 is the first register that is written to memory, 10 = ten registers R10...R19 are written ; this means : if the instruction is executed it does : the registers R10...R19, 40 bytes, are written to [R62] and R62 += 40. An other example is ADD.H,F1:GE R7,R6,R5,F0 which means : if the instruction is executed (GE-condition in flag-set 1 is true) it do half-word (16 bit) based R7 = R6 + R5 and set the flags in flag-set 0 for the 16 bit result.

Owen wrote:
i hope my english is good enough for this forum
You're coming across quite clearly

We will see.

gravaera wrote:Anyway, you seem to know what you're doing,

I really really hope you are right!

Thanks
Erik

OSDev.org

Calling-Conventions

Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions

Re: Calling-Conventions