To create (enter) and discard (leave) a stack frame. See the intel manuals (Intel 2A) for a detailed description.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Well, the manuals don't do a very good job of telling you why you might want to create a stack frame.
When a function gets called in C, for example, the arguments get pushed onto the stack. Then the function gets called. The stack will be used lots more in just a second, so many programmers think it is a good idea to save a copy of ESP at this moment. The EBP register was made for doing exactly that. So you do "PUSH EBP; MOV EBP, ESP". That is called "setting up a stack frame pointer", which is EBP. That is what the ENTER opcode does -- it's pretty much a replacement for those two opcodes. But then while the function is running, you can use ESP as much as you want and leave it trashed -- since you saved a good copy of the pointer in EBP. You can also use EBP to easily access the arguments that were pushed onto the stack. LEAVE does a "MOV ESP, EBP; POP EBP".
bewing wrote:When a function gets called in C, for example, the arguments get pushed onto the stack. Then the function gets called. The stack will be used lots more in just a second, so many programmers think it is a good idea to save a copy of ESP at this moment. The EBP register was made for doing exactly that. So you do "PUSH EBP; MOV EBP, ESP". That is called "setting up a stack frame pointer", which is EBP. That is what the ENTER opcode does -- it's pretty much a replacement for those two opcodes.
Actually, up to 3 opcodes ("PUSH EBP; MOV EBP, ESP; SUB ESP,<space_for_local_variables>").
Ironically, on most CPUs ENTER/LEAVE are implemented in micro-code and it's faster to use 2 or 3 smaller/simpler instructions instead, so most compilers don't use ENTER/LEAVE at all.
Also note that if you replace ENTER/LEAVE with the faster/smaller/simpler alternative instructions and then optimise the assembly (e.g. replace "MOV ESP,EBP" with "ADD ESP,<space_for_local_variables>" and remove the "MOV EBP,ESP", then use ESP instead of EBP to access local variables and input parameters to free up EBP for normal use) you end up with smaller/faster code with no stack frame.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Ironically, on most CPUs ENTER/LEAVE are implemented in micro-code
Every instruction is microcoded on every CPU since the 1990s.
Simple instructions are decoded directly into a small number of micro-ops (typically 1 micro-op). Complex instructions aren't, and (for the sake of over-simplifying) are a little bit like miniature subroutines stored in microcode ROM (or "microcoded") rather than actual instructions that are executed directly (quickly).
From Intel's Optimisation Reference Manual: "Assembler/Compiler Coding Rule 40. (ML impact, M generality) Avoid using complex instructions (for example, enter, leave, or loop) that have more than 4 uops and require multiple cycles to decode. Use sequences of simple instructions instead.
Complex instructions may save architectural registers, but incur a penalty of 4 uops to set up parameters for the microcode ROM."
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Depending where you put the distinction between microcoded instructions and others, on the Athlon series of AMD there are two distinct decoder systems: directpath and vectorpath. The manuals suggest that the decoding operations are hardwired into the directpath unit, while a the vector unit generates internal opcodes from an internal memory (ROM is technically not the best description).
Thing is, out-of-order execution always causes a need to save some internal state to be dispatched to the various ALU components - at this point, the distinction between microcodes and storage of "simple" control signals is kind of blurred. An ancient processor like a 6502 simply grabs an instruction, load the operands when needed, perform the ALU op, then store the operands where needed. The moment you start pipelining that, you can do the loads where possible in one cycle, the operation in the next, and the store in the third. If you do that out of order, you can just save some control signals for later use. In all cases, there is no technical difference between having a "discrete" lookup table that we label a "microcode rom" that converts an opcode into signal batches, or that it is done by a more efficient logic network that takes advantage of the similarities in instruction formatting - the net effect is, at this level, the same.
Therefore the statement "microcode rom is slow" or "microcoded instructions are slow" is, as a generalisation, wrong.
The difference is how much load is put on the so-called microcode unit. If it can always respond with the same amount of operations, there's no difference. If it has to respond with a variable number of operations, then it can become a bottleneck the moment the amount of instructions dispatched is high compared to the input. The execution engine will then start seeing bunches of operations belonging to one instruction, and then goes idle because it has no other source instruction in the queue it might do in parallel. And that is the situation behind the microcode myth: "complex microcoded instructions break the amount of independent work available to the processor, so that it can no longer do more than one thing at the same time"
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
Also note that, IIRC, simple LEAVEs are fast (at least as good as the simple instructions), but ENTER and complex LEAVEs are expensive and to be avoided.
(This would appear to be corroborated by GCC generating LEAVEs quite often when it doesn't elide the frame pointer altogether)
berkus wrote:Just don't forget to add you're speaking about Intel cpus, not all cpus.
JamesM works for a very successful designer of semiconductors.
This alone still doesn't mean that ALL cpus are microcoded. If you want superscalar and out-of-order, then yes, unit-specific mops make more sense, but for the microcontroller sort of cpus just dumb direct execution may be more efficient.
Prove me wrong, James.
I'd like to point out that I do not work in the processor division of said company, so my statements have as much research behind them as any of yours.
Every CPU with a pipeline requires one operation to be broken down into multiple micro-ops - LOAD, EXECUTE, WRITEBACK for a very simple system (ignoring instruction fetch because although it is a pipeline stage it obviously doesn't depend on instruction content).
As Combuster rightly mentions, to split an insn into mops, you need what is functionally equivalent to a lookup table. A request to ROM is functionally equivalent to just "a combinitorial function - there's no constraint on how that combinitorial function is implemented. In the x86 it seems some instructions fall through to a ROM (ENTER et al) and others are special-cased for speed. This is what I would expect.
But they're all microcoded, because they all use a pipeline. Even the Cortex-M3 is pipelined, so yes, it is microcoded too.
How the microcode lookup is implemented is an implementation detail!
Why not just transform the instruction into the equivalent fast preamble??
Is it because of a transmeta patent
Brendan wrote:Hi,
JamesM wrote:
Ironically, on most CPUs ENTER/LEAVE are implemented in micro-code
Every instruction is microcoded on every CPU since the 1990s.
Simple instructions are decoded directly into a small number of micro-ops (typically 1 micro-op). Complex instructions aren't, and (for the sake of over-simplifying) are a little bit like miniature subroutines stored in microcode ROM (or "microcoded") rather than actual instructions that are executed directly (quickly).
From Intel's Optimisation Reference Manual: "Assembler/Compiler Coding Rule 40. (ML impact, M generality) Avoid using complex instructions (for example, enter, leave, or loop) that have more than 4 uops and require multiple cycles to decode. Use sequences of simple instructions instead.
Complex instructions may save architectural registers, but incur a penalty of 4 uops to set up parameters for the microcode ROM."