Converting MOV instructions to machine code issue

iman · Post by **iman** » Wed Jan 29, 2020 6:13 am

Hi.

I am trying to understand how to code assembly instructions to machine codes.
At the moment I'm focusing only on MOV mnemonic.
The CPU is in the 32-bit protected mode, therefore the REX.w is cleared.
Besides, I am not going to talk about Mod R/M, SIB, and Prefix bytes in the following.

To learn how a simple coding must be performed, I made several case studies. They are:

Code: Select all

MOV EAX, DWORD[0xAABBCCDD] : 8B 05 DD CC BB AA
MOV AX, WORD[0xAABBCCDD]: 66 8B 05 DD CC BB AA
MOV AL, BYTE[0xAABBCCDD]: 8A 05 DD CC BB AA
MOV EAX, imm32: B8 DD CC BB AA ( but why this and NOT 8A DD CC BB AA ? )

To build up the coding bits, I look at the bit 0 of the OpCode byte.
I said if it is cleared, it means a 8-bit register is involved and if set, there is a 16/32-bit register involved.

Then I look at the bit 1 of the OpCode byte.
If set it means a register is the destination and if cleared, a memory is the destination of the instruction.
Then bits [2...7] should manifest the opcode itself. I found that for MOV mnemonic, it would be 100010 in binary.

Based the upper assumptions, and using the proper Mod R/M byte, I could truly code the first three instructions.
But when it comes to the 4th instruction, the bit pattern of the OpCode byte is no longer true. I cannot assume that MOV opcode is 100010.

Now the question:
Does it mean that to code properly a MOV instruction, I have to keep in mind two MOV patterns? One for when no immediate is involved and one only for the immediate case?

Best regards.
Iman.

bzt · Post by **bzt** » Wed Jan 29, 2020 7:02 am

Hi,

iman wrote:Does it mean that to code properly a MOV instruction, I have to keep in mind two MOV patterns?

A lot more patterns, actually. And there are also two byte operand code versions too. Here are a few examples from my disassembler (note this is for long mode, not protected mode, but most of the opcodes should match):
L335 - the two byte opcode versions
L838 - opcodes starting from 0x80
L865 - opcodes starting from 0xA0
L883 - opcodes starting from 0xB0
L892 - opcodes starting from 0xB8
L907 - opcodes starting from 0xC6

For a 32-bit disassembler, I'd suggest to take a look at OpenBSD.
The Intel Manual "Instruction Set" lists all the combinations (for protected mode too) and that should be your primary source of information.

Cheers,
bzt

iman · Post by **iman** » Wed Jan 29, 2020 8:15 am

bzt wrote:A lot more patterns, actually. And there are also two byte operand code

Yes, I see that.

Now the opcode for only MOV mnemonic turns out not be just 100010 binary value. Right?
So if I'm on the right track, to assemble a MOV (or all other) instruction into a machine code, you have to parse to figure out which MOV opcode to use.
Especially in my examples, the value 100010 would be the true opcode for the first three, while for the 4th, I must have figured out that there is an immediate involved which demands another MOV opcode.

Is the procedure described, true?

Best.
Iman.

iansjack · Post by **iansjack** » Wed Jan 29, 2020 9:09 am

As bzt said, the encoding are comprehensively described in the Intel® 64 and IA-32 Architectures Software Developer’s Manual (Vol 2, Appendixes A and B).

Candy · Post by **Candy** » Wed Jan 29, 2020 9:33 am

MOV is the *worst* opcode to start with. There are a couple of hundred different encodings for MOV alone.

MOV r32, imm32 is encoded with B[8-F], with the 8 variations selecting the register. B[0-7] are the equivalent MOV r8, imm8 for the 8-bit variants. 8B is a ModR/M encoding for MOV, allowing you to move from a 32-bit register to another, and to and from memory. You can select the SIB byte variants in ModR/M to get even more options.

Then there are the segment register MOVs, debug register movs, control register MOVs, address size prefix (67), operand size prefix (66), REX prefix (40-4F)... and that's all still just plain MOVs.

nullplan · Post by **nullplan** » Wed Jan 29, 2020 9:48 am

Well yeah, mov is Turing-complete (google keyword: movfuscator). In all honesty, though, just look into the Architecture Manual. The opcode encodings for MOV take up more than one page.

Schol-R-LEA · Post by **Schol-R-LEA** » Wed Jan 29, 2020 12:38 pm

I definitely recommend the Intel manuals, as they detail the sub-fields of each instruction type¹, the opcodes within the instructions², the various addressing modes each can accept³, and the modifiers⁴ which can be applied to them.

I'd also recommend going through the instruction spreadsheet at x86asm.net, as that is in some ways better organized for quick lookups. The way it lays out the various fields side-by-side also helps with comparisons and in understanding the various field formats, though you'll really need to have read through the manuals first for most of it to make sense.

If you don't mind me asking, why are you studying the instruction binaries? For most OS devs, it really isn't necessary - it is more something which would come up in designing an assembler, usually. If it is because you are curious about them, or want to write your own toolchain, or are planning to write your OS in raw machine code (we do have a few who are doing that, so it isn't as absurd as it might sound), then all well and good, but whatever the reason, it may be relevant to whatever answers we give you since this is the sort of thing which is prone to XY problems - it does you little good if we answer a question Y, when you were asking it because you were trying to use that answer to solve a different problem X.

Thus, if you could please say more about your goals - even if they are just 'I was curious' - we may be able to give a better answer.

Footnotes

There generally are a few instruction types which define the fields within the instruction - opcode, source(s), destination, modifiers - and the type determines which instruction bits of the fields are located at. However, in older ISAs where there has been a lot of accretion over the years the original patterns may be blurred; this has definitely happened with x86 and x86-64, and to a lesser extent with ARM and MIPS.
While they are usually treated as synonyms, in this instance differentiating between 'instruction' and 'opcode' is useful. Technically, in all but some of the oldest ISAs, the term 'opcode' only refers to a specific field in the instruction, not the instruction as a whole.
Older CISC architectures such as x86 usually have several addressing modes for arithmetic and logic instructions, whereas a newer ISA such as MIPS, ARM, or RISC-V will generally be load/store architectures; ordinary instructions only have reg/reg and reg/imm modes as a rule, and only those instructions which require memory access will have any other addressing methods
By 'modifiers' I mean anything which changes how the instruction is decoded and executed; e.g., instruction size markers, word-size mode prefixes, segment prefixes, LOCK and REP prefixes, etc. Some of these are considered to be part of the instruction, while others are quasi-independent of the instruction but may not be applicable to all instructions.

iman · Post by **iman** » Wed Jan 29, 2020 3:34 pm

Schol-R-LEA wrote:If you don't mind me asking, why are you studying the instruction binaries? ... if you could please say more about your goals

I'm curious to see how far I can go to implement a custom assembler for my own OS.
This is the way I learn something. Start to write something simple and building up the rest step by step. If I could get the way an assembler codes some simple instructions, then I would be motivated to continue. Apparently starting off from the MOV, gives quite a good knowledge for coding the other instructions.
I had recently gotten in touch with implementation of an assembler, as such, communication with the Intel's manual, in this respect, is still tough and needs more time to understand.
That is why I made the decision to ask for some easier-to-grasp explanations.

iman · Post by **iman** » Wed Jan 29, 2020 3:37 pm

Candy wrote:MOV r32, imm32 is encoded with B[8-F], with the 8 variations selecting the register. B[0-7] are the equivalent MOV r8, imm8 for the 8-bit variants. 8B is a ModR/M ...

Yes I see it from the Intel's manual.
The assembler parser then has to take good care of encoding the right MOV pattern.

Schol-R-LEA · Post by **Schol-R-LEA** » Wed Jan 29, 2020 9:45 pm

iman wrote:
Schol-R-LEA wrote:If you don't mind me asking, why are you studying the instruction binaries? ... if you could please say more about your goals
I'm curious to see how far I can go to implement a custom assembler for my own OS.
This is the way I learn something. Start to write something simple and building up the rest step by step. If I could get the way an assembler codes some simple instructions, then I would be motivated to continue. Apparently starting off from the MOV, gives quite a good knowledge for coding the other instructions.
I had recently gotten in touch with implementation of an assembler, as such, communication with the Intel's manual, in this respect, is still tough and needs more time to understand.
That is why I made the decision to ask for some easier-to-grasp explanations.

Fair enough, both regarding motives and your reason in asking for help.

You may want to take a bit of time to look at other instruction set architectures before diving into the assembler proper, however, as the comparison between x86 (and x86-64, which is mostly a superset of x86 but does change some things) and, say, ARM, or MIPS, or RISC-V - or maybe even an older, simpler ISA such as 6502 (or conversely, a contemporary one such as Motorola 68000, which came out a few months after the 8086 did) - may give you more of a context for what the different terms mean, and how the instructions' fields are arranged, if you can compare it to how other designs work.

I'd also mention that the x86 is a particularly difficult ISA to understand and implement a toolchain for; it came at an awkward period in the history of microprocessors, falling between the very simple accumulator machines such as the 6502 and the M6800, and the even more complex but more regular and consistent designs such as the M68000 (which, despite the name, was not a related design to the 6800) or the Z8000 - and certainly before the re-alignment and retrenching of the design space that came with the RISC architectures in the mid-1980s. Also, the 8086 (and subsequent 8088 used on the original PC) was designed as an extension of the 8080, so the x86 ISA already had some historical baggage even before it came out.

You may find it easier to get a brief overview of a different architecture - or better still, a few different ones - so you have a bit more context to put the weirdness of the x86 into. I'd suggest MIPS, as it is a solid, if somewhat simple and abstract, RISC ISA which was partly developed for teaching computer architecture courses and thus particularly easy to understand. RISC-V might be a good choice too, being a later development of the same overall design but with a lot of extra work done in order to make it more extensible. ARM is a good choice too, and live hardware for it may be easier to come by as most of the popular single-board computers - such as the Raspberry Pi, Asus Tinkerboard, Libre Renegade, Pine Rock64, and several ODROID models - use some implementation of the ARM for their CPUs, and a suitable one can be had (including the necessary accessories, not counting a monitor or television) for under $50 US.

Or you can go old school with some retro hardware like a Commodore 64 or Apple II, if you can find one, or neo-retro with something like the the Maximite or the Commander X16. The advantage of those is that the 6502 most of them use(d) is dead simple compared to any of the others, and the systems themselves (or the emulator for the upcoming system, in the case of the Commander) are simple single-tasking systems and generally have a built-in debugger for manually coding in machine language.

Still, if your main interest is in PCs, you probably won't want to linger on the other CPU designs too long. A little may go a long way.

Korona · Post by **Korona** » Thu Jan 30, 2020 1:01 pm

I don't think x86 is so horrible to write a toolchain for. Yes, the register allocation is a bit more complicated than on other archs but that's already done in the compiler. And a LLVM-style greedy register allocator can handle constraints just fine. RISC vs. CISC also seems to be a dead issue today since apparently, instruction decoders are not a bottleneck anymore (but L1 cache is).

bzt · Post by **bzt** » Thu Jan 30, 2020 1:32 pm

Korona wrote:RISC vs. CISC also seems to be a dead issue today

Yep. Also because Intel (which supposed to be a CISC) is built on top of a RISC processor (programmed by microcode), and ARM (which supposed to be RISC) has a lot more instructions than Intel (around 1200 if you don't count different encodings such as thumb, if you do, then above 3000). So those qualifications don't mean a thing these days.

To the OP: it popped into my mind, you should also check flatassembler's source. It's written in Asm (downside), but on the bright side, it is extremely small, simply and logically structured (OS-related stuff and output format is clearly separated from the code generation) and handles all instruction encodings very well (protmode and longmode alike). It's developed for more than twenty years now, and its code is mature and very well tested. It also has the best macro capabilities I have ever encountered (for example, using macros you can calculate checksums on the fly as it generates the code and things like that). Oh, and the main developer, Tomasz is very active and helpful on the board, you can ask him, see this for example.

Cheers,
bzt

Schol-R-LEA · Post by **Schol-R-LEA** » Fri Jan 31, 2020 6:34 pm

bzt wrote:
Korona wrote:RISC vs. CISC also seems to be a dead issue today
Yep. Also because Intel (which supposed to be a CISC) is built on top of a RISC processor (programmed by microcode), and ARM (which supposed to be RISC) has a lot more instructions than Intel (around 1200 if you don't count different encodings such as thumb, if you do, then above 3000). So those qualifications don't mean a thing these days.

I would argue that the terms 'RISC' and 'CISC' have always been misleading, since the sizes of the instruction sets was never really the main issue - the instruction decoding, the difference in fixed versus variable instruction sizes and its impact on fetch operations, and (most of all) the cost of supporting memory-direct operations versus enforcing load/store discipline, were always the definitive differences.

Register file sizes were a part of it too, I suppose, though there were a number of classic CISC designs which had as many registers as most RISC designs. The designs with the really gargantuan register files managed via register windows (e.g., SPARC) were always an outlier even among RISC systems, while the persistence of the small register set in later x86 implementations (prior to the introduction of the r8-r15 registers in x86-64) made it something of an anomaly among later CISC systems, too.

As Korona points out, all of those differences are more or less moot with Out-of-Order superscalar implementations; techniques such as register renaming on large shadow register files, multi-instruction fetching and instruction caching, multi-path predictive execution, and hardware-level instruction reordering, separation, and merging operations, all eliminate the main arguments in favor of RISC, save for their generally lower TDP and energy draw (and even that is mostly lost if the implementation applies all of the same superscalar techniques that the x86 does).

However, in regards to understanding the instruction encoding and targeting them for an assembler, compiler, debugger, etc, I would argue that the lack of orthogonality and regularity in the x86 ISA specifically (not CISC in general, since ISAs such as M68K were far better in this regard than some RISCs) makes it less than ideal to learn these things on. Once you understand it, yeah, it isn't nearly as bad as I tend to make it out to be, but the learning curve is there.

I can't speak for everyone, but for me, knowing how MIPS does things made it a lot easier for me to go back and figure out aspects of x86 which I found quite puzzling when I initially studied it (though part of that too may have been that, IME, the third-party documentation for MIPS is often a lot easier to understand than that for x86, despite there being an order of magnitude more docs available describing x86 - or maybe because of that, I dunno). YMMV.

EDIT: To be more specific, the MIPS third-party docs I had in mind were Mips Assembly Language Programming by Britton, See Mips Run by Sweetman, and the third edition of Computer Organization and Design by Patterson and Hennessy (which now also has RISC-V and ARM editions, but curiously enough no x86 one

). The first of those was used as the main text for an assembly course, as the target ISA reference for a compiler design course, and as a reference again together with the third in two Computer Arch courses in which we implemented a subset of MIPS I in a circuit emulator called LogicWorks. I read the second one out of my own interest much later. All three of them worked really well for my purposes, IMAO.

iman · Post by **iman** » Sat Feb 01, 2020 1:06 pm

bzt wrote:it popped into my mind, you should also flatassembler's source. It's written in Asm (downside), but on the bright side, it is extremely small, simply and logically structured

Thanks for your suggestion. I had a look at the source code of the flatassembler. There's a lot that one can learn from.

OSDev.org

Converting MOV instructions to machine code issue

Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue

Re: Converting MOV instructions to machine code issue