Writing an Assembler

WIASUOM · Post by **WIASUOM** » Fri Feb 07, 2025 3:02 pm

Hi osdev community, I open this post to get your opinions. I want to create an assembler as a side project, which is for x86 and written in C language. I have already done lexical analysis. Since the most recent post I found is from 2011 I wanted to know your opinions and recommendations, and also a question

There is any simple assembly that written in C, so I can research on it?
I mean I can do it on nasm too, but it is too complex. All I want to convert my assembly code to flat binary. Maybe I will add complex things in the future, but since it is nothing serious right now I don't really care

It is also my first post on osdev, so please don't be harsh on me if I can't adapt to this forum for now

Demindiro · Post by **Demindiro** » Fri Feb 07, 2025 5:47 pm

I've written a (very incomplete) JIT compiler for x86-64 in Rust. Some tips:

- Look at the instructions in octal, not hexadecimal. The original x86 ISA was designed in an era where octal was very common.
- Make lots and lots of small functions for handling each case. It'll save your sanity.
- If you don't intend to be compatible with other assemblers, I suggest inventing a new set of mnemonics that map more explicitly to their corresponding instructions.

As for what I mean with small functions: stuff like this:

Code: Select all

fn op_32rr(op: u8, dst: GpReg, src: GpReg) -> Instr {
    *Instr::new().rr32_rex(dst, src).push(op).rr32_arg(dst, src)
}
pub fn mov32rr(dst: GpReg, src: GpReg) -> Instr {
    op_32rr(0x89, dst, src)
}
pub fn or32rr(dst: GpReg, src: GpReg) -> Instr {
    op_32rr(0x09, dst, src)
}
pub fn and32rr(dst: GpReg, src: GpReg) -> Instr {
    op_32rr(0x21, dst, src)
}

alexfru · Post by **alexfru** » Fri Feb 07, 2025 9:15 pm

Look for assemblers for simpler CPUs, e.g. i8080/8085, Z80, i8051. They have fewer instructions and simpler memory addressing in instructions and yet all have variable-length instructions like the x86.

WIASUOM · Post by **WIASUOM** » Sat Feb 08, 2025 12:01 am

- Look at the instructions in octal, not hexadecimal. The original x86 ISA was designed in an era where octal was very common.
- Make lots and lots of small functions for handling each case. It'll save your sanity.
- If you don't intend to be compatible with other assemblers, I suggest inventing a new set of mnemonics that map more explicitly to their corresponding instructions.

Good to know, I am sure that will help me out!

alexfru wrote: ↑Fri Feb 07, 2025 9:15 pm Look for assemblers for simpler CPUs, e.g. i8080/8085, Z80, i8051. They have fewer instructions and simpler memory addressing in instructions and yet all have variable-length instructions like the x86.

Ok, I will look it. Thanks for advice!

neon · Post by **neon** » Tue Feb 11, 2025 9:32 pm

Hi,

I wrote an assembler a few years ago here. Not quite sure what details you are seeking however can provide some insight.

The assembler goes through 2 passes. The first pass determines the offsets of the labels with respect to the current segment. The second pass generates the code. This allows forward references and farther allows additional error checking.

Front end (Input > Parsed INSTR.)
- Consider different components and layers and keep them separate. Scanner > Preprocessor > Parser to generate an INSTR. Then pass that to the middle and back ends.
- The parser is recursive descent with an error token for recovery.

Middle end (Parsed INSTR > INSTR.)
- Our assembler uses a lookup table for instructions. Each entry is basically (mnemonic, opflags1, opflags2, ...).
- The middle end goes through a pass over INSTR with OperandFlags to narrow down the flags for matching the instruction.
- Expressions are complicated and need to support forward references, returning symbol offsets, registers, segment overrides, externals, etc. This was I think the part that was rewritten the most.

Back end (INSTR > x86.)
- The back end only runs on final pass and writes INSTR to the Target. This involves writing the instruction as object code and adding relocations and symbols as needed. For flat binary, relocations would just be automatically computed before writing the file.

OSDev.org

Writing an Assembler

Writing an Assembler

Re: Writing an Assembler

Re: Writing an Assembler

Re: Writing an Assembler

Re: Writing an Assembler