Writing an Assembler

Programming, for all ages and all languages.
Post Reply
WIASUOM
Posts: 2
Joined: Fri Feb 07, 2025 2:28 pm
Libera.chat IRC: WIASUOM

Writing an Assembler

Post by WIASUOM »

Hi osdev community, I open this post to get your opinions. I want to create an assembler as a side project, which is for x86 and written in C language. I have already done lexical analysis. Since the most recent post I found is from 2011 I wanted to know your opinions and recommendations, and also a question

There is any simple assembly that written in C, so I can research on it?
I mean I can do it on nasm too, but it is too complex. All I want to convert my assembly code to flat binary. Maybe I will add complex things in the future, but since it is nothing serious right now I don't really care

It is also my first post on osdev, so please don't be harsh on me if I can't adapt to this forum for now :lol:
User avatar
Demindiro
Member
Member
Posts: 110
Joined: Fri Jun 11, 2021 6:02 am
Libera.chat IRC: demindiro
Location: Belgium
Contact:

Re: Writing an Assembler

Post by Demindiro »

I've written a (very incomplete) JIT compiler for x86-64 in Rust. Some tips:

- Look at the instructions in octal, not hexadecimal. The original x86 ISA was designed in an era where octal was very common.
- Make lots and lots of small functions for handling each case. It'll save your sanity.
- If you don't intend to be compatible with other assemblers, I suggest inventing a new set of mnemonics that map more explicitly to their corresponding instructions.

As for what I mean with small functions: stuff like this:

Code: Select all

fn op_32rr(op: u8, dst: GpReg, src: GpReg) -> Instr {
    *Instr::new().rr32_rex(dst, src).push(op).rr32_arg(dst, src)
}
pub fn mov32rr(dst: GpReg, src: GpReg) -> Instr {
    op_32rr(0x89, dst, src)
}
pub fn or32rr(dst: GpReg, src: GpReg) -> Instr {
    op_32rr(0x09, dst, src)
}
pub fn and32rr(dst: GpReg, src: GpReg) -> Instr {
    op_32rr(0x21, dst, src)
}
My OS is Norost B (website, Github, sourcehut)
My filesystem is NRFS (Github, sourcehut)
^ defunct
alexfru
Member
Member
Posts: 1113
Joined: Tue Mar 04, 2014 5:27 am

Re: Writing an Assembler

Post by alexfru »

Look for assemblers for simpler CPUs, e.g. i8080/8085, Z80, i8051. They have fewer instructions and simpler memory addressing in instructions and yet all have variable-length instructions like the x86.
WIASUOM
Posts: 2
Joined: Fri Feb 07, 2025 2:28 pm
Libera.chat IRC: WIASUOM

Re: Writing an Assembler

Post by WIASUOM »

- Look at the instructions in octal, not hexadecimal. The original x86 ISA was designed in an era where octal was very common.
- Make lots and lots of small functions for handling each case. It'll save your sanity.
- If you don't intend to be compatible with other assemblers, I suggest inventing a new set of mnemonics that map more explicitly to their corresponding instructions.
Good to know, I am sure that will help me out!
:D
alexfru wrote: Fri Feb 07, 2025 9:15 pm Look for assemblers for simpler CPUs, e.g. i8080/8085, Z80, i8051. They have fewer instructions and simpler memory addressing in instructions and yet all have variable-length instructions like the x86.
Ok, I will look it. Thanks for advice! :D
User avatar
neon
Member
Member
Posts: 1568
Joined: Sun Feb 18, 2007 7:28 pm
Contact:

Re: Writing an Assembler

Post by neon »

Hi,

I wrote an assembler a few years ago here. Not quite sure what details you are seeking however can provide some insight.

The assembler goes through 2 passes. The first pass determines the offsets of the labels with respect to the current segment. The second pass generates the code. This allows forward references and farther allows additional error checking.

Front end (Input > Parsed INSTR.)
- Consider different components and layers and keep them separate. Scanner > Preprocessor > Parser to generate an INSTR. Then pass that to the middle and back ends.
- The parser is recursive descent with an error token for recovery.

Middle end (Parsed INSTR > INSTR.)
- Our assembler uses a lookup table for instructions. Each entry is basically (mnemonic, opflags1, opflags2, ...).
- The middle end goes through a pass over INSTR with OperandFlags to narrow down the flags for matching the instruction.
- Expressions are complicated and need to support forward references, returning symbol offsets, registers, segment overrides, externals, etc. This was I think the part that was rewritten the most.

Back end (INSTR > x86.)
- The back end only runs on final pass and writes INSTR to the Target. This involves writing the instruction as object code and adding relocations and symbols as needed. For flat binary, relocations would just be automatically computed before writing the file.
OS Development Series | Wiki | os | ncc
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}
Post Reply