Hi,
I wrote an assembler a few years ago
here. Not quite sure what details you are seeking however can provide some insight.
The assembler goes through 2 passes. The first pass determines the offsets of the labels with respect to the current segment. The second pass generates the code. This allows forward references and farther allows additional error checking.
Front end (Input > Parsed INSTR.)
- Consider different components and layers and keep them separate. Scanner > Preprocessor > Parser to generate an INSTR. Then pass that to the middle and back ends.
- The parser is recursive descent with an error token for recovery.
Middle end (Parsed INSTR > INSTR.)
- Our assembler uses a lookup table for instructions. Each entry is basically (mnemonic, opflags1, opflags2, ...).
- The middle end goes through a pass over INSTR with OperandFlags to narrow down the flags for matching the instruction.
- Expressions are complicated and need to support forward references, returning symbol offsets, registers, segment overrides, externals, etc. This was I think the part that was rewritten the most.
Back end (INSTR > x86.)
- The back end only runs on final pass and writes INSTR to the Target. This involves writing the instruction as object code and adding relocations and symbols as needed. For flat binary, relocations would just be automatically computed before writing the file.