OSDev.org

Posted: **Sat Jul 26, 2014 12:37 pm**

I have split my assembler, Assiah, into a separate project from Thelema proper, and have made a separate Github repo for it. The intention is to have a (mostly) table-driven assembler, which should eventually support x86, x86-64 (which at this time I am considering a distinct ISA for design purposes), ARM, and MIPS, at the very least. I am focusing the majority of my work on the support for x86, for the simple reason that it and x86-64 are the most complicated ISAs I expect to support, and I expect that I might paint myself into a corner if I start with the easier ISAs; by tackling x86 now, I can anticipate the majority of the elaborations needed for the others.

However, I could use some assistance, or at the very least criticism and advice. The current design, which is written in R6RS Scheme, calls for a separate data file and register definition file for each of the supported iterations of the ISA - that is to say, there is a pair of files for the 8086, another for the 80186, the 80286, etc. The register file format consists of a series of lists, each containing a list of the register name(s) followed by the bit width of the register in bytes (this does mean that architectures which aren't aligned on multiples of bytes would be unsupported, but how many such architectures are still in use?) . For example, the file 'i8086.regs' reads as follows:

Code: Select all

'(("AX" "Accumulator") 2)
'(("AH" "Accumulator-Upper-Half") 1)
'(("AL" "Accumulator-Lower-Half") 1)
'(("BX" "Base-Register" "Index") 2)
'(("BH" "Base-Upper-Half" "Index-Upper-Half") 1)
'(("BL" "Base-Lower-Half" "Index-Lower-Half") 1)
'(("CX" "Counter") 2)
'(("CH" "Counter-Upper-Half") 1)
'(("CL" "Counter-Lower-Half") 1)
'(("DX" "Data-Register") 2)
'(("DH" "Data-Upper-Half") 1)
'(("DL" "Data-Lower-Half") 1)
'(("DI" "Dest-Index") 2)
'(("SI" "Source-Index") 2) 
'(("BP" "Base-Pointer" "Stack-Frame-Pointer") 2)
'(("SP" "Stack") 2)
'(("IP" "Instruction-Pointer") 2)
'(("FLAGS") 2)
'(("CS" "Code-Segment") 2)
'(("DS" "Data-Segment") 2)
'(("SS" "Stack-Segment") 2)
'(("ES" "Extra-Segment") 2)

The allowance for register and instruction aliases is meant to allow for more readable code, though it also handles to case of operations with multiple mnemonics (e.g., the JZ and JE instructions on the x86) quite handily.

The format for instructions a good deal more complex; it again consists of a series of lists, each with three outer fields, the list of mnemonics, the list of the possible fields of the opcode, and a text description of the instruction. The fields section is itself a list consisting (in the case of the x86 ISA) of the four common opcode prefixes, the primary opcode (which I will describe shortly), the MOD-R/M type, an enum indicating whether the field can accept the LOCK prefix (and under what circumstances), and an enum indicating the minimum security ring of the instruction. The opcode field is itself broken down into the size of the opcode field, the opcode itself, and the sub-fields which select various conditions or states (it gets complicated - really complicated). A few examples would be:

Code: Select all

'(("AAA" "ASCII-Adjust-After-Addition") 
  ((NONE NONE NONE NONE (8 #x37 (NONE)) NONE NO RING-3))
  "ASCII Adjust AL After Addition")

'(("ADD")
  ((NONE NONE NONE NONE (6 #x00 (D W)) reg REG-DEST-ONLY RING-3)
   (NONE NONE NONE NONE (7 #x04 (W)) NONE NO RING-3)
   (NONE NONE NONE NONE (4 #x80 (S W)) 0 ALLOWED RING-3))
  "Add")

'(("ADDPD")
  (#x66 #x0F NONE NONE (8 #x58 (NONE) reg NO RING-3))
  "Add Packed Double-FP Values")

'(("BOUND" "Check-Array-Bounds")
   (('NONE 'NONE '(8 #x62 (D)) 'NONE 'reg INT-FLAG))
   "Check Array Index Against Bounds")

'(("CMOVB" "CMOVNAE" "CMOVC") 
  ((NONE #x0F NONE NONE (5 #x40 (B)) reg NO RING-3))
  "Conditional Move - below/not above or equal/carry (CF=1)")

As complex as this is, I am not sure that I have captured enough information about this ISA to make a truly table-driven assembler. I have repeatedly had to expand upon it already to handle edge cases I hadn't foreseen, and the absurd, irrational complexity of the x86 design and difficulty of following the manuals and the various (often contradictory) web pages documenting it means I could easily miss something important. Even now, I am uncertain enough about how to represent the multiplicity of argument formats that I am have not tried to add that information to the data files; I am hoping I won't need to do so explicitly.

Thus, against my better judgment, I am asking the good folks at this forum for three things: first, a review of the code and data, and of the data formats, to see if there is something I a have overlooked; two, advice on how best to represent the needed data; and third, assistance in entering the volumes of instruction data I am trying to cope with (most of my information is coming from http://ref.x86asm.net/, an excellent if somewhat opaque reference page to whom I am greatly indebted). I have barely scratched the surface at this point, and the outrageous number of instructions and variants thereof are threatening to drive me crazier than I already am. Can anyone give me some good advice on this matter?

Posted: **Sun Jul 27, 2014 2:09 am**

I am focusing the majority of my work on the support for x86, for the simple reason that it and x86-64 are the most complicated ISAs I expect to support, and I expect that I might paint myself into a corner if I start with the easier ISAs;

I've always wondered if it wouldn't be easier to turn language definitions into simple logical operations (forgive me if I got the shifts/bitfield/values wrong):

Code: Select all

template GPR32 : u8 = {"EAX" -> 0, "EDX" -> 1, "ECX" -> 2, "EBX" -> 3}
template RM32 : u8 = {GPR32@reg -> (reg << 3 + 0xC0), 
                      "[" GPR32@reg "]" -> (reg << 3 + 0x00)}
"ADD" RM32@rm "," GPR32@reg -> [0x00, rm + reg]
"ADD" GPR32@reg "," RM32@rm -> [0x01, rm + reg]

The thing is mostly that having one huge struct that necessarily contains all possible information is bound to give issues for instruction sets which don't want to map to the scheme you designed for x86 - as it might look complicated, but there are complications in various other forms of assembly languages that don't look like anything you find in x86 and you more likely want to be provably sure you can handle any arbitrary case.

Posted: **Sun Jul 27, 2014 6:58 am**

Combuster wrote:I've always wondered if it wouldn't be easier to turn language definitions into simple logical operations

It is possible to define program semantics like that, though it isn't usually done for implementing language translators, and if you can come up with a feasible method you would be a candidate for a Turing Award. What you're describing sounds like a combination of axiomatic and operational semantics, with some aspects of attribute grammars perhaps. Axiomatic semantics are usually applied to theorem proving and program correctness, rather than translation; denotational semantics (including action semantics) are more often used for that purpose, though no really good notation for automating the process has ever come out of it.

It is a very important area, at least in research, but most undergraduate courses these days are too bogged down in covering the details of the languages being used to get to much theory.

<soapbox>
One of the reasons I favor Scheme as a basis for teaching programming is because the language is simple enough (even taking into account the expansions in R6RS, though the libraries can be a bit tough slogging) that it can be taught to a complete novice in a couple of hours, leaving plenty of room for learning how to apply it. I still think that Structure and Interpretation of Computer Programs is an excellent textbook, though it takes an extraordinary instructor to really make it work - the authors went a little too far in paring down what is already a very small language, and it can be difficult to connect the highly abstract textbook with actual programming practice.
</soapbox>

Posted: **Sun Jul 27, 2014 7:50 am**

Combuster wrote:The thing is mostly that having one huge struct that necessarily contains all possible information is bound to give issues for instruction sets which don't want to map to the scheme you designed for x86 - as it might look complicated, but there are complications in various other forms of assembly languages that don't look like anything you find in x86 and you more likely want to be provably sure you can handle any arbitrary case.

Valid point, and one I am acutely aware of, and quite concerned about.

I designed the record types such that the types for the specific instruction sets are sub-types of an abstract parent that would be the basis for the assembler - in theory. For example, the parent record types for instructions is:

Code: Select all

(define-record-type (instruction make-instruction instruction?)
   (fields mnemonic opcode-format-list))
 
 (define-record-type (opcode-format make-opcode-format opcode-format?)
   (fields opcode opcode-sub-field-map)))
 
 (define-record-type (opcode-sub-fields make-opcode-sub-fields opcode-sub-fields?)
   (fields size bit-index))

The concrete forms are found in the x86.ss library file, which defines the format(s) specific to x86 instructions.

Unfortunately, record types (which were only introduced to the official library in R6RS) aren't classes, and don't support a lot of things one would expect coming from an OO background, such as function polymorphism (or generics in the Common Lisp sense, which is more or less the same thing). There are ways to handle this, that have the same basic effect, though they are more explicit and require some extra coding.

The point is, I am trying to keep the assembler's core more or less abstract, and use dispatch tables and such-like to direct the code through the concrete parts.

Posted: **Sun Jul 27, 2014 8:24 am**

What you're describing sounds like a combination of axiomatic and operational semantics

It's still a "simple" assembler, and unlike those terms seem to suggest, there's no formal verificaton involved at all here. Basically it defines a bunch of lexical to data structures conversions, as well as operations on how to map the resulting structure to an array of bytes. The assembler would just try the parse rules in order (because there are multiple encoding options for many instructions), convert the parse tree into an operation tree, then evaluate it to get the bytes. The assembler can easily make complaints in the order of "expected 'foo', found 'bar' near 'bar foo' " because it has a list of rules that indicate what could have been parsed next.

Basically, the definition has become a program written in a (turing-incomplete) functional programming language designed purely for notational efficiency, that the "assembler" subsequently interprets.

Such a file would look like a bunch of partial rules for the language intricacies, followed by the list of opcodes which do little other than defining the opcode and their primitive bytes, then running it through the previously defined patterns. Even for x86, you need very little corner cases once you have the basics laid out properly. The trick is to define a language that's expressive enough to perform all the steps a hand-written assembler would do, without making it fall back to a full blown programming language.

Thing is, I have personally pondered the subject before since I still have a use for an assembler for a target platform that lacks one altogether, and the obvious absence of anything that doesn't need 101 binaries for "supports all platforms" *cough*binutils*cough*.

Posted: **Sun Jul 27, 2014 9:28 am**

Combuster wrote:Basically, the definition has become a program written in a (turing-incomplete) functional programming language designed purely for notational efficiency, that the "assembler" subsequently interprets.

I think that's exactly what I need to do, come to think of it. I'll need to consider the best way to do this for a bit before proceeding; but it is actually an excellent fit for the problem, especially since notational programming has always been the forte of Lisp (yes, I know, I need to lay off of the gratuitous language plugs, but here it really does make sense). I'm just surprised I didn't think of it earlier, it's the sort of thing that is obvious in hindsight I suppose. Thank you for the suggestion.

Posted: **Sun Jul 27, 2014 9:50 am**

I need to lay off of the gratuitous language plugs

Lots of ideas blatantly stolen from Haskell, mind you

Posted: **Sun Jul 27, 2014 11:05 am**

Combuster wrote:
I need to lay off of the gratuitous language plugs
Lots of ideas blatantly stolen from Haskell, mind you

Hey, I'll take it, wherever it came from

If you're going to steal from something, Haskell is as good as any and better than most..

Posted: **Sun Jul 27, 2014 8:32 pm**

OK, I have added a file x86.atmpl with a few experimental examples of my version of you template assembly definition idea. To get a feel for what I have in mind, here is the equivalent of your earlier example (with some elaborations):

Code: Select all

(define-field REG-32 (width 3) (index 3)
  (("EAX" => #b000)
   ("ECX" => #b001)
   ("EDX" => #b010)
   ("EBX" => #b100)
   ("ESP" => #b100)
   ("EBP" => #b101)
   ("ESI" => #b110)
   ("EDI" => #b111)))

; several details elided, if you want them see the version in the repo

(define-field MOD (width 2) (index 6)                      
  ((sub-instruction ("Ref" INDEX-BASE-OR-DISPLACEMENT => 00) ; (ADD EBX (REF EAX))
   (sub-instruction ("Ref" INDEX-BASE INDEX-8 => 01))        ; (ADD EBX (REF AL 2))
   (sub-instruction ("Ref" INDEX-BASE INDEX-32 => 10))       ; (ADD EBX (REF EAX 512))
   (REG => 11)))                                             ; (ADD EBX EAX)

I am thinking in terms of implementing it as an embedded language, using a combination of functions and macros, rather than interpreting it per se.

Posted: **Tue Jul 29, 2014 8:21 am**

Mind you, the assembly language itself is going to be a bit unusual, as it will be written in s-expressions. For example, the following piece of my old MASM boot loader:

Code: Select all

%define zero(x) xor x, x

%macro write 1
   mov si, %1
   call printstr
%endmacro


[bits 16]
[org boot_offset]

entry:
  jmp short redirect
  nop

; skipping the BPB...

redirect:
  jmp 0x0000:start

start:
  mov ax, stack_seg
  cli
  mov ss, ax          ; set the stack at an arbitrarily high point past ES.
  mov sp,  stack_top  ; put the stack pointer to the top of SS
  sti                 ; reset ints so BIOS calls can be used
  mov ax, cs
  mov ds, ax          ; set DS == CS

  write testnumber
  mov [bootdrv], dl   ; save boot drive info for later use
  write reset

would be something like the following in the Assiah notation:

Code: Select all

(macro (zero x) (xor x x))

(macro (write str)
   (mov si str)
   (call printstr))

(bit-width 16)
(org boot-offset)

(defproc entry
  (jmp short redirect)
  (no-op)

; skipping the BPB again...

:redirect
  (jmp boot-base:start)

:start
  (mov ax stack-seg)
  (cli)
  (mov ss ax)          ; set the stack at an arbitrarily high point past ES.
  (mov sp stack_top)   ; put the stack pointer to the top of SS
  (sti)                ; reset ints so BIOS calls can be used
  (mov ax cs)
  (mov ds ax)          ; set DS == CS

  (write testnumber)
  (mov (ref bootdrv) dl)   ; save boot drive info for later use
  (write reset)

So, you might ask, why go to the trouble of reinventing the wheel? In this case, the answer is: macros. Lisp-style macrology is much more sophisticated than those typically found in assemblers, and the thing that makes this possible is the s-expression syntax. I intend to experiment with the ability to write a full-fledged language in macros, a la SNOBOL, with the goal of writing the Thelema compiler/interpreter entirely in macro expansions. Crazy? Maybe. But hey, it's my time to waste, and I might actually learn something along the way.

OSDev.org

Assiah assembler - req for code reviews and data entry help

Assiah assembler - req for code reviews and data entry help

Re: Assiah assembler - req for code reviews and data entry h

Re: Assiah assembler - req for code reviews and data entry h

Re: Assiah assembler - req for code reviews and data entry h

Re: Assiah assembler - req for code reviews and data entry h

Re: Assiah assembler - req for code reviews and data entry h

Re: Assiah assembler - req for code reviews and data entry h

Re: Assiah assembler - req for code reviews and data entry h

Re: Assiah assembler - req for code reviews and data entry h

Re: Assiah assembler - req for code reviews and data entry h