Assiah assembler - req for code reviews and data entry help
Posted: Sat Jul 26, 2014 12:37 pm
I have split my assembler, Assiah, into a separate project from Thelema proper, and have made a separate Github repo for it. The intention is to have a (mostly) table-driven assembler, which should eventually support x86, x86-64 (which at this time I am considering a distinct ISA for design purposes), ARM, and MIPS, at the very least. I am focusing the majority of my work on the support for x86, for the simple reason that it and x86-64 are the most complicated ISAs I expect to support, and I expect that I might paint myself into a corner if I start with the easier ISAs; by tackling x86 now, I can anticipate the majority of the elaborations needed for the others.
However, I could use some assistance, or at the very least criticism and advice. The current design, which is written in R6RS Scheme, calls for a separate data file and register definition file for each of the supported iterations of the ISA - that is to say, there is a pair of files for the 8086, another for the 80186, the 80286, etc. The register file format consists of a series of lists, each containing a list of the register name(s) followed by the bit width of the register in bytes (this does mean that architectures which aren't aligned on multiples of bytes would be unsupported, but how many such architectures are still in use?) . For example, the file 'i8086.regs' reads as follows:
The allowance for register and instruction aliases is meant to allow for more readable code, though it also handles to case of operations with multiple mnemonics (e.g., the JZ and JE instructions on the x86) quite handily.
The format for instructions a good deal more complex; it again consists of a series of lists, each with three outer fields, the list of mnemonics, the list of the possible fields of the opcode, and a text description of the instruction. The fields section is itself a list consisting (in the case of the x86 ISA) of the four common opcode prefixes, the primary opcode (which I will describe shortly), the MOD-R/M type, an enum indicating whether the field can accept the LOCK prefix (and under what circumstances), and an enum indicating the minimum security ring of the instruction. The opcode field is itself broken down into the size of the opcode field, the opcode itself, and the sub-fields which select various conditions or states (it gets complicated - really complicated). A few examples would be:
As complex as this is, I am not sure that I have captured enough information about this ISA to make a truly table-driven assembler. I have repeatedly had to expand upon it already to handle edge cases I hadn't foreseen, and the absurd, irrational complexity of the x86 design and difficulty of following the manuals and the various (often contradictory) web pages documenting it means I could easily miss something important. Even now, I am uncertain enough about how to represent the multiplicity of argument formats that I am have not tried to add that information to the data files; I am hoping I won't need to do so explicitly.
Thus, against my better judgment, I am asking the good folks at this forum for three things: first, a review of the code and data, and of the data formats, to see if there is something I a have overlooked; two, advice on how best to represent the needed data; and third, assistance in entering the volumes of instruction data I am trying to cope with (most of my information is coming from http://ref.x86asm.net/, an excellent if somewhat opaque reference page to whom I am greatly indebted). I have barely scratched the surface at this point, and the outrageous number of instructions and variants thereof are threatening to drive me crazier than I already am. Can anyone give me some good advice on this matter?
However, I could use some assistance, or at the very least criticism and advice. The current design, which is written in R6RS Scheme, calls for a separate data file and register definition file for each of the supported iterations of the ISA - that is to say, there is a pair of files for the 8086, another for the 80186, the 80286, etc. The register file format consists of a series of lists, each containing a list of the register name(s) followed by the bit width of the register in bytes (this does mean that architectures which aren't aligned on multiples of bytes would be unsupported, but how many such architectures are still in use?) . For example, the file 'i8086.regs' reads as follows:
Code: Select all
'(("AX" "Accumulator") 2)
'(("AH" "Accumulator-Upper-Half") 1)
'(("AL" "Accumulator-Lower-Half") 1)
'(("BX" "Base-Register" "Index") 2)
'(("BH" "Base-Upper-Half" "Index-Upper-Half") 1)
'(("BL" "Base-Lower-Half" "Index-Lower-Half") 1)
'(("CX" "Counter") 2)
'(("CH" "Counter-Upper-Half") 1)
'(("CL" "Counter-Lower-Half") 1)
'(("DX" "Data-Register") 2)
'(("DH" "Data-Upper-Half") 1)
'(("DL" "Data-Lower-Half") 1)
'(("DI" "Dest-Index") 2)
'(("SI" "Source-Index") 2)
'(("BP" "Base-Pointer" "Stack-Frame-Pointer") 2)
'(("SP" "Stack") 2)
'(("IP" "Instruction-Pointer") 2)
'(("FLAGS") 2)
'(("CS" "Code-Segment") 2)
'(("DS" "Data-Segment") 2)
'(("SS" "Stack-Segment") 2)
'(("ES" "Extra-Segment") 2)
The format for instructions a good deal more complex; it again consists of a series of lists, each with three outer fields, the list of mnemonics, the list of the possible fields of the opcode, and a text description of the instruction. The fields section is itself a list consisting (in the case of the x86 ISA) of the four common opcode prefixes, the primary opcode (which I will describe shortly), the MOD-R/M type, an enum indicating whether the field can accept the LOCK prefix (and under what circumstances), and an enum indicating the minimum security ring of the instruction. The opcode field is itself broken down into the size of the opcode field, the opcode itself, and the sub-fields which select various conditions or states (it gets complicated - really complicated). A few examples would be:
Code: Select all
'(("AAA" "ASCII-Adjust-After-Addition")
((NONE NONE NONE NONE (8 #x37 (NONE)) NONE NO RING-3))
"ASCII Adjust AL After Addition")
'(("ADD")
((NONE NONE NONE NONE (6 #x00 (D W)) reg REG-DEST-ONLY RING-3)
(NONE NONE NONE NONE (7 #x04 (W)) NONE NO RING-3)
(NONE NONE NONE NONE (4 #x80 (S W)) 0 ALLOWED RING-3))
"Add")
'(("ADDPD")
(#x66 #x0F NONE NONE (8 #x58 (NONE) reg NO RING-3))
"Add Packed Double-FP Values")
'(("BOUND" "Check-Array-Bounds")
(('NONE 'NONE '(8 #x62 (D)) 'NONE 'reg INT-FLAG))
"Check Array Index Against Bounds")
'(("CMOVB" "CMOVNAE" "CMOVC")
((NONE #x0F NONE NONE (5 #x40 (B)) reg NO RING-3))
"Conditional Move - below/not above or equal/carry (CF=1)")
As complex as this is, I am not sure that I have captured enough information about this ISA to make a truly table-driven assembler. I have repeatedly had to expand upon it already to handle edge cases I hadn't foreseen, and the absurd, irrational complexity of the x86 design and difficulty of following the manuals and the various (often contradictory) web pages documenting it means I could easily miss something important. Even now, I am uncertain enough about how to represent the multiplicity of argument formats that I am have not tried to add that information to the data files; I am hoping I won't need to do so explicitly.
Thus, against my better judgment, I am asking the good folks at this forum for three things: first, a review of the code and data, and of the data formats, to see if there is something I a have overlooked; two, advice on how best to represent the needed data; and third, assistance in entering the volumes of instruction data I am trying to cope with (most of my information is coming from http://ref.x86asm.net/, an excellent if somewhat opaque reference page to whom I am greatly indebted). I have barely scratched the surface at this point, and the outrageous number of instructions and variants thereof are threatening to drive me crazier than I already am. Can anyone give me some good advice on this matter?