I agree on the usefullness of
this link. However it would take time to understand and get something out of it to be implemented from scratch, at least for me I must say.
----------------------------------------------
Thanks also for the parser you have attached, Hangin10.
By now the parser part of the attempting compiler I'm building for RealC has the following components:
- It has an
allTrim function which removes spaces, tabs, newlines, car returns and any other dummy character.
- It has a
getTypifiedString function which will gather all of the characters of one same type in the same index of one array at once. For example, if we have:
It will generate an array like this (of course, it's only that raw function, and it needs more sanity checks):
Code: Select all
[0] = variable1
[1] = "="
[2] = "_2variable"
[3] = "++++"
[4] = "5"
[5] = ";"
SandeepMathew wrote:I also made an artificial limit to make lexical analysis easier ,ie all the token's should be seperated by spaces .
is correct but ....
will give an error .
The explanation above should solve that problem by defining "a" as a casual alphanumeric, ":=" as an whole operator, "b" another variable name, "+" an operator, "c" a variable name and ";" the end of the instruction. That just will ignore whether there are interfering spaces or even multiline comments in the middle of an expression.
- It has a way of producing code for nestings (not complete yet). The base address through registers to address automatically structures is made using the
$ character, which represents the BP, EBP or RBP register for x86 platforms. For example:
Willl generate:
So:
Will generate:
Code: Select all
mov dword[ebp+8],0 ;32-bit or Unreal Mode
OR
mov word[ebp+8],0 ;32-bit
OR
mov word[bp+8],0 ;16-bit
OR
mov qword[rbp+8],0 ;64-bit
As can be seen, the 16-bit code generation needs more attention.
- It can skip comments, when finding
// up to newline, car return; it can skip multiline comments
/**/ but it still needs a way of properly understanding something ambiguous such as
/*/ /*/.
-------------------------------------------
JamesM wrote:Yep, define it as a YACC or EBNF (extended bacckus-nauer form (sp?)), and for crying out loud WHY WHY WHY use byte, word, dword, qword et al?
They're just relics from DOS days, a word is no long 16 bits long, It's been 32 bits long, generally, for almost a decade and its hitting 64 bits now. GAH! It really annoys be because it lowers portability and makes discussions with people who don't use these archaic datatype names and use "word" to mean something different, (like, i don't know, the width of the data bus?!).
Rant over.
For addressing the width of the data bus either don't specify it as in:
Or use the
wideword keyword as in:
Of course, other data types can be used, but they aren't needed since that's only if overriding is necessary and the compiler takes into account automatically the full addressing width depending on the target bytecode generated.
It's just to bring the simplicity and documentation of assembler when it needs constantly to define whether it's going to move or in general handle bytes, 16-bit words, 32-bit double words, 64-bit quad words (it just was considered simplicity, not obsolete relics). That for itself should make some more clearer intrincate operations, something valuable for tasks like OS development.
Below is an example on how would look a program using this RealC (it should be easier to understand from the start):
Code: Select all
#org 0x7C00
#platform x86_16 //Generate 16-bit Intel opcodes
goto Start;
GDT:
#define SELNull 0 //As you may see, we have used
GDT_size: //the Null Selector as the GDT
word GDTsize; //pointer to load with LGDT.
GDT_actualPtr: //
dword GDT; //You can see that the WORD at
word 0x0000; //the end of these 8 bytes are
//2 padding bytes with 0x0000.
#define SELCod32 8
word 0FFFFh; // bits 0-15 length (Bytes 0-1)
word 00000h; // bits 0-15 base addr (Bytes 2-3)
byte 0; // bits 16-23 base addr (Byte 4)
byte 10011010b; // bits 0-3 Type, 4 DT, 5-6 DPL & 7 P (Byte 5)
byte 11001111b; // bits 16-19 length into 0-3, 6 D & 7 G (Byte 6)
byte 0; // bits 24-31 base addr (Byte 7)
#define SELDat32 16
word 0FFFFh; // bits 0-15 length (Bytes 0-1)
word 00000h; // bits 0-15 base addr (Bytes 2-3)
byte 0; // bits 16-23 base addr (Byte 4)
byte 10010010b; // bits 0-3 Type, 4 DT, 5-6 DPL & 7 P (Byte 5)
byte 11001111b; // bits 16-19 length into 0-3, 6 D & 7 G (Byte 6)
byte 0; // bits 24-31 base addr (Byte 7)
GDT_end:
#define GDTsize (GDT_end-GDT)-1
Start:
asm{
cli ;Disable interrupts
lgdt[cs:GDT] ;Load GDT
mov eax,cr0
inc ax
mov cr0,eax ;Set PE bit
push byte SELDat32 ;Set data selectors
pop ds
push ds
pop es
}
asm jmp dword SELCod32:_32bit ;Far jump
#platform x86_32
#align 4
_32bit:
/*
Stop Floppy A:
Stop Floppy A:
Stop Floppy A:
The same as:
xor ax,ax
mov dx,3F2h
out dx,al
*/
/////////
writeByteReg(0x3F2,0); //It's an internal pseudofunction/pseudomacro
asm jmp $ ;Spinning loop
The best part is that all of the above code is supposed to be translated using one only compiler tool readily, no need to make more things to mix this C and assembler.
And it's good to note that the C compilers don't seem to accept binary notation for numbers just like 0000b, but as can be seen here, it does support it.
The code I have written until now can fully and correctly interpret all of that code, but it still is unable to handle if's and expressions in general, as well as function calls, and by now it doesn't generate a binary, just assembly source code for NASM/YASM.
-------------------------------------------
SandeepMathew wrote:We normally do not use 'modified compilers for os developement ' . A compiler is defined as a program that translates source program from one form to other .. (eg C to Assembly ) . It is upto the assembler and linker to make the final executable for you . We give the required options to get the work done . In fact, developing a language for os development is a foolish idea.
I neither understand nor see how or why developing a new language is a foolish thing. It has been originally thought to avoid such problems as asking or having to solve hundreds of times either individually or collectively "where's the problem in my build/linker script?", "how/why can I get to link the objects?" (complicating a bit more things like OS development which is the goal here), or having to use multiple tools to use assembly and C instead of just using it simply, practically in one stage.
But if it gets to develop well and were capable of building DOS, Windows, Linux executables as well as raw binaries (the original goal) it would be very useful. But that certainly would be a major addition, not planned in the near future. And since it produces assembly source code it could be used to write just small pieces of an application if one wants it to be maintainable and still be able to practically program it in assembly if that's absolutely required.
Later on it can be improved to build also the final binary instead of just the generated NASM/YASM source code.