OSDev.org

Posted: **Thu Mar 06, 2008 9:50 am**

JamesM wrote:However I think that for C, the ambiguities are so small that a context free parser with little bits bolted on could (and does) do the trick.

Indeed. C++ did (and does) give 'em headaches, though, and was one of the major reasons for the sweeping changes done to the engine in GCC v3 / v4.

Posted: **Thu Mar 06, 2008 10:55 am**

There are actually 2 approaches to make your own compiler ...

1) handcraft your compiler :- This is the most enjoyable way of writing a compiler , you can either make use of a hard coded brute force lexical analyzer or hard coded dfa to do the tokenizing for you . After this you can develop a simple LL-1 grammar and build a simple recusive descent parser with a semantic analyzer and code generator smashed up in the parser itself.(hint : it is easier and efficent to parse expressions using an operator precedence parser , (see dijikstra's shunting yard algorithm ) ) . Many hobby compilers follow this pattern . eg Small C compiler , initial vesions of Tiny C compiler ,Crenshaw's compiler tutorial . You can write small compilers in this manner easily . In fact i made a crap compiler in a day this way. It is buggy and generates inefficent code . It generates MASM style x86 instructions. I also made an artificial limit to make lexical analysis easier ,ie all the token's should be seperated by spaces .

Code: Select all


a := b + c ;

is correct but ....

Code: Select all

a:=b+c;

will give an error . My aim was to implement a resonable subset of pascal . It's just 800 lines of code .

2) Make use of automated tools like lex,yacc etc : This reduces your burden on designing the frontend . You can concentrate more on the backend now . The grammar however should be a LALR type grammar . After the parsing you should get a parse tree as output . Traverse this tree to perform sematic analysis.This followed by building assembly trees and code generation . This is a more involved procedure and requires a lot of time for completion . Most of the commercial compilers are made this way . See Dragon Book and Tiger Book for more details . I find the dragon book better than the tiger book .

We normally do not use 'modified compilers for os developement ' . A compiler is defined as a program that translates source program from one form to other .. (eg C to Assembly ) . It is upto the assembler and linker to make the final executable for you . We give the required options to get the work done . In fact, developing a language for os development is a foolish idea.

Regards

Sandeep Mathew

Posted: **Thu Mar 06, 2008 11:39 am**

This site is good ... Have a look at ...
http://www.scifac.ru.ac.za/compilers/index.htm

Posted: **Fri Mar 07, 2008 8:41 am**

I agree on the usefullness of this link. However it would take time to understand and get something out of it to be implemented from scratch, at least for me I must say.

----------------------------------------------

Thanks also for the parser you have attached, Hangin10.

By now the parser part of the attempting compiler I'm building for RealC has the following components:

- It has an allTrim function which removes spaces, tabs, newlines, car returns and any other dummy character.

- It has a getTypifiedString function which will gather all of the characters of one same type in the same index of one array at once. For example, if we have:

Code: Select all

variable1=_2variable++++5;

It will generate an array like this (of course, it's only that raw function, and it needs more sanity checks):

Code: Select all

[0] = variable1
[1] = "="
[2] = "_2variable"
[3] = "++++"
[4] = "5"
[5] = ";"

SandeepMathew wrote:I also made an artificial limit to make lexical analysis easier ,ie all the token's should be seperated by spaces .
Code: Select all
a := b + c ;
is correct but ....
Code: Select all
a:=b+c;
will give an error .

The explanation above should solve that problem by defining "a" as a casual alphanumeric, ":=" as an whole operator, "b" another variable name, "+" an operator, "c" a variable name and ";" the end of the instruction. That just will ignore whether there are interfering spaces or even multiline comments in the middle of an expression.

- It has a way of producing code for nestings (not complete yet). The base address through registers to address automatically structures is made using the $ character, which represents the BP, EBP or RBP register for x86 platforms. For example:

Code: Select all

$ = my_structure;

Willl generate:

Code: Select all

mov ebp,my_structure

So:

Code: Select all

[$+8]=0

Will generate:

Code: Select all

mov dword[ebp+8],0  ;32-bit or Unreal Mode
   OR
mov word[ebp+8],0   ;32-bit
   OR
mov word[bp+8],0    ;16-bit
   OR
mov qword[rbp+8],0  ;64-bit

As can be seen, the 16-bit code generation needs more attention.

- It can skip comments, when finding // up to newline, car return; it can skip multiline comments /**/ but it still needs a way of properly understanding something ambiguous such as /*/ /*/.

-------------------------------------------

JamesM wrote:Yep, define it as a YACC or EBNF (extended bacckus-nauer form (sp?)), and for crying out loud WHY WHY WHY use byte, word, dword, qword et al?

They're just relics from DOS days, a word is no long 16 bits long, It's been 32 bits long, generally, for almost a decade and its hitting 64 bits now. GAH! It really annoys be because it lowers portability and makes discussions with people who don't use these archaic datatype names and use "word" to mean something different, (like, i don't know, the width of the data bus?!).

Rant over.

For addressing the width of the data bus either don't specify it as in:

Code: Select all

[myvariable]=5;

Or use the wideword keyword as in:

Code: Select all

wideword[myvariable]=5;

Of course, other data types can be used, but they aren't needed since that's only if overriding is necessary and the compiler takes into account automatically the full addressing width depending on the target bytecode generated.

It's just to bring the simplicity and documentation of assembler when it needs constantly to define whether it's going to move or in general handle bytes, 16-bit words, 32-bit double words, 64-bit quad words (it just was considered simplicity, not obsolete relics). That for itself should make some more clearer intrincate operations, something valuable for tasks like OS development.

Below is an example on how would look a program using this RealC (it should be easier to understand from the start):

Code: Select all

#org      0x7C00
#platform x86_16   //Generate 16-bit Intel opcodes



goto Start;



GDT:
#define SELNull 0       //As you may see, we have used
        GDT_size:       //the Null Selector as the GDT
          word GDTsize; //pointer to load with LGDT.
        GDT_actualPtr:  //
          dword GDT;    //You can see that the WORD at
          word 0x0000;  //the end of these 8 bytes are
                        //2 padding bytes with 0x0000.

#define SELCod32 8
        word 0FFFFh;    // bits 0-15 length (Bytes 0-1)
        word 00000h;    // bits 0-15 base addr (Bytes 2-3)
        byte 0;         // bits 16-23 base addr (Byte 4)
        byte 10011010b; // bits 0-3 Type, 4 DT, 5-6 DPL & 7 P (Byte 5)
        byte 11001111b; // bits 16-19 length into 0-3, 6 D & 7 G (Byte 6)
        byte 0;         // bits 24-31 base addr (Byte 7)

#define SELDat32 16
        word 0FFFFh;    // bits 0-15 length (Bytes 0-1)
        word 00000h;    // bits 0-15 base addr (Bytes 2-3)
        byte 0;         // bits 16-23 base addr (Byte 4)
        byte 10010010b; // bits 0-3 Type, 4 DT, 5-6 DPL & 7 P (Byte 5)
        byte 11001111b; // bits 16-19 length into 0-3, 6 D & 7 G (Byte 6)
        byte 0;         // bits 24-31 base addr (Byte 7)
      GDT_end:

#define GDTsize (GDT_end-GDT)-1





Start:


 asm{
  cli            ;Disable interrupts
  lgdt[cs:GDT]   ;Load GDT

   mov eax,cr0
    inc ax
   mov cr0,eax   ;Set PE bit

   push byte SELDat32  ;Set data selectors
   pop ds
   push ds
   pop es
 }


 asm jmp dword SELCod32:_32bit   ;Far jump


 #platform x86_32
 #align 4
 _32bit:


  /*
   Stop Floppy A:
   Stop Floppy A:
   Stop Floppy A:

     The same as:
      xor ax,ax 
      mov dx,3F2h
      out dx,al
  */
   /////////
    writeByteReg(0x3F2,0);   //It's an internal pseudofunction/pseudomacro



  asm jmp $  ;Spinning loop

The best part is that all of the above code is supposed to be translated using one only compiler tool readily, no need to make more things to mix this C and assembler.

And it's good to note that the C compilers don't seem to accept binary notation for numbers just like 0000b, but as can be seen here, it does support it.

The code I have written until now can fully and correctly interpret all of that code, but it still is unable to handle if's and expressions in general, as well as function calls, and by now it doesn't generate a binary, just assembly source code for NASM/YASM.

-------------------------------------------

SandeepMathew wrote:We normally do not use 'modified compilers for os developement ' . A compiler is defined as a program that translates source program from one form to other .. (eg C to Assembly ) . It is upto the assembler and linker to make the final executable for you . We give the required options to get the work done . In fact, developing a language for os development is a foolish idea.

I neither understand nor see how or why developing a new language is a foolish thing. It has been originally thought to avoid such problems as asking or having to solve hundreds of times either individually or collectively "where's the problem in my build/linker script?", "how/why can I get to link the objects?" (complicating a bit more things like OS development which is the goal here), or having to use multiple tools to use assembly and C instead of just using it simply, practically in one stage.

But if it gets to develop well and were capable of building DOS, Windows, Linux executables as well as raw binaries (the original goal) it would be very useful. But that certainly would be a major addition, not planned in the near future. And since it produces assembly source code it could be used to write just small pieces of an application if one wants it to be maintainable and still be able to practically program it in assembly if that's absolutely required.

Later on it can be improved to build also the final binary instead of just the generated NASM/YASM source code.

Posted: **Fri Mar 07, 2008 9:09 am**

Wow. I unfortunately have to say that I am underwhelmed.

The language snippet you posted of "RealC" actually only contains one line of C - the function call! Everything else is either preprocessor macros or inline asm. Hmm.

The parsing functions you gave seem woefully inefficient - you don't have a proper lexer, you don't have a proper grammar, you're designing everything as you meet it, seemingly.

This language will never stand the test of time, and unfortunately it seems you have a massive lack of knowledge about compilers in general. The fact that you couldn't work out what the YACC grammar meant...

Posted: **Fri Mar 07, 2008 9:39 am**

Well, the goto directive and the asm directives, both multiline and one-line are C as well, also the ";" present which need to be interpreted, as well as comments (it's necessary to be able to skip them in any circumstance).

Not to mention that to handle preprocessor directives such as #define the parser has to take that string and based on dummy characters and actual parameters tell whether its corresponding syntax is correct or not, and then translate it to assembly directives such as %define or db/dw/dd/dq (you're right, there are some missing ';' so I will update the snippet to correct that).

Also, the byte and the other fields are actually un-named variables/fields, exactly like in assembly. A named variable would be something like:

Code: Select all

byte myvariable=0;

And the parser functions I mentioned were just a simplification of the ideas, I'm trying to make it fully understandable for those who can offer help if they wish. The actual implementations are rather different but those are the principles, and of course when implementing them it results in a little amount of recursive functions.

And yes, you are right. I'm not a YACC programmer at all, it would take me time to learn it, and it would only help me mainly in using a generator. But until now what I don't know about YACC and grammar tools, I have got to implement what I have, as I said, being able to implement a parser that in the beginning may not be optimized but that does the parsing correctly, that at least is capable of correctly interpreting the previous snippet.

It isn't that I don't have a grammar defined at all, but I cannot simply post it; that would mean that everything is already almost done and wouldn't make much sense.

I would like to know properly argumented why it wouldn't succeed at least slightly. The lack of knowledge is just what is being fixed here by sharing what already is done and with what has been shared until now, and it has been helping to successfuly speed up a bit the development, even if it doesn't seem to be the case.

-------
By the way, it finally brings to attention an ambiguity which must be fixed somehow, and suggestions would be gladly and greatly appreciated:

Code: Select all

word GDT;

That could be either an unnamed variable or an actual variable declaration. What about using assembly style and decide that:

Code: Select all

word GDT;  //it will always result in an unnamed field 
           //with value "GDT", whatever that label is 

word GDT=_ANYTHING_;  //it's the only way to define a normal, 
                      //named variable/field: define its width, 
                      //specify a name, then '=' and then a 
                      //value or just 0.

Wouldn't that be reasonable, stable, convenient and easy enough, taking into account that this language is actually a mix between C an assembly?

Note that I'm just looking for suggestions about that; if there aren't any then it will just be implemented that way, no complications.

Posted: **Fri Mar 07, 2008 10:03 am**

Yes, but, the design of the parser shows that you have no knowledge of parsers, which means it won't succeed without a full rewrite.

In fact, what you're describing isn't even a context-free parser, what you've done could be simply implemented using regular expressions - there is no nesting involved yet, and that's where things get seriously nasty (bracket matching, and even worse, error handling/recovery).

I would post more, but I'm about to leave work for home. Yippee!!

Posted: **Fri Mar 07, 2008 5:44 pm**

As the Crenshaw tutorial says, it should be possible to just add functionalities to the global parser, avoiding a full rewrite.

But now into the actual ways of doing things, it looks like when solving a nested expression, sometimes it's necessary to preserve the contents of the previous results.

The easiest way is to PUSH the previous results as needed, and then POP them to operate when it's proper.

For example:

Code: Select all

(1+7*(8*2))+(6+9*(5*3))

Of course that could be solved without generating opcodes because all of the values are direct and static.

It would generate code such as:

Code: Select all

xor edx,edx
mov eax,8
mov ebx,2
mul ebx    ;do (8*2) and preserve only EAX (16)

mov ebx,7
mul ebx    ;do 7*(8*2) and preserve EAX (112)

add eax,1  ;do 1+7*(8*2) into EAX (113)

So until now we would have the first part of the expression, (1+7*(8*2)).

-------
Now, there follows a + so we would be supposed to store it and operate it until after we solve the second expression nest.

For test purposes, the second nest would basically generate the same opcodes but with different values. The question is, is there a better way of preserving the previous result in EAX other than pushing it in the stack, or to avoid even having to use temporary variables or stack space and avoid the remote but actual possibility of stack overflow just for solving an expression?

Maybe implementing optimizations to determine whether the stack will be needed, or if the result can be directly and gradually be saved into the variable itself (in case it's being saved to a variable) or if it's a temporary expression for an if or the like, then either use the stack if necessary or a combination of that and general registers recycling.

Posted: **Sat Mar 08, 2008 12:26 am**

Not to mention that to handle preprocessor directives such as #define the parser has to take that string and based on dummy characters and actual parameters tell whether its corresponding syntax is correct or not, and then translate it to assembly directives such as %define or db/dw/dd/dq (you're right, there are some missing ';' so I will update the snippet to correct that).

Parser doesnt do the preprocessing ! . Its the job of the macro processor

The explanation above should solve that problem by defining "a" as a casual alphanumeric, ":=" as an whole operator, "b" another variable name, "+" an operator, "c" a variable name and ";" the end of the instruction. That just will ignore whether there are interfering spaces or even multiline comments in the middle of an expression.

Sorry , I do not mix semantic analysis with lexical analysis . It's better to do one thing at a time ! . That limit can be resolved easily . I developed it simply for fun , I do not have to show it to anybody .

I neither understand nor see how or why developing a new language is a foolish thing. It has been originally thought to avoid such problems as asking or having to solve hundreds of times either individually or collectively "where's the problem in my build/linker script?", "how/why can I get to link the objects?" (complicating a bit more things like OS development which is the goal here), or having to use multiple tools to use assembly and C instead of just using it simply, practically in one stage.

I dont see the word 'compiler'. I never said developing a new language is bad. Develping a new compiler for os - development appears to be useless proposition . The crenshaw tutorial is not worth imitating when it comes to developing a real compiler.

. Have a good look at the link i posted . It seems that you might also be interested in HLA by Ryndall Hyde.

OSDev.org

Building a C-like Language