OSDev.org

Posted: **Sun May 20, 2007 7:50 pm**

Parsing is nt that hard....,you can make use of paser generators like
yacc or bison....,May be you can make you generic code more readable
by allowing higher level constructs(This requires a Parser)......this is what Ryndall Hyde has done with hla....But it'is written in hla itself and very diffcult to port....... , if you make something like that in C or pascal ... it will be far
easier to port .... and also compiler's can target your translator as a target
than the bare machine......This makes making portable compilers quite
easier...... I presented this as a paper ...but they did'nt like it very much.............

Posted: **Mon May 21, 2007 1:21 am**

Hi,

hckr83 wrote:ok so to code for optimizations on some archs, you use 100s of registers, so on those few archs that do support, it will be fast...but what do you do when the target arch only has 16?
My idea is to use memory in order to store missing registers..

A better idea would be for the language to have no registers at all (where all variables are in memory), and for the compiler/optimizer to automatically decide to use CPU registers instead of the variables in memory where appropriate.

For e.g. you'd code something like:

Code: Select all

    mov [variable],0xFOODBABE
    add [variable],3

And the compiler might generate:

Code: Select all

    mov eax,0xFOODBABE
    add eax,3

hckr83 wrote:lets say you wanted to program on 32bit for the generic assembly, but wanted to compile to use a 16bit arch. For example sake, we'll use 8086

The generic code:

Code: Select all

;generic
mov dword r0,0xFFFF FFFE
add dword r0,1
sub dword r0,2
div dword r0,3
mul dword r0,4

Would produce something like:

Code: Select all

;produced
mov ax,0xFFFF
mov bx,0xFFFF

add ax,1
adc bx,0

sub ax,2
sbb bx,0

push cx
push dx
push ax
mov ax,bx    ;ax = high word
mov dx,0     ;dx:ax = high word
mov cx,3
div cx           ;ax = high word / 3, dx = remainder
pop ax         ;ax = low word
add ax,dx
mov dx,0
adc dx,0      ;dx:ax = low word + remainder
div cx          ;ax = (low word + remainder) / 3
pop dx
pop cx

push dx
push cx
push si
push di
mov cx,4
mul cx        ;dx:ax = low word * 4
mov si,ax
mov di,dx  ;di:si = low word * 4
mov ax,bx
mul cx        ;dx:ax = high word * 4
mov bx,di
add bx,ax
mov ax,si   ;bx:ax = ((high word * 4) << 16) + (low word * 4)
pop di
pop si
pop cx
pop dx

Of course this is extremely unoptimized (and not tested)...

If the language had no registers at all, then you'd produce something like this:

Code: Select all

;produced
mov word [variable],0xFFFF
mov word [variable+2],0xFFFF

add word [variable],1
adc word [variable+2],0

sub word [variable],2
sbb word [variable+2],0

mov ax,[variable+2]
mov dx,0
mov cx,3
div cx                         ;ax = high word / 3, dx = remainder
mov [variable+2],ax
mov ax,[variable]
add ax,dx
mov dx,0
adc dx,0
div cx                         ;ax = (low word + remainder) / 3
mov [variable],ax

mov ax,[variable]
mov cx,4
mul cx                       ;dx:ax = low dword * 4
mov [variable],ax
mov ax,[variable+2]
mov [variable+2],dx
mul cx
add [variable+2],ax

This is also extremely unoptimized, but it'd be easier for the compiler to see that SI, DI and BP are unused and decide to use SI:DI instead of "[variable]" and "[variable+2]".

Cheers,

Brendan

Posted: **Fri Jun 01, 2007 9:52 pm**

hmmm...that'd be smart...

though optimizers are hard to make..lol

really though, what if you knew you wanted something stored in a register for speed, and though your not using it right hten, you will use it later..something no optimizer can detect...

hmm...what about a "register" flag or something you could put for a certain address of memory where "if possible put this in a register"

Posted: **Sat Jun 02, 2007 1:15 am**

Hi,

hckr83 wrote:really though, what if you knew you wanted something stored in a register for speed, and though your not using it right hten, you will use it later..something no optimizer can detect...

Using registers is faster, and not using registers is slower. Putting something in a register now that you intend to use much later on means that the register can't be used to speed up all the code between "now" and "later on". In this case you could improve performance by pushing the register on the stack, using the register to speed things up, then popping it off the stack much later on.

Think of it like an excercise in caching, where CPU registers are used to cache the most frequently used variables. As soon as you're caching something that isn't the most frequently used, then you start to lose efficiency.

BTW you'd be surprised what an optimizer could detect. It'd recommend doing some research into common optimizations (e.g. see this wiki page) before designing your language, so you can decide which features of your language will make it easier or harder to optimize later.

One thing I've been considering is replacing complete functions with table lookups. For example, consider a function like this:

Code: Select all

int myFunction(char a, char b) {
    int result;

    result = ((a * b) / (a + b)) * b;
    result &= (a * b) / MY_CONSTANT;
    return result;
}

This could be optimized to:

Code: Select all

    movzx eax,byte [a]
    mov ah,[b]
    mov eax,[__myFunction_table + eax * 4]
    ret

Of course to do this type of thing you need to implement an interpretter to generate the table, and it'd be nice if you could tell the compiler the exact range of a variable (so that if "b" only ever contains a value from 0 to 3, the optimizer could generate a 1024 byte table instead of a full 65536 byte table).

hckr83 wrote:hmm...what about a "register" flag or something you could put for a certain address of memory where "if possible put this in a register"

If your optimizer/s are any good, the "register" flag should be considered a suggestion to the optimizer (and mostly ignored), just like it is in C. It's more important to have it's opposite ("volatile") which tells the optimizer that the variable can't be cached in a register (which the optimizer must not ignore).

Optimizing is hard, and optimizing well is a lot harder. My suggestion is to forget about optimizing while you're implementing the compiler, but leave places where you could insert optimizers later (e.g. between "tokenising" and "compiling", and between "compiling" and "assembling").

Cheers,

Brendan

Posted: **Sat Jun 02, 2007 7:47 am**

Making the tokenizer is what I can't do...I have tried like 10 times to make something convert a plain text file in a language convert to tokens...it just doesn't work for me...

hmm...
what about a registers only language?
I think it simplifies things a LOT if you only have to worry about registers or memory...kinda one or the other..

course, I don't think registers only could have many uses practically.. So probably memory only would be best..what about storing addresses in registers though? would that be best done by the optimizer?

another bit, what about having portable bits? like be able to choose between one amount of bits to make the code in, and then that can be compiled to any other amount of bits? course...this also brings up the problem of endianess..and also the way the computer stores memory(most significant byte first or least significant byte first)

I just don't see anyway to combat those problems without major speed penalties..

Posted: **Sat Jun 02, 2007 8:36 am**

hckr83 wrote:Making the tokenizer is what I can't do...I have tried like 10 times to make something convert a plain text file in a language convert to tokens...it just doesn't work for me...

Tokenizers aren't really that difficult. You could even use flex to do it for you.

If you write it yourself, you might want to look into finite state automata's as they do help.

Posted: **Sat Jun 02, 2007 9:53 am**

Hi,

hckr83 wrote:Making the tokenizer is what I can't do...I have tried like 10 times to make something convert a plain text file in a language convert to tokens...it just doesn't work for me...

A quick crash course....

First you need some output routines which would send the tokenized data somewhere. For now just send the tokens to the screen and worry about the details later:

Code: Select all

void sendSimpleToken(unsigned short token) {
    printf("%d\n", token);
}

void sendStringToken(unsigned short token, char *string) {
    printf("%d '%s'\n", token, string);
}

void sendNumberToken(unsigned short token, long int number) {
    printf("%d %d\n", token, number);
}

Next you want some framework for your tokenizer. I'm going to assume it's a line-oriented language (like assembly) and your tokenizer is sent lines of ASCII by the code that reads the source file/s from disk:

Code: Select all

void tokenize(char *input) {
    while(*input != 0) {
        printf("Syntax error!\n");
    }
}

Now you probably want to ignore any white space between things. That's easy enough:

Code: Select all

void tokenize(char *input) {
    while(*input != 0) {
        if( (*input = ' ')  || (*input = '\t') || (*input = '\r') || (*input = '\n') ) {
            input++;
        }
        printf("Syntax error!\n");
    }
}

Every language has numerical constants, so let's add them next:

Code: Select all

enum {
    TOKEN_NUMBER,
}

void tokenize(char *input) {
    long int number;
    char *temp;

    while(*input != 0) {
        if( (*input = ' ')  || (*input = '\t') || (*input = '\r') || (*input = '\n') ) {
            input++;
        } else {
            number = strtol(input, &temp, 0);
            if(*temp != input) {
                sendNumberToken(TOKEN_NUMBER, number);
            }
        }
        printf("Syntax error!\n");
    }
}

Then there's identifiers (labels, keywords, etc):

Code: Select all

enum {
    TOKEN_NUMBER,
    TOKEN_IDENTIFIER,
}

void tokenize(char *input) {
    long int number;
    char *temp;
    char identifier[256];
    int i;

    while(*input != 0) {
        if( (*input = ' ')  || (*input = '\t') || (*input = '\r') || (*input = '\n') ) {
            input++;
        } else {
            number = strtol(input, &temp, 0);
            if(temp != input) {
                sendNumberToken(TOKEN_NUMBER, number);
                input = temp;
            } else {
                i = 0;
                while( (input[i] > 'A') && (input[i] < 'Z') ) {
                    identifier[i] = input[i];
                    i++;
                }
                if(i != 0) {
                    identifier[i] = 0;
                    sendStringToken(TOKEN_IDENTIFIER, identifier);
                    input += i;
                }
            }
        }
        printf("Syntax error!\n");
    }
}

You'd probably also want to handle operators, punctuation, etc. I'll only do a few:

Code: Select all

enum {
    TOKEN_NUMBER,
    TOKEN_IDENTIFIER,
    TOKEN_LEFTBRACKET,
    TOKEN_RIGHTBRACKET,
    TOKEN_MULTIPLY,
    TOKEN_ADD,
}

void tokenize(char *input) {
    long int number;
    char *temp;
    char identifier[256];
    int i;

    while(*input != 0) {
        if( (*input = ' ')  || (*input = '\t') || (*input = '\r') || (*input = '\n') ) {
            input++;
        } else if(*input == '*') {
            sendSimpleToken(TOKEN_MULTIPLY);
        } else if(*input == '+') {
            sendSimpleToken(TOKEN_ADD);
        } else if(*input == '(') {
            sendSimpleToken(TOKEN_LEFTBRACKET);
        } else if(*input == ')') {
            sendSimpleToken(TOKEN_RIGHTBRACKET);
        } else {
            number = strtol(input, &temp, 0);
            if(temp != input) {
                sendNumberToken(TOKEN_NUMBER, number);
                input = temp;
            } else {
                i = 0;
                while( (input[i] > 'A') && (input[i] < 'Z') ) {
                    identifier[i] = input[i];
                    i++;
                }
                if(i != 0) {
                    identifier[i] = 0;
                    sendStringToken(TOKEN_IDENTIFIER, identifier);
                    input += i;
                }
            }
        }
        printf("Syntax error!\n");
    }
}

Handling string constants would be a good idea. Also if your language has comments like assembly you could probably stop tokenizing when you get a comment character:

Code: Select all

enum {
    TOKEN_NUMBER,
    TOKEN_IDENTIFIER,
    TOKEN_STRING,
    TOKEN_LEFTBRACKET,
    TOKEN_RIGHTBRACKET,
    TOKEN_MULTIPLY,
    TOKEN_ADD,
}

void tokenize(char *input) {
    long int number;
    char *temp;
    char identifier[256];
    int i;

    while(*input != 0) {
        if( (*input = ' ')  || (*input = '\t') || (*input = '\r') || (*input = '\n') ) {
            input++;
        } else if(*input == ';') {
            return;
        } else if(*input == '*') {
            sendSimpleToken(TOKEN_MULTIPLY);
        } else if(*input == '+') {
            sendSimpleToken(TOKEN_ADD);
        } else if(*input == '(') {
            sendSimpleToken(TOKEN_LEFTBRACKET);
        } else if(*input == ')') {
            sendSimpleToken(TOKEN_RIGHTBRACKET);
        } else if(*input == '"') {
            i = 1;
            while( (input[i] != '"') && (input[i] != 0) {
                i++;
            }
            if(input[i] == 0) {
                printf("Hey - there's no right quote character on that string!\n");
                exit(EXIT_FAILURE);
            }
            input[i] = 0;
            sendStringToken(TOKEN_STRING, input);
            input += i + 1;
        } else {
            number = strtol(input, &temp, 0);
            if(temp != input) {
                sendNumberToken(TOKEN_NUMBER, number);
                input = temp;
            } else {
                i = 0;
                while( (input[i] > 'A') && (input[i] < 'Z') ) {
                    identifier[i] = input[i];
                    i++;
                }
                if(i != 0) {
                    identifier[i] = 0;
                    sendStringToken(TOKEN_IDENTIFIER, identifier);
                    input += i;
                }
            }
        }
        printf("Syntax error!\n");
        exit(EXIT_FAILURE);
    }
}

Lastly, there's keywords. At the moment a keyword would end up being tokenized as an identifier. If your compiler handles several languages (e.g. C and inline assembly, or NASM and AT&T) it's probably better to leave them for the parser to sort out. Otherwise, you could start adding them:

Code: Select all

enum {
    TOKEN_NUMBER,
    TOKEN_IDENTIFIER,
    TOKEN_STRING,
    TOKEN_LEFTBRACKET,
    TOKEN_RIGHTBRACKET,
    TOKEN_MULTIPLY,
    TOKEN_ADD,
    TOKEN_FOO_KEYWORD,
    TOKEN_BAR_KEYWORD,
}

void tokenize(char *input) {
    long int number;
    char *temp;
    char identifier[256];
    int i;

    while(*input != 0) {
        if( (*input = ' ')  || (*input = '\t') || (*input = '\r') || (*input = '\n') ) {
            input++;
        } else if(*input == ';') {
            return;
        } else if(*input == '*') {
            sendSimpleToken(TOKEN_MULTIPLY);
        } else if(*input == '+') {
            sendSimpleToken(TOKEN_ADD);
        } else if(*input == '(') {
            sendSimpleToken(TOKEN_LEFTBRACKET);
        } else if(*input == ')') {
            sendSimpleToken(TOKEN_RIGHTBRACKET);
        } else if(*input == '"') {
            i = 1;
            while( (input[i] != '"') && (input[i] != 0) {
                i++;
            }
            if(input[i] == 0) {
                printf("Hey - there's no right quote character on that string!\n");
                exit(EXIT_FAILURE);
            }
            input[i] = 0;
            sendStringToken(TOKEN_STRING, input);
            input += i + 1;
        } else {
            number = strtol(input, &temp, 0);
            if(temp != input) {
                sendNumberToken(TOKEN_NUMBER, number);
                input = temp;
            } else {
                i = 0;
                while( (input[i] > 'A') && (input[i] < 'Z') ) {
                    identifier[i] = input[i];
                    i++;
                }
                if(i != 0) {
                    identifier[i] = 0;
                    if(strcasecmp(identifier, "FOO") == 0) {
                        sendSimpleToken(TOKEN_FOO_KEYWORD);
                    } else if(strcasecmp(identifier, "BAR") == 0) {
                        sendSimpleToken(TOKEN_BAR_KEYWORD);
                    } else {
                        sendStringToken(TOKEN_IDENTIFIER, identifier);
                    }
                    input += i;
                }
            }
        }
        printf("Syntax error!\n");
        exit(EXIT_FAILURE);
    }
}

Most of the code above is fairly dodgy (untested, and intended as an example only), but it should be able to tokenize something like this:

Code: Select all

 ; A comment
    FOO 1 * (3 + 4)
    BAR "Hello World"

Of course it'd also accept some gibberish like this:

Code: Select all

   (FOO BAR) FOO + BAR "Hello" * 3

That's OK - the next stage of your compiler can work out what is gibberish and what isn't.

There's also a lot missing from the example code. For e.g. you'd want to be able to handle larger integers and floating point constants (I store all numerical constants as arbitrary precision floating point). It's also a good idea to use an array of keywords and write a function that checks each entry in the array to see if a string is a keyword or not, instead of using a "strcasecmp" for each possible keyword (it's easier to maintain that way, and means you can use hash tables or something to improve performance later on). You'd also want to add a generic error handling function, so that the compiler will tell you which line in which file it didn't like.

After writing your tokenizer you'd want to replace those output routines. For this I tend to add the output to a buffer, and send the buffer to another thread when it gets full or when there's nothing left to tokenize (but that's just how I do things - most people would just collect it all in memory).

I'd also be tempted to write the reverse - some code to convert the tokens back into ASCII. This just makes it easier to debug your compiler later.

Cheers,

Brendan

Posted: **Thu Jun 07, 2007 4:07 pm**

Brendan wrote:Optimizing is hard, and optimizing well is a lot harder. My suggestion is to forget about optimizing while you're implementing the compiler, but leave places where you could insert optimizers later

I completely agree with the premise, but I say that the conclusion logically contradicts the premise.

If you have something like the old Whitesmith C compiler that will generate 100 lines of asm from a particular source code -- and GCC that will emit 500 lines of asm for the same code -- and if you ignore optimization completely and get 3000 lines of asm for the same code out of yours -- where an asm hacker can do it all in 40 lines of asm ...

you will NEVER significantly compress the code down later with an optimizer (because optimizing really is damned hard!), for any of the inefficient compilers. The one that will optimize best, later, is the one that was the best optimized to begin with.

And every piece of code that you compile with your inefficient compiler will be inefficient -- bogging down any system you ever run that code on.

Developer time is NOT always the most important factor. There are two pieces of software that are the fulcrum of the entire system -- the kernel, and the compiler. Spending the extra time to polish both of those foundation stones into slick shiny jewels will produce endless rewards, later.

Posted: **Thu Jun 07, 2007 7:21 pm**

bewing wrote:
Brendan wrote:Optimizing is hard, and optimizing well is a lot harder. My suggestion is to forget about optimizing while you're implementing the compiler, but leave places where you could insert optimizers later
I completely agree with the premise, but I say that the conclusion logically contradicts the premise.

If you have something like the old Whitesmith C compiler that will generate 100 lines of asm from a particular source code -- and GCC that will emit 500 lines of asm for the same code -- and if you ignore optimization completely and get 3000 lines of asm for the same code out of yours -- where an asm hacker can do it all in 40 lines of asm ...

you will NEVER significantly compress the code down later with an optimizer (because optimizing really is damned hard!), for any of the inefficient compilers. The one that will optimize best, later, is the one that was the best optimized to begin with.

And every piece of code that you compile with your inefficient compiler will be inefficient -- bogging down any system you ever run that code on.

Developer time is NOT always the most important factor. There are two pieces of software that are the fulcrum of the entire system -- the kernel, and the compiler. Spending the extra time to polish both of those foundation stones into slick shiny jewels will produce endless rewards, later.

umm... ok. I agree with what you are saying, but not in the context of this thread and the thread's original post's author. You have to slow down and realize that the guy is actually trying to learn how to write a compiler not revolutionize the world with a multi-million dollar compiler, which I think is the point Brendan had in mind. Notice he said he was having trouble learning to parse syntax of some language I think that is grounds for him to think and start small. Since as you know every prototype gets rebuilt.

...but I say that the conclusion logically contradicts the premise... if we were writing a production compiler instead of for a educational factor.

Posted: **Fri Jun 08, 2007 8:48 pm**

Kevin McGuire wrote: umm... ok. I agree with what you are saying, but not in the context of this thread and the thread's original post's author. You have to slow down and realize that the guy is actually trying to learn how to write a compiler not revolutionize the world with a multi-million dollar compiler, which I think is the point Brendan had in mind. Notice he said he was having trouble learning to parse syntax of some language I think that is grounds for him to think and start small. Since as you know every prototype gets rebuilt.

You are correct, and my post was not as well phrased as it could have been. *My* OS is dead-serious, and sometimes I have trouble remembering that many/most people here are doing this for fun or as a hobby. My post was also in reply to several intermediate posts in this thread that I did not bother going back to quote.

Posted: **Sat Jun 09, 2007 12:42 am**

Hi,

bewing wrote:
Kevin McGuire wrote: umm... ok. I agree with what you are saying, but not in the context of this thread and the thread's original post's author. You have to slow down and realize that the guy is actually trying to learn how to write a compiler not revolutionize the world with a multi-million dollar compiler, which I think is the point Brendan had in mind. Notice he said he was having trouble learning to parse syntax of some language I think that is grounds for him to think and start small. Since as you know every prototype gets rebuilt.
You are correct, and my post was not as well phrased as it could have been. *My* OS is dead-serious, and sometimes I have trouble remembering that many/most people here are doing this for fun or as a hobby. My post was also in reply to several intermediate posts in this thread that I did not bother going back to quote.

Even in a more general context I'm not sure I agree...

For most high level languages you'd preprocess and tokenize, convert to some sort of intermediate language, optimize (dead code elimination, loop optimizations, function inlining, etc), compile from intermediate language to intermediate assembly, optimize again (peephole, cache management, instruction scheduling), then assemble, and output a file.

Ignoring preprocessing and all the optimizing stages, this only leaves optimizing register allocation and (depending on how good your peephole optimizer is) some instruction selection in the "compile from intermediate language to intermediate assembly" stage.

Of course you'd want to add the preprocessing and optimizing stages later, but until the other stages are working right they just get in the way and make things harder IMHO....

Cheers,

Brendan

Posted: **Sat Jun 23, 2007 8:53 am**

We have system software lab......., I have done a fair bit of reaseach on tokenizing ....... , Here are the few methods that will help you in tokenzing

1) Simplest and the dirtiest :
make use of the function str_tok function available in the C - libraty
(This is not recomended ..check the man pages.....)

2) Make an adhoc lexer : Something which i initially did.......
like this (pseudo code ..not entirely correct)
char token[80];
char *exp_ptr=token;
while(!end_of_file)
{
ch=fgetc(inp_file);
if(ch==' ')ch=fgetc(inp_file); // skip white space.....
*exp_ptr=ch;
while(ch!=' ')
{
exp_ptr++;
*exp_ptr=ch;
}
*exp_ptr='\0';
return token;
}

3) Much better method is to hard code a DFA (deterministic finite automata) table into your lexer.....This is cleaner and better than the
earler methods.there are many ways of doing this
one way is to use a array for implementing the transistion function,
(see adrew s appel )or else you can again hardcode the transistion function also...

4) Easier way is to use lex ... which will automatically generate and implement a DFA (NFA) for you .......It s used like this
{%
/*put declarations here*/
%}
%%
/*
eg
a(a|b)* { printf("ab string");}

/* put rules here*/
%%

/* other functions*/

I recommend method three coz ... i am wriing a compiler to lean not to
outperform gcc or microsoft.... My compiler is not yet compleate...In fact
it's not even begun well.....:)Gee ..it works sporadically......It uses method 2... Yea i am pragmatic guy...

Posted: **Mon Jun 25, 2007 1:18 am**

It's actually strtok not str_tok() ....also plse forgive my spell errors and
other grammar mistakes(I apologize for my ambigous grammar !!!

,
which makes parsing difficult

.....

)....

Posted: **Sun Jul 08, 2007 1:51 am**

Posted: **Sun Jul 08, 2007 4:45 pm**

ok?
what do you mean "Placed in Microsoft" ?

OSDev.org

PORTABLE OBJECT CODE -- plz post ur comments

Dont worry......

Breif guide on tokeninzing

Slight correction....

Hi Guys..................