Language Design

glauxosdever · Post by **glauxosdever** » Sat Oct 01, 2016 10:15 am

Hi,

In order to free myself from C and its interfaces, I decided to design a new language.

However, I have stumbled upon a design issue. I'm wondering what should the size of the dereferenced value be in this case.

Code: Select all

a = *0x000B8000;

If I do...

Code: Select all

u8* b = 0x000B8000;
a = *b;

...then I know the size is u8.

Maybe the first case should be disallowed? Maybe I should add casts? What do you think?

Thanks in advance!

Regards,
glauxosdever

Schol-R-LEA · Post by **Schol-R-LEA** » Sat Oct 01, 2016 11:03 am

Assuming that a is strictly a value-containing typed variable (which based on the C-style declaration syntax it presumably would be, as it implies variables have type rather than values), then I would definitely say that it should be disallowed, at least as it is given. For an untyped reference value (which a literal address would be if there isn't some sort of explicit constraint), the type of the value needs to be established in some manner, either explicitly or implicitly - and implicit type coercion is fraught with problems, especially if it defaults to a specific type.

While you could have some kind of type coercion based on the type of the variable it is being sent to, that would mean that semantic analysis of pointer values would need contextual information about the operations they are going to be used by in order to dereference the values correctly. Doing that even just for primitive values is a pretty hairy prospect; doing it for user-defined types would require meta-programming, and have far too many potential pitfalls.

(For a language where type is a property of the value rather than the variable, the question would be moot - there's simply no way that I know of to have a compiler-defined general syntax for dereferencing a literal address, because it would result in either a typeless value, or a value in a fixed type that would need to be coerced anyway. One could have a user-programmable syntax, but again you would be talking about metaprogramming, not something typically found in Algol family languages.)

MichaelFarthing · Post by **MichaelFarthing** » Sat Oct 01, 2016 11:29 am

I have designed and and (mostly) implemented a language*.
My strong advice is that you should draw up something like a Backus-Naur** complete syntax statement at the start.
It might get modified as you go along, but it provides a good discipline. [Actually my start version held up quite well]

It also helps considerably with the writing of a compiler: You know you need a parser for each syntactic definition and large chunks of the work just drop out without effort. Particularly useful when dealing with the nasty recursive things.

Ideally, of course, the compiler will eventually be written in the language itself, which is a superb test. However, what do you intend to use to get up the compiler initially?

*I never got round to some of the planned 64 bit operations. It was prior to 64-bit machines.

**Naturally, of course, I didn't just use Backus-Naur. I had to alter it.
One quite useful little addition was this
<expression> :: introduces the definition of an expression. Expressions can have white between constituent parts of the definition
<<denary_positive> > :: introduces the definition of a base 10 positive integer: no white is allowed between the start and end

Kevin · Post by **Kevin** » Sat Oct 01, 2016 11:58 am

glauxosdever wrote:In order to free myself from C and its interfaces, I decided to design a new language.

Knowing what you don't want is a good first step, but did you also decide what you do want? Only if you know what your general goals are with the language, you can tell how to decide in the details. So what are the specific problems that you see with C and want to improve on?

Maybe the first case should be disallowed? Maybe I should add casts? What do you think?

Myself, I would forbid both cases and require explicit casts between integers and pointers. But my ideas of a good programming language could be different from yours.

glauxosdever · Post by **glauxosdever** » Sat Oct 01, 2016 1:06 pm

Hi,

I think I will do what Schol-R-LEA suggested, since it seems the most sensible to me at the moment. The type is indeed a property of the value.

However, semantic analysis is something I will need to do, since the compiler has to ensure, for example, that you are not dividing with zero, or that the returned value of the expression at the right size of the "=" fits in the variable at the left size of the "=". I know this can and will be complex when eventually writing the compiler, but the benefits are overwhelming.

MichaelFarthing wrote:Ideally, of course, the compiler will eventually be written in the language itself, which is a superb test. However, what do you intend to use to get up the compiler initially?

Since I plan to use the language to write the OS eventually, I will need to make a cross-compiler that runs from Linux. When the OS is somewhat mature, I will write the native compiler, which will be written in this language.

Since the initial question has been answered, I think I could speak a bit about the language as a whole. To start, I am going to call it G, since there is no systems programming language called like that, as far as I can tell. There are however other languages called G, but they are mostly domain specific and not well-known.

My general intention is to make programmer errors harder. I am aware this may annoy programmers when trying to get used to it, but I am also aware it will reduce debugging time, since errors will be more rare. Consider a divide-by-zero error. If the compiler can't ensure the divisor is non-zero, it will error right at compilation time. Consider now an out-of-bounds error, which involves using a variable as an index to access an element of a 12-element array. If the variable has the value 12 or greater, or is negative, it will definitely result in an error which, unlike the divide-by-zero error, may not even be evident at runtime. The compiler should be able to ensure the variable is in range in order to compile the code.

There should also be as much as possible well-defined behaviour. Out in the wild there are many programmers relying on undefined behaviour, and this can cause breakage of their programs on different compilers, or even on different versions of the same compiler. It is evident that even experienced programmers put much time into writing code carefully in order not to invoke undefined behaviour. A common case for undefined behaviour is uninitialised variables, and this is something I would rather forbid right from its roots (except for accessing values through pointers, where the compiler can't do anything at compile time). Another option would be to implicitly initialise to zero.

I am thinking of having allowed ranges for variables. A variable representing a weekday would have a range equal to [0, 6] or [1, 7], depending on what you like. Trying to assign the value 8 to it would result in an error, since 8 is out of range.

It should somehow be possible to have bounded arrays starting at some hardcoded address. An example of this is the VGA text buffer, which always starts at 0x000B8000, and is of bounded size. Maybe it could be specified if the curly brackets were omitted in case of specifying an address instead of array elements.

Having multiple types that eventually represent the same type is something I also want to get rid of. In the G language there will be only two cases when something like this will be needed; usize which is a type sized equally to the native machine word size will either be the same as u16, u32 or u64 depending on the target architecture, and isize which is a type sized equally to the native machine word size will either be the same as i16, i32 or i64 depending on the target architecture.

Booleans should not be built on top of integers like in C. Consider the "a + (b == c)" expression. Is there any real use for it?

I plan on having no standard language API as we know it. In C, there are many interfaces that may influence many aspects of OS design. I aim instead for the G language to be mostly independent from the OS-specific standard interfaces.

And, for topping out, I plan having functions that will easily return multiple values. Some people may argue this is syntactical sugar, but I would rather disagree.

This is, in a nutshell, the proposed design. Feel free to discuss about it. I would like to get some feedback and/or more ideas.

Regards,
glauxosdever

Antti · Post by **Antti** » Sat Oct 01, 2016 1:34 pm

glauxosdever wrote:This is, in a nutshell, the proposed design. Feel free to discuss about it.

There are so many ideas that are basically exactly what Brendan has described. There is nothing wrong if both of you are comfortable with that. I apologize for bringing this up but this would be a "grey area" if you were taking full credit for this in the future.

glauxosdever · Post by **glauxosdever** » Sat Oct 01, 2016 1:45 pm

Hi,

Antti wrote:There are so many ideas that are basically exactly what Brendan has described. There is nothing wrong if both of you are comfortable with that. I apologize for bringing this up but this would be a "grey area" if you were taking full credit for this in the future.

Indeed, I have designed this through many discussions with Brendan. However, even if there were fewer ideas borrowed from others, I would not take the full credit.

Regards,
glauxosdever

glauxosdever · Post by **glauxosdever** » Sat Oct 01, 2016 2:08 pm

Hi,

Now that I think of the initial problem again, it seems that it is more complicated than I thought.

Imagine a case like...

Code: Select all

u8* b = 0x000B8000;
u8 a = *(b + 10);

...where b is a pointer, and 10 is an integer. Should it be valid or not?

I tend to call it valid because I'm used to C, although I can't justify it otherwise. Could someone give me some ideas about this?

Thanks in advance.

Regards,
glauxosdever

MichaelFarthing · Post by **MichaelFarthing** » Sat Oct 01, 2016 3:19 pm

glauxosdever wrote:Hi,
Now that I think of the initial problem again, it seems that it is more complicated than I thought.

Imagine a case like...
Code: Select all
u8* b = 0x000B8000;
u8 a = *(b + 10);
...where b is a pointer, and 10 is an integer. Should it be valid or not?

I tend to call it valid because I'm used to C, although I can't justify it otherwise. Could someone give me some ideas about this?

Remember your philosophy!
C was explicity designed for assembler programmers who wanted high language features without divorce from the machine. It intentionally allowed programmers to do dangerous things and expected them to take the consequences.

You want an environment that is safe for programmers and that means telling them what they can and can't do.,

Now there are arguments both ways. [Actually, there aren't. There are languages for different people and different circumstances].

However, you have clearly expressed what you intend your target market to be and therefore the answer to your question should be clear.
Further, you might consider whether (within your approach) you should largely ditch pointers and concentrate on arrays.

P.S. On a previous question about uninitialised variables. I chose rather than to forbid it to zeroise them. Forbidding it is quite messy. Zeroising is a few bytes code in the memory allocation routine and is surprisingly cheap on time. It could be overrridden by express request of the programmer by an optional parameter. Useful for large memory allocations.

P.P.S. Your idea for bound checking:
This is really a Pascal concept rather than C (particularly when you start fromanything other than zero). Fair enough. BUT it is only really useful if you can name the elements: not [1..4] but [Clubs, Diamonds, Hearts, Spades]. A lot of work (I don't mean the naming: I mean the bound checking and the elegant failure). And it pisses off the programmers when the machine complains that it doesn't recognise No Trumps.

[Having said that, my language did do it but I never felt entirely happy with it]

Schol-R-LEA · Post by **Schol-R-LEA** » Sat Oct 01, 2016 3:35 pm

glauxosdever wrote:Hi,

Now that I think of the initial problem again, it seems that it is more complicated than I thought.

Imagine a case like...
Code: Select all
u8* b = 0x000B8000;
u8 a = *(b + 10);
...where b is a pointer, and 10 is an integer. Should it be valid or not?

I tend to call it valid because I'm used to C, although I can't justify it otherwise. Could someone give me some ideas about this?

That depends on a large number of things in your design, such as:

strong typing - if you are looking to make the whole language type-safe (or at least restrict the potential easements of type-safety), then you will want to have a cast, or better still, an in-place constructor (more on this shortly), or else the first line would represent a type-safety hole.
type compatibility and subtyping - is u8 a subtype or range of a more general 'big_int' or 'system_int' type (which may or may not have an actual implementation if its own), and if so, do the operations of the base type apply to the subtypes automatically?
operator typing - do you see an operator (e.g., '+') as part of the type interface of a primitive type, or as a compiler-generated action that takes arguments and generated inline code? If you say the former, then having a 'add one address to one integer' can be justified as syntactic sugar for something the compiler does as magic; in either order; however, if the plus operator is a part of the type, then (b + 10) would not be the same thing as (10 + b), because one would be part of the u8* type, while the other would have to be for the 'generic int literal' type, which (again depending on how you arrange it) doesn't even necessarily mean it is valid to assign to a u8 even if that is how all of the familiar languages do it, and despite the fact that the implementation (in terms of generated code) would probably be exactly the same in both cases.
user-defined types/abstract data types/classes - how do you treat user-defined structs or types, and are they 'first class' relative to primitive types? This is relevant in that, for example, the foo* pointer type is not as int* or any other primitive type, which means that in a strongly typed language, the plus operator wouldn't automagically apply to them unless pointers were all either sub-types (or auto-generated sub-classes) of a generic or void pointer type, or you did some kind of typeclass system like in Haskell (which would be very odd indeed for a language of the type you are talking about).
user-defined operators - can you overload an operator so as to take a user type? That opens up a number of possibilities, but also leads to some very complicated places in terms of transparency, compiling, and efficiency.

The reason I bring all this up is that you need to think through these issues, or at the very least, know enough about them to dismiss the ones which don't apply.

Mind you, I'm not sure you entirely understood the statement about value-focused typing versus variable-focused-typing (versus type-by-contract, etc.). Most languages that apply type by value don't require a type declaration on the variables, as it is seen as redundant and overly restrictive (since you are already checking the value's type, checking the variable's type doesn't gain you anything except in passing arguments to functions/methods). Value-typing usually is associated with either duck typing (e.g., Python), type inference (e.g., Haskell), or generic typing (e.g., most Lisps, where the default argument value is a cons cell with a pair of void pointers, and the typing is at yet another remove, in the values pointed to by the cons cell - immediate values and stack-local variables are seen as optimizations in Lisp compilers), and it would be unusual to add mandatory type declarations (or requiring explicit declarations of any kind for anything other than ambiguity resolution) on top of that.

My recommendation is to forbid automatic conversion of integer literals to pointer types outright, and a) have a separate syntax for address literals, such as @0x000B8000;, and/or b) provide what I called an 'in-place constructor' earlier. This basically would be a way of saying, "there exists an integer value matching the starting address of a block of memory; that block of memory is a Foo value, but is already correctly formed, so you just need to make a Foo pointer from that integer value and proceed'". You should probably let the user-defined types define their own in-place c'tors to allow them to do consistency checking. If you use in-place c'tors for user types, then I would still use the address literal syntax, as otherwise you would have a problem where if you had a standard c'tor that takes a single integer value, it would be ambiguous.

Given this, then, you would write:

Code: Select all

u8* b = u8(@0x000B8000);
u8 a = *(b + @10);

Kevin · Post by **Kevin** » Sat Oct 01, 2016 3:52 pm

glauxosdever wrote:
Code: Select all
u8* b = 0x000B8000;
u8 a = *(b + 10);
...where b is a pointer, and 10 is an integer. Should it be valid or not?

You said that you want to have a language that catches as many errors as possible at compile time. This means that you want a really strict type system. b is a pointer to a single u8. There is no way that b + 10 could be a valid expression.

Things might look different if b was a pointer to an array of 20 u8. Then using C pointer arithmetics would mean that b + 10 is a pointer to the 11th element of the array. For clarity I would forbid this anyway and require writing &b[10], which makes it obvious that array bound checking will be applied.

I think you don't want something like C pointers with pointer arithmetics and no bounds anyway, but rather something more like references to specific objects and possibly even automatic refcounting for heap allocated objects. Bug related to manual memory management and pointers (like buffer overflows, dangling pointers etc.) play an important role in C, and you want to avoid such problems as much as you can in a language that is designed to be safe. Of course, for a system programming language, such things can't be completely avoided, but you can try to make them much less common.

Schol-R-LEA · Post by **Schol-R-LEA** » Sat Oct 01, 2016 5:40 pm

Kevin wrote:
glauxosdever wrote:
Code: Select all
u8* b = 0x000B8000;
u8 a = *(b + 10);
...where b is a pointer, and 10 is an integer. Should it be valid or not?
You said that you want to have a language that catches as many errors as possible at compile time. This means that you want a really strict type system. b is a pointer to a single u8. There is no way that b + 10 could be a valid expression.

Things might look different if b was a pointer to an array of 20 u8. Then using C pointer arithmetics would mean that b + 10 is a pointer to the 11th element of the array.

This brings to mind something else I forgot to mention: given the specific address used in the example, I am guessing that the correct type declarations probably should be more along the lines of:

Code: Select all

type TextColor = range<u8>(0..7);
type TextIntensity = enum {Low, High};
type TextSetting = enum {Off, On};

type TextAttrib = bitfield {
    bit 0..2: bit_union<TextSetting, TextColor> {
                      bit 0:    TextSetting underline;
                      bit 0..2: TextColor fg_color;
              };
    bit 3:    TextIntensity intensity;
    bit 4..6: TextColor bg_color;
    bit 7:    TextSetting blink;
};

type TextCell = struct {
   uchar8 char; 
   TextAttrib attrib;
};

type TextVideoBuffer = TextCell[];   // a declarably-sized array of TextCells

TextVideoBuffer *text_buffers[4];

text_buffer[0] = TextBuffer[80 * 25](@0x0000B800);

This syntax is just something off of the top of my head, but it should give you a general sense of what you could use.

alexfru · Post by **alexfru** » Sat Oct 01, 2016 7:55 pm

glauxosdever wrote: In order to free myself from C and its interfaces, I decided to design a new language.

Have you worked through all of the problematic areas of C and have you considered how these problems are dealt with in other languages?

glauxosdever wrote: ... the compiler has to ensure, for example, that you are not dividing with zero,

You can't do that in all cases, which brings the question, how is the programmer supposed to deal with it? Should the compiler insert checks into the code when it can't prove impossibility of division by zero? If so, what should that code do when it detects division by zero? Should there be a special syntax (or other construct) to tell the compiler that the situation is impossible? Or should there be a special syntax (or other construct) to check for division by zero, which the compiler will always be able to understand? (See, if you can't always prove impossibility of division by zero, the check for zero may too be misunderstood by the compiler if it's overly complex, too far away from the division operator and so on).

glauxosdever wrote: or that the returned value of the expression at the right size of the "=" fits in the variable at the left size of the "=".

Via a mandatory cast (e.g. Java and Go) or by employing arbitrary precision arithmetic (e.g. Python)?

What about comparing signed and unsigned integers? Are you going to do it half-assed as in C/C++ and Go, requiring multiple checks and/or casts or are you going to restore the mathematical sense for once? This is a frequent problem, often with security implications.

glauxosdever wrote: My general intention is to make programmer errors harder. I am aware this may annoy programmers when trying to get used to it, but I am also aware it will reduce debugging time, since errors will be more rare.

Properly learning one's tools (saws and programming languages and school-grade math) can achieve that.

glauxosdever wrote: Consider now an out-of-bounds error, which involves using a variable as an index to access an element of a 12-element array. If the variable has the value 12 or greater, or is negative, it will definitely result in an error which, unlike the divide-by-zero error, may not even be evident at runtime. The compiler should be able to ensure the variable is in range in order to compile the code.

But it can't always do that. For example, in Java you can't have an arbitrary reference (pointer). It can only point at a live object or be null. When the compiler can't prove non-nullness, it has to check for it. One way is an explicit compare instruction. Another is for array elements (and object fields) that are not farther away than a page or a few pages from the beginning of the containing object, when the page 0 (an possibly a few more) can be left unmapped, causing page faults on accesses through null references. However, if you don't know the index bounds at compile time, you still need an explicit check against the element count at run time. Again, how do you detect the problem and how do you propose to deal with it by the programmer and the generated code?

glauxosdever wrote: There should also be as much as possible well-defined behaviour.

Agreed. C has a bit too much of undefined and unspecified behavior. Java got rid of some of the absolutely unnecessary ones.

glauxosdever wrote: It is evident that even experienced programmers put much time into writing code carefully in order not to invoke undefined behaviour.

I've seen many in Android source code. And one would think that Google's got the best programmers. Apparently, not.

Ditto for Microsoft (been there) and Amazon (had a chance to be interviewed by someone who did know about undefined behavior but still insisted on it being somehow predictable or possible to reason about, lol) and the rest of the world is no better.

glauxosdever wrote: I am thinking of having allowed ranges for variables. A variable representing a weekday would have a range equal to [0, 6] or [1, 7], depending on what you like. Trying to assign the value 8 to it would result in an error, since 8 is out of range.

Again, same questions, how do you detect it and what do you propose the programmer or the generated code do when the detection succeeds?

glauxosdever wrote: Booleans should not be built on top of integers like in C. Consider the "a + (b == c)" expression. Is there any real use for it?

I use it when not disallowed to. I don't think this is an example of a big or important problem, though. I should probably repeat the initial questions so you're not sidetracked into stuff of secondary or tertiary importance or into adding features...

Have you worked through all of the problematic areas of C and have you considered how these problems are dealt with in other languages? Take the language standard and go through it if you haven't yet. Write down the problems (some are conveniently grouped in the annex devoted to undefined behavior). Then read up on other languages.

Speaking of adding useful features, have you heard of C++ proposing array/string views/spans?

glauxosdever · Post by **glauxosdever** » Sun Oct 02, 2016 1:34 am

Hi,

I think you are right.

Code: Select all

u8* b = 0x000B8000;
u8 a = *(b + 10);

should be forbidden.

Code: Select all

u8 b[80*25] = 0x000B8000;
u8 a = b[10];

should not be forbidden.

As for prefixing pointer literals with @, I am not sure whether it is really needed or not. I will think about this.

Regards,
glauxosdever

glauxosdever · Post by **glauxosdever** » Sun Oct 02, 2016 2:04 am

Hi,

alexfru wrote:You can't do that in all cases, which brings the question, how is the programmer supposed to deal with it?

The correct way (even for C) is to write...

Code: Select all

if (c != 0)
{
    a = b / c;
}

...in order not to cause a division-by-zero error on occasions.

alexfru wrote:
glauxosdever wrote:or that the returned value of the expression at the right size of the "=" fits in the variable at the left size of the "=".
Via a mandatory cast (e.g. Java and Go) or by employing arbitrary precision arithmetic (e.g. Python)?

I am mostly going for arbitrary precision arithmetic, but I need to see for myself if it is efficient too.

alexfru wrote:What about comparing signed and unsigned integers? Are you going to do it half-assed as in C/C++ and Go, requiring multiple checks and/or casts or are you going to restore the mathematical sense for once? This is a frequent problem, often with security implications.

This is something I didn't think about yet.

alexfru wrote:Properly learning one's tools (saws and programming languages and school-grade math) can achieve that.

Learning the language doesn't assure that the programmer will write good code. Besides, reading the language specifications (701 pages for C) is something even most experienced programmers haven't done.

alexfru wrote:
glauxosdever wrote:Consider now an out-of-bounds error, which involves using a variable as an index to access an element of a 12-element array. If the variable has the value 12 or greater, or is negative, it will definitely result in an error which, unlike the divide-by-zero error, may not even be evident at runtime. The compiler should be able to ensure the variable is in range in order to compile the code.
But it can't always do that. For example, in Java you can't have an arbitrary reference (pointer). It can only point at a live object or be null. When the compiler can't prove non-nullness, it has to check for it. One way is an explicit compare instruction. Another is for array elements (and object fields) that are not farther away than a page or a few pages from the beginning of the containing object, when the page 0 (an possibly a few more) can be left unmapped, causing page faults on accesses through null references. However, if you don't know the index bounds at compile time, you still need an explicit check against the element count at run time. Again, how do you detect the problem and how do you propose to deal with it by the programmer and the generated code?

If the compiler can't prove that an array access uses a valid index, the programmer will have to add checks for that.

alexfru wrote:Have you worked through all of the problematic areas of C and have you considered how these problems are dealt with in other languages? Take the language standard and go through it if you haven't yet. Write down the problems (some are conveniently grouped in the annex devoted to undefined behavior). Then read up on other languages.

I haven't worked through all of the problematic aspects of C yet. But I agree, it something I should already had done.

Regards,
glauxosdever

OSDev.org

Language Design

Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design