Language Design

Kevin · Post by **Kevin** » Sun Oct 02, 2016 3:14 pm

glauxosdever wrote:Everything that is created by the code is being initialised. Therefore...
Code: Select all
u8 a;
...is an error, or causes a to be implicitly initialised to 0. I'm undecided.

What about:

Code: Select all

u8 a;
if (condition) {
    a = getFoo();
} else {
    a = getBar();
}

Would you have to do a useless initialisation that is immediately overwritten?

Rusky · Post by **Rusky** » Sun Oct 02, 2016 3:16 pm

glauxosdever wrote:As for the lot of extra work for something that should be the default (in this case the pointer that is never null), it would be nice to hear how to specify something that is not the default (in this case the pointer that can be null).

I quickly mentioned using an Option<T> type or a T? type- given a type T, Option<T> can be either Some(t) or None, sort of a tagged union. Using a non-nullable pointer type for T, you get a nullable pointer. If the compiler exploits its knowledge that 0 is an invalid value for the pointer, Option<T*> even gets the same representation as a C-style nullable pointer.

This forces the programmer to check for null (or at least acknowledge its possibility) because they can't get at the value without doing so. Most languages that do this use pattern matching- the way Option<T> is like a tagged union, pattern matching is a superpowered switch statement. For example (in Rust):

Code: Select all

let maybe_bob: Option<&Person> = get_person("bob");
match maybe_bob {
    Some(bob) => { process(bob); } // here, bob is a pointer to a Person
    None => { panic!("bob doesn't exist"); } // here, bob is not in scope because maybe_bob is None
}

You can wrap up common patterns like this into helper functions- abort on missing values, provide a default value for the missing case, call a function on an existing value but pass missing values through, etc. Some languages that do this are OCaml (and other MLs), Haskell, Rust, Swift, etc.

simeonz · Post by **simeonz** » Sun Oct 02, 2016 4:55 pm

The programmer should exclude 0 from the pointer's range then. The way to do it portably is yet to be specified unfortunately.

What if all list types utilize common list logic. This logic will not be equally type safe, because it is shared by the circular and non-circular variants. For example, add_node(what, where) is a good candidate. The node structure of the "base" list type will allow null for its next pointer type. Thus, by forwarding processing to the "general" functions, the circular list inherits the unsafety of the base list. This is a limited example, but reuse and passing control between generalist and specific code are virtually everywhere. Generics may solve the problem sometimes, but not always.

Everything that is created by the code is being initialised.

If a string is read from configuration file at the start of the program, whether the respective global object is statically initialized to null or non-null is besides the point. If say, it is initialized to empty string, it will be just as useless, even though the dereference will be valid memory access. On the other hand, once the configuration parsing is complete, the pointer will become truly valid and the contents will become meaningful. Which also introduces another nuance - correctness vs. definedness. A program that always invokes defined behavior, such as valid memory accesses is not automatically correct. In fact, if the programmer is forced to use "defined" state for the sake of it, that initial state may end up being a set of dummy values. Although it might encourage good programming practices, in my opinion, it will not contribute as much to the program's correctness. In contrast, discovering whether invalid access actually occurs is a much stronger value for the programmer.

I will look into that.

The interactive environment thing is a personal vision, inspired by some instruments. Don't read too much into it. What I was hinting at is that you are not constrained by the edit, compile, debug cycle. The classical view is that those instruments should be separate, but I think that this point can be challenged. So, basically, I think that stricter, safer language is very much desirable, but doubt that automatic correctness will be possible.

Sik · Post by **Sik** » Mon Oct 03, 2016 4:08 am

glauxosdever wrote:Since the initial question has been answered, I think I could speak a bit about the language as a whole. To start, I am going to call it G, since there is no systems programming language called like that, as far as I can tell. There are however other languages called G, but they are mostly domain specific and not well-known.

If you intend this to ever take off then please give it a name that would be easier to search for =P

glauxosdever wrote:My general intention is to make programmer errors harder. I am aware this may annoy programmers when trying to get used to it, but I am also aware it will reduce debugging time, since errors will be more rare. Consider a divide-by-zero error. If the compiler can't ensure the divisor is non-zero, it will error right at compilation time. Consider now an out-of-bounds error, which involves using a variable as an index to access an element of a 12-element array. If the variable has the value 12 or greater, or is negative, it will definitely result in an error which, unlike the divide-by-zero error, may not even be evident at runtime. The compiler should be able to ensure the variable is in range in order to compile the code.

Random comment: try to make it as painless as possible. A lot of language designers see programmers not adding error checks and think the problem is that they're not forced to do it, rather than wondering if it's an usability issue and what can be done to reduce the need for error checks as much as possible (and where needed, help make it be likely to handle it in a reasonable way instead of doing something dumb like aborting because they're rushing to meet a deadline). Especially since having checks everywhere can add clutter.

This is not an easy problem to solve, mind you (and I'm sure there are many opinions on how to achieve this). Just asking you to first figure out where you can help simplify it while still retaining a reasonable outcome. Maybe you can find something many people didn't think on =P

glauxosdever wrote:A common case for undefined behaviour is uninitialised variables, and this is something I would rather forbid right from its roots (except for accessing values through pointers, where the compiler can't do anything at compile time). Another option would be to implicitly initialise to zero.

Yeah, I'd say each type should just have a "default" value (e.g. 0 or false) that becomes the initialized value when a programmer doesn't specify one (and if later it turns out it isn't used, the compiler can easily just optimize it out). I think a lot of beginners expect variables to be a value like that by default, enforcing it in the language would help make it reliable.

glauxosdever wrote:Booleans should not be built on top of integers like in C. Consider the "a + (b == c)" expression. Is there any real use for it?

If you have an array with two elements representing an on/off state, you can use a boolean as an index to the relevant value immediately (I've done this in some cases, yeah, mostly when dealing with rendering interfaces).

Honestly though lack of that feature would be a minor problem, so just get rid of it. If you're worried, you can just add a way to explicitly cast a boolean into an integer (yielding 0 or 1), since the cast is explicit the code would already make the intention clear. This could possibly help in some other situations too.

While we're on casting: do not allow implicit casting between signed and unsigned. This is already a significant source of surprises in C. If somebody needs to mix together signed and unsigned integers, just require it be done with an explicit cast instead (which also lets the programmer decide which of the two types is best to use).

alexfru wrote:And one would think that Google's got the best programmers. Apparently, not.

After fighting non-stop with some stupid Android limitations in a tablet, I can confirm.

glauxosdever · Post by **glauxosdever** » Mon Oct 03, 2016 5:20 am

Hi,

Kevin wrote:What about:
Code: Select all
u8 a;
if (condition) {
    a = getFoo();
} else {
    a = getBar();
}
Would you have to do a useless initialisation that is immediately overwritten?

The compiler should be smart enough to optimise it out.

Regards,
glauxosdever

glauxosdever · Post by **glauxosdever** » Mon Oct 03, 2016 7:05 am

Hi,

For arrays of known addresses, I decided to allow pointers to be specified with the optional keyword "indexrange" which is similar to the "range" keyword, except it specifies the ranges of indexes the programmer is allowed to use when specifying an address relative to this pointer, instead of specifying the ranges of values the programmer is allowed to assign.

This way, one could write...

Code: Select all

u16* indexrange[0, 80*25) b = 0x000B8000;

...in order to specify an array at a specific address.

Then, one could write...

Code: Select all

*(b + 10) = 0x65;

...or...

Code: Select all

b[10] = 0x65;

...in order to access the 11th member of the array.

In the meanwhile...

Code: Select all

*(b + 80*25 + 100) = 0x65;

...and...

Code: Select all

b[80*25 + 100] = 0x65;

...would be invalid, since b was defined to allow indexes from 0 inclusive to 80*25 exclusive.

Note: I remember that in an earlier post I said I would forbid *(pointer + integer), but G is destined to be mainly a systems programming language. Therefore pointers of different forms can be useful. I am however wondering whether every relative pointer access should always happen through an index like b[10] instead of *(b + 10).

Regards,
glauxosdever

Kevin · Post by **Kevin** » Mon Oct 03, 2016 9:52 am

glauxosdever wrote:
Kevin wrote:Would you have to do a useless initialisation that is immediately overwritten?
The compiler should be smart enough to optimise it out.

If the compiler is smart enough to optimise it out, it should also be smart enough to allow me leaving out the useless explicit initialisation.

About your pointers: What is the type of b + 10 in your example? Is it u16* indexrange[-10, 80*25 - 10)? If you use a variable instead of 10, you don't necessarily know the resulting type at compile time, though. Which is fine if you're only doing runtime checks anyway, but something to be aware of.

glauxosdever · Post by **glauxosdever** » Mon Oct 03, 2016 10:02 am

Hi,

Kevin wrote:About your pointers: What is the type of b + 10 in your example? Is it u16* indexrange[-10, 80*25 - 10)? If you use a variable instead of 10, you don't necessarily know the resulting type at compile time, though. Which is fine if you're only doing runtime checks anyway, but something to be aware of.

That's why I thought of disallowing accesses through *(pointer + integer) and *(pointer + pointer). Accesses through pointer[integer] would however be allowed as long as integer fits in indexrange.

As for runtime checks, I don't plan specifying them as a language feature. Say a pointer is found to be null, does the compiler know what the programmer intends in this case? The programmer should be responsible for these checks and for specifying what happens in either case.

Regards,
glauxosdever

glauxosdever · Post by **glauxosdever** » Mon Oct 03, 2016 1:03 pm

Hi,

While discussing in #osdev, people suggested an issue I didn't come up with. What happens if...

Code: Select all

u8 a = 10;
u8 range[0, 20]* b = &a;
a = 30;

...therefore *b not being in range anymore? This is a simple example the compiler can detect, but what happens if, for example, some other thread writes to a and the compiler can't detect it? This is something that needs to be resolved too.

Thanks in advance.

Regards,
glauxosdever

Kevin · Post by **Kevin** » Mon Oct 03, 2016 3:47 pm

The question here is which type &a has. For a simple example like here where a is a static variable whose address is known at build time, the type could possibly be treated as u8* range[<address of i>], which would be compatible with the type you're assigning to (extending the range is always possible). Of course, this means that you need to do this kind of checks after the binary has already been linked, as some kind of sanity checking of the finished binary (or as a run-time check). With less trivial examples, it will be the programmer's job to explicitly check the pointer value of an unrestricted u8* before assigning it to a variable of smaller range.

I'm not sure if the hassle of dealing with ranges is worth it for pointers. Just distinguishing nullable and non-nullable pointers would be a lot easier and would already get you the most important advantage.

Brendan · Post by **Brendan** » Mon Oct 03, 2016 6:31 pm

Hi,

glauxosdever wrote:While discussing in #osdev, people suggested an issue I didn't come up with. What happens if...
Code: Select all
u8 a = 10;
u8 range[0, 20]* b = &a;
a = 30;
...therefore *b not being in range anymore? This is a simple example the compiler can detect, but what happens if, for example, some other thread writes to a and the compiler can't detect it? This is something that needs to be resolved too.

This is not fine because you know "a" is changed elsewhere:

Code: Select all

u8 a = 10;
u8 range[0, 20]* b = &a;

void foo(void) {
    a = 99;
}

This is always fine because you know "a" is constant and can't be changed elsewhere:

Code: Select all

const u8 a = 10;
u8 range[0, 20]* b = &a;

This is the same as one of the cases above:

Code: Select all

u8 a = 10;
u8 range[0, 20]* b = &a;

The question is which?

If you can prove that nothing can modify "a" then you can treat it as a constant (even though the programmer didn't say it is), and this is very useful for other reasons anyway (e.g. optimisation - constant folding).

If you can do whole program optimisation or if "a" can't be accessed outside the compilation unit (a "static global" in C); and if nothing takes the address of "a"; then it's trivial to prove that nothing can modify "a". Otherwise, it might be possible to prove "a" isn't modified in theory but it's too hard in practice (see note), and you'd assume "a" might be modified.

The "u8 range[0, 20]* b = &a" takes the address of "a"; so you'd assume "a" might be modified because something took its address, so you'd assume that "u8 range[0, 20]* b = &a" is not fine (even if it is fine).

Note: "possible to prove in theory but too hard in practice" is something that will (hopefully) improve as the compiler gets more sophisticated. Ignore this (for now). Focus on writing a compiler that actually works, and only worry about improving it (e.g. sophisticated methods of avoiding "false negatives", better optimisation, syntactical sugar, etc) after it works.

Kevin wrote:I'm not sure if the hassle of dealing with ranges is worth it for pointers. Just distinguishing nullable and non-nullable pointers would be a lot easier and would already get you the most important advantage.

For my work/research; I decided it wasn't worth the hassle. More specifically; I decided it was better to encourage the use of arrays (where array indices are checked), and that for "rare use" of pointers (e.g. in system code) there's a good chance (for an OS using a micro-kernel like mine) that any dodgy pointer problems would result in a page fault that makes the problem easy to find/debug anyway.

Cheers,

Brendan

Kevin · Post by **Kevin** » Tue Oct 04, 2016 2:07 am

glauxosdever wrote:
Code: Select all
u8 a = 10;
u8 range[0, 20]* b = &a;
a = 30;
...therefore *b not being in range anymore?

Oops, I think I misinterpreted this as being a pointer with a range of addresses instead of a pointer to an integer with range. Maybe copying C syntax really isn't the best idea.

u8 range[0, 20]* isn't completely compatible with u8*. You can cast the former to the latter, but not the other way round. This is basically leading to the principles that you would call covariance and contravariance in OOP.

glauxosdever · Post by **glauxosdever** » Tue Oct 04, 2016 8:57 am

Hi,

Thank you for your answers!

I think I will simply forbid assignments of addresses of variables to pointers, unless the ranges are the same.

However, for some code that has no knowledge about the allowed range and writes to the address where the ranged integer/float exists, the compiler can't do anything and I'll have to assume that undefined behaviour probably.

Anyway, no sane programmer would try to do that.

Regards,
glauxosdever

Schol-R-LEA · Post by **Schol-R-LEA** » Tue Oct 04, 2016 5:45 pm

glauxosdever wrote: Anyway, no sane programmer would try to do that.

If you'd like, I can point you in the directions of some insane programmers, starting with myself. Both TempleOS and SpectateSwamp have been mentioned here before, but they are just the tip of that iceberg.

In any case, history has proven that assuming some programmer, sane or not, would never do some particularly idiotic thing almost guarantees that one will. It usually takes nothing more than a slight error in judgment at 3AM when you have a deadline to be met at 9:30. It rarely takes even that; simple curiosity or carelessness is easily enough, especially if it seems to be working on a cursory inspection. Since this sort of thing can often escape the notice of even a detailed testing regimen, that assumption starts to look very bad indeed.

Sik · Post by **Sik** » Wed Oct 05, 2016 2:36 am

Or just the program being complex enough that you do it without realizing (because the outcome is obscured by lots of layers of complexity). Incidentally this also tends to be the reason why undefined behavior in C is dangerous, you can run into it by pure accident in extremely subtle ways due to how everything interacts.

OSDev.org

Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design

Re: Language Design