Unions, program proofs and the Halting Problem

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Brendan »

Hi,
Rusky wrote:
Brendan wrote:I need to support assembly language or it'd be useless; and there won't be any standard library (or any standard library lock implementation). I'd also want to ensure that its possible to use structure/union members in block-free code.
Rust supports inline assembly, and the lock part of its standard library should be trivially portable to or reimplemented in a freestanding environment. There's no reason you couldn't use structures/unions in block-free code either. A lock is only required if the variable is mutable and shared between threads (actually this is enforced by the type signatures of the library functions for threads).
Rust supports inline assembly [and therefore there's no guarantee that the "type" was set correctly when any member of the union was set], and the lock part of its standard library should be trivially portable to or reimplemented in a freestanding environment [which is entirely irrelevant for me given that there is no libraries at all and therefore no difference between hosted and freestanding]. There's no reason you couldn't use structures/unions in block-free code either [except that you'd have to atomically write both the union member and the "type" at the same time, which is simply not possible in most cases]. A lock is only required if the variable is mutable and shared between threads (actually this is enforced by the type signatures of the library functions for threads [which means it's not enforced at all because anyone can write their own alternatives to the library functions instead]).
Rusky wrote:
Brendan wrote:Given the choice between failing to co-locate structure members when it is possible, and making programmers learn and use an extra union thing plus an explicit tag; then the former is the "least worst" alternative.

For co-locating structure members; the halting problem is a broken anology. It assumes that I must have a correct decision (no "false positives" and no "false negatives") and this is a very wrong assumption. I do need "no false positives", because this leads to co-locating structure members that shouldn't be co-located (a compiler that automatically inserts bugs into perfect source code). However; I do not need "no false negatives" because this only means a missed opportunity for optimisation. If I only co-locate structure members when it's trivial to prove that it's safe, then that's good enough (but obviously "better" would be better).

The only real question here is; how much better than "only when trivial" is possible?
The main reasons for using unions are generally not the trivial case, and I suspect they are not the possible cases either. The main reasons for unions always use tags anyway (or they're bitcasting, which you can do other ways). Thus, trying to auto-detect potentially overlapping members in structs is probably not worth it.
So you're suggesting that, given that I will not bother with unions at all, I should also not bother trying to co-locate structure members?
Rusky wrote:The alternative, enforcing safe tagged unions, is not really anything new (although I would ask, why is giving programmers a new tool a bad thing per se?). Not only that, it enables several useful patterns that are otherwise either unenforceable or overly verbose. In Rust, it looks like this:

Code: Select all

enum Shape {
    Circle { center: Point, radius: f64 },
    Rectangle { top_left: Point, bottom_right: Point }
}
It is accessed like this:

Code: Select all

match shape {
    Circle { radius: radius, .. } => f64::consts::PI * square(radius),
    Rectangle { top_left: top_left, bottom_right: bottom_right } => {
        (bottom_right.x - top_left.x) * (top_left.y - bottom_right.y)
    }
}
As far as a programmer should care, that's functionally identical to structure with a "type" field and a "switch(something->type) {". Why do you think programmers using a high level language should need to do explicit micro-optimisations by hand? A programmer shouldn't have to think about all of the potential use cases of their data structure before being forced to decide whether to use a structure or union.

What happens in 3 years time when the programmer that decided to use your unions realises that they need to add a "type = both_circle_and_rectangle" to the enum? Do they have to find all of the existing code that does anything with the old union and rewrite all of it so it works for structures instead before they can start thinking about adding any new code for the "both_circle_and_rectangle" case?
Rusky wrote:One example of a new feature (that's been mentioned before in this thread) is null safety using Option types. In Rust, pointers cannot be null. Thus, if you need to store a nullable pointer, you wrap the usual pointer type in a union:
If it allows assembly then pointers can be null. Of course null is only one of a very large number of invalid values that a pointer could contain; so caring about null and not caring about all of the other invalid values is relatively short-sighted.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Owen »

Brendan wrote:Hi,
Rusky wrote:
Brendan wrote:I need to support assembly language or it'd be useless; and there won't be any standard library (or any standard library lock implementation). I'd also want to ensure that its possible to use structure/union members in block-free code.
Rust supports inline assembly, and the lock part of its standard library should be trivially portable to or reimplemented in a freestanding environment. There's no reason you couldn't use structures/unions in block-free code either. A lock is only required if the variable is mutable and shared between threads (actually this is enforced by the type signatures of the library functions for threads).
Rust supports inline assembly [and therefore there's no guarantee that the "type" was set correctly when any member of the union was set], and the lock part of its standard library should be trivially portable to or reimplemented in a freestanding environment [which is entirely irrelevant for me given that there is no libraries at all and therefore no difference between hosted and freestanding]. There's no reason you couldn't use structures/unions in block-free code either [except that you'd have to atomically write both the union member and the "type" at the same time, which is simply not possible in most cases]. A lock is only required if the variable is mutable and shared between threads (actually this is enforced by the type signatures of the library functions for threads [which means it's not enforced at all because anyone can write their own alternatives to the library functions instead]).
Rust supports inline assembly in unsafe blocks. You know, the bits which get access to raw pointers and all. The bits which should immediately jump out at you because they say unsafe right next to them!

The bits which let you implement useful, safe-to-use libraries. The bits which are included in the language so you can build new useful components the standard library didn't anticipate (or because you are doing freestanding work and don't have access to the standard library)

Rust has "unsafe" because there are always going to be edge cases where it is tricky or impossible for the compiler to check. Instead, we have a bunch of humans check, and then as long as those humans were correct the rest of our code is safe. Or, in other words, it allows us to contain our dangerous bits of code in self-contained, easy to find units.

Yes, you can break things from the unsafe blocks. "Doctor, it hurts when I do this" "So don't do it?"
Brendan wrote:
Rusky wrote:
Brendan wrote:Given the choice between failing to co-locate structure members when it is possible, and making programmers learn and use an extra union thing plus an explicit tag; then the former is the "least worst" alternative.

For co-locating structure members; the halting problem is a broken anology. It assumes that I must have a correct decision (no "false positives" and no "false negatives") and this is a very wrong assumption. I do need "no false positives", because this leads to co-locating structure members that shouldn't be co-located (a compiler that automatically inserts bugs into perfect source code). However; I do not need "no false negatives" because this only means a missed opportunity for optimisation. If I only co-locate structure members when it's trivial to prove that it's safe, then that's good enough (but obviously "better" would be better).

The only real question here is; how much better than "only when trivial" is possible?
The main reasons for using unions are generally not the trivial case, and I suspect they are not the possible cases either. The main reasons for unions always use tags anyway (or they're bitcasting, which you can do other ways). Thus, trying to auto-detect potentially overlapping members in structs is probably not worth it.
So you're suggesting that, given that I will not bother with unions at all, I should also not bother trying to co-locate structure members?
Rusky wrote:The alternative, enforcing safe tagged unions, is not really anything new (although I would ask, why is giving programmers a new tool a bad thing per se?). Not only that, it enables several useful patterns that are otherwise either unenforceable or overly verbose. In Rust, it looks like this:

Code: Select all

enum Shape {
    Circle { center: Point, radius: f64 },
    Rectangle { top_left: Point, bottom_right: Point }
}
It is accessed like this:

Code: Select all

match shape {
    Circle { radius: radius, .. } => f64::consts::PI * square(radius),
    Rectangle { top_left: top_left, bottom_right: bottom_right } => {
        (bottom_right.x - top_left.x) * (top_left.y - bottom_right.y)
    }
}
As far as a programmer should care, that's functionally identical to structure with a "type" field and a "switch(something->type) {". Why do you think programmers using a high level language should need to do explicit micro-optimisations by hand? A programmer shouldn't have to think about all of the potential use cases of their data structure before being forced to decide whether to use a structure or union.

What happens in 3 years time when the programmer that decided to use your unions realises that they need to add a "type = both_circle_and_rectangle" to the enum? Do they have to find all of the existing code that does anything with the old union and rewrite all of it so it works for structures instead before they can start thinking about adding any new code for the "both_circle_and_rectangle" case?
You need to rewrite it all anyway, because you just added a new value of type. Or was there no "type" member, and your code was doing random things to an uninitialized "circle" or "rectangle" member?
Brendan wrote:
Rusky wrote:One example of a new feature (that's been mentioned before in this thread) is null safety using Option types. In Rust, pointers cannot be null. Thus, if you need to store a nullable pointer, you wrap the usual pointer type in a union:
If it allows assembly then pointers can be null. Of course null is only one of a very large number of invalid values that a pointer could contain; so caring about null and not caring about all of the other invalid values is relatively short-sighted.

Cheers,

Brendan
In Rust, unless unsafe blocks were used, all pointers are valid. Of course every program contains some unsafe code (hopefully only that from the standard library. Servo, Mozilla's experimental HTML rendering engine, manages this for example) that some human must check, but as long as the human was right then the compiler's assertions are true.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Brendan »

Hi,
Owen wrote:Rust supports inline assembly in unsafe blocks. You know, the bits which get access to raw pointers and all. The bits which should immediately jump out at you because they say unsafe right next to them!

The bits which let you implement useful, safe-to-use libraries. The bits which are included in the language so you can build new useful components the standard library didn't anticipate (or because you are doing freestanding work and don't have access to the standard library)
So, I'll just write myself a small function in assembly that accepts a pair of 64-bit integers and uses the first integer as an address to write the second integer, and then mark it as "unsafe". Now; either:
  • code that calls this function also has to be marked as unsafe (and therefore most code in most projects ends up correctly marked as "unsafe" due to "internally unsafe" library functions; and everyone learns to just slap "unsafe" on everything and ignore it); or
  • code that calls a function that's marked as unsafe can pretend that it's still safe even though it's not, and therefore most code ends up being unsafe but not marked as unsafe (creating a false sense of safety that's worse than nothing)
Owen wrote:Rust has "unsafe" because there are always going to be edge cases where it is tricky or impossible for the compiler to check. Instead, we have a bunch of humans check, and then as long as those humans were correct the rest of our code is safe. Or, in other words, it allows us to contain our dangerous bits of code in self-contained, easy to find units.
Excellent. I'll have to remember that Rust developers provide this service the next time I write thousands of lines of assembly and want someone to check it all. :)
Owen wrote:
Brendan wrote:As far as a programmer should care, that's functionally identical to structure with a "type" field and a "switch(something->type) {". Why do you think programmers using a high level language should need to do explicit micro-optimisations by hand? A programmer shouldn't have to think about all of the potential use cases of their data structure before being forced to decide whether to use a structure or union.

What happens in 3 years time when the programmer that decided to use your unions realises that they need to add a "type = both_circle_and_rectangle" to the enum? Do they have to find all of the existing code that does anything with the old union and rewrite all of it so it works for structures instead before they can start thinking about adding any new code for the "both_circle_and_rectangle" case?
You need to rewrite it all anyway, because you just added a new value of type. Or was there no "type" member, and your code was doing random things to an uninitialized "circle" or "rectangle" member?
I mostly use "default" to handle missing cases if they actually occur. This means I can add a new value for the "type" member without breaking anything, and then worry about implementing the new code after.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

Brendan wrote:So, I'll just write myself a small function in assembly that accepts a pair of 64-bit integers and uses the first integer as an address to write the second integer, and then mark it as "unsafe". Now; either:
  • code that calls this function also has to be marked as unsafe (and therefore most code in most projects ends up correctly marked as "unsafe" due to "internally unsafe" library functions; and everyone learns to just slap "unsafe" on everything and ignore it); or
  • code that calls a function that's marked as unsafe can pretend that it's still safe even though it's not, and therefore most code ends up being unsafe but not marked as unsafe (creating a false sense of safety that's worse than nothing)
Owen wrote:Rust has "unsafe" because there are always going to be edge cases where it is tricky or impossible for the compiler to check. Instead, we have a bunch of humans check, and then as long as those humans were correct the rest of our code is safe. Or, in other words, it allows us to contain our dangerous bits of code in self-contained, easy to find units.
Excellent. I'll have to remember that Rust developers provide this service the next time I write thousands of lines of assembly and want someone to check it all. :)
In Rust, certain operations are considered unsafe- raw pointers (i.e. non-owned and non-borrowed; it's nothing to do with runtime representation), inline assembly, etc. Functions can also be marked unsafe, in which case they can use unsafe operations and call other unsafe functions internally. If that were it, you would be right.

Instead, there are also unsafe blocks. These allow unsafe operations inside, but are a message to the compiler and the programmer that "I've made sure what's in here is safe, even though the compiler doesn't know it." APIs that wrap unsafe operations use safe functions with unsafe blocks inside, so that if you do run into a bug like a dangling pointer or a memory leak or whatever, you know the problem is in an unsafe block.

This way, you write the majority of your code with the safe features of the language and the safe APIs from libraries (remember the language features are verified as safe by the humans who designed the language, just like the libraries are verified by the humans who wrote them).
Brendan wrote:I mostly use "default" to handle missing cases if they actually occur. This means I can add a new value for the "type" member without breaking anything, and then worry about implementing the new code after.
Rust has a default case too...

Code: Select all

match shape {
    Circle { radius: radius, .. } => f64::consts::PI * square(radius),
    _ => /* default case here */
}
This still provides the benefits of option types for null safety, because you have to acknowledge the possibility of some other value, even if you don't care what it is. It also enforces safe access to the union more than just a switch on shape->type, because each arm of the match only scopes in the members for that variant of the union.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Brendan »

Hi,
Rusky wrote:In Rust, certain operations are considered unsafe- raw pointers (i.e. non-owned and non-borrowed; it's nothing to do with runtime representation), inline assembly, etc. Functions can also be marked unsafe, in which case they can use unsafe operations and call other unsafe functions internally. If that were it, you would be right.
So, something like this (that doesn't use pointers, inline assembly, etc) is "safe":

Code: Select all

char myArray[123];

char foo(char x, char y) {
     clock_t myTime;

     myTime = clock();
     x = x/y;             // UNSAFE (potential division by zero)
     myArray[y] = x;      // UNSAFE (potential array index out of bounds)
     x = sqrt(y);         // UNSAFE (potential imaginary numbers)
     y = y & myTime;      // UNSAFE (failed to check for error returned by "clock()")
     return y*2;          // UNSAFE (potential overflow)
}
Please note that over half of the problems above will result in dodgy values silently propagating throughout the code; and are far worse than (e.g) using a null pointer and getting an instant "SIGSEGV".
Rusky wrote:
Brendan wrote:I mostly use "default" to handle missing cases if they actually occur. This means I can add a new value for the "type" member without breaking anything, and then worry about implementing the new code after.
Rust has a default case too...

Code: Select all

match shape {
    Circle { radius: radius, .. } => f64::consts::PI * square(radius),
    _ => /* default case here */
}
This still provides the benefits of option types for null safety, because you have to acknowledge the possibility of some other value, even if you don't care what it is. It also enforces safe access to the union more than just a switch on shape->type, because each arm of the match only scopes in the members for that variant of the union.
I think you may have missed the point.

Code: Select all

enum Shape {
  Circle { center: Point, radius: f64 },
  Rectangle { top_left: Point, bottom_right: Point },
  Both { void }
}

match shape {
    Circle { radius: radius, .. } => f64::consts::PI * square(radius),
    Rectangle { top_left: top_left, bottom_right: bottom_right } => {
        (bottom_right.x - top_left.x) * (top_left.y - bottom_right.y),
    Both { radius: radius, top_left: top_left,
        bottom_right: bottom_right } => f64::consts::PI
        * square(radius)
        + (bottom_right.x - top_left.x) * (top_left.y - bottom_right.y)
    _ => panic("Missing case detected!")
    }
}

Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

Brendan wrote:So, something like this (that doesn't use pointers, inline assembly, etc) is "safe":

Code: Select all

char myArray[123];

char foo(char x, char y) {
     clock_t myTime;

     myTime = clock();
     x = x/y;             // UNSAFE (potential division by zero)
     myArray[y] = x;      // UNSAFE (potential array index out of bounds)
     x = sqrt(y);         // UNSAFE (potential imaginary numbers)
     y = y & myTime;      // UNSAFE (failed to check for error returned by "clock()")
     return y*2;          // UNSAFE (potential overflow)
}
Please note that over half of the problems above will result in dodgy values silently propagating throughout the code; and are far worse than (e.g) using a null pointer and getting an instant "SIGSEGV".
Rust array indexing immediately fails on out of bounds. An even nicer implementation would use dependent types to check array access at compile time.
Functions that can return errors (i.e. clock()) return Option<T> instead of just T, forcing the programmer to handle the possibility of failure.
Division by zero, square roots, and overflow, are the fault of the number specification. Rust does have versions of those operations that immediately fail on errors (or allow the programmer to handle it). However, I will admit this is a weaker area in Rust's particular implementation of this style of safety, if not in this style of safety in general (a safer way: sqrt could return a complex type, division could return an Option type, and multiplication could return a wider type).
Brendan wrote:I think you may have missed the point.
Nah, just didn't feel like typing out all of your examples. You clearly knew what I meant (although Both would have to specify that it's got a Circle and a Rect in it, not just void).
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Brendan »

Hi,
Rusky wrote:
Brendan wrote:So, something like this (that doesn't use pointers, inline assembly, etc) is "safe":

Code: Select all

char myArray[123];

char foo(char x, char y) {
     clock_t myTime;

     myTime = clock();
     x = x/y;             // UNSAFE (potential division by zero)
     myArray[y] = x;      // UNSAFE (potential array index out of bounds)
     x = sqrt(y);         // UNSAFE (potential imaginary numbers)
     y = y & myTime;      // UNSAFE (failed to check for error returned by "clock()")
     return y*2;          // UNSAFE (potential overflow)
}
Please note that over half of the problems above will result in dodgy values silently propagating throughout the code; and are far worse than (e.g) using a null pointer and getting an instant "SIGSEGV".
Rust array indexing immediately fails on out of bounds. An even nicer implementation would use dependent types to check array access at compile time.
Functions that can return errors (i.e. clock()) return Option<T> instead of just T, forcing the programmer to handle the possibility of failure.
Division by zero, square roots, and overflow, are the fault of the number specification. Rust does have versions of those operations that immediately fail on errors (or allow the programmer to handle it). However, I will admit this is a weaker area in Rust's particular implementation of this style of safety, if not in this style of safety in general (a safer way: sqrt could return a complex type, division could return an Option type, and multiplication could return a wider type).
What I'm saying here is that pointers are like a magician's shiny trinket - they draw attention away from the stuff that's really unsafe.

Division by zero, square roots and overflow are the fault of poorly specified data types. Because programmers can't specify exact ranges for variables in languages like C it becomes impossible for compilers to check if problems are possible or not. For example, I'd do "sqrt(value as f53range 0 to f53e11.max)" to say the parameter can't be negative, so you get a compile time error (not run-time error) when someone tries this function with an out of range value.

Basically; all of the potential problems in my example would cause compile time errors. To get it to compile, you'd have to change it to something like:

Code: Select all

myArray[123] as s8

functiondef foo(s8 x, y as range 1 to myArray.items-1) (output as range y.min*2 to y.max*2)  {
     clock_t myTime
     temp as auto        ;Actual type determined by return type of "sqrt()"

     myTime = clock()
     x = x\y             ;Provably safe
     myArray[y] = x      ;Provably safe
     temp = sqrt(y)      ;Provably safe
     if(temp <= x.max) {
         x = temp        ;Provably safe
     }
     y = y & myTime      ;Provably safe
     output = y*2        ;Provably safe
}
Note: the '\' operator is integer division. The original code used normal division ('/') which would've caused a "potential precision loss" compile time error.
Rusky wrote:
Brendan wrote:I think you may have missed the point.
Nah, just didn't feel like typing out all of your examples. You clearly knew what I meant (although Both would have to specify that it's got a Circle and a Rect in it, not just void).
You can't have have an additional Circle and an addition Rectangle for Both without breaking things everywhere. For example, consider:

Code: Select all

//  if(type == Rectangle) {
    if( (type == Rectangle) || (type == Both) ) {
        area = calcRectangleArea( shape.Rectangle )
    }
Of course this would have been clearer if there were actual structures in your "union of structures". ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Rusky
Member
Member
Posts: 792
Joined: Wed Jan 06, 2010 7:07 pm

Re: The Mill: a new low-power, high-performance CPU design

Post by Rusky »

Brendan wrote:What I'm saying here is that pointers are like a magician's shiny trinket - they draw attention away from the stuff that's really unsafe.

Division by zero, square roots and overflow are the fault of poorly specified data types. Because programmers can't specify exact ranges for variables in languages like C it becomes impossible for compilers to check if problems are possible or not. For example, I'd do "sqrt(value as f53range 0 to f53e11.max)" to say the parameter can't be negative, so you get a compile time error (not run-time error) when someone tries this function with an out of range value.
According to this paper and this paper, buffer overflows (and the related null/dangling pointer dereference) are the most common vulnerabilities, and are even more common in critical security bugs, so I'm not sure pointers are just a red herring.

I will agree that tracking ranges is at least as important, though. I would love to see Rust add dependent types. If you're interested in existing methods for implementing them, this design is pretty good for a general-purpose language (as opposed to other dependently-typed languages that have undecidable type systems and other such nonsense).
Brendan wrote:You can't have have an additional Circle and an addition Rectangle for Both without breaking things everywhere. For example, consider:

Code: Select all

//  if(type == Rectangle) {
    if( (type == Rectangle) || (type == Both) ) {
        area = calcRectangleArea( shape.Rectangle )
    }
Of course this would have been clearer if there were actual structures in your "union of structures". ;)
If you want Both to use the structures for Circle and Rectangle directly, then it's no longer a union because they can't overlap. Your example could either be:

Code: Select all

struct Rect { ... } struct Circle { ... }
enum Shape { Rectangle(Rect), Circle(Circle), Both(Rect, Circle) }
area = calcRectangleArea(match shape {
  Rectangle(rect) => calcRectangleArea(rect),
  Both(rect, _) => calcRectangleArea(rect),
  _ => fail!("missing case")
})
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: The Mill: a new low-power, high-performance CPU design

Post by Brendan »

Hi,
Rusky wrote:
Brendan wrote:What I'm saying here is that pointers are like a magician's shiny trinket - they draw attention away from the stuff that's really unsafe.

Division by zero, square roots and overflow are the fault of poorly specified data types. Because programmers can't specify exact ranges for variables in languages like C it becomes impossible for compilers to check if problems are possible or not. For example, I'd do "sqrt(value as f53range 0 to f53e11.max)" to say the parameter can't be negative, so you get a compile time error (not run-time error) when someone tries this function with an out of range value.
According to this paper and this paper, buffer overflows (and the related null/dangling pointer dereference) are the most common vulnerabilities, and are even more common in critical security bugs, so I'm not sure pointers are just a red herring.
Buffer overflows; and not dangling, undefined, or null pointers?

Let's look at a simple buffer overflow:

Code: Select all

char myString[123];

void foo(void) {
    int pos = 0;
    int c;

    while( ( (c = getc(stdin)) != EOF) && (c != '\n') ) {
        myString[pos] = c;          // UNSAFE (potential array index out of bounds
                                    //         + potential overflow: 'c' doesn't fit in a signed 8-bit integer)
        pos++;                      // UNSAFE (potential overflow)
    }
}
If a compiler tracks the ranges of values that are possible in each variable, those unsafe things end up being compile time errors.

Let's use a pointer:

Code: Select all

char myString[123];

void foo(void) {
    bar(myString);
}

void bar(char *string) {
    int c;

    while( ( (c = getc(stdin)) != EOF) && (c != '\n') ) {
        *string = c;                // UNSAFE (potential overflow: 'c' doesn't fit in a signed 8-bit integer)
        string++;                   // UNSAFE (potential overflow)
    }
}
The compiler doesn't know the range of the pointer and has to assume the pointer can point anywhere; and therefore it can't detect the problem. How do you solve this? Integers have a range; pointers are just integers; pointers can have a range too.

I actually have 2 types of pointers: unbounded pointers (like pointers in C), and bounded pointers. A bounded pointer has a range. Here's the code above written with a bounded pointer:

Code: Select all

myString1[123] as s8;
myString2[999] as s8;

functiondef foo(void) (void) {
    tempUnboundedPtr as @s8           ;Note: "tempUnboundedPtr" is a unbounded pointer to a signed 8-bit integer

    bar(myString1)                    ;UNSAFE (bad type conversion, array of 123 items converted to pointer to 256 items)

    tempUnboundedPtr = #myString2[5]
    bar(tempUnboundedPtr)             ;UNSAFE (bad type conversion, unbounded pointer converted to bounded pointer)
    bar(#myString2[5])

    tempUnboundedPtr = malloc(999)
    bar(tempUnboundedPtr)             ;UNSAFE (bad type conversion, unbounded pointer converted to bounded pointer)
}

functiondef bar(string as [256]s8) (void) {  ;Note: "string" is a pointer to a group of 256 signed 8-bit integers
    pos as range 0 to string.items
    c as s8
    EOF as bool

    for(;;) {
        (EOF, c) = getc(stdin)
        if(EOF) {
            return
        }
        [pos]string = c
        pos++                         ;UNSAFE (potential overflow)
    }
}
Now lets think about "malloc()". It sucks for several reasons - it has to return an unbounded pointer with no type, it has poor cache locality, it's hard to detect problems (e.g. "free(malloc(123)+88)"), if you have an infinite loop that allocates memory you can't set a reasonable limit (you have to wait until it consumes the entire heap), and it's less fun if you want to figure out what all your RAM is being used for ("Oh my, the profiler says I'm using 99 KiB for strings!"). Let's fix all those problems:

Code: Select all

myStringPool[999][256] as s8          ;OS uses "allocation on demand" (doesn't cost 250 KiB of RAM unless all used)
    
lastStringNumber as range 0 to myStringPool.items - 1 = 0

enum StringPoolEntryState = {
    free = 0
    used
}

myStringPoolEntryStates[myStringPool.items] as StringPoolEntryState


functiondef allocString(void) (status as bool = true, string as [256]s8) {
    auto initialEntry;

    initialEntry = lastStringNumber
    while(myStringPoolEntryStates[lastStringNumber] != StringPoolEntryState.free) {
        lastStringNumber = (lastStringNumber + 1) % lastString.max
        if(lastStringNumber == initialEntry) {
            status = false
            return
        }
    }
    StringPoolEntryState[lastStringNumber] = used
    string = #myStringPool[lastStringNumber]
}

functiondef foo(void) (void) {
    status as bool
    string as [256]s8

    (status, string) = allocString()
    if(status) {
        bar(string)
    }
}
That's probably not the best string allocator (I'd use a stack but I was too lazy to write an "initStringPool()" for it) but you get the idea. The important thing is that all of it can all be proven safe by the compiler.

There is still one problem - the caller could ignore the returned status and end up using a null bounded pointer (and even if they don't, error handling ends up on the critical path). Let's fix that using "alternative exits":

Code: Select all

functiondef allocString(exit noMemory) (string as [256]s8) {
    auto initialEntry;

    initialEntry = lastStringNumber
    while(myStringPoolEntryStates[lastStringNumber] != StringPoolEntryState.free) {
        lastStringNumber = (lastStringNumber + 1) % lastString.max
        if(lastStringNumber == initialEntry) {
            return noMemory                         ;Return to the alternative exit
        }
    }
    StringPoolEntryState[lastStringNumber] = used
    string = #myStringPool[lastStringNumber]
}

functiondef foo(void) (void) {
    string as [256]s8

    string = allocString(handleNoMemory)
    bar()
    return

handleNoMemory:
    ; Do any cleanup, etc
}
Rusky wrote:
Brendan wrote:You can't have have an additional Circle and an addition Rectangle for Both without breaking things everywhere. For example, consider:

Code: Select all

//  if(type == Rectangle) {
    if( (type == Rectangle) || (type == Both) ) {
        area = calcRectangleArea( shape.Rectangle )
    }
Of course this would have been clearer if there were actual structures in your "union of structures". ;)
If you want Both to use the structures for Circle and Rectangle directly, then it's no longer a union because they can't overlap. Your example could either be:

Code: Select all

struct Rect { ... } struct Circle { ... }
enum Shape { Rectangle(Rect), Circle(Circle), Both(Rect, Circle) }
area = calcRectangleArea(match shape {
  Rectangle(rect) => calcRectangleArea(rect),
  Both(rect, _) => calcRectangleArea(rect),
  _ => fail!("missing case")
})


This is a continuation of the same problem - the offset of Rect structure within the union is different for the enumeration's Recangle and Both, so you're forced to have an ugly switch/case thing inside "calcRectangleArea()". If you used a shape structure instead of a shape enum, then you could've added "Both" to the structure without changing or adding any the code in the "calcRectangleArea()" function at all.

Let's add a "square":

Code: Select all

typedef struct { double radius } CIRCLE;
typedef struct { double width, double height } RECTANGLE;
typedef struct { double width } SQUARE;

typedef struct {
    hasCircle    : 1;
    hasRectangle : 1;
    hasSquare    : 1;
    CIRCLE circle;
    RECTANGLE rectangle;
    SQUARE square;
} SHAPE;

double getArea(SHAPE shape) {
    double area = 0;

    if(shape.hasCircle) {
        area += getCircleArea(shape.circle);
    }
    if(shape.hasRectangle) {
        area += getRectangleArea(shape.rectangle);
    }
    if(shape.hasSquare) {
        area += getSquareArea(shape.square);
    }
}
If the compiler can determine that circle, rectangle and square are never used at the same time; then it can co-locate the structure members and the structure can be just as good as a union. If the programmer makes a minor change (e.g. so that rectangle and circle might be used at the same time) then with no additional changes whatsoever everything continues to work perfectly.

More specifically:
  • Step 1: Scan through the program and find out which structure members are never read (note: taking the address of a structure member is counted as a "read")
    Step 2: Remove any writes to structure members that are never read
    Step 3: Remove structure members that are never read (reduce the amount of work needed for the next steps)
    Step 4: For each variable that has a potential "key condition" (where initially every value of every variable is a potential "key condition"), determine the state of that variable when each structure member is accessed (e.g. remember the ranges of values of each variable at the time of the access, for all variables that have a known range of values); and remove any potential "key conditions" from the list.
    Step 5: Find structure members with mutually exclusive "key conditions"; and co-locate them
Please note that for trivial cases step 4 is trivial (assuming the compiler is already tracking the values in variables for the purpose of making sure basic arithmetic, etc is done safely); and for non-trivial cases (e.g. too much state, halting problem, etc) step 4 won't be possible and I'll end up with "false negatives" and missed optimisations.

For example:

Code: Select all

void setShapeToCircle(SHAPE *shape, double radius) {
    shape->hasCircle = 1;
    shape->hasRectangle = 0;
    shape->hasSquare = 0;
    shape->circle.radius = radius;
}

void getCircleArea(SHAPE *shape, double radius) {
    if(shape->hasCircle) {
        return (shape->circle.radius * shape->circle.radius) * PI;
    }
    return 0;
}
From this we can see that the member "shape.circle" has the key conditions "shape->hasCircle != 0; shape->hasRectangle == 0; shape->hasSquare == 0". In a similar way might be able to determine that the member "shape.rectangle" has the key conditions "shape->hasCircle == 0; shape->hasRectangle != 0; shape->hasSquare == 0". In that case, we know the key conditions are mutually exclusive and that these structure members can be co-located.

If someone adds this:

Code: Select all

void addCircle(SHAPE *shape, double radius) {
    shape->hasCircle = 1;
    shape->circle.radius = radius;
}
Then the previous key conditions "shape->hasRectangle == 0; shape->hasSquare == 0" get filtered out, leaving us with the key condition "shape->hasCircle != 0;". In that case, there are no mutually exclusive key conditions for "shape.circle" and "shape.rectangle" and we know that these structure members can not be co-located.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Unions, program proofs and the Halting Problem

Post by Combuster »

That sounds like ignoring what such languages have to offer in the first place. Either you expect multiple shapes to be overlayed in the average case, in which case you'd rather encode it as

Code: Select all

Shape = Collection (Option Rectangle) (Option Circle) (Option ...)
Or you're wasting memory and the typical case is just one shape with occasional composition:

Code: Select all

Shape = Rectangle
      | Circle
      | (...)
      | Composite [Shape]
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Unions, program proofs and the Halting Problem

Post by Brendan »

Hi,
Combuster wrote:That sounds like ignoring what such languages have to offer in the first place.
I'm not ignoring what languages have to offer. I've considered untagged unions and found that it's impossible to prove they've been used safely. I've also considered tagged unions and found they make the language harder to learn, make software harder to maintain/extend, and are still only partially safe (e.g. not safe at all when people are allowed to use the language, including inline assembly).
Combuster wrote:Either you expect multiple shapes to be overlayed in the average case, in which case you'd rather encode it as

Code: Select all

Shape = Collection (Option Rectangle) (Option Circle) (Option ...)
Or you're wasting memory and the typical case is just one shape with occasional composition:

Code: Select all

Shape = Rectangle
      | Circle
      | (...)
      | Composite [Shape]
Either the compiler's optimiser is bad and programmers need to care if their data is a structure or union; or the compiler's optimiser is good and programmers don't need to care.

Today, if I had to use a compilers that couldn't (e.g.) replace a multiplication with a shift, or remove dead code, or do common sub-expression elimination, or inline functions; I'd laugh at how crappy the compiler is. Some time in the future, maybe people will laugh at crappy compilers that can't do structure member co-location.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Unions, program proofs and the Halting Problem

Post by Combuster »

I've also considered tagged unions and found they make the language harder to learn, make software harder to maintain/extend
Mind you, I've only experienced the exact opposite.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Unions, program proofs and the Halting Problem

Post by Brendan »

Hi,
Combuster wrote:
I've also considered tagged unions and found they make the language harder to learn, make software harder to maintain/extend
Mind you, I've only experienced the exact opposite.
You've probably spent almost 10 years having your head filled with far more idiotic nonsense than unions; and I seriously doubt that you are still able to see things from the perspective of a beginner learning their first language.

However; despite your curse, I would've hoped you'd be intelligent enough to understand that learning something is always harder than not needing to learn something; and having to take something into account is always harder than not needing to take something into account.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
embryo

Re: Unions, program proofs and the Halting Problem

Post by embryo »

Brendan wrote:Either the compiler's optimiser is bad and programmers need to care if their data is a structure or union; or the compiler's optimiser is good and programmers don't need to care
It's not about good or bad optimizer. Or more correctly - not only about. There should be an information about execution time constraints. If the information exists then a compiler can have a good optimizer. But if there's no information - any compiler sucks.

And by the way - Java has means to provide a compiler with such information, it is annotations. But C/C++ have no such means. One more plus for Java :)
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Unions, program proofs and the Halting Problem

Post by Brendan »

Hi,
embryo wrote:
Brendan wrote:Either the compiler's optimiser is bad and programmers need to care if their data is a structure or union; or the compiler's optimiser is good and programmers don't need to care
It's not about good or bad optimizer. Or more correctly - not only about. There should be an information about execution time constraints. If the information exists then a compiler can have a good optimizer. But if there's no information - any compiler sucks.
Worst: No optimisation
Average: Programmer has to assist optimiser
Best: Programmer doesn't have to assist the optimiser
embryo wrote:And by the way - Java has means to provide a compiler with such information, it is annotations. But C/C++ have no such means. One more plus for Java :)
C/C++ have "#pragma", which mostly serves the same purpose as Java's annotations.

Of course you're right - if a try to get it right and fail; then I can try to hide my failure behind pragmas/annotations later on. ;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply