Safe Systems-Programming Language
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
The example was not one of a typical mistake. It was merely meant to demonstrate that pointers are type-unsafe.Candy wrote:Most typos and mistakes I've seen didn't involve pointer arithmetic of the variant of adding a random number to a pointer, nor have I seen reinterpret_cast's used at all. Nor C casts with the same effect.Colonel Kernel wrote:I've been working in the software industry for over a decade. I've seen code written by lots of people, most of whom are very smart and not lazy. They make mistakes. They are human beings, after all. Those mistakes cost time and money, and I would rather have the compiler catch those mistakes than my test team, thank you very much.
I work on database drivers and all things related to them. This involves a lot of low-level buffer manipulation involving raw pointers, casts, etc. It is very easy to get something wrong in such code, and very hard to notice until it's too late. The issue ultimately stems from the type-unsafe interfaces to these drivers, which we have no control over.
I suppose the canonical example of a typical pointer-related mistake would be an array access that's out-of-bounds. Here is another one I've seen (note that the classes involved are not STL, but a home-brewed class library written for platforms to which the STL has not been ported):
Code: Select all
// Vector is just like std::vector.
Vector v = getAVectorFromSomewhere();
assert( !v.isEmpty() ); // Assume for this example that it's not empty.
someOldCFunction( &v[0] ); // Legal in STL, and for our Vector.
// The above trick is explained in Scott Meyer's Effective STL.
// String is not like std::string -- it encapsulates an immutable buffer
// of characters that is managed by a reference count.
String str = getAStringFromSomewhere();
assert( !str.isEmpty() ); // Again, assume it's not empty...
anotherOldCFunctionThatFillsABuffer( &str[0] ); // Aaaaaarrggh!
BTW, here is another more subtle example:
Code: Select all
Foo* foo = getAFooFromSomewhere();
// Assume for the sake of this example that foo really points to an instance of Bar.
Bar* bar = (Bar*) foo; // Bar is derived from foo.
bar->doSomething(); // May crash in the presence of MI.
Code: Select all
+-----+ Where bar *should* point to.
| Baz |
--> +-----+ Where foo points to.
| Foo |
+-----+
| Bar |
+-----+
If the answer were as simple as hiring the world's best C++ programmers, I'm sure we would have done it by now. The problem with this solution is that it is expensive because the supply of such programmers is very limited.
Most of the time we're stuck with the equivalent of asking a taxi driver to drive a Formula-1. They're good drivers, but way out of their league. But calling them "lazy" is just stupid IMO.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
note the '' around lazy, which should have given a less hard meaning but ok, lazy might be wrongly chosen here. Nonetheless i have also been working for over a decade in the software industry and also have seen these mistakes.
Mostly i agree with Candy that 'most typos and mistakes didn't involve pointer arithmetic or the variant of adding a random number to a pointer' but that might be how software is developed at our companies. However i did find that junior c/c++ software do make those mistakes more often mainly due to experience and lack of teaching on the matter. In general we do a code review and search (or ask) for the location of the heavy pointer usage. If a person is tagged twice for making trivial mistakes he is gently brought up to level that he/she does not makes that mistake again.
in some of your examples you already see a potential problem, for instance anotherOldCFunctionThatFillsABuffer(&str[0]), the problem here IMHO is that you have designed/used here a function that has a pointer to primitive type char *, so what do you expect. If you want to do string manipulation then only have string parameters. If you need to link to old c code as the example does, write a wrapper function or better yet don't do it unless you rewrite the code to use strings. Here we enter the gray area because not reusing existing code means more effort means more money means more delay and the worlds of development and sales clash.
@Zekrazey1
2) The get flexibility which might lead to better performance.
PS. If this all sound vague i am sorry, i have just finished 1.5 liter of beer.
Mostly i agree with Candy that 'most typos and mistakes didn't involve pointer arithmetic or the variant of adding a random number to a pointer' but that might be how software is developed at our companies. However i did find that junior c/c++ software do make those mistakes more often mainly due to experience and lack of teaching on the matter. In general we do a code review and search (or ask) for the location of the heavy pointer usage. If a person is tagged twice for making trivial mistakes he is gently brought up to level that he/she does not makes that mistake again.
in some of your examples you already see a potential problem, for instance anotherOldCFunctionThatFillsABuffer(&str[0]), the problem here IMHO is that you have designed/used here a function that has a pointer to primitive type char *, so what do you expect. If you want to do string manipulation then only have string parameters. If you need to link to old c code as the example does, write a wrapper function or better yet don't do it unless you rewrite the code to use strings. Here we enter the gray area because not reusing existing code means more effort means more money means more delay and the worlds of development and sales clash.
@Zekrazey1
1) To let developers understand what they are doing.Why waste effort on something that you can get the computer to handle?
2) The get flexibility which might lead to better performance.
PS. If this all sound vague i am sorry, i have just finished 1.5 liter of beer.
Author of COBOS
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
Ok. Sorry for the harsh reaction... you're one of many to use that word in this context.os64dev wrote:note the '' around lazy, which should have given a less hard meaning but ok, lazy might be wrongly chosen here.
It's a legacy C function that we have to use.in some of your examples you already see a potential problem, for instance anotherOldCFunctionThatFillsABuffer(&str[0]), the problem here IMHO is that you have designed/used here a function that has a pointer to primitive type char *, so what do you expect.
The problem is that someone has to write the wrapper, and that's where this kind of thing can happen.If you want to do string manipulation then only have string parameters. If you need to link to old c code as the example does, write a wrapper function or better yet don't do it unless you rewrite the code to use strings.
I would rather let the developers think about the domain problem and how they're going to solve it rather than the nitty-gritty details of memory management.1) To let developers understand what they are doing.Why waste effort on something that you can get the computer to handle?
When you really, really need it, it's good to have that flexibility. However, as I mentioned before, there are advances in static type systems and compiler optimizations (e.g. -- dependent types, better whole-program optimization) that will mean really good performance and type safety at the same time. I'm interested to see how these things pan out in the coming years.2) The get flexibility which might lead to better performance.
Sounds like a good idea.PS. If this all sound vague i am sorry, i have just finished 1.5 liter of beer.
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
I see a safe language as being one in which you can specify that certain requirements must be met at a lower level and having 'errors' (in quotes because you could incorrectly define what an error is) prevented through some language mechanism when you switch to a higher level.1) To let developers understand what they are doing.
2) The get flexibility which might lead to better performance.
A simple example is the private keyword. The programmer labels something private and the compiler enforces what it means to be private. It's safe in the sense that once you've decided that something should be private and labelled it so within a particular scope, you can go about your business outside of that scope without worrying that you're going to break whatever requirements led you to decide it should be private. It's not safe in the sense that you are absolutely stopped from stuffing around with it because you can always switch context (mentally) and change it to public.
That should give a compile error, or one out of four people screwed up. In increasing order of likelyhood:Colonel Kernel wrote:Code: Select all
// Vector is just like std::vector. Vector v = getAVectorFromSomewhere(); assert( !v.isEmpty() ); // Assume for this example that it's not empty. someOldCFunction( &v[0] ); // Legal in STL, and for our Vector. // The above trick is explained in Scott Meyer's Effective STL. // String is not like std::string -- it encapsulates an immutable buffer // of characters that is managed by a reference count. String str = getAStringFromSomewhere(); assert( !str.isEmpty() ); // Again, assume it's not empty... anotherOldCFunctionThatFillsABuffer( &str[0] ); // Aaaaaarrggh!
1. The compiler writer, for not checking const correctness.
2. The function writer for taking an argument of type const char * and stripping const
3. You, for stripping const off explicitly (which I don't see here, so it's not likely)
4. The author of String who knows that his buffer is CONST and returns a non-const pointer to it.
Your compiler should / must handle this case properly to be C++ compliant. It can never assume that the base class is at the same place if it even barely touches MI.I know that down-casting is typically a dubious practice anyway, but imagine for the sake of argument that this is one of the rare instances when it's necessary. Clearly this code should be using dynamic_cast, or at least static_cast if RTTI can't be used for whatever reason (portability, performance, etc.). For whatever reason, the knob who wrote this code didn't know that C-style casts could very well be interpreted as reinterpret_cast in this context. Imagine that Bar is derived from Foo, but also from Baz. What happens if the object layout looks like this?Code: Select all
Foo* foo = getAFooFromSomewhere(); // Assume for the sake of this example that foo really points to an instance of Bar. Bar* bar = (Bar*) foo; // Bar is derived from foo. bar->doSomething(); // May crash in the presence of MI.
You should have a checkin script that makes people that use reinterpret_cast ask you personally for agreement. It should never be used unless you're hacking - and when you're hacking you shouldn't be working on a product in an archive.static_cast or dynamic_cast will do the appropriate pointer adjustment for you, but reinterpret_cast (or possibly the C-style cast, depending on your compiler) will not. Yes, I've seen this happen (with an older version of GCC).
If the answer were as simple as hiring the world's best C++ programmers, I'm sure we would have done it by now. The problem with this solution is that it is expensive because the supply of such programmers is very limited.
Did you consider sidewheels?Most of the time we're stuck with the equivalent of asking a taxi driver to drive a Formula-1. They're good drivers, but way out of their league. But calling them "lazy" is just stupid IMO.
On your "must use library, library is evil" note - wrap it. Basics of OO programming - if you have ANYTHING non-trivial, encapsulate it and hide the complexity. If your library can write to a buffer, wrap it so you can use the string class for holding a result, or some other class. If it offers only a buffer-overflow unsafe function, wrap it with code that makes it impossible (effectively) to overflow the buffer. If it offers only a excruciating interface to use, design your own and wrap the library into that interface.
- Colonel Kernel
- Member
- Posts: 1437
- Joined: Tue Oct 17, 2006 6:06 pm
- Location: Vancouver, BC, Canada
- Contact:
You forgot option #5 -- I mis-remembered the example. The C function actually did take a const char* and didn't attempt to modify it. The problem is actually that our String class does not null-terminate its internal buffer because it is optimized for taking sub-strings efficiently. Each instance stores its own length and several instances can share the same buffer but point to different parts of it.Candy wrote:That should give a compile error, or one out of four people screwed up. In increasing order of likelyhood:
1. The compiler writer, for not checking const correctness.
2. The function writer for taking an argument of type const char * and stripping const
3. You, for stripping const off explicitly (which I don't see here, so it's not likely)
4. The author of String who knows that his buffer is CONST and returns a non-const pointer to it.
Do you mean it should interpret the C-style cast as a static_cast or dynamic_cast instead of a reinterpret_cast in this case? I would tend to agree. It was a very old version of GCC, and it's probably been fixed by now.Your compiler should / must handle this case properly to be C++ compliant. It can never assume that the base class is at the same place if it even barely touches MI.
As I said, we write database drivers. I'm pretty sure it's impossible to do type-unsafe things with buffers without using reinterpret_cast. Your idea is good though... maybe it's time to bring out the handcuffs.You should have a checkin script that makes people that use reinterpret_cast ask you personally for agreement. It should never be used unless you're hacking - and when you're hacking you shouldn't be working on a product in an archive.
To bring this back to the larger discussion though, my point is that this kind of holding-by-the-hand and watching-over-the-shoulder solution is irritating and expensive.
You mean "training wheels", but yes, I did. There was no time. The API we're implementing (defined by M$, not us) is 16 years old and has several hundred violations of basic type safety baked right in. Next time I will insist on extra time in the schedule for such wrapping though...Did you consider sidewheels?
All good advice, but it doesn't help if the only developers available to do the wrapping make mistakes like the ones I mentioned above...On your "must use library, library is evil" note - wrap it. Basics of OO programming - if you have ANYTHING non-trivial, encapsulate it and hide the complexity. If your library can write to a buffer, wrap it so you can use the string class for holding a result, or some other class. If it offers only a buffer-overflow unsafe function, wrap it with code that makes it impossible (effectively) to overflow the buffer. If it offers only a excruciating interface to use, design your own and wrap the library into that interface.
My point has been and continues to be this -- why put up with this extra cost in terms of training, testing, and review, when the compiler can help out more?
Top three reasons why my OS project died:
- Too much overtime at work
- Got married
- My brain got stuck in an infinite loop while trying to design the memory manager
Since I have more expertise in programming languages than in operating systems, I am interested in doing a type-safe systems programming language.
TYPE SYSTEM
The language I dream of shall have a strong and static type system, i.e. checked at compile-time and not allow arbitrary transformations of one type to another.
The language shall have the following basic types:
The type system is strong, i.e. uint8 is not char.
Pointers are not nullable, by default. In order to make a pointer nullable, one has to use a logical union:
This is extremely important, because it would force checking of nullable pointers at compile-time.
There is a clear distinction between pointers and arrays: array types are either statically declared with specific sizes, or array references, i.e. a pointer to a element plus length.
The compiler will force the programmer, through pattern matching, to statically manage cases where the array index is not within the appropriate range.
Atomic types are those types that are handled using the CPU's atomic instructions.
Endianess types define the endianess of the value. By default, the endianess is little (least significant byte comes first), but the programmer should be able to declare the endianess of a value.
Value types are used in the cases where a distinct value shall only be used at specific places. For example, if I have a struct with A and B members and A takes only X, Y and B takes only N, M values, I should not be allowed to mix them. Example:
The above code declares specific values allowable on members x and y. It's not possible to do x = VALUE3, for example.
Ranges are types which can be instantiated either by a static value within the range or a variable which is proven statically to contain a value within the range; again, pattern matching is used for this functionality.
Strong types allow the definition of new types from existing ones, but the two are different and can not be mixed in the same expression.
State types manage state transitions logically. These types allow only specific changes to variables. This is important because it ensures that resources are used only in one way, and not another. State types are not limited to data, but to code as well. For example, in order to declare a value which flip flops between 0 and 1, the following code would be required:
CPU registers will be mapped to types like this:
When instantiating a register type, the compiler will use the declared register with the type signature implied by the type declaration. It will not be possible to use incompatible types for registers.
In case of multiple CPUs, a predefined type CONTEXT would define a CPU context, and a predefined array of CONTEXTs would define each CPU.
BINARY INTERFACE
The binary parameters would be definable using compiler switches.
Data structures would not contain implicit spare bytes in order to align fields properly; this is something I have been biten by a lot of times. The compiler would not accept data structures and variables which are not properly packed and aligned, based on compiler flags.
The default calling convention would be 'cdecl', but the programmer would also be able to define 'naked' (i.e. no calling convention) or 'interrupt'.
ASM
Assembly would be directly embedded into the language (a compiler switch should define the architecture), and have direct access to all of the CPUs commands, but the type rules will be valid at the assembly block declaration. In other words, if you declare EAX to be of type 'UNSIGNED INTEGER', then it would not be possible to do 'mov EAX, -1' within the block.
STATEMENTS
The usual C statements apply, but with some differences:
Operators would be overloadable, as in C++ (even the operators (), [], -> and . (yes, dot))).
EXPRESSIONS
The usual C expressions apply, but with some differences:
TEMPLATES
The language would have templates, as in C++.
MODULES
No include files, only symbol files which are imported (ala D). Each translation unit would have public, private and protected members. Protected members would be friends to other modules. Example:
MEMORY MANAGEMENT
Types would be mapped either to memory addresses or to input/output addresses, creating memory mapped types and I/O mapped types, respectively. The keyword 'new' maps a variable into a specific memory address, whereas the keywords 'input' and 'output' map a variable into the I/O address space. Examples:
The operators 'new', 'input' and 'output' are overloadable, and typed (unlike C++). For example:
CONCLUSION
Well, the above is by no means complete, but it is a start. What do you think? does it cover the needs for system development?
TYPE SYSTEM
The language I dream of shall have a strong and static type system, i.e. checked at compile-time and not allow arbitrary transformations of one type to another.
The language shall have the following basic types:
- bit
- int8 (8-bit signed integer)
- int16 (16-bit signed integer)
- int32 (32-bit signed integer)
- int64 (64-bit signed integer)
- uint8 (8-bit unsigned integer)
- uint16 (16-bit unsigned integer)
- uint32 (32-bit unsigned integer)
- uint64 (64-bit unsigned integer)
- float32 (32-bit floating point, IEEE format)
- float64 (64-bit floating point, IEEE format)
- int (platform-specific signed integer)
- uint (platform-specific unsigned integer)
- float (platform-specific floating point)
- char (8-bit unsigned integer)
- pointer
- function pointer
- array (not pointer; statically declared dimensions)
- array pointer (pointer to array plus length)
- struct (as in C)
- union (as in C, but no overlapping of pointers)
- tuple (like structs, but with unnamed members)
- volatile (as in C)
- atomic (use atomic instructions to access the variable)
- enum (as in C, but typed)
- set (as in Pascal)
- endianess type.
- constant (a non modifiable memory location)
- logical union (algebraic type)
- value (for avoiding invalid constant value mixing)
- range (from .. to)
- strong type (copy a type to another type)
- state type (for avoiding invalid states)
- null
- cpu register (for accessing CPU registers)
- alias (as in C typedef)
The type system is strong, i.e. uint8 is not char.
Pointers are not nullable, by default. In order to make a pointer nullable, one has to use a logical union:
Code: Select all
typedef Point = {x:int, y:int};
typedef NullablePointPtr = @Point | null;
There is a clear distinction between pointers and arrays: array types are either statically declared with specific sizes, or array references, i.e. a pointer to a element plus length.
The compiler will force the programmer, through pattern matching, to statically manage cases where the array index is not within the appropriate range.
Atomic types are those types that are handled using the CPU's atomic instructions.
Endianess types define the endianess of the value. By default, the endianess is little (least significant byte comes first), but the programmer should be able to declare the endianess of a value.
Value types are used in the cases where a distinct value shall only be used at specific places. For example, if I have a struct with A and B members and A takes only X, Y and B takes only N, M values, I should not be allowed to mix them. Example:
Code: Select all
typedef VALUE1 = 0;
typedef VALUE2 = 1;
typedef VALUE3 = 0;
typedef VALUE4 = 1;
typedef foo = struct {
x:VALUE1 | VALUE2,
y:VALUE3 | VALUE4
};
Ranges are types which can be instantiated either by a static value within the range or a variable which is proven statically to contain a value within the range; again, pattern matching is used for this functionality.
Strong types allow the definition of new types from existing ones, but the two are different and can not be mixed in the same expression.
State types manage state transitions logically. These types allow only specific changes to variables. This is important because it ensures that resources are used only in one way, and not another. State types are not limited to data, but to code as well. For example, in order to declare a value which flip flops between 0 and 1, the following code would be required:
Code: Select all
typedef flip_flop = 0 => 1 => 0;
Code: Select all
type EAX_INTEGER = EAX :> int32;
type EAX_UNSIGNED_INTEGER = EAX :> uint32;
In case of multiple CPUs, a predefined type CONTEXT would define a CPU context, and a predefined array of CONTEXTs would define each CPU.
BINARY INTERFACE
The binary parameters would be definable using compiler switches.
Data structures would not contain implicit spare bytes in order to align fields properly; this is something I have been biten by a lot of times. The compiler would not accept data structures and variables which are not properly packed and aligned, based on compiler flags.
The default calling convention would be 'cdecl', but the programmer would also be able to define 'naked' (i.e. no calling convention) or 'interrupt'.
ASM
Assembly would be directly embedded into the language (a compiler switch should define the architecture), and have direct access to all of the CPUs commands, but the type rules will be valid at the assembly block declaration. In other words, if you declare EAX to be of type 'UNSIGNED INTEGER', then it would not be possible to do 'mov EAX, -1' within the block.
STATEMENTS
The usual C statements apply, but with some differences:
- variables can be declared anywhere in a block, as in C++.
- the goto statements 'break', and 'continue' can have an optional label to indicate the jump target.
- switch cases do not need break; cases can accept multiple expressions.
Operators would be overloadable, as in C++ (even the operators (), [], -> and . (yes, dot))).
EXPRESSIONS
The usual C expressions apply, but with some differences:
- bitwise operations would have precedence over arithmetic operations.
- assignment would no longer be an expression; it will only be a statement; hence postfix ++/-- is no longer allowed.
- there is no implicit conversion from and to any type (to avoid type mixing problems).
Code: Select all
var x : 10 .. 20 = 15;
fun main(args : string[]) : int {
var y : int = 0;
match (y) {
case 10..20:
x = y;
other:
}
return 0;
}
The language would have templates, as in C++.
MODULES
No include files, only symbol files which are imported (ala D). Each translation unit would have public, private and protected members. Protected members would be friends to other modules. Example:
Code: Select all
import foo.bar.cee, a.b.c, mymodules.myunit;
public:
var x : int = 0, y : int = 0;
private:
var data : @int | null = null;
protected foo.bar.cee, a.b.c:
var internals : char = 'a';
Types would be mapped either to memory addresses or to input/output addresses, creating memory mapped types and I/O mapped types, respectively. The keyword 'new' maps a variable into a specific memory address, whereas the keywords 'input' and 'output' map a variable into the I/O address space. Examples:
Code: Select all
typedef POINT = {x : int, y : int};
fun main(args : string[]) : int {
var p1 : @Point = new (0xfff0000) Point;
var io1 : int8 = input (0x60) int8;
var io2 : int8 = output (0x60) int8;
}
Code: Select all
fun new() : @POINT {
var buffer : byte[] | @POINT = alloc(sizeof(POINT));
match (typeof(buffer)) {
case typeof(byte[]):
for(i : uint8 = 0; i < sizeof(POINT); ++i) {
buffer[i] = 0;
}
}
return buffer;
}
Well, the above is by no means complete, but it is a start. What do you think? does it cover the needs for system development?
The only tweaks I'd make would be to take out CPU register types and I/O-port types. Those tie the language too closely to a single architecture.
Other than that and the lack of a runtime-exception system (though I can see why it doesn't have one), it's a damn nice language. Will you start working on a formal definition so a compiler can be built?
Oh, and take a look at D's template system. It looks much cleaner than that of C++.
Other than that and the lack of a runtime-exception system (though I can see why it doesn't have one), it's a damn nice language. Will you start working on a formal definition so a compiler can be built?
Oh, and take a look at D's template system. It looks much cleaner than that of C++.