C and undefined behavior

Programming, for all ages and all languages.
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

C and undefined behavior

Post by vvaltchev »

Recently, I read this blog post about C and undefined behavior:
https://www.yodaiken.com/2021/05/19/und ... ing-error/

In which, substantially, the author states that the current way of treating UB by modern compilers is a result of a misinterpretation of the C89 standard.
What do you think about that?

IMHO, I believe he's probably right. Even if new fancy compiler optimizations are great, the "modern C" probably broke the philosophy behind the C language.
Check out Dennis Richies' essay about noalias: https://www.yodaiken.com/2021/03/19/den ... uage-1988/

Really, it's not fair claiming that millions and millions of lines of C code written in '90s are just wrong and all written by incompetent programmers who didn't read the standard. No. Everybody agreed on what C was back in time. And then, things start to change, step by step. The boundaries of the C89 standard have been pushed since then. In the C99 standard some different wording around UB allowed even more freedom to compilers.

And the consequences of that are very remarkable. Let me quote the blog's author:
https://www.yodaiken.com/ wrote:[...] over time the Standard and the common compilers have made C an unsuitable language for developing a range of applications, from memory allocators, to cryptography applications, to threading libraries and, especially operating systems. We have the absurd situation that C, specifically constructed to write the UNIX kernel, cannot be used to write operating systems. In fact, Linux and other operating systems are written in an unstable dialect of C that is produced by using a number of special flags that turn off compiler transformations based on undefined behavior (with no guarantees about future “optimizations”). The Postgres database also needs some of these flags as does the libsodium encryption library and even the machine learning tensor-flow package.
C used to be the "portable assembler" where any kind of hacks/casts were allowed. If you wanted a type-safe language, you simply had to use something else. At the end of the day, I don't want to question the good intentions of the people that pushed those changes forward. Actually, for a language like C++, I think it's mostly for good. In C++ you're not supposed to do tricky casts or other unsafe low-level tricks. Even if you can do that, there are 1,000 things you have to consider before doing that. It's part of what we consider "idiomatic C++" having to be careful when going there. Rust goes one step further, making low-level stuff impossible to do without super-explicit annotations etc. See the Rust kernel projects out there.

But that shouldn't be C's case. The language has been designed with the purpose of writing operating systems. It's exactly where you're supposed to do things like that, from time to time. C was the right tool for that kind of software. It was the portable assembler, really. Now, it a feels more like a downgraded high-level language. A C++ without most of its features.

Still, when I write C code I'm super-pedantic and careful about UB. The fact that I don't like how modern compilers behave in some cases, doesn't mean I deny UB's existence or ignore the ISO documents etc. It's the opposite: I'm obsessed with avoiding UB, because I hate dealing with UB bugs.

-----------------------------------------------------------------------------------------------------------------
EDIT[1]: This text is rough and seriously rushed up. Please, check the whole discussion.
Changes: I added the word "probably" in two places and fixed some typos like missing "lines of code".

EDIT[2]: For future readers ending up here, who might not desire to read a LONG and unstructured discussion about undefined behavior, I believe it's worth sharing a few articles on the same topic:
EDIT[3]: Fix typos: because -> before, does -> goes
Last edited by vvaltchev on Mon Aug 16, 2021 11:19 am, edited 3 times in total.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: C and undefined behavior

Post by Korona »

I think it's not as simple as the blog post depicts it. My understanding is that the author wants to replace undefined behavior with implementation-defined behavior (the compiler may do what it wants, but it must document it) or unspecified behavior (the compiler may do what it wants but the program remains well-formed).

However, UB is used in almost every optimization. It's not just fancy and tricky ones. For example, the existence of UB allows the compiler to rewrite (a + b) + c = a + (b + c) for signed integers (for example, if b + c already has been computed before), because it knows that neither expression overflows in any valid C program, regardless of the target architecture. If we remove the UB that is triggered by signed overflow, then this optimization is still valid on x86, but invalid on other (legacy) archs.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: C and undefined behavior

Post by Solar »

I, too, think that this here is mostly a failure of using the correct terminology.

"Undefined behavior" is what allows C/C++ to be as effective as it is. It is what allows the implementor of strcpy() to assume that the source string is zero-terminated, that the destination holds enough space, and that the two areas do not overlap.

Code: Select all

char * strcpy( char * restrict s1, const char * restrict s2 )
{
    char * rc = s1;

    while ( ( *s1++ = *s2++ ) )
    {
        /* EMPTY */
    }

    return rc;
}
If violations of these preconditions would be required to lead to specified behavior, the implementation of strcpy() would have to check for those conditions.

Code: Select all

char * strcpy( char * s1, const char * s2 )
{
    char * rc = s1;
    size_t len;

    if ( s1 == NULL || s2 == NULL )
    {
        return NULL;
    }

    /* Let us assume that strlen() somehow, magically, can check
    whether s2 is properly zero-terminated, then returns SIZE_MAX */
    if ( ( len = strlen( s2 ) ) == SIZE_MAX )
    {
        return NULL;
    }

    /* Let us assume the existence of a freemem() function that
    somehow, magically, can check the available space at a memory
    location */
    if ( freemem( s1 ) <= len )
    {
        return NULL;
    }

    memmove( s1, s2, len + 1 );
}
Aside from the severe impact on performance, it requires additional plumbing both on string objects and on memory objects to even work. And the program could still invoke UB if it does not handle a NULL return properly.

So, to quote that blog,
...“the Standard imposes no requirements” on how compilers can implement undefined behavior...
...which is not a "misinterpretation by self-appointed experts on comp.lang.c", but exactly the point that makes C/C++ as efficient as they are. The precondition, the agreement between the creator of the program, the creator of the compiler, and the creator of the library/libraries, is that source that contains UB is no longer standard compliant, and thus the compiler / library is not required to "implement" anything because the precondition is not met (as far as the language standard is concerned). Otherwise, a compiler would be required to provide an exhaustive static analysis of the program to ensure no UB is present, or be considered faulty. Which is, obviously, ridiculous when looking at a language that runs on actual hardware instead of inside a well-defined virtual machine.

What all this does not say is that the creator of compiler/library X could not expand on the standard. And indeed, virtually every compiler / library manufacturer does. This is perfectly permissible, in parts even desirable. It is just outside the scope of the language standard. Such compiler / library extensions, as well as any source making use of them, are no longer "strictly compliant" to the standard; they now represent a dialect of C. As long as all involved are aware of this and its implications, that is fine. Just don't expect your code to port that easily.
Every good solution is obvious once you've found it.
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: C and undefined behavior

Post by vvaltchev »

Korona wrote:However, UB is used in almost every optimization. It's not just fancy and tricky ones.
Respectfully, I disagree. From the '70s until 2005-2006, UB wasn't a "thing". Clearly, it was mentioned in the ISO documents, compiler engineers knew about it etc. but at the time C programmers didn't worry about it, like it happened for the previous 30 years of C history. Compilers didn't generate "surprising" instructions, no matter what you do. If there's an integer overflow, it will happen what has to happen depending on the arch and on the OS. I hope we can agree on that.

UB started slowly to become "a thing to worry about" probably around that time. This bug (about integer overflow etc.):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475 is from 2007.

Now, if we agree that before that time compilers didn't generate surprising code taking advantage of UB, can we agree that they were still able to optimize? In 2005 we had something like 50 years of compiler history. Plenty of optimizations that did nothing surprising: constant propagation, unrolling, inlining, statement reordering, etc. I'm no compiler engineer to list all the stuff that was possible without abusing of UB, but I remember that even at the time between -O0 and -O3 there was a 2x-3x speed-up, depending on the code, of course.
Korona wrote:For example, the existence of UB allows the compiler to rewrite (a + b) + c = a + (b + c) for signed integers (for example, if b + c already has been computed before), because it knows that neither expression overflows in any valid C program, regardless of the target architecture. If we remove the UB that is triggered by signed overflow, then this optimization is still valid on x86, but invalid on other (legacy) archs.
Well, I personally consider that a "fancy" optimization. Actually, I know about an even fancier one that extends a signed 32-bit to a signed 64-bit in order to use generate faster code (using MOV's parameters): it assumes that the 32-bit index will never overflow. I have no problem admitting the power of such optimizations. They really made C code faster, but that came at high price to pay as well: C doesn't do anymore what's the most intuitive (for the programmer) thing to do. Depending on the context, that might be good or bad. I'd say that in most cases that's good, but there are some pathological cases where that's bad.

What I'm thinking now is that during that time CPUs stopped gaining (single-core) performance with the same speed as in the past so there was a need in the industry to "magically" make existing software faster without spending an incredible amount of resources to hand-optimize the code. That makes perfectly sense, actually. Just, that slowly changed the C language as "we knew it". Now modern compilers are much much smarter than ones even 5 years old let's not even mention ones from 2005, but there's a price we are paying for that. UB bugs are hidden everywhere in decades of C code written with different assumptions. And the sad thing is that in many cases we cannot even get a warning stating that: "hey, I used an optimization here that relays on UB xyz, please check that xyz will never happen or test your code with UBSAN". In many cases, runtime testing is the only way.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: C and undefined behavior

Post by Korona »

(Not in response to your post above, you ninja'd me.)

Also note that

Code: Select all

There is No Reliable Way to Determine if a Large Codebase Contains Undefined Behavior.
Is not quite true. Yes, there is no reliable way to prove the absence of UB, but there are ways to (dynamically) verify that there is no UB that the compiler exploits in optimization decisions. Namely, you can run dynamic analyzers such as UBAN, ASAN, TSAN and MSAN.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: C and undefined behavior

Post by vvaltchev »

Korona wrote:(Not in response to your post above, you ninja'd me.)

Also note that

Code: Select all

There is No Reliable Way to Determine if a Large Codebase Contains Undefined Behavior.
Is not quite true. Yes, there is no reliable way to prove the absence of UB, but there are ways to (dynamically) verify that there is no UB that the compiler exploits in optimization decisions. Namely, you can run dynamic analyzers such as UBAN, ASAN, TSAN and MSAN.
I hope we can agree that dynamic checking is way worse than compile-time checking. With dynamic checking you need automated tests with 100% coverage to get some degree of confidence (not an absolute guarantee) that your code is not affected by UB. Since only very few projects have so many tests, it is true that There is No Reliable Way to Determine if a Large Codebase Contains Undefined Behavior.

What the Linux kernel uses is an incredible amount of beta testing done by an unreasonable amount of users and a big part of the software industry. The same thing applies for GCC and other major open source projects. If your 20 year old legacy non-open-source project doesn't fall there, things get bad.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: C and undefined behavior

Post by vvaltchev »

Solar wrote:I, too, think that this here is mostly a failure of using the correct terminology.

"Undefined behavior" is what allows C/C++ to be as effective as it is. It is what allows the implementor of strcpy() to assume that the source string is zero-terminated, that the destination holds enough space, and that the two areas do not overlap.

Code: Select all

char * strcpy( char * restrict s1, const char * restrict s2 )
{
    char * rc = s1;

    while ( ( *s1++ = *s2++ ) )
    {
        /* EMPTY */
    }

    return rc;
}
If strcpy() is just a regular C function, not a built-in or anything the compiler knows about, it has NOTHING to do with UB. You can assume that the strings are NUL-terminated. If they aren't, the function will produce any kind of incorrect output, including the crash of the program in case the function touches invalid an vmem area. We can call that "undefined behavior", but it's not necessarily UB from the compiler's point of view. If strcpy() is a regular function, the compiler won't see any UB, because it don't understand the sematic of that function. In the real-world example of strcpy() though, a non NUL-terminated string IS indeed UB according to the compiler, because strcpy() is specially tagged/uses built-ins. So, the compiler can make assumption relying on NON-UB. But, in the general case of my_strcpy(), you don't need to involve UB at all. Just implement your function with your assumptions and add comments. No need to add an insane amount of runtime checks. At most, use ASSERTs.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: C and undefined behavior

Post by Korona »

vvaltchev wrote:UB started slowly to become "a thing to worry about" probably around that time. This bug (about integer overflow etc.):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475 is from 2007.
By the way, the most remarkable thing about that bug is how disrespectful the poster is, wow. Suggesting "your's [your copy of GCC] is apparently some butchered distro version." to Richard Biener (= the long-time release manager of GCC) is really, ... not productive if you want to convince people that your bug is important.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
nullplan
Member
Member
Posts: 1769
Joined: Wed Aug 30, 2017 8:24 am

Re: C and undefined behavior

Post by nullplan »

The single most damaging assumption in current (well, from the last decade or so) compilers is that undefined behavior does not happen. That assumption is false, but the compiler writers parade it around like a shield whenever you say something. That assumption makes problems non-local. You can have one mistake in your giant loop, and the compiler throws it all out. De jure, of course, the compiler writers are perfectly within their rights to do so. But the developers of clang must have gotten enough fan mail over this issue alone that they backtracked on it, and now will no longer prune the entire path and the decisions leading up to it, but rather place a UD2 instruction in that path.

The argument "just write error-free code" is a bit galling, especially coming from GCC authors, who have been known to get these things wrong themselves. Not so long ago GCC had a bug where it would call qsort() with a broken comparison operator (that wasn't transitive), leading to code that worked on glibc but not on musl. And I happen to know that glibc switches to a different sorting algorithm when it runs out of memory, so it is conceivable the code would have failed on glibc as well under high load. Those issues are always the best to debug.
Korona wrote:For example, the existence of UB allows the compiler to rewrite (a + b) + c = a + (b + c) for signed integers (for example, if b + c already has been computed before), because it knows that neither expression overflows in any valid C program, regardless of the target architecture
That is false. For one, addition in C is not associative. It is perfectly valid to have an expression in your code that would overflow if transformed with associativity in mind. For two, it is not UB that makes that optimization possible, but rather the knowledge of the compiler authors about the target architecture. It would be undefined behavior for a programmer to rearrange, for example, INT_MIN + INT_MAX + 1, based on the above rule, but a compiler that happens to know the target uses the same operation for signed and unsigned addition can make that rearrangement as any overflow occurring will be cancelled out by an overflow backwards. If the target architecture used addition with saturation, this would not be possible.
Solar wrote:"Undefined behavior" is what allows C/C++ to be as effective as it is
Not specific enough. That is like saying "a traffic accident is a disruption in your itinerary". Yes, but that is not the whole story. And nobody was arguing for additional checks in strcpy(). What the blogger was arguing for was that "i << 32" should not cause the compiler to delete the entire code path. Essentially, you are in the wrong chapter. This is about language UB, not library UB (chapters 5 and 6 instead of chapter 7 in C99).
Carpe diem!
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: C and undefined behavior

Post by Korona »

nullplan wrote:That is false. For one, addition in C is not associative. It is perfectly valid to have an expression in your code that would overflow if transformed with associativity in mind. For two, it is not UB that makes that optimization possible, but rather the knowledge of the compiler authors about the target architecture.
My bad, you're right. Regardless, my argument was essentially: the existence of UB makes it possible to perform these optimizations *without* knowing the specifics of the target architecture, and I think that argument still holds.

Note that I am all for sanitizers that check for the presence of UB. In an ideal world, the compiler would have a mode that instruments each place where it exploits UB (I think UBSAN is not quite there yet, but it does support a lot of checks). Regarding the non-locally: UB basically stops at the translation unit boundary (unless you enable LTO). I guess the argument can be made that if you want to keep the possibility of UB localized, the trade-off is that you cannot use LTO. That argument is similar in spirit to the "compilers should not optimize based on UB" argument -- your code is slower but also exhibits more reproducible behavior.

I also agree that it would be nicer if the absence of UB could be verified at compile time, but C is just not the right language for that, whether you turn UB into unspecified behavior or not. ("Unspecified behavior" is the C++ standard's term for "any result may be returned, but some result must be returned and the program remains well-formed".)
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: C and undefined behavior

Post by vvaltchev »

Korona wrote:
vvaltchev wrote:UB started slowly to become "a thing to worry about" probably around that time. This bug (about integer overflow etc.):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475 is from 2007.
By the way, the most remarkable thing about that bug is how disrespectful the poster is, wow. Suggesting "your's [your copy of GCC] is apparently some butchered distro version." to Richard Biener (= the long-time release manager of GCC) is really, ... not productive if you want to convince people that your bug is important.
I totally agree with that. The author was too aggressive. But in substance he had a point: something that worked for many years in a ton of code, stopped working.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: C and undefined behavior

Post by vvaltchev »

nullplan wrote:The single most damaging assumption in current (well, from the last decade or so) compilers is that undefined behavior does not happen. That assumption is false, but the compiler writers parade it around like a shield whenever you say something. That assumption makes problems non-local. You can have one mistake in your giant loop, and the compiler throws it all out. De jure, of course, the compiler writers are perfectly within their rights to do so. But the developers of clang must have gotten enough fan mail over this issue alone that they backtracked on it, and now will no longer prune the entire path and the decisions leading up to it, but rather place a UD2 instruction in that path.
I totally agree on that.
nullplan wrote:The argument "just write error-free code" is a bit galling, especially coming from GCC authors, who have been known to get these things wrong themselves. Not so long ago GCC had a bug where it would call qsort() with a broken comparison operator (that wasn't transitive), leading to code that worked on glibc but not on musl. And I happen to know that glibc switches to a different sorting algorithm when it runs out of memory, so it is conceivable the code would have failed on glibc as well under high load. Those issues are always the best to debug.
The argument "just write error-free code" is bad not only because UB strikes in very subtle ways sometimes, but because it doesn't help with millions of lines of legacy code. We cannot just magically fix them all, even if we could write error-free code. Of course, in reality, nobody can.
nullplan wrote:And nobody was arguing for additional checks in strcpy(). What the blogger was arguing for was that "i << 32" should not cause the compiler to delete the entire code path.
Yep. That's what I call "abuse of UB".
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
alexfru
Member
Member
Posts: 1111
Joined: Tue Mar 04, 2014 5:27 am

Re: C and undefined behavior

Post by alexfru »

The standard didn't try to contain undefined behavior in any way.
It didn't define that, say, signed integer overflow is fully contained within the (say, multiplication) operator that causes it and that the operator only produces a wrong value and there are no other ill side effects.

Early computers and compilers, though, couldn't do much in terms of code analysis and optimizations and therefore a lot of instances of UB seemed contained and rarely surprising. Technological progress contributed to said analysis and optimizations and made UB bleed, creep and spread further beyond simple operations causing it.

My position is that a lot of code is indeed nonconformant (almost all of it). And it has happened because people have been getting away with it. Perhaps, today they don't get away with some of the tricks of the past, but they still do with others or they rely (often, unknowingly) on implementation-specific behavior, thinking that's *the* language, whereas it's just an extension to it.

You can embrace the language standard and enable all kinds of "sanitizers" to reveal various obscure UBs in the code.
User avatar
Solar
Member
Member
Posts: 7615
Joined: Thu Nov 16, 2006 12:01 pm
Location: Germany
Contact:

Re: C and undefined behavior

Post by Solar »

vvaltchev wrote:UB started slowly to become "a thing to worry about" probably around that time. This bug (about integer overflow etc.):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475 is from 2007.

Now, if we agree that before that time compilers didn't generate surprising code taking advantage of UB...
I don't agree. That example program in that bug report is stupid, and it was stupid right from Ritchie's first iteration of C.

assert( X + Y > X ) is always true in correct C. The compiler may rely on that, and as the statement is a non-op, it may be optimized away. The correct way to check is well-known and established (assert( INT_MAX - Y >= X )). Again, what we see here is somebody who has not understood UB.
vvaltchev wrote:We can call that "undefined behavior", but it's not necessarily UB from the compiler's point of view.
But that is the point of UB, isn't it? There is no "compiler's point of view" on UB. UB is a condition a compiler has no obligation to detect or handle. Which does, of course, not mean it cannot generate a warning. Just that it does not have to (because it rightly couldn't for all cases).
vvaltchev wrote:But, in the general case of my_strcpy(), you don't need to involve UB at all. Just implement your function with your assumptions and add comments.
No I won't. Those are not "my" assumptions, they are preconditions for using strcpy() correctly.
nullplan wrote:The single most damaging assumption in current (well, from the last decade or so) compilers is that undefined behavior does not happen. That assumption is false, but the compiler writers parade it around like a shield whenever you say something.
No. Actually, UB was part and parcel of C (and many other languages) right from the beginning. If you remember what kind of hardware C was developed on, and for, it should become obvious that the kind of elaborate static analysis you are calling for was unrealistic at the time. UB was, right from the start, to explicitly allow compilers / libraries (and their creators) to take shortcuts.

If you don't like that, you will have to switch to a VM'ed language.
nullplan wrote:Yes, but that is not the whole story. And nobody was arguing for additional checks in strcpy(). What the blogger was arguing for was that "i << 32" should not cause the compiler to delete the entire code path. Essentially, you are in the wrong chapter. This is about language UB, not library UB (chapters 5 and 6 instead of chapter 7 in C99).
I am perfectly aware of that. I picked strcpy() as a simple and everyday example. I still uphold that the blogger (as well as vvaltchev and yourself) is barking up the wrong tree.

Has he checked with the creators of that optimization to actually understand why it tosses the code path, how it positively affects correct code? Apparently not. So what remains is some "mimimi" about "my broken code doesn't work".

Oh, and by the way, GCC emits a warning about i << 32, even without -Wall. So...???
Every good solution is obvious once you've found it.
vvaltchev
Member
Member
Posts: 274
Joined: Fri May 11, 2018 6:51 am

Re: C and undefined behavior

Post by vvaltchev »

alexfru wrote:The standard didn't try to contain undefined behavior in any way.
It didn't define that, say, signed integer overflow is fully contained within the (say, multiplication) operator that causes it and that the operator only produces a wrong value and there are no other ill side effects.

Early computers and compilers, though, couldn't do much in terms of code analysis and optimizations and therefore a lot of instances of UB seemed contained and rarely surprising. Technological progress contributed to said analysis and optimizations and made UB bleed, creep and spread further beyond simple operations causing it.

My position is that a lot of code is indeed nonconformant (almost all of it). And it has happened because people have been getting away with it. Perhaps, today they don't get away with some of the tricks of the past, but they still do with others or they rely (often, unknowingly) on implementation-specific behavior, thinking that's *the* language, whereas it's just an extension to it.

You can embrace the language standard and enable all kinds of "sanitizers" to reveal various obscure UBs in the code.
That has been exactly my position for many years. But, recently I've started to question that idea. How can we prove that nobody, even the early C programmers never actually knew properly the language? How do you define "the language"? You can argue that the language is exclusively described by the first ISO standard but:
  • How can you prove that no misinterpretation ever occurred and disprove the theory according to which an update of the standard in the transition C89 -> C99 really changed the spirit of the language?
  • Even if no misinterpretation ever occurred, how can you be sure that the wording of the first C standard correctly expressed the intention of its creators? In other words, while trying to understand the philosophy of the language, are you willing to consider as evidence not just the first ISO standard but also the code written by its creators and the opinions they expressed over time?
IMHO, formally, the language now is defined solely by the ISO standard, but that doesn't prove that the language didn't fundamentally change since its inception. What we call today "C", is not anymore what was "C" was meant to be by its original creators, that's ultimately my point here. I'm not absolutely sure about that theory, just.. I'm starting to believe in it.

Some clues about that misinterpretation theory? Well, check the Dennis Richie's comments on his essay about "noalias" which, fortunately, was never included in the language, at least not in that form:
Dennis Richie wrote: Let me begin by saying that I’m not convinced that even the pre-December qualifiers (`const’ and `volatile’) carry their weight; I suspect that what they add to the cost of learning and using the language is not repaid in greater expressiveness. `Volatile,’ in particular, is a frill for esoteric applications, and much better expressed by other means. Its chief virtue is that nearly everyone can forget about it. `Const’ is simultaneously more useful and more obtrusive; you can’t avoid learning about it, because of its presence in the library interface. Nevertheless, I don’t argue for the extirpation of qualifiers, if only because it is too late.

The fundamental problem is that it is not possible to write real programs using the X3J11 definition of C. The committee has created an unreal language that no one can or will actually use. While the problems of `const’ may owe to careless drafting of the specification, `noalias’ is an altogether mistaken notion, and must not survive.
He is very skeptic about const and volatile, let's not even mention "noalias". Now look at this other statement:
Dennis Richie wrote: 2. Noalias is an abomination
`Noalias’ is much more dangerous; the committee is planting timebombs that are sure to explode in people’s faces. Assigning an ordinary pointer to a pointer to a `noalias’ object is a license for the compiler to undertake aggressive optimizations that are completely legal by the committee’s rules, but make hash of apparently safe programs.
He was deeply concerned about giving compilers the license to make aggressive optimizations.

Finally:
Dennies Richie wrote: Noalias must go. This is non-negotiable.

It must not be reworded, reformulated or reinvented. The draft’s description is badly flawed, but that is not the problem. The concept is wrong from start to finish. It negates every brave promise X3J11 ever made about codifying existing practices, preserving the existing body of code, and keeping (dare I say it?) `the spirit of C.’
So, he wanted the ISO document to codify the existing practices and preserve the spirit of C while other people obviously pushed in another direction. After the ISO C89 standard was released (which, in my view was a big compromise between all the parties), people belonging to the other school of thought continued pushing harder and harder. At the end, a small battle after another, they've won completely. Maybe for good, maybe for bad. I don't wanna judge.

Because of that, indeed, today C is much more optimizable than what it was in the past. I totally agree with that. For many things I'd agree that's for good, actually.

Just, I care to remark that modern C is not what the C language was meant to be and that the theory according which for decades the UNIX programmers (even the ones working closely with K&R) had no idea of what C really was is wrong and unfair to them. It somehow belittles their work and them as programmers.

-------------
EDIT: link to Dennis Richie's essay:
https://www.yodaiken.com/2021/03/19/den ... uage-1988/

Second link in case the first gets broken one day:
https://www.lysator.liu.se/c/dmr-on-noalias.html
Last edited by vvaltchev on Mon Jun 07, 2021 7:03 am, edited 1 time in total.
Tilck, a Tiny Linux-Compatible Kernel: https://github.com/vvaltchev/tilck
Post Reply