C and undefined behavior

alexfru · Post by **alexfru** » Tue Jun 15, 2021 1:33 pm

vvaltchev wrote: If we assume your position on the paper as "correct" (= no substantial change in about how UB can be treated), the following question remains: why didn't compilers seem to care about that, at all? (If we agree that compilers didn't care back then, of course. That's the where the discussion moved in the latest posts with Solar). Also, why didn't the literature care about talking about how dangerous UB is over and over again? Why evil type-punning casts have been used in examples?

Somehow you seem to be ignoring the fact that initially optimizing compilers weren't common or they didn't optimize nearly as much as they can and do today. It was that way because typical computational resources (CPU speeds, memory/storage size) were scarce (my first computer had 48KB RAM and ran at 3 MHz or something like that and storage was on audio/cassette tapes). If you accept these facts, there will be no room left for wondering why something didn't happen 30 to 50 years ago but it has been happening in the past 20.

It wasn't much about understanding UB (if anything, one could say by looking at C99 something like "Well, C89's UB seems to be staying with us and not getting any more defined... So, we could just as well embrace it for real now."). It was more about what people could and couldn't do given the technology of the day. The rules didn't change much. The tools did. And the results followed.

vvaltchev · Post by **vvaltchev** » Tue Jun 15, 2021 3:43 pm

alexfru wrote:
vvaltchev wrote: If we assume your position on the paper as "correct" (= no substantial change in about how UB can be treated), the following question remains: why didn't compilers seem to care about that, at all? (If we agree that compilers didn't care back then, of course. That's the where the discussion moved in the latest posts with Solar). Also, why didn't the literature care about talking about how dangerous UB is over and over again? Why evil type-punning casts have been used in examples?
Somehow you seem to be ignoring the fact that initially optimizing compilers weren't common or they didn't optimize nearly as much as they can and do today. It was that way because typical computational resources (CPU speeds, memory/storage size) were scarce (my first computer had 48KB RAM and ran at 3 MHz or something like that and storage was on audio/cassette tapes). If you accept these facts, there will be no room left for wondering why something didn't happen 30 to 50 years ago but it has been happening in the past 20.

It wasn't much about understanding UB (if anything, one could say by looking at C99 something like "Well, C89's UB seems to be staying with us and not getting any more defined... So, we could just as well embrace it for real now."). It was more about what people could and couldn't do given the technology of the day. The rules didn't change much. The tools did. And the results followed.

Well, when you put this way, it all seems to make perfect sense and that I'm one of the guys believing in non-sense "conspiracy" theories

While, from my POV the reality is just much more complicated than that. First of all, I don't ignore the limited computational resources of early compilers. I compare Fortran compilers in the '90s with C compilers in the '90s. During that time, we had machines with 16-32 MB of RAM. Also, even in the '80s we had a similar amount of memory, but not in the personal computers, in the traditional ones that we call now mainframes. On early personal computers like commodore64 it was hard or impossible to run a C compiler, because of the limited resources of course.

Said that, it's curious why C compilers in the '90s were so conservative in (non) making assumptions about aliasing compared to Fortran ones and that behavior was consistent among all the compilers at the time. Also, during that time we had expensive enterprise machines which certainly had much more memory and computational power than regular personal computers and, apparently, not even C compilers running there really cared about introducing more powerful optimizations. It looks to me that the theory about the lack of technology for better optimizations is more convincing than yours about lack of computational power. Also, even if compilers lacked both technology and computational power, why they did not intentionally try to prevent as much as UB as possible, even without fancy static analysis, just in the trivial cases? E.g. "p = 0; *p = 'a';" or "x/0", or "x/y" when "y" can be easily proved "=0" with the optimizations of the time. I see a lack of intention in compilers to prevent some simple cases of UB from spreading but, honestly, at this point I'm tired by this whole discussion because we're repeating the same concepts over and over again. Just think of it as "yet another theory online"; there's no need for anybody to actually "win" this discussion here. Individual readers will make their own opinion.

So, I think it's time for me to stop contributing to this discussion with more arguing, unless someone shares some "new" interesting facts. I believe we all exhausted our arguments in favor of one theory or the other. Whatever is the reason that got us here, it matters more for historic purposes than for anything else in practice. If in the future C becomes totally unsuitable for kernel development, it will be "forked" or it will be replaced by another completely different language. In parallel with that, the "do what I mean" paradigm will continue to evolve as it has already proven to offer plenty of opportunities for optimization. Along with that line of thought, some people predict that in the future functional languages might beat the classic ones, because of the greater parallelism they allow. I'm a bit skeptic here, but who knows? Everything is possible. We'll wait and see how software will evolve in the next 50 years

Korona · Post by **Korona** » Tue Jun 15, 2021 4:04 pm

One reason is that the "trivial cases" only truly appear after inlining and other optimizations. You do not gain anything by replacing x/0 by ud2. The source code will not contain a literal x/0 anyway. Only after inlining and constant folding (+ potentially other optimizations), such statements will appear. And it only makes sense to assume that they don't happen if you can also act on that, not only by inserting a fault, but also by propagating the information that the code path is dead out of a conditional or a loop etc. So while each individual optimization is not difficult to implement or CPU heavy (and they were already well-known in the 80s), you need multiple optimization passes to really exploit UB through division by zero in real-world code bases. And before these multiple passes were available, it didn't make sense to tell the compiler that x/0 == __builtin_unreachable.

vvaltchev · Post by **vvaltchev** » Tue Jun 15, 2021 4:55 pm

Korona wrote:One reason is that the "trivial cases" only truly appear after inlining and other optimizations.

True.

Korona wrote:You do not gain anything by replacing x/0 by ud2. The source code will not contain a literal x/0 anyway. Only after inlining and constant folding (+ potentially other optimizations), such statements will appear. And it only makes sense to assume that they don't happen if you can also act on that, not only by inserting a fault, but also by propagating the information that the code path is dead out of a conditional or a loop etc.

I disagree. You gain something valuable by preventing such wrong code from spreading if you believe in the standard and consider that one day you might act differently.

Let me make an example about CPUs. Copying the text from Wikipedia:

wikipedia wrote: the AMD specification requires that the most significant 16 bits of any virtual address, bits 48 through 63, must be copies of bit 47 (in a manner akin to sign extension). If this requirement is not met, the processor will raise an exception. [...]
This feature eases later scalability to true 64-bit addressing.

That's a perfect example. While the 16 most significant bit were completely useless, AMD required them to be set in order to prevent software from setting them with arbitrary value and then de-referencing the pointer. Now that there systems with 56 bits of indexable virtual memory, we can benefit from that early limitation. Why apparently no compiler did that?

Korona wrote: it didn't make sense to tell the compiler that x/0 == __builtin_unreachable.

It would have made sense to me, just to prevent wrong code from spreading.

For unaligned access is the same story. Consider the example:

Code: Select all

int x = 1;
char *p = (char *)&x + 1;
*(int *)p = 2;

Why it took decades to have a warning like -Wcast-align=strict on line 3? It would have been extremely useful to have such a warning to prevent the unaligned access UB, right? Nobody even considered that for something like 40 years, while it was fairly simple to implement. It requires just checking the alignment of the type of a given expression. Char pointers cannot be (safely) casted to int pointers because "char" is aligned at 1, while "int" is aligned at 4 (or something else > 1). Also, when "-Wcast-align" itself was introduced, it still didn't show ANY warnings on architectures that actually allowed unaligned access. We had to wait for a few more years to get -Wcast-align=strict. Isn't that weird?

Why is that? IMHO, nobody thought of doing anything with that form of UB. It was fine like that. There was no need to force developers to use the uglier memcpy(), while the semantic of unaligned access was well-defined for both developers and compiler engineers, no matter what the standard said about that form of UB. It feels like the standard wasn't taken so literally in the past, doesn't it? Later, new ideas for optimizations came, so it was necessary to find a way to allow a whole class of optimizations without "breaking the law" and... BOOM, hidden pearls in the standard that almost nobody cared about, started to be used as pillars on which a whole generation of optimizations will lay on. Does it make any sense?

P.S. Check this SO question from more than 6 years ago: https://stackoverflow.com/questions/257 ... int-on-x86 Nobody even mentioned the ISO standard and UB until an update on 2020-05-10, which recommends -Wcast-align=strict to be used on x86, in order to avoid UB. It's not a hard-proof of anything, but the whole -Wcast-align story shows, as many of my other examples, developers' and compiler engineers' mindset and how it evolved.

alexfru · Post by **alexfru** » Tue Jun 15, 2021 6:29 pm

vvaltchev wrote:I don't ignore the limited computational resources of early compilers. I compare Fortran compilers in the '90s with C compilers in the '90s.

Fortran was first standardized in 1977, whereas C was standardized in 1989. So, it seems like Fortran had an extra decade to develop and mature compared to C. I mean, even known optimization techniques would need some time to be implemented/ported in a new compiler or after standardization.

On top of that, does Fortran have the same kind of undefined behavior as does C? I mean, the same kind of unlimited undefinedness and a similar number of instances in which it occurs?

And are the intended and actual uses (hardware, areas/tasks) of the two languages the same or largely similar?

IOW, is the comparison of the two languages fair?

Korona · Post by **Korona** » Wed Jun 16, 2021 1:23 am

Fortran was often used on HPC applications (actually, it is still widely used in this context). The naturally had more compute power available and improving a program's running time by 10% due to an optimization could mean $100k+ saved in operational costs over the lifetime of a supercomputer.

thorfdbg · Post by **thorfdbg** » Thu Jun 17, 2021 8:22 am

Solar wrote: But on AmigaOS (first version released 1985), IntuitionBase -- the address to the jump table allowing access to the basic OS functions -- resides at $00000004. If that hadn't been a char you wrote there, but e.g. a double, you'd just have shot your OS in the head. Any access to Intuition functions (like, starting a new executable) would now actually do undefined things, because it would interpret whatever you wrote to $00000004 as IntuitionBase, and execute basically random areas of memory.

Cough. That would be AbsExecBase, not intutionbase, and the right way how to deal with that is that the compiler startup code would catch it for you and place it in SysBase (an object of external linkage), so the code can get it from there.

Actually, going through AbsExecBase you can consider bad practise as this requires (in some cases) going through a full emulation cycle of the instruction making the access, as otherwise the first page is blocked out by debugging tools such as MuForce.

linguofreak · Post by **linguofreak** » Sun Jun 20, 2021 7:34 am

A couple of articles linked from this thread have brought up the difference between the wording "Permissible" vs. "Possible" in the C89 vs. C99 standards. I think this is splitting hairs.

Code: Select all

Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

vs.

Code: Select all

Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

If the list of permissible/possible behaviors were almost anything else, there would be a difference. However, when the very first thing after "Permissible undefined behavior ranges from..." is "...ignoring the situation completely with unpredictable results", you've already written a blank check. It doesn't matter if you use "Permissible" or "possible", you've already given license to nasal demons. If anything, "possible" probably grants *less* license in this scenario, in that while it still says that doing just about anything is an option, it falls short of granting full-throated permission.

I won't weigh in here on whether the standard should or shouldn't have been written this way, I just want to point out that considerable license was granted in C89, and the change of wording in C99 didn't grant any license that hadn't already been granted in C99 (EDIT: *ahem* C89).

vvaltchev · Post by **vvaltchev** » Sun Jun 20, 2021 9:05 am

For the sake of completeness in observing how the wording of UB changed in the transitions C89 -> C99 -> C11, let me quote ISO C11 as well:

C11 wrote: 3.4.3
undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements

NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable
results, to behaving during translation or program execution in a documented manner characteristic of the
environment (with or without the issuance of a diagnostic message), to terminating a translation or
execution (with the issuance of a diagnostic message).

EXAMPLE
An example of undefined behavior is the behavior on integer overflow.

OSDev.org

C and undefined behavior

Re: C and undefined behavior

Re: C and undefined behavior

Re: C and undefined behavior

Re: C and undefined behavior

Re: C and undefined behavior

Re: C and undefined behavior

Re: C and undefined behavior

Re: C and undefined behavior

Re: C and undefined behavior