Page 1 of 1

When the problem is not with your code...

Posted: Wed Jul 01, 2020 9:50 pm
by bzt
Guys... I've been hunting this bug for months now! Finally I've figured it out and I think I must share so that others can learn from it... It was not trivial to solve, and it turned out the bug wasn't in my code, but in the linker!

First, you should know that I compile my OS with Clang and gcc for x86_64 and AArch64 as well. All 4 combinations worked perfectly for years. Then at some point, I've upgraded my toolchains to the latest version, and all tests ran fine (so that I've thought).

I didn't cared about the update until I noticed that if I compile on certain days, then I got page faults. Not always, just sometimes. So I've added debug prints to my code, and to my surprise, the page faults never happened! I've tracked this bloody bug for months because of this.

Then, I've hardcoded a printf that printed to the serial port, and worked even when I compiled without debugging functions. And I've noticed something interesting in the dynamic linker, the following showed up on the console:

Code: Select all

  import 41A6D0 (304 bytes):
    41B000 D lastlink
    41B008 D fsck
    41B010 D nfcb
    41B018 D fcb
    41B038 T strcpy
    41B040 T fsdrv_reg
    41B048 T memcpy
    41B050 T pathpop
    41B058 T writeblock
    41B060 T memzero
    41B068 T strcat
    41B070 T bzt_free
    41B078 T memcmp
    41B080 T readblock
    41B088 T strdup
    41B090 T pathpush
    41B098 T strlen
    41B0A0 T crc32c_calc
    419000 D base+5010102464C457F
    419000 D base+419000
    419000 D base+419000
    419000 D base+419000
That's right, a bad relocation entry was responsible for the page faults!!! Now it makes sense: when I used the debug functions in my libc, the relocation table changed and become bigger, as it contained more records (dbg_printf for one), and the relocation problem wasn't triggered at all.

After many painful trial-and-error iterations I've narrowed the problem down to gcc and GNU ld. It always worked with Clang and lld. It popped into my mind that a few years ago I've reported a linkage problem in lld and they have fixed that. So I looked up that ticket and did all those tests again with gcc.

Then I've spent hours and hours debugging my dynamic linker, and I just couldn't find the problem! Everything looked fine and I couldn't find any problems with dumping the ELF relocations using readelf either.

HERE'S THE FIRST TAKE AWAY: every single tool, nm, objdump, readelf, etc. dumps the relocation SECTION (which goes by the name .rela.dyn), and NOT what's in the actual dynamic relocation section. For all architectures, and for all linkers, the .rela.dyn section is always correct! However my dynamic linker used the data in the dynamic section, which used to be correct with ld, but not any more!

Check this out: here's a readelf output compiled for AArch64 and linked with GNU ld:

Code: Select all

Dynamic section at offset 0x40b0 contains 15 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so]
 0x0000000000000004 (HASH)               0x3448
 0x0000000000000005 (STRTAB)             0x36e0
 0x0000000000000006 (SYMTAB)             0x34e8
 0x000000000000000a (STRSZ)              147 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x4028
 0x0000000000000002 (PLTRELSZ)           336 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x38b0
 0x0000000000000007 (RELA)               0x3778
 0x0000000000000008 (RELASZ)             312 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffff9 (RELACOUNT)          9
 0x0000000000000000 (NULL)               0x0

Relocation section '.rela.dyn' at offset 0x3778 contains 27 entries:
As you can see the .rela.dyn section starts at offset 0x3778 and it is 27*24 = 648 bytes. If we take a closer look at the dynamic section, we can see that PLTRELSZ is 336 bytes, and RELASZ is 312, and that's 336+312 = 648 indeed. PLTREL immediately follows RELA (can't be otherwise, because they must be in the same table for .rela.dyn, no gaps allowed).

That's okay, but now let's take a look at the same file, compiled for x86_64 and linked with GNU ld:

Code: Select all

Dynamic section at offset 0x20a8 contains 14 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so]
 0x0000000000000004 (HASH)               0x13d8
 0x0000000000000005 (STRTAB)             0x1638
 0x0000000000000006 (SYMTAB)             0x1470
 0x000000000000000a (STRSZ)              147 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x2020
 0x0000000000000002 (PLTRELSZ)           432 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x16d0
 0x0000000000000007 (RELA)               0x16d0
 0x0000000000000008 (RELASZ)             96 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x0000000000000000 (NULL)               0x0

Relocation section '.rela.dyn' at offset 0x16d0 contains 18 entries:
Can you spot the difference? The .rela.dyn starts at offset 0x16d0, and its size is 18*24 = 432 bytes. But in the dynamic section, PLTRELSZ is also 432 bytes! Therefore PLTRELSZ plus RELASZ is 528 bytes, which is BIGGER than the actual relocation table! JMPREL contains data relocations too, WTF? No wonder that my poor dynamic linker read garbage!

SECOND TAKE AWAY: never trust your toolchain! Even though you expect differences between compilers and linkers, as this example shows there could be SIGNIFICANT differences using the same linker and same linker script on different architectures too!

Solution: I've added an extra check to my dynamic linker if the RELA section is included in the JMPREL section or not. Ugly as hell, but that's the best workaround I could come up with. I couldn't find no command line flags nor linker script magic to make GNU ld consistent on both x86_64 and AArch64.

Cheers,
bzt

Re: When the problem is not with your code...

Posted: Thu Jul 02, 2020 6:04 am
by linuxyne
bzt wrote: because they must be in the same table for .rela.dyn
That doesn't force one to store the input .rela.plt and .rela.dyn sections in a single output .rela.dyn section. On my x86_64 Linux, /usr/bin/ls shows two sections, .rela.dyn and .rela.plt, although the table is continguous.

It seems that your linker script is bundling the input .rela.dyn and .rela.plt sections into a single output section .rela.dyn. In what order does the script lists those two sections? Is it possible for your linker script to keep them separate?

See [1] for some details.

Edit:
The llvm fixes cited below appears to be in response to your bug [6], which is almost 2 years old.
So the issue was found then, and has been debugged already.
bzt wrote: I couldn't find no command line flags nor linker script magic to make GNU ld consistent on both x86_64 and AArch64.
How about keeping the two sections separate? And was a bug filed with binutils?


[1] https://reviews.llvm.org/rG7dc5af75ae5b ... cc2e4a61f0
[2] https://reviews.llvm.org/D54759
[3] https://github.com/bminor/glibc/blob/5f ... ink.h#L103
[4] https://binutils.sourceware.narkive.com ... verlapping
[5] https://sourceware.org/legacy-ml/binuti ... 00140.html
[6] https://bugs.llvm.org/show_bug.cgi?id=39678

Re: When the problem is not with your code...

Posted: Thu Jul 02, 2020 8:41 am
by bzt
linuxyne wrote:That doesn't force one to store the input .rela.plt and .rela.dyn sections in a single output .rela.dyn section. On my x86_64 Linux, /usr/bin/ls shows two sections, .rela.dyn and .rela.plt, although the table is continguous.
It is always going to be contiguous.
linuxyne wrote:It seems that your linker script is bundling the input .rela.dyn and .rela.plt sections into a single output section .rela.dyn. In what order does the script lists those two sections? Is it possible for your linker script to keep them separate?
You're missing the point. Same script, same linker, different behaviours for architectures.
linuxyne wrote:The llvm fixes cited below appears to be in response to your bug [6], which is almost 2 years old.
So the issue was found then, and has been debugged already.
With LLVM. Now this problem is with GNU ld, which was working correctly at the time of the LLVM bug. I should know because it was me who debugged that particular issue :-)
linuxyne wrote:How about keeping the two sections separate? And was a bug filed with binutils?
The original SCO ELF spec says, DT_RELA must contain all relocation tables, so having .rela.dyn and .rela.plt in one output section is fine, this is not the problem. Btw, some platforms mandate one relocation section, so separating is not always possible.

First, the area in the file pointed by the section that can be found using the Section Headers is correct (for all architectures and all linkers). It is only the area pointed by the Dynamic Program Header that's incorrect and differs (to its own Section too).

Second, what we here see is, that the DT_JMPREL section includes DT_RELA and not the other way around, which never was meant nor allowed by either the original SCO ELF, nor by the psABI ELF spec. DT_RELA should contain both data and plt relocations, but DT_JMPREL should never contain any data relocation, just plt.

And nope, I haven't filed a binutils bug because I know from first-hand experience that binutils developers are sadly @ssh0les. They simply doesn't worth my time. :-(

(This could be another topic of its own, but I've noticed that ever since big money corps (eg. MS, Oracle, IBM, FB etc.) started to use and contribute to Linux, the attitude of the kernel and the GNU userland developers have changed. Not in the good direction. I wasn't the only one noticing this, see this and this for example. Last time I had a fight with two of the main developers of a library, because they did not know how C++ "thread_local" keyword works (part of the language since C++11), and that -1 is stored in 2's complement (meaning that's full 1 bits), not to mention they critised stb's code that I borrowed for inflation, and they thought using libc is evil, and said I should implement my own floating number parser... (just for the records, it took YEARS for the musl developers to get atof right). Thankfully the lead developer come to the rescue, but that's extremely rare, leaders are usually part of the problem.)

Cheers,
bzt

Re: When the problem is not with your code...

Posted: Thu Jul 02, 2020 11:20 am
by linuxyne
bzt wrote: You're missing the point. Same script, same linker, different behaviours for architectures.
Right. From a cursory investigation, bfd seems to pass control of managing the dynrelocs to the files private to individual architectures.

One can try keeping the sections separate and see if the latest ld works the same on both x86_64 and aarch64.
bzt wrote: Now this problem is with GNU ld, which was working correctly at the time of the LLVM bug.
One of your comments on the bug say:
So I was mistaken, GNU ld for AArch64 was not according to the book at all, in fact it's wrong in all cases, and LLVM is correct for all four cases regarding DT_RELA and DT_RELASZ, and only DT_PLTRELASZ needs to be fixed.

At least some problem with ld was found, I think.
bzt wrote: DT_RELA must contain all relocation tables, so having .rela.dyn and .rela.plt in one output section is fine
For some reason, keeping them separate has been the convention/default.
bzt wrote: Btw, some platforms mandate one relocation section
Does ld for such a target platform work okay with your linker script and source file? I think it will.
bzt wrote: what we here see is, that the DT_JMPREL section includes DT_RELA
That is one way to put it, although I doubt that such an opposite encapsulation is intentional - it manifests on x86_64 likely because of forcing a single section for relocs.


This issue seems like a corner case with a workaround available.
Since it was working before, there must be some recent change that exposed it, or caused a regression. I will try to see if I can find the root cause.

Re: When the problem is not with your code...

Posted: Thu Jul 02, 2020 6:52 pm
by bzt
linuxyne wrote:For some reason, keeping them separate has been the convention/default.
I guess because programmers in general are getting more and more stupider. Not sure about the reason, but you can notice this almost everywhere, solutions are getting more complex, slower, harder to maintain, and they call that "progress" (see Wirth's law for example). Just think about this: on your desktop computer, where there are plenty of RAM and network bandwidth is high, you use a single client (a browser) to watch webpages. On your phone, where the storage is limited, RAM is precious, and network bandwidth is often limited, you have to install applications for each and every webpage...
linuxyne wrote:Since it was working before, there must be some recent change that exposed it, or caused a regression. I will try to see if I can find the root cause.
It was definitely an update that let the dogs out. Not sure which one, because the problem was only triggered occasionally. If you can find the cause that's great, but it is not so important (to me at least), because I've came up with a workaround already. Now my kernel compiles perfectly with gcc and Clang on x86_64 and AArch64 too, plus I've added an ELF verification tool to the build environment, so that I can see instantly if the devs break the linker again.

Just for your information, there was an issue with LLVM lld too :-) It is extremely lame, I can't imagine what could went wrong. Again, only x86_64 affected, AArch64 worked as expected (and this worked with gcc ld too). So I had this in my linker script:

Code: Select all

SECTIONS {
  . = LOAD_ADDRESS;
  bootboot = .; . += 4096;
  environment = .; . += 4096;
  .text  ...
and after compilation nm showed that "bootboot" is at 0xffffffffe00000 and "environment" is at 0xffffffffe01020.. WTF? How? Only with lld and only for x86_64, there was an additional 32 bytes. Then I changed the script to

Code: Select all

SECTIONS {
  bootboot = LOAD_ADDRESS;
  enviroment = LOAD_ADDRESS + 4096;
  . = LOAD_ADDRESS + 8192;
  .text  ...
and everything got back to normal...

Honestly, I don't understand why they are keep developing this part of ld and lld. We have a saying, "if it's not broken, then don't try to fix it!". I seriously doubt that new processor features added to the compilers should effect the section creation and address calculation in linkers.

Cheers,
bzt