When the problem is not with your code...
Posted: Wed Jul 01, 2020 9:50 pm
Guys... I've been hunting this bug for months now! Finally I've figured it out and I think I must share so that others can learn from it... It was not trivial to solve, and it turned out the bug wasn't in my code, but in the linker!
First, you should know that I compile my OS with Clang and gcc for x86_64 and AArch64 as well. All 4 combinations worked perfectly for years. Then at some point, I've upgraded my toolchains to the latest version, and all tests ran fine (so that I've thought).
I didn't cared about the update until I noticed that if I compile on certain days, then I got page faults. Not always, just sometimes. So I've added debug prints to my code, and to my surprise, the page faults never happened! I've tracked this bloody bug for months because of this.
Then, I've hardcoded a printf that printed to the serial port, and worked even when I compiled without debugging functions. And I've noticed something interesting in the dynamic linker, the following showed up on the console:That's right, a bad relocation entry was responsible for the page faults!!! Now it makes sense: when I used the debug functions in my libc, the relocation table changed and become bigger, as it contained more records (dbg_printf for one), and the relocation problem wasn't triggered at all.
After many painful trial-and-error iterations I've narrowed the problem down to gcc and GNU ld. It always worked with Clang and lld. It popped into my mind that a few years ago I've reported a linkage problem in lld and they have fixed that. So I looked up that ticket and did all those tests again with gcc.
Then I've spent hours and hours debugging my dynamic linker, and I just couldn't find the problem! Everything looked fine and I couldn't find any problems with dumping the ELF relocations using readelf either.
HERE'S THE FIRST TAKE AWAY: every single tool, nm, objdump, readelf, etc. dumps the relocation SECTION (which goes by the name .rela.dyn), and NOT what's in the actual dynamic relocation section. For all architectures, and for all linkers, the .rela.dyn section is always correct! However my dynamic linker used the data in the dynamic section, which used to be correct with ld, but not any more!
Check this out: here's a readelf output compiled for AArch64 and linked with GNU ld:As you can see the .rela.dyn section starts at offset 0x3778 and it is 27*24 = 648 bytes. If we take a closer look at the dynamic section, we can see that PLTRELSZ is 336 bytes, and RELASZ is 312, and that's 336+312 = 648 indeed. PLTREL immediately follows RELA (can't be otherwise, because they must be in the same table for .rela.dyn, no gaps allowed).
That's okay, but now let's take a look at the same file, compiled for x86_64 and linked with GNU ld:Can you spot the difference? The .rela.dyn starts at offset 0x16d0, and its size is 18*24 = 432 bytes. But in the dynamic section, PLTRELSZ is also 432 bytes! Therefore PLTRELSZ plus RELASZ is 528 bytes, which is BIGGER than the actual relocation table! JMPREL contains data relocations too, WTF? No wonder that my poor dynamic linker read garbage!
SECOND TAKE AWAY: never trust your toolchain! Even though you expect differences between compilers and linkers, as this example shows there could be SIGNIFICANT differences using the same linker and same linker script on different architectures too!
Solution: I've added an extra check to my dynamic linker if the RELA section is included in the JMPREL section or not. Ugly as hell, but that's the best workaround I could come up with. I couldn't find no command line flags nor linker script magic to make GNU ld consistent on both x86_64 and AArch64.
Cheers,
bzt
First, you should know that I compile my OS with Clang and gcc for x86_64 and AArch64 as well. All 4 combinations worked perfectly for years. Then at some point, I've upgraded my toolchains to the latest version, and all tests ran fine (so that I've thought).
I didn't cared about the update until I noticed that if I compile on certain days, then I got page faults. Not always, just sometimes. So I've added debug prints to my code, and to my surprise, the page faults never happened! I've tracked this bloody bug for months because of this.
Then, I've hardcoded a printf that printed to the serial port, and worked even when I compiled without debugging functions. And I've noticed something interesting in the dynamic linker, the following showed up on the console:
Code: Select all
import 41A6D0 (304 bytes):
41B000 D lastlink
41B008 D fsck
41B010 D nfcb
41B018 D fcb
41B038 T strcpy
41B040 T fsdrv_reg
41B048 T memcpy
41B050 T pathpop
41B058 T writeblock
41B060 T memzero
41B068 T strcat
41B070 T bzt_free
41B078 T memcmp
41B080 T readblock
41B088 T strdup
41B090 T pathpush
41B098 T strlen
41B0A0 T crc32c_calc
419000 D base+5010102464C457F
419000 D base+419000
419000 D base+419000
419000 D base+419000
After many painful trial-and-error iterations I've narrowed the problem down to gcc and GNU ld. It always worked with Clang and lld. It popped into my mind that a few years ago I've reported a linkage problem in lld and they have fixed that. So I looked up that ticket and did all those tests again with gcc.
Then I've spent hours and hours debugging my dynamic linker, and I just couldn't find the problem! Everything looked fine and I couldn't find any problems with dumping the ELF relocations using readelf either.
HERE'S THE FIRST TAKE AWAY: every single tool, nm, objdump, readelf, etc. dumps the relocation SECTION (which goes by the name .rela.dyn), and NOT what's in the actual dynamic relocation section. For all architectures, and for all linkers, the .rela.dyn section is always correct! However my dynamic linker used the data in the dynamic section, which used to be correct with ld, but not any more!
Check this out: here's a readelf output compiled for AArch64 and linked with GNU ld:
Code: Select all
Dynamic section at offset 0x40b0 contains 15 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so]
0x0000000000000004 (HASH) 0x3448
0x0000000000000005 (STRTAB) 0x36e0
0x0000000000000006 (SYMTAB) 0x34e8
0x000000000000000a (STRSZ) 147 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x4028
0x0000000000000002 (PLTRELSZ) 336 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x38b0
0x0000000000000007 (RELA) 0x3778
0x0000000000000008 (RELASZ) 312 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffff9 (RELACOUNT) 9
0x0000000000000000 (NULL) 0x0
Relocation section '.rela.dyn' at offset 0x3778 contains 27 entries:
That's okay, but now let's take a look at the same file, compiled for x86_64 and linked with GNU ld:
Code: Select all
Dynamic section at offset 0x20a8 contains 14 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so]
0x0000000000000004 (HASH) 0x13d8
0x0000000000000005 (STRTAB) 0x1638
0x0000000000000006 (SYMTAB) 0x1470
0x000000000000000a (STRSZ) 147 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x2020
0x0000000000000002 (PLTRELSZ) 432 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x16d0
0x0000000000000007 (RELA) 0x16d0
0x0000000000000008 (RELASZ) 96 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x0000000000000000 (NULL) 0x0
Relocation section '.rela.dyn' at offset 0x16d0 contains 18 entries:
SECOND TAKE AWAY: never trust your toolchain! Even though you expect differences between compilers and linkers, as this example shows there could be SIGNIFICANT differences using the same linker and same linker script on different architectures too!
Solution: I've added an extra check to my dynamic linker if the RELA section is included in the JMPREL section or not. Ugly as hell, but that's the best workaround I could come up with. I couldn't find no command line flags nor linker script magic to make GNU ld consistent on both x86_64 and AArch64.
Cheers,
bzt