First, you should know that I compile my OS with Clang and gcc for x86_64 and AArch64 as well. All 4 combinations worked perfectly for years. Then at some point, I've upgraded my toolchains to the latest version, and all tests ran fine (so that I've thought).
I didn't cared about the update until I noticed that if I compile on certain days, then I got page faults. Not always, just sometimes. So I've added debug prints to my code, and to my surprise, the page faults never happened! I've tracked this bloody bug for months because of this.
Then, I've hardcoded a printf that printed to the serial port, and worked even when I compiled without debugging functions. And I've noticed something interesting in the dynamic linker, the following showed up on the console:
Code: Select all
import 41A6D0 (304 bytes):
41B000 D lastlink
41B008 D fsck
41B010 D nfcb
41B018 D fcb
41B038 T strcpy
41B040 T fsdrv_reg
41B048 T memcpy
41B050 T pathpop
41B058 T writeblock
41B060 T memzero
41B068 T strcat
41B070 T bzt_free
41B078 T memcmp
41B080 T readblock
41B088 T strdup
41B090 T pathpush
41B098 T strlen
41B0A0 T crc32c_calc
419000 D base+5010102464C457F
419000 D base+419000
419000 D base+419000
419000 D base+419000
After many painful trial-and-error iterations I've narrowed the problem down to gcc and GNU ld. It always worked with Clang and lld. It popped into my mind that a few years ago I've reported a linkage problem in lld and they have fixed that. So I looked up that ticket and did all those tests again with gcc.
Then I've spent hours and hours debugging my dynamic linker, and I just couldn't find the problem! Everything looked fine and I couldn't find any problems with dumping the ELF relocations using readelf either.
HERE'S THE FIRST TAKE AWAY: every single tool, nm, objdump, readelf, etc. dumps the relocation SECTION (which goes by the name .rela.dyn), and NOT what's in the actual dynamic relocation section. For all architectures, and for all linkers, the .rela.dyn section is always correct! However my dynamic linker used the data in the dynamic section, which used to be correct with ld, but not any more!
Check this out: here's a readelf output compiled for AArch64 and linked with GNU ld:
Code: Select all
Dynamic section at offset 0x40b0 contains 15 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so]
0x0000000000000004 (HASH) 0x3448
0x0000000000000005 (STRTAB) 0x36e0
0x0000000000000006 (SYMTAB) 0x34e8
0x000000000000000a (STRSZ) 147 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x4028
0x0000000000000002 (PLTRELSZ) 336 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x38b0
0x0000000000000007 (RELA) 0x3778
0x0000000000000008 (RELASZ) 312 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffff9 (RELACOUNT) 9
0x0000000000000000 (NULL) 0x0
Relocation section '.rela.dyn' at offset 0x3778 contains 27 entries:
That's okay, but now let's take a look at the same file, compiled for x86_64 and linked with GNU ld:
Code: Select all
Dynamic section at offset 0x20a8 contains 14 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so]
0x0000000000000004 (HASH) 0x13d8
0x0000000000000005 (STRTAB) 0x1638
0x0000000000000006 (SYMTAB) 0x1470
0x000000000000000a (STRSZ) 147 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x2020
0x0000000000000002 (PLTRELSZ) 432 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x16d0
0x0000000000000007 (RELA) 0x16d0
0x0000000000000008 (RELASZ) 96 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x0000000000000000 (NULL) 0x0
Relocation section '.rela.dyn' at offset 0x16d0 contains 18 entries:
SECOND TAKE AWAY: never trust your toolchain! Even though you expect differences between compilers and linkers, as this example shows there could be SIGNIFICANT differences using the same linker and same linker script on different architectures too!
Solution: I've added an extra check to my dynamic linker if the RELA section is included in the JMPREL section or not. Ugly as hell, but that's the best workaround I could come up with. I couldn't find no command line flags nor linker script magic to make GNU ld consistent on both x86_64 and AArch64.
Cheers,
bzt