Schol-R-LEA wrote:First, it is entirely possible for the compiler or assembler to emit blocks of data inside of the code sections, but IIUC, in the majority of cases, doing so would be counter-productive.
Placement of constants around the functions is possible, but the features are there to encourage separation. I am not against memory protection by separation. But I was curious if this is the entire reason why we have the code and data separately today, or are there other reasons. Such as, compiler architecture, OS architecture, CPU architecture, etc.
Schol-R-LEA wrote:Now, to be fair, I am not entirely clear if by 'interleaved data' you mean just immediate operands, or non-immediate constants embedded in code, with the flow of execution going around or stopping prior to reaching the data block. Immediate addressing for most ISAs means that the data is part of the opcode, and does not need to be fetched separately, so it wouldn't be a concern for the object format at all.
I was thinking about placing constants between the functions. May be variables as well, if we don't care about memory protection at all. Placing data inside the function body with high frequency will cause inefficiencies, but placing the constants around the functions or between big branches of the code, cushioned by a few nops for proper alignment - should be acceptable. This could improve the TLB performance, memory performance, virtual mememory performance.
Schol-R-LEA wrote:EDIT: apparently, the way 32-bit ARM does immediates is rather different from what I misunderstood it to be; it actually uses 12 bits, four for position in a 32 final value and 8 for the value itself. So, loading a 32-bit value could take up to four MOV or MVN instructions, but doesn't need any explicit shift instructions. If anyone can clarify this further, please do so.
I recently discovered that gcc places constants after the functions on ARM, which is probably the only case where I have seen this done on Linux. With your clarification, it makes sense. The code generator introduced the constant for the sole purpose of being loaded into a register. It was an implicit address constant, not a constant pointer that the program explicitly defined. The compiler either had to use four instructions with immediate operands, or use one instruction with indirect addressing and a memory location with a constant in it.
Schol-R-LEA wrote:Few instruction sets have 'short' addressing for data accesses, the way some do for jumps. As I understand it, the way associative caching works, there can be separate locality for instructions and data, or even multiple locality 'nodes' for both. Indeed, it is my understanding that most modern CPUs use separate caching for instructions and data, so I don't know if data locality relative to the instruction stream is even a factor, regardless of whether the data is writable or not. I may be wrong about this, however, so any corrections on this would be welcome.
Here is what I know. The L1 instruction cache is virtually tagged and virtually indexed and the L1 data cache is physically tagged and virtually indexed. The two caches are apparently very different, including what they store, how they handle aliasing, etc. We can assume that the new cpu designs expect the data and instruction streams to be separate. If they alias the same cache line, this will decrease the L1 overall efficiency. Hence my earlier remark about 64 byte alignment, such that the code and data still occupy separate cache lines. I am not sure if there wouldn't be other contention issues.
Edit: The L1 data cache is virtually indexed, but the indexing component is taken from the page offset, i.e. bits 6-11 from the address, which coincide with the same bits in the physical address.
Schol-R-LEA wrote:Second, most systems use the same overall object format for several different kinds of object files, such as intermediate linkable object files, executable object files, static library files, and shared library files, with the header indicating which type a given file is. This is certainly the case with ELF, which has several different sub-types. The sections are designed so that, when linking several object files together, only those parts which the linker needs to patch for relocations and other link-time changes need to be altered, with the rest simply copied to the final executable file.
Aside from a few specific cases, the sections seem to me primarily designed to pack code and data separately. Therefore, it is all part of the same overall effort, to support separation.
Schol-R-LEA wrote:Third, depending on the memory architecture, the operating system, the program's needs, and several other factors, the loader may need to perform load time relocation, of either the code or data in the executable file, or of some form of shared/dynamic-link libraries. While most current systems use the paging mechanism to eliminate load-time relocation in the majority or cases, by mapping the code into the correct memory locations independent of the physical addressing, and relative addressing modes can reduce the need for it even further, there are usually at least some programs where load-time relocation is needed.
Elf types ET_EXEC and ET_DYN, meaning executables and shared objects, should always be laid out in memory contiguously. This is per elf specification. Rip relative code relies on this, and would be very inefficient if it couldn't rely on this. In other words, on x86-like targets, separation does not facilitate greater relocation freedom. For other architectures, there may be specification addenda, how to execute code from flash memory in separate address ranges, etc.
Schol-R-LEA wrote:Fourth, most relocatable object formats - not just ELF - allow read-only data sections, read-write data, and BSS (uninitialized data space allocation declarations) and even auxiliary sections such as comments, stack frames, etc., to be separated from the text (code) sections in order to allow the linker to treat them separately for the purposes of managing relocation (among other things). the loader also needs to be able to manipulate them independently for similar reasons.
Special sections like bss, ctors, dtors, eh_frame, are indeed designed to solve specific problems. Bss is relevant exception to some extent. But for sections like code, data, rodata, the contents are arbitrary blobs as far as the elf format is concerned. My point is, the format encourages separation by offering means for split output, but doesn't really make it necessary in those cases. As for the tracking of the static relocations in the intermediate files, it involves the section metadata, because this is necessary in order to support sections. But sections are not necessary for having relocations.
Schol-R-LEA wrote:Finally, there is no reason why the loader could not merge them at run time, assuming that it would provide an advantage, and doing so would not add any overhead not inherent in a relocatable executable format already (no matter what else, the loader has to parse the headers of the object file, and find the individual sections, compute the sizes and assign addresses for the BSS variables, determine if there are any load-time relocations for either the code or the data, request that the paging system map any sections of the file that don't need relocation into memory pages, figure out which non-relocating sections need to be paged in immediately and which can be kept paged out, generate temporary pages for any that do need relocation, and do several other housekeeping tasks).
I suppose the loader could process the static relocations if the linker did leave them in the final executable. This could allow intelligent merging/copying of constants near the code. It would be slow, but the more important thing is, either noone ever thought of it, or it was never desired.
Edit: My point here when I say that the loader could "process .. relocations" is that it could use them to gather information about the referential structure of the code and place the datums close to the point of reference. But on second thought, for composite objects such as arrays, this would be difficult, because the relocation entries lack enough type detail.