OSDev.org

Posted: **Fri Aug 11, 2017 3:50 pm**

I actually agree with you on the matter of the MIPS ISA; As I said in my later edit, I didn't properly understand the ARM approach anyway. I was simply noting that the separate shift wasn't needed, and as it happens, I misunderstood that aspect anyway in my original post.

Trust me, of all of the ISAs I have seen, I consider MIPS the best designed. Good design is not, however, enough to make a system practical or successful, as I have said many times before.

In the case of MIPS, poor promotion, the confusing mess with it's licensing back in the day, and weak implementations on the silicon level compared to x86 or even ARM were more than it could overcome.

That last point is particularly damning given how straightforward it is in comparison - it really just doesn't need a lot of optimization, or even have much room for any, yet neither MIPS Inc nor Imagination Tech have been able to get it together.

Twenty-three years ago, the five-year-old R3000 circuit design was still so far ahead of the curve that it was used for the Playstation 1 despite running at a pokey 33 Mhz (compared to the contemporary 66MHz Pentium), while not long afterwards the 64-bit R4000 was used in the N64 and still had enough crunch to run (in the slightly upgraded R5900 core used in the 'Emotion Engine') the Playstation 2 five years later, yet today the performance of current-generation MIPS chips (including Loongson chips) is barely better than the MIPS V designs from 1996 despite much high cycle speeds. The Warrior series does show improvement, but not nearly enough given what the architecture should be capable of.

Perhaps that is the case study Geri needs to look at.

Posted: **Fri Aug 11, 2017 5:20 pm**

Schol-R-LEA wrote:First, it is entirely possible for the compiler or assembler to emit blocks of data inside of the code sections, but IIUC, in the majority of cases, doing so would be counter-productive.

Placement of constants around the functions is possible, but the features are there to encourage separation. I am not against memory protection by separation. But I was curious if this is the entire reason why we have the code and data separately today, or are there other reasons. Such as, compiler architecture, OS architecture, CPU architecture, etc.

Schol-R-LEA wrote:Now, to be fair, I am not entirely clear if by 'interleaved data' you mean just immediate operands, or non-immediate constants embedded in code, with the flow of execution going around or stopping prior to reaching the data block. Immediate addressing for most ISAs means that the data is part of the opcode, and does not need to be fetched separately, so it wouldn't be a concern for the object format at all.

I was thinking about placing constants between the functions. May be variables as well, if we don't care about memory protection at all. Placing data inside the function body with high frequency will cause inefficiencies, but placing the constants around the functions or between big branches of the code, cushioned by a few nops for proper alignment - should be acceptable. This could improve the TLB performance, memory performance, virtual mememory performance.

Schol-R-LEA wrote:EDIT: apparently, the way 32-bit ARM does immediates is rather different from what I misunderstood it to be; it actually uses 12 bits, four for position in a 32 final value and 8 for the value itself. So, loading a 32-bit value could take up to four MOV or MVN instructions, but doesn't need any explicit shift instructions. If anyone can clarify this further, please do so.

I recently discovered that gcc places constants after the functions on ARM, which is probably the only case where I have seen this done on Linux. With your clarification, it makes sense. The code generator introduced the constant for the sole purpose of being loaded into a register. It was an implicit address constant, not a constant pointer that the program explicitly defined. The compiler either had to use four instructions with immediate operands, or use one instruction with indirect addressing and a memory location with a constant in it.

Schol-R-LEA wrote:Few instruction sets have 'short' addressing for data accesses, the way some do for jumps. As I understand it, the way associative caching works, there can be separate locality for instructions and data, or even multiple locality 'nodes' for both. Indeed, it is my understanding that most modern CPUs use separate caching for instructions and data, so I don't know if data locality relative to the instruction stream is even a factor, regardless of whether the data is writable or not. I may be wrong about this, however, so any corrections on this would be welcome.

Here is what I know. The L1 instruction cache is virtually tagged and virtually indexed and the L1 data cache is physically tagged and virtually indexed. The two caches are apparently very different, including what they store, how they handle aliasing, etc. We can assume that the new cpu designs expect the data and instruction streams to be separate. If they alias the same cache line, this will decrease the L1 overall efficiency. Hence my earlier remark about 64 byte alignment, such that the code and data still occupy separate cache lines. I am not sure if there wouldn't be other contention issues.
Edit: The L1 data cache is virtually indexed, but the indexing component is taken from the page offset, i.e. bits 6-11 from the address, which coincide with the same bits in the physical address.

Schol-R-LEA wrote:Second, most systems use the same overall object format for several different kinds of object files, such as intermediate linkable object files, executable object files, static library files, and shared library files, with the header indicating which type a given file is. This is certainly the case with ELF, which has several different sub-types. The sections are designed so that, when linking several object files together, only those parts which the linker needs to patch for relocations and other link-time changes need to be altered, with the rest simply copied to the final executable file.

Aside from a few specific cases, the sections seem to me primarily designed to pack code and data separately. Therefore, it is all part of the same overall effort, to support separation.

Schol-R-LEA wrote:Third, depending on the memory architecture, the operating system, the program's needs, and several other factors, the loader may need to perform load time relocation, of either the code or data in the executable file, or of some form of shared/dynamic-link libraries. While most current systems use the paging mechanism to eliminate load-time relocation in the majority or cases, by mapping the code into the correct memory locations independent of the physical addressing, and relative addressing modes can reduce the need for it even further, there are usually at least some programs where load-time relocation is needed.

Elf types ET_EXEC and ET_DYN, meaning executables and shared objects, should always be laid out in memory contiguously. This is per elf specification. Rip relative code relies on this, and would be very inefficient if it couldn't rely on this. In other words, on x86-like targets, separation does not facilitate greater relocation freedom. For other architectures, there may be specification addenda, how to execute code from flash memory in separate address ranges, etc.

Schol-R-LEA wrote:Fourth, most relocatable object formats - not just ELF - allow read-only data sections, read-write data, and BSS (uninitialized data space allocation declarations) and even auxiliary sections such as comments, stack frames, etc., to be separated from the text (code) sections in order to allow the linker to treat them separately for the purposes of managing relocation (among other things). the loader also needs to be able to manipulate them independently for similar reasons.

Special sections like bss, ctors, dtors, eh_frame, are indeed designed to solve specific problems. Bss is relevant exception to some extent. But for sections like code, data, rodata, the contents are arbitrary blobs as far as the elf format is concerned. My point is, the format encourages separation by offering means for split output, but doesn't really make it necessary in those cases. As for the tracking of the static relocations in the intermediate files, it involves the section metadata, because this is necessary in order to support sections. But sections are not necessary for having relocations.

Schol-R-LEA wrote:Finally, there is no reason why the loader could not merge them at run time, assuming that it would provide an advantage, and doing so would not add any overhead not inherent in a relocatable executable format already (no matter what else, the loader has to parse the headers of the object file, and find the individual sections, compute the sizes and assign addresses for the BSS variables, determine if there are any load-time relocations for either the code or the data, request that the paging system map any sections of the file that don't need relocation into memory pages, figure out which non-relocating sections need to be paged in immediately and which can be kept paged out, generate temporary pages for any that do need relocation, and do several other housekeeping tasks).

I suppose the loader could process the static relocations if the linker did leave them in the final executable. This could allow intelligent merging/copying of constants near the code. It would be slow, but the more important thing is, either noone ever thought of it, or it was never desired.
Edit: My point here when I say that the loader could "process .. relocations" is that it could use them to gather information about the referential structure of the code and place the datums close to the point of reference. But on second thought, for composite objects such as arrays, this would be difficult, because the relocation entries lack enough type detail.

Posted: **Fri Aug 11, 2017 6:38 pm**

lol, that moment when you realize noone read your post about literal pools.

simeonz, arm compiler does that because it needs to do pc-addressing. instead it might use movt/movw pair for loading an address/immediate into a register. ARM recommneds using literal pools. immediates and variables' addresses from other sections go there and compiler happily loads them through [pc+offset] mechanism. they are are data inclusions inside code sections by themselves. it doesn't eliminate data in data sections in the case of variables, it's just to make 12-bit pc-addressing happen. of course arm has such a narrow offset field not of a whim, it uses conditional execution model, it needs bit fields to encode too.

School-l-rea, you kind of misunderstood in the second interpretation. there is no need to do up to four mov/mvn instructions, no way, even for arm it would be too much. that 4-bit field is a rotation constant, it says to CPU how much rotation it should apply to the 8-bit immediate, when processing the instruction. actually the 4-bit rotation field encodes how much times by 2 it should be rotated right, it rotates only by even numbers:

rotation | <const>

0000 | 00000000 00000000 00000000 abcdefgh
0001 | gh000000 00000000 00000000 00abcdef
0010 | efgh0000 00000000 00000000 0000abcd
0011 | cdefgh00 00000000 00000000 000000ab
0100 | abcdefgh 00000000 00000000 00000000
...

so for example if you have an immediate 0xf0000003, it could be encoded with rotation == 2 (4 right rotate, in fact) and value 0x3f (00111111). yay!

Obviously, the above could only work out for a subset of integers. And this is not as an easily derived subset as with the mips case, where you just check either it is a 16-bit fit or 16-bit aligned. So an arm assembly writer is left with an interesting additional challenge to rotate his/her constants.

If it is not working out, arm resorts to pc-related loading through the level of indirection: it puts an immediate (or the address of a variable) in the nearest literal pool, for small code sections, it's an end of the section, not so for big ones. and then it loads the immediate (or an address) through pc-related addressing using the offset to that location which it guarnatees to finally fit into the 12-bit limitation. Instead of an immediate (or an address of a variable) itself, instuction now carries an offset from the current instruction (almost current

) to the literal pool position, inside the code section, where an immediate or an address to a variable is lying, and finally loads it into a register:
ldr rX, [pc, #OffsetToLiteralPoolItem]
it is 1 instruction. But it goes to the memory.

doing as MIPS, ARM could do:
movt rX, HI(<immediate> or <symbol>)
mov(w) rX, LO(<immediate> or <symbol>)

Posted: **Fri Aug 11, 2017 7:19 pm**

OK, then, thank you. I still don't think I quite get it yet, but from what you are saying, I assume that ORing together a series of shifted immediates isn't really idiomatic for ARM programming in general - that, for a value larger than one byte (prior to rotation, even if the rotated value is split between two bytes), the practice would be to use an immediate (12-bit) offset to a local literal pool instead. Is this correct?

BTW, is the PC-relative offset signed, or unsigned, or does it depend on the modifier(s)?

We are getting far afield of the original question, and for that matter of Zaval's question. If we continue much further we might want to ask a mod to fork this part of the discussion.

Posted: **Fri Aug 11, 2017 8:18 pm**

Schol-R-LEA wrote:OK, then, thank you. I still don't think I quite get it yet, but from what you are saying, I assume that ORing together a series of shifted immediates isn't really idiomatic for ARM programming in general - that, for a value larger than one byte (prior to rotation, even if the rotated value is split between two bytes), the practice would be to use an immediate (12-bit) offset to a local literal pool instead. Is this correct?

It doesn't OR series of shifted immediates neither really nor conceptually. it's always a 1 instruction. assembler either can encode your immediate in 1 instruction or it fails. if success, it means assembler was able to encode your value with 8 "significant" bits and 4 rotation constant bits. if it fails, it puts the immediate into a literal pool and loads it through
[pc, #offsettoliteralpoollocationfromaddressofthisinstructionminus8].
what values are elligible for having been encoded this way? all their 1s should fit into 8 bits with an even numbered right rotation. Do you get it?

Or the negated value should, this is for the mvn variant.

easy!

an example:
let's take a value c3fffffe,
binary: 11000011111111111111111111111110
it has a lot more 1's than 8, doesn't fit, negate it:
00111100000000000000000000000001
kewl, we have such a "significant" row now: 1001111
it fits into 8 bits, but yet the rotation applied could only be even, could we have it here? yes we could. it's 6. you need right rotate this bit string in the 32-bit "circle" 6 bits to get that number, 00111100000000000000000000000001, which is 3c000001 in hexadecimal.
so when you do:
ldr r0, =#0xc3fffffe,

assembly does:
mvn r0, #0x3c000001

and it puts 3 into "rotation" field and 4f into "imm8", resulting in 001101001111, 34f imm12 encoded.

BTW, is the PC-relative offset signed, or unsigned, or does it depend on the modifier(s)?

signed (it uses 'U' flag of the instruction for sign distinguishing). the range depends on modes (thumb or arm). the biggest one is for the arm mode -4095/4095.

We are getting far afield of the original question, and for that matter of Zaval's question. If we continue much further we might want to ask a mod to fork this part of the discussion.

ARM literal pools are data in the code section. just like with COM files!

Posted: **Fri Aug 11, 2017 9:01 pm**

ARM assembler probably can't avoid using the pool because some ALU instructions affect flags. MIPS doesn't have flags, so the pseudo/macro instructions that load constants (li and dli) can freely expand into ors, ands, shifts, etc.

Posted: **Fri Aug 11, 2017 9:16 pm**

alexfru wrote:ARM assembler probably can't avoid using the pool because some ALU instructions affect flags. MIPS doesn't have flags, so the pseudo/macro instructions that load constants (li and dli) can freely expand into ors, ands, shifts, etc.

well, arm assembler can easily expand into the pool-less movt/mov pair without affecting flags, it's just a "tradition" to insist on the use of the pc-related addressing when loading constants/addresses. however doing so, arm assembler would need always use both of the pair, unlike mips, which could avoid ori part if it sees lower 16 bits are zero. that's not because of condition flags, but because of semantical difference of movt vs lui. the former doesn't touch lower 16-bits of a target register whereas the latter zeroes them. of course, mips yet has addiu, which has the sign extended semantics for its immediate, it has $zero register, helpful in this, anyways for immediates too, arm could use that pair (movt/mov) just as easy. I am not sure that 2 such instruction will be slower than 1 memory accessing. with an additional bonus of freaking out the branch prediction logic.

Posted: **Fri Aug 11, 2017 11:51 pm**

zaval wrote:
alexfru wrote:ARM assembler probably can't avoid using the pool because some ALU instructions affect flags. MIPS doesn't have flags, so the pseudo/macro instructions that load constants (li and dli) can freely expand into ors, ands, shifts, etc.
well, arm assembler can easily expand into the pool-less movt/mov pair without affecting flags, it's just a "tradition" to insist on the use of the pc-related addressing when loading constants/addresses. however doing so, arm assembler would need always use both of the pair, unlike mips, which could avoid ori part if it sees lower 16 bits are zero. that's not because of condition flags, but because of semantical difference of movt vs lui. the former doesn't touch lower 16-bits of a target register whereas the latter zeroes them. of course, mips yet has addiu, which has the sign extended semantics for its immediate, it has $zero register, helpful in this, anyways for immediates too, arm could use that pair (movt/mov) just as easy. I am not sure that 2 such instruction will be slower than 1 memory accessing. with an additional bonus of freaking out the branch prediction logic.

My point is that if you want to load a constant without using the pool *and* with the smallest amount of code, you may have a problem. E.g. you could load the low bits and then shift them into their final position, but the 16-bit shift instruction will modify flags, unless it's in an IT block. So, in order to shift the register without modifying flags you have to use a pair of 16-bit instructions (IT + LSL) or a 32-bit LSL. Actually, if I'm reading the manual correctly, even the 16-bit MOV instruction that loads imm8 will modify flags unless in an IT block. The 32-bit form of the move imm instruction won't affect flags. But it's a 32-bit instruction. So, loading a constant without affecting flags may need the same amount of memory as (or more than) a load from the pool. Am I right? Am I wrong?

Posted: **Sat Aug 12, 2017 1:30 am**

zaval wrote:lol, that moment when you realize noone read your post about literal pools.

Sorry. I glanced over it at the time, but I thought that it elaborates on a nuanced tangential point with Schol-R-LEA. I have read it now.

zaval wrote:simeonz, arm compiler does that because it needs to do pc-addressing. instead it might use movt/movw pair for loading an address/immediate into a register. ARM recommneds using literal pools. immediates and variables' addresses from other sections go there and compiler happily loads them through [pc+offset] mechanism. they are are data inclusions inside code sections by themselves. it doesn't eliminate data in data sections in the case of variables, it's just to make 12-bit pc-addressing happen. of course arm has such a narrow offset field not of a whim, it uses conditional execution model, it needs bit fields to encode too.

From this post and the previous one you wrote, I understand that for loading 32-bit immediate into a register, special cases aside, arm has two options. Either use one ldr, or use movw + movt. Then, in both cases, if the immediate is a pointer, another ldr will follow to apply the indirection. The way I see it, the first approach (ldr) uses one instruction less. The downside is that it requires packing all the immediates that a chunk of code uses into another chunk that is within the 4K relative range.

But do you oppose the presence of constant data in the code section? And if so, what disadvantages do you see? Granted, the literal pools technique may be uncomfortable for manual assembly programming, but does it make any difference when the compiler generates the code. You mentioned that the branch predictor can get confused. Shouldn't it only try to evaluate branch targets inside the anticipated instruction stream, and not arbitrary bit sequences that lie beyond. Probably I was naive in thinking this, but I assumed that a mishmash of bits shouldn't be an issue on intelligent cpus, as long as the data chunks lie after unconditional jumps, returns, and other unconditional control flow instructions.

Posted: **Sat Aug 12, 2017 1:53 pm**

alexfru wrote: My point is that if you want to load a constant without using the pool *and* with the smallest amount of code, you may have a problem. E.g. you could load the low bits and then shift them into their final position, but the 16-bit shift instruction will modify flags, unless it's in an IT block. So, in order to shift the register without modifying flags you have to use a pair of 16-bit instructions (IT + LSL) or a 32-bit LSL. Actually, if I'm reading the manual correctly, even the 16-bit MOV instruction that loads imm8 will modify flags unless in an IT block. The 32-bit form of the move imm instruction won't affect flags. But it's a 32-bit instruction. So, loading a constant without affecting flags may need the same amount of memory as (or more than) a load from the pool. Am I right? Am I wrong?

actually, i was talking about the arm mode, I don't care about thumb, it's useless AF.) (because "modern" programs tend to be sooo bloated, no thumb could help with this, when you are looking at the size of android app packages, those thumb "economy" looks silly). It's a thingy for M-series.

But ok, for the 16-bit mode, outside of IT block, it maybe has to use literal pools.)

But do you oppose the presence of constant data in the code section? And if so, what disadvantages do you see? Granted, the literal pools technique may be uncomfortable for manual assembly programming, but does it make any difference when the compiler generates the code. You mentioned that the branch predictor can get confused. Shouldn't it only try to evaluate branch targets inside the anticipated instruction stream, and not arbitrary bit sequences that lie beyond. Probably I was naive in thinking this, but I assumed that a mishmash of bits shouldn't be an issue on intelligent cpus, as long as the data chunks lie after unconditional jumps, returns, and other unconditional control flow instructions

no, I don't oppose. I was emphasizing more on the inconvenience for the assembly writing. and BP misleading. Speaking about the latter, I don't know of course how smart it is, but ARM manuals by themselves warn about this. I honestly doubt, that literal pools are placed behind unconditional jumps only, rather at the end of section. for my short code I wrote it was at the end of section. I wanted to find that warning in the manuals, but I failed.

ARM manuals are very very talkative.

From this post and the previous one you wrote, I understand that for loading 32-bit immediate into a register, special cases aside, arm has two options. Either use one ldr, or use movw + movt. Then, in both cases, if the immediate is a pointer, another ldr will follow to apply the indirection. The way I see it, the first approach (ldr) uses one instruction less. The downside is that it requires packing all the immediates that a chunk of code uses into another chunk that is within the 4K relative range.

if the goal is to have a pointer inside a register, movt/movw appoach doesn't need anything else, it's just 2 instructions loading a register with a pointer, without touching additional memory. the first approach uses 1 instruction, but this is a memory instruction that goes to 4K relative range of memory.

Code: Select all

/* movt/movw case */
movt r0, HI(Variable)
mov r0, LO(Variable)
... /* here, r0 has a pointer to your variable */

/* ldr case */
ldr r0, [pc, LiteralPoolItem - . - 8]
... /* here, r0 has a pointer to your variable */
...
LiteralPoolItem: .long Variable    /* Literal Pool pointer to "Variable", "Variable" by itself is defined somewhere else (data) */

Posted: **Sat Aug 12, 2017 4:50 pm**

zaval wrote:
alexfru wrote: My point is that if you want to load a constant without using the pool *and* with the smallest amount of code, you may have a problem. E.g. you could load the low bits and then shift them into their final position, but the 16-bit shift instruction will modify flags, unless it's in an IT block. So, in order to shift the register without modifying flags you have to use a pair of 16-bit instructions (IT + LSL) or a 32-bit LSL. Actually, if I'm reading the manual correctly, even the 16-bit MOV instruction that loads imm8 will modify flags unless in an IT block. The 32-bit form of the move imm instruction won't affect flags. But it's a 32-bit instruction. So, loading a constant without affecting flags may need the same amount of memory as (or more than) a load from the pool. Am I right? Am I wrong?
actually, i was talking about the arm mode, I don't care about thumb, it's useless AF.) (because "modern" programs tend to be sooo bloated, no thumb could help with this, when you are looking at the size of android app packages, those thumb "economy" looks silly). It's a thingy for M-series.
But ok, for the 16-bit mode, outside of IT block, it maybe has to use literal pools.)

Thumb(2) is not useless as f. But it's ugly and could've been better. It's just an old thing that everyone's stuck with. Unless they go 64-bit. I'm surprised that Android Java libraries compiled for 32-bit and 64-bit ARM are about the same size even though it's 3 on average (because of Thumb2) vs 4 bytes per instruction (I have an idea of which instructions made things better). If you compare ARM (with Thumb2) with MIPS32R2 (the latter isn't very new either) you can see the difference of ~3/4 in .oat file size. MIPSR6 is noticeably better than R2 (mainly because of better PC-relative addressing (there isn't any on pre-R6, really), fewer wasted delay/forbidden branch slots, improved array addressing, better FPU).

Posted: **Tue Aug 29, 2017 12:45 am**

zaval wrote: really? what if I told you there are jump instructions?

What if i told you jump instruction could jump into the middle of bytes used as data elsewhere?

Posted: **Tue Aug 29, 2017 6:30 am**

^ I don't know what about you are, pulling off that old post, but with it, the only thing I wanted to say, that it's possible to jump over your data to the code, putting a jump instruction on the beginning of data area. See the context.
obviously a jump instruction could jump everywhere you direct it by the offset. so what?

OSDev.org

COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?

Re: COM File, where is data/code located?