Page 1 of 1

structure padding and data alignment

Posted: Thu Feb 27, 2014 10:15 pm
by icealys
why is structure padding needed?
can someone explain in detail what data alignment does?

Re: structure padding and data alignment

Posted: Thu Feb 27, 2014 10:33 pm
by thepowersgang
*sniff* smells like homework... but I'll bite.

Padding is needed to align data to it's "natural size". In most cases, you can only read/write memory at aligned addresses (e.g. you can only read/write a 32-bit value when the bottom two bits of the address are zero). If you attempt to read a non-aligned address, the processor will either 1) Fault (which is what ARM CPUs tend to do, same with if you set the AF bit on x86), 2) Ignore the lower bits, and read memory you didn't want, or 3) Emulate it (by reading the data byte-by-byte, which is far slower than an aigned access).

Because of the above, compilers will usually add padding to structures to ensure that each member is aligned correctly (avoiding either an alignment fault, or the very expensive aligned access)

Re: structure padding and data alignment

Posted: Thu Feb 27, 2014 10:47 pm
by icealys
please see my other post too about dma word transfers. I'm still not sure why data alignment is necessary...isn't it just starting at a certain address divisible by the amount of bytes you are writing? what is its significance in a little more detail? how do the cpu registers play a role in this? and i also heard that it is also used for bus efficiency...please explain?

Re: structure padding and data alignment

Posted: Thu Feb 27, 2014 10:52 pm
by thepowersgang
/me points to the Forum Rules, Required Knowledge, and general courtesy.

That's exactly what I said. All accesses should be aligned to the size of the data you're manipulating (which is what the compiler does for you).

Re: structure padding and data alignment

Posted: Thu Feb 27, 2014 11:02 pm
by icealys
i'm not exactly sure how 8,16,24,32,64 words are aligned...so you say the bottom two bits of the address have to be 0 for a 32 bit address...what about 8,16,24,64? why does the processor fault or take two reads instead of one if the data is not aligned? are there certain sections of RAM reserved for 8,16,24,32,64 words?

Re: structure padding and data alignment

Posted: Fri Feb 28, 2014 12:27 am
by Brendan
Hi,
icealys wrote:i'm not exactly sure how 8,16,24,32,64 words are aligned...so you say the bottom two bits of the address have to be 0 for a 32 bit address...what about 8,16,24,64?
For 80x86, the CPU only expects "natural alignment". A 16-bit value would consume 2 bytes of RAM; and would be aligned if it doesn't cross a 2-byte boundary (e.g. the address is even). A 32-bit value would consume 4 bytes of RAM; and would be aligned if it doesn't cross a 4-byte boundary (e.g. the address is a multiple of 4). A 64-bit value would consume 8 bytes of RAM; and would be aligned if it doesn't cross an 8-byte boundary (e.g. the address is a multiple of 8 ).

For other CPUs the conditions for alignment can be more severe. For example, "aligned on a 4-byte boundary" might be the minimum alignment a CPU supports, even for smaller data.
icealys wrote:why does the processor fault or take two reads instead of one if the data is not aligned?
CPUs typically have caches, and work with "cache lines". If a 16-bit piece of data is not aligned, then half of it can be in one cache line while the other half is in a different cache line. This can be expensive to deal with in hardware, especially when you consider all the corner-cases (e.g. what if one cache line is in RAM and the other cache line is in ROM?), and a lot of the hassle is determining if there's a problem or not. The fastest way to determine if there might be a problem (which is different to determining if there actually is a problem) is to test the lowest bits of the address. For an example, for a 32-bit read or write, if the address being read or written has the lowest 2 bits set then the data is misaligned and you have to do more costly tests to determine if the data is split across 2 different cache lines or not.

Of course the simplest way for a CPU to handle this is to not support it in the first place - if the data is not aligned generate a trap/exception so that the CPU doesn't have to worry about it.
icealys wrote:are there certain sections of RAM reserved for 8,16,24,32,64 words?
No.


Cheers,

Brendan

Re: structure padding and data alignment

Posted: Fri Feb 28, 2014 1:42 am
by icealys
thanks for your replies. Does the cache line only work with getting data from RAM?

Re: structure padding and data alignment

Posted: Fri Feb 28, 2014 2:05 am
by Combuster
It looks like "learn to read" applies to you more often than appreciated:
icealys wrote:Does the cache line only work with getting data from RAM?
Brendan wrote:e.g. what if one cache line is in RAM and the other cache line is in ROM?
Other than it being obvious if you were aware of how caches are meant to work.

Re: structure padding and data alignment

Posted: Fri Feb 28, 2014 3:27 pm
by icealys
so basically if it does work with components other than RAM, the cache should be flushed right?

Re: structure padding and data alignment

Posted: Fri Feb 28, 2014 5:11 pm
by Combuster
A cache is only defined to hold a number of copies of some originals, where the copy is more quickly accessible than it's original.

Homework question:
1: When would you need to remove items from a cache?

Re: structure padding and data alignment

Posted: Sat Mar 01, 2014 3:46 am
by bwat
Combuster wrote:A cache is only defined to hold a number of copies of some originals, where the copy is more quickly accessible than it's original.
In the case of a write with a write-allocate copy-back cache, the cache contains the original and whilst inconsistent with the next level in the memory hierarchy there are no copies.

Re: structure padding and data alignment

Posted: Sun Mar 02, 2014 4:42 pm
by Pancakes
The cache I do not know about.

But, the interface between the CPU and memory is a bit different that you would expect. I used to understand a great example of this, but have since lost it. The guys toward the end start to touch on exactly why. It is different for different hardware.

http://stackoverflow.com/questions/3655 ... d-boundary

This is SDRAM but it starts to show you why the CPU can not access an arbitrary address with out performing two reads or writes.

http://en.wikipedia.org/wiki/Synchronou ... ess_memory

Basically, even though you might have 32 effective address pins you cant access across multiple ... dunno what word to use here.. but very much like the problem is the guys are talking about with the cache. The memory is divided into rows and cols and you have to kinda of preselect one then read from it (at least with SDRAM). Like I said I swear I used to know the easy way to explain it but I have forgotten.

Just know that the limitation (aside of cache) is actually with the memory chips.

This guy here basically touched right onto the reason. If you want to really dig down your going to have to really follow the pin outs of the RAM module and the time schematic of signals:
Word alignment is not only featured by CPUs

On the hardware level, most RAM-Modules have a given Word size in respect to the amount of bits that can be accessed per read/write cycle.

On a module I had to interface on an embedded device, addressing was implemented through three parameters: The module was organized in four banks which could be selected prior to the RW operation. each of this banks was essentially a large table 32-bit words, wich could be adressed through a row and column index.

In this design, access was only possible per cell, so every read operation returned 4 bytes, and every write operation expected 4 bytes.

A memory controller hooked up to this RAM chip could be desigend in two ways: either allowing unrestricted access to the memory chip using several cycles to split/merge unaligned data to/from several cells (with additional logic), or imposing some restrictions on how memory can be accessed with the gain of reduced complexity.

As complexity can impede maintainability and performance, most designers chose the latter..

Re: structure padding and data alignment

Posted: Sun Mar 02, 2014 5:22 pm
by Pancakes
I got a good one. =)

Imagine 4 RAM chips installed in your motherboard. Now, access the boundary between two. What is going to happen?

As in make a 32-bit/64-bit (whatever they support as they channel width) read with the middle between them. Are they both going to respond at the same time? (not possible) Or, is the CPU or memory controller going to issue two reads to each separately? Now, imagine each RAM chip having individual modules inside of it (which they do). Same problem. Even if they both responded one would have to wait for the other so it would be like two reads and then the CPU has to be prepared for a multiple response and that adds complexity and latency to the hardware (and including the latency of the extra transistors switching on and off).. but anyway that is basically the problem in a nutshell.

Oh and just in case someone thinks well why do we just not stagger the read instruction to spread across multiple modules. So each module responds with one byte, right? Well, say 0x0 does just that. Now, try to read a 32-bit value at 0x1.

Code: Select all

|address   |ram module/physical-chip/stick
0000        module 0 (picks byte 0 in its array of bytes)
0000        module 1 (picks byte 0 in its array of bytes)
0000        module 2 (picks byte 0 in its array of bytes)
0000        module 3 (picks byte 0 in its array of bytes)

0001        module 0 (picks byte 1 in its array of bytes)
0001        module 1 (picks byte 1 in its array of bytes)
0001        module 2 (picks byte 1 in its array of bytes)
0001        module 3 (picks byte 1 in its array of bytes)
Did not turn out like you wanted?

So you say well we can do better right? We not add some complexity into the chip (first problem) and have it more intelligently decide. So lets try it again. The "smart" way.

Code: Select all

0001        module 0 (picks byte 1 in its array of bytes)
0001        module 1 (picks byte 0 in its array of bytes)
0001        module 2 (picks byte 0 in its array of bytes)
0001        module 3 (picks byte 0 in its array of bytes)

Code: Select all

0010       module 0 byte 1
0010       module 1 byte 1
0010       module 2 byte 0
0010       module 3 byte 0
Of course this works except we need to swap the first byte to the end and shift the others down. So we just added some extra cycles of work. Which make it slower. Yep, on CPUs that do not support unaligned access this is exactly what the software does already when it knows it will have to made an access like this.

But.. Lets do it in hardware so lets add in some dedicated circuitry to do the swaps and shifts. Well, we need a sub-clock which runs at a even faster frequency.. I thought our main clock was already fast well now we have to go even faster than it to keep up. You know eventually our main clock is going to reach some electromagnetic limit in frequency because of inductive reactance (needing more voltage) and capacitance (causing cross talk) so this sub-clock is going to be running into the same problem. And, reactance scales exponentially so eventually your going to approach the limit between the two...

So we are going to have to back off the main clock just to give the sub-clock more time which in theory just made all memory access slower. So why not just have the programmer work with aligned data and enjoy high performance and if needed have a separate mechanism with a performance penalty just for that one unaligned instruction or even simpler just throw an exception and keep the cost and heat down (complexity).

(For certain architectures...) You know I just thought about the FSB being lower than the CPU frequency so this translation circuit swapping the bytes would not have to run as fast as the CPU since memory fetches can only occur at most two times during each cycle (or some multiple of it). So hell.. now I am confused why this approach does not work. So it must be the cache being the reason unaligned memory access does not work in certain situations because it will be having fetches happen a lot faster. Basically, same concept as the memory above but for the cache (which is a type of memory just closer to the CPU).

And, Yeah. This could be wrong because I can not find it anywhere on the internet but it is just the only understanding I can make out of it.

And what is this doing in the general programming by now we should be in the electrical engineering forum that we do not have, LOL.

<edit>
Forgot to add transistor on/off delay with inductive reactance and capacitance.
</edit>