structure padding and data alignment
structure padding and data alignment
why is structure padding needed?
can someone explain in detail what data alignment does?
can someone explain in detail what data alignment does?
- thepowersgang
- Member
- Posts: 734
- Joined: Tue Dec 25, 2007 6:03 am
- Libera.chat IRC: thePowersGang
- Location: Perth, Western Australia
- Contact:
Re: structure padding and data alignment
*sniff* smells like homework... but I'll bite.
Padding is needed to align data to it's "natural size". In most cases, you can only read/write memory at aligned addresses (e.g. you can only read/write a 32-bit value when the bottom two bits of the address are zero). If you attempt to read a non-aligned address, the processor will either 1) Fault (which is what ARM CPUs tend to do, same with if you set the AF bit on x86), 2) Ignore the lower bits, and read memory you didn't want, or 3) Emulate it (by reading the data byte-by-byte, which is far slower than an aigned access).
Because of the above, compilers will usually add padding to structures to ensure that each member is aligned correctly (avoiding either an alignment fault, or the very expensive aligned access)
Padding is needed to align data to it's "natural size". In most cases, you can only read/write memory at aligned addresses (e.g. you can only read/write a 32-bit value when the bottom two bits of the address are zero). If you attempt to read a non-aligned address, the processor will either 1) Fault (which is what ARM CPUs tend to do, same with if you set the AF bit on x86), 2) Ignore the lower bits, and read memory you didn't want, or 3) Emulate it (by reading the data byte-by-byte, which is far slower than an aigned access).
Because of the above, compilers will usually add padding to structures to ensure that each member is aligned correctly (avoiding either an alignment fault, or the very expensive aligned access)
Kernel Development, It's the brain surgery of programming.
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
Re: structure padding and data alignment
please see my other post too about dma word transfers. I'm still not sure why data alignment is necessary...isn't it just starting at a certain address divisible by the amount of bytes you are writing? what is its significance in a little more detail? how do the cpu registers play a role in this? and i also heard that it is also used for bus efficiency...please explain?
Last edited by icealys on Thu Feb 27, 2014 10:54 pm, edited 1 time in total.
- thepowersgang
- Member
- Posts: 734
- Joined: Tue Dec 25, 2007 6:03 am
- Libera.chat IRC: thePowersGang
- Location: Perth, Western Australia
- Contact:
Re: structure padding and data alignment
/me points to the Forum Rules, Required Knowledge, and general courtesy.
That's exactly what I said. All accesses should be aligned to the size of the data you're manipulating (which is what the compiler does for you).
That's exactly what I said. All accesses should be aligned to the size of the data you're manipulating (which is what the compiler does for you).
Kernel Development, It's the brain surgery of programming.
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
Acess2 OS (c) | Tifflin OS (rust) | mrustc - Rust compiler
Currently Working on: mrustc
Re: structure padding and data alignment
i'm not exactly sure how 8,16,24,32,64 words are aligned...so you say the bottom two bits of the address have to be 0 for a 32 bit address...what about 8,16,24,64? why does the processor fault or take two reads instead of one if the data is not aligned? are there certain sections of RAM reserved for 8,16,24,32,64 words?
Re: structure padding and data alignment
Hi,
For other CPUs the conditions for alignment can be more severe. For example, "aligned on a 4-byte boundary" might be the minimum alignment a CPU supports, even for smaller data.
Of course the simplest way for a CPU to handle this is to not support it in the first place - if the data is not aligned generate a trap/exception so that the CPU doesn't have to worry about it.
Cheers,
Brendan
For 80x86, the CPU only expects "natural alignment". A 16-bit value would consume 2 bytes of RAM; and would be aligned if it doesn't cross a 2-byte boundary (e.g. the address is even). A 32-bit value would consume 4 bytes of RAM; and would be aligned if it doesn't cross a 4-byte boundary (e.g. the address is a multiple of 4). A 64-bit value would consume 8 bytes of RAM; and would be aligned if it doesn't cross an 8-byte boundary (e.g. the address is a multiple of 8 ).icealys wrote:i'm not exactly sure how 8,16,24,32,64 words are aligned...so you say the bottom two bits of the address have to be 0 for a 32 bit address...what about 8,16,24,64?
For other CPUs the conditions for alignment can be more severe. For example, "aligned on a 4-byte boundary" might be the minimum alignment a CPU supports, even for smaller data.
CPUs typically have caches, and work with "cache lines". If a 16-bit piece of data is not aligned, then half of it can be in one cache line while the other half is in a different cache line. This can be expensive to deal with in hardware, especially when you consider all the corner-cases (e.g. what if one cache line is in RAM and the other cache line is in ROM?), and a lot of the hassle is determining if there's a problem or not. The fastest way to determine if there might be a problem (which is different to determining if there actually is a problem) is to test the lowest bits of the address. For an example, for a 32-bit read or write, if the address being read or written has the lowest 2 bits set then the data is misaligned and you have to do more costly tests to determine if the data is split across 2 different cache lines or not.icealys wrote:why does the processor fault or take two reads instead of one if the data is not aligned?
Of course the simplest way for a CPU to handle this is to not support it in the first place - if the data is not aligned generate a trap/exception so that the CPU doesn't have to worry about it.
No.icealys wrote:are there certain sections of RAM reserved for 8,16,24,32,64 words?
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: structure padding and data alignment
thanks for your replies. Does the cache line only work with getting data from RAM?
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: structure padding and data alignment
It looks like "learn to read" applies to you more often than appreciated:
icealys wrote:Does the cache line only work with getting data from RAM?
Other than it being obvious if you were aware of how caches are meant to work.Brendan wrote:e.g. what if one cache line is in RAM and the other cache line is in ROM?
Re: structure padding and data alignment
so basically if it does work with components other than RAM, the cache should be flushed right?
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: structure padding and data alignment
A cache is only defined to hold a number of copies of some originals, where the copy is more quickly accessible than it's original.
Homework question:
1: When would you need to remove items from a cache?
Homework question:
1: When would you need to remove items from a cache?
Re: structure padding and data alignment
In the case of a write with a write-allocate copy-back cache, the cache contains the original and whilst inconsistent with the next level in the memory hierarchy there are no copies.Combuster wrote:A cache is only defined to hold a number of copies of some originals, where the copy is more quickly accessible than it's original.
Every universe of discourse has its logical structure --- S. K. Langer.
Re: structure padding and data alignment
The cache I do not know about.
But, the interface between the CPU and memory is a bit different that you would expect. I used to understand a great example of this, but have since lost it. The guys toward the end start to touch on exactly why. It is different for different hardware.
http://stackoverflow.com/questions/3655 ... d-boundary
This is SDRAM but it starts to show you why the CPU can not access an arbitrary address with out performing two reads or writes.
http://en.wikipedia.org/wiki/Synchronou ... ess_memory
Basically, even though you might have 32 effective address pins you cant access across multiple ... dunno what word to use here.. but very much like the problem is the guys are talking about with the cache. The memory is divided into rows and cols and you have to kinda of preselect one then read from it (at least with SDRAM). Like I said I swear I used to know the easy way to explain it but I have forgotten.
Just know that the limitation (aside of cache) is actually with the memory chips.
This guy here basically touched right onto the reason. If you want to really dig down your going to have to really follow the pin outs of the RAM module and the time schematic of signals:
But, the interface between the CPU and memory is a bit different that you would expect. I used to understand a great example of this, but have since lost it. The guys toward the end start to touch on exactly why. It is different for different hardware.
http://stackoverflow.com/questions/3655 ... d-boundary
This is SDRAM but it starts to show you why the CPU can not access an arbitrary address with out performing two reads or writes.
http://en.wikipedia.org/wiki/Synchronou ... ess_memory
Basically, even though you might have 32 effective address pins you cant access across multiple ... dunno what word to use here.. but very much like the problem is the guys are talking about with the cache. The memory is divided into rows and cols and you have to kinda of preselect one then read from it (at least with SDRAM). Like I said I swear I used to know the easy way to explain it but I have forgotten.
Just know that the limitation (aside of cache) is actually with the memory chips.
This guy here basically touched right onto the reason. If you want to really dig down your going to have to really follow the pin outs of the RAM module and the time schematic of signals:
Word alignment is not only featured by CPUs
On the hardware level, most RAM-Modules have a given Word size in respect to the amount of bits that can be accessed per read/write cycle.
On a module I had to interface on an embedded device, addressing was implemented through three parameters: The module was organized in four banks which could be selected prior to the RW operation. each of this banks was essentially a large table 32-bit words, wich could be adressed through a row and column index.
In this design, access was only possible per cell, so every read operation returned 4 bytes, and every write operation expected 4 bytes.
A memory controller hooked up to this RAM chip could be desigend in two ways: either allowing unrestricted access to the memory chip using several cycles to split/merge unaligned data to/from several cells (with additional logic), or imposing some restrictions on how memory can be accessed with the gain of reduced complexity.
As complexity can impede maintainability and performance, most designers chose the latter..
Re: structure padding and data alignment
I got a good one. =)
Imagine 4 RAM chips installed in your motherboard. Now, access the boundary between two. What is going to happen?
As in make a 32-bit/64-bit (whatever they support as they channel width) read with the middle between them. Are they both going to respond at the same time? (not possible) Or, is the CPU or memory controller going to issue two reads to each separately? Now, imagine each RAM chip having individual modules inside of it (which they do). Same problem. Even if they both responded one would have to wait for the other so it would be like two reads and then the CPU has to be prepared for a multiple response and that adds complexity and latency to the hardware (and including the latency of the extra transistors switching on and off).. but anyway that is basically the problem in a nutshell.
Oh and just in case someone thinks well why do we just not stagger the read instruction to spread across multiple modules. So each module responds with one byte, right? Well, say 0x0 does just that. Now, try to read a 32-bit value at 0x1.
Did not turn out like you wanted?
So you say well we can do better right? We not add some complexity into the chip (first problem) and have it more intelligently decide. So lets try it again. The "smart" way.
Of course this works except we need to swap the first byte to the end and shift the others down. So we just added some extra cycles of work. Which make it slower. Yep, on CPUs that do not support unaligned access this is exactly what the software does already when it knows it will have to made an access like this.
But.. Lets do it in hardware so lets add in some dedicated circuitry to do the swaps and shifts. Well, we need a sub-clock which runs at a even faster frequency.. I thought our main clock was already fast well now we have to go even faster than it to keep up. You know eventually our main clock is going to reach some electromagnetic limit in frequency because of inductive reactance (needing more voltage) and capacitance (causing cross talk) so this sub-clock is going to be running into the same problem. And, reactance scales exponentially so eventually your going to approach the limit between the two...
So we are going to have to back off the main clock just to give the sub-clock more time which in theory just made all memory access slower. So why not just have the programmer work with aligned data and enjoy high performance and if needed have a separate mechanism with a performance penalty just for that one unaligned instruction or even simpler just throw an exception and keep the cost and heat down (complexity).
(For certain architectures...) You know I just thought about the FSB being lower than the CPU frequency so this translation circuit swapping the bytes would not have to run as fast as the CPU since memory fetches can only occur at most two times during each cycle (or some multiple of it). So hell.. now I am confused why this approach does not work. So it must be the cache being the reason unaligned memory access does not work in certain situations because it will be having fetches happen a lot faster. Basically, same concept as the memory above but for the cache (which is a type of memory just closer to the CPU).
And, Yeah. This could be wrong because I can not find it anywhere on the internet but it is just the only understanding I can make out of it.
And what is this doing in the general programming by now we should be in the electrical engineering forum that we do not have, LOL.
<edit>
Forgot to add transistor on/off delay with inductive reactance and capacitance.
</edit>
Imagine 4 RAM chips installed in your motherboard. Now, access the boundary between two. What is going to happen?
As in make a 32-bit/64-bit (whatever they support as they channel width) read with the middle between them. Are they both going to respond at the same time? (not possible) Or, is the CPU or memory controller going to issue two reads to each separately? Now, imagine each RAM chip having individual modules inside of it (which they do). Same problem. Even if they both responded one would have to wait for the other so it would be like two reads and then the CPU has to be prepared for a multiple response and that adds complexity and latency to the hardware (and including the latency of the extra transistors switching on and off).. but anyway that is basically the problem in a nutshell.
Oh and just in case someone thinks well why do we just not stagger the read instruction to spread across multiple modules. So each module responds with one byte, right? Well, say 0x0 does just that. Now, try to read a 32-bit value at 0x1.
Code: Select all
|address |ram module/physical-chip/stick
0000 module 0 (picks byte 0 in its array of bytes)
0000 module 1 (picks byte 0 in its array of bytes)
0000 module 2 (picks byte 0 in its array of bytes)
0000 module 3 (picks byte 0 in its array of bytes)
0001 module 0 (picks byte 1 in its array of bytes)
0001 module 1 (picks byte 1 in its array of bytes)
0001 module 2 (picks byte 1 in its array of bytes)
0001 module 3 (picks byte 1 in its array of bytes)
So you say well we can do better right? We not add some complexity into the chip (first problem) and have it more intelligently decide. So lets try it again. The "smart" way.
Code: Select all
0001 module 0 (picks byte 1 in its array of bytes)
0001 module 1 (picks byte 0 in its array of bytes)
0001 module 2 (picks byte 0 in its array of bytes)
0001 module 3 (picks byte 0 in its array of bytes)
Code: Select all
0010 module 0 byte 1
0010 module 1 byte 1
0010 module 2 byte 0
0010 module 3 byte 0
But.. Lets do it in hardware so lets add in some dedicated circuitry to do the swaps and shifts. Well, we need a sub-clock which runs at a even faster frequency.. I thought our main clock was already fast well now we have to go even faster than it to keep up. You know eventually our main clock is going to reach some electromagnetic limit in frequency because of inductive reactance (needing more voltage) and capacitance (causing cross talk) so this sub-clock is going to be running into the same problem. And, reactance scales exponentially so eventually your going to approach the limit between the two...
So we are going to have to back off the main clock just to give the sub-clock more time which in theory just made all memory access slower. So why not just have the programmer work with aligned data and enjoy high performance and if needed have a separate mechanism with a performance penalty just for that one unaligned instruction or even simpler just throw an exception and keep the cost and heat down (complexity).
(For certain architectures...) You know I just thought about the FSB being lower than the CPU frequency so this translation circuit swapping the bytes would not have to run as fast as the CPU since memory fetches can only occur at most two times during each cycle (or some multiple of it). So hell.. now I am confused why this approach does not work. So it must be the cache being the reason unaligned memory access does not work in certain situations because it will be having fetches happen a lot faster. Basically, same concept as the memory above but for the cache (which is a type of memory just closer to the CPU).
And, Yeah. This could be wrong because I can not find it anywhere on the internet but it is just the only understanding I can make out of it.
And what is this doing in the general programming by now we should be in the electrical engineering forum that we do not have, LOL.
<edit>
Forgot to add transistor on/off delay with inductive reactance and capacitance.
</edit>