Hi everyone, I have a rather odd question for anyone willing to answer.
We all know that at the very basic level, C/C++ datatypes like int, char or long are really just ways of allocating a place in memory to store data, which is often abstracted to decimal numbers, strings and useful stuff like that. We also know that it is only possible to allocate a set number of bytes ranting from 1 byte to 8 bytes in a progression something like:
char - 1 byte short int - 2 bytes int - 4 bytes double - 8 bytes
That is a very basic list and is just intended to illustrate the fact. So, for the sake of the argument, let's say I want some silly, arbitrary number of bytes like 6 or 3 or. I am a horrible person, aren't I! Now comes the question, performance-wise what is more costly, padding on the fly in registers or just allocating the next highest level of memory (i.e. for 3 bytes I allocate 4 and so on). I have a feeling it makes no difference but I thought I'd ask anyway.
At Iðavoll met the mighty gods,
Shrines and temples they timbered high;
Forges they set, and they smithied ore,
Tongs they wrought, and tools they fashioned.
~Verse 9 of Völuspá
char - 1 byte short int - 2 bytes int - 4 bytes double - 8 bytes
Although this is usually true, it's not actually guaranteed that these types are these sizes, especially on 64 bit platforms. If you need an integer to be a specific number of bytes, use (u)int8_t, (u)int16_t, (u)int32_t etc. from stdint.h. And, of course, double isn't an integer type... (IIRC, it also doesn't have to be 8 bytes either.)
Venn wrote:
That is a very basic list and is just intended to illustrate the fact. So, for the sake of the argument, let's say I want some silly, arbitrary number of bytes like 6 or 3 or. I am a horrible person, aren't I! Now comes the question, performance-wise what is more costly, padding on the fly in registers or just allocating the next highest level of memory (i.e. for 3 bytes I allocate 4 and so on). I have a feeling it makes no difference but I thought I'd ask anyway.
In general, it makes the most sense just to use a 4-byte integer to store 3 bytes. The processor only does reads and writes to memory in power-of-two chunks anyway (i.e. three byte loads takes longer than one dword load,) so you won't be saving any time by attempting to load three bytes using bitshifts or something. If you need an integer to behave like it has fewer bytes arithmetically, just AND it with 0xFFFFFF (or whatever is appropriate) whenever it might overflow those artificial bounds.
There are rare cases when you do want to store integers as non-power-of-two numbers of bytes. For example, if you have a lot of pixel data in 24 bit RGB, you would be wasting a quarter of your memory aligning each pixel value to a dword. In general, however, the compiler will work against you trying to pack data like that by adding padding to force alignment (alignment being good for performance) unless you tell it not to explicitly.
Although this is usually true, it's not actually guaranteed that these types are these sizes, especially on 64 bit platforms. If you need an integer to be a specific number of bytes, use (u)int8_t, (u)int16_t, (u)int32_t etc. from stdint.h. And, of course, double isn't an integer type... (IIRC, it also doesn't have to be 8 bytes either.)
The datatype 'double' (like all the other floating point data types) is a strange creature and if I recall it, too, depends upon the processor and indeed the compiler as to it's size. I believe that it has something to do with how much memory is actually being allocated and how much is being used (pointing back to my question! lol) depending on precision and accuracy , utilizing something like 50-some bits with the rest being "breathing room". It's actually quite strange and hazy when one gets down to the bit-by-bit aspect of it.
But anyway, your answer does actually help me quite a bit, which brings me to the purpose of asking such a question. In C, I would probably (and rightfully) be laughed out of existence trying to work with such arbitrary sizes and of course, as you said, the compiler would fight me tooth and nail about such things. I've been contemplating writing a very simple (which will evolve into not-so-simple over time) programming language in which a user specifies variables with specific lengths. I came here because this is probably has a higher concentration of x86 assembly programmers than most other places, aaaand in time I do wish to write a very basic bootloader or other low-level program with said language. Actually, the final, end purpose of the language is to be something you write kernels and other very low-level programs/libraries with. Why? Well...to learn, mostly, as I don't plan on becoming the next Bill Gates or Linus Torvalds, I'm here simply to learn and create.
At Iðavoll met the mighty gods,
Shrines and temples they timbered high;
Forges they set, and they smithied ore,
Tongs they wrought, and tools they fashioned.
~Verse 9 of Völuspá
Venn wrote:The datatype 'double' (like all the other floating point data types) is a strange creature and if I recall it, too, depends upon the processor and indeed the compiler as to it's size. I believe that it has something to do with how much memory is actually being allocated and how much is being used (pointing back to my question! lol) depending on precision and accuracy , utilizing something like 50-some bits with the rest being "breathing room". It's actually quite strange and hazy when one gets down to the bit-by-bit aspect of it.
The double type's storage format almost always corresponds to the "binary64" precision of the IEEE 754 standard, which is in fact 8 bytes in memory. The problem is that the C standard doesn't actually guarantee this format, so you can't completely rely on it. That said, any sane C compiler for x86 or x86_64 uses that standard, because the x87 FPU can manipulate it natively, so it does not vary between compilers or platforms. long double, on the other hand, can be anything from 80 to 128 bits in a variety of formats (although it's usually the x87's 80 bit internal format stored in 96 or 128 bits in memory on x86.)
FP's are always a pain to deal with. I've experienced particularly troublesome issues with their precision when writing a hockey simulator. Which reminds me, we all know that registers only come in 8-, 16-, 32- and 64-bit sizes. So, weird sizes really only serve to conserve memory. How much extra work would the CPU really have to do to handle these sizes and would it cancel out the memory size benefits. Shaving off 128 bytes worth of unused bits in large, continuous segments of memory like an array, vector or some other such thing would seem advantageous, but does the computational overhead outweigh it? Perhaps that is a better way of putting my question.
At Iðavoll met the mighty gods,
Shrines and temples they timbered high;
Forges they set, and they smithied ore,
Tongs they wrought, and tools they fashioned.
~Verse 9 of Völuspá
The key is in how the processor actually reads data from memory. For normal loads to general-purpose registers (I'm explicitly omitting SSE stuff here for simplicity,) the processor reads data in certain-size chunks along certain-size boundaries (x86 has a 64-bit data bus IIRC, so this size is 8 bytes.) If you read from address 0x1000 to register EAX, for example, it only does a single read of the bytes from 0x1000 to 0x1007 and stores 0x1000 to 0x1003 directly in EAX. If you want to read address 0x1001 to AL, it still reads the bytes from 0x1000 to 0x1007, but only stores 0x1001 in the register (this shifting has no overhead.) However, if you want to read address 0x0FFF to EAX, the processor must read the bytes from 0x0FF8 to 0x1007--two reads--then only use the ones from 0x0FFF to 0x1003. This is why the compiler tries it's hardest to align things within data structures, why the heap returns word-aligned pointers, and why the stack is word-aligned: there is a significant overhead from unaligned reads/writes. On RISC architectures, the processor may even refuse to do non-aligned memory operations.
This is also why doing memory operations with wider types is faster, because they use the full data bus width. SSE stuff is the best for this, but requires 16-byte alignment. The compiler will usually try to do this sort of stuff for you.
The other issue is that if you have data that is completely packed and of odd sizes, you have to perform extra reads and writes. Imagine there is a 1-byte field starting at 0x1000 and a 3-byte field starting at 0x1001. To write the contents of EBX to the 3-byte field, you would have to load the data at 0x1000 into EAX, left-shift EBX by 8, AND EAX with 0x000000FF, OR EBX with EAX into EAX, then store EAX back at 0x1000. Even though this is an aligned read and write, it is still clearly slow. You could also do three byte-width writes or one byte-width and one word-width write, but I don't know if that's any faster, and is clearly slower than a single write.
This is all ignoring the cache, but that's a whole other can of worms.
Again though, if you really need to save space, or are only using the format for storage (where access speed is less important,) it may make sense to use packed structures. From a cache standpoint, using less memory is also better, which could make up for some of the speed loss; for example, it would actually be faster to use packed structures if all you were doing is copying the data from one place to another using memcpy(), since memcpy() only does aligned accesses. The only place I can think of for this to be applicable is 24-bit pixel data, which I mentioned.
Most of that kind of data manipulation is done by the GPU, which is specifically designed to handle such operations efficiently. So, it seems that there really are no serious benefits from an "only as much as you need" approach to memory allocation for variables outside of the graphics niche due to the additional read/write operations to handle the odd byte. Having done more research on structure packing, it seams that it would actually be undesirable and I would actually have better overall performance by keeping data well aligned. How the data is managed in memory is probably the biggest performance booster. Just an FYI, so that everyone knows what I'm driving at, my development is focused toward Beowulf cluster applications.
At Iðavoll met the mighty gods,
Shrines and temples they timbered high;
Forges they set, and they smithied ore,
Tongs they wrought, and tools they fashioned.
~Verse 9 of Völuspá