MEMSET / MEMCPY Question
MEMSET / MEMCPY Question
Ah, Kyretzn here for another episode of Whaa?
Anywho, I got told by beforementioned friend that MEMSET and MEMCPY functions have to be on a 4byte boundary? or something? Somethign to do with CPU Limitation?
And the solution is apparently a ALIGN parameter to GCC.
Would any of you guys mind clarifying this for me?
Thanks!
~Kyretzn
Anywho, I got told by beforementioned friend that MEMSET and MEMCPY functions have to be on a 4byte boundary? or something? Somethign to do with CPU Limitation?
And the solution is apparently a ALIGN parameter to GCC.
Would any of you guys mind clarifying this for me?
Thanks!
~Kyretzn
Re:MEMSET / MEMCPY Question
Hi,
However, it is possible to enable "alignment check exceptions" by setting the "AC" bit (bit 18) in EFLAGS, and the "AM" bit (bit 18) in CR0. With these bits set, any mis-aligned read/write will cause an alignment check exception (exception 17) if the CPU is in protected mode and running at CPL=3. IMHO alignment check should only used for performance tuning during testing (not when end users are running the final code).
Then there's SSE (and SSE2) instructions which often have 2 different opcodes for the same instruction - one for aligned data and another for possibly mis-aligned data. If an SSE instruction expects aligned data but the data isn't aligned then you'll get a general protection exception (this can't be disabled - use an SSE instruction that doesn't expect aligned data instead). This is mainly for performance too because SSE uses 128 bit values that match the CPUs cache line size, which allows the CPU manufacturer to cut some corners for the SSE instructions that only work on aligned data (like loading data from RAM directly into the SSE register without polluting the CPUs cache on the way). Most code doesn't use SSE though (and it'd be a 16 byte boundary, not a 4 byte boundary)...
Depending on your compiler/optimizer/library, code to minimize mis-alignment may be generated. For example, MEMSET might use "rep stosb" for the first 0 to 3 bytes (until it's aligned) and then use "rep stosd" until it gets near the end, where another 0 to 3 "rep stosb" instructions might be used to finish it off (all depending on the size and location of the area being set). I'm not sure how common this is though...
Code to minimize mis-alignment is probably more typical for graphics code - for example drawing a horizontal line from (3,5) to (17, 5) in 256 colour mode, where aligning the destination isn't possible without producing wrong results.
Cheers,
Brendan
The CPU will happily do MEMSET or MEMCPY on unaligned regions, but it is faster if the memory region is aligned. Usually alignment is used to improve performance only (there is no CPU limitation on any 80x86 CPU - other CPUs vary).Kyretzn wrote:Anywho, I got told by beforementioned friend that MEMSET and MEMCPY functions have to be on a 4byte boundary? or something? Somethign to do with CPU Limitation?
And the solution is apparently a ALIGN parameter to GCC.
Would any of you guys mind clarifying this for me?
However, it is possible to enable "alignment check exceptions" by setting the "AC" bit (bit 18) in EFLAGS, and the "AM" bit (bit 18) in CR0. With these bits set, any mis-aligned read/write will cause an alignment check exception (exception 17) if the CPU is in protected mode and running at CPL=3. IMHO alignment check should only used for performance tuning during testing (not when end users are running the final code).
Then there's SSE (and SSE2) instructions which often have 2 different opcodes for the same instruction - one for aligned data and another for possibly mis-aligned data. If an SSE instruction expects aligned data but the data isn't aligned then you'll get a general protection exception (this can't be disabled - use an SSE instruction that doesn't expect aligned data instead). This is mainly for performance too because SSE uses 128 bit values that match the CPUs cache line size, which allows the CPU manufacturer to cut some corners for the SSE instructions that only work on aligned data (like loading data from RAM directly into the SSE register without polluting the CPUs cache on the way). Most code doesn't use SSE though (and it'd be a 16 byte boundary, not a 4 byte boundary)...
Depending on your compiler/optimizer/library, code to minimize mis-alignment may be generated. For example, MEMSET might use "rep stosb" for the first 0 to 3 bytes (until it's aligned) and then use "rep stosd" until it gets near the end, where another 0 to 3 "rep stosb" instructions might be used to finish it off (all depending on the size and location of the area being set). I'm not sure how common this is though...
Code to minimize mis-alignment is probably more typical for graphics code - for example drawing a horizontal line from (3,5) to (17, 5) in 256 colour mode, where aligning the destination isn't possible without producing wrong results.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re:MEMSET / MEMCPY Question
All C libs I have checked while researching for my own PDCLib do this, at least optionally. (newlib, for example, offers to "PREFER_SIZE_OVER_SPEED" and omits the alignment punting if that is defined.)Brendan wrote: I'm not sure how common this is though...
Every good solution is obvious once you've found it.
Re:MEMSET / MEMCPY Question
it's worth noting that some CPUs will not allow misaligned reads and writes. Sparc machines come to mind, and I've been bitten by this one.
proxy
proxy
Re:MEMSET / MEMCPY Question
Uh... you mean unaligned access on integers, right? As in, accessing a four-byte int at an unaligned address?
memcpy() and memset() are defined to work on char arrays, and as such work fine regardless of the alignment requirements of the host machine (really exotic ones nonewithstanding).
The naive memcpy() below does work on a SUN E15k.
memcpy() and memset() are defined to work on char arrays, and as such work fine regardless of the alignment requirements of the host machine (really exotic ones nonewithstanding).
The naive memcpy() below does work on a SUN E15k.
Code: Select all
#include <assert.h>
#include <string.h>
void * copy( void * s1, const void * s2, int n )
{
char * dest = (char *) s1;
const char * src = (const char *) s2;
while ( n-- )
{
*dest++ = *src++;
}
return s1;
}
int main()
{
char dest[] = "0123";
const char src[] = "xy";
copy( dest + 1, src, 2 );
assert( ! strcmp( dest, "0xy3" ) );
return 0;
}
Every good solution is obvious once you've found it.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:MEMSET / MEMCPY Question
the thing is, you can do a memcpy much faster if you use something like rep stosd between items that are aligned on 32 bits boundaries. true. That's not always possible, however, as soon as you might wish to cpy anything anywhere.
Re:MEMSET / MEMCPY Question
It is always possible. Below is the code from newlib. However, with small unaligned memory areas it can actually be slower, and in any case it adds to the code size of the function, which might not be an acceptable trade-off. (Which is why they made it optional.)Pype.Clicker wrote: That's not always possible, however, as soon as you might wish to cpy anything anywhere.
Code: Select all
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#define UNALIGNED(X, Y) \
(((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))
/* How many bytes are copied each iteration of the 4X unrolled loop. */
#define BIGBLOCKSIZE (sizeof (long) << 2)
/* How many bytes are copied each iteration of the word copy loop. */
#define LITTLEBLOCKSIZE (sizeof (long))
/* Threshhold for punting to the byte copier. */
#define TOO_SMALL(LEN) ((LEN) < BIGBLOCKSIZE)
_PTR
_DEFUN (memcpy, (dst0, src0, len0),
_PTR dst0 _AND
_CONST _PTR src0 _AND
size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
char *dst = (char *) dst0;
char *src = (char *) src0;
_PTR save = dst0;
while (len0--)
{
*dst++ = *src++;
}
return save;
#else
char *dst = dst0;
_CONST char *src = src0;
long *aligned_dst;
_CONST long *aligned_src;
int len = len0;
/* If the size is small, or either SRC or DST is unaligned,
then punt into the byte copy loop. This should be rare. */
if (!TOO_SMALL(len) && !UNALIGNED (src, dst))
{
aligned_dst = (long*)dst;
aligned_src = (long*)src;
/* Copy 4X long words at a time if possible. */
while (len >= BIGBLOCKSIZE)
{
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
len -= BIGBLOCKSIZE;
}
/* Copy one long word at a time if possible. */
while (len >= LITTLEBLOCKSIZE)
{
*aligned_dst++ = *aligned_src++;
len -= LITTLEBLOCKSIZE;
}
/* Pick up any residual with a byte copier. */
dst = (char*)aligned_dst;
src = (char*)aligned_src;
}
while (len--)
*dst++ = *src++;
return dst0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}
Every good solution is obvious once you've found it.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:MEMSET / MEMCPY Question
Maybe i should have made clearer that "That" in "That is not always possible" was referring to the whole "make it faster by moving 32bits at once rather than 8 bits at once".
You can of course manipulate unsigned longs (*src and *dst) so that they now refer to unaligned memory and happily keep using "*dst++=*src++" in your mainloop: you're just losing the benefit.
So basically what you wish is a builtin memcpy that is able to detect if its arguments are const or not, if they're aligned or not, if the size to copy is small or large or unknown and generate on the fly what's needed.
You can of course manipulate unsigned longs (*src and *dst) so that they now refer to unaligned memory and happily keep using "*dst++=*src++" in your mainloop: you're just losing the benefit.
So basically what you wish is a builtin memcpy that is able to detect if its arguments are const or not, if they're aligned or not, if the size to copy is small or large or unknown and generate on the fly what's needed.
Re:MEMSET / MEMCPY Question
Pype.Clicker wrote: Maybe i should have made clearer that "That" in "That is not always possible" was referring to the whole "make it faster by moving 32bits at once rather than 8 bits at once".
You can of course manipulate unsigned longs (*src and *dst) so that they now refer to unaligned memory and happily keep using "*dst++=*src++" in your mainloop: you're just losing the benefit.
So basically what you wish is a builtin memcpy that is able to detect if its arguments are const or not, if they're aligned or not, if the size to copy is small or large or unknown and generate on the fly what's needed.
Code: Select all
template <typename arg1type, typename arg2type, bool aligned, int size, bool constness>
memcpy(...) {
.....
}
Re:MEMSET / MEMCPY Question
clarification on 4 byte cpu limitations.
pentium 4. 32 bit processor (32-bits, 4 bytes, your limitation)
althlon 64. 64 bit processor (64-bits, 8 bytes)
ram
..................
12 13 14 15
8 9 10 11
4 5 6 7
0 1 2 3
the above is how the ram addressing looks.
if your moving a byte at a time then it wouldn't matter if the memory your moving is aligned or not. althou it would take a really long time to do that.
if your moving a word u would only notice if the start address is an odd number.
if your moving a double then it would be best for it to be aligned.
if your making a copy of an array into another part of memory. i think it would be best to use 2 pointers.
set the first to the address of the starting memory address.
set the second to the address of the destination memory address.
set pointers to closest aligned memory address and start moving double words. just keep in mind that offset.
pentium 4. 32 bit processor (32-bits, 4 bytes, your limitation)
althlon 64. 64 bit processor (64-bits, 8 bytes)
ram
..................
12 13 14 15
8 9 10 11
4 5 6 7
0 1 2 3
the above is how the ram addressing looks.
if your moving a byte at a time then it wouldn't matter if the memory your moving is aligned or not. althou it would take a really long time to do that.
if your moving a word u would only notice if the start address is an odd number.
if your moving a double then it would be best for it to be aligned.
if your making a copy of an array into another part of memory. i think it would be best to use 2 pointers.
set the first to the address of the starting memory address.
set the second to the address of the destination memory address.
set pointers to closest aligned memory address and start moving double words. just keep in mind that offset.