MEMSET / MEMCPY Question

Kyretzn · Post by **Kyretzn** » Thu Sep 15, 2005 3:23 am

Ah, Kyretzn here for another episode of Whaa?

Anywho, I got told by beforementioned friend that MEMSET and MEMCPY functions have to be on a 4byte boundary? or something? Somethign to do with CPU Limitation?

And the solution is apparently a ALIGN parameter to GCC.
Would any of you guys mind clarifying this for me?

Thanks!
~Kyretzn

Brendan · Post by **Brendan** » Thu Sep 15, 2005 4:40 am

Hi,

Kyretzn wrote:Anywho, I got told by beforementioned friend that MEMSET and MEMCPY functions have to be on a 4byte boundary? or something? Somethign to do with CPU Limitation?

And the solution is apparently a ALIGN parameter to GCC.
Would any of you guys mind clarifying this for me?

The CPU will happily do MEMSET or MEMCPY on unaligned regions, but it is faster if the memory region is aligned. Usually alignment is used to improve performance only (there is no CPU limitation on any 80x86 CPU - other CPUs vary).

However, it is possible to enable "alignment check exceptions" by setting the "AC" bit (bit 18) in EFLAGS, and the "AM" bit (bit 18) in CR0. With these bits set, any mis-aligned read/write will cause an alignment check exception (exception 17) if the CPU is in protected mode and running at CPL=3. IMHO alignment check should only used for performance tuning during testing (not when end users are running the final code).

Then there's SSE (and SSE2) instructions which often have 2 different opcodes for the same instruction - one for aligned data and another for possibly mis-aligned data. If an SSE instruction expects aligned data but the data isn't aligned then you'll get a general protection exception (this can't be disabled - use an SSE instruction that doesn't expect aligned data instead). This is mainly for performance too because SSE uses 128 bit values that match the CPUs cache line size, which allows the CPU manufacturer to cut some corners for the SSE instructions that only work on aligned data (like loading data from RAM directly into the SSE register without polluting the CPUs cache on the way). Most code doesn't use SSE though (and it'd be a 16 byte boundary, not a 4 byte boundary)...

Depending on your compiler/optimizer/library, code to minimize mis-alignment may be generated. For example, MEMSET might use "rep stosb" for the first 0 to 3 bytes (until it's aligned) and then use "rep stosd" until it gets near the end, where another 0 to 3 "rep stosb" instructions might be used to finish it off (all depending on the size and location of the area being set). I'm not sure how common this is though...

Code to minimize mis-alignment is probably more typical for graphics code - for example drawing a horizontal line from (3,5) to (17, 5) in 256 colour mode, where aligning the destination isn't possible without producing wrong results.

Cheers,

Brendan

Solar · Post by **Solar** » Thu Sep 15, 2005 4:59 am

Brendan wrote: I'm not sure how common this is though...

All C libs I have checked while researching for my own PDCLib do this, at least optionally. (newlib, for example, offers to "PREFER_SIZE_OVER_SPEED" and omits the alignment punting if that is defined.)

proxy · Post by **proxy** » Thu Sep 15, 2005 8:28 am

it's worth noting that some CPUs will not allow misaligned reads and writes. Sparc machines come to mind, and I've been bitten by this one.

proxy

Solar · Post by **Solar** » Thu Sep 15, 2005 8:43 am

Uh... you mean unaligned access on integers, right? As in, accessing a four-byte int at an unaligned address?

memcpy() and memset() are defined to work on char arrays, and as such work fine regardless of the alignment requirements of the host machine (really exotic ones nonewithstanding).

The naive memcpy() below does work on a SUN E15k.

Code: Select all

#include <assert.h>
#include <string.h>

void * copy( void * s1, const void * s2, int n )
{
   char * dest = (char *) s1;
   const char * src = (const char *) s2;
   while ( n-- )
   {
      *dest++ = *src++;
   }
   return s1;
}

int main()
{
   char dest[] = "0123";
   const char src[] = "xy";
   copy( dest + 1, src, 2 );
   assert( ! strcmp( dest, "0xy3" ) );
   return 0;
}

Pype.Clicker · Post by **Pype.Clicker** » Thu Sep 15, 2005 10:21 am

the thing is, you can do a memcpy much faster if you use something like rep stosd between items that are aligned on 32 bits boundaries. true. That's not always possible, however, as soon as you might wish to cpy anything anywhere.

Solar · Post by **Solar** » Fri Sep 16, 2005 2:04 am

Pype.Clicker wrote: That's not always possible, however, as soon as you might wish to cpy anything anywhere.

It is always possible. Below is the code from newlib. However, with small unaligned memory areas it can actually be slower, and in any case it adds to the code size of the function, which might not be an acceptable trade-off. (Which is why they made it optional.)

Code: Select all

/* Nonzero if either X or Y is not aligned on a "long" boundary.  */
#define UNALIGNED(X, Y) \
  (((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))

/* How many bytes are copied each iteration of the 4X unrolled loop.  */
#define BIGBLOCKSIZE    (sizeof (long) << 2)

/* How many bytes are copied each iteration of the word copy loop.  */
#define LITTLEBLOCKSIZE (sizeof (long))

/* Threshhold for punting to the byte copier.  */
#define TOO_SMALL(LEN)  ((LEN) < BIGBLOCKSIZE)

_PTR
_DEFUN (memcpy, (dst0, src0, len0),
   _PTR dst0 _AND
   _CONST _PTR src0 _AND
   size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
  char *dst = (char *) dst0;
  char *src = (char *) src0;

  _PTR save = dst0;

  while (len0--)
    {
      *dst++ = *src++;
    }

  return save;
#else
  char *dst = dst0;
  _CONST char *src = src0;
  long *aligned_dst;
  _CONST long *aligned_src;
  int   len =  len0;

  /* If the size is small, or either SRC or DST is unaligned,
     then punt into the byte copy loop.  This should be rare.  */
  if (!TOO_SMALL(len) && !UNALIGNED (src, dst))
    {
      aligned_dst = (long*)dst;
      aligned_src = (long*)src;

      /* Copy 4X long words at a time if possible.  */
      while (len >= BIGBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          len -= BIGBLOCKSIZE;
        }

      /* Copy one long word at a time if possible.  */
      while (len >= LITTLEBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          len -= LITTLEBLOCKSIZE;
        }

       /* Pick up any residual with a byte copier.  */
      dst = (char*)aligned_dst;
      src = (char*)aligned_src;
    }

  while (len--)
    *dst++ = *src++;

  return dst0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}

Pype.Clicker · Post by **Pype.Clicker** » Fri Sep 16, 2005 3:48 am

Maybe i should have made clearer that "That" in "That is not always possible" was referring to the whole "make it faster by moving 32bits at once rather than 8 bits at once".

You can of course manipulate unsigned longs (*src and *dst) so that they now refer to unaligned memory and happily keep using "*dst++=*src++" in your mainloop: you're just losing the benefit.

So basically what you wish is a builtin memcpy that is able to detect if its arguments are const or not, if they're aligned or not, if the size to copy is small or large or unknown and generate on the fly what's needed.

Candy · Post by **Candy** » Fri Sep 16, 2005 5:14 am

Pype.Clicker wrote: Maybe i should have made clearer that "That" in "That is not always possible" was referring to the whole "make it faster by moving 32bits at once rather than 8 bits at once".

You can of course manipulate unsigned longs (*src and *dst) so that they now refer to unaligned memory and happily keep using "*dst++=*src++" in your mainloop: you're just losing the benefit.

So basically what you wish is a builtin memcpy that is able to detect if its arguments are const or not, if they're aligned or not, if the size to copy is small or large or unknown and generate on the fly what's needed.

Code: Select all

template <typename arg1type, typename arg2type, bool aligned, int size, bool constness>
memcpy(...) {
.....
}

Ninja Rider · Post by **Ninja Rider** » Fri Sep 16, 2005 11:09 am

clarification on 4 byte cpu limitations.

pentium 4. 32 bit processor (32-bits, 4 bytes, your limitation)
althlon 64. 64 bit processor (64-bits, 8 bytes)

ram
..................
12 13 14 15
8 9 10 11
4 5 6 7
0 1 2 3

the above is how the ram addressing looks.
if your moving a byte at a time then it wouldn't matter if the memory your moving is aligned or not. althou it would take a really long time to do that.

if your moving a word u would only notice if the start address is an odd number.

if your moving a double then it would be best for it to be aligned.

if your making a copy of an array into another part of memory. i think it would be best to use 2 pointers.
set the first to the address of the starting memory address.
set the second to the address of the destination memory address.
set pointers to closest aligned memory address and start moving double words. just keep in mind that offset.

OSDev.org

MEMSET / MEMCPY Question

MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question

Re:MEMSET / MEMCPY Question