OSDev.org

Posted: **Fri Dec 19, 2008 2:37 am**

Hi,

LoseThos wrote:Suppose you have two cores and a global memory variable which is cached. They both have accessed recently so each core has it in it's local cache. Core #0 changes it. Core #0 can write-back invalidate cache, writing everything out of cache. (Just pushing one value from cache to mem would be good, but WbInvd does the job.) Now, you have core #1... how do you fetch the updated value sitting in memory when you have a different value in your local cache?

For 80x86, if the value is in both CPU's caches then the cache line would be in the "shared" state. If the first CPU wants to modify the cache line, then it has to tell the other CPU/s to invalidate their copy of the cache line first (to get the cache line into the "exclusive" state), and after the first CPU has modified the cache line it would be set to the "modified" state in the first CPU. Now, if the other CPU wants to read that cache line again it'd look in it's cache and find nothing (as the cache line was invalidated) and it'd need to ask for the cache line again. The current copy of the cache line would be sent from the first CPU to the other CPU and the cache line would return to the "shared" state. You would not get a stale value from RAM.

Basically what I'm saying is that the hardware uses MESI cache states to ensure that all caches remain coherent, and an instruction that bypasses the MESI states and gets a value directly from RAM would usually get a wrong/stale value from RAM (instead of getting the right/current value from wherever it currently is).

Cheers,

Brendan

Posted: **Fri Dec 19, 2008 11:07 am**

I guess it's about time I chime in here. I'd really like to see some hardware support for RNGs, encryption, and one way hashes. MD5, the SHA family, and AES would be great. VIA at least does have hardware support for AES, but I think it's limited to only 128bit keys. Oh, let's throw in CRC32 support as well, since Intel introduced that with SSE4.2 finally.

Posted: **Fri Dec 19, 2008 12:47 pm**

Ever tried implementing randomness from completely nonrandom components?

As for MD5, SHA and AES, theyre too complex to do with instructions. Better off with some kind of coprocessor which you tell to DMA over a region and do the work. CRC32 is simple enough you could possibly implement it in a CPU as an instruction you call for each byte (or word, or long)

Posted: **Fri Dec 19, 2008 1:10 pm**

For 80x86, if the value is in both CPU's caches then the cache line would be in the "shared" state. If the first CPU wants to modify the cache line, then it has to tell the other CPU/s to invalidate their copy of the cache line first (to get the cache line into the "exclusive" state), and after the first CPU has modified the cache line it would be set to the "modified" state in the first CPU. Now, if the other CPU wants to read that cache line again it'd look in it's cache and find nothing (as the cache line was invalidated) and it'd need to ask for the cache line again. The current copy of the cache line would be sent from the first CPU to the other CPU and the cache line would return to the "shared" state. You would not get a stale value from RAM.

Basically what I'm saying is that the hardware uses MESI cache states to ensure that all caches remain coherent, and an instruction that bypasses the MESI states and gets a value directly from RAM would usually get a wrong/stale value from RAM (instead of getting the right/current value from wherever it currently is).

Oh, that's what the CLFLUSH instruction is good for. Umm... each core has a nonshared L1 cache and a shared L2 cache. So... are you are saying the L1 caches are connected to each other?

Posted: **Fri Dec 19, 2008 1:50 pm**

Yes, the caches all "snoop" each other (that's actually a technical term). But if we're dreaming, I also want to do away with MESI, and add a 5th state to it. "Shared from a modified cache."

Posted: **Fri Dec 19, 2008 2:28 pm**

Owen wrote:Ever tried implementing randomness from completely nonrandom components?

As for MD5, SHA and AES, theyre too complex to do with instructions. Better off with some kind of coprocessor which you tell to DMA over a region and do the work. CRC32 is simple enough you could possibly implement it in a CPU as an instruction you call for each byte (or word, or long)

I call ignorance here. MD5, SHA, AES, and CRC32 are all easily implemented in hardware. As I pointed out in my previous post, VIA already does this for AES in their x86 processors. Turns out their AES implementation does handle up to 256 bit keys, and they support SHA-1 and SHA-256, RSA algorithms AND a hardware RNG. Details are here: http://www.via.com.tw/en/initiatives/pa ... rdware.jsp

Also, as I said, Intel introduced CRC32 instructions with SSE4.2, so I really don't know what the "possibly" is about this. The linux kernel already takes advantage of it.

Many other processors do similar things here. The AMD Geode is another good example.

Posted: **Fri Dec 19, 2008 2:40 pm**

You know... they have locked prefixes on read-modify-write instructions and those bypass cache. Why can't they have a lock prefix for MOV that bypasses cache? (I don't care if it's separate opcodes) You have to admit my proposed instructions -- write straight to RAM and read straight from RAM would be handy, even if there are other way to do it. getting all cores to invalidate a location is not exactly convenient! I could use a locked XCHG instruction... but that puts a crazy value into the location which might create havoc.

Locked instructions do bypass cache, right?

EDIT: The lock doesn't bypass but takes exclusive use of shared values or something like that. In effect, though, it's like bypassing... at least that's how I think of it. I know if you lock a BTS instruction, it works in a multiprocessing environment even if the location's page table entry is not not cached.

All I care about is bypasing L1 cache. If L2 is shared, it might as well be RAM unless you actually have two CPU chips.

Posted: **Fri Dec 19, 2008 4:10 pm**

LOCK just stops the other processors from taking the cache line in the middle of the instruction, nothing more.

As for the AES/etc, theyre very implementable in hardware. But theyre not sanely implementable as CPU instructions (Except for CRC32). AFAIK, Via's AES is on the chipset. In any case, encryption algorithms are better operating in parallel with the CPU else they'd end up being 10 cycle instructions or such

Posted: **Fri Dec 19, 2008 6:05 pm**

LOCK just stops the other processors from taking the cache line in the middle of the instruction, nothing more.

I can tell you read the intel manual and haven't tried it. You're clueless.

Code: Select all

int cnt1,cnt2,lock,done;

void CPU1Job()
{
  while (!done) {

    //Unlocked Btr
    while (!Btr(&lock,0) && !done);
    if (!done)
      cnt1++;
  }

  done=FALSE;

  while (!done) {

    //Locked Btr
    while (!LBtr(&lock,0) && !done);
    if (!done)
      cnt2++;
  }

  done=FALSE;

}

void CPU0Job()
{
  I8 i;

  done=FALSE;
  lock=0;
  cnt1=0;
  cnt2=0;
  WbInvd;

  MPJob(&CPU1Job); //Start job on CPU1

  for (i=0;i<100;i++) {
    Bts(&lock,0);
    Sleep(1);
  }
  done=TRUE;

  for (i=0;i<100;i++) {
    Bts(&lock,0);
    Sleep(1);
  }
  done=TRUE;

  PrintF("Count1:%d  Count2:%d\r\n",cnt1,cnt2);
}

The locked version has a count of 100 and the unlocked version has a count of 9,7,10, randomly.

Posted: **Fri Dec 19, 2008 6:55 pm**

Oops maybe you're right.

This works:

Code: Select all

int cnt,lock,done;

void CPU1Job()
{
  while (!done) {
    while (!lock && !done);
    lock=FALSE;
    if (!done)
      cnt++;
  }

  done=FALSE;

}

void CPU0Job()
{
  I8 i;

  done=FALSE;
  lock=0;
  cnt =0;
  WbInvd;

  MPJob(&CPU1Job); //Start job on CPU1

  for (i=0;i<100;i++) {
    lock=TRUE;
    Sleep(1);
  }
  done=TRUE;

  PrintF("Count1:%d\r\n",cnt);
}

Posted: **Fri Dec 19, 2008 7:19 pm**

No ****... this works too. I was wrong.

Code: Select all

int cnt,lock,done;

void CPU1Job()
{
  while (!done) {
    while (!lock && !done);
    lock=FALSE;
    if (!done)
      cnt++;
  }
  done=FALSE;
}

void CPU0Job()
{
  I8 i;

  done=FALSE;
  lock=FALSE;
  cnt =0;
  MPJob(&CPU1Job); //Start job on CPU1

  for (i=0;i<100;i++) {
    lock=TRUE;
    while (lock);
  }
  done=TRUE;
  PrintF("Count:%d\r\n",cnt);
}

CPU0Job;

So, except for read-modify-write issues, there are no worries with multicore? I was forcing stuff out of cache thinking each core kept it's own cache and they were not synced-up.

Posted: **Fri Dec 19, 2008 8:49 pm**

Hi,

LoseThos wrote:You know... they have locked prefixes on read-modify-write instructions and those bypass cache. Why can't they have a lock prefix for MOV that bypasses cache? (I don't care if it's separate opcodes)

In theory, locked operations lock the bus to prevent other things on the bus (e.g. other CPUs) from messing things up (e.g. so a different CPU can't modify the contents of a memory location while you're doing a "read-modify-write", after you've done the read but before you've done the write). In practice, modern CPUs take advantage of the cache coherency protocols to do the entire operation in the cache without locking the bus at all; which is important for scalability (imagine several CPUs repeatedly trying to lock the bus for different reasons, where other bus traffic is severely crippled).

Also note that some operations are guaranteed to be atomic without locking the bus. An example of this is a MOV that writes or reads a 32-bit value to/from an address that's aligned on a 4 byte boundary. There's similar guarantees for 16-bit values with 16-bit alignment, etc. Some (newer) CPUs go further (e.g. any 64-bit or smaller read or write that doesn't cross a cache line boundary is guaranteed to be done atomically, regardless of whether or not it's aligned).

The XCHG instruction is special because the LOCK is implied (no explicit LOCK prefix needed). Because of this you should avoid the XCHG instruction and use some other way if you don't need the LOCK. For example, "mov ebx,[address]; xchg ebx,eax; mov [address],ebx" can be better than "xchg [address],eax" if the LOCK isn't needed.

LoseThos wrote:You have to admit my proposed instructions -- write straight to RAM and read straight from RAM would be handy, even if there are other way to do it. getting all cores to invalidate a location is not exactly convenient! I could use a locked XCHG instruction... but that puts a crazy value into the location which might create havoc.

Can you think of any situation where these "read/write direct to RAM" instructions would be used, where a normal read/write won't work? The only situation I can think of is reading/writing unaligned data, where the read/write is used for multi-CPU synchronization (but the best solution here would be to align the data).

Cheers,

Brendan

Posted: **Fri Dec 19, 2008 9:15 pm**

Three words: LOCK CMPXCHG8B EAX

Posted: **Tue Dec 23, 2008 5:56 am**

Troy Martin wrote:Three words: LOCK CMPXCHG8B EAX

ROFL

Posted: **Tue Dec 23, 2008 9:21 am**

Owen wrote: As for the AES/etc, theyre very implementable in hardware. But theyre not sanely implementable as CPU instructions (Except for CRC32). AFAIK, Via's AES is on the chipset. In any case, encryption algorithms are better operating in parallel with the CPU else they'd end up being 10 cycle instructions or such

No, they're actual CPU instructions. xcryptb for AES, xsha for SHA. Reading the documentation helps, there's a reason I linked to it.

I don't believe that crypto is any better off as part of the chipset or even a PCI card, as then you'd have much lower encryption throughput since the data would have to traverse the PCI bus. You should really read the VIA docs at least before dismissing crypto on the CPU as inefficient.

OSDev.org

What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?

Re: What features would you like in a CPU?