OSDev.org

Posted: **Mon Jun 01, 2020 7:58 am**

Checkout this code segment:

void put_zmm_value()
{
  struct zmm_value buffer;

  asm volatile(
		"vmovdqa64	%%zmm0, (%[buffer]) \n": :[buffer]"r"(buffer):"%zmm0");

  for(int i=0;i<8;i++)
           printf("%lx ",buffer.word[i]);
}

I'm wondering whether my inline assembly code is correct, and will work for all optimizations.

Posted: **Mon Jun 01, 2020 8:30 am**

Probably not, since it doesn't appear to do anything useful. What is it supposed to do?

Posted: **Mon Jun 01, 2020 8:40 am**

Correction:

Code: Select all

void put_zmm_value()
{
  struct zmm_value buffer;

  asm volatile(
      "vmovdqa64   %%zmm0, %[buffer] \n": :[buffer]"m"(buffer):"%zmm0");

  for(int i=0;i<8;i++)
           printf("%lx ",buffer.word[i]);
}

The previous snippet doesn't even compile.
The purpose of this code is to print out the value of the zmm0 register. Nothing else!

Note: buffer is cache line aligned... No issues there... I won't get segfault

Posted: **Mon Jun 01, 2020 9:04 am**

sunnysideup wrote:The purpose of this code is to print out the value of the zmm0 register.

Why? According to the ABI, there is nothing useful in zmm0 at this point in time.

Posted: **Mon Jun 01, 2020 9:31 am**

I'm want to get familiar with zmm0 as is important for implementing fast memcpy and so on. I also manually 'fill' zmm0 using:

Code: Select all

struct zmm_value
{
  uint64_t word[8];
} __attribute__((packed)) __attribute__ ((aligned(64)));

void set_zmm_value(struct zmm_value* val_address)
{
  asm volatile
    ("vmovntdqa (%[val_address]),%%zmm0\n":: [val_address]"r"(val_address):"%zmm0");
}

Posted: **Mon Jun 01, 2020 10:32 am**

That won't work. The compiler is free to do whatever it wants with zmm0 outside your asm block.

If you want to move data through zmm registers, you must either load and store within the same asm block, or you must tell the compiler to do the load/store on your behalf by passing around __m512/__m512d/__m512i values.

Posted: **Mon Jun 01, 2020 10:57 am**

Makes sense

Posted: **Sun Jun 28, 2020 6:21 am**

Alright, moving on to something new here: I've often seen this:

Code: Select all

  asm volatile("" ::: "memory");

in C code that is compiled using gcc. What is its significance and what is the concept here? Is this extended asm? I can't wrap my head around inline assembly in gcc. Any good resources?

Also, what's the difference between __asm and just asm?

Posted: **Sun Jun 28, 2020 9:41 am**

That's a memory barrier for the compiler.¹ Yes, it is extended asm. The memory clobber forces all loads/stores to globally visible variables that occur before/after the barrier in program order to happen before/after the barrier.

__asm can be used in contexts where asm is not available (e.g., because somebody chose to #define asm) but other than that, there is no difference.

¹ But not for the CPU! On some architectures (e.g., ARM), that makes a vast difference.

Posted: **Sun Jun 28, 2020 11:35 am**

Korona wrote:That's a memory barrier for the compiler.¹ Yes, it is extended asm. The memory clobber forces all loads/stores to globally visible variables that occur before/after the barrier in program order to happen before/after the barrier.

Alright, I understand that why it's used for now - as a way for the compiler to ensure that no compile time reordering occurs across this 'barrier'

However, why does it work this way? Or as a mathematician would say - can you derive it from first principles?

I've also read this piece of code:

Code: Select all

static void force_read(uint8_t *p) {
    asm volatile("" : : "r"(*p) : "memory");
}

It's supposed to force a read from memory location p. But how does it work, i.e. why does gcc make it work that way?

Posted: **Sun Jun 28, 2020 2:06 pm**

sunnysideup wrote:However, why does it work this way?

It's an assembler statement with a memory clobber. The fact that it's empty is incidental to this. The memory clobber tells GCC that this statement will change "memory", but not which memory and in what way it will be changed. Therefore, GCC cannot assume anything about the state of memory, and must write all changes to memory before the statement, and read all things that are in memory again after the statement.

sunnysideup wrote:I've also read this piece of code:
Code: Select all
static void force_read(uint8_t *p) {
    asm volatile("" : : "r"(*p) : "memory");
}
It's supposed to force a read from memory location p. But how does it work, i.e. why does gcc make it work that way?

This time it is a memory clobber and an input constraint. So in addition to the above, this statement requires that the value of "*p" be put iinto a register beforehand. The statement is empty and doesn't do anything with the value, but GCC doesn't know that, and is therefore forced to emit a read of this memory location. And since memory is clobbered, even multiple reads of this location have to be read, since they might have changed now.

Posted: **Sun Jun 28, 2020 2:47 pm**

sunnysideup wrote:However, why does it work this way? Or as a mathematician would say - can you derive it from first principles?

Because the GCC developers say so.

Using the "memory" clobber effectively forms a read/write memory barrier for the compiler.

Here's the part of the manual that explains it.

nullplan wrote:And since memory is clobbered, even multiple reads of this location have to be read, since they might have changed now.

But this holds true only if you use functions like this one with memory barriers to access that location. If you also access it without a memory barrier and the function gets inlined, the read may be combined with prior accesses. There is also no guarantee that the read will occur after all prior statements if the function is inlined.

Posted: **Mon Aug 03, 2020 8:37 am**

Hello people, It's been a long time....
I've been experimenting with inline assembly for a bit (again). I just realized that the force_read() function isn't really correct (according to my understanding at least) when optimizations come into the picture.

Here is the function again:

Code: Select all

static void force_read(uint8_t *p) 
{
  asm volatile("" : : "r"(*p) : "memory");
}

Let us assume that the compiler inlines it, and we finally get:

Code: Select all

static inline void force_read(uint8_t *p) 
{
  asm volatile("" : : "r"(*p) : "memory");
}

It is my understanding that a call to force_read would not really guarantee a memory read. Consider this program:

Code: Select all

inline void force_read(int* address)
{
    asm volatile ("": :"r"(*address):"memory" );
}

int main()
{
    int a;
    scanf("%d",&a);
//I'm doing some calculations with a
    a *= 12432; 
    a += 1231;

    force_read(&a); //This shouldn't actually compile to a memory read
}

As expected, Godbolt gives me:

Code: Select all

.LC0:
  .string "%d"
main:
  subq $24, %rsp
  movl $.LC0, %edi
  xorl %eax, %eax
  leaq 12(%rsp), %rsi
  call __isoc99_scanf
  imull $12432, 12(%rsp), %eax
  addl $1231, %eax
  movl %eax, 12(%rsp)
  xorl %eax, %eax
  addq $24, %rsp
  ret

Clearly, there is no "forced read"...
I would have guessed that a better function would have been:

Code: Select all

inline void force_read(int* address)
{
    asm volatile ("mov  (%[addr]) , %%rax": :[addr]"m"(address):"memory","rax" ); //Do we even need the memory clobber??
}

Is my understanding correct?

Posted: **Mon Aug 03, 2020 9:12 am**

sunnysideup wrote:Clearly, there is no "forced read"...

Well, there is. The snippet (empty string) is instantiated with the value you wanted in a register, namely in EAX. Question is what this is even supposed to prove and how it is useful in any way. You are calling force_read() on a local variable. But local variables are in normal memory, where a force read is both unnecessary and not useful. Or if it is, I don't know how. force_read() is useful for MMIO, where you sometimes need to read a register because that has side effects, even if you don't need the value. But that means, in terms of C, that you are reading a non-local object through a pointer. For example:

Code: Select all

struct whatever *foo = get_foo();
force_read(&foo->reg);

Result: https://godbolt.org/z/ss6nda
See, it does what it is supposed to.

Your replacement function has at least the problem that it is reading 8 bytes from a 4-byte pointer, and it is using one specific register that is used very often, thereby increasing register pressure. Mind you, my solution is to not use inline assembly at all, thereby removing all optimization opportunities. But less of a headache in the long run.

Posted: **Mon Aug 03, 2020 10:53 pm**

I'm afraid I didn't explain myself very well.

Let me explain why I'd want a force_read function first: You can use it for cache side-channel attacks, where you'd measure the time that it takes to access a variable from memory to figure out whether it was accessed from the cache, or main memory.

So I'd expect that, whenever I call force_read(), It would compile to a memory load. In my example, I purposefully do some arithmetic before calling force_read(), because I want the value to be loaded into a register anyway, and when I call force_read(), the variable isn't really forcefully read.... and the compiler uses that register that was already used for the arithmetic...

Btw, the memory clobber is messing things up a bit... removing it makes it a lot more clearer in my example.
I hope what I am saying makes sense XD

OSDev.org

Proper way to write inline assembly

Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly

Re: Proper way to write inline assembly