e.g.
e0 = 0;
e1 = 8;
e2 = 16;
...
What's the best assembly sequences for vm64z indexing?
Re: What's the best assembly sequences for vm64z indexing?
You are going to have to write a little bit more than that. What is it you wish to do? If I have to google it, I can't answer your question.
Carpe diem!
Re: What's the best assembly sequences for vm64z indexing?
to form [ gpr_base + zmm0 + displacement ], vm64z.
It's a bit slow to use instruction mov zmm0, [ vm64z_index_from_memory ]
It's a bit slow to use instruction mov zmm0, [ vm64z_index_from_memory ]
Re: What's the best assembly sequences for vm64z indexing?
I lost you there. If zmm0 is supposed to be a floating point / SIMD register, then you can't use it for indexing and you can't use "mov". You have to use special instructions like "movaps" with those registers. Otherwise you can speed up the read by using only aligned values and prefetch.blackoil wrote:to form [ gpr_base + zmm0 + displacement ], vm64z.
It's a bit slow to use instruction mov zmm0, [ vm64z_index_from_memory ]
Btw, with indexed addressing you can encode 3 bit shifts and a base in a single mov instruction (like [rbx + rax*8]), and reading memory with it into a gpr is not slow at all. Read about addressing modes in Intel spec.
Cheers,
bzt
Re: What's the best assembly sequences for vm64z indexing?
I used pseudo one.
vmovdqa64 zmm0, [index64] ; vindex instruction from armv8 can do this without memory read
vgatherqpd zmm1, [ rbx + zmm0 ] ; the zmm0 contains offsets for each element of zmm1.
index64:
dq 0
dq 16
dq 32
dq 48
dq 64
dq 80
dq 96
dq 112
vmovdqa64 zmm0, [index64] ; vindex instruction from armv8 can do this without memory read
vgatherqpd zmm1, [ rbx + zmm0 ] ; the zmm0 contains offsets for each element of zmm1.
index64:
dq 0
dq 16
dq 32
dq 48
dq 64
dq 80
dq 96
dq 112
-
- Member
- Posts: 5516
- Joined: Mon Mar 25, 2013 7:01 pm
Re: What's the best assembly sequences for vm64z indexing?
I don't think x86 has any way to do that without an extra memory access to load the indices.
Why is there an 8-byte gap between each of the values you want to load?
Why is there an 8-byte gap between each of the values you want to load?