Page 1 of 1
Most optimized function to copy/store every 4th byte (SSE2)
Posted: Thu Dec 16, 2021 3:49 pm
by amyipdev
Essentially, I'm looking for the most optimized way (within SSE2) to copy and re-store every fourth byte of a data stream.
If the data stream was:
ABCDABCDABCDABCD
Then the new four outputs would be:
AAAA
BBBB
CCCC
DDDD
Likewise, if I had four data streams:
EEEE
FFFF
GGGG
HHHH
They could output as:
EFGHEFGHEFGHEFGH
Re: Most optimized function to copy/store every 4th byte (SS
Posted: Thu Dec 16, 2021 8:25 pm
by Octocontrabass
Why do you want to do this?
If it has anything to do with your other question about image processing, you shouldn't separate the color channels at all - process them in parallel instead.
Re: Most optimized function to copy/store every 4th byte (SS
Posted: Fri Dec 17, 2021 12:32 am
by amyipdev
Some of the algorithms I've found (like the Gaussian blur one) work on single color channels; I could modify them to work with offsets for each channel, but I can also see this being useful in the future for other algorithms.
Re: Most optimized function to copy/store every 4th byte (SS
Posted: Fri Dec 17, 2021 7:05 am
by Solar
The most optimized way to copy is to avoid the copy, so... yes, context matters.
Re: Most optimized function to copy/store every 4th byte (SS
Posted: Fri Dec 17, 2021 10:50 am
by Gigasoft
This is the best I could come up with.
Extracting:
Code: Select all
; Initial conditions: ecx = pixel count, esi = source, edx = channel 1, ebx = channel 2,
; edi = channel 3, ebp = channel 4
shr ecx, 2
sub ebx, edx
sub edi, edx
sub ebp, edx
sub edx, 4
mov eax, 0ffh
movd mm4, eax
punpckldq mm4, mm4
punpckldq xmm4, xmm4
extractloop:
add edx, 4
movdqa xmm0, [esi]
movdqa xmm1, xmm0
movdqa xmm2, xmm0
movdqa xmm3, xmm0
psrld xmm1, 8
psrld xmm2, 16
psrld xmm3, 24
pand xmm0, xmm4
pand xmm1, xmm4
pand xmm2, xmm4
pand xmm3, xmm4
packssdw xmm0, xmm0
packssdw xmm1, xmm1
packssdw xmm2, xmm2
packssdw xmm3, xmm3
packuswb xmm0, xmm0
packuswb xmm1, xmm1
packuswb xmm2, xmm2
packuswb xmm3, xmm3
add esi, 16
dec ecx
movd [edx], mm0
movd [edx+ebx], mm1
movd [edx+edi], mm2
movd [edx+ebp], mm3
jnz extractloop
Merging:
Code: Select all
; Initial conditions: ecx = pixel count, esi = destination, edx = channel 1, ebx = channel 2,
; edi = channel 3, ebp = channel 4
shr ecx, 2
sub esi, 16
sub ebx, edx
sub edi, edx
sub ebp, edx
mergeloop:
add esi, 16
movd mm0, [edx]
punpcklbw mm0, [edx+ebx]
movd mm1, [edx+edi]
punpcklbw mm1, [edx+ebp]
punpcklwd xmm0, xmm1
movdqa [esi], xmm0
add edx, 4
dec ecx
jnz mergeloop
Re: Most optimized function to copy/store every 4th byte (SS
Posted: Fri Dec 17, 2021 12:35 pm
by amyipdev
Gigasoft wrote:This is the best I could come up with.
Thank you so much!