Regardless, I do want to finish what I've been working out, and hear what he has to say about it.
Here is the hand-coded assembly version I wrote:
Code: Select all
// factArray in hand-coded assembly language
.globl factArray
.extern factArray
factArray:
// in the x86-64 calling convention the first six
// arguments are passed in registers, so there is no
// need to build a stack frame.
// %rdi == ptr to array of 64-bit quadwords
// %rsi == array size
//
// %rax is the return value
// %rbx is the value being computed
// %rcx is the number whose factorial is being computed
// %rdi is the pointer to the current insertion point
// %rsi is the endpoint of the array
// set the starting point aside to return it later
mov %rdi, %rax
// find an offset to the endpoint of the array
// then compute the actual endpoint
imulq $8, %rsi
addq %rdi, %rsi
// initialize the accumulated value and the counter
movq $1, %rbx
movq $0, %rcx
jmp .loop_test
.populate_loop:
inc %rcx
imulq %rcx, %rbx
.loop_test:
movq %rbx, (%rdi)
addq $8, %rdi
cmp %rdi, %rsi
jg .populate_loop
.loop_exit:
ret
And here is the gprof results for it:
Code: Select all
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
89.39 1.78 1.78 factArray
11.68 2.01 0.23 main
As you can see, this is close to, but not quite as fast as, the optimized results for the second C version. Let's take a look at the source output for that:
Code: Select all
.p2align 4,,15
.globl factArray
factArray:
.LFB0:
.cfi_startproc
movl %esi, %esi
leaq 8(%rdi), %rcx
movq %rdi, %rax
leaq (%rdi,%rsi,8), %rdx
movq $1, (%rdi)
cmpq %rcx, %rdx
jbe .L6
subq %rdi, %rdx
movl $1, %ecx
leaq -9(%rdx), %rsi
movl $1, %edx
shrq $3, %rsi
addq $2, %rsi
.p2align 4,,10
.p2align 3
.L3:
imulq %rdx, %rcx
movq %rcx, (%rax,%rdx,8)
addq $1, %rdx
cmpq %rsi, %rdx
jne .L3
.L6:
rep ret
.cfi_endproc
While this seems pretty different on the surface, it is actually quite similar, just doing basically the same things using slightly different instructions and with some tweaks to things such as instruction alignments.
compare this to the other unoptimized version, and the other versions (opt and unopt)
version 1, no opt:
Code: Select all
.globl factArray
.type factArray, @function
factArray:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -24(%rbp)
movl %esi, -28(%rbp)
movq -24(%rbp), %rax
movq $1, (%rax)
movl $1, -4(%rbp)
jmp .L2
.L3:
movl -4(%rbp), %eax
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rax, %rdx
movl -4(%rbp), %ecx
movl -4(%rbp), %eax
subl $1, %eax
movl %eax, %eax
leaq 0(,%rax,8), %rsi
movq -24(%rbp), %rax
addq %rsi, %rax
movq (%rax), %rax
imulq %rcx, %rax
movq %rax, (%rdx)
addl $1, -4(%rbp)
.L2:
movl -4(%rbp), %eax
cmpl -28(%rbp), %eax
jbe .L3
movq -24(%rbp), %rax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
version 1, opt:
Code: Select all
.p2align 4,,15
.globl factArray
.type factArray, @function
factArray:
.LFB0:
.cfi_startproc
testl %esi, %esi
movq %rdi, %rax
movq $1, (%rdi)
movl $1, %edx
je .L7
.p2align 4,,10
.p2align 3
.L5:
leal -1(%rdx), %edi
movl %edx, %ecx
addl $1, %edx
movq %rcx, %r8
imulq (%rax,%rdi,8), %r8
cmpl %edx, %esi
movq %r8, (%rax,%rcx,8)
jnb .L5
.L7:
rep ret
.cfi_endproc
version 2, no opt:
Code: Select all
.globl factArray
.type factArray, @function
factArray:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -40(%rbp)
movl %esi, -44(%rbp)
movq $1, -24(%rbp)
movq -40(%rbp), %rax
movq -24(%rbp), %rdx
movq %rdx, (%rax)
movq -40(%rbp), %rax
addq $8, %rax
movq %rax, -32(%rbp)
movl -44(%rbp), %eax
leaq 0(,%rax,8), %rdx
movq -40(%rbp), %rax
addq %rdx, %rax
movq %rax, -8(%rbp)
movq $1, -16(%rbp)
jmp .L2
.L3:
movq -24(%rbp), %rax
imulq -16(%rbp), %rax
movq %rax, -24(%rbp)
movq -32(%rbp), %rax
movq -24(%rbp), %rdx
movq %rdx, (%rax)
addq $8, -32(%rbp)
addq $1, -16(%rbp)
.L2:
movq -32(%rbp), %rax
cmpq -8(%rbp), %rax
jb .L3
movq -40(%rbp), %rax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
Now, I won't pretend that I am an expert on OISC, or even on x86, but I am pretty certain that the way your compiler would implement these would be to emit blocks of code corresponding to the individual C operations. While this makes sense when said operations have exact or near-exact analogs in the instruction set, it makes
no sense to do this if each operations requires a page of code to implement, and a jury-rigged 'procedure call' operation to invoke.
For an implicit-state machine, imitating a register machine is a dead end approach. You need to be generating code that works
with the state machine, not against it. The factorial operation I've been could probably be generated as less than a two pages of OISC states - but only if it is written in a notation that doesn't try to mimic something else when doing so.
I can't promise you will find a better programming model for OISC, nor am I convinced that doing so will be sufficient to make OISC practical as a hardware platform, but even just trying to find one would be well worth your trouble if you are unwavering in your decision to follow this path. Even if it fails, it will have contributed to your knowledge of what is and isn't practical with an OISC.