Best way to implement MMU on x86_64 Higher Half kernel?

kzinti · Post by **kzinti** » Tue May 10, 2022 12:30 pm

Ethin wrote:Right, but if segmentation was truly the superior option, the C standard could've been changed back when people were trying to decide what would be dominant.

The C standard doesn't need to be changed to support segmentation. You can perfectly implement a C compiler that generates segmented 16-bit code/data using so-called "huge" pointers (Borland C++ 3.1 is one I used for years). In fact 16 bits x86 compilers can generate code for different memory models depending on what you want. The C standard doesn't in any way prevent this. If anything, the design of C takes segmented memory models into account.

Also the 8086 was released in 1978. C was invented in 1972-1973.

rdos · Post by **rdos** » Tue May 10, 2022 1:48 pm

kzinti wrote:
Ethin wrote:Right, but if segmentation was truly the superior option, the C standard could've been changed back when people were trying to decide what would be dominant.
The C standard doesn't need to be changed to support segmentation. You can perfectly implement a C compiler that generates segmented 16-bit code/data using so-called "huge" pointers (Borland C++ 3.1 is one I used for years). In fact 16 bits x86 compilers can generate code for different memory models depending on what you want. The C standard doesn't in any way prevent this. If anything, the design of C takes segmented memory models into account.

Also the 8086 was released in 1978. C was invented in 1972-1973.

16-bit segmentation is not of much interest. You should use 32-bit segmentation.

The major problem with segmentation is that if you allocate a selector for every new/malloc call, you will run out of selectors. So, to really make segmentation useful, Intel / AMD should have extended selectors to 32-bits. A minor problem is that C is not so smart about how to handle pointers, and so does too many segment register loads, which slows down code. The 32-bit compact memory model is the best one (particularly for devices) since all local data can be accessed with ds and all code is in the same segment. Only external calls needs to use 48-bit far pointers. This is the memory model I use, and it works pretty well with C too.

rdos · Post by **rdos** » Tue May 10, 2022 1:58 pm

Ethin wrote:
rdos wrote: The inner loop consist of multiplying a sin() function (using an 18-bit table) with an array of 14-bit integers. This is typically done on hundreds of values per operation. The code uses assembly as I don't trust the compiler would be able to provide better optimizations than my hand-coded assembly code.
Trust me when I say that you can't write better assembly than the compiler. The compiler knows far more about your program than you do. It after all can hold your entire program in memory and analyze it in its entirety. Good luck doing that in your mind on a program that's 15000 LoC. When it comes to "who can write better assembly", the compiler always wins. Its an extremely rare moment when you can outsmart a compiler these days. And I'm not talking about intrinsics, just normal code.

I don't think so. The C code is something like this:

int CalcPower(double startp, double phaseinc, int *arr, int count)
{
int i;
double sum = 0;

for (i = 0; i < count; i++)
sum += arr * sin(startp + i * phaseinc);

return (int)sum;
}

I can almost garantee you that it will use fsin and many other floating point instructions, which would result in horrible performance compared to table-lookup and using integer operations.

Demindiro · Post by **Demindiro** » Tue May 10, 2022 2:04 pm

rdos wrote:
Ethin wrote:
rdos wrote: The inner loop consist of multiplying a sin() function (using an 18-bit table) with an array of 14-bit integers. This is typically done on hundreds of values per operation. The code uses assembly as I don't trust the compiler would be able to provide better optimizations than my hand-coded assembly code.
Trust me when I say that you can't write better assembly than the compiler. The compiler knows far more about your program than you do. It after all can hold your entire program in memory and analyze it in its entirety. Good luck doing that in your mind on a program that's 15000 LoC. When it comes to "who can write better assembly", the compiler always wins. Its an extremely rare moment when you can outsmart a compiler these days. And I'm not talking about intrinsics, just normal code.
I don't think so. The C code is something like this:

int CalcPower(double startp, double phaseinc, int *arr, int count)
{
int i;
double sum = 0;

for (i = 0; i < count; i++)
sum += arr * sin(startp + i * phaseinc);

return (int)sum;
}

I can almost garantee you that it will use fsin and many other floating point instructions, which would result in horrible performance compared to table-lookup and using integer operations.

Of course it's going to call the FP version of sin if you explicitly tell it to. If you want it to work with integers and a look-up table you have to implement it yourself, as you're already doing in assembly anyways.

rdos · Post by **rdos** » Tue May 10, 2022 2:15 pm

nullplan wrote:He didn't write that either. He wrote that having lots of address space available makes people recklessly map all of physical memory into kernel space.

And it is a fair point that an error in the kernel can corrupt more if more physical memory is mapped. Absolutely. An error in the USB driver would ordinarily only affect USB performance, but with all of physical memory mapped, you could end up writing USB disk contents into the page tables of some process, thereby breaking it. My counterpoint to that is that an error in the kernel is pretty much always catastrophic (so kernel errors must be fixed promptly), and the benefits of always having all of physical memory mapped (trivial mapping from physical to virtual address; not so many TLB shootdowns, which are always a barrel of fun, especially these days with hundreds of cores being a possibility; less TLB pressure due to lack of temporary mapping) are too big to forego, but that is a judgment call.

I'm more concerned with the kernel being "bug-free" (which never is possible, of course) than applications. Bugs in applications can typically only take down the application, and not the whole system. Bugs in kernel, OTOH, can easily kill your system. Today, we run very complicated software in kernel space, and making sure it is bug-free is more or less impossible. Microkernels tend to provide better isolation that can keep errors more local, while the monolithic flat kernels Linux and Windows use are open to all kinds of device bugs taking down the whole system. Isolating drivers using a 32-bit compact memory model keep many bugs local, just like in a microkernel, but only require selector reloads, not TLB shootdowns.

So, I use a 32-bit compact memory model in kernel and a flat memory model for applications. The new filesystem has some resemblance with a microkernel since the filesystems are isolated in their own address spaces and use a flat memory model internally.

Octocontrabass · Post by **Octocontrabass** » Tue May 10, 2022 2:23 pm

rdos wrote:I can almost garantee you that it will use fsin and many other floating point instructions, which would result in horrible performance compared to table-lookup and using integer operations.

Er, no, it's plenty smart enough to call a vector math library function instead of using fsin.

But it's not a fair comparison, since this C version is not using a table lookup or integer operations.

rdos · Post by **rdos** » Tue May 10, 2022 2:29 pm

Ethin wrote: Third: paging doesn't allow processes to access one anothers memory without it being mapped first. That's literally the whole point: to make a process think that its the only thing that the computer is doing. If process 1 and 2 are in separate virtual address spaces, they can only access the addresses mapped in those address spaces. If I don't map any of process 2's memory into process 1's address space, process 1 can't access process 2's memory at all -- that'll either cause a page fault or process 1 will read its own memory. If process 1 and 2 are kernel-mode processes (and therefore can change the page tables at will), then the risk is higher -- either process can screw around and break things or read arbitrary memory, but that risk occurs with segmentation too: if a process can muck about with the segment registers, your screwed. Paging also has the advantage of being able to share memory without having to duplicate it. If I want to send data to process 2 from process 1, all I have to do is map the page containing the data I'd like to share into process 2's address space. The underlying physical address doesn't need to change, and I don't have to copy anything around. If I want both processes to be able to access the same memory, all I have to do is map it twice. Again, the underlying address in memory doesn't have to change. Or I could map the memory into process 2's address space and unmap it in process 1's. You can't do that with segmentation.

Nobody claimed the issue was paging vs segmentation. I use both paging and segmentation. I want the advantage of both. I run processes in their own address spaces, and can also create new address spaces for kernel use (the debug monitor runs in it's own address space).

rdos · Post by **rdos** » Tue May 10, 2022 2:40 pm

Octocontrabass wrote:
rdos wrote:I can almost garantee you that it will use fsin and many other floating point instructions, which would result in horrible performance compared to table-lookup and using integer operations.
Er, no, it's plenty smart enough to call a vector math library function instead of using fsin.

But it's not a fair comparison, since this C version is not using a table lookup or integer operations.

That's still a lot of code, and the sin() function is a function call with unknown number of operations.

This is basically how it looks like in assembly (there is a 256kx2 entries sin table in the same source file, sin_tab):

Code: Select all

;       PARAMETERS:     Data
;                       Size
;                       InitPhase
;                       PhasePerSample
;                       Power
;
;       RETURNS:        End phase

_CalcFreqPowerA  Proc near
    push ebx
    push ecx
    push edx
    push esi
    push edi
    push ebp
;
    mov esi,[esp+1Ch]
    mov ecx,[esp+20h]
    mov ebp,[esp+24h]
    mov edi,[esp+2Ch]
;
    xor eax,eax
    mov [edi].pow_c,eax
    mov [edi].pow_c+4,eax

cfpaLoop:
    mov ebx,ebp
    shr ebx,13
    and bl,0FEh
;
    mov ax,[ebx].sin_tab
    imul word ptr [esi]
    movsx edx,dx
    add word ptr [edi].pow_c,ax
    adc dword ptr [edi].pow_c+2,edx
;
    add esi,4
    add ebp,[esp+28h]
    loop cfpaLoop
;
    movsx eax,word ptr [edi].pow_c+4
    mov [edi].pow_c+4,eax
;
    mov eax,ebp
;
    pop ebp
    pop edi
    pop esi
    pop edx
    pop ecx
    pop ebx
    ret 20
_CalcFreqPowerA    Endp

The 64-bit version would be one less instruction in the inner loop since it could use a 64-bit register.

The C code using integer parameters would look like this:

Code: Select all

long long CalcPower(int startp, int phaseinc, int *arr, int count)
{
int i;
double sum = 0;
double phase = (double)startp * 2.0 * PI / 0x100000000L;
double incr = (double)phaseinc * 2.0 * PI / 0x100000000L;

for (i = 0; i < count; i++)
{
  sum += arr[i] * sin(phase);
  phase += incr;
}

return (long long)sum;
}

Naturally, the C compiler cannot redo this to use table-lookups and integer operations. Additionally, in the example above, phase can get out of the optimal range, something the assembly code solves by the registers "rolling over".

Ethin · Post by **Ethin** » Tue May 10, 2022 3:10 pm

rdos wrote:
Ethin wrote:
rdos wrote: The inner loop consist of multiplying a sin() function (using an 18-bit table) with an array of 14-bit integers. This is typically done on hundreds of values per operation. The code uses assembly as I don't trust the compiler would be able to provide better optimizations than my hand-coded assembly code.
Trust me when I say that you can't write better assembly than the compiler. The compiler knows far more about your program than you do. It after all can hold your entire program in memory and analyze it in its entirety. Good luck doing that in your mind on a program that's 15000 LoC. When it comes to "who can write better assembly", the compiler always wins. Its an extremely rare moment when you can outsmart a compiler these days. And I'm not talking about intrinsics, just normal code.
I don't think so. The C code is something like this:

int CalcPower(double startp, double phaseinc, int *arr, int count)
{
int i;
double sum = 0;

for (i = 0; i < count; i++)
sum += arr * sin(startp + i * phaseinc);

return (int)sum;
}

I can almost garantee you that it will use fsin and many other floating point instructions, which would result in horrible performance compared to table-lookup and using integer operations.

As Octocontrobass said, a compiler is significantly smarter than you are at this. I just tested your code and I got SSE/AVX instructions all the way through -- even with no optimizations enabled. And I tested on ISPC, GCC and Clang. A good high-quality SIN function is actually not that hard to find -- they're fully open source (most of them anyway). I'll see about making the comparison fair but I'm sure that even if I were to write the function by hand you (still) would have zero FPU instructions anywhere.

Demindiro · Post by **Demindiro** » Tue May 10, 2022 3:16 pm

rdos wrote:
Octocontrabass wrote:
rdos wrote:I can almost garantee you that it will use fsin and many other floating point instructions, which would result in horrible performance compared to table-lookup and using integer operations.
Er, no, it's plenty smart enough to call a vector math library function instead of using fsin.

But it's not a fair comparison, since this C version is not using a table lookup or integer operations.
That's still a lot of code, and the sin() function is a function call with unknown number of operations.

This is basically how it looks like in assembly (there is a 256kx2 entries sin table in the same source file, sin_tab):
Code: Select all
;       PARAMETERS:     Data
;                       Size
;                       InitPhase
;                       PhasePerSample
;                       Power
;
;       RETURNS:        End phase

_CalcFreqPowerA  Proc near
    push ebx
    push ecx
    push edx
    push esi
    push edi
    push ebp
;
    mov esi,[esp+1Ch]
    mov ecx,[esp+20h]
    mov ebp,[esp+24h]
    mov edi,[esp+2Ch]
;
    xor eax,eax
    mov [edi].pow_c,eax
    mov [edi].pow_c+4,eax

cfpaLoop:
    mov ebx,ebp
    shr ebx,13
    and bl,0FEh
;
    mov ax,[ebx].sin_tab
    imul word ptr [esi]
    movsx edx,dx
    add word ptr [edi].pow_c,ax
    adc dword ptr [edi].pow_c+2,edx
;
    add esi,4
    add ebp,[esp+28h]
    loop cfpaLoop
;
    movsx eax,word ptr [edi].pow_c+4
    mov [edi].pow_c+4,eax
;
    mov eax,ebp
;
    pop ebp
    pop edi
    pop esi
    pop edx
    pop ecx
    pop ebx
    ret 20
_CalcFreqPowerA    Endp
The 64-bit version would be one less instruction in the inner loop since it could use a 64-bit register.

One thing that immediately stands out is the use of a loop instruction. While this is fast on AMD processors, it's still very slow on Intel processors for some reason (7 cycles on Ice Lake according to Agner Fog's tables). It also seems to me it would be a good idea to put phase_per_sample in a register instead of loading it every iteration.

Anyhow, if I interpreted your code correctly the equivalent C version would look like this:

Code: Select all

static long *sin_table = { 0, /* ... */ };

long CalcPower(int *data, int size, int init_phase, int phase_per_sample, int power) {
	long end_phase = 0;

	for (int i = 0; i < size; i++) {
		int p = (init_phase >> 13) & (~1);
		end_phase += sin_table[p] * data[i];
		init_phase += phase_per_sample;
	}

	return end_phase;
}

The corresponding generated assembly with `gcc -O3 sin.c -c -m32` is:

Code: Select all

00000000 <CalcPower>:
   0:	57                   	push   edi
   1:	56                   	push   esi
   2:	53                   	push   ebx
   3:	8b 54 24 14          	mov    edx,DWORD PTR [esp+0x14]
   7:	8b 4c 24 18          	mov    ecx,DWORD PTR [esp+0x18]
   b:	8b 74 24 1c          	mov    esi,DWORD PTR [esp+0x1c]
   f:	85 d2                	test   edx,edx
  11:	7e 35                	jle    48 <CalcPower+0x48>
  13:	8b 44 24 10          	mov    eax,DWORD PTR [esp+0x10]
  17:	31 db                	xor    ebx,ebx
  19:	8d 3c 90             	lea    edi,[eax+edx*4]
  1c:	8d 74 26 00          	lea    esi,[esi+eiz*1+0x0]
  20:	89 ca                	mov    edx,ecx
  22:	83 c0 04             	add    eax,0x4
  25:	01 f1                	add    ecx,esi
  27:	c1 fa 0d             	sar    edx,0xd
  2a:	83 e2 fe             	and    edx,0xfffffffe
  2d:	8b 14 95 00 00 00 00 	mov    edx,DWORD PTR [edx*4+0x0]
  34:	0f af 50 fc          	imul   edx,DWORD PTR [eax-0x4]
  38:	01 d3                	add    ebx,edx
  3a:	39 c7                	cmp    edi,eax
  3c:	75 e2                	jne    20 <CalcPower+0x20>
  3e:	89 d8                	mov    eax,ebx
  40:	5b                   	pop    ebx
  41:	5e                   	pop    esi
  42:	5f                   	pop    edi
  43:	c3                   	ret    
  44:	8d 74 26 00          	lea    esi,[esi+eiz*1+0x0]
  48:	31 db                	xor    ebx,ebx
  4a:	89 d8                	mov    eax,ebx
  4c:	5b                   	pop    ebx
  4d:	5e                   	pop    esi
  4e:	5f                   	pop    edi
  4f:	c3                   	ret

With `gcc -O3 sin.c -c -march=znver1` (I use a 2700X) the generated assembly is:

Code: Select all

0000000000000000 <CalcPower>:
   0:	89 f0                	mov    eax,esi
   2:	89 d6                	mov    esi,edx
   4:	85 c0                	test   eax,eax
   6:	7e 48                	jle    50 <CalcPower+0x50>
   8:	ff c8                	dec    eax
   a:	45 31 c0             	xor    r8d,r8d
   d:	4c 8d 4c 87 04       	lea    r9,[rdi+rax*4+0x4]
  12:	66 66 2e 0f 1f 84 00 	data16 nop WORD PTR cs:[rax+rax*1+0x0]
  19:	00 00 00 00 
  1d:	0f 1f 00             	nop    DWORD PTR [rax]
  20:	89 f0                	mov    eax,esi
  22:	48 63 17             	movsxd rdx,DWORD PTR [rdi]
  25:	48 83 c7 04          	add    rdi,0x4
  29:	01 ce                	add    esi,ecx
  2b:	c1 f8 0d             	sar    eax,0xd
  2e:	83 e0 fe             	and    eax,0xfffffffe
  31:	48 98                	cdqe   
  33:	48 0f af 14 c5 00 00 	imul   rdx,QWORD PTR [rax*8+0x0]
  3a:	00 00 
  3c:	49 01 d0             	add    r8,rdx
  3f:	49 39 f9             	cmp    r9,rdi
  42:	75 dc                	jne    20 <CalcPower+0x20>
  44:	4c 89 c0             	mov    rax,r8
  47:	c3                   	ret    
  48:	0f 1f 84 00 00 00 00 	nop    DWORD PTR [rax+rax*1+0x0]
  4f:	00 
  50:	45 31 c0             	xor    r8d,r8d
  53:	4c 89 c0             	mov    rax,r8
  56:	c3                   	ret

My GCC version is `gcc (Debian 10.2.1-6) 10.2.1 20210110`

To see whether it's correct & to benchmark it I'll need the lookup table though.

rdos · Post by **rdos** » Tue May 10, 2022 3:31 pm

Demindiro wrote:

Code: Select all

static long *sin_table = { 0, /* ... */ };

long CalcPower(int *data, int size, int init_phase, int phase_per_sample, int power) {
	long end_phase = 0;

	for (int i = 0; i < size; i++) {
		int p = (init_phase >> 13) & (~1);
		end_phase += sin_table[p] * data[i];
		init_phase += phase_per_sample;
	}

	return end_phase;
}

OSDev.org

Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?

Re: Best way to implement MMU on x86_64 Higher Half kernel?