Help in Tasking - That's hideously bad/unmaintainable

tsdnz · Post by **tsdnz** » Thu Feb 11, 2016 1:21 am

Edit: This is from post: http://forum.osdev.org/viewtopic.php?f=1&t=30069
I thought I would start another so as not to clutter up someone else's post.

Brendan wrote:Hi,

tsdnz wrote:Hi, I am in 64 bit mode long, QWORDS/UINT64 only for me, but I hope this helps you out.
A) That's hideously bad/unmaintainable

B) For IRQs the "HandlerAddress" should be the same for all IRQs; and it's better for to have a common assembly stub that calls a generic (C, C++) IRQ handler that handles things like how many completely different/unrelated device drivers happen to be sharing that IRQ

C) For all other types of interrupts it's better to have a specific assembly stub for each thing and not have a generic assembly stub (unless you're writing a tutorial and don't want to complicate it by doing things right)

D) Interrupts never have anything to do with task switching in the first place (unless the kernel/scheduler is a massive design failure).

Cheers,

Brendan

LOL, good times.

A) That's hideously bad/unmaintainable
For me I wanted to remove a lookup and call using a Generic Handler.
I was not sure now to do this using asm macros, so I wrote it as I wanted it.
I only have two files in my Kernel, the main file and a Generic file.
Yup, not the design method you guys would like, but using Windows Visual Studio C++ IDE it works a treat.

B) For IRQs the "HandlerAddress" should be the same for all IRQs; and it's better for to have a common assembly stub that calls a generic (C, C++) IRQ handler that handles things like how many completely different/unrelated device drivers happen to be sharing that IRQ
For me I was after speed, I had the design you are talking about but I found that I was losing a few cycles.
To find out what CPU the interrupt was on required reading from ((LocalAPICAddress + 0x20) >> 24).
My design only one driver has a single interrupt, my OS is not a generic OS, it is specific to a task.

D) Interrupts never have anything to do with task switching in the first place (unless the kernel/scheduler is a massive design failure).
Very interesting, how do you guys time-slice a running program?
For instance a program running in an infinite loop.
On my 48 core server I am time-slicing 196,608 times a second, 48 * 4096.
Each core helps the scheduler.

Brendan wrote:Hi,

ashishkumar4 wrote:and the switch task function:
Never, under any circumstances, do anything in inline assembly that touches or modifies the stack, or relies on any specific stack layout. The stack belongs to the compiler and it will do whatever it likes with its stack; it is not yours to mess with, you gave up the right to touch the stack when you chose to use a compiler.

You must use external assembly and not inline assembly for the (tiny) piece of code that does the final task switch.

Cheers,

Brendan

Never, under any circumstances, do anything in inline assembly that touches or modifies the stack
Although I agree I break the rules.
Changing the stack is always the last line of code to execute before functionality is passed back to user-space in my OS.

You must use external assembly and not inline assembly for the (tiny) piece of code that does the final task switch.
Again, very interesting.
I use inline not external, it works nicely for me, very nicely.
For example: If the scheduler interrupt wants to switch a task, I load the float data, set up the pages, etc...
Then do this,

Code: Select all

asm volatile (
				"movq	%0, %%rsp;"
				"popq	%%r15;"
				"popq	%%r14;"
				"popq	%%r13;"
				"popq	%%r12;"
				"popq	%%r11;"
				"popq	%%r10;"
				"popq	%%r9;"
				"popq	%%r8;"
				"popq	%%rbp;"
				"popq	%%rdi;"
				"popq	%%rsi;"
				"popq	%%rdx;"
				"popq	%%rcx;"
				"popq	%%rbx;"
				"popq	%%rax;"
				"iretq;" : : "r"((QWORD)&uk->gpr)
				);
			/*!! __builtin_unreachable();!!*/

I am very interested to hear the bad/massive/terrible mistakes I am making, nothing like learning.

I am compiling a list of performance tests between my OS and windows, it would be great to see how you guys compare.
I have only starting this today. Here are 3.

What about I start a new topic and we test our server speeds out.
I would very much like to gauge my OS against others and get some feed back.

Both Windows and my OS are on the same server. 48 Core, 1.9 GHz, 128 GB Ram

1) Inside an infinite loop I increment a volatile QWORD and display it every 1/2 second.
Running on a single core.
My OS shows 474 million = 948 million per second. 1.9 GHz / 948 million = 2 cycles
Windows shows 160 million = 320 million per second. 1.9 GHz / 320 = 5.93 cycles

2) Inside an infinite loop I increment a volatile QWORD and display it every 1/2 second.
Running 48 tasks, one on each core. Separate QWORD for each task.
My OS shows 474 million = 948 million per second. 1.9 GHz / 948 million = 2 cycles
My OS show a total count of 22.7 billion. (Each QWORD for all tasks in all CPUS)
Windows shows 160 million = 320 million per second. 1.9 GHz / 320 = 5.93 cycles
Windows show a total count of 7.6 billion. (Each QWORD for all tasks in all CPUS)

3) Inside an infinite loop I increment a volatile QWORD and display it every 1/2 second.
Running 8192 tasks per core = 393,216 tasks. Separate QWORD for each task.
My OS show a total count of 21.0 billion, a little loss. (Each QWORD for all tasks in all CPUS)
And each task is allocate a percentage of the time.
Windows has trouble updating the screen with 96 threads, so I cannot get accurate readings, and it cannot handle much more without completely stopping.

As always, thanks to everyone for their feedback, it is greatly appreciated.

Alistair.

Brendan · Post by **Brendan** » Thu Feb 11, 2016 4:30 am

Hi,

tsdnz wrote:B) For IRQs the "HandlerAddress" should be the same for all IRQs; and it's better for to have a common assembly stub that calls a generic (C, C++) IRQ handler that handles things like how many completely different/unrelated device drivers happen to be sharing that IRQ
For me I was after speed, I had the design you are talking about but I found that I was losing a few cycles.
To find out what CPU the interrupt was on required reading from ((LocalAPICAddress + 0x20) >> 24).
My design only one driver has a single interrupt, my OS is not a generic OS, it is specific to a task.

Most OSs use GS (and the "swapgs" instruction on kernel entry) to access per CPU data.

If your OS is not a generic OS and is specific to a task (e.g. you never plan to support PCI devices without MSI and don't care about IRQ sharing); then it's useless as an example for someone else (e.g. ashishkumar4) unless the other person also happens to be writing an OS that's specific to the same task as yours.

By presenting your code like you did, you make it sound suitable for others when it's not; causing them to implement something that might be right for you but is completely wrong from them.

tsdnz wrote:D) Interrupts never have anything to do with task switching in the first place (unless the kernel/scheduler is a massive design failure).
Very interesting, how do you guys time-slice a running program?
For instance a program running in an infinite loop.
On my 48 core server I am time-slicing 196,608 times a second, 48 * 4096.
Each core helps the scheduler.

If there's only one program (or one thread per CPU) that happens to be running in an infinite loop; you should have zero task switches because there's nothing else to switch to. Most OSs would even disable the scheduler's timer for this case to remove the overhead of unnecessary interrupts.

What I was getting at is that something (kernel API call, IRQ, exception, whatever) causes a privilege level switch from CPL=3 to CPL=0 (e.g. kernel API call); the kernel does some stuff; then the kernel returns from CPL=0 back to CPL=3. This has nothing (directly) to do with task switching.

In the middle of "kernel does some stuff", the kernel may or may not decide to call a scheduler function (to block, unblock, spawn or terminate a task) and the scheduler might or might not do a task switch. This has nothing (directly) to do with whatever happened to cause the privilege level switch from CPL=3 to CPL=0.

Note that it's common for beginners to make the mistake of assuming all task switches are caused by IRQs (often, by scheduler's timer and nothing else). These people end up paying for the mistake later (e.g. doing "HLT" in a loop whenever a task blocks or terminates for whatever reason, and wasting a huge amount of CPU time until an IRQ happens to come along).

tsdnz wrote:You must use external assembly and not inline assembly for the (tiny) piece of code that does the final task switch.
Again, very interesting.
I use inline not external, it works nicely for me, very nicely.
For example: If the scheduler interrupt wants to switch a task, I load the float data, set up the pages, etc...
Then do this,
Code: Select all
asm volatile (
				"movq	%0, %%rsp;"
				"popq	%%r15;"
				"popq	%%r14;"
				"popq	%%r13;"
				"popq	%%r12;"
				"popq	%%r11;"
				"popq	%%r10;"
				"popq	%%r9;"
				"popq	%%r8;"
				"popq	%%rbp;"
				"popq	%%rdi;"
				"popq	%%rsi;"
				"popq	%%rdx;"
				"popq	%%rcx;"
				"popq	%%rbx;"
				"popq	%%rax;"
				"iretq;" : : "r"((QWORD)&uk->gpr)
				);
			/*!! __builtin_unreachable();!!*/
I am very interested to hear the bad/massive/terrible mistakes I am making, nothing like learning.

You're assuming that task switches always cause an immediate return to CPL=3 and that the stack was setup by an IRQ.

tsdnz wrote:I am compiling a list of performance tests between my OS and windows, it would be great to see how you guys compare.
I have only starting this today. Here are 3.

Are you benchmarking infinite loops(!); or benchmarking the difference between printing characters in a window in graphics mode with full font support vs. printing characters directly to 0xB8000 in text mode?

A relatively standard/common test is the "ping pong" test. The idea is that one task sends something to another task and waits to receive something back (causing the second task to unblock and the first task to block); and the other task waits to receive something and sends something back (causing the first task to unblock and the second task to block). This causes a massive number of task switches; and gives you a good idea of task switching overhead (and/or IPC overhead - how fast your pipes, networking, message passing, whatever is). Typically people do this test before they bother to get IRQs working.

Cheers,

Brendan

ashishkumar4 · Post by **ashishkumar4** » Thu Feb 11, 2016 6:56 am

Lol this all stuff is so interesting for me to read :p because I am still far back in OS dev (just started a month ago :p )

tsdnz · Post by **tsdnz** » Thu Feb 11, 2016 12:30 pm

Brendan wrote:Most OSs use GS (and the "swapgs" instruction on kernel entry) to access per CPU data.

Yes, I was using SwapGS but found it faster my way.

Brendan wrote:If your OS is not a generic OS and is specific to a task (e.g. you never plan to support PCI devices without MSI and don't care about IRQ sharing); then it's useless as an example for someone else (e.g. ashishkumar4) unless the other person also happens to be writing an OS that's specific to the same task as yours.

That is a great point, never thought about the other person and what their goal was and how my response could be confusing to them, thanks.

Brendan wrote:By presenting your code like you did, you make it sound suitable for others when it's not; causing them to implement something that might be right for you but is completely wrong from them.

Very true, great point.

Brendan wrote:Note that it's common for beginners to make the mistake of assuming all task switches are caused by IRQs (often, by scheduler's timer and nothing else). These people end up paying for the mistake later (e.g. doing "HLT" in a loop whenever a task blocks or terminates for whatever reason, and wasting a huge amount of CPU time until an IRQ happens to come along).

I see the light, great points.

Brendan wrote:Are you benchmarking infinite loops(!); or benchmarking the difference between printing characters in a window in graphics mode with full font support vs. printing characters directly to 0xB8000 in text mode?

Infinite loops, also the code generated by the loop. Not a true apple vs apple test, but it will highlight to my audience what I am trying to show them.

Brendan wrote:A relatively standard/common test is the "ping pong" test. The idea is that one task sends something to another task and waits to receive something back (causing the second task to unblock and the first task to block); and the other task waits to receive something and sends something back (causing the first task to unblock and the second task to block). This causes a massive number of task switches; and gives you a good idea of task switching overhead (and/or IPC overhead - how fast your pipes, networking, message passing, whatever is). Typically people do this test before they bother to get IRQs working.

Thanks, I have tests for the task switching overhead. I have other tests like you suggest, message passing, networking, number of tasks created per second with different sizes, memory allocation, etc....

As you stated above my OS is designed for a specific task.
I have a large number of tests that are not applicable to the general OS.

It is all very exciting.

Thanks for your feedback.

Alistair

max · Post by **max** » Fri Feb 12, 2016 12:28 pm

tsdnz wrote:
Brendan wrote:Are you benchmarking infinite loops(!); or benchmarking the difference between printing characters in a window in graphics mode with full font support vs. printing characters directly to 0xB8000 in text mode?
Infinite loops, also the code generated by the loop. Not a true apple vs apple test, but it will highlight to my audience what I am trying to show them.

That makes no sense. As Brendan said you can't benchmark your tasking like this properly. Also you can't benchmark anything with loops that do nothing.

tsdnz · Post by **tsdnz** » Fri Feb 12, 2016 5:49 pm

max wrote:
tsdnz wrote:
Brendan wrote:Are you benchmarking infinite loops(!); or benchmarking the difference between printing characters in a window in graphics mode with full font support vs. printing characters directly to 0xB8000 in text mode?
Infinite loops, also the code generated by the loop. Not a true apple vs apple test, but it will highlight to my audience what I am trying to show them.
That makes no sense. As Brendan said you can't benchmark your tasking like this properly. Also you can't benchmark anything with loops that do nothing.

Are you sure?
I can spin up 4,800 threads or processes in Windows that are blocking.
Then unblock them into an infinite loop that just increments a volatile QWORD.
I can then add the QWORDS together and find out how much time is spend in the scheduler.

I can then try 48,000 or 480,000, etc...

I can also just spin up the threads directly into an infinite loop and see how many threads are created in 5 minutes.
This will show me how the OS works when the threads are fully loaded and a new request is needed.
Again showing me the performance of the scheduler.

Ideas?

Alistair

OSDev.org

Help in Tasking - That's hideously bad/unmaintainable

Help in Tasking - That's hideously bad/unmaintainable

Re: Help in Tasking - That's hideously bad/unmaintainable

Re: Help in Tasking - That's hideously bad/unmaintainable

Re: Help in Tasking - That's hideously bad/unmaintainable

Re: Help in Tasking - That's hideously bad/unmaintainable

Re: Help in Tasking - That's hideously bad/unmaintainable