can IPI be cancelled?

xeyes · Post by **xeyes** » Thu Feb 11, 2021 12:02 am

For example, if the destination core acted on the event already without taking the interrupt. It also knows that an IPI is coming its way, maybe just a sti away.

Is there a good way to cancel the IPI to avoid the cost of getting into the handler just to return from it right away?

Another related case being, if the CPU is in the IPI handler already and noticed that other IPIs are coming to the same vector, can it simply EOI multiple times?

thewrongchristian · Post by **thewrongchristian** » Thu Feb 11, 2021 1:02 pm

xeyes wrote:For example, if the destination core acted on the event already without taking the interrupt. It also knows that an IPI is coming its way, maybe just a sti away.

Is there a good way to cancel the IPI to avoid the cost of getting into the handler just to return from it right away?

Another related case being, if the CPU is in the IPI handler already and noticed that other IPIs are coming to the same vector, can it simply EOI multiple times?

You're assuming IPIs queue. Does an x86 LAPIC do that? Do PICs on other architectures do that?

Do you have an example IPI use case?

An obvious one would be TLB shootdown, in which case, you might be batching PTE entries to shoot down on other CPUs. In your case, you might be invalidating a page on CPU0, sending the IPI to CPU1 to do the same, then invalidating the next page on CPU0, IPI to CPU1 etc. Is that what you mean?

If you treat your IPI just like any other interrupt, it's just a signal that something has happened, but not exactly what. If you then use shared memory to communicate what the intent of the IPI is, your IPI handle can just look for outstanding work in this shared area, access to which can be synchronized, and just complete any outstanding work.

If the TLB shootdown example above is batched into this shared area, then CPU1 can respond to the IPI by processing all the batched invalid PTEs in the shared area, then ack the IPI. The next IPI that comes in will do the same, and if the work has already been done, it can just ack the IPI after doing nothing.

The other thing to bear in mind is that if the other processor has already processed whatever it is you're communicating, that state can be recorded (perhaps by state of the shared memory indicating there is no outstanding work) and the IPI skipped as not necessary. The other CPU would have to be polling this shared structure though, which might have cache and bus contention issues, so it might be best just to avoid it and rely on IPIs.

Note, this is theoretical. I've not done SMP or IPIs yet, but is just an example of how I might manage a generic IPI mechanism in a manner that handles your use case.

xeyes · Post by **xeyes** » Thu Feb 11, 2021 4:15 pm

thewrongchristian wrote: You're assuming IPIs queue. Does an x86 LAPIC do that? Do PICs on other architectures do that?

The intel manual it isn't very clear. From my reading of it, it should be able to queue at least 2, one in ISR and another in IRR. But if there are more it isn't clear what should happen.

The experiments I did had unreliable behaviors as well. I tried to burst IPIs in small batches like 16 at a regular cadence. Sometimes the handler is called 16 times, sometimes it is only called 1 time, and sometimes a number in between.

So the hardware or KVM might be batching IPIs, unless one of them dropped some IPIs which seems unlikely.

thewrongchristian wrote: Do you have an example IPI use case?

Yes, I use IPI to notify another core to resume a blocked task.

But there's a twist, if the dest core is idle enough, it will do some extra checking in scheduler code. So it can discover the resume request before getting the IPI.

thewrongchristian wrote: that state can be recorded (perhaps by state of the shared memory indicating there is no outstanding work) and the IPI skipped as not necessary.

Doesn't seem to work for the resume case though, the sending core send the resume request via shared memory and then IPI. It can't afford to wait before sending IPI as the dest core could be busy and doesn't discover the request in shared memory for a long time.

So the dest core doesn't have the chance to cancel the action of sending the IPI. But it can and do win against the IPI itself (20,30% of the time from my observations) and that's why I want to cancel the IPI that's already en route.

TLB shootdown

I'm actually avoiding complex things as much as possible as I have enough race conditions to debug already

So my TLB invalidation is very simple, whenever I discover a case where stale TLB messes things up, I put a invlpg or cr3 reload there depending on how big the affected area is. It is relatively slow but seems acceptable.

But task resume is different as even the dest core is completely idle, it only runs scheduler code once in a while, for example 60hz. That means a minimal 1/120 second delay on average to resume a remote task, and much longer in reality when the core isn't completely idle

thewrongchristian · Post by **thewrongchristian** » Thu Feb 11, 2021 5:40 pm

xeyes wrote:
thewrongchristian wrote: Do you have an example IPI use case?
Yes, I use IPI to notify another core to resume a blocked task.

But there's a twist, if the dest core is idle enough, it will do some extra checking in scheduler code. So it can discover the resume request before getting the IPI.

thewrongchristian wrote: that state can be recorded (perhaps by state of the shared memory indicating there is no outstanding work) and the IPI skipped as not necessary.
Doesn't seem to work for the resume case though, the sending core send the resume request via shared memory and then IPI. It can't afford to wait before sending IPI as the dest core could be busy and doesn't discover the request in shared memory for a long time.

So the dest core doesn't have the chance to cancel the action of sending the IPI. But it can and do win against the IPI itself (20,30% of the time from my observations) and that's why I want to cancel the IPI that's already en route.

But do you really want an otherwise idle processor busy polling shared structures?

Consider:

Impact on core power - A busy polling core will push both the power and temperature of core, which will affect not only the power usage, but potentially the performance of other cores. Features such as turbo boosts depend on idle cores creating temperature and power headroom by being idle, to boost the performance of individual cores and improve the performance of single threads.
Impact on cache/bus usage - Busy polling a shared data structure will create cache and bus contention. This can impact the performance of the entire system.

If a core is idle, then let it idle in a hlt, waiting for an interrupt.

xeyes wrote:
TLB shootdown
I'm actually avoiding complex things as much as possible as I have enough race conditions to debug already

So my TLB invalidation is very simple, whenever I discover a case where stale TLB messes things up, I put a invlpg or cr3 reload there depending on how big the affected area is. It is relatively slow but seems acceptable.

But task resume is different as even the dest core is completely idle, it only runs scheduler code once in a while, for example 60hz. That means a minimal 1/120 second delay on average to resume a remote task, and much longer in reality when the core isn't completely idle

Why not do your scheduling decisions as part of your interrupt return path?

Time slicing can be as now, on your timer interrupt, set a pre-empt flag indicating that scheduling is required. Then, when you return from the timer interrupt, your pre-empt flag is checked and you switch tasks as required if set.

But the beauty of this is you can do it on all interrupts, so any interrupt that results in the requirement for a scheduling decision to be made (for example, an IO completion unblocking that high priority thread waiting for that IO to finish) will similarly set the pre-empt flag, and all the pre-emption is done in a single place with a single piece of logic. Less to debug.

IPIs can be the same. CPU0 running thread0 marks thread1 as ready to run. But CPU1 is idle, so in your case, you want to reschedule thread1 on CPU1. You can communicate that with an IPI, with an instruction to reschedule, which the IPI duly handles on CPU1 by setting the pre-empt flag. On return from the IPI on CPU1, the flag now indicates the scheduler needs to be called, and we call it and pick up the newly runnable thread1.

xeyes · Post by **xeyes** » Thu Feb 11, 2021 6:52 pm

thewrongchristian wrote: But do you really want an otherwise idle processor busy polling shared structures?

Consider:

Impact on core power - A busy polling core will push both the power and temperature of core, which will affect not only the power usage, but potentially the performance of other cores. Features such as turbo boosts depend on idle cores creating temperature and power headroom by being idle, to boost the performance of individual cores and improve the performance of single threads.

Impact on cache/bus usage - Busy polling a shared data structure will create cache and bus contention. This can impact the performance of the entire system.
If a core is idle, then let it idle in a hlt, waiting for an interrupt.

The idle core only polls at the switch frequency, for example 60hz, it is otherwise in a hlt loop.

The infrequent polling is the main reason that I felt an IPI to poke it into action is needed.

thewrongchristian wrote: Why not do your scheduling decisions as part of your interrupt return path?

Time slicing can be as now, on your timer interrupt, set a pre-empt flag indicating that scheduling is required. Then, when you return from the timer interrupt, your pre-empt flag is checked and you switch tasks as required if set.

The flow I have is similar to this part, the timer interrupt runs at a higher frequency than the switch frequency and it only runs the scheduler at the switch frequency.

thewrongchristian wrote:
But the beauty of this is you can do it on all interrupts, so any interrupt that results in the requirement for a scheduling decision to be made (for example, an IO completion unblocking that high priority thread waiting for that IO to finish) will similarly set the pre-empt flag, and all the pre-emption is done in a single place with a single piece of logic. Less to debug.

Interesting idea, I currently have some hard coded fast paths for IO completion. Like the disk interrupt is pointed at the core that programmed the disk, and the disk ISR can directly iret to the task that is blocked waiting for disk. It seems like a better idea to have a uniform way of doing things.

There are quite a few different paths of entry towards a task switch though, like the user code can synchronously request yielding of CPU or exit via syscall, can take a synchronous exception like a page fault, can take async exception like any ISR, or it can take an asych signal (like ctrl+c or when another task called kill on it).

I have some unification in place but not all the paths enter the scheduler the same way, does it make sense to make these and all future ways of getting into task switch all share a uniform front end?

thewrongchristian wrote:
IPIs can be the same. CPU0 running thread0 marks thread1 as ready to run. But CPU1 is idle, so in your case, you want to reschedule thread1 on CPU1. You can communicate that with an IPI, with an instruction to reschedule, which the IPI duly handles on CPU1 by setting the pre-empt flag. On return from the IPI on CPU1, the flag now indicates the scheduler needs to be called, and we call it and pick up the newly runnable thread1.

Original question still stands, if other ISRs can also invoke scheduler, it would only make CPU1 more likely to act on the resume before the IPI and thus wanting to cancel it?

thewrongchristian · Post by **thewrongchristian** » Thu Feb 11, 2021 8:07 pm

xeyes wrote:
thewrongchristian wrote: But do you really want an otherwise idle processor busy polling shared structures?

Consider:

Impact on core power - A busy polling core will push both the power and temperature of core, which will affect not only the power usage, but potentially the performance of other cores. Features such as turbo boosts depend on idle cores creating temperature and power headroom by being idle, to boost the performance of individual cores and improve the performance of single threads.

Impact on cache/bus usage - Busy polling a shared data structure will create cache and bus contention. This can impact the performance of the entire system.
If a core is idle, then let it idle in a hlt, waiting for an interrupt.

The idle core only polls at the switch frequency, for example 60hz, it is otherwise in a hlt loop.

The infrequent polling is the main reason that I felt an IPI to poke it into action is needed.

Yes, exactly. The way you described it before it sounded like you only polled the scheduler at all at a rate of 60Hz.

The question then becomes, if your CPU is idle, why do you even need a switch frequency? You'll only wake a thread in response to a stimulus, either an interrupt or a schedule request from another CPU. But with no other stimulus, a 60Hz scheduler is just wasting cycles. Why not just rely on IPI for rescheduling when idle?

xeyes wrote:
thewrongchristian wrote:
But the beauty of this is you can do it on all interrupts, so any interrupt that results in the requirement for a scheduling decision to be made (for example, an IO completion unblocking that high priority thread waiting for that IO to finish) will similarly set the pre-empt flag, and all the pre-emption is done in a single place with a single piece of logic. Less to debug.

Interesting idea, I currently have some hard coded fast paths for IO completion. Like the disk interrupt is pointed at the core that programmed the disk, and the disk ISR can directly iret to the task that is blocked waiting for disk. It seems like a better idea to have a uniform way of doing things.

There are quite a few different paths of entry towards a task switch though, like the user code can synchronously request yielding of CPU or exit via syscall, can take a synchronous exception like a page fault, can take async exception like any ISR, or it can take an asych signal (like ctrl+c or when another task called kill on it).

I have some unification in place but not all the paths enter the scheduler the same way, does it make sense to make these and all future ways of getting into task switch all share a uniform front end?

By front-end, do you mean entry point into the scheduler itself? I would certainly recommend a single entry point to the scheduler itself.

My scheduler is called expecting:

- The scheduler lock to be held.
- The current thread queued wherever already. It can be queued on a wait queue (such as waiting for IO), or it can be queued onto the run queue.
- The call to the scheduler then selects the next thread to run and dispatches it (note, this can be the same thread we just put back onto the run queue, which we can special case by doing nothing but removing it from the run queue again.

I have three conditions where I need to call the scheduler, all with their own calls. This might be what you mean by front end:

- Co-operative yield - Lock the scheduler. Put the current thread onto the end of the current priority run queue, and call the scheduler.
- Pre-emptive yield - Lock the scheduler. Put the current thread onto the front of the of the current priority run queue, and call the scheduler.
- Sleep - Lock the scheduler. Put the current thread onto whatever wait queue we're waiting for, and call the scheduler.

The latter one, Sleep, is implemented in my primary synchronisation primitive, and is the only way to put a thread to sleep.

Resuming a thread currently just locks the scheduler, puts the woken thread onto the run queue, then unlocks the scheduler again. Then we set the pre-empt flag if the newly resumed thread has a higher priority than the current thread, but that pre-emption is not done until we return from the current interrupt.

My interrupt return then checks whether the pre-empt flag is set, and calls the pre-empt front end.

xeyes wrote:
thewrongchristian wrote:
IPIs can be the same. CPU0 running thread0 marks thread1 as ready to run. But CPU1 is idle, so in your case, you want to reschedule thread1 on CPU1. You can communicate that with an IPI, with an instruction to reschedule, which the IPI duly handles on CPU1 by setting the pre-empt flag. On return from the IPI on CPU1, the flag now indicates the scheduler needs to be called, and we call it and pick up the newly runnable thread1.

Original question still stands, if other ISRs can also invoke scheduler, it would only make CPU1 more likely to act on the resume before the IPI and thus wanting to cancel it?

I don't know of a way to cancel it. Probably the best you can hope for is to make the no-op IPI path as efficient as possible.

Using my scheme, that would be processing such a pre-empt IPI simply by setting the pre-empt flag, then handling the case of no actual pre-emption as efficiently as possible (don't bother changing contexts).

It may not be absolutely optimal, but I think it certainly more manageable.

xeyes · Post by **xeyes** » Fri Feb 12, 2021 4:07 pm

thewrongchristian wrote: if your CPU is idle, why do you even need a switch frequency?

Historical reasons

.

The cores exchange status among each other. For example if one core asserts or panics other cores pick up the hint and also stop rather than keep going. So the timer tick needs to be running at a frequency that isn't 'perceived by humans as too slow'. This predates IPIs and it won't always be possible for a panicking core to IPI others either.

The scheduler part, as you can easily imaging, is very fast. It determines that the runqueue is empty and then returns without switching. So the overhead of doing this at 60hz seems small enough to not warrant the difficulty of making the kernel tickless at this time.

thewrongchristian wrote:By front-end, do you mean entry point into the scheduler itself?

I meant the various entry points into kernel code that could eventually lead to a context switch. Like the code that saves registers before calling C ISR and the code that do housekeeping like taking timestamps, etc before a syscall.

Seems that all these need to be the same for "pre-emption is done in a single place with a single piece of logic"? I do it at a single place but it fans out into many branches to take care of the different ways the switch target's last entry into kernel code (or the lack there of for new tasks) so it is far from single piece of logic

Unifying all kernel entry points seem inefficient. For example a timer ISR knows exactly what needs to be saved and doesn't want the heavy housekeeping logic used for syscall entries. Yet without that the context switch code would always have to take care of the differences.

Also curious why you lock the scheduler? It can be recursively called?

thewrongchristian wrote: no-op IPI path as efficient as possible.

Yes that's the plan, it would be nice if I can cancel it, if not it still works.

thewrongchristian · Post by **thewrongchristian** » Fri Feb 12, 2021 6:10 pm

xeyes wrote:
thewrongchristian wrote:By front-end, do you mean entry point into the scheduler itself?
I meant the various entry points into kernel code that could eventually lead to a context switch. Like the code that saves registers before calling C ISR and the code that do housekeeping like taking timestamps, etc before a syscall.

Seems that all these need to be the same for "pre-emption is done in a single place with a single piece of logic"? I do it at a single place but it fans out into many branches to take care of the different ways the switch target's last entry into kernel code (or the lack there of for new tasks) so it is far from single piece of logic

Unifying all kernel entry points seem inefficient. For example a timer ISR knows exactly what needs to be saved and doesn't want the heavy housekeeping logic used for syscall entries. Yet without that the context switch code would always have to take care of the differences.

Because kernel threads only switch tasks under complete control of the kernel, using a C interface, the only state you actually have to save when switching kernel threads is the callee saved register set defined by the ABI. I'm 32-bit x86 only for now, so the ABI dictates that my call to thread_schedule() will preserve EBX, EDI, ESI, EBP, and ESP. So my task state is essentially just a setjmp/longjmp jmp_buf. No interrupted state is saved directly by the scheduler, that interrupted state is left at the bottom of the kernel stack, with a pointer to the interrupted state passed to the ISR as an argument.

So, I have a single common ISR C routine, called by all the interrupt vectors, which takes as an argument an ISR number, and a pointer to the interrupted state.

My common ISR:

Code: Select all

void i386_isr(uint32_t num, arch_trap_frame_t * state)
{
	isr_t isr = itable[num] ? itable[num] : unhandled_isr;

	isr(num, state);

	/* Do any post-return processing */
	intr_beforereturn();

	/* For return to usermode, do any processing such as delivering signals */
	if (state->cs == 0x1b) {
		intr_beforereturntouser();
	}
}

For a system call, we use the interrupted state as the user state to select what to do in the system call.

For a hardware interrupt, we simply totally ignore the interrupted state, it is of no significance to the interrupt handling context. Your timer is just a hardware interrupt, and the interrupted context has to be saved no matter what anyway.

But in neither case does the interrupted state have any bearing on what we switch when we call the scheduler to switch tasks. That is dictated entirely by the C ABI.

xeyes wrote: Also curious why you lock the scheduler? It can be recursively called?

While I don't do SMP yet, I want the code ready to be SMP capable with as little change as possible.

Also, as a side-effect of locking the scheduler, I also block out interrupts, so scheduling process can't be interrupted by a hardware interrupt, which itself might have scheduling ramifications (interrupt return being a pre-emption point.)

xeyes · Post by **xeyes** » Fri Feb 12, 2021 11:00 pm

thewrongchristian wrote: a single common ISR C routine, called by all the interrupt vectors, which takes as an argument an ISR number, and a pointer to the interrupted state.

That's a neat idea, I have a similar common dispatcher for "slow" device ISRs but also have special cases for "fast" ISRs like timer and the IPI where they don't go through the dispatching. So there are many loose ends to keep track of.

dictated entirely by the C ABI

I have places where ring 0 code calls syscall directly using register param passing and the software int, and the other end of has to do the whole status saving until it reaches the point of reading the saved CS and noticing that it was called by ring0 code. Maybe it's a good idea to move away from such usage.

SMP capable with as little change as possible.

You plan to have all CPUs read/write the same set of run queues? It would probably work and won't have issues like "resuming a remote task".

But moving to SMP without much code change is a tall order even on the super SMP friendly x86 architecture.

Let me share a fascinating little example:

Code: Select all

    // v1 v2 and sum are pointers in shared memory
    // launch 2 threads with thread id 0 and 1 
    if(thread_id == 1)
    {
        while(1)
        {
            *v1 = 0;
            // your choice barrier wait
            *v1 = 1;
            tmp = *v2;
            if(tmp == 1) atomic_increment(sum);
            // another barrier wait
        }
    }
    else
    {
        while(1)
        {
            *v2 = 0; *sum = 0;
            // your choice of barrier wait
            *v2 = 1;
            tmp = *v1;
            if(tmp == 1) atomic_increment(sum);
            // another barrier wait
            if (*sum == 0) printf("sum is 0 ?!\n");
        }
    }

Can sum ever be 0?

If you launch these threads on separate x86 CPUs or cores, it happens very often.

But a single CPU won't show this no matter how fast it preempts and changes between the 2 threads.

Learned this from https://preshing.com/20120515/memory-re ... n-the-act/ Didn't need the random delay and it already happens very often.

thewrongchristian · Post by **thewrongchristian** » Sat Feb 13, 2021 6:24 am

xeyes wrote:
SMP capable with as little change as possible.
You plan to have all CPUs read/write the same set of run queues? It would probably work and won't have issues like "resuming a remote task".

The plan is for baby steps. Make it work first, then make it work fast (if needs be).

In the SMP/scheduler case, it's not necessarily locks per se that kill performance, it's lock contention. I'd have to measure lock contention on the scheduler lock before making a decision on how to remedy any bottlenecks.

A separate run-queue per CPU is also about affinity, which has other benefits such as cache usage optimisation, so it might be that I have a run-queue per CPU, into which the main scheduler periodically deposits tasks. Such a queue might be accessed in a lock-less manner using atomics, if implemented as a simple linked list.

But I'd still have the remote resume issue you have even with a single run queue. The resumer on CPU0 still has to send an IPI to CPU1 for the resumee to be scheduled on CPU1. In my case, though, it'll just be an otherwise empty IPI, which CPU1 will handle simply by setting its preempt flag, and subsequently scheduling on return from the IPI.

xeyes wrote: But moving to SMP without much code change is a tall order even on the super SMP friendly x86 architecture.

Let me share a fascinating little example:
Code: Select all
    // v1 v2 and sum are pointers in shared memory
    // launch 2 threads with thread id 0 and 1 
    if(thread_id == 1)
    {
        while(1)
        {
            *v1 = 0;
            // your choice barrier wait
            *v1 = 1;
            tmp = *v2;
            if(tmp == 1) atomic_increment(sum);
            // another barrier wait
        }
    }
    else
    {
        while(1)
        {
            *v2 = 0; *sum = 0;
            // your choice of barrier wait
            *v2 = 1;
            tmp = *v1;
            if(tmp == 1) atomic_increment(sum);
            // another barrier wait
            if (*sum == 0) printf("sum is 0 ?!\n");
        }
    }
Can sum ever be 0?

If you launch these threads on separate x86 CPUs or cores, it happens very often.

But a single CPU won't show this no matter how fast it preempts and changes between the 2 threads.

Learned this from https://preshing.com/20120515/memory-re ... n-the-act/ Didn't need the random delay and it already happens very often.

Yeah, there is a reason I've not tackled SMP yet

Fascinating link, BTW. Another page on this blog caught my eye:

https://preshing.com/20111118/locks-are ... ention-is/

Which looks like it backs up my lock contention comment above. It very much encapsulates my view of optimisation, you have to measure before you optimise, and not simply guess where the bottlenecks are.

I have this issue at work all the time, code reviews failing because I've "copied a string unnecessarily into dynamic memory, that's not optimal" even thought the code in question will be run once, at initialisation. But the same code base is littered with painfully sub-optimal SQL queries, and sometimes a transaction might be composed of triple digit separate SQL read queries (many of which are the same, just with different parameters!)

nullplan · Post by **nullplan** » Sat Feb 13, 2021 2:01 pm

xeyes wrote:But moving to SMP without much code change is a tall order even on the super SMP friendly x86 architecture.

Which is why I would recommend going SMP as early as possible, since that minimizes the pain of transition. Start thinking about what resources are shared between tasks immediately when starting to write the kernel. And never, ever, use a "cli" in place of an actual mutex lock. And that should take care of most of the problem.

xeyes wrote:Can sum ever be 0?

If you launch these threads on separate x86 CPUs or cores, it happens very often.

But a single CPU won't show this no matter how fast it preempts and changes between the 2 threads.

Sure it can. If thread 1 never runs at all, or only runs until the barrier, then thread 2 will not see v1 being 1, will not increment sum, and will see sum being zero.

But precisely because parallel code is so conducive to headaches you should probably keep it simple. Use locks for everything but the most simple of atomic operations. That is, just use locks except maybe for the odd shared counter. My "spurious interrupt" handler is to just increase a counter that can be queried later. That's about it. I have read a lot about lockless algorithms, and I do find them fascinating, but they are really hard to do without a garbage collector, and I really have little use for singly-linked lists. And with most other data structures, the lockless way to do it seems to be to always create a copy of the data structure with the change included, then atomically update the reference to it if it didn't change. Which seems a bit wasteful to me, when you could just add a lock, make everyone queue up and get it over with.

xeyes · Post by **xeyes** » Sat Feb 13, 2021 7:09 pm

thewrongchristian wrote: The plan is for baby steps. Make it work first, then make it work fast (if needs be).

Agreed that's the best way, I probably moved too fast in certain areas and left a bunch of 'subtly broken' code paths that are now costing me more time to track down and fix

thewrongchristian wrote: In the SMP/scheduler case, it's not necessarily locks per se that kill performance, it's lock contention. I'd have to measure lock contention on the scheduler lock before making a decision on how to remedy any bottlenecks.

A separate run-queue per CPU is also about affinity, which has other benefits such as cache usage optimisation, so it might be that I have a run-queue per CPU, into which the main scheduler periodically deposits tasks. Such a queue might be accessed in a lock-less manner using atomics, if implemented as a simple linked list.

But I'd still have the remote resume issue you have even with a single run queue. The resumer on CPU0 still has to send an IPI to CPU1 for the resumee to be scheduled on CPU1. In my case, though, it'll just be an otherwise empty IPI, which CPU1 will handle simply by setting its preempt flag, and subsequently scheduling on return from the IPI.

Yes there are many considerations here. In fact I think lock contention may not be a big issue as the schedulers usually run very infrequently (like 60hz) and it's not like we all have a few dual epic servers sitting idle waiting for us see how the scheduler code scales on 256 cores.

I thought about a really simple way at first. 1 runqueue, 1 central lock, for all CPUs to 'randomly' pull tasks from. The resumer only needs to lock the queue and add the task to it. It would cause random migration without extra logic so probably not very good for split caches but there's no shenanigans like "IPI for resume'.

Why I said lock contention at 60hz may not be a big deal, is that, I've seen sustained 10k+ resuming IPIs per second per CPU, there's little doubt that taking so much interrupts with associated CPU status flushing and register savings is way worse than waiting for a central lock at 60hz and is also why I wanted to cancel IPIs in the first place. But perhaps it's a bad case either way as it might then translate to the CPUs all trying to grab that central lock 10k times per second.

thewrongchristian wrote: Yeah, there is a reason I've not tackled SMP yet

Fascinating link, BTW. Another page on this blog caught my eye:

https://preshing.com/20111118/locks-are ... ention-is/

Yes very interesting blog to read. Useful as well, the example I pasted might look very contrived, but I found that page because I was debugging a very similar issue.

thewrongchristian wrote: code reviews

One of the most controversial piece of process for sure

xeyes · Post by **xeyes** » Sat Feb 13, 2021 7:22 pm

nullplan wrote: Which is why I would recommend going SMP as early as possible, since that minimizes the pain of transition. Start thinking about what resources are shared between tasks immediately when starting to write the kernel. And never, ever, use a "cli" in place of an actual mutex lock. And that should take care of most of the problem.

Right now most of my issues are 'cross functional', Like some thread has a few locks, then it decides to sleep

Apparently that wasn't the the original intention but a byproduct of situations like "a few layers down, it needs to wait for hard disk".

Both routes work though, maybe it's just a choice of what interests you most?

I can't imagine many things that definitely need MP, it's mostly a throughput or performance enhancement.

A SP kernel would do fine for a long time as well. Some parts might need overhaul when adding MP but those prices are paid along the way of MP development anyways?

That's also my excuse of not touching 64b yet

nullplan wrote:
xeyes wrote:Can sum ever be 0?

If you launch these threads on separate x86 CPUs or cores, it happens very often.

But a single CPU won't show this no matter how fast it preempts and changes between the 2 threads.
Sure it can. If thread 1 never runs at all, or only runs until the barrier, then thread 2 will not see v1 being 1, will not increment sum, and will see sum being zero.

I probably included too much 'supporting' code and wasn't clear in the comments.

The barrier waits are between the 2 threads. Without both reaching the same point and ready to continue, neither can continue.

So the gist is these 2 pieces start at around the same time with sum, v1 and v2 all cleared to 0 before either can start, can the 2 tmps both be 0 after both of them finish? The atomic increments are only used to communicate between the threads to enable easy detection of both tmps being 0.

Code: Select all

   
*v1 = 1;
tmp = *v2;

Code: Select all

   
*v2 = 1;
tmp = *v1;

If you believe that there's causality in this universe and time only flows in one direction, sum == 0 shouldn't have happened.

nullplan · Post by **nullplan** » Sat Feb 13, 2021 11:25 pm

xeyes wrote:Right now most of my issues are 'cross functional', Like some thread has a few locks, then it decides to sleep

Apparently that wasn't the the original intention but a byproduct of situations like "a few layers down, it needs to wait for hard disk".

I see. Yeah, that will be fun once I get to file system code.

xeyes wrote:I can't imagine many things that definitely need MP, it's mostly a throughput or performance enhancement.

Mostly, yes. However, consider that single threaded performance has been mostly stagnating for the last decade. We will likely see more and more cores, and only using a single one means only using a smaller and smaller fraction of the available hardware.

xeyes wrote:That's also my excuse of not touching 64b yet

Your choice. I simply started my kernel as a 64-bit one. I have no intent of ever supporting 32-bit mode for anything other than legacy booting.

xeyes wrote:I probably included too much 'supporting' code and wasn't clear in the comments.

The barrier waits are between the 2 threads. Without both reaching the same point and ready to continue, neither can continue.
[...]
If you believe that there's causality in this universe and time only flows in one direction, sum == 0 shouldn't have happened.

Having now read the link you provided, I understand a bit better. In that case, you have undefined behavior due to a data race on v1 and v2. According to C11's memory model, a data race is when multiple threads access the same variable unsynchronized, and at least one of those accesses is writing. This is the case for v1 and v2 here, which are both being written by one thread and read by the other. In order to resolve the data race, you would have to make the store and the load atomic. And the compiler intrinsic atomic load and store probably contain the required memory barriers.

Which is also why you really ought to either read up on memory ordering or - my preference - use compiler intrinsics for atomic operations, since those already contain the required barriers. I always find it hard to reason about barriers, since sequentially, they don't do anything. It's a similar problem I have with cache management instructions (cache is supposed to be transparent).

So anyway, I tried the test program presented in that link and got the following output:

Code: Select all

1 reorders detected after 608520 iterations
2 reorders detected after 4988844 iterations
3 reorders detected after 11804037 iterations

So I guess my Core i7 is not as reorder happy as the Core 2 Duo of the blog author. However, with atomic instructions in place, i.e. the "transaction" part replaced with

Code: Select all

        __atomic_store_n(&X, 1, __ATOMIC_SEQ_CST);
        r1 = __atomic_load_n(&Y, __ATOMIC_SEQ_CST);

in thread 1 and similar for thread 2, the compiler just inserts the mfence instruction where the blog author originally put it. And indeed, running the program now for a few minutes fails to produce any output. Since in this case two different memory locations are involved, sequential consistency is the required order, since that is the only order that synchronizes with other operations on different addresses.

I do find it telling that the guy chose to "solve" the problem first by restricting process CPU affinity. Instead of fixing the bug (the data race in this case), he chose to make the whole program worse. That's like integrating a garbage collector into your program because you can't figure out a memory allocation bug. It's just shoddy craftsmanship. If a bridge builder added a pillar to the otherwise free hanging bridge because he couldn't figure out the static issues, we'd call him raving mad, yet here we are, essentially doing the same.

xeyes · Post by **xeyes** » Sun Feb 14, 2021 5:58 pm

nullplan wrote:We will likely see more and more cores, and only using a single one means only using a smaller and smaller fraction of the available hardware.

Things could also head a different way. Did you know 10 years ago that a normal desktop can also have dozens of CPUs (cores) these days?

I also see 2 big issues with the more and more cores approach.

Firstly it is very hard to make MP program that are 100% correct, but much much harder to make MP program that are 100% correct and scales well beyond more than a dozen cores. The hardware side is also challenged by scalability issues, this is probably why many high core count CPUs these days, even the consumer ones, are "NUMA in one socket" designs.

Secondly not all workloads can scale well. List 10 things that you do at least once per month on your computer that scales well with MP? Like if it runs on a 36 core machine it is more than 30x faster than when it runs one a SP machine, or 40~50x faster if hyper threading is enabled.

nullplan wrote:I have no intent of ever supporting 32-bit mode for anything other than legacy booting.

You don't want to play with v86 mode and get a taste of hypervisor/VMM before your kernel can be a true hypervisor/VMM that boots mainstream modern OSes?

nullplan wrote:So I guess my Core i7 is not as reorder happy as the Core 2 Duo of the blog author.

It happens more frequently for me on real hardware than in Qemu. I can get 10% of the iterations to show this. i7 should actually be a lot more aggressive, you didn't think it is faster than core 2 only because of Intel's "industry leading 10nm technology" right?

But maybe my barrier wait has subtle bugs so while it keeps the iteration in sync but somehow also allowed one of the threads to sneak through from time to time.

I would certainly expect it to hit less frequently on a host OS, but should still happen as the host OS can do many things but can't decide how a x86 CPU works.

nullplan wrote:my preference - use compiler intrinsics for atomic operations
integrating a garbage collector into your program because you can't figure out a memory allocation bug.

Do you see any parallel between "I don't want to deal with the details of memory management so I use GC" and "I don't want to deal with the details of memory ordering so I target high level C++ machine"?

nullplan wrote:I do find it telling that the guy chose to "solve" the problem first by restricting process CPU affinity. Instead of fixing the bug (the data race in this case), he chose to make the whole program worse.

I'm more than a bit surprised that what you get out of that post is something like "the guy doesn't know how to set two variables both to 1 and has bugs in such a simple program".

Perhaps you can also write (or find) a similarly interesting "bug" to share?

OSDev.org

can IPI be cancelled?

can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?

Re: can IPI be cancelled?