Opinions on scheduler (C++ coroutines and NodeJS)

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
startrail
Posts: 12
Joined: Tue Jan 05, 2016 9:10 am

Opinions on scheduler (C++ coroutines and NodeJS)

Post by startrail »

I am professionally a web developer (ewwwww) and quite like JS' async/await and its event loop. I am modelling my scheduler with the NodeJS event loop as inspiration. I also use C++23 coroutines heavily. So far it has been working quite well.

Say, I want to read from an AHCI device, the flow goes kind of like this

Code: Select all

device::readFile(...) {
    // Control will be yielded to the caller as soon as co_await happens
    auto result = co_await Command(this, freeSlot);
}

Command::setResult(bool result) noexcept {
    // Called within the context of IRQ handler
    scheduler.queueEvent(<whoever issued the AHCI command>);
}
    
Command::await_resume() {
   // IRETQ has been already done and running in the context of scheduler event loop
   return commandResultToTheAwaiter;
}
And this is happening for key events as well (IRQ handled on a different CPU).

Currently my scheduler runs on only 1 CPU which listens to the HPET IRQ and does this

Code: Select all

void Kernel::Scheduler::timerLoop() {
	// Dispatch events synchronously until the queue has dispatchable events and < SCHEDULER_EVENT_DISPATCH_LIMIT in 1 loop
	// FIXME: should lock the event queue once all CPUs get timer broadcast IRQ
	size_t dispatchedEventsCount = 0;
	while (!eventQueue.empty() && dispatchedEventsCount < SCHEDULER_EVENT_DISPATCH_LIMIT) {
		std::coroutine_handle<> x = eventQueue.front();
		if (x && !x.done()) {
			x.resume();
		}
		eventQueue.pop();
		++dispatchedEventsCount;
	}

	// TODO: do rest of scheduling/context switching
}
I am going to make the timer IRQ broadcast to all CPUs. Linux, according to my current understanding, maintains a task queue for each CPU. I however don't want to do that and think the current model of centralized task and event queue should work fine.

Do you guys foresee any problems I might run into?
One problem I can think of is CPU affinity but even that can be navigated (I guess).
rdos
Member
Member
Posts: 3279
Joined: Wed Oct 01, 2008 1:55 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by rdos »

startrail wrote:I am professionally a web developer (ewwwww) and quite like JS' async/await and its event loop. I am modelling my scheduler with the NodeJS event loop as inspiration. I also use C++23 coroutines heavily. So far it has been working quite well.
The async/wait is only one function of the scheduler. You also need critical sections / semaphores.

Generally, the scheduler has two quite different functions: Real time handling of events, which happen in the context of some thread, and longer term decisions on where to run threads (which CPU core), which use it's own thread. So, there is no "event loop" in the real time handling, rather this should happen in the context of the current thread.
startrail wrote: I am going to make the timer IRQ broadcast to all CPUs. Linux, according to my current understanding, maintains a task queue for each CPU. I however don't want to do that and think the current model of centralized task and event queue should work fine.
Multicore operation becomes extremely complicated if you don't have task queues per CPU core. It also scales poorly since real time events needs to use system-wide spin-locks. For local task queues, spinlocks are not needed since only one CPU core use these task queues.

I still have a transient global task queue, but running tasks are assigned to a particular CPU core and are kept in that core's task queue.
Octocontrabass
Member
Member
Posts: 5513
Joined: Mon Mar 25, 2013 7:01 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by Octocontrabass »

startrail wrote:I am going to make the timer IRQ broadcast to all CPUs.
This is terrible for battery life. You'll be waking CPUs that have no work to do, just to put them back to sleep.
startrail wrote:Linux, according to my current understanding, maintains a task queue for each CPU. I however don't want to do that and think the current model of centralized task and event queue should work fine.
Linux has to scale to machines with hundreds of CPUs. A single queue that can only be accessed by one CPU at a time doesn't scale, and it especially doesn't scale when every CPU is trying to access it at the same time.
rdos
Member
Member
Posts: 3279
Joined: Wed Oct 01, 2008 1:55 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by rdos »

I still think the event handling concept is useful, however it is not the scheduler that runs the "event loop", rather the server/device. The scheduler needs to implement some mechanism that allows a device to implement the event loop (without the use of busy-polling), and allow IRQs to signal events. This is a very common scenario in the kernel. Most device drivers can be implemented with a handler thread and an IRQ.

I implement event loops with the Signal/WaitForSignal pair. Each thread control block has a boolean "signalled" value. The Signal function will set the signalled indicator and wakeup the thread if it's blocked. The WaitForSignal function will implement the event loop. If signal indicator is clear, it will block, otherwise it clears signalled and exits. The device use WaitForSignal in a loop, and handle events after each call.

If nested interrupts are allowed, then a scheduler lock counter needs to be kept that keeps track of IRQ nesting. Threads will be woken up when nesting count goes to zero. A list of pending threads to wakeup is kept per CPU-core. On multicore, the IRQ might not happen on the core where the thread is blocked, and then an IPI needs to be sent to the core where the thread is blocked. To avoid this, the scheduler keeps track of connections between IRQs and server threads, and when it moves a thread it will reroute IRQs connected to the thread as well.

The HPET is a different thing. It can be used for keeping track of time or implementing timers. These functions are not suitable for events loops. Timers & timeout should be general functions of the scheduler that device drivers can use for whatever needs it has. These function can also be implemented in different ways based on available hardware (HPET, PIT, RTC, APIC timer).
User avatar
AndrewAPrice
Member
Member
Posts: 2299
Joined: Mon Jun 05, 2006 11:00 pm
Location: USA (and Australia)

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by AndrewAPrice »

I love fibers.

Anytime in my OS I hit a locked mutex, or call an RPC synchronosly (which includes the underlying implementation behind fread, etc.), I switch to the next running fiber, only sleeping the thread if there are no awake fibers to run.

They can be implemented completely in user space. Here's my context switching code: fibers.asm. Fibers are cooperatively interrupted so you only have to save the callee-saved registers. My fiber data structures live in an object pool so it's super fast to create and destroy them.
Here's the C++ code for my fibers: fibers.cc fibers.h

And my user-space scheduler: scheduler scheduler.h

Since I'm in a microkernel environment, communication happens through both RPCs being sent out (e.g. file IO) and RPCs coming in (e.g. window manager says mouse moved over my window), and if my scheduler sees there are no awake fibers, it sleeps the thread until an incoming message from the kernel (RPCs, interrupts, timer events), which then creates a fiber that calls the handler.

My scheduler offers a few options - such as HandOverControl() which sleeps the main fiber and never returns unless the user calls TerminateProcess(). HandOverControl() is intented to be called at the end of main(). So for example, in main() you'd create your UI Windows, set up some handlers for incoming messages, then call HandOverControl() which keeps sleeping and waiting until there are incoming messages and fibers to run then sleeps again. Most applications on my OS should call HandOverControl() unless they're command line utilities that do one thing then quit.

I also provide some other options - such as FinishAnyPendingWork(), which if there's nothing to immediately do it returns rather than sleeps, so for example you'd call this from a game loop that you want continously running.

Code: Select all

// Sleeps the current fiber - running any other fibers or handling incomming messages - until the duration has passed.
SleepForDuration(std::chrono::seconds(1));

// Creates a fiber that wakes up after the given time, then destroys itself once it's finished running.
AfterDuration(std::chrono::seconds(2), [&]() {
    // Do something
  });
I'm thinking of providing a function similar to std::async that allows the the programmer to create a future and explicitly state whether it's resolver runs in an async_thread, async_fiber, defer_thread, defer_fiber because it would be super awesome if I could do something like:

Code: Select all

// Kick of work:

std::future<Result> future_result_1 = Defer(std::launch::async_thread, [&]() {
 // Code that runs pre-emptive multitasking in its own thread.
});
std::future<Result> future_result_2 = Defer(std::launch::async_fiber, [&]() {
 // Code that runs cooperatively in the same thread but its own fiber.
});
std::future<Result> future_result_3 = Defer(std::launch::async_fiber, [&]() {
 // Code that runs cooperatively in the same thread but its own fiber.
});

Result result_1 = future_result_1.get();
Result result_2 = future_result_2.get();
Result result_3 = future_result_3.get();
My OS is Perception.
User avatar
bellezzasolo
Member
Member
Posts: 110
Joined: Sun Feb 20, 2011 2:01 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by bellezzasolo »

AndrewAPrice wrote:I love fibers.

Anytime in my OS I hit a locked mutex, or call an RPC synchronosly (which includes the underlying implementation behind fread, etc.), I switch to the next running fiber, only sleeping the thread if there are no awake fibers to run.

They can be implemented completely in user space. Here's my context switching code: fibers.asm. Fibers are cooperatively interrupted so you only have to save the callee-saved registers. My fiber data structures live in an object pool so it's super fast to create and destroy them.
Here's the C++ code for my fibers: fibers.cc fibers.h

And my user-space scheduler: scheduler scheduler.h

Since I'm in a microkernel environment, communication happens through both RPCs being sent out (e.g. file IO) and RPCs coming in (e.g. window manager says mouse moved over my window), and if my scheduler sees there are no awake fibers, it sleeps the thread until an incoming message from the kernel (RPCs, interrupts, timer events), which then creates a fiber that calls the handler.

My scheduler offers a few options - such as HandOverControl() which sleeps the main fiber and never returns unless the user calls TerminateProcess(). HandOverControl() is intented to be called at the end of main(). So for example, in main() you'd create your UI Windows, set up some handlers for incoming messages, then call HandOverControl() which keeps sleeping and waiting until there are incoming messages and fibers to run then sleeps again. Most applications on my OS should call HandOverControl() unless they're command line utilities that do one thing then quit.

I also provide some other options - such as FinishAnyPendingWork(), which if there's nothing to immediately do it returns rather than sleeps, so for example you'd call this from a game loop that you want continously running.

Code: Select all

// Sleeps the current fiber - running any other fibers or handling incomming messages - until the duration has passed.
SleepForDuration(std::chrono::seconds(1));

// Creates a fiber that wakes up after the given time, then destroys itself once it's finished running.
AfterDuration(std::chrono::seconds(2), [&]() {
    // Do something
  });
I'm thinking of providing a function similar to std::async that allows the the programmer to create a future and explicitly state whether it's resolver runs in an async_thread, async_fiber, defer_thread, defer_fiber because it would be super awesome if I could do something like:

Code: Select all

// Kick of work:

std::future<Result> future_result_1 = Defer(std::launch::async_thread, [&]() {
 // Code that runs pre-emptive multitasking in its own thread.
});
std::future<Result> future_result_2 = Defer(std::launch::async_fiber, [&]() {
 // Code that runs cooperatively in the same thread but its own fiber.
});
std::future<Result> future_result_3 = Defer(std::launch::async_fiber, [&]() {
 // Code that runs cooperatively in the same thread but its own fiber.
});

Result result_1 = future_result_1.get();
Result result_2 = future_result_2.get();
Result result_3 = future_result_3.get();
I'm thinking of going this way, my threading code is flaky and needs an overhaul.

Currently I have the fairly standard pattern:

Code: Select all

static uint8_t xhci_interrupt(size_t vector, void* info)
{
	XHCI* inf = reinterpret_cast<XHCI*>(info);
	if (inf->event_available)
	{
		signal_semaphore(inf->event_available, 1);
	}
	if(!inf->interrupt_msi)
		inf->Operational.USBSTS.EINT = 1;
	inf->Runtime.Interrupter(0).IMAN.InterruptPending = 1;
	inf->Runtime.Interrupter(0).IMAN.InterruptEnable = 1;
	return 1;
}

void eventThread()
	{
		while (1)
		{
			wait_semaphore(event_available, 1, TIMEOUT_INFINITY);
			...
		}
	}
co_await would be lovely to work with, although it doesn't really work in the interrupt context. Really I should be prepared to do context switches after all manner of interrupts to execute the deferred handler more expediently.

I'd want to make sure that the C++ niceness was running on a C compatible ABI (like COM) to avoid interop issues, though, even if realistically I'm the only one who's going to be writing drivers for my OS.
Whoever said you can't do OS development on Windows?
https://github.com/ChaiSoft/ChaiOS
User avatar
qookie
Member
Member
Posts: 72
Joined: Sun Apr 30, 2017 12:16 pm
Libera.chat IRC: qookie
Location: Poland

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by qookie »

Managarm makes extensive use of C++20 coroutines, both in user-space servers and in the kernel.

In user-space, it is used both to asynchronously perform IPC, and for example to handle IRQs in drivers. As for the aforementioned XHCI example, the managarm driver does the following:

Code: Select all

async::detached Controller::handleIrqs() {
	uint64_t sequence = 0;

	while(1) {
		auto await = co_await helix_ng::awaitEvent(_irq, sequence);
		HEL_CHECK(await.error());
		sequence = await.sequence();

		// ...

		HEL_CHECK(helAcknowledgeIrq(_irq.getHandle(), kHelAckAcknowledge, sequence));

		_eventRing.processRing();
	}
}
(from https://github.com/managarm/managarm/bl ... #L306-L327)

In the kernel, they are used to asynchronously complete work started in a system call, and also for IRQ logic in one place. For example, "helSubmitProtectMemory" does:

Code: Select all

	[](smarter::shared_ptr<AddressSpace, BindableHandle> space,
			smarter::shared_ptr<IpcQueue> queue,
			VirtualAddr pointer, size_t length,
			uint32_t protectFlags, uintptr_t context,
			enable_detached_coroutine = {}) -> void {
		auto outcome = co_await space->protect(pointer, length, protectFlags);
		// TODO: handle errors after propagating them through VirtualSpace::protect.
		assert(outcome);

		HelSimpleResult helResult{.error = kHelErrNone};
		QueueSource ipcSource{&helResult, sizeof(HelSimpleResult), nullptr};
		co_await queue->submit(&ipcSource, context);
	}(std::move(space), std::move(queue), reinterpret_cast<VirtualAddr>(pointer),
			length, protectFlags, context);
(from https://github.com/managarm/managarm/bl ... #L895-L908)

Another example, of IRQ handling in the kernel, is for the dmalog device used for a gdbstub for servers:

Code: Select all

		// ...
		bool inIrq = false, outIrq = false;
		while (true) {
			irqSeq_ = co_await irqEvent_.async_wait(irqSeq_);

			// Schedule on the work queue in order to return from the IRQ handler
			co_await WorkQueue::generalQueue()->schedule();

			// ...
		}
		// ...

	// Called within the IRQ context by the IRQ plumbing when one arrives
	IrqStatus raise() override {
		// ...

		if (/* ... */) {
			// ...
			irqEvent_.raise();

			return IrqStatus::acked;
		} else {
			return IrqStatus::nacked;
		}
	}
(from https://github.com/managarm/managarm/bl ... dmalog.cpp)

For the record, a detached coroutine here is a coroutine that runs independently of other coroutines, and does not produce a value, and as such can't be awaited, and can be started from outside a coroutine (like in the "helSubmitProtectMemory" example).
Working on managarm.
rdos
Member
Member
Posts: 3279
Joined: Wed Oct 01, 2008 1:55 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by rdos »

I want full control of IRQs, and using C, C++, exception handling and co-routines in IRQs seems like a nightmare to me.
ArsenArsen
Posts: 9
Joined: Wed Aug 05, 2020 3:38 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by ArsenArsen »

rdos wrote:I want full control of IRQs, and using C, C++, exception handling and co-routines in IRQs seems like a nightmare to me.
you can disable exceptions both locally and globally.

I'm unsure what the rest of your post is referring to - all code that you don't put in the IRQ handler does not end up in the IRQ handler.
rdos
Member
Member
Posts: 3279
Joined: Wed Oct 01, 2008 1:55 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by rdos »

ArsenArsen wrote:
rdos wrote:I want full control of IRQs, and using C, C++, exception handling and co-routines in IRQs seems like a nightmare to me.
you can disable exceptions both locally and globally.

I'm unsure what the rest of your post is referring to - all code that you don't put in the IRQ handler does not end up in the IRQ handler.
Well, you typically want to enable & disable interrupts in IRQs, you might need spin-locks, and AFAIK, C doesn't support any of that without various hacks. If you run IRQs with interrupts disabled, your system will get awful interrupt latency, and if you enable interrupts, you need to disable the scheduler from switching thread. Of course, you cannot access anything that might block. Which means that you need to know what your generated code is doing.
Octocontrabass
Member
Member
Posts: 5513
Joined: Mon Mar 25, 2013 7:01 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by Octocontrabass »

rdos wrote:Well, you typically want to enable & disable interrupts in IRQs, you might need spin-locks, and AFAIK, C doesn't support any of that without various hacks.
If you really want to avoid using non-standard inline assembly extensions, you can call a separate function written entirely in assembly. Linking C and assembly together is not a hack.

Spin locks can be implemented using atomic_flag from <stdatomic.h>, which has been part of C since C11.
rdos wrote:Of course, you cannot access anything that might block. Which means that you need to know what your generated code is doing.
Why would you need to look at the generated code to know whether you're trying to do something that might block?
rdos
Member
Member
Posts: 3279
Joined: Wed Oct 01, 2008 1:55 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by rdos »

Octocontrabass wrote:
rdos wrote:Well, you typically want to enable & disable interrupts in IRQs, you might need spin-locks, and AFAIK, C doesn't support any of that without various hacks.
If you really want to avoid using non-standard inline assembly extensions, you can call a separate function written entirely in assembly. Linking C and assembly together is not a hack.

Spin locks can be implemented using atomic_flag from <stdatomic.h>, which has been part of C since C11.
I prefer to code this in assembly, not in C. There is no advantage of having IRQ handlers or spinlocks in C. They will be cluttered by stuff that make them more unreadable than pure assembly code.

My opinion is that if you cannot write C code without using tricks or fancy constructs, write it in assembly instead. Atomic variables or variables where the optimizer is not allowed to remove references are good examples of code that is better done in assembly. Simply because the assembler won't try to remove constructs it doesn't find useful.
Octocontrabass wrote:
rdos wrote:Of course, you cannot access anything that might block. Which means that you need to know what your generated code is doing.
Why would you need to look at the generated code to know whether you're trying to do something that might block?
The discussion was more about C++, where you never know the side-effects of even trivial code. C without exception handling is more appropriate, as then you know the side effects of the code without needing to look at the generated code.

In fact, in the C based drivers I have, I decided that the syscall interface is better done in assembly, and then I would define C based handler procedures with a specific register call interface. That works well for complicated drivers like hid, audio codecs and font utilities.
nullplan
Member
Member
Posts: 1769
Joined: Wed Aug 30, 2017 8:24 am

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by nullplan »

Octocontrabass wrote:[Spin locks can be implemented using atomic_flag from <stdatomic.h>, which has been part of C since C11.
Well yes, but also no. In order for spinlocks to be useful, they really need to disable interrupts before taking the spinlock (and revert spinlocks to the prior state after release). Otherwise it is possible to take a spinlock, be interrupted, and have the interrupt handler try to take the same spinlock, deadlocking the kernel. And it is not possible to disable interrupts in C.

A variant that does not disable interrupts is possible if the spinlock is never shared with interrupt handlers, but that is an optimization and not the rule.

I prefer making small building blocks in assembler, with a portable interface. E.g. interface:

Code: Select all

unsigned long a_irqdisable(void);
void a_irqrestore(unsigned long);
int a_swap(volatile int *, int);
void a_store(volatile int *, int);
void a_spin(void);
void a_exit_spin(void);
AMD64:

Code: Select all

a_irqdisable:
  pushfq
  popq %rax
  retq
a_irqrestore:
  pushq %rdi
  popfq
  retq
a_swap:
  lock xchgl %esi, (%rdi)
  movl %esi, %eax
  retq
a_store:
  xorl %eax, %eax
  movl %esi, (%edi)
  lock cmpxchgl %eax, (%rsp) # prevent processor-side load-reorders across the store instruction above
  retq
a_spin:
  pause
  retq
a_exit_spin:
  retq
PowerPC:

Code: Select all

a_irqdisable:
  mfmsr 3
  rlwinm 4,3,0,~MSR_EE
# skip mtmsr instruction if possible. It is slow.
  cmplw 4, 3
  beq 1f
  mtmsr 4
1: blr
a_irqrestore:
  andi. 0, 3, MSR_EE
  beq 1f
  mtmsr 3
1: blr
a_swap:
  sync
  lwarx 5, 0, 3
  stwcx. 4, 0, 3
  bne- a_swap
  isync
  mr 3, 5
  blr
a_store:
  sync
  stw 4, 0(3)
  sync
  blr
a_spin:
  or 4, 4, 4
  blr
a_exit_spin:
  or 2, 2, 2
  blr
And then it is possible to use those again to build the spinlock in C. Using external functions rather than inline asm has the benefit of creating a well-defined ABI boundary, rather than whatever inline assembler is doing. Yes, it is nice that the compiler can inline and reorder this stuff, but getting the constraints (especially the clobbers) just right is not a thing I want to waste time on. Anyway, the functions are small and easy to verify.

The spinlock code could then be something like

Code: Select all

typedef volatile int spinlock_t;
unsigned long spinlock_irqsave(spinlock_t *lock)
{
  unsigned long flg = a_irqsave();
  while (a_swap(lock, 1)) a_spin();
  a_exit_spin();
  return flg;
}

void spinunlock_irqrestore(spinlock_t * lock, unsigned long flg)
{
  a_store(lock, 0);
  a_irqrestore(flg);
}

C11 atomics also have the significant drawback of utilizing the C memory model. Which is fine if you want to tune it all for the best performance, but also easy to get wrong. I tend to write my atomics simply with a full memory barrier, as that is way easier to understand. Might not perform as well, but as stated in the past I take readable and understandable code that works over fast code that fails sometimes any day of the week, and twice on Sundays.
Carpe diem!
Octocontrabass
Member
Member
Posts: 5513
Joined: Mon Mar 25, 2013 7:01 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by Octocontrabass »

nullplan wrote:And it is not possible to disable interrupts in C.
Right, thus the external assembly function (or inline assembly "hack").
nullplan wrote:I prefer making small building blocks in assembler, with a portable interface.
And that's a valid choice. I'm not saying you have to use stdatomic.h, just that using it is not a hack.
nullplan wrote:C11 atomics also have the significant drawback of utilizing the C memory model. Which is fine if you want to tune it all for the best performance, but also easy to get wrong. I tend to write my atomics simply with a full memory barrier, as that is way easier to understand. Might not perform as well, but as stated in the past I take readable and understandable code that works over fast code that fails sometimes any day of the week, and twice on Sundays.
The C memory model defaults to full memory barriers on all atomic accesses, including accesses outside atomic_function() calls. Relaxing the memory order is optional.
rdos
Member
Member
Posts: 3279
Joined: Wed Oct 01, 2008 1:55 pm

Re: Opinions on scheduler (C++ coroutines and NodeJS)

Post by rdos »

nullplan wrote:
Octocontrabass wrote:[Spin locks can be implemented using atomic_flag from <stdatomic.h>, which has been part of C since C11.
Well yes, but also no. In order for spinlocks to be useful, they really need to disable interrupts before taking the spinlock (and revert spinlocks to the prior state after release). Otherwise it is possible to take a spinlock, be interrupted, and have the interrupt handler try to take the same spinlock, deadlocking the kernel. And it is not possible to disable interrupts in C.

A variant that does not disable interrupts is possible if the spinlock is never shared with interrupt handlers, but that is an optimization and not the rule.

I prefer making small building blocks in assembler, with a portable interface. E.g. interface:

Code: Select all

unsigned long a_irqdisable(void);
void a_irqrestore(unsigned long);
int a_swap(volatile int *, int);
void a_store(volatile int *, int);
void a_spin(void);
void a_exit_spin(void);
AMD64:

Code: Select all

a_irqdisable:
  pushfq
  popq %rax
  retq
a_irqrestore:
  pushq %rdi
  popfq
  retq
a_swap:
  lock xchgl %esi, (%rdi)
  movl %esi, %eax
  retq
a_store:
  xorl %eax, %eax
  movl %esi, (%edi)
  lock cmpxchgl %eax, (%rsp) # prevent processor-side load-reorders across the store instruction above
  retq
a_spin:
  pause
  retq
a_exit_spin:
  retq
PowerPC:

Code: Select all

a_irqdisable:
  mfmsr 3
  rlwinm 4,3,0,~MSR_EE
# skip mtmsr instruction if possible. It is slow.
  cmplw 4, 3
  beq 1f
  mtmsr 4
1: blr
a_irqrestore:
  andi. 0, 3, MSR_EE
  beq 1f
  mtmsr 3
1: blr
a_swap:
  sync
  lwarx 5, 0, 3
  stwcx. 4, 0, 3
  bne- a_swap
  isync
  mr 3, 5
  blr
a_store:
  sync
  stw 4, 0(3)
  sync
  blr
a_spin:
  or 4, 4, 4
  blr
a_exit_spin:
  or 2, 2, 2
  blr
And then it is possible to use those again to build the spinlock in C. Using external functions rather than inline asm has the benefit of creating a well-defined ABI boundary, rather than whatever inline assembler is doing. Yes, it is nice that the compiler can inline and reorder this stuff, but getting the constraints (especially the clobbers) just right is not a thing I want to waste time on. Anyway, the functions are small and easy to verify.

The spinlock code could then be something like

Code: Select all

typedef volatile int spinlock_t;
unsigned long spinlock_irqsave(spinlock_t *lock)
{
  unsigned long flg = a_irqsave();
  while (a_swap(lock, 1)) a_spin();
  a_exit_spin();
  return flg;
}

void spinunlock_irqrestore(spinlock_t * lock, unsigned long flg)
{
  a_store(lock, 0);
  a_irqrestore(flg);
}

C11 atomics also have the significant drawback of utilizing the C memory model. Which is fine if you want to tune it all for the best performance, but also easy to get wrong. I tend to write my atomics simply with a full memory barrier, as that is way easier to understand. Might not perform as well, but as stated in the past I take readable and understandable code that works over fast code that fails sometimes any day of the week, and twice on Sundays.
I think I would prefer to do the complete spinlock in assembly. It's more compact and the code is pretty short.

Besides, your x64 disable procedure seems to lack "cli", and you don't need lock for xchg on x86.
Post Reply