Optimal multi-logical-processor scheduling

Brendan · Post by **Brendan** » Sat Sep 27, 2008 12:26 pm

Hi,

Colonel Kernel wrote:
Brendan wrote:Show me how this can be improved by "centralizing the RFO in the kernel"....
That's easy. By putting it in the kernel, you don't have to entrust programmers who are far less talented and experienced than you to write that kind of code ever again. Therefore, it is improved.

A library would work too... As long as you're providing some kind of abstraction that doesn't force app developers to twiddle with counters in shared memory. I guarantee you that a significant portion of them will mess it up.

I guarantee you that a significant portion of them will still mess it up - e.g. calling your function much more than necessary and/or using it in an inappropriate way.

Then there's the problem that all forms of abstraction have - inefficiency caused by a mismatch between intended usage and desired usage. For example, if I were using POSIX threads to do the same as the example above then I'd use "pthread_join()", but in that case there'd be a main thread that would be constantly blocking and restarting while it waits for each of the other threads to finish (and the overhead/inefficiency caused by the additional thread switches).

Of course there's a huge number of different things programmers can do that will hurt performance in some way. An OS can't stop programmers from writing bad code without becoming a bloated mess itself.

Out of curiosity, what sort of features are you planning to support in your kernel to assist with profiling and performance monitoring?

Cheers,

Brendan

bewing · Post by **bewing** » Sat Sep 27, 2008 9:42 pm

Colonel Kernel wrote: That's easy. By putting it in the kernel, you don't have to entrust programmers who are far less talented and experienced than you to write that kind of code ever again. Therefore, it is improved.

Plus, if your kernel is running on one and only one CPU, then all those lock sub statements are all happening on just one CPU. So there aren't any RFO's on those statements at all.

-- Let me rephrase that. The thrust of my argument is that all RFOs are not created equal. There is a performance penalty on RFOs that is related to the total number of CPUs that are trying to modify one particular memory address. In fact, after thinking about it, I am pretty sure that in the worst case the performance penalty is n*(n-1) -- or, in other words O(n^2).

IPC between two CPUs creates RFOs, of course. However, those RFOs are only between two CPUs. Therefore, the performance penalty is O(1). A lock like the one Brendan posted is very pretty -- however, the memory address is "popular" -- it is being shared and modified by n CPUs -- therefore the penalty is O(n^2) on the RFOs.

So, to directly answer his question on how it could be improved: Replace the one shared memory address with N separate memory addresses (in separate cache lines), and combine the results on one "master" CPU. This replaces an O(N^2) RFO penalty with N O(1) penalties -- and therefore creates a scalable solution from a non-scalable one.

(And please note that this argument applies to EVERY kind of multi-CPU locking mechanism, from spinlocks on down.)

Virtlink · Post by **Virtlink** » Sun Sep 28, 2008 3:54 pm

Brendan wrote:An OS should never need to manage applications at the machine-code level. The OS's kernel should only need to care about processes and "kernel level threads" (if supported). Things like automatically doing loops in parallel and "user level threads" should be handled by languages/compilers, and mapped onto whatever the kernel supports by the language/compiler.

Why is this? Generally, the process should have no need to know the number of CPUs in the system, their characteristics and priorities, and has no means to influence the number of CPUs on which it is executed. When the application wants to use parallel loops, it might use a provided structure, say, a 'fiber', managed by the scheduler.

When I read the posts in this thread, I understand it as this:

No process should be limited in the number of CPUs that they use at one time.
CPUs are very likely to share caches.
Modifying shared memory will cause RFOs. Bewings solution above seems a nice enough way to deal with it.

It is feasible for the OS to know which caches are shared? If so, the scheduler could schedule multiple threads of the same process to CPUs with a shared TLB. Are there other means to optimize scheduling for the other caches (L1, L2, etc...)?

Colonel Kernel · Post by **Colonel Kernel** » Sun Sep 28, 2008 11:33 pm

Brendan wrote:I guarantee you that a significant portion of them will still mess it up - e.g. calling your function much more than necessary and/or using it in an inappropriate way.

That's a risk with any abstraction, but the existence of the risk doesn't justify not having an abstraction in the first place.

Then there's the problem that all forms of abstraction have - inefficiency caused by a mismatch between intended usage and desired usage. For example, if I were using POSIX threads to do the same as the example above then I'd use "pthread_join()", but in that case there'd be a main thread that would be constantly blocking and restarting while it waits for each of the other threads to finish (and the overhead/inefficiency caused by the additional thread switches).

I don't think all abstractions suffer from this problem, although many of them do. I think the problem stems from having abstractions at the wrong level -- for example, shoe-horning shared memory concurrency into C, which was designed with a single-threaded process model in mind.

Your pthread_join() example is a good one on a few levels. You're right about the performance issues it causes, but the reason those issues exist is not because of pthread_join(). It is because of the way that C forces the programmer to call pthread_join() in an imperative rather than a declarative manner. If you want to give the system the most freedom to schedule things intelligently for better performance, you have to restrict the programmer to saying as little as possible about how things are to be done. They should only care about what needs to be done, not how. This is just good sense, since the system knows the hardware topology better than the application. It knows how many CPUs there are, how "close together" they are in NUMA terms, how busy they currently are with other threads, etc. IMO, threads are simply too low-level to be a good abstraction for application programmers.

Here is how I would imagine a performant solution to your background-task pattern could be expressed using a higher abstraction level than threads:

Code: Select all

results = fork_sync_n( doSomeWork, n, combineResults );

It says everything that the application programmer should care about saying, while leaving the rest up to the system to decide. It says "call doSomeWork() n times concurrently, use combineResults() to combine the results of all doSomeWork() calls, block until all n invocations are done (allowing the calling thread to participate as well, if it makes sense to the scheduler to do so), then return the combined results." This is very difficult to get wrong compared to inventing everything from scratch.

Out of curiosity, what sort of features are you planning to support in your kernel to assist with profiling and performance monitoring?

I never got that far, unfortunately. My kernel project has been stalled for almost three years now. Life (new job, moving, getting married) has really gotten in the way.

Brendan · Post by **Brendan** » Tue Sep 30, 2008 2:48 am

Hi,

Virtlink wrote:When I read the posts in this thread, I understand it as this:

No process should be limited in the number of CPUs that they use at one time.

Hmm - I'd say "No process should be limited in the number of CPUs that they could use at one time". There's plenty of good reasons for a process to be limited to a subset of all available CPUs, either by the OS/kernel or by the process itself. These reasons could include situations where the process uses features that only some of the CPUs support (e.g. a process that uses MMX on a "mixed Pentium" system where only some of the CPUs support MMX), NUMA efficiency (where a process is limited to CPUs that belong to the same NUMA domain), etc.

Virtlink wrote:
CPUs are very likely to share caches.

As a rough guide, logical CPUs within the same core will share L2 and L3 caches and maybe the L1 data cache, and cores within the same chip might share L2 and/or L3 caches; but no caches are ever shared between CPUs that are in different chips (except in rare/unusual NUMA systems where there's some sort of "cache line state" caching done by the chipset to reduce cache miss and RFO costs).

Virtlink wrote:It is feasible for the OS to know which caches are shared?

Yes. For modern Intel CPUs you can use "CPUID, EAX=0x00000004" to get the "Deterministic Cache Parameters", which includes the number of logical CPUs using each cache and a lot more information (cache size, associativity, etc). For AMD CPUs there isn't a way to get this information, but AFAIK you can assume that L1 and L2 caches aren't shared, and the L3 cache (if present) is shared by all cores within the chip.

Virtlink wrote:If so, the scheduler could schedule multiple threads of the same process to CPUs with a shared TLB.

AFAIK TLB entries are never shared.

For hyper-threading, logical CPUs that belong to the same core "competitively share" the TLB, which means that one logical CPU can't use a TLB entry that is owned by another logical CPU (even if CR3 is the same in both logical CPUs, and even if the TLB entry is marked as "global"). This means that running threads that belong to the same process on logical CPUs within the same core or on cores that belong to the same chip won't reduce the chance of TLB misses. Note: "won't reduce the chance of TLB misses" is not the same as "won't reduce the cost of TLB misses"....

Virtlink wrote:Are there other means to optimize scheduling for the other caches (L1, L2, etc...)?

Yes - the general idea is to keep caches "warm"; or schedule threads on CPUs that may still have cached data that the thread may need. For example, if a thread was running on CPU#0 for a while and blocks, then when that thread is started again it's data might still be in CPU#0's cache, and might also still be in other CPU's caches if those caches are shared with CPU#0.

However, if you always schedule threads to run on the same CPU that they ran on last time (or the same group of CPUs that share cache) then you won't be able to effectively balance CPU load - for e.g. you could have 100 threads waiting to use CPU#0 while other CPUs are doing nothing, which is bad for performance. Therefore it's a good idea to take into account the chance of a thread's data still being in the CPU's cache. CPU caches use "least recently used" algorithm. So, if a thread was running on CPU#0 but CPU#0 has done a lot of other work since, then that thread's data probably won't be in CPU#0's cache anymore, and it may be a bad idea to schedule that thread on CPU#0 again if other CPUs have less load.

To complicate this more, threads that belong to the same process probably share some data. If 2 threads that belong to the same process are running at the same time it would improve performance to run those threads on CPUs that share cache; and if a CPU was used to run a thread recently then it might still have data in it's cache that a different thread that belongs to the same process might need.

Note: this sort of cache optimization can indirectly effect TLBs miss costs. For example, if 2 CPUs share the same L3 cache but don't share TLBs, then data needed for a TLB miss on one CPU might be satisfied by data that was cached in the L3 cache because of another CPU's TLB miss. Basically, even though TLBs aren't shared, optimizing for L1/L2/L3 cache sharing can also indirectly improve the amount of time it takes to handle TLB misses.

For load balancing on CPUs with hyper-threading things get even more complicated, because work done on one logical CPU effects the performance of the other logical CPU. For example, if you've got 2 threads that belong to the same process and nothing else that's ready to run, and 2 seperate chips that both support hyper-threading (4 logical CPUs total); then it's better for performance to schedule one thread on each chip (with no chance of cache sharing) instead of scheduling both threads on the same chip (where they can share cache) and leave the other chip idle.

Of course all of the above is about optimizing performance. Power management (e.g. optimizing heat) is a different problem. It's not an unrelated problem though; most modern CPUs will reduce their performance when they get too hot (it's "self defense"

) and can suddenly drop down to 12.5% performance or 25% performance. To optimize for heat, you want to schedule threads on the coolest CPU (or maybe on the least loaded CPU if you can't get temperature information) , which means forgetting about all of the performance optimizations. A "very good" scheduler would take into account both performance and power management and optimize for both depending on a variety of factors (and not just optimize for performance only). For power management, other factors include whether or not the computer is running on batteries (mobile systems like laptops *and* servers running from a UPS during power failures), and whether or not the computer is in a crowded office or someone's home (where CPU fan noise/speed may be more important than performance).

Cheers,

Brendan

Brendan · Post by **Brendan** » Tue Sep 30, 2008 4:32 am

Hi,

bewing wrote:
Colonel Kernel wrote: That's easy. By putting it in the kernel, you don't have to entrust programmers who are far less talented and experienced than you to write that kind of code ever again. Therefore, it is improved.
Plus, if your kernel is running on one and only one CPU, then all those lock sub statements are all happening on just one CPU. So there aren't any RFO's on those statements at all.

-- Let me rephrase that. The thrust of my argument is that all RFOs are not created equal. There is a performance penalty on RFOs that is related to the total number of CPUs that are trying to modify one particular memory address. In fact, after thinking about it, I am pretty sure that in the worst case the performance penalty is n*(n-1) -- or, in other words O(n^2).

IPC between two CPUs creates RFOs, of course. However, those RFOs are only between two CPUs. Therefore, the performance penalty is O(1). A lock like the one Brendan posted is very pretty -- however, the memory address is "popular" -- it is being shared and modified by n CPUs -- therefore the penalty is O(n^2) on the RFOs.

Ummm.... If all threads need to tell all other threads that they've reached a certain point, then for IPC you'd need "n * (n - 1)" messages, probably with at least 2 RFOs per message (depending on implementation), and probably with "CPL=3 -> CPL=0 -> CPL=3" context switches.

Now, about those RFO penalties, what are they? For bus bandwidth it's at most 2 "bus transactions" over each bus (FSB or hyper-transport link) - a request to all CPUs followed by a response containing the cache line from one CPU. For cache lines, it's one cache line invalidation per CPU (except the sender) but only if the CPU has that cache line in it's cache; and if the cache line is invalidated it only matters if the CPU needs that cache line again.

Let's look at the example I posted. At the beginning the cache line containing the "remainingWorkerThreads" counter isn't in any CPU's cache except for the initial thread's CPU's cache. When a thread reaches the "lock sub [remainingWorkerThreads],1" instruction the cache line is marked invalidated in (at most) one CPU's cache (because it's not present in any other CPU's cache at any stage). For worst case that adds up to a total of "n" cache line invalidations (best case is "n - 1" cache line invalidations, which only happens if the initial thread is the first thread to reach the "lock sub [remainingWorkerThreads],1" instruction).

Of course most of the cache line invalidations matter, because none of the threads need that cache line again after the "lock sub [remainingWorkerThreads],1" instruction. The only invalidation that does matter is for the initial thread, where the cache line is invalidated after the "remainingWorkerThreads = neededThreads;" because it's needed later by the "lock sub [remainingWorkerThreads],1" instruction.

The main thing that does matter here is the latency of the RFO (but that's less than the latency for a normal cache miss) and the effects of the LOCK prefix on CPU piplines. Both of these things can't be avoided (unless you give up on improving performance by doing work in parallel).

Cheers,

Brendan

bewing · Post by **bewing** » Tue Sep 30, 2008 2:38 pm

Brendan wrote: Ummm.... If all threads need to tell all other threads that they've reached a certain point, then for IPC you'd need "n * (n - 1)" messages, probably with at least 2 RFOs per message (depending on implementation), and probably with "CPL=3 -> CPL=0 -> CPL=3" context switches.

Not necessarily. The way I suggested uses 0 RFOs, 4*(n-1) cache line SHARES, 2*(n-1) cache line invalidations (when shared cache lines get modified), and has n*(n-1)-2*(n-1) cache hits (or many more, depending on the algorithm) on shared cache lines -- but who cares how many cache hits you have. -- At least, I *think* I got my math right on those.

bewing · Post by **bewing** » Thu Oct 02, 2008 5:58 pm

Here is some code. Please, anyone who knows exactly what an RFO, and what a shared cache line is -- count all the RFOs and tell me exactly how many there are, and where they occur.

This is an application that performs any task, on any number of CPUs, with full synchronization.

Assume that main is started on cpu # 0, and that the do_my_assigned_task (MyCpu); functions are completely independent of each other -- this example is for the purpose of demonstrating RFO-less signalling/locking only.

Code: Select all

main()
{
	int CpuNum, flag;
	char *CacheLine[NUM_CPUS];
	void generic_task_thread();

	// allocate the cache lines -- one for each cpu, cpu0 is the "master"
	CacheLine[0] = malloc_on_cacheline_boundary (NUM_CPUS * CACHE_LINE_SIZE);
	CpuNum = 0;
	while (++CpuNum < NUM_CPUS)
		CacheLine[CpuNum] = CacheLine[CpuNum - 1] + CACHE_LINE_SIZE;
	// I am the master cpu -- init/modify ONLY my cacheline
	// -- indicate that the CPUs are still in "initialization mode" -> state = -1
	CacheLine[0][0] = -1;
	// cache line #0 is now "dirty" and I OWN it

	// Now start all the slave CPUs with their task threads.
	CpuNum = NUM_CPUS;
	while (--CpuNum > 0)
	{
		start_thread_on_cpu (CpuNum, generic_task_thread, CacheLine);
	}

	// synchronize/wait until all slave CPUs have initialized their cache lines to 0
	flag = 1;
	while (flag != 0)
	{
		CpuNum = NUM_CPUS;
		flag = 0;
		while (--CpuNum > 0 && flag == 0)
		{
			// READ (but do not modify!) each slave CPUs cacheline
			// Since I am not modifying them, I do not OWN them
			// this will generate up to NUM_CPUS -1 invalidations, and twice that # of cacheline SHARES
			// all the rest of the accesses are cacheline hits to a valid, shared cacheline
			if (CacheLine[CpuNum][0] != 0)
				flag = 1;	// the slave CPU has not initialized its byte yet
		}
	}

	// all the slave CPUs have initted their byte to 0 -- it's time to GO!
	CacheLine[0][0] = 0;	// set my state = 0 = go
	// synchronize/wait until all slave CPUs have finished their tasks
	flag = 1;
	while (flag != 0)
	{
		CpuNum = NUM_CPUS;
		flag = 0;
		while (--CpuNum > 00 && flag == 0)
		{
			// READ (but do not modify!) each slave CPUs cacheline again
			// ditto on invalidations, SHARES, and shared cacheline hits
			if (CacheLine[CpuNum][0] == 0)
				flag = 1;	// the slave CPU has not completed its task
		}
	}
	CacheLine[0][0] = 1;	// Done!
	... ;
}

void generic_task_thread(char *CacheLine[])
{
	int MyCpu = get_thread_cpu_num();
		// init my CacheLine byte
	CacheLine[MyCpu][0] = 0;
		// my cacheline is now modified/dirty and I OWN it

		// synchronize/wait until it's time to GO!
		// READ (but do not modify!) the master CPUs cacheline
		// this memory address in the master CPU's cache line is only modified ONCE,
		// which will result in two cacheline shares, and one invalidation,
		// NUM_CPUS - 1 times, altogether -- once for all the slave CPUs
		// all the rest of the accesses are valid, shared cacheline hits
	while (**CacheLine != 0);
		// Now, do my assigned task
	do_my_assigned_task (MyCpu);
		// Tell the master CPU that I'm done
	CacheLine[MyCpu][0] = 1;
		// just for fun, I could synchronize again, here -- waiting for **CacheLine != 0
	exit_thread();
}

Brendan · Post by **Brendan** » Fri Oct 03, 2008 3:58 am

Hi,

bewing wrote:Here is some code. Please, anyone who knows exactly what an RFO, and what a shared cache line is -- count all the RFOs and tell me exactly how many there are, and where they occur.

This is an application that performs any task, on any number of CPUs, with full synchronization.

Assume that main is started on cpu # 0, and that the do_my_assigned_task (MyCpu); functions are completely independent of each other -- this example is for the purpose of demonstrating RFO-less signalling/locking only.

An RFO occurs when a CPU needs to modify part of a cache line that isn't already in it's cache in the "modified" or "invalid" state. For example, an RFO can occur when you push something onto the stack, regardless of whether or not any other CPUs have the cache line in their cache.

You're not counting RFOs, you're counting cache line evictions caused by RFOs.

So let's summarize. You suggested limiting processes to one core as a way of reducing RFOs. My opinion is that RFOs in user-space code are a user-space problem and that it's not appropriate for a kernel to place such limitations on processes; and posted user-space code showing reasonably efficient cache-line usage without kernel intervention. Now you're posting user-space code showing reasonably efficient cache-line usage without kernel intervention too?

I guess this means we agree - kernel intervention isn't strictly necessary...

However, I should admit that I am leaning a little more towards Colonel Kernel's suggestion - hiding the underlying details in a library could help to reduce the chance of "bad programming" (but more importantly, it would also make it possible to write portable code that works efficiently under different architectures, with different underlying details).

Cheers,

Brendan

Combuster · Post by **Combuster** » Fri Oct 03, 2008 4:27 am

RFO-less is kindof difficult (without CD=1

)

Code: Select all

while (**CacheLine != 0);

prior to this statement cache line 0 is exclusive to CPU 0 - when the APs execute this, the line must be posted back to the other processors, setting it to shared for all processors.
When CPU 0 consequently resets it, it must issue an RFO because it is not exclusive to the processor but shared amongst other processors (i.e. the line must go from either the O/S/I state to the M state, which by definition is an RFO.)

similarly, RFOs happen too between the cache lines dedicated to each of the APs because the BSP reads them which causes them to go non-exclusive, after which they must be followed by a RFO.

I can see 2*#cpu-1 RFO's occurring: the BSP reads out the cache lines before the APs start the thread, which then set their lines (#ap RFO's), then read the BSP's line (-> shared state), the BSP in turn checks all cache lines for initialisation (AP cache -> shared state). The BSP then signals go (1 RFO, followed by immediate resharing), after which the AP's come back and tell the BSP they're done (#ap RFO's)

I'll stick with my atomic fetch-and-increment bakery algorithm (1 RFO per lock, 1 RFO per unlock)

bewing · Post by **bewing** » Fri Oct 03, 2008 11:21 am

Brendan wrote: You're not counting RFOs, you're counting cache line evictions caused by RFOs.

I'm trying to count both because they both take time.

You suggested limiting processes to one core as a way of reducing RFOs. ... Now you're posting user-space code showing reasonably efficient cache-line usage without kernel intervention too?

Yes. Precisely. Limiting processes to one cpu is, as you say, "one way" of solving the problem -- the intermediate degree of difficulty solution, as it were. Then the N cpu method is the advanced solution.

I guess this means we agree - kernel intervention isn't strictly necessary...

Yes. I do agree with that. Not strictly necessary.

However, I should admit that I am leaning a little more towards Colonel Kernel's suggestion - hiding the underlying details in a library could help to reduce the chance of "bad programming" (but more importantly, it would also make it possible to write portable code that works efficiently under different architectures, with different underlying details).

Agreed. Me too. In my case, I'm thinking about blocking user-written locks (but not kernel-supplied locks) by default, on machines with more than 4 CPUs, with an override if the programmer swears that he knows what he's doing -- or something like that.

Combuster wrote: prior to this statement cache line 0 is exclusive to CPU 0 - ...
... must go from either the O/S/I state to the M state

I think I disagree. I will look into it further -- but CPU0 already modified that cache line, so it is already in the M state, I think -- before it is shared.

because the BSP reads them which causes them to go non-exclusive, after which they must be followed by a RFO.

Depending on whether the BSP spins off the threads synchronously, or asynchronously, the BSP may read those cachelines before they go into the M state -- as each slave CPU modifies its cacheline. So I agree, it is possible for some RFOs to occur there, or not, depending on timing issues.
However, that is in the one-time initialization section of a program that sets up an entire cacheline of locking flags for each child process to use repeatedly for the lifetime of the app.

I'll stick with my atomic fetch-and-increment bakery algorithm (1 RFO per lock, 1 RFO per unlock)

I bet if you look at it more closely, you will see that it's not 1 and 1 -- it's n*(n-1) worst case for both -- when the lock is accessed by n CPUs all at the same time -- which becomes more likely when you have 2048 cpus on a machine. But have it your way.

Brendan · Post by **Brendan** » Fri Oct 03, 2008 3:06 pm

Hi,

bewing wrote:
Brendan wrote:You're not counting RFOs, you're counting cache line evictions caused by RFOs.
I'm trying to count both because they both take time.

Everything costs something. An occasional RFO on a single-bus system (e.g. all CPUs connected by the same front-side bus) costs less that an occasional cache miss (same bus bandwidth, better latency).

I'm not too sure about "multi-bus" systems (e.g. AMD's hyper-transport and Intel's upcoming Quickpath) - I haven't investigated the implications. AFAIK the bandwidth usage is no worse (from a "transactions per bus" perspective or a "total bus utilization" perspective) but in some situations it could be improved (e.g. if the RFO is satisfied because a "close" CPU had the cache line in a modified state, then the RFO doesn't need to be forwarded to "distant" buses). The worst case latency would be worse, because that gets worse with the number of hops. However, for anything more than 8-way NUMA there's typically some sort of directory-based cache coherence chip, so that the state and/or location of a cache line can be found quickly. Examples of this include the Newisys Horus chipset (for Opterons) and Sequent NUMA-Q (for Intel P6). Note: If you're curious, good internet search keywords seem to be "directory cache coherency".

Also, IMHO a single-bus system with multi-core CPUs isn't necessarily different from a multi-bus system, as there's a shared bus interface within each chip that acts in a similar way to a high-speed bus between cores. For example, an RFO may be satisfied without any activity on the front-side bus because a CPU in the same chip had the cache line in an modified state.

IMHO occasional RFOs aren't a problem and won't make a significant difference to performance. I wouldn't even worry about frequent RFOs for different cache lines. IMHO the problem is frequent RFOs for the same cache line, otherwise known as cache line thrashing.

bewing wrote:
Combuster wrote: prior to this statement cache line 0 is exclusive to CPU 0 - ...
... must go from either the O/S/I state to the M state
I think I disagree. I will look into it further -- but CPU0 already modified that cache line, so it is already in the M state, I think -- before it is shared.

It'd be in the modified state, but that doesn't make any difference to Combuster's argument.

bewing wrote:
because the BSP reads them which causes them to go non-exclusive, after which they must be followed by a RFO.
Depending on whether the BSP spins off the threads synchronously, or asynchronously, the BSP may read those cachelines before they go into the M state -- as each slave CPU modifies its cacheline. So I agree, it is possible for some RFOs to occur there, or not, depending on timing issues.
However, that is in the one-time initialization section of a program that sets up an entire cacheline of locking flags for each child process to use repeatedly for the lifetime of the app.

Depending on exact timing I see a worst case of "2 * (n - 1) + 1" RFOs, where the main thread's final "CacheLine[0][0] = 1;" causes the extra RFO because that cache line would be in the shared state in other CPUs (terminating a worker thread won't invalidate the worker thread's CPU's cache). This final RFO isn't needed though (the "CacheLine[0][0] = 1;" line could be deleted). Best case is "(n - 1)".

Actually, I'm completely wrong. Every single cache line that's used may have been in the modified state in a different CPU's cache before the process even started. Calling "main()" and setting up the main function's local variables could account another RFO, and worker thread stacks could cost another "n - 1" RFOs. That brings the worst case to at least "3 * (n - 1) + 2" RFOs....

Of course once you start thinking like that it'll send you insane. For example, for a computer with 4 CPUs with 4 MiB of cache per CPU, 3 of those CPUs may have been used to fill 4 MiB of RAM (each) with zeros, and that 12 MiB of zeroed RAM may have been mapped into the ".bss" section of a single-threaded process that is run on the remaining CPU, resulting in up to 196608 RFOs.

If you take this sort of thinking to the extreme, as long as physical pages are being allocated and freed by different processes (including processes starting and terminating) then you can achieve a massive number of RFOs, even if all processes are single-threaded, and even if each process is limited to a single CPU...

Cheers,

Brendan

bewing · Post by **bewing** » Fri Oct 03, 2008 4:04 pm

You think it's not a big deal, and I think that it will be. Therefore we base our OSes on differing concepts, and may the OS that handles the real world the best, win.

The thing I notice about all your worst-case estimates of my algorithm is that they are all linear with N -- even if we disagree on the other terms. Linear on N is my goal.

It'd be in the modified state, but that doesn't make any difference to Combuster's argument.

As combuster himself quoted: "the line must go from either the O/S/I state to the M state, which by definition is an RFO".
But if the line is already in the M state, then there is no transition, and there is no RFO. By definition. So I disagree.

My algorithm is designed to put all the cache lines in the M state, in their own CPU, permanently. After the lines are in the M state, there is never another RFO again, on those cache lines. So I disagree with all your estimates, of course.

Brendan · Post by **Brendan** » Fri Oct 03, 2008 5:49 pm

Hi,

bewing wrote:
It'd be in the modified state, but that doesn't make any difference to Combuster's argument.
As combuster himself quoted: "the line must go from either the O/S/I state to the M state, which by definition is an RFO".
But if the line is already in the M state, then there is no transition, and there is no RFO. By definition. So I disagree.

The main thread does "CacheLine[0][0] = -1;" putting the cache line in the modified state on CPU0, then one or more worker threads do "while (**CacheLine != 0);" which puts the cache line into the shared state on CPU0, then:

Combuster with minor editing for emphasis wrote:When CPU 0 consequently resets it [at the "CacheLine[0][0] = 0;" line in the main thread], it must issue an RFO because it is not exclusive to the processor but shared amongst other processors (i.e. the line must go from either the O/S/I state to the M state, which by definition is an RFO.)

At the point Combuster is talking about, the cache line is no longer in the modified state on CPU0 (and is in the shared state), and needs to go from the shared state to the exclusive state for the "CacheLine[0][0] = 0;" line; and going from the shared state to the exclusive state really would be an RFO.

Combuster's original words said "exclusive" instead of "modified", but that makes no difference because it's still shifted to the shared state by one or more worker threads doing "while (**CacheLine != 0);" before the RFO occurs.

bewing wrote:My algorithm is designed to put all the cache lines in the M state, in their own CPU, permanently. After the lines are in the M state, there is never another RFO again, on those cache lines. So I disagree with all your estimates, of course.

Your algorithm puts the cache line in the M state "permanently", until another CPU reads from it and puts it in the shared state, which causes an RFO when it's modified a second time....

bewing wrote:You think it's not a big deal, and I think that it will be. Therefore we base our OSes on differing concepts, and may the OS that handles the real world the best, win.

The thing I notice about all your worst-case estimates of my algorithm is that they are all linear with N -- even if we disagree on the other terms. Linear on N is my goal.

The example code I posted previously is also linear on N (with "N" RFOs), except it doesn't waste CPU time doing "while(other_threads_still_running) {}" busy waiting. To be fair, mine isn't portable (but you get that from assembly language programmers occasionally), and I guess you could add some "pthread_cond_wait()" calls in there if you really wanted to avoid busy waiting.

And now for an admission - that stuff I was saying about physical pages being allocated and freed by different processes/CPU and causing heaps of RFOs wasn't just a silly example, it's a silly example from my previous kernel's code. The idea was that freed physical pages are "cleaned" (filled with zeros and optionally tested for RAM faults) in idle time, so that they don't need to be cleaned while a higher priority thread is allocating them. This was implemented with pairs of page stacks. A busy CPU would free a page and push it on the dirty free page stack; another idle CPU would pop it from the dirty free page stack, clean it and push it onto the clean free page stack; and then a busy CPU would allocate it again by popping it off the clean free page stack. In hindsight, they should've been "per chip" free page stacks rather than "per NUMA domain" free page stacks...

Cheers,

Brendan

Combuster · Post by **Combuster** » Sat Oct 04, 2008 7:57 am

bewing wrote:
I'll stick with my atomic fetch-and-increment bakery algorithm (1 RFO per lock, 1 RFO per unlock)
I bet if you look at it more closely, you will see that it's not 1 and 1 -- it's n*(n-1) worst case for both -- when the lock is accessed by n CPUs all at the same time -- which becomes more likely when you have 2048 cpus on a machine. But have it your way.

Hmm lemme see:

Code: Select all

lock:
  MOV EAX, 1 
    ; no cache stuff
  LOCK XADD [cacheline1], EAX  
    ; 1 likely RFO: the cache line must be fetched, then modified immediately afterwards.
    ; no RFO if the previous call was from the same CPU and the line still cached.
    ; will atmost invalidate one cache line (the previous caller will have it in the M state or written back)
.wait:
  CMP EAX, [cacheline2]
    ; shares cache line 2, no RFOs
  JE .done
  ; busy wait or yield or whatever
  JMP .wait
.done:
  RET

unlock:
  LOCK INC [cacheline2]
     ; 1 likely RFO, might trash line in all caches (n-1 invalidations)
  RET

If i got the math right, I was wrong: its 1/1 RFO worst case

Also, there are exactly #cpu invalidations worst case: 1 in the lock, n-1 in the unlock. (*wonders whether this is the theoretical minimum*)

OSDev.org

Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling

Re: Optimal multi-logical-processor scheduling