x86 multiprocessing in assembly?

CPLH · Post by **CPLH** » Mon Jan 26, 2009 4:42 pm

Now-a-days and in the future, there are and will be more processors in each computer. Usually assembly, especially for x86, almost assumes one processor. Does anybody know of a source that explains how to use more than one processor in assembly?

Thank you,
Veniamin

Zenith · Post by **Zenith** » Mon Jan 26, 2009 5:00 pm

Well, the reason for this is that multi-processing is ideally transparent to a software application, and that the support for multiple processing cores is supposed to be managed by the OS. Nowadays, developers are becoming more aware of the parallel processing trend and so applications take advantage of these cores via multi-threading. It's the operating system's job to take these threads and execute them on separate cores, and the code for doing such is usually not compiled into the application.

If you want information for utilizing multiple cores in your OS, have at look at the Intel Manuals. Note that each processor core will have to be separately initialized by your OS, interrupts have to be managed, etc. I haven't implemented such support myself yet, so other forum members may be able to correct me or provide additional resources.

Good luck!

DrLink · Post by **DrLink** » Mon Jan 26, 2009 5:07 pm

The way you worded it, wouldn't that mean one core could be run in protected mode and one in real mode, allowing for a good bit more flexibility/laziness?

AndrewAPrice · Post by **AndrewAPrice** » Mon Jan 26, 2009 6:08 pm

For system programming:
I don't see any advantages of having a real mode and protected mode processor (I'm not even sure if that is possible?). If you have multiple processors/cores on a system then generally you would want the end applications to run as fast as possible taking advantage of the extra processors/cores.

In a perfect scenario, the most computationally heavy threads should not be running on the same processor. Think of processors as a precious resource, they're mathematical beasts that are good a calculating. With 2 processors, you can potentially do double the calculating when the work is correctly distributed across them.

When a person is programming a multithreaded application, they're not going to want to write two versions of their code (one to run in real mode, one to run in protected mode). You're limiting the potential performance of those multithreaded applications in half since they then cannot be distributed across processors. You'll also have to write 2 backends for your system libraries, scheduler, IPC (one for 16-bit real mode applications, one for 32-bit protected mode applications).

You might be thinking you get the best of both worlds by having real-mode and protected-mode at the same time, but what can you do in real mode that you can't in protected mode (set up a virtual 8086 task if you need your BIOS interrupts).

For application programming:
There are experimental concurrent languages that supposedly optimise code to run over multiple processes, but nothing has caught mainstream.

I have done heavy multithreaded work within game frameworks. The majority of applications will only run in a single thread, thus there is not much you have to worry about. The easiest way to get your head around multithreading is to divide the work that needs to be done into tasks that can be run in parallel. e.g. in a simple pong game with have 3 tasks:
- Update the ball position
- Update the paddle/input
- Update the score (sleeping unless woken)

Each of these tasks have a pointer to a function (the code to think/update), and there is a pool of awake tasks.

The best abstraction to handle splitting tasks across threads is a construct called a "parallel for". Basically say you have 2 threads (if the code is computationally heavy you generally have the same number of threads as cores and processors, but if your threads are blocking then it may benefit to have more, the secret is benchmarking) and when you call your parallel for loop around your pool of tasks it will handle distributing them between threads (2 tasks to thread 1, 1 task to thread 2). When the pool runs empty all threads (accept the one calling the parallel for) are sent to sleep (automatic synchronization!) and then you can repeat (or in the pong example we'd call a post-update pool of tasks like render, update physics, update input, play sounds).

With this kind of thinking you can write code which is scalable and modular, and you don't need any special kind of parallel language, just know how the correct code to fork a thread.

I don't know if this is the kind of advice you're looking for.

CPLH · Post by **CPLH** » Mon Jan 26, 2009 6:30 pm

Compilers write up code in a way that supports parallelization now-a-days, no?

Is someone missing something here? (Including me of course?

)

For example, if you've got a "loop".. a compiler would try to divide up the for loop to execute whatever you need using both processors... Makes it more efficient as you slice the time needed to do the loop in half.
There are of course architectures that can try and do this for you, but it take up more power.
An OS can try to do such a thing but it seems there would be much overhead because of various checks it would make. Much easier for an OS to only manage processes as you said. Are you saying that each time a process wants to compile using parallelization, it has to use a system library? That doesn't make sense for gcc as it would require a library function to be natively implemented within the optimization routine. This kind of goes against how GCC is made to work.

bewing · Post by **bewing** » Mon Jan 26, 2009 8:11 pm

There are quite a few threads on this forum about optimal ways of spreading processing over multiple CPUs. I've had quite a few arguments -- lost several to Brendan and Combuster, and I think I succeeded in getting draws out of some others.

In any case, you have to decide what your vision is, for programmers and users of your machine.

Brendan, jal, & etc. envision a multi-cpu machine that is almost entirely "empty". A programmer/user can create and run an app on this machine, and they can greedily snatch every last instruction cycle. Therefore, the OS/scheduler they visualize is capable of taking a) application level processes, b) sub-application level threads, and perhaps c) microthreads -- and allocating all those things onto the "best" choice of core, at runtime.

This has a big downfall, in that when you have a single address space spread across multiple cores (or moving between cores), you get cache problems, locality problems, and synchronization problems.

On the other hand, I envision a machine that always has plenty of work to do. (Visualize that there are always 5 times more processes scheduled than there are CPU cores.) All applications always have to wait their turn. No app can ever grab even a significant fraction of the computing power of the machine. So, in my vision, clusters of application level processes are fixed onto particular cores, except in unusual circumstances. Threads and microthreads are never moved to a different CPU than their parent application is running on. This greatly simplifies all the above address space problems.

But it has a big downfall. You have this machine with all this theoretical horsepower, but if you have some app that needs a tremendous amount of horsepower -- you either have to code it very carefully (as multiple independent standalone applications with minimal inter-communication), or the most you are going to ever get is one CPU of horsepower per application.

But to relate this back to your original question: the only time that the compiler is involved in multi-CPU design decisions is if you (the OS designer) allow tiny little fractions (microthreads) of an application to run completely asynchronously from the rest of the application.
If you only farm out processes and threads over your CPU array, then that is entirely handled by your scheduler, and the compiler has nothing to do with it.

Solar · Post by **Solar** » Tue Jan 27, 2009 6:17 am

CPLH wrote:For example, if you've got a "loop".. a compiler would try to divide up the for loop to execute whatever you need using both processors...

I certainly hope it wouldn't, at least not unless specifically told to...

That loop is most likely working on one or more data containers, each of which would reside in a given location of memory. Spreading that loop across multiple cores would mean that each core would have to load that memory location into its cache, and if you're not lucky, having to synchronise accesses to those containers with the other cores.

Multithreading is usually done at a higher level. You have 1,000,000 of data records to process, and four cores to do the work, so you give each core 250,000 records to process. (Instead of having the compiler and / or the OS automatically deciding on each record which core it should go to.)

To come back to the OP's subject, that can be done conveniently in Assembler, too. (Well, as much as anything can be done "conveniently" in Assembler...)

CPLH · Post by **CPLH** » Tue Jan 27, 2009 7:04 am

Solar, you seem to discredit the abilities of multiple cores to execute a loop at the same time because of problems with synchronization and memory. In that case, what does it mean for a compiler to support Automatic parallelization? After all it seems the compiler is trying to get it to do just that.

In any case, is transferring between processors hard? So far I haven't found a specific way to do it.. the wiki.. Multi_Processors ... is tied up with the old way of doing it... The Intel manuals have a chapter called "Multiple-Processor Management" which talks a lot about the APIC, checking and monitoring various things, and a general high-level view of initialization and synchronization, however I am yet to find the specific way to actually transfer to the other processor.. other than that, it seems few people have written about transferring between multiple processors...

Solar · Post by **Solar** » Tue Jan 27, 2009 7:33 am

CPLH wrote:Solar, you seem to discredit the abilities of multiple cores to execute a loop at the same time because of problems with synchronization and memory. In that case, what does it mean for a compiler to support Automatic parallelization?

It means that "Though the quality of automatic parallelization has improved in the past several decades, fully automatic parallelization of sequential programs by compilers remains a grand challenge due to its need for complex program analysis and the unknown factors (such as input data range) during compilation.".

Have you ever heard of one of the compilers mentioned in the article? I mean, before you read about AP?

If you are ignorant of MT requirements, an AP compiler won't help you that much. If you know how to parallelize your algorithms, I don't think an AP compiler will help performance that much. Both ways, I don't really see the benefit of the effort.

Then again, I claimed a JIT could never perform as well as native code, so what do I know.

In any case, is transferring between processors hard?

It's usually done by the threading library you use, so you don't have to worry about it much...

...other than that, it seems few people have written about transferring between multiple processors...

Because most people leave the specifics to libpthread or whatever they use... and even the pthread maintainers won't write much about it because you usually want to avoid having to switch cores, because of caching and switch overhead...

bewing · Post by **bewing** » Tue Jan 27, 2009 1:48 pm

How you switch threads between cores:

Either the thread is running, or it is "swapped out".
If it is running, you wait until it uses up its timeslice, and swaps out.

When a thread is swapped out, it exists only as a structure in memory.
There are 2 possibilities: either you have one scheduler for all your threads on all your cores, or you have one scheduler per core.
Scenario #1: One scheduler for everything. When a new timeslice is available on a particular core, the scheduler looks through the queue of processes assigned to that core, and takes the next one (usually). So, to move a thread from one core to another, you delete the thread from the queue of core A, and add it to the end of the queue for core B. Done.
Scenario #2: One scheduler per core. The process is similar. The scheduler for core A deletes the thread from its queue. It then needs to communicate with the scheduler for core B, and tell it to add the thread (based on its ProcessID) to the core B queue -- and this message needs to be 100% reliable, or you get a zombie process.

OSDev.org

x86 multiprocessing in assembly?

x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?

Re: x86 multiprocessing in assembly?