Looking for volunteer to develop parallel Bochs architecture
Looking for volunteer to develop parallel Bochs architecture
Looking for volunteer to develop parallel Bochs architecture.
Required knowledge in multi-treaded programming, both for Win32 and using pthreads.
No need to know about current Bochs emulation architecture - I could teach all required things in few e-mails.
BTW, I am not planning to rewrite all things from scratch, as bewing aims and therefore hope to finish and release the developments sometimes
Stanislav
Required knowledge in multi-treaded programming, both for Win32 and using pthreads.
No need to know about current Bochs emulation architecture - I could teach all required things in few e-mails.
BTW, I am not planning to rewrite all things from scratch, as bewing aims and therefore hope to finish and release the developments sometimes
Stanislav
Re: Looking for volunteer to develop parallel Bochs architecture
In case Bochs emulates multicore/multiprocessor machine I would like to be able to run every emulated core/processor as separate thread.berkus wrote:What's the idea of parallel bochs, i.e. what it aims to achieve and how exactly threads fit in there?
In current Bochs model they all running on the same thread and therefore total performance is SINGLE_THREAD_IPS/N_THREADS.
8 processor simulation runs >8 times slower than single threaded.
Some students team in China started this as university project many time ago and never finished - the model functionally broken.
They have website describing changes:
http://grid.hust.edu.cn/cluster/Virtual ... /main.html
Stanislav
Re: Looking for volunteer to develop parallel Bochs architecture
hi!
I've never really understood the bochs source.. but I have wrote my own emulation library and I have some knowledge of pthreads.. no experience with win32 however...
What exactly is your model on how to achieve this without completely rewriting bochs though? as multi-threading it would have device problems wouldn't it?
I've never really understood the bochs source.. but I have wrote my own emulation library and I have some knowledge of pthreads.. no experience with win32 however...
What exactly is your model on how to achieve this without completely rewriting bochs though? as multi-threading it would have device problems wouldn't it?
Re: Looking for volunteer to develop parallel Bochs architecture
I'll try to explain the main idea in few words:earlz wrote:hi!
I've never really understood the bochs source.. but I have wrote my own emulation library and I have some knowledge of pthreads.. no experience with win32 however...
What exactly is your model on how to achieve this without completely rewriting bochs though? as multi-threading it would have device problems wouldn't it?
- I always thought that analogy with real hardware worlds is the best idea. In the real hardware world you have CPU and buses going outside - Front Side Bus, PCI bus and etc.
The CPU(s) and memory conected through Front Side Bus. Once one of the CPUs doing something that have to affect main memory or other CPUs (like APIC messages) - it sent on FSB.
Devices connected through PCI, if some device read/write DRAM they end up accessing the same buses as well.
- I want to develop memory object abstaction for Bochs. Current memory object is freely accessable for everybody - you just call memory function and it writes data/returns you data.
This is fine and works but completely not thread safe. When two CPUs access the data in the memory - it would not necessary work.
The idea is to put memory object aside and make user/consumer interface to it.
A CPU which wants to read/write data would put a request to memory object and get a data from it. All the iteractions with memory through thread-safe proxy queues.
The serialization could be made more fine grained later - say one proxy queue per every 2M of emulated RAM - this way many emulated CPUs could access RAM in parallel.
The same with devices.
- A technical side of the rest is very simple. Leave all devices and memory in main Bochs thread, create a new thread for every CPU. In future - new threads for devices as well.
- Bochs CPU actually has very few links to the outside memory. Yes, it is duplicated many times accross the code but still could be collected back with single grep in the CPU code.
The links are: codefetch (in FetchDecode and BoundaryFetch, could be eliminated), direct memory access through host physical address (mainstream), mem object call with physical address (mainstream, not performance sensitive).
This is 10000 fit of the idea. Probably not perfect but should work
If you have better solution - you are welcome to suggest.
Stanislav
Re: Looking for volunteer to develop parallel Bochs architecture
As I am working on a multithreaded rewrite of bochs right now, I have a different perception of the problem than Stanislav.
To me, the problem looks like this:
You need to choose which piece of the code is responsible for the threads. Threads need to be created/started/paused/and destroyed based on signals, user commands, panics, and fatal sim errors. This needs to be accomplished "invisibly" from the point of view of the simulation. The thread code will certainly be OS-specific.
Generic memory operations are not a problem at all. All of the CPUs in an actual piece of hardware can all modify memory all at the same time (in a completely "unCPU thread-unsafe" way). To model this properly, memory *should* be modeled using thread-unsafe code.
However, there are some arithmetic opcodes that perform "read-modify-write" operations with the LOCK prefix on memory. Those operations need to be transactional, or to have complete ATOMIC locks -- which is rather hard to do in standard C++. This might mean that the instruction simulation code may need to include OS-specific locking mechanisms.
Devices are not much of a problem. Yes, they should be threadsafe.
Possibility #1: bochs needs to have a model built into it of the FSB. The FSB model will serialize all accesses to devices by all the CPUs. To some extent this just shifts the problem -- instead of making each device threadsafe, now the FSB model needs to be threadsafe and transactional. This would require a complete rewrite of every single device -- to route all accesses through the new FSB device.
Possibility #2: just put a CPU lock on each device. A well-built OS must be designed to prevent access conflicts anyway, so all you need to do is detect the times when access conflicts happen anyway, and then fail spectacularly with a PANIC or something.
Possibility #3: because of the reasoning in #2, ignore the problem. If two CPUs try to access a device simultaneously, then the OS is seriously buggy anyway and just let the conflicting accesses produce whatever weird results they will (this was Combuster's suggestion, but I personally don't like it).
"Time" and synchronization. You cannot predict what timeslices will be given to your threads, in what order. One thread may be given 10 seconds of simulation time, and all the rest may be given 0. You do not want the simulations on the threads to get too far out of sync with each other, so some threads may need to put themselves back to sleep, to wait for the laggard sim threads to catch up. So there needs to be a mechanism for each thread to know "how far ahead" of the rear-most thread it has gotten.
Bochs has asynchronous events that need to be "executed" at a particular simulation time. With multiple threads all having (slightly) different individual simulation times, this becomes problematic.
To me, the problem looks like this:
You need to choose which piece of the code is responsible for the threads. Threads need to be created/started/paused/and destroyed based on signals, user commands, panics, and fatal sim errors. This needs to be accomplished "invisibly" from the point of view of the simulation. The thread code will certainly be OS-specific.
Generic memory operations are not a problem at all. All of the CPUs in an actual piece of hardware can all modify memory all at the same time (in a completely "unCPU thread-unsafe" way). To model this properly, memory *should* be modeled using thread-unsafe code.
However, there are some arithmetic opcodes that perform "read-modify-write" operations with the LOCK prefix on memory. Those operations need to be transactional, or to have complete ATOMIC locks -- which is rather hard to do in standard C++. This might mean that the instruction simulation code may need to include OS-specific locking mechanisms.
Devices are not much of a problem. Yes, they should be threadsafe.
Possibility #1: bochs needs to have a model built into it of the FSB. The FSB model will serialize all accesses to devices by all the CPUs. To some extent this just shifts the problem -- instead of making each device threadsafe, now the FSB model needs to be threadsafe and transactional. This would require a complete rewrite of every single device -- to route all accesses through the new FSB device.
Possibility #2: just put a CPU lock on each device. A well-built OS must be designed to prevent access conflicts anyway, so all you need to do is detect the times when access conflicts happen anyway, and then fail spectacularly with a PANIC or something.
Possibility #3: because of the reasoning in #2, ignore the problem. If two CPUs try to access a device simultaneously, then the OS is seriously buggy anyway and just let the conflicting accesses produce whatever weird results they will (this was Combuster's suggestion, but I personally don't like it).
"Time" and synchronization. You cannot predict what timeslices will be given to your threads, in what order. One thread may be given 10 seconds of simulation time, and all the rest may be given 0. You do not want the simulations on the threads to get too far out of sync with each other, so some threads may need to put themselves back to sleep, to wait for the laggard sim threads to catch up. So there needs to be a mechanism for each thread to know "how far ahead" of the rear-most thread it has gotten.
Bochs has asynchronous events that need to be "executed" at a particular simulation time. With multiple threads all having (slightly) different individual simulation times, this becomes problematic.
Re: Looking for volunteer to develop parallel Bochs architecture
Personally I don't think this will ever be a problem - all (host) OS's have algorithms to deal with thread starvation - usually simple ones like using a round-robin scheduler."Time" and synchronization. You cannot predict what timeslices will be given to your threads, in what order. One thread may be given 10 seconds of simulation time, and all the rest may be given 0. You do not want the simulations on the threads to get too far out of sync with each other, so some threads may need to put themselves back to sleep, to wait for the laggard sim threads to catch up. So there needs to be a mechanism for each thread to know "how far ahead" of the rear-most thread it has gotten.
I personally feel that the host OS can be relied upon to avoid starvation. Obviously there is no guarantee that each thread will get exactly the same amount of runtime, but that wasn't your point...
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Looking for volunteer to develop parallel Bochs architecture
I'd have the processors sync back every X emulated instructions. If some are getting ahead too much, you can put those to sleep. You can also use the sync point to temporarily suspend the other cores to properly deal with LOCKed instructions on any random platform. (of course, with the option to use host-specific assembly for x86 hosts)
You can also use the sync point for running more cores with less threads, i.e. use N threads where N optimally is the amount of logical real CPUs, and each run several cycles on any of the emulated cores in turn, pretty much like it works now but with some custom "scheduler". That'd fix issues coming from timestamp-synchronous operations, especially when one thread "gets ahead" in seconds because it's doing number crunching (many completed instructions per real second) while the other's writing to the VGA (few completed instructions / real second) and then they start reading TSCs somewhere...
Although, that problem happens to some extent on real hardware too (try running UT 1 on an Athlon64x2, it completely chokes on the bad scheduling and the unsynchronized TSCs unless you force it onto one core)
As mentioned, putting a lock on the virtual southbridge/APIC bus would be a good idea to fix reentrancy issues with emulated hardware as well.
As for volunteering, I'm too random to actually stick long enough around voluntary projects, but I can try to help when I have some spare time and lively grey cells.
You can also use the sync point for running more cores with less threads, i.e. use N threads where N optimally is the amount of logical real CPUs, and each run several cycles on any of the emulated cores in turn, pretty much like it works now but with some custom "scheduler". That'd fix issues coming from timestamp-synchronous operations, especially when one thread "gets ahead" in seconds because it's doing number crunching (many completed instructions per real second) while the other's writing to the VGA (few completed instructions / real second) and then they start reading TSCs somewhere...
Although, that problem happens to some extent on real hardware too (try running UT 1 on an Athlon64x2, it completely chokes on the bad scheduling and the unsynchronized TSCs unless you force it onto one core)
As mentioned, putting a lock on the virtual southbridge/APIC bus would be a good idea to fix reentrancy issues with emulated hardware as well.
As for volunteering, I'm too random to actually stick long enough around voluntary projects, but I can try to help when I have some spare time and lively grey cells.
Re: Looking for volunteer to develop parallel Bochs architecture
So may be I could help you and you could share some of your work !bewing wrote: As I am working on a multithreaded rewrite of bochs right now, I have a different perception of the problem than Stanislav.
Yes and no. There is a problem in this observation. Intel architecture defines that all naturally aligned memory accesses of 1,2,4 and 8 bytes length are atomic.Generic memory operations are not a problem at all. All of the CPUs in an actual piece of hardware can all modify memory all at the same time (in a completely "unCPU thread-unsafe" way). To model this properly, memory *should* be modeled using thread-unsafe code.
This mean that if you read 2 bytes on one thread and write two bytes on another thread - the interleaved result is prohibited.
When load sees one byte written by store and another byte from original memory - this is Intel atomicity violation.
Allowing such violations are not so good idea - Windows won't boot for you example (checked) - many software assuming that this is how Intel works.
Many people build their thread safe queues without locks in software assuming that this assumption is correct.
Aligned 8 byte read in Bochs finally looks like (I intentionally took 8 byte example, it is hardest case):
Code: Select all
#define ReadHostQWordFromLittleEndian(hostPtr, nativeVar64) \
(nativeVar64) = *((Bit64u*)(hostPtr))
Bopchs aligns host memory by the same way as guest memory (until you changed that) so hostPtr is aligned. One problem solved.
But what if some particular compiler will break this line to two 4-byte loads ? You already not atomic. But, wait, almost ANY 32-bit compiler will do it !
Now look at the same macro in Big Endian machines (remember - Bochs is multiplatform):
Code: Select all
#define ReadHostQWordFromLittleEndian(hostPtr, nativeVar64) { \
(nativeVar64) = ((Bit64u) ((Bit8u *)(hostPtr))[0]) | \
(((Bit64u) ((Bit8u *)(hostPtr))[1])<<8) | \
(((Bit64u) ((Bit8u *)(hostPtr))[2])<<16) | \
(((Bit64u) ((Bit8u *)(hostPtr))[3])<<24) | \
(((Bit64u) ((Bit8u *)(hostPtr))[4])<<32) | \
(((Bit64u) ((Bit8u *)(hostPtr))[5])<<40) | \
(((Bit64u) ((Bit8u *)(hostPtr))[6])<<48) | \
(((Bit64u) ((Bit8u *)(hostPtr))[7])<<56); \
}
The another idea of doing memory access - software transactional memory.However, there are some arithmetic opcodes that perform "read-modify-write" operations with the LOCK prefix on memory. Those operations need to be transactional, or to have complete ATOMIC locks -- which is rather hard to do in standard C++. This might mean that the instruction simulation code may need to include OS-specific locking mechanisms.
Every read/write memory is a transaction. Might be good idea to look on - unfortunatelly I don't know what software transactional memory is
Yeah, I agree devices are not such a problem. Any of written above is good.Devices are not much of a problem. Yes, they should be threadsafe.
Possibility #1: bochs needs to have a model built into it of the FSB. The FSB model will serialize all accesses to devices by all the CPUs. To some extent this just shifts the problem -- instead of making each device threadsafe, now the FSB model needs to be threadsafe and transactional. This would require a complete rewrite of every single device -- to route all accesses through the new FSB device.
Possibility #2: just put a CPU lock on each device. A well-built OS must be designed to prevent access conflicts anyway, so all you need to do is detect the times when access conflicts happen anyway, and then fail spectacularly with a PANIC or something.
Possibility #3: because of the reasoning in #2, ignore the problem. If two CPUs try to access a device simultaneously, then the OS is seriously buggy anyway and just let the conflicting accesses produce whatever weird results they will (this was Combuster's suggestion, but I personally don't like it).
Actually I think I have a solution here."Time" and synchronization. You cannot predict what timeslices will be given to your threads, in what order. One thread may be given 10 seconds of simulation time, and all the rest may be given 0. You do not want the simulations on the threads to get too far out of sync with each other, so some threads may need to put themselves back to sleep, to wait for the laggard sim threads to catch up. So there needs to be a mechanism for each thread to know "how far ahead" of the rear-most thread it has gotten. Bochs has asynchronous events that need to be "executed" at a particular simulation time. With multiple threads all having (slightly) different individual simulation times, this becomes problematic.
BTW, "events that need to be "executed" at a particular simulation time" is a bug. I am trying to fix them all.
What do you think these events are ?
Stanislav
Re: Looking for volunteer to develop parallel Bochs architecture
Currently a virtual southbridge/APIC bus is not exists in Bochs yet. I need your help to define it.As mentioned, putting a lock on the virtual southbridge/APIC bus would be a good idea to fix reentrancy issues with emulated hardware as well.
Stanislav
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: Looking for volunteer to develop parallel Bochs architecture
By my interpretation, that is not what it means. It says that individual, aligned, memory accesses are atomic. So if you read a aligned dword, then you get exactly that dword, not interleaved with another write that may or may not happen. if you read two separate bytes, then you have two accesses, each of which is atomic per definition, however not atomic as a pair.Yes and no. There is a problem in this observation. Intel architecture defines that all naturally aligned memory accesses of 1,2,4 and 8 bytes length are atomic.
This mean that if you read 2 bytes on one thread and write two bytes on another thread - the interleaved result is prohibited.
IIRC The manuals say that any read or write that fits a cache line is atomic, where being aligned is mentioned as a special, easy-to-understand case. But I'd have to look that up to be sure.
The apic bus is the bus that routes interrupt requests. It needs a lock to prevent race conditions when multiple interrupts happen at the same time (they are serialized by the bus semantics)Currently a virtual southbridge/APIC bus is not exists in Bochs yet. I need your help to define it.
The southbridge (although not entirely true) is where accesses go that are not part of system memory. So if you have the memory accessing functions, you can check if they address a part of memory, or a part of device space. if they access device space, the lock must be taken during the remainder of the access. (which would in turn serialize accesses over the system bus to the southbridge, and consequently, the ISA and PCI buses)
In essence, it means that the bus as an object does not exist, but that its semantics are "magically" enforced somewhere in the code that accesses something across that bus.
Re: Looking for volunteer to develop parallel Bochs architecture
This is exactly that I mean as wellCombuster wrote:By my interpretation, that is not what it means. It says that individual, aligned, memory accesses are atomic. So if you read a aligned dword, then you get exactly that dword, not interleaved with another write that may or may not happen. if you read two separate bytes, then you have two accesses, each of which is atomic per definition, however not atomic as a pair.Yes and no. There is a problem in this observation. Intel architecture defines that all naturally aligned memory accesses of 1,2,4 and 8 bytes length are atomic.
This mean that if you read 2 bytes on one thread and write two bytes on another thread - the interleaved result is prohibited.
I will refrase the above:
if you read a word (a dword/a qword) on one thread and write a word (a dword/a qword) on another thread - the interleaved result is prohibited.
Intel SDM Vol3a, section 7.2.3.1:IIRC The manuals say that any read or write that fits a cache line is atomic, where being aligned is mentioned as a special, easy-to-understand case. But I'd have to look that up to be sure.
The principles underlying the examples in this section apply to individual memory accesses and to locked read-modify-write instructions. The Intel-64 memory-ordering model guarantees that, for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access:
•Instructions that read or write a single byte.
•Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.
•Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.
•Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.
In the real life, you are right, any read or write that fits a cache line would like to be atomic - but manual not requires this explicitly.
Actually I could show you bunch of real CPUs where a unaligned access of 16-byte is not atomic at all.
Even if it fits in a cache line perfectly.
Yeah, I know what it should do. Only need to translate from the definition to OOP model in C++The apic bus is the bus that routes interrupt requests. It needs a lock to prevent race conditions when multiple interrupts happen at the same time (they are serialized by the bus semantics)
The southbridge (although not entirely true) is where accesses go that are not part of system memory. So if you have the memory accessing functions, you can check if they address a part of memory, or a part of device space. if they access device space, the lock must be taken during the remainder of the access. (which would in turn serialize accesses over the system bus to the southbridge, and consequently, the ISA and PCI buses)
Re: Looking for volunteer to develop parallel Bochs architecture
As soon as I get the basic framework of it defined, you will be free to contribute with everyone else.stlw wrote: So may be I could help you and you could share some of your work !
All compilers already make 1, 2, and 4 byte accesses atomic. So that is not an issue. Yes, atomic qword accesses on 32bit CPUs is a problem -- but I don't think that it is worth destroying the memory model over. It is better just to put atomic locks in all the qword accessing instructions.Yes and no. There is a problem in this observation. Intel architecture defines that all naturally aligned memory accesses of 1,2,4 and 8 bytes length are atomic.
Which is EXACTLY why that macro is completely foolish and must be deleted! You do NOT compute byteswapping on the fly, from memory. You load from memory FIRST (atomically) to a local variable, and THEN byteswap your local variable.Now look at the same macro in Big Endian machines (remember - Bochs is multiplatform): ....
Forget about atomic reads.
Heh. Last time I suggested that, you said I must be crazy.Yeah, I agree devices are not such a problem. Any of written above is good.Possibility #2: just put a CPU lock on each device. A well-built OS must be designed to prevent access conflicts anyway, so all you need to do is detect the times when access conflicts happen anyway, and then fail spectacularly with a PANIC or something.
Interrupts.What do you think these events are?
Re: Looking for volunteer to develop parallel Bochs architecture
Defined or implemented ?bewing wrote:As soon as I get the basic framework of it defined, you will be free to contribute with everyone else.stlw wrote: So may be I could help you and you could share some of your work !
I don't know how far are you shooting but it seems me that you gonna lose multi-platform host support in Bochs and end up supporting only x86 hosts ...
Could you describe your architecture and why you want/would "rewrite Bochs completely" for it ?All compilers already make 1, 2, and 4 byte accesses atomic. So that is not an issue. Yes, atomic qword accesses on 32bit CPUs is a problem -- but I don't think that it is worth destroying the memory model over. It is better just to put atomic locks in all the qword accessing instructions.
If the things done right - it is going to be no more that incremental change. Even relatively small incremental change.
Currently this macro is fighting with one ugly artifact of big endian systems - unaligned memory access causes a fault and DO NOT WORK.Which is EXACTLY why that macro is completely foolish and must be deleted! You do NOT compute byteswapping on the fly, from memory. You load from memory FIRST (atomically) to a local variable, and THEN byteswap your local variable.
You can't blindly read a dword and when byteswap - if the dword appear to be unaligned in the guest memory - you gonna crash.
Some compilers support modifiers to avoid that crash - but not all of them.
Not the devices part was crazy - the CPU part and the way you planned to do it. But I see your way is changing with the time as well - you become more realistic.Heh. Last time I suggested that, you said I must be crazy.Yeah, I agree devices are not such a problem. Any of written above is good.
Or it is me starting to understand what are you suggestions
You mean external interrupts ? Internal interrupts are not a problem.Interrupts.What do you think these events are?
And external interrupts never were required to be delievered "at certain time and place".
When interrupts is ready the pic/apic sets INTR pin for CPU and waits for CPU to acknowledge the interrupt. The event are asyncronyous.
When CPU is ready to serve that interrupt it will set INTA pin back to pic/apic.
Stanislav
Re: Looking for volunteer to develop parallel Bochs architecture
They are basically the same thing for now. The implementation that I get working is the definition.stlw wrote: Defined or implemented ?
I am certain that at first Bochs itself only supported x86 hosts. Multi-platform support was added later, as patches. What I am shooting for is to make the entire program simpler, tighter, more readable, and slicker. If I destroy any features in the process, they can be added back in, later -- if anyone really wants them.I don't know how far are you shooting but it seems me that you gonna lose multi-platform host support in Bochs and end up supporting only x86 hosts ...
The Siminterface has been deleted. Most of textconfig has been deleted. Access to variables through the param tree is done for user display only, and the param tree has been completely redefined and simplified. All the CPU memory structures have been completely redefined for efficiency (as arrays). The selected debugger now is completely in charge of driving the simulation. The interface between the init code and the debuggers has been standardized. All the multithreaded mods above have been implemented. The instruction decoder has been made more efficient than the instruction cache, and therefore the instruction cache has been deleted. BX_EVENTS have been deleted. Breakpoints have been simplified and are managed for efficiency. All the code is written in pure C for portability and simplicity.Could you describe your architecture and why you want/would "rewrite Bochs completely" for it ?
If the things done right - it is going to be no more that incremental change. Even relatively small incremental change.
All devices are always compiled into the code, and "enabled" at runtime. Plugins have been eliminated. Most conditional code has been eliminated. Almost all compile-time options have been eliminated, and turned into runtime options (bochsrc or commandline). One GUI and one textmode debugger are selected at compile time (only).
Next, I am going to redefine and simplify the interface between the sim and the devices.
(And if my code doesn't run 5 times faster than yours, I'll eat my hat. High-efficiency code is my specialty, and I am even published once on the topic.)
You are confusing two separate issues here. First you say that ALIGNED accesses must be atomic, and now you are talking about UNALIGNED accesses. Clearly, you test for alignment first -- if the access is aligned then you do an atomic read to a local variable, and byteswap -- else you do the macro, and it is not atomic, and there is no problem. Almost all accesses will be aligned, because an unaligned access on x86 is usually a fault, and is always slow, so no OS or app should ever do it.Currently this macro is fighting with one ugly artifact of big endian systems - unaligned memory access causes a fault and DO NOT WORK.
You can't blindly read a dword and then byteswap - if the dword appear to be unaligned in the guest memory - you gonna crash.
Re: Looking for volunteer to develop parallel Bochs architecture
Hi,
Bochs is portable, and other architectures provide different memory coherency guarantees (often less strict than 80x86). Therefore Bochs can't rely on the host machine for correct guest machine behaviour. Also, if the host machine is an 80486 or older Pentium and the emulated guest is newer, there is no way for the host machine to emulate an 8-byte (64-bit) atomic read or write. Basically, Bochs has to handle memory coherency itself. More on this further down...
With one thread per emulated CPU, if some threads/CPUs are getting ahead of others then you could put those threads to sleep, but without OS specific code you can't sleep for less than 1 second (e.g. the "sleep()" function) and you probably won't be able to get all CPUs within about 500 ms of each other; and even with OS specific code you'd wouldn't have good synchronization in a lot of cases.
Alternatively you could have one thread per host CPU and use a list of "agents" that's sorted by how far behind it is; where each thread picks the agent that's the furtherest behind, does some work for that agent, then puts the agent back on the list and picks the next agent. That way you can make sure all emulated CPUs are always within N instruction of each other. This would also reduce the chance of lock contention (because you'll have less threads), and will improve performance for other reasons - a thread can switch to a different emulated CPU faster than the OS or a library can switch between threads; and the emulator knows which emulated CPU/s are more important and can make better decisions than an OS or library (unless the OS or library supports thread priorities and the emulator constantly changes thread priorities). Note: In theory POSIX/pthreads does support thread priorities, but it's a complete mess, especially for portable code.
I used the word "agents" instead of "CPUs" because devices need to be synchronized too. IMHO devices aren't very different from CPUs - they read and write memory and generate interrupts too. RAM isn't implemented as an agent and has no thread.
The same sorted list of agents would also be used for timers. Basically, each agent has a "last time" and a "next time", where "next time" determines where it is inserted on the list of agents. If an agent is not currently running, then another agent can change it's "next time" to any value that is higher than or equal to it's "last time" and higher than or equal to the other agent's current "last time" (and reposition the effected agent on the sorted list).
For an example of this, imagine a CPU that is idle ("CPU_A->last_time" is 10000) with it's local APIC timer set to go off ("CPU_A->next_time" is 30000). Another CPU is on the list with an earlier "next time" ("CPU_B->next_time" is 20000). A thread gets the second CPU from the list (because it's "next time" is earlier) and does "CPU_B->last_time = CPU_B->next_time", then emulates 10 instructions while incrementing "last time" ("CPU_B->last_time" is now 20010), then sends an IPI to the first CPU and does "if(CPU_B->last_time > CPU_A->next_time) {CPU_A->next_time = CPU_B->last_time} else {CPU_A->next_time = CPU_A->last_time}" (so that "CPU_A->next_time" is 20010 now) and changes the position of the first CPU on the sorted list.
For synchronizing with real time, I'd do something similar to the existing behaviour of "clock: sync=slowdown". When a thread goes to get the next agent from the sorted list of agents it checks if the agent's "next_time" is too far ahead of real time, and if it is then that thread does "sleep(real_time - next_time)".
Currently Bochs keeps track of cycles. This means you can't emulate software controlled throttling, thermal throttling or Intel's new "turbo boost" feature correctly. I think the emulator should keep track of 1/16ths of a cycle (e.g. with fixed point maths). Under normal conditions, when a CPU emulates one instruction it should add 16 sixteens of a cycle to it's time stamp (or "CPU_A->last_time += 16"). Then, if you want to emulate "turbo boost" you'd add 12 sixteens to the time stamp when the core is getting a boost. If the core is using software controlled throttling then you'd add 18, 21, 25, 32, 43, 64 or 128 sixteens of a cycle per instruction (for 87.5%, 75%, 63.5%, 50%, 37.5%, 25% or 12.5% duty cycle, respectively). You could also apply some sort of scaling factor for hyper-threading too (e.g. if both logical CPUs in the core are busy, then add 10 sixteens of a cycle per instruction, otherwise add 16 sixteens of a cycle per instruction). This can be done with lazy code - keep track of the number of instructions emulated at the current speed, then do "CPU_A->last_time += instructions * scaling_factor" when you need "CPU_A->last_time" to be correct or when the CPU's speed changes. Of course for RDTSC you'd do "EDX:EAX = CPU_A->last_time >> 4". Note: I know Bochs currently doesn't emulate any of this, and I'm not saying throttling and "turbo boost" should be implemented now - I just like the idea of being able to implement this sort of thing in the future.
For locks, because it'd only be using one thread per host CPU, and because critical sections would/should be short, I think it'd be better to spin until the lock becomes free (e.g. use spinlocks and/or the baker's algorithm, and *don't* use mutexes or semaphores). If you combine this with all of the above, it means that the only platform specific code you'd need is during startup (e.g. where the emulator gets it's CPU affinity and spawns one thread per host CPU), maybe when the emulator terminates threads, and a few (host CPU specific rather than platform specific) primitives for atomic operations.
[EDIT] Also note that for modern 80x86 hosts, it is possible to emulate the RAM without using any locks (as suggested by Bewing). This means RAM locking needs to be a compile-time option.[/EDIT]
AFAIK it's still entirely possible to implement a trace cache (or even do dynamic translation) on top of everything above.
I realize what I'm suggesting might not be easy to add to the current Bochs code. From my perspective, if it's worth doing it's worth doing it right, and there's plenty of things in the current Bochs code that needs to be rewritten anyway (for a very simple example, why can't I emulate a computer with 2 video cards and use PCI BARs to map each video card's ROM at different physical addresses?). What I'm suggesting is to build a proof-of-concept prototype, and if that seems promising, stop work on Bochs and release it as "Bochs 2.3.8", then start work on "Bochs 3.0" with the prototype (while doing bug fixes for "Bochs 2.3.8" but adding nothing new).
Cheers,
Brendan
Bochs is portable, and other architectures provide different memory coherency guarantees (often less strict than 80x86). Therefore Bochs can't rely on the host machine for correct guest machine behaviour. Also, if the host machine is an 80486 or older Pentium and the emulated guest is newer, there is no way for the host machine to emulate an 8-byte (64-bit) atomic read or write. Basically, Bochs has to handle memory coherency itself. More on this further down...
With one thread per emulated CPU, if some threads/CPUs are getting ahead of others then you could put those threads to sleep, but without OS specific code you can't sleep for less than 1 second (e.g. the "sleep()" function) and you probably won't be able to get all CPUs within about 500 ms of each other; and even with OS specific code you'd wouldn't have good synchronization in a lot of cases.
Alternatively you could have one thread per host CPU and use a list of "agents" that's sorted by how far behind it is; where each thread picks the agent that's the furtherest behind, does some work for that agent, then puts the agent back on the list and picks the next agent. That way you can make sure all emulated CPUs are always within N instruction of each other. This would also reduce the chance of lock contention (because you'll have less threads), and will improve performance for other reasons - a thread can switch to a different emulated CPU faster than the OS or a library can switch between threads; and the emulator knows which emulated CPU/s are more important and can make better decisions than an OS or library (unless the OS or library supports thread priorities and the emulator constantly changes thread priorities). Note: In theory POSIX/pthreads does support thread priorities, but it's a complete mess, especially for portable code.
I used the word "agents" instead of "CPUs" because devices need to be synchronized too. IMHO devices aren't very different from CPUs - they read and write memory and generate interrupts too. RAM isn't implemented as an agent and has no thread.
The same sorted list of agents would also be used for timers. Basically, each agent has a "last time" and a "next time", where "next time" determines where it is inserted on the list of agents. If an agent is not currently running, then another agent can change it's "next time" to any value that is higher than or equal to it's "last time" and higher than or equal to the other agent's current "last time" (and reposition the effected agent on the sorted list).
For an example of this, imagine a CPU that is idle ("CPU_A->last_time" is 10000) with it's local APIC timer set to go off ("CPU_A->next_time" is 30000). Another CPU is on the list with an earlier "next time" ("CPU_B->next_time" is 20000). A thread gets the second CPU from the list (because it's "next time" is earlier) and does "CPU_B->last_time = CPU_B->next_time", then emulates 10 instructions while incrementing "last time" ("CPU_B->last_time" is now 20010), then sends an IPI to the first CPU and does "if(CPU_B->last_time > CPU_A->next_time) {CPU_A->next_time = CPU_B->last_time} else {CPU_A->next_time = CPU_A->last_time}" (so that "CPU_A->next_time" is 20010 now) and changes the position of the first CPU on the sorted list.
For synchronizing with real time, I'd do something similar to the existing behaviour of "clock: sync=slowdown". When a thread goes to get the next agent from the sorted list of agents it checks if the agent's "next_time" is too far ahead of real time, and if it is then that thread does "sleep(real_time - next_time)".
Currently Bochs keeps track of cycles. This means you can't emulate software controlled throttling, thermal throttling or Intel's new "turbo boost" feature correctly. I think the emulator should keep track of 1/16ths of a cycle (e.g. with fixed point maths). Under normal conditions, when a CPU emulates one instruction it should add 16 sixteens of a cycle to it's time stamp (or "CPU_A->last_time += 16"). Then, if you want to emulate "turbo boost" you'd add 12 sixteens to the time stamp when the core is getting a boost. If the core is using software controlled throttling then you'd add 18, 21, 25, 32, 43, 64 or 128 sixteens of a cycle per instruction (for 87.5%, 75%, 63.5%, 50%, 37.5%, 25% or 12.5% duty cycle, respectively). You could also apply some sort of scaling factor for hyper-threading too (e.g. if both logical CPUs in the core are busy, then add 10 sixteens of a cycle per instruction, otherwise add 16 sixteens of a cycle per instruction). This can be done with lazy code - keep track of the number of instructions emulated at the current speed, then do "CPU_A->last_time += instructions * scaling_factor" when you need "CPU_A->last_time" to be correct or when the CPU's speed changes. Of course for RDTSC you'd do "EDX:EAX = CPU_A->last_time >> 4". Note: I know Bochs currently doesn't emulate any of this, and I'm not saying throttling and "turbo boost" should be implemented now - I just like the idea of being able to implement this sort of thing in the future.
For locks, because it'd only be using one thread per host CPU, and because critical sections would/should be short, I think it'd be better to spin until the lock becomes free (e.g. use spinlocks and/or the baker's algorithm, and *don't* use mutexes or semaphores). If you combine this with all of the above, it means that the only platform specific code you'd need is during startup (e.g. where the emulator gets it's CPU affinity and spawns one thread per host CPU), maybe when the emulator terminates threads, and a few (host CPU specific rather than platform specific) primitives for atomic operations.
Here's a full quote from (my copy of) Intel SDM Vol3a, section 7.1.1, "Guaranteed Atomic Operations" (highlighting added by me):stlw wrote:Intel SDM Vol3a, section 7.2.3.1:IIRC The manuals say that any read or write that fits a cache line is atomic, where being aligned is mentioned as a special, easy-to-understand case. But I'd have to look that up to be sure.
The principles underlying the examples in this section apply to individual memory accesses and to locked read-modify-write instructions. The Intel-64 memory-ordering model guarantees that, for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access:
•Instructions that read or write a single byte.
•Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.
•Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.
•Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.
In the real life, you are right, any read or write that fits a cache line would like to be atomic - but manual not requires this explicitly.
Actually I could show you bunch of real CPUs where a unaligned access of 16-byte is not atomic at all.
Even if it fits in a cache line perfectly.
With POSIX/pthreads each lock is relatively large (I tested it - for my system "sizof(pthread_mutex_t)" is 40 bytes) and you can't really do fine-grained locking without using a lot of host RAM for locks. Screw that. Use one bit per lock per cache line (with spinlocks) - it works out to 2 MiB of locks for 1 GiB of emulated RAM. It also means the chance of lock contention will be very low (if the emulator is emulating 255 CPUs and if the host computer has 4 CPUs, then the chance of lock contention is about the same as the chance of 4 guest CPUs trying to access the same cache line at the same time). Basically, when an agent (CPU or device) accesses RAM it'd repeatedly try to set the lock bit for each cache line until all needed cache lines are locked, then do the read or write, then clear the lock bit/s. For most accesses only one cache line needs to be locked. If an access crosses a cache line boundary then the access can be split into 2 separate accesses. If the instruction uses the "LOCK" prefix *and* crosses a cache line boundary, then you'd need to lock (at most) 2 cache lines, and you'd have to do them in order (e.g. lowest cache line first, then highest cache line) to avoid deadlocks.Intel wrote:The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:
- Reading or writing a byte
- Reading or writing a word aligned on a 16-bit boundary
- Reading or writing a doubleword aligned on a 32-bit boundary
The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:
- Reading or writing a quadword aligned on a 64-bit boundary
- 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
- Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line
[EDIT] Also note that for modern 80x86 hosts, it is possible to emulate the RAM without using any locks (as suggested by Bewing). This means RAM locking needs to be a compile-time option.[/EDIT]
AFAIK it's still entirely possible to implement a trace cache (or even do dynamic translation) on top of everything above.
I realize what I'm suggesting might not be easy to add to the current Bochs code. From my perspective, if it's worth doing it's worth doing it right, and there's plenty of things in the current Bochs code that needs to be rewritten anyway (for a very simple example, why can't I emulate a computer with 2 video cards and use PCI BARs to map each video card's ROM at different physical addresses?). What I'm suggesting is to build a proof-of-concept prototype, and if that seems promising, stop work on Bochs and release it as "Bochs 2.3.8", then start work on "Bochs 3.0" with the prototype (while doing bug fixes for "Bochs 2.3.8" but adding nothing new).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.