Hi,
Bochs is portable, and other architectures provide different memory coherency guarantees (often less strict than 80x86). Therefore Bochs can't rely on the host machine for correct guest machine behaviour. Also, if the host machine is an 80486 or older Pentium and the emulated guest is newer, there is no way for the host machine to emulate an 8-byte (64-bit) atomic read or write. Basically, Bochs has to handle memory coherency itself. More on this further down...
With one thread per emulated CPU, if some threads/CPUs are getting ahead of others then you could put those threads to sleep, but without OS specific code you can't sleep for less than 1 second (e.g. the "sleep()" function) and you probably won't be able to get all CPUs within about 500 ms of each other; and even with OS specific code you'd wouldn't have good synchronization in a lot of cases.
Alternatively you could have one thread per host CPU and use a list of "agents" that's sorted by how far behind it is; where each thread picks the agent that's the furtherest behind, does some work for that agent, then puts the agent back on the list and picks the next agent. That way you can make sure all emulated CPUs are always within N instruction of each other. This would also reduce the chance of lock contention (because you'll have less threads), and will improve performance for other reasons - a thread can switch to a different emulated CPU faster than the OS or a library can switch between threads; and the emulator knows which emulated CPU/s are more important and can make better decisions than an OS or library (unless the OS or library supports thread priorities and the emulator constantly changes thread priorities).
Note: In theory POSIX/pthreads does support thread priorities, but it's a complete mess, especially for portable code.
I used the word "agents" instead of "CPUs" because devices need to be synchronized too. IMHO devices aren't very different from CPUs - they read and write memory and generate interrupts too. RAM isn't implemented as an agent and has no thread.
The same sorted list of agents would also be used for timers. Basically, each agent has a "last time" and a "next time", where "next time" determines where it is inserted on the list of agents. If an agent is not currently running, then another agent can change it's "next time" to any value that is higher than or equal to it's "last time" and higher than or equal to the other agent's current "last time" (and reposition the effected agent on the sorted list).
For an example of this, imagine a CPU that is idle ("CPU_A->last_time" is 10000) with it's local APIC timer set to go off ("CPU_A->next_time" is 30000). Another CPU is on the list with an earlier "next time" ("CPU_B->next_time" is 20000). A thread gets the second CPU from the list (because it's "next time" is earlier) and does "CPU_B->last_time = CPU_B->next_time", then emulates 10 instructions while incrementing "last time" ("CPU_B->last_time" is now 20010), then sends an IPI to the first CPU and does "if(CPU_B->last_time > CPU_A->next_time) {CPU_A->next_time = CPU_B->last_time} else {CPU_A->next_time = CPU_A->last_time}" (so that "CPU_A->next_time" is 20010 now) and changes the position of the first CPU on the sorted list.
For synchronizing with real time, I'd do something similar to the existing behaviour of "clock: sync=slowdown". When a thread goes to get the next agent from the sorted list of agents it checks if the agent's "next_time" is too far ahead of real time, and if it is then that thread does "sleep(real_time - next_time)".
Currently Bochs keeps track of cycles. This means you can't emulate software controlled throttling, thermal throttling or Intel's new "turbo boost" feature correctly. I think the emulator should keep track of 1/16ths of a cycle (e.g. with fixed point maths). Under normal conditions, when a CPU emulates one instruction it should add 16 sixteens of a cycle to it's time stamp (or "CPU_A->last_time += 16"). Then, if you want to emulate "turbo boost" you'd add 12 sixteens to the time stamp when the core is getting a boost. If the core is using software controlled throttling then you'd add 18, 21, 25, 32, 43, 64 or 128 sixteens of a cycle per instruction (for 87.5%, 75%, 63.5%, 50%, 37.5%, 25% or 12.5% duty cycle, respectively). You could also apply some sort of scaling factor for hyper-threading too (e.g. if both logical CPUs in the core are busy, then add 10 sixteens of a cycle per instruction, otherwise add 16 sixteens of a cycle per instruction). This can be done with lazy code - keep track of the number of instructions emulated at the current speed, then do "CPU_A->last_time += instructions * scaling_factor" when you need "CPU_A->last_time" to be correct or when the CPU's speed changes. Of course for RDTSC you'd do "EDX:EAX = CPU_A->last_time >> 4".
Note: I know Bochs currently doesn't emulate any of this, and I'm not saying throttling and "turbo boost" should be implemented now - I just like the idea of being able to implement this sort of thing in the future.
For locks, because it'd only be using one thread per host CPU, and because critical sections would/should be short, I think it'd be better to spin until the lock becomes free (e.g. use spinlocks and/or the baker's algorithm, and *don't* use mutexes or semaphores). If you combine this with all of the above, it means that the only platform specific code you'd need is during startup (e.g. where the emulator gets it's CPU affinity and spawns one thread per host CPU), maybe when the emulator terminates threads, and a few (host CPU specific rather than platform specific) primitives for atomic operations.
stlw wrote:IIRC The manuals say that any read or write that fits a cache line is atomic, where being aligned is mentioned as a special, easy-to-understand case. But I'd have to look that up to be sure.
Intel SDM Vol3a, section 7.2.3.1:
The principles underlying the examples in this section apply to individual memory accesses and to locked read-modify-write instructions. The Intel-64 memory-ordering model guarantees that, for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access:
•Instructions that read or write a single byte.
•Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary.
•Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary.
•Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary.
In the real life, you are right, any read or write that fits a cache line would like to be atomic - but manual not requires this explicitly.
Actually I could show you bunch of real CPUs where a unaligned access of 16-byte is not atomic at all.
Even if it fits in a cache line perfectly.
Here's a full quote from (my copy of) Intel SDM Vol3a, section 7.1.1, "Guaranteed Atomic Operations" (highlighting added by me):
Intel wrote:The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:
- Reading or writing a byte
- Reading or writing a word aligned on a 16-bit boundary
- Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:
- Reading or writing a quadword aligned on a 64-bit boundary
- 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:
- Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line
With POSIX/pthreads each lock is relatively large (I tested it - for my system "sizof(pthread_mutex_t)" is 40 bytes) and you can't really do fine-grained locking without using a lot of host RAM for locks. Screw that. Use one bit per lock per cache line (with spinlocks) - it works out to 2 MiB of locks for 1 GiB of emulated RAM. It also means the chance of lock contention will be very low (if the emulator is emulating 255 CPUs and if the host computer has 4 CPUs, then the chance of lock contention is about the same as the chance of 4 guest CPUs trying to access the same cache line at the same time). Basically, when an agent (CPU or device) accesses RAM it'd repeatedly try to set the lock bit for each cache line until all needed cache lines are locked, then do the read or write, then clear the lock bit/s. For most accesses only one cache line needs to be locked. If an access crosses a cache line boundary then the access can be split into 2 separate accesses. If the instruction uses the "LOCK" prefix *and* crosses a cache line boundary, then you'd need to lock (at most) 2 cache lines, and you'd have to do them in order (e.g. lowest cache line first, then highest cache line) to avoid deadlocks.
[EDIT] Also note that for modern 80x86 hosts, it is possible to emulate the RAM without using any locks (as suggested by Bewing). This means RAM locking needs to be a compile-time option.[/EDIT]
AFAIK it's still entirely possible to implement a trace cache (or even do dynamic translation) on top of everything above.
I realize what I'm suggesting might not be easy to add to the current Bochs code. From my perspective, if it's worth doing it's worth doing it right, and there's plenty of things in the current Bochs code that needs to be rewritten anyway (for a very simple example, why can't I emulate a computer with 2 video cards and use PCI BARs to map each video card's ROM at different physical addresses?). What I'm suggesting is to build a proof-of-concept prototype, and if that seems promising, stop work on Bochs and release it as "Bochs 2.3.8", then start work on "Bochs 3.0" with the prototype (while doing bug fixes for "Bochs 2.3.8" but adding nothing new).
Cheers,
Brendan