Hi,
wanted:
- OS for the newest AMD-s, lots of RAM, 64 bit mode only
- microkernel design
- supa-fast IPC avoiding TLB flushing if possible
idea:
- use AMD SVM
- the kernel pretends to be a VMM (supervisor)
- user processes pretend to be guest OS-s
- instead of SYSCALL we use VMEXIT
- instead of SYSRET we use VMRUN
- each of the latest 63 processes can have its own ASID
on "context switches" TLB is preserved
- maybe we can use a single VMCB for all non-privileged processes
making use of VMCB caching
before VMRUN the kernel would change ASID
questions:
- anything horribly wrong here?
- is VMRUN insanely slow?
- anything else is going to be so slow to eat the benefit?
- is MOV from mem going to be any slower when ASID-s are used?
Best,
Process switching via VMSTART/VMEXIT to avoid TLB flushing?
Re: Process switching via VMSTART/VMEXIT to avoid TLB flushi
Hi,
You already answred by yourslef all your questions
1. Yes, it will be slower than classic approach.
2. VMRUN is significantly slower than SYSRET, in level of magnitude
3. Similarly VMEXIT is significantly slower that SYSCALL, in level of magnitude
4. any MOV to/from memory will be slower due to extra levels in the virtualized page walk
5. there is nice amount of cases where you will hit VMEXIT even when you don't want to. and this is going to be much slower.
and, btw, you can just use PCID recently introduced by Intel which does exactly this (do not flush TLBs between task switches), it is similar to ASID but works without virtualization as well.
Stanislav
You already answred by yourslef all your questions
1. Yes, it will be slower than classic approach.
2. VMRUN is significantly slower than SYSRET, in level of magnitude
3. Similarly VMEXIT is significantly slower that SYSCALL, in level of magnitude
4. any MOV to/from memory will be slower due to extra levels in the virtualized page walk
5. there is nice amount of cases where you will hit VMEXIT even when you don't want to. and this is going to be much slower.
and, btw, you can just use PCID recently introduced by Intel which does exactly this (do not flush TLBs between task switches), it is similar to ASID but works without virtualization as well.
Stanislav
Re: Process switching via VMSTART/VMEXIT to avoid TLB flushi
Thx a lot Stas for your kind response
Would you have any pointers on how to use it?
CPU instruction mnemonics?
Half an hour of frantic googling for "itel context id" has yielded 0 useful info :-/
This would be idealstlw wrote:and, btw, you can just use PCID recently introduced by Intel which does exactly this (do not flush TLBs between task switches), it is similar to ASID but works without virtualization as well
Would you have any pointers on how to use it?
CPU instruction mnemonics?
Half an hour of frantic googling for "itel context id" has yielded 0 useful info :-/
Re: Process switching via VMSTART/VMEXIT to avoid TLB flushi
Update: have found sections 4.10.1 and 4.10.4.1 in "Intel (R) 64 and IA-32 Architectures Software Developer's Manual Combined Volumes 1, 2A, 2B, 3A and 3B". Looking forward to the joy of reading
A quick question though:
- how likely do you think is AMD to introduce PCID?
- it's not coming in the Bulldozers is it?
A quick question though:
- how likely do you think is AMD to introduce PCID?
- it's not coming in the Bulldozers is it?
Re: Process switching via VMSTART/VMEXIT to avoid TLB flushi
Hi,
I wouldn't be too surprised if AMD didn't find out about PCID soon enough, and when they did find out about PCID it was too late (and too expensive to add to Bulldozer). I'd also assume that if it's not in "1st Generation Bulldozer" it will be in "2nd Generation Bulldozer".
However...
TLBs aren't infinitely large. If one process does enough work, then the "least recently used" eviction algorithm will cause TLB entries for other processes to be evicted (despite PCID). The advantages of PCID may not be as much as you're expecting (and really depends on how many TLB entries each process uses, the order that processes use the same CPU, the size of the TLB, etc - it's hard to predict how much it might help under various workloads).
When there's multiple CPUs, you need to keep the TLBs synchronised. If one CPU changes the paging structures it needs to invalidate the effected TLB entry/s on that CPU, but also needs to tell other CPUs to invalidate the effected TLB entry/s. This is called "multi-CPU TLB shootdown", and is typically done with IPIs. It's also expensive. There are a lot of ways to avoid it in certain situations. For example, (without PCID) if you change a page table entry for a single-threaded process that is currently running on one CPU, then you know that all other CPUs can't have that single-threaded process' TLB entries and therefore you can safely avoid the "multi-CPU TLB shootdown". With PCID, the number of situations where "multi-CPU TLB shootdown" can be avoided is reduced. For the same "single-threaded process" example, you can't assume that other CPUs don't have the effected TLB entry/s (even though you know that the process can't be running on other CPUs) and therefore you can't easily avoid the expensive "multi-CPU TLB shootdown".
Also, the "process context identifiers" are 12-bit numbers, so you get 4096 process context IDs. If the OS supports more than 4096 processes at the same time, then it has to have some sort of dynamic ID management (for e.g. use the IDs for the most recently used processes, where less recently used processes have no ID and the IDs are reassigned when a less recently used process is given some CPU time). Even if the OS doesn't support more than 4096 processes at the same time it would still need to track which IDs are currently in use, because it still needs to support 4096 processes at different times. This "ID management" adds some overhead to the OS somewhere.
Basically, (depending on a very large number of things), the disadvantages of PCID might out-weight the advantages, and PCID may actually make performance worse in some situations.
The best way to avoid overhead is to avoid task switches.
Cheers,
Brendan
I'd assume that it's very likely that AMD (and VIA) will also implement PCID when they can.atagunov wrote:- how likely do you think is AMD to introduce PCID?
It takes a relatively long time (several years) for a CPU to go from "concept" to "initial design", then through all the testing/validation, into reference chips, and then end up in production. It's cheap and easy to make changes in the beginning, and hard and expensive to make changes near the end.atagunov wrote:- it's not coming in the Bulldozers is it?
I wouldn't be too surprised if AMD didn't find out about PCID soon enough, and when they did find out about PCID it was too late (and too expensive to add to Bulldozer). I'd also assume that if it's not in "1st Generation Bulldozer" it will be in "2nd Generation Bulldozer".
However...
TLBs aren't infinitely large. If one process does enough work, then the "least recently used" eviction algorithm will cause TLB entries for other processes to be evicted (despite PCID). The advantages of PCID may not be as much as you're expecting (and really depends on how many TLB entries each process uses, the order that processes use the same CPU, the size of the TLB, etc - it's hard to predict how much it might help under various workloads).
When there's multiple CPUs, you need to keep the TLBs synchronised. If one CPU changes the paging structures it needs to invalidate the effected TLB entry/s on that CPU, but also needs to tell other CPUs to invalidate the effected TLB entry/s. This is called "multi-CPU TLB shootdown", and is typically done with IPIs. It's also expensive. There are a lot of ways to avoid it in certain situations. For example, (without PCID) if you change a page table entry for a single-threaded process that is currently running on one CPU, then you know that all other CPUs can't have that single-threaded process' TLB entries and therefore you can safely avoid the "multi-CPU TLB shootdown". With PCID, the number of situations where "multi-CPU TLB shootdown" can be avoided is reduced. For the same "single-threaded process" example, you can't assume that other CPUs don't have the effected TLB entry/s (even though you know that the process can't be running on other CPUs) and therefore you can't easily avoid the expensive "multi-CPU TLB shootdown".
Also, the "process context identifiers" are 12-bit numbers, so you get 4096 process context IDs. If the OS supports more than 4096 processes at the same time, then it has to have some sort of dynamic ID management (for e.g. use the IDs for the most recently used processes, where less recently used processes have no ID and the IDs are reassigned when a less recently used process is given some CPU time). Even if the OS doesn't support more than 4096 processes at the same time it would still need to track which IDs are currently in use, because it still needs to support 4096 processes at different times. This "ID management" adds some overhead to the OS somewhere.
Basically, (depending on a very large number of things), the disadvantages of PCID might out-weight the advantages, and PCID may actually make performance worse in some situations.
The best way to avoid overhead is to avoid task switches.
"Supa-fast IPC" (e.g. rendezvous/synchronous messaging) typically doesn't avoid task switches. Slower IPC (e.g. asynchronous, where the message data is only put onto the receiver's queue) can avoid task switches. Therefore an OS that uses "Supa-fast IPC" may be slower than an OS that uses "slower IPC".atagunov wrote:wanted:
- supa-fast IPC avoiding TLB flushing if possible
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Process switching via VMSTART/VMEXIT to avoid TLB flushi
Thx for a lot for your helpful insights BrendanBrendan wrote:With PCID, the number of situations where "multi-CPU TLB shootdown" can be avoided is reduced ...
"Supa-fast IPC" typically doesn't avoid task switches ... an OS that uses "Supa-fast IPC" may be slower than an OS that uses "slower IPC".
Feels good to have my brain pipeline stuffed