After implement new/delete, I am working on tasks. Unfortunately, I haven't found any documents on task switch in 64-bit mode, so I read AMD's manual and try to invent my own. I plan to do it like this:
1. allocate memory for elf64 image
2. parse and load elf64 image into memory.
3. build new page map (In kernel space, I use identity paging)
4. load STAR and LSTAR
5. load CR3
6. save GPRs, RFLAGS, load new RSP, RFLAGS
7. SYSRET
But I have a few questions on these steps:
First, and also the most important question: are these steps correct? I don't want to spend days in coding and debugging, but finally find out that the solution is just completely wrong.
Second, load CR3 is costly, how can I reduce the cost? Because I have to load CR3 twice when switching tasks (the first time is load CR3 with kernel page map, the second time is load CR3 with new task page map), task switching happens hundreds of times a second, if I cannot reduce the cost, my kernel will be very very slow.
Cheers.
torshie
How to start a userspace task in 64-bit mode?
Re: How to start a userspace task in 64-bit mode?
You would typically load CR3 only once, by mapping the kernel pages to all processes. You can't avoid loading CR3 once of course, unless you don't care about memory protection.torshie wrote:Second, load CR3 is costly, how can I reduce the cost? Because I have to load CR3 twice when switching tasks (the first time is load CR3 with kernel page map, the second time is load CR3 with new task page map)
JAL
Re: How to start a userspace task in 64-bit mode?
This is a great idea, thank you, JAL.jal wrote:You would typically load CR3 only once, by mapping the kernel pages to all processes.torshie wrote:Second, load CR3 is costly, how can I reduce the cost? Because I have to load CR3 twice when switching tasks (the first time is load CR3 with kernel page map, the second time is load CR3 with new task page map)
JAL
Re: How to start a userspace task in 64-bit mode?
To create a task, load the image into memory using whatever executable format you wish and then jump to the entry point. Of course this is an oversimplification and you have to take care of things such as dynamic linking, relocations, or initializing memory, etc. However, once the image you want is loaded into memory, create a new address space, load an empty processor state (e.g. zeroing all general purpose registers or some other well-defined value), and jump to the userspace entry point.
Software Task Switching on amd64 uses the same theory as sw task switching on other processors: Save processor state, switch stacks and memory protection stuff, load processor state, and jump to the new task. How you do it is up to you.
To avoid the double CR3 load, simply map the kernel into every address space. For example, on a 32-bit machine you could load the kernel in the top 1 gb and keep 3 gb for the actual user space process. Then you can use memory protection schemes such as rings or segmentation to keep the process from accessing kernel resources directly. On 64-bit machines you have much more free address space so using say, 4gb virtual memory for the kernel shouldn't be all that big a deal.
However, on amd64, no matter how you do it, if you want to maintain memory protection it is imperative that you give each task a separate address space and therefore, you have to switch the page tables every task switch. This is why task switching is so slow, the change in memory space pretty much invalidates the processor's cache. However, there are some ways to mitigate this which include intelligent scheduling and such. Also, if I am not mistaken the amd64 and x86 both have global pages, or pages which are not thrown away from the cache when the page tables are changed. You can use this feature to mark all kernel pages as global (since they exist in every address space). Then accessing kernel memory won't be as slow. Just make sure that all your global pages are indeed mapped into every address space or you'll get bizarre behavior on a real processor (but maybe expected behavior on an emulator). But I wouldn't worry too much about optimizing cr3 loads (unless the optimization is extremely obvious, like avoiding a reload when you're switching to the same task) because all operating systems have to do it. You're operating system will suffer no more from this slowdown than any other mainline kernel like linux, darwin, or windows.
Software Task Switching on amd64 uses the same theory as sw task switching on other processors: Save processor state, switch stacks and memory protection stuff, load processor state, and jump to the new task. How you do it is up to you.
To avoid the double CR3 load, simply map the kernel into every address space. For example, on a 32-bit machine you could load the kernel in the top 1 gb and keep 3 gb for the actual user space process. Then you can use memory protection schemes such as rings or segmentation to keep the process from accessing kernel resources directly. On 64-bit machines you have much more free address space so using say, 4gb virtual memory for the kernel shouldn't be all that big a deal.
However, on amd64, no matter how you do it, if you want to maintain memory protection it is imperative that you give each task a separate address space and therefore, you have to switch the page tables every task switch. This is why task switching is so slow, the change in memory space pretty much invalidates the processor's cache. However, there are some ways to mitigate this which include intelligent scheduling and such. Also, if I am not mistaken the amd64 and x86 both have global pages, or pages which are not thrown away from the cache when the page tables are changed. You can use this feature to mark all kernel pages as global (since they exist in every address space). Then accessing kernel memory won't be as slow. Just make sure that all your global pages are indeed mapped into every address space or you'll get bizarre behavior on a real processor (but maybe expected behavior on an emulator). But I wouldn't worry too much about optimizing cr3 loads (unless the optimization is extremely obvious, like avoiding a reload when you're switching to the same task) because all operating systems have to do it. You're operating system will suffer no more from this slowdown than any other mainline kernel like linux, darwin, or windows.
- AndreaOrru
- Member
- Posts: 50
- Joined: Fri Apr 25, 2008 2:50 pm
- Location: New York
Re: How to start a userspace task in 64-bit mode?
Is it really required? Does the process's code make assumptions over registers' values?iammisc wrote:To create a task, load the image into memory using whatever executable format you wish and then jump to the entry point. Of course this is an oversimplification and you have to take care of things such as dynamic linking, relocations, or initializing memory, etc. However, once the image you want is loaded into memory, create a new address space, load an empty processor state (e.g. zeroing all general purpose registers or some other well-defined value), and jump to the userspace entry point.
Or maybe is it just for security reasons?
Close the world, txEn eht nepO
-
- Member
- Posts: 2566
- Joined: Sun Jan 14, 2007 9:15 pm
- Libera.chat IRC: miselin
- Location: Sydney, Australia (I come from a land down under!)
- Contact:
Re: How to start a userspace task in 64-bit mode?
It's more or less to have all applications begin running with a deterministic state. It means the environment that each application runs in is the same, which is one less thing to worry about in debugging
Re: How to start a userspace task in 64-bit mode?
Hi,
The part that isn't really necessary is this part:
Not only is this better for performance (especially for "real time" systems), it's also better for multi-CPU systems. For example, a task that uses CPU affinity to make sure it can only run on CPU #1 can create a new task that uses CPU affinity that can only run on CPU #2. Another example is NUMA; where the OS decides that (for load balancing reasons) the new task should be run in a different NUMA domain.
Cheers,
Brendan
The contents of all general registers, etc during process startup should be a well documented by OS developers; for the same reason that the contents of all general registers, etc during CPU startup are well documented by CPU manufacturers. It's mostly so programmers know what they can and can't rely on, but also partly because some registers may be used to transfer information (either now or in future versions). For example, "eax = zero" is entirely different to "eax = reserved for future use (must be zero)". Of course this doesn't preclude something like "eax = undefined", as long as it's documented properly so people know what to expect.andreaorru wrote:Is it really required? Does the process's code make assumptions over registers' values?iammisc wrote:To create a task, load the image into memory using whatever executable format you wish and then jump to the entry point. Of course this is an oversimplification and you have to take care of things such as dynamic linking, relocations, or initializing memory, etc. However, once the image you want is loaded into memory, create a new address space, load an empty processor state (e.g. zeroing all general purpose registers or some other well-defined value), and jump to the userspace entry point.
Or maybe is it just for security reasons?
The part that isn't really necessary is this part:
Typically it's much better to create a dummy/empty process (e.g. with the kernel and nothing else in the address space) then return to the caller. After the scheduler gives this dummy/empty process CPU time it starts running the kernel's "process loader" that loads (or maps) the executable file into the address space (and loads/maps any shared libraries, and does linking, etc). This means a high priority task (or a "real time" task) can spawn many low priority tasks very quickly (where file I/O, task switches, etc are postponed); while a low priority task that spawns a high priority task will be immediately preempted by that high priority task. Of course the normal Unix way also splits this into 2 separate steps - you "fork()" (with no file I/O, etc) then after the new task gets CPU time it does "exec()" (where the time consuming part is).iammisc wrote:To create a task, load the image into memory using whatever executable format you wish and then jump to the entry point. Of course this is an oversimplification and you have to take care of things such as dynamic linking, relocations, or initializing memory, etc.
Not only is this better for performance (especially for "real time" systems), it's also better for multi-CPU systems. For example, a task that uses CPU affinity to make sure it can only run on CPU #1 can create a new task that uses CPU affinity that can only run on CPU #2. Another example is NUMA; where the OS decides that (for load balancing reasons) the new task should be run in a different NUMA domain.
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.