OSDev.org

Posted: **Sat Aug 15, 2020 9:56 am**

Hi.
Now I'm implementing multitasking in my system, I can already switch between tasks using this function.

Now I read in the book that using the task register and related instructions work better, but they are more complex.

I'm wondering what I should use in my system, my current function or task register?

I am interested in these aspects:
1. Easy to use.
2. Speed.
3. Switch between the user space and the kernel.

Posted: **Sat Aug 15, 2020 10:59 am**

Forget hardware task switching. In general it is relatively inefficient. And, when you implement long mode (why not do so from the start?) you can't use it.

(And take the rest of that book with a pinch of salt.)

Posted: **Sat Aug 15, 2020 1:27 pm**

I see some mixing of concerns in that task switching function. Maybe sending EOI to the old PIC is not terribly sensible in the middle of a task switch? For one, you might get an interrupt from a different timer source than the PIT. For two, you may not even be in an interrupt. When userspace tasks do things that block them until some hardware says they can continue, you can switch to another task right there.

Now, about your actual question, hardware task switching is an interesting topic, maybe deserving of theoretical consideration, but nobody ever used it. At least, I know of no operating system that did, but then, I'm not sure about Win16. Ooh, wait, the original Linux kernel (0.1 or so) did! But that was quickly replaced. Since then, nobody has been using this stuff, so that part of the CPU might be buggy since it is so untested. Plus, there has been little drive to optimize this stuff for the same reason.

The only thing hardware task switching helps with is with the general purpose registers and CR3. It all but forces a lazy FPU saving/restoration scheme, which these days limits the ability to migrate tasks between CPUs. On a whim, I once designed a way to do it, but I don't know if you'd call the result "easy". Here goes:

All tasks get a TSS. It is allocated along with everything else for a task (task descriptor and kernel stack, for example).
The GDT acts as a cache for the 8000 or so most recently used TSSs
In addition, one interrupt is a dedicated task switch interrupt. That interrupt is registered as a task gate with a TSS that is fixed in the GDT
When the task switcher runs, it finds the calling task in its own TSS's back link. It picks the next task, maybe loads its TSS into the GDT if necessary, and overwrites its backlink in the TSS with the new task's TSS before returning.
The task switch interrupt can take care of FPU saving/restoration if wanted

I don't know if I thought of everything, and I'm building a 64-bit OS, where hardware task switching is not available. As you can see, task switching would mean an interrupt, instead of just calling a function, so this may be somewhat suboptimal in performance. As for ease of use, the fact that this relies heavily on parts of the CPU that went unused for a few decades might limit your tutorial options somewhat. And the whole thing has nothing to do with switching between user and kernel space, except for the stack pointers per privilege level in the TSS. But nothing changes there compared to software task switching. In the end, it is a whole lot of work for less flexibility, and the only thing you save is the occasional "pusha".

Posted: **Sat Aug 15, 2020 1:35 pm**

iansjack wrote:And, when you implement long mode (why not do so from the start?) you can't use it.

I don't really want to suffer with a long regime, I think it's difficult.

Tell me, if I want to switch to long mode now, what will I need to change?
I know that I will need to rewrite the code that works with physical and virtual memory. What else needs to be rewritten?

Posted: **Sat Aug 15, 2020 1:48 pm**

mrjbom wrote:Tell me, if I want to switch to long mode now, what will I need to change?
I know that I will need to rewrite the code that works with physical and virtual memory. What else needs to be rewritten?

Wasn't there a chapter in the Intel manuals about this? There definitely was in the AMD manuals. Anyway:

Basically all assembler, because you now need to account for 64-bit registers
The GDT/IDT code, because descriptors look different now.
The paging code, because PAE is a must now, and you get 4-level paging now.

I can't think of anything else major at this point. Also, you must go through all your C code and find all the places where you assumed a pointer was 32-bits. Stuff like casting a pointer to int are prime offenders. If you develop on Linux then you must also find all the places where you assumed "long" was 32 bits. It is definitely at least 32 bits, but not always exactly that.

Posted: **Sat Aug 15, 2020 1:51 pm**

nullplan wrote:I see some mixing of concerns in that task switching function. Maybe sending EOI to the old PIC is not terribly sensible in the middle of a task switch? For one, you might get an interrupt from a different timer source than the PIT. For two, you may not even be in an interrupt. When userspace tasks do things that block them until some hardware says they can continue, you can switch to another task right there.

Now, about your actual question, hardware task switching is an interesting topic, maybe deserving of theoretical consideration, but nobody ever used it. At least, I know of no operating system that did, but then, I'm not sure about Win16. Ooh, wait, the original Linux kernel (0.1 or so) did! But that was quickly replaced. Since then, nobody has been using this stuff, so that part of the CPU might be buggy since it is so untested. Plus, there has been little drive to optimize this stuff for the same reason.

The only thing hardware task switching helps with is with the general purpose registers and CR3. It all but forces a lazy FPU saving/restoration scheme, which these days limits the ability to migrate tasks between CPUs. On a whim, I once designed a way to do it, but I don't know if you'd call the result "easy". Here goes:
All tasks get a TSS. It is allocated along with everything else for a task (task descriptor and kernel stack, for example).

The GDT acts as a cache for the 8000 or so most recently used TSSs

In addition, one interrupt is a dedicated task switch interrupt. That interrupt is registered as a task gate with a TSS that is fixed in the GDT

When the task switcher runs, it finds the calling task in its own TSS's back link. It picks the next task, maybe loads its TSS into the GDT if necessary, and overwrites its backlink in the TSS with the new task's TSS before returning.

The task switch interrupt can take care of FPU saving/restoration if wanted
I don't know if I thought of everything, and I'm building a 64-bit OS, where hardware task switching is not available. As you can see, task switching would mean an interrupt, instead of just calling a function, so this may be somewhat suboptimal in performance. As for ease of use, the fact that this relies heavily on parts of the CPU that went unused for a few decades might limit your tutorial options somewhat. And the whole thing has nothing to do with switching between user and kernel space, except for the stack pointers per privilege level in the TSS. But nothing changes there compared to software task switching. In the end, it is a whole lot of work for less flexibility, and the only thing you save is the occasional "pusha".

Then I won't use hardware task switching. I also realized that I should reconsider my way of switching tasks. Could you tell me where I can look at the implementation of task switching?

Posted: **Sat Aug 15, 2020 1:57 pm**

nullplan wrote:
mrjbom wrote:Tell me, if I want to switch to long mode now, what will I need to change?
I know that I will need to rewrite the code that works with physical and virtual memory. What else needs to be rewritten?
Wasn't there a chapter in the Intel manuals about this? There definitely was in the AMD manuals. Anyway:
Basically all assembler, because you now need to account for 64-bit registers

The GDT/IDT code, because descriptors look different now.

The paging code, because PAE is a must now, and you get 4-level paging now.
I can't think of anything else major at this point. Also, you must go through all your C code and find all the places where you assumed a pointer was 32-bits. Stuff like casting a pointer to int are prime offenders. If you develop on Linux then you must also find all the places where you assumed "long" was 32 bits. It is definitely at least 32 bits, but not always exactly that.

Well, then I'll start rewriting kernel for long mode, and then I'll go back to multitasking.
Thanks.

Posted: **Sat Aug 15, 2020 2:03 pm**

mrjbom wrote:
iansjack wrote:And, when you implement long mode (why not do so from the start?) you can't use it.
I don't really want to suffer with a long regime, I think it's difficult.

Tell me, if I want to switch to long mode now, what will I need to change?
I know that I will need to rewrite the code that works with physical and virtual memory. What else needs to be rewritten?

If you are using paging already, practically nothing. You need to change the page tables, and any assembly routines will need to pass parameters in registers rather than via the stack, but really that's about it. A few other small things like the bigger descriptors, but really that's just detail.

On the plus side, you have a huge virtual address space to play with and the extra registers.

Posted: **Sat Aug 15, 2020 2:27 pm**

iansjack wrote:
mrjbom wrote:
iansjack wrote:And, when you implement long mode (why not do so from the start?) you can't use it.
I don't really want to suffer with a long regime, I think it's difficult.

Tell me, if I want to switch to long mode now, what will I need to change?
I know that I will need to rewrite the code that works with physical and virtual memory. What else needs to be rewritten?
If you are using paging already, practically nothing. You need to change the page tables, and any assembly routines will need to pass parameters in registers rather than via the stack, but really that's about it. A few other small things like the bigger descriptors, but really that's just detail.

On the plus side, you have a huge virtual address space to play with and the extra registers.

I read about how long mode works and what you need to enable it.
I don't think I can do all this, my knowledge is not enough.
In addition, I believe that my system will have enough 4 GB.

In general, I should take care of switching tasks correctly.

Posted: **Tue Aug 18, 2020 1:19 pm**

mrjbom wrote:
nullplan wrote:
mrjbom wrote:Tell me, if I want to switch to long mode now, what will I need to change?
I know that I will need to rewrite the code that works with physical and virtual memory. What else needs to be rewritten?
Wasn't there a chapter in the Intel manuals about this? There definitely was in the AMD manuals. Anyway:
Basically all assembler, because you now need to account for 64-bit registers

The GDT/IDT code, because descriptors look different now.

The paging code, because PAE is a must now, and you get 4-level paging now.
I can't think of anything else major at this point. Also, you must go through all your C code and find all the places where you assumed a pointer was 32-bits. Stuff like casting a pointer to int are prime offenders. If you develop on Linux then you must also find all the places where you assumed "long" was 32 bits. It is definitely at least 32 bits, but not always exactly that.
Well, then I'll start rewriting kernel for long mode, and then I'll go back to multitasking.
Thanks.

Yeah, Long Mode is a bit more work to get working, but, once it works, you get the 64 bit address space to play with. You lose stuff like PUSHAD, but you should optimize that according to your ABI anyway (only save registers that C won't save / will trash).

SWAPGS is pretty handy, also, you don't need to tweak the GDT to implement per-CPU / per-task data structures, but rather just write to the IA32_KERNEL_GS_BASE, IA32_GS_BASE, and IA32_FS_BASE MSRs. Not to mention the NX bit, and Interrupt Stack Tables.

E.g. FS is my per-CPU data, with GS for TLS, so SWAPGS changes between user and kernel TLS. At least when I implement a user mode...

Posted: **Tue Aug 18, 2020 3:41 pm**

bellezzasolo wrote:Yeah, Long Mode is a bit more work to get working, but, once it works, you get the 64 bit address space to play with. You lose stuff like PUSHAD, but you should optimize that according to your ABI anyway (only save registers that C won't save / will trash).

SWAPGS is pretty handy, also, you don't need to tweak the GDT to implement per-CPU / per-task data structures, but rather just write to the IA32_KERNEL_GS_BASE, IA32_GS_BASE, and IA32_FS_BASE MSRs. Not to mention the NX bit, and Interrupt Stack Tables.

E.g. FS is my per-CPU data, with GS for TLS, so SWAPGS changes between user and kernel TLS. At least when I implement a user mode...

It doesn't look bad.
However, I do not think that my knowledge at this time is enough to implement a long mode. I should probably implement the protected mode version first, and then rewrite it for long mode.

Posted: **Sat Aug 22, 2020 2:44 pm**

mrjbom wrote:
bellezzasolo wrote:Yeah, Long Mode is a bit more work to get working, but, once it works, you get the 64 bit address space to play with. You lose stuff like PUSHAD, but you should optimize that according to your ABI anyway (only save registers that C won't save / will trash).

SWAPGS is pretty handy, also, you don't need to tweak the GDT to implement per-CPU / per-task data structures, but rather just write to the IA32_KERNEL_GS_BASE, IA32_GS_BASE, and IA32_FS_BASE MSRs. Not to mention the NX bit, and Interrupt Stack Tables.

E.g. FS is my per-CPU data, with GS for TLS, so SWAPGS changes between user and kernel TLS. At least when I implement a user mode...
It doesn't look bad.
However, I do not think that my knowledge at this time is enough to implement a long mode. I should probably implement the protected mode version first, and then rewrite it for long mode.

Yeah, I initially never had success with long mode - page faults where the x86 version of my OS would work fine! But the current version is x64 only, with abstraction of the CPU specifics - it should be fairly easy to port to another architecture. At this point, though, I'd probably go with something more exciting like AArch64!

I'd take the time to make sure that "rewriting" isn't rewriting the whole OS, but rather just writing a support layer. I've been there numerous times. Whether it be supporting 64 bit, VBE, SMP,... it gets tiring after a while.

Here's the current version of my CPU "driver" interface:

Code: Select all

void arch_cpu_init();

size_t arch_read_port(size_t port, uint8_t width);
void arch_write_port(size_t port, size_t value, uint8_t width);

#ifdef __cplusplus
CHAIKRNL_FUNC bool arch_cas(volatile size_t* loc, size_t oldv, size_t newv);
#else
CHAIKRNL_FUNC int arch_cas(volatile size_t* loc, size_t oldv, size_t newv);
#endif

CHAIKRNL_FUNC void arch_pause();		//Hyperthreading hint

typedef size_t cpu_status_t;

cpu_status_t arch_disable_interrupts();
cpu_status_t arch_enable_interrupts();
void arch_restore_state(cpu_status_t val);

#define BREAKPOINT_CODE 0
#define BREAKPOINT_WRITE 1
#define BREAKPOINT_READ_WRITE 3
CHAIKRNL_FUNC void arch_set_breakpoint(void* addr, size_t length, size_t type);
CHAIKRNL_FUNC void arch_enable_breakpoint(size_t enabled);

void arch_setup_interrupts();

#define INTERRUPT_SUBSYSTEM_NATIVE 0
#define INTERRUPT_SUBSYSTEM_DISPATCH 1
#define INTERRUPT_SUBSYSTEM_IRQ 2

#define IRQL_TIMER 0xFFFFFFFF
#define IRQL_INTERRUPT 1
#define IRQL_KERNEL 0

typedef void(*arch_register_irq_func)(size_t vector, uint32_t processor, void* fn, void* param);
typedef void(*arch_register_irq_postevt)(size_t vector, uint32_t processor, void(*evt)());

typedef struct _arch_interrupt_subsystem {
	arch_register_irq_func register_irq;
	arch_register_irq_postevt post_evt;
}arch_interrupt_subsystem;

uint64_t arch_read_per_cpu_data(uint32_t offset, uint8_t width);
void arch_write_per_cpu_data(uint32_t offset, uint8_t width, uint64_t value);

void arch_write_tls_base(void* tls, uint8_t user);
uint64_t arch_read_tls(uint32_t offset, uint8_t user, uint8_t width);
void arch_write_tls(uint32_t offset, uint8_t user, uint64_t value, uint8_t width);

typedef struct _per_cpu_data {
	struct _per_cpu_data* cpu_data;
	void* running_thread;
	uint64_t cpu_ticks;
	uint32_t cpu_id;
}per_cpu_data;

#ifdef __cplusplus
static class _cpu_data {
	static const uint32_t offset_ptr = 0;
	static const uint32_t offset_thread = 0x8;
	static const uint32_t offset_ticks = 0x10;
	static const uint32_t offset_id = 0x18;
	static const uint32_t offset_irql = 0x1C;
	static const uint32_t offset_max = 0x20;
public:
	static const size_t data_size = 0x38;
	class cpu_id {
	public:
		uint32_t operator = (uint32_t i) { arch_write_per_cpu_data(offset_id, 32, i); return i; }
		operator uint32_t() const { return arch_read_per_cpu_data(offset_id, 32); }
	}cpuid;
	class cpu_data {
	public:
		operator per_cpu_data*() const { return (per_cpu_data*)arch_read_per_cpu_data(offset_ptr, 64); }
	}cpudata;
	class running_thread {
	public:
		void* operator = (void* i) { arch_write_per_cpu_data(offset_thread, 64, (size_t)i); return i; }
		operator void*() const { return (void*)arch_read_per_cpu_data(offset_thread, 64); }
	}runningthread;
	class cpu_ticks {
	public:
		uint64_t operator = (uint64_t i) { arch_write_per_cpu_data(offset_ticks, 64, i); return i; }
		operator uint64_t() const { return arch_read_per_cpu_data(offset_ticks, 64); }
	}cputicks;

	class cpu_irql {
	public:
		uint32_t operator = (uint32_t i) { arch_write_per_cpu_data(offset_irql, 32, i); return i; }
		operator uint32_t() const { return arch_read_per_cpu_data(offset_irql, 32); }
	}irql;
}pcpu_data;
uint64_t arch_msi_address(uint64_t* data, size_t vector, uint32_t processor, uint8_t edgetrigger = 1, uint8_t deassert = 0);
#endif

CHAIKRNL_FUNC void arch_register_interrupt_subsystem(uint32_t subsystem, arch_interrupt_subsystem* system);

typedef uint8_t(*dispatch_interrupt_handler)(size_t vector, void* param);
#define INTERRUPT_ALLCPUS (-1)
#define INTERRUPT_CURRENTCPU (-2)
CHAIKRNL_FUNC void arch_register_interrupt_handler(uint32_t subsystem, size_t vector, uint32_t processor, void* fn, void* param);
CHAIKRNL_FUNC void arch_install_interrupt_post_event(uint32_t subsystem, size_t vector, uint32_t processor, void(*evt)());

CHAIKRNL_FUNC uint32_t arch_allocate_interrupt_vector();
CHAIKRNL_FUNC void arch_reserve_interrupt_range(uint32_t start, uint32_t end);

void arch_set_paging_root(size_t root);

uint32_t arch_current_processor_id();
uint8_t arch_startup_cpu(uint32_t processor, void* address, volatile size_t* rendezvous, size_t rendezvousval);
uint8_t arch_is_bsp();
void arch_halt();
void arch_local_eoi();

typedef void* context_t;
context_t context_factory();
void context_destroy(context_t ctx);
int save_context(context_t ctxt);
void jump_context(context_t ctxt, int value);

typedef void* kstack_t;
kstack_t arch_create_kernel_stack();
void arch_destroy_kernel_stack(kstack_t stack);
void* arch_init_stackptr(kstack_t stack);
void arch_new_thread(context_t ctxt, kstack_t stack, void* entrypt);

void arch_go_usermode(void* userstack, void (*ufunc)(void*), size_t bitness);

void arch_flush_tlb(void*);
CHAIKRNL_FUNC void arch_flush_cache();
void arch_memory_barrier();

CHAIKRNL_FUNC uint16_t arch_swap_endian16(uint16_t);
CHAIKRNL_FUNC uint32_t arch_swap_endian32(uint32_t);
CHAIKRNL_FUNC uint64_t arch_swap_endian64(uint64_t);

#ifdef __cplusplus
enum ARCH_CACHE_TYPE {
	CACHE_TYPE_UNKNOWN,
	CACHE_TYPE_DATA,
	CACHE_TYPE_INSTRUCTION,
	CACHE_TYPE_UNIFIED
};

#define CACHE_FULLY_ASSOCIATIVE SIZE_MAX

size_t cpu_get_cache_size(uint8_t cache_level, ARCH_CACHE_TYPE type);
size_t cpu_get_cache_associativity(uint8_t cache_level, ARCH_CACHE_TYPE type);
size_t cpu_get_cache_linesize(uint8_t cache_level, ARCH_CACHE_TYPE type);

typedef void(*cpu_cache_callback)(uint8_t, ARCH_CACHE_TYPE);
size_t iterate_cpu_caches(cpu_cache_callback callback);
#endif

void cpu_print_information();
CHAIKRNL_FUNC uint64_t arch_get_system_timer();

It may look a bit intimidating, but there's a whole class there for per-cpu data, and also my interrupt dispatcher code, which perhaps should be elsewhere, but it does tightly interface with the CPU stuff.

Posted: **Sat Aug 22, 2020 3:02 pm**

That's the first time I ever heard somebody call SWAPGS "handy". SWAPGS is the greatest wreckage in the entire x86_64 architecture. (The second place is probably taken by NMIs.)

Posted: **Tue Sep 15, 2020 9:02 am**

I think hardware task-switching was usable (and efficient) back when CPUs only had a single core. I don't think Intel originally thought about the problems their hardware task-switching would cause in multicore systems. I only dropped hardware task-switching when I moved to multicore.

I question a bit the need for long mode today. I had an application that uses up to 100GB of physical memory, but I also found out that by creating a smart algorithm that analyzed only part of the data at a time, then mapping this large physical area into 2M windows in 3G linear memory really isn't a problem.

I think long mode is mostly a need for applications that are poorly designed. My 32-bit OS certainly didn't stop me from analyzing 100GB of data I streamed over PCI.

After all, PAE paging can map just as much physical memory as long mode can.

Posted: **Tue Sep 15, 2020 10:11 am**

rdos wrote:I think hardware task-switching was usable (and efficient) back when CPUs only had a single core. I don't think Intel originally thought about the problems their hardware task-switching would cause in multicore systems. I only dropped hardware task-switching when I moved to multicore.

I question a bit the need for long mode today. I had an application that uses up to 100GB of physical memory, but I also found out that by creating a smart algorithm that analyzed only part of the data at a time, then mapping this large physical area into 2M windows in 3G linear memory really isn't a problem.

I think long mode is mostly a need for applications that are poorly designed. My 32-bit OS certainly didn't stop me from analyzing 100GB of data I streamed over PCI.

After all, PAE paging can map just as much physical memory as long mode can.

A big address space also makes things simpler. Think something like having a memory mapped file representing your database. Sure, you can have windows into that file in a 32-bit address space, but with 64-bits to play with, you can reasonable map the entire database file into memory and simply use pointers to navigate. Simpler code often means fewer bugs, and lower maintenance costs.

Plus, it sounds like your algorithm could analyze your stream in self contained chunks. What if you can't do that, and you need random access to your data (such as the database example above.)

It's not just about address space, though. Long mode opened up other opportunities, such as adding extra registers, and the jump from 8 to 16 GPR probably had a big positive effect on performance.

OSDev.org

Using the task register to switch between tasks

Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks

Re: Using the task register to switch between tasks