Read,
and understand,
PORTABLE MULTITHREADING, which is the basis of GNU PTH.
It is a user level threading library, gives details on how it creates its initial thread context, in a mostly portable manner.
Basically if using the setjmp/longjmp method, in the creating thread, you temporarily switch to the stack of the thread being created, save some state (including the stack pointer on the new stack) using setjmp (which will return 0), then switch back to the old stack.
Then, when you want to actually switch to the new thread, you save the current thread state using setjmp, then longjump using the jmp_buf setup above in the new thread, and your code will now be running on the new stack, in the bootstrap function, returning != 0 from setjmp. That is then your signal to jump to the new thread code.
Once the initial thread context is created, switching threads is then quite simple, using existing C setjmp primitives (or POSIX context primitives, in the paper). A task switch becomes:
Code:
if (setjmp(currentthread->context)==0) {
longjmp(nextthread->context, 1);
}
I used this idea as the basis of my kernel threads. All kernel thread switching is implemented using setjmp/longjmp, which saves/restores the compiler visible state. That is all you need in the kernel thread, any user visible state such as address space (cr3) or the kernel stack in the TSS can be managed separately, and your interrupt handlers should already save the user level register state on the kernel stack. Kernel esp and cr3 management can be in separate code, as suggested by @Octocontrabass.
All you need then is the architecture specific code to execute your thread bootstrap code with some arbitrary stack pointer.