My apologies, I didn't notice the note about how you prefer less worker threads. Here's another solution, which may or may not work depending on your design. This assumes that your kernel stacks are allocated on the kernel heap.
1. ensure that the handle of the thread description, and the stack base, are placed in registers.
2. disable interrupts
3. remove the thread from the queue.
4. free the thread description.
5. free the stack, by marking its heap block as free (without actually using the stack to do this; in my case this would work since the block header, which stores whether or not the block is free, is placed right before the start of the block, so it's just a constant offset).
6. switch to the next process without using the stack; this includes actually jumping onto another thread's stack.
7. you may now safely enable interrupts, and the thread was removed
if your stack is allocated not on the kernel heap, but as a list of pages, then you'd have to unmap the pages, again using only registers, and not the stack.