thread-specific storage

Pype.Clicker · Post by **Pype.Clicker** » Wed Jan 18, 2006 6:40 am

One of the things that makes clicker most bound to the x86 segmentation feature is the thread-local storage (well, i call it "glocal" storage, but "TLS" is the terminology found in ancient ELF scrolls, so ...)

TLS offers each thread a way to keep data that need global visibility but that should have a different value for each thread. The "errno" variable is the most common example, though clicker has more and more of these, such as the display buffer, the "current action" tracking log, etc. aswel as pointer to the current thread, process, dispatcher and the like.

So far, in Clicker32, the TLS data are pointed from the thread's stack, using the fact each thread has its own stack and that each stack has a GDT/LDT entry that depicts its own limit. since the limit can be easily retrieved through the asm "LSL" instruction, you can also retrieve the address of the glocals block.

Of course, if i once wish to move to Clicker64, i'll need something else (no more segmentation

) So, what do you guys have ? Ever thought about TLS in your OS ? or maybe "per-cpu" state (which is probably as fine) ... already implemented it ? and how ?

Candy · Post by **Candy** » Wed Jan 18, 2006 6:48 am

For clicker64 on AMD64 and compatible (oh yes, that sounds good... might be slightly fanboyish to say though) you can use the FS and GS registers. They're still used as far as their base is concerned and you are advised to store it in gs. There's also an instruction SWAPGS which swaps the userland GS with the kernel-land one so you can store the kernel-half of the thread there as well. Works really well, except that trying to emulate this on 32-bit (as I do right now) looks silly.

Pype.Clicker · Post by **Pype.Clicker** » Wed Jan 18, 2006 6:58 am

gcc-3.3.info.gz redirects the curious reader to http://people.redhat.com/drepper/tls.pdf, for those who wished to know

Brendan · Post by **Brendan** » Wed Jan 18, 2006 7:33 am

Hi,

Pype.Clicker wrote:Of course, if i once wish to move to Clicker64, i'll need something else (no more segmentation ) So, what do you guys have ? Ever thought about TLS in your OS ? or maybe "per-cpu" state (which is probably as fine) ... already implemented it ? and how ?

I use paging. Specifically, I split address spaces into 3 parts - one part for the kernel, one part for the process and one part for the TLS.

For 32 bit paging, every thread has it's own page directory, and all page tables used for kernel space and process space are mapped into each thread's page directory.

For 36 bit paging (PAE), each thread has it's own page directory pointer table, and all page directories used for kernel space and process space are mapped into each thread's page directory pointer table.

For 32 bit processes in long mode, each thread has it's own PML4 and "user-level" page directory pointer table. All page directory pointer tables used for kernel space and all page directories used for process space are mapped into each thread's PML4 or user-level page directory pointer table.

For 64 bit processes in long mode, each thread has it's own PML4, and all page directory pointer tables used for kernel space and process space are mapped into each thread's PML4.

The boundary between process space and thread space (or TLS) is determined by the process's executable header, which (for 32 bit processes) allows process space to be 1 GB or 2 GB (with thread space being whatever is left over).

This has advantages and disadvantages. For advantages, security is better (one thread can't mess up another thread's data), fault tolerance is better (the OS can terminate a thread and reclaim it's resources without worrying about other threads that belong to the same process), and the total amount of address space a process can use is much much larger.

For example, the total address space a process can use can be calculated with "total_space = process_space_size + number_of_threads * thread_space_size". A 32 bit process running on a 32 bit kernel (without PAE) with 200 threads and 1 GB of process space can use up to 301 GB of address space.

For disadvantages, security is better (one thread can't access data on another thread's stack), it consumes a bit more memory for the paging structures, and a thread switch is as slow as a process switch.

For the thread switch times, it doesn't matter much because the scheduler switches to the highest priority thread regardless of which process it belongs to (i.e. most of the time the scheduler is switching between threads that belong to different processes anyway).

Cheers,

Brendan

durand · Post by **durand** » Wed Jan 18, 2006 8:33 am

Wow...

I use a pointer which is specified by the process and I update this pointer on each thread switch. So you could do something like the following:

Code: Select all

void *TLS;

void *thread1_tls;
void *thread2_tls;

int main( int argc, char *argv[] )
{
    set_tls_location( TLS );

    set_tls( 1, thread1_tls );
    set_tls( 2, thread2_tls );

      // spawn threads and let them run... 
}

So, the first call will tell the kernel where the process' TLS pointer is. The second and third calls will tell the kernel what to set TLS to for each thread. On each thread switch, the kernel will just update TLS to point to the configured location.

So whenever thread 1 references TLS, it's actually referencing thread1_tls and whenever thread 2 references TLS, it's actually referencing thread2_tls.

That's a simple way to get TLS and it provides large room for movement to layer different stuff on there. For example:

Code: Select all

... pseudo-code-ish ...

struct TLS_struct
{
   TID;
   ENVIRONMENT VARIABLES;
   ERRNO;
   NAME;
}

struct TLS_struct *TLS;

struct TLS_struct thread1_tls;
struct TLS_struct thread2_tls;

my thread()
{
   print "My name is TLS->name";
}


int main( int argc, char *argv[] )
{
    set_tls_location( TLS );
    set_tls( 1, &thread1_tls );
    set_tls( 2, &thread2_tls );

    thread1_tls.NAME = "THREAD ONE";
    thread2_tls.NAME = "THREAD TWO";

      // spawn threads and let them run... 
}

Pype.Clicker · Post by **Pype.Clicker** » Wed Jan 18, 2006 9:09 am

durand wrote: Wow...

I use a pointer which is specified by the process and I update this pointer on each thread switch. So you could do something like the following:

yep, that's a quite convenient technique too, as long as you have only one cpu ... of course, maybe there's a way i'm not aware of to know on what specific CPU we're running (local APIC, someone ? ) that we could use to index an array of current_thread* ...

durand · Post by **durand** » Wed Jan 18, 2006 9:15 am

arg! I never thought about that. I'm in the middle of an SMP re-write as well but I haven't re-implemented TLS yet.

kataklinger · Post by **kataklinger** » Wed Jan 18, 2006 10:04 am

yep, that's a quite convenient technique too, as long as you have only one cpu ... of course, maybe there's a way i'm not aware of to know on what specific CPU we're running (local APIC, someone ? ) that we could use to index an array of current_thread* ...

I use local APIC ID as an index into array of objects that describe CPU state (current thread, is the CPU in ISR, CPUID stuff and other thingas)

Pype.Clicker · Post by **Pype.Clicker** » Wed Jan 18, 2006 11:46 am

local APIC ID you use hmm ? meditate on this i will ...

For 32 bit paging, every thread has it's own page directory, and all page tables used for kernel space and process space are mapped into each thread's page directory.

those threads sounds terribly like process to me! each in its own page directory ? even if i guess they're sharing massive parts of the directory ...

I suppose if i'm to use that trick, i'd rather tune the switching code so that the very page (or table) that contains thread-local variables will be replaced (but still not touching CR3 to avoid full TLB flush) as needed ...

Brendan · Post by **Brendan** » Wed Jan 18, 2006 1:12 pm

Hi,

Pype.Clicker wrote:I suppose if i'm to use that trick, i'd rather tune the switching code so that the very page (or table) that contains thread-local variables will be replaced (but still not touching CR3 to avoid full TLB flush) as needed ...

Does your scheduler schedule processes or threads, and how big will your TLS areas be?

I schedule threads, and for my OS I recommend using the TLS areas for as much as possible - process space should only be used for the executable file, and any data that is shared by all threads (which should be almost nothing if good OOP practices are used).

Cheers,

Brendan

Candy · Post by **Candy** » Wed Jan 18, 2006 1:26 pm

Brendan wrote: I schedule threads, and for my OS I recommend using the TLS areas for as much as possible - process space should only be used for the executable file, and any data that is shared by all threads (which should be almost nothing if good OOP practices are used).

I tend to disagree quite strongly with that. Given a bunch of programs that run together, they can be multiple processes. Interaction between threads can be more than just a few classes, some designs I've seen are based on proper concurrent access to the entire model for all threads. They were good OO designs, but not on separating threads off all the rest. Threads were created especially for when there are more than one context running in the same environment. In some design patterns they run side by side sharing only their buffer, in some they run in the same code or in interweaved bits of code.

Example: A webserver that spawns a thread for each client. All those clients use the entire webserver content, but there's no reason for them to use their own spawn of cache.

In terms of embedded-OO that's true. Each class has messageboxes and each class runs one thread. This causes way too much overhead in my opinion.

Brendan · Post by **Brendan** » Wed Jan 18, 2006 5:37 pm

Hi,

Candy wrote:
I schedule threads, and for my OS I recommend using the TLS areas for as much as possible - process space should only be used for the executable file, and any data that is shared by all threads (which should be almost nothing if good OOP practices are used).
I tend to disagree quite strongly with that. Given a bunch of programs that run together, they can be multiple processes. Interaction between threads can be more than just a few classes, some designs I've seen are based on proper concurrent access to the entire model for all threads. They were good OO designs, but not on separating threads off all the rest. Threads were created especially for when there are more than one context running in the same environment. In some design patterns they run side by side sharing only their buffer, in some they run in the same code or in interweaved bits of code.

Let me clarify my statement..

...for my OS and only my OS I recommend but do not require or insist upon using the TLS areas for as much as practically possible - process space should only be used for the executable file, and any data that is shared by all threads (which should be almost nothing if and only if good OOP practices are used, where the term "good OOP practices" does not imply that other OOP practices or non-OOP practices are less "good" in general).

Some notes...

In an OS where multiple threads belonging to the same process are run with pre-emptive scheduling or at the same time (i.e. on seperate CPUs), any user-level data structures shared by multiple threads must be protected by re-entrancy locking. Also, in an OS where multiple threads belonging to the same process can be run at the same time (i.e. on seperate CPUs) all linear address space regions that can be accessed by multiple threads must have re-entrancy locks used by any code that may modify those linear address space regions. For these cases, lock contention and lock overhead can be improved by maximizing the use of thread local storage, as anything that can only be accessed by one thread never needs any locks.

For my scheduler threads are given CPU time in order of priority. Therefore (within a single process), with one CPU several threads at the same priority doing one "thing" each will get the same amount of CPU time as one thread at that priority doing all "things". For N CPUs (within a single process), more than N threads at the same priority doing one "thing" each will get the same amount of CPU time as N threads at that priority doing the same "things". The additional threads increase the number of thread switches without improving the amount of work done. For these cases, scheduling overhead can be improved by having one thread per CPU at each priority.

For my OS, "good OOP practices" means when the code is being designed you split it into classes, then assign each class a priority and create groups of all classes at the same priority. At run time, each group would be split into one thread per CPU. Of course this is difficult in practice (it's a guideline only).

For example, your webserver could have 4 classes, "main", "connection", "cache" and "log". The main class and the connection class would be at the same priority, the cache class would be at higher priority and the log class would be at a lower priority. On a single CPU computer this works out to 3 threads, one for each priority.

On a computer with 2 CPUs, you only need one thread for the log class as it's not that important and there's only one instance of it. The cache class can be split into 2 threads where each thread caches roughly one half of the data (put the file name through a hash function to determine which file is cached by which thread). For the remaining classes I'd have one thread for the main class (only one instance) and half of the connections and another thread for the other half of the connections (with some sort of load balancing to determine which thread handles each new connection).

For a computer with 4 CPUs, you still only need one thread for the log class, the cache class would be split across 4 threads, and I'd probably go for one thread for the main class (only one instance) and 3 threads for the connections.

This example probably isn't too good as I'm not that familiar with the internals of a web server, and I do realise standard high level languages might have trouble with it, but it should illustrate the point.

For my original comment, I was only trying to get Pype to think about the possible uses for his TLS - for me and my OS, anything small enough to use INVLPG instead of an address space switch would be far too small to be useful.

Cheers,

Brendan

Assembler · Post by **Assembler** » Wed Jan 18, 2006 5:50 pm

Hi,
i don't know how u've implemented ur threading system but i recommend a DragonFlyBSD threading model (LWKT) with LWKT schedular for each cpu.

Assembler,

kataklinger · Post by **kataklinger** » Wed Jan 18, 2006 5:53 pm

Have you tought abou using PAE and page directory pointer tabel?
Well if you want to use EM64T it won't help, but can help someone who still want to support 32bits architecture

Kevin McGuire · Post by **Kevin McGuire** » Wed Jan 18, 2006 10:50 pm

I know this is off topic, but I just had to poke fun at myself for the fact that I did not even know such a thing existed and was perplexed and had the wrong idea until pype posted the link.

__thread int i;

I see how it works, and I can see how to implement it. I am glad pype started this post.

(This has poke fun written all over it.)

I figure I can create that funny table thing, allocate some space, use GS/LDT on x86, and wham bam dito boom whoosh wallap.. I got TLS. It is not protected, but if the process has wild pointers whipping around in memory like a tornado from hell corrupting things -- hit the road jack.

If I did the X86-64 I would use FS, and just place the TLS right into the same memory space as everything else.

(I know, I only half-way do things right.)

OSDev.org

thread-specific storage

thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage

Re:thread-specific storage