Is thread local storage a good solution?

OSwhatever · Post by **OSwhatever** » Fri Mar 20, 2020 10:46 am

Is thread local storage (TLS) a complex solution for a non existent problem?

This reddit post highlights several problems with it.
https://www.reddit.com/r/rust/comments/ ... orage_how/

Essentially with TLS global variables are thread local, so why not just allocate them on stack instead? What I've notice is that TLS doesn't really do well with systems that have short lived threads and/or thread pooling, which is becoming more and more common. In order to support TLS you need a special area that needs to be allocated and then initialized for each thread. For thread pools this means that either you need to reinitialize the TLS for each run or use the same values from previous run. Reinitialize TLS takes time, unnecessary time it feels. It is however possible to do a lazy initialization done during the first access.

Then to access TLS variables you need some special register or a function like __get_tls_addr which adds to the overhead of accessing variables. Dynamic modules adds to the complexity of it all.

My question is, do we need TLS or was it just some workaround to solve the infamous errno variable? Many modern languages require it but was it more a mistake to rely on TLS instead of redesign it so that it doesn't need it? What do you think.

reapersms · Post by **reapersms** » Fri Mar 20, 2020 1:34 pm

They may very well be on the stack, but you still need a mechanism for locating them. Sure, they could be in the stack frame of the thread entry point function, but child frames don't generally know how far down the stack they are, at which point your __get_tls_addr function has to walk up frames until it finds the right onw. The additional space isn't a huge issue, as it can be allocated at the same time the rest of the thread management data is, or included as part of the thread stack.

Dynamic linking is indeed an issue, with the usual solution of "be very careful about it", and lazy initialization.

The replacement is rolling your own, via indexing some global container by thread ID or some similar mechanism. At this point you're probably reimplementing the existing system, but without any of the language assistance.

Thread pooling shouldn't really care either way about it. If the usage is to grab a thread from the pool, run a workload to completion, and return it to the pool, then the workload should probably be cleaning up after itself. If the usage is more a user-level threading system, where a particular workload may run on a variety of threads over its lifetime, it will need to be careful about not storing workload-specific data in TLS.

Short-lived threads are probably something to avoid anyways, as the OS bookkeeping around thread creation/destruction is likely not the fastest thing.

It was not just created to handle errno.

Korona · Post by **Korona** » Sat Mar 21, 2020 2:19 am

TLS allocation is negligible compared to the other costs involved into thread creation. In the System-V ABI, static TLS is part of the TCB that you'll likely need anyway.

TLS is useful for the same reason that global variables are useful (and see below why that is the case!). You probably should not maintain hundreds of TLS variables, but it can often be impractical to add another argument to all functions. Indeed, globals/TLS are most useful to pass data through libraries that do not need to be aware of the data. For example, on Managarm, the current run queue (that manages callbacks from asynchronous I/O) is thread-local. Passing it to each and every function that can potentially do I/O would be impractical.

OSwhatever · Post by **OSwhatever** » Mon Mar 23, 2020 2:47 pm

Korona wrote:TLS allocation is negligible compared to the other costs involved into thread creation. In the System-V ABI, static TLS is part of the TCB that you'll likely need anyway.

There is certainly a cost, especially when accessing TLS variables. According to this use case up to 5% of the CPU time.

https://software.intel.com/en-us/blogs/ ... variables/

Korona · Post by **Korona** » Tue Mar 24, 2020 1:54 am

Yes, accessing TLS has a comparatively high overhead (at least for the first TLS access in each function) since __tls_get_addr cannot be inlined. The op (also) asked about allocation of TLS (which is mostly for free).

OSwhatever · Post by **OSwhatever** » Sat Apr 18, 2020 4:06 pm

Korona wrote:Yes, accessing TLS has a comparatively high overhead (at least for the first TLS access in each function) since __tls_get_addr cannot be inlined. The op (also) asked about allocation of TLS (which is mostly for free).

The allocation of TLS is "free" when you put it on the stack. You can put both the TLS area and the DTV vector on the stack. However, there are those who like to have megabytes in TLS variables and then you cannot have it on the stack as it wouldn't fit in the stack virtual area (this depends a bit on the stack design you use). In those cases you have to put it on the heap. So now you see there are just a lot of special cases you have deal with.

It doesn't stop there, because if you want to load more modules during run time, you have to expand the DTV vector and if it is already on the stack you cannot expand it and must have a special solution for that.

Another annoying thing is that I haven't seen any option to avoid optimization towards initial exec model. Maybe I just want to support general dynamic model, but that's not possible. Especially on x86 this is a bad fit for my design, which is to allocate and initialize TLS as late as possible.

In computer science, sometimes when I read about different designs and solution I think "that's really clever". TLS is by far not one of those.

AndrewAPrice · Post by **AndrewAPrice** » Mon Apr 20, 2020 3:34 pm

I had some ideas of ways you could implement TLS without a syscall:

a) If we know the size of the TLS at load time (because you annotated global variables with some attribute), we could calculate how many memory pages we need for these variables and put them at a certain memory mapped location. Then, when context switching between threads, you'd switch those pages. The downside of this is that it might be heavy to context switch between threads in the same process (but not heavier than context switching between threads across processes.)

b) We could have a memory address where the scheduler puts the currently executing thread ID. The downside of this is that it's likely anything wanting to do TLS would probably have a Thread ID -> data map that would get locked a lot.

c) Similar to above, except rather than the fixed memory address containing the thread ID, it can contain any int, and the scheduler will save and restore whatever this int was upon context switching. Then programs can use this as a pointer to dynamically allocate thread safe storage. The downside is that you need to remember to allocate/free your TLS object.

d) The TLS lives in the thread's stack, and the scheduler stores the stack's base address at a fixed memory address. The downside of this is that you might have many megabytes of TLS, as OSwhatever said, you'll need to support super huge stacks.

nullplan · Post by **nullplan** » Mon Apr 20, 2020 9:59 pm

MessiahAndrw wrote: a) If we know the size of the TLS at load time (because you annotated global variables with some attribute), we could calculate how many memory pages we need for these variables and put them at a certain memory mapped location. Then, when context switching between threads, you'd switch those pages. The downside of this is that it might be heavy to context switch between threads in the same process (but not heavier than context switching between threads across processes.)

Alternatively, every thread gets a copy at a different place, and a register points there. For x86, you only need a syscall to set the base address of one of the segment registers. This only recently became unnecessary with the introduction of WRGSBASE. But still, this is a single syscall right after spawning a thread, so the impact should be limited.

This means all threads share the exact same address space and can share all pointers they want between each other, and only use the register file for diversification.

MessiahAndrw wrote:b) We could have a memory address where the scheduler puts the currently executing thread ID. The downside of this is that it's likely anything wanting to do TLS would probably have a Thread ID -> data map that would get locked a lot.

c) Similar to above, except rather than the fixed memory address containing the thread ID, it can contain any int, and the scheduler will save and restore whatever this int was upon context switching. Then programs can use this as a pointer to dynamically allocate thread safe storage. The downside is that you need to remember to allocate/free your TLS object.

This is pretty much what OS-9 does (only it's a pointer, not an int). Now they have to do that since they don't have a register left to serve as thread pointer. Still requires a syscall to set up, tho.

MessiahAndrw wrote:d) The TLS lives in the thread's stack, and the scheduler stores the stack's base address at a fixed memory address. The downside of this is that you might have many megabytes of TLS, as OSwhatever said, you'll need to support super huge stacks.

This way, it becomes impossible to grow the TLS, so if a new library is loaded in requiring more TLS, you need to allocate the TLS elsewhere again. And if you need to store a pointer to TLS in memory, anyway, might as well make it independent of the stack, anyway.

OSwhatever · Post by **OSwhatever** » Tue Apr 21, 2020 5:47 am

nullplan wrote:Alternatively, every thread gets a copy at a different place, and a register points there. For x86, you only need a syscall to set the base address of one of the segment registers. This only recently became unnecessary with the introduction of WRGSBASE. But still, this is a single syscall right after spawning a thread, so the impact should be limited.

How is it with x86, do you have the possibility to provide a function call in order to obtain the tp pointer for the init exec TLS area? I'm in kind of a luck since I'm working on ARM, there you have the option to either use a HW cp15 register or an ABI function call to __aeabi_read_tp in order to get the pointer. In that function you can do all sorts of things like late allocation and initialization which works for me. However, would that be possible with x86 or do you have to use fs/gs instead of a function?

nullplan wrote:]This way, it becomes impossible to grow the TLS, so if a new library is loaded in requiring more TLS, you need to allocate the TLS elsewhere again. And if you need to store a pointer to TLS in memory, anyway, might as well make it independent of the stack, anyway.

Right now I opted for a split DTV, one for init exec model and on for global dynamic model. Init exec model is stored on the stack below a certain threshold otherwise it is on the heap. Init exec modules never grows as they are known at process creation. Global dynamic model is always on the heap and a dynamic DTV. The idea is that it is more unusual with programs that loads DLLs at runtime, while most programs load the DLLs at process creation.

nullplan · Post by **nullplan** » Tue Apr 21, 2020 9:11 am

OSwhatever wrote:How is it with x86, do you have the possibility to provide a function call in order to obtain the tp pointer for the init exec TLS area? I'm in kind of a luck since I'm working on ARM, there you have the option to either use a HW cp15 register or an ABI function call to __aeabi_read_tp in order to get the pointer. In that function you can do all sorts of things like late allocation and initialization which works for me. However, would that be possible with x86 or do you have to use fs/gs instead of a function?

I'm not sure I understand. The base of the FS or GS register can be set only with a descriptor (in 32-bit mode), with an MSR, or with a relatively new instruction called WRFSBASE or WRGSBASE. The first of these can obviously only be done by the kernel, since the descriptor tables are supervisor resources (if user space could write them, even just the LDT, it could install a ring 0 code segment and a call gate into that segment, usurping the machine). The second one can also only be done by the kernel since WRMSR and RDMSR are supervisor instructions. The last of these is only available with kernel support, but if that is communicated, it could be done entirely in user space. Since we are talking about a very quick system call that is happening at a time other resource intensive calls are happening, this is usually too little of a benefit and too much of a hassle to bother. If WRFSGSBASE is supported, the kernel can just patch its own code accordingly.

Most libcs use FS/GS for more than just ELF TLS. Both musl and glibc use it to store the current thread descriptor, with the TLS coming in below the thread pointer. The thread pointer points to the start of the thread descriptor, and below the descriptor is the ELF TLS. Since the thread descriptor is used for all sorts of things, late initialization is usually not useful for the thread pointer itself. As for the TLS, glibc performs late initialization in __tls_get_addr(), namely it allocates and copies the TLS image of a DSO only on first use. For initial-exec, this is entirely useless, since no function is called. In case of initial exec, the base pointer is read in i386 and x86_64 with "movl %gs:0, %<target>" and "movq %fs:0, %<target>" respectively, so this stuff has to be initialized by the time control is passed to the application.

That said, late initialization has the drawback that if it fails, you have no option but to abort the process. If early initialization fails in the loader, you also abort the process, but at least no damage is done since the process could not do its job yet, and if early initialization fails in dlopen() you can just return failure and hope the application has a better idea of what to do now than to abort.

OSwhatever · Post by **OSwhatever** » Tue Apr 21, 2020 1:25 pm

nullplan wrote:I'm not sure I understand. The base of the FS or GS register can be set only with a descriptor (in 32-bit mode), with an MSR, or with a relatively new instruction called WRFSBASE or WRGSBASE. The first of these can obviously only be done by the kernel, since the descriptor tables are supervisor resources (if user space could write them, even just the LDT, it could install a ring 0 code segment and a call gate into that segment, usurping the machine). The second one can also only be done by the kernel since WRMSR and RDMSR are supervisor instructions. The last of these is only available with kernel support, but if that is communicated, it could be done entirely in user space. Since we are talking about a very quick system call that is happening at a time other resource intensive calls are happening, this is usually too little of a benefit and too much of a hassle to bother. If WRFSGSBASE is supported, the kernel can just patch its own code accordingly.

I little bit a mistake by the x86 designers in my opinion unless you can configure user processes to be allowed to write WRFSBASE/WRGSBASE. They should have provided a register that user space processes could just set. My OS is using user space scheduling and having a register that user space processes can set without the kernel helps. This requires that the kernel saves that register if there is a process change of course. ARM has two of these one register for kernel only and one for user processes.

nullplan wrote:In case of initial exec, the base pointer is read in i386 and x86_64 with "movl %gs:0, %<target>" and "movq %fs:0, %<target>" respectively, so this stuff has to be initialized by the time control is passed to the application.

If the x86 ABI doesn't have something similar to __aeabi_read_tp in order obtain the tp pointer to get to the init exec TLS area, is it possible to load fs/gs in user space with an invalid descriptor? Then when the program wants to access init exec TLS using fs/gs you get an exception. The kernel then reports to the user process which allocates and initializes the TLS area. Would that be possible?

nullplan · Post by **nullplan** » Tue Apr 21, 2020 9:58 pm

OSwhatever wrote:I little bit a mistake by the x86 designers in my opinion

x86 was not designed, it grew over time. The 8086 had 16-bit segments to ease transition from the 8080, so every CPU in that line had to keep the 16-bit segments around. Then the 80286 was way over-engineered, and we're stuck with that design to this day. Then the 80386 gained an extra two segment registers, and they became useful only a few years down the line. And on and on.

OSwhatever wrote:unless you can configure user processes to be allowed to write WRFSBASE/WRGSBASE.

These instructions, as I said, can be used from user space. If the CPU supports them, and the kernel enabled them. So whether you can use them is a dynamic property. User space applications would have to keep both the code to use the system call and the new instructions around, and decide between the paths at run time. Oh, and the kernel would need to signal support for the new instructions to user space somehow. All of that to save one very fast system call. Most applications will not bother.

OSwhatever wrote:They should have provided a register that user space processes could just set.

They did! Sixteen of them, in fact. But no ABI could agree to just reserve one GPR for the thread pointer so now we're stuck with using the FS/GS base address.
OK, in 32-bit mode, you only have eight GPRs, and permanently loosing one of them would be very harsh indeed.

OSwhatever wrote:My OS is using user space scheduling and having a register that user space processes can set without the kernel helps.

What's the point of user-space scheduling? You need kernel-space scheduling, anyway, so why not use it for threads as well?

OSwhatever wrote:is it possible to load fs/gs in user space with an invalid descriptor?

You could set them to zero, I suppose. That would invalidate them. Of course, now you are trading a system call now for a trap later, not sure how that is beneficial. And the trap is "General Protection Fault", that exception that is invoked for almost everything, and yet it still does not tell you why it is running. It will be hard to identify this specific fault, is my point.

OSDev.org

Is thread local storage a good solution?

Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?

Re: Is thread local storage a good solution?