This is not as much a discussion as just a chance to show off a few cool graphs I made with dot(1) when profiling static linking on my OS.
My operating system has a reasonably-complete homemade libc and uses static linking. Each libc function is placed in its own source file. This is nice because it allows static linking to pull in exactly what is needed into the program. However, the standard library initializes itself before main and shuts itself down in exit() after main. This means that particular components (stdio, the printf engine, the heap - and more) gets pulled unconditionally into every program. Since I ship with a lot of small utilities, this overhead accumulates.
Additionally the early libc initialization can potentially fail (for instance, if mallocing the stdout FILE itself fails), is inefficient as it doesn't just directly go to main, consumes additional memory for code initializing data into addition to the data itself, and it makes special early cases where standard library features (malloc and such) and language features (thread-local storage and such) are not yet operational. The lesson here is that all this can go away if the code is restructured so global state is initialized at compile-time and language features can be initialized by the program loader. This means that I can simply delete all the user-space initialization code.
I completed the final stages of this work yesterday, but I've solved related design problems in the past months:
Thread-local storage
I handled thread-local storage a while ago. The key issue the master copy of the thread-local storage is stored in an ELF segment that's not loaded. Existing kernels such as LInux pay no attention to it, the standard library literally has to parse the ELF file of the executable to locate it and subsequently load it. It also has to create a copy of it for the first thread, and put the pthread structure afterwards - this is the layout mandated by the ABI - and finally set the thread-self-pointer so it works. There are also alignment considerations and such things can get terribly complicated - and this initialization bloat has to go into every single program. There are also special cases because you can't make errno a classic thread-local variable because it needs to be available before thread-local storage goes online. This design is absolutely mad.
My solution is much better: The kernel sets up thread-local storage for the first thread so it's already operational when the program entry runs. The thread-local storage is tied to the pthread structure, so the pthread library puts a special uthread structure in front of the pthread structure. This is a place where the kernel can conveniently put pointers to the thread-local storage of the current thread, a pointer to the thread-local storage master copy, a pointer to where the initial stack is located, and such. The pthread library contains an ELF note that notifies the kernel of the size and alignment of the pthread structure. This allows the program loader to set up the pthread structure and self-pointer for the first thread, initializing the parts of the struct pthread (uthread) the program loader knows about and zeroing the remainder. This means threading and thread-local storage entirely online when the program entry runs. This means I can declare errno as "__thread int errno;" and it's also online when the program entry runs. This also means the stack of the initial thread can be unmapped when it exists, unlike existing systems.
The only pthread initialization there's left to do in user-space is simply making the main thread joinable and taking the join lock.
Initialization of malloc
My earlier heap implementation required it to be explicitly initialized before main. When I recently wrote a new malloc, I deliberately designed its global state to be initialized entirely at compile time. This includes creating a global heap lock, pointers to the heap location and other such meta information. The first time malloc is called it notices there is no memory available and the heap expansion code is invoked, which notices it has no heap location and it relies on mmap to pick a suitable location. Subsequent expansions try to mmap next to the previous allocation. This scheme means malloc doesn't need to be explicitly initialized at program startup. The startup code still indirectly pulled in malloc for other reasons, and so did the exit code indirectly.
Initialization of signals
Signals are delivered by my kernel by the affected thread noticing it has a signal pending (after a syscall returns, or when the scheduler switches to the thread and it is in user-space). The thread runs the kernel signal delivery code that saves all the interrupted state on the user-space stack, then creates the parameters for the signal handler, and jumps to it in user-space on the user-space stack. The signal handler executes normally and returns. The trick is now for the thread to jump back to what was interrupted. I previously handled this by having a special signal return function in the libc that was registered at libc initialization time and the kernel makes the function return pointer point there. My new approach has the kernel simply automatically map a particular page into all processes that contains short functions. There is a sigreturn function in that page and the signal delivery code sets the return pointer to that location. This means the standard library initialization code no longer has to have its own copy of that function and it doesn't need to register it, removing another step of early initialization where signals are not operational.
Mininal true(1)
And this is where my work yesterday began from. I wanted to know exactly how much libc bloat got sucked into a program that needed none of it:
Code: Select all
int main(void) { return 0; }
As you can see, there's a lot of stuff being pulled in: The stdio file descriptor backend, large parts of the malloc implementation, the printf implementation, the exit code that pulls in the fclose() and closedir() functions and their backends, file locking, file buffer initialization, abort, assert, sysconf, signal set functions, calltracing support, signal masks, and a bunch of system calls. It's also scary how the process can abort() before even main is hit.
The key solution to reducing the bloat is to cut the dependency edges that doesn't need to exist and to preferably make the libc initialization a no-operation. Additionally, exit() should do as little work as possible before it asks the kernel to destroy the process (where the kernel frees everything).
fcloseall() and dcloseall()
I had an implementation of the GNU extension fcloseall() and use it in exit(3) to close all files. Likewise I had made a dcloseall() extension that does the same for open directory streams. However, these functions are silly and almost never safe to use. Perhaps it's useful to close all FILEs except stdin/stdout/stderr, but those get closed too - and this is dangerous since library code might use them. Additionally, all other FILEs and DIRs get closed behind the back of code using them. But perhaps it is useful for security? Not really, it doesn't prevent file descriptor leakage as not all file descriptors need to have an associated FILE. Finally, nobody uses them, at least in my ports collection. So I removed the functions and inlined them in exit(3) instead as their purpose is still relevant on process exit, at least.
Indirectly calling fclose() from exit()
If you look at the original object dependency graph for true(1) above, you see that exit() pulls in fclose() that in turn pulls in lots of stuff. This goes into every single program as all programs contain an implicit call to exit if main returns. This is a costly edge if none of that code is needed, but it's also a tricky dependency edge to remove, as fclose must be called on all open files. I solved this issue by putting a function pointer to fclose inside the FILE object which is initialized on creation of the FILE object (if such an object is created, then it must be closed, so that's okay for the FILE creation code to pull in fclose). The exit() function has access to the FILE linked list and can follow it without pulling in any object files, for each object it finds it can use the indirect fclose pointer. With this change, fclose() only got pulled in from the stdio.o object file that declares stdin, stdout and stderr.
A similar trick is applied with closedir() and DIR objects.
Indirectly free() stdio buffers
The file closing code pulls in free() because it wants to free the internal buffers in the FILE objects. Newly created FILE objects doesn't have such buffers (including stdin, stdout, stderr) and they are created on the first use of the FILE object if not set with setvbuf() before that. I cut this dependency using another indirect function pointer to free() in the FILE object that is assigned when default buffers are allocated for the FILE object.
Indirectly fflushing() on fclose()
If you examine the same object dependency graph, you also see that fclose() (through fshutdown.o) pulls in fflush() which in turn pulls in a number of objects as well. I cut this dependency by putting a function pointer to fflush() in the FILE object and fclose() uses that if available and otherwise doesn't flush. This function pointer is assigned on the first use of the FILE object, it's safe not to fflush the FILE object was never used. This removes a bunch of object files from the true(1) program.
Using compile-time memory for stdin/stdout/stderr
I refactored the stdin/stdout/stderr initialization code so these objects are no longer malloced but use global variables instead. This removes the last startup dependency on the heap (the last exit dependency on the heap was removed above) and the heap is no longer pulled into true(1). Additionally, I no longer initialize the stdio objects using code, but the global variables are entirely initialized at compile time. This even includes the linked list of all FILE objects.
This is what the object dependency graph now looks like: http://users-cs.au.dk/~sortie/sortix/tr ... educed.png
You'll notice how the heap and many other things are now gone. The base initialization objects remain, the exit code remains, the fclose implementation (and the files it depends on) remains, the file descriptor stdo backend remains, and some pthread locking remains. Pretty small executable though.
Lazy exit(3)
But we can cut it down even further. The exit code doesn't need to close directory streams, those are just file descriptors and heap allocations. The kernel destroys that on process exit regardless. That code can just be entirely removed which shaves a few bytes off exit().
Likewise exit() doesn't really need to fully fclose() all open FILEs. It just needs to fflush the FILEs and call the you-got-closed handlers on the FILE objects (see fopencookie and funopen) if any. I modify the loop that fclose()s all files so it instead calls the indirect fflush pointer in the FILE object (see above) and the user-provided you-got-closed handler that's already referenced in the FILE object. This means exit() doesn't need fclose() at all. More importantly, this means the indirect fclose() pointer (see above) no longer needs to exist so it's deleted. This cuts the dependency on fclose.o from stdio.o (as it doesn't need that pointer in the stdin/stdout/stderr objects) and fclose.o is no longer pulled into true(1).
Final result
This is the very reduced object dependency graph for true(1) with these libc changes: http://users-cs.au.dk/~sortie/sortix/tr ... educed.png
As you can see, only a few things remain: The process initialization code, the process exit code, the object containing compile-time-initialized stdin/stdout/stderr, and the file descriptor stdio backend. That's pretty damn small. It's hard to trim things further and it's probably of no actual value.
Interestingly all failure cases between _start and main have been removed as well as all system calls, so it almost directly jumps to main. The only things that remain is the pthread first thread that needs to initialize and take a lock and initializing program_invocation_short_name(3). This is nice because things are small, simple and robust now.
The size of true(1) has been significantly reduced. For this benchmark I compiled with -Os and stripped the executable. In the first dependency graph, the size of true(1) was 16 KB (with ELF overhead). Now it's just 2342 bytes (without ELF overhead) and 3.5 KB (with ELF overhead). This sounds like small savings and it is. Mind that I normally compile all programs on my OS with -O2 -g -fsanitize=undefined which creates much larger objects and executables and there are a lot of small programs, so it accumulates. Either way, the robustness of having _start almost immediately jump to main() with no error cases is worth it.
Minimal hello world
For the record, I also profiled the static linking of a simple hello world program:
Code: Select all
#include <stdio.h>
int main(int argc, char* argv[])
{
printf("Hello %i\n", argc);
return 0;
}
The file size is 14 KB (with ELF overhead), so it's actually smaller than the 16 KB (with ELF overhead) true(1) from before yesterday's work.
Not too much unreasonable is getting pulled in. The heap can likely be left out if I initialize the stdin and stodut objects with compile-time default buffers rather than allocating them with malloc and do some refactoring.
Going foward
The indirect pointer scheme is perhaps not too good, it's a bit slow in principle and is completely pointless if the indirected functions are needed regardless. I'll likely #ifdef __small_static_libc__ that later on. What's more important is that I consider my immediately problem solved (that true(1) should be smaller and program initialization should be done mostly at compile-time or program-load time). A major builtin limitation of static linking is that it operates only on object files and that it's dumb. I would very much like to employ link-time-optimization in my operating system in the future. At present, it doesn't work with my cross-toolchain for mysterious reasons.