Minimal Static Link

sortie · Post by **sortie** » Mon Aug 04, 2014 6:25 am

Hey there,

This is not as much a discussion as just a chance to show off a few cool graphs I made with dot(1) when profiling static linking on my OS.

My operating system has a reasonably-complete homemade libc and uses static linking. Each libc function is placed in its own source file. This is nice because it allows static linking to pull in exactly what is needed into the program. However, the standard library initializes itself before main and shuts itself down in exit() after main. This means that particular components (stdio, the printf engine, the heap - and more) gets pulled unconditionally into every program. Since I ship with a lot of small utilities, this overhead accumulates.

Additionally the early libc initialization can potentially fail (for instance, if mallocing the stdout FILE itself fails), is inefficient as it doesn't just directly go to main, consumes additional memory for code initializing data into addition to the data itself, and it makes special early cases where standard library features (malloc and such) and language features (thread-local storage and such) are not yet operational. The lesson here is that all this can go away if the code is restructured so global state is initialized at compile-time and language features can be initialized by the program loader. This means that I can simply delete all the user-space initialization code.

I completed the final stages of this work yesterday, but I've solved related design problems in the past months:

Thread-local storage

I handled thread-local storage a while ago. The key issue the master copy of the thread-local storage is stored in an ELF segment that's not loaded. Existing kernels such as LInux pay no attention to it, the standard library literally has to parse the ELF file of the executable to locate it and subsequently load it. It also has to create a copy of it for the first thread, and put the pthread structure afterwards - this is the layout mandated by the ABI - and finally set the thread-self-pointer so it works. There are also alignment considerations and such things can get terribly complicated - and this initialization bloat has to go into every single program. There are also special cases because you can't make errno a classic thread-local variable because it needs to be available before thread-local storage goes online. This design is absolutely mad.

My solution is much better: The kernel sets up thread-local storage for the first thread so it's already operational when the program entry runs. The thread-local storage is tied to the pthread structure, so the pthread library puts a special uthread structure in front of the pthread structure. This is a place where the kernel can conveniently put pointers to the thread-local storage of the current thread, a pointer to the thread-local storage master copy, a pointer to where the initial stack is located, and such. The pthread library contains an ELF note that notifies the kernel of the size and alignment of the pthread structure. This allows the program loader to set up the pthread structure and self-pointer for the first thread, initializing the parts of the struct pthread (uthread) the program loader knows about and zeroing the remainder. This means threading and thread-local storage entirely online when the program entry runs. This means I can declare errno as "__thread int errno;" and it's also online when the program entry runs. This also means the stack of the initial thread can be unmapped when it exists, unlike existing systems.

The only pthread initialization there's left to do in user-space is simply making the main thread joinable and taking the join lock.

Initialization of malloc

My earlier heap implementation required it to be explicitly initialized before main. When I recently wrote a new malloc, I deliberately designed its global state to be initialized entirely at compile time. This includes creating a global heap lock, pointers to the heap location and other such meta information. The first time malloc is called it notices there is no memory available and the heap expansion code is invoked, which notices it has no heap location and it relies on mmap to pick a suitable location. Subsequent expansions try to mmap next to the previous allocation. This scheme means malloc doesn't need to be explicitly initialized at program startup. The startup code still indirectly pulled in malloc for other reasons, and so did the exit code indirectly.

Initialization of signals

Signals are delivered by my kernel by the affected thread noticing it has a signal pending (after a syscall returns, or when the scheduler switches to the thread and it is in user-space). The thread runs the kernel signal delivery code that saves all the interrupted state on the user-space stack, then creates the parameters for the signal handler, and jumps to it in user-space on the user-space stack. The signal handler executes normally and returns. The trick is now for the thread to jump back to what was interrupted. I previously handled this by having a special signal return function in the libc that was registered at libc initialization time and the kernel makes the function return pointer point there. My new approach has the kernel simply automatically map a particular page into all processes that contains short functions. There is a sigreturn function in that page and the signal delivery code sets the return pointer to that location. This means the standard library initialization code no longer has to have its own copy of that function and it doesn't need to register it, removing another step of early initialization where signals are not operational.

Mininal true(1)

And this is where my work yesterday began from. I wanted to know exactly how much libc bloat got sucked into a program that needed none of it:

Code: Select all

int main(void) { return 0; }

This is a graph of the object files and their dependencies that got sucked into the program: http://users-cs.au.dk/~sortie/sortix/tr ... bjects.png

As you can see, there's a lot of stuff being pulled in: The stdio file descriptor backend, large parts of the malloc implementation, the printf implementation, the exit code that pulls in the fclose() and closedir() functions and their backends, file locking, file buffer initialization, abort, assert, sysconf, signal set functions, calltracing support, signal masks, and a bunch of system calls. It's also scary how the process can abort() before even main is hit.

The key solution to reducing the bloat is to cut the dependency edges that doesn't need to exist and to preferably make the libc initialization a no-operation. Additionally, exit() should do as little work as possible before it asks the kernel to destroy the process (where the kernel frees everything).

fcloseall() and dcloseall()

I had an implementation of the GNU extension fcloseall() and use it in exit(3) to close all files. Likewise I had made a dcloseall() extension that does the same for open directory streams. However, these functions are silly and almost never safe to use. Perhaps it's useful to close all FILEs except stdin/stdout/stderr, but those get closed too - and this is dangerous since library code might use them. Additionally, all other FILEs and DIRs get closed behind the back of code using them. But perhaps it is useful for security? Not really, it doesn't prevent file descriptor leakage as not all file descriptors need to have an associated FILE. Finally, nobody uses them, at least in my ports collection. So I removed the functions and inlined them in exit(3) instead as their purpose is still relevant on process exit, at least.

Indirectly calling fclose() from exit()

If you look at the original object dependency graph for true(1) above, you see that exit() pulls in fclose() that in turn pulls in lots of stuff. This goes into every single program as all programs contain an implicit call to exit if main returns. This is a costly edge if none of that code is needed, but it's also a tricky dependency edge to remove, as fclose must be called on all open files. I solved this issue by putting a function pointer to fclose inside the FILE object which is initialized on creation of the FILE object (if such an object is created, then it must be closed, so that's okay for the FILE creation code to pull in fclose). The exit() function has access to the FILE linked list and can follow it without pulling in any object files, for each object it finds it can use the indirect fclose pointer. With this change, fclose() only got pulled in from the stdio.o object file that declares stdin, stdout and stderr.

A similar trick is applied with closedir() and DIR objects.

Indirectly free() stdio buffers

The file closing code pulls in free() because it wants to free the internal buffers in the FILE objects. Newly created FILE objects doesn't have such buffers (including stdin, stdout, stderr) and they are created on the first use of the FILE object if not set with setvbuf() before that. I cut this dependency using another indirect function pointer to free() in the FILE object that is assigned when default buffers are allocated for the FILE object.

Indirectly fflushing() on fclose()

If you examine the same object dependency graph, you also see that fclose() (through fshutdown.o) pulls in fflush() which in turn pulls in a number of objects as well. I cut this dependency by putting a function pointer to fflush() in the FILE object and fclose() uses that if available and otherwise doesn't flush. This function pointer is assigned on the first use of the FILE object, it's safe not to fflush the FILE object was never used. This removes a bunch of object files from the true(1) program.

Using compile-time memory for stdin/stdout/stderr

I refactored the stdin/stdout/stderr initialization code so these objects are no longer malloced but use global variables instead. This removes the last startup dependency on the heap (the last exit dependency on the heap was removed above) and the heap is no longer pulled into true(1). Additionally, I no longer initialize the stdio objects using code, but the global variables are entirely initialized at compile time. This even includes the linked list of all FILE objects.

This is what the object dependency graph now looks like: http://users-cs.au.dk/~sortie/sortix/tr ... educed.png

You'll notice how the heap and many other things are now gone. The base initialization objects remain, the exit code remains, the fclose implementation (and the files it depends on) remains, the file descriptor stdo backend remains, and some pthread locking remains. Pretty small executable though.

Lazy exit(3)

But we can cut it down even further. The exit code doesn't need to close directory streams, those are just file descriptors and heap allocations. The kernel destroys that on process exit regardless. That code can just be entirely removed which shaves a few bytes off exit().

Likewise exit() doesn't really need to fully fclose() all open FILEs. It just needs to fflush the FILEs and call the you-got-closed handlers on the FILE objects (see fopencookie and funopen) if any. I modify the loop that fclose()s all files so it instead calls the indirect fflush pointer in the FILE object (see above) and the user-provided you-got-closed handler that's already referenced in the FILE object. This means exit() doesn't need fclose() at all. More importantly, this means the indirect fclose() pointer (see above) no longer needs to exist so it's deleted. This cuts the dependency on fclose.o from stdio.o (as it doesn't need that pointer in the stdin/stdout/stderr objects) and fclose.o is no longer pulled into true(1).

Final result

This is the very reduced object dependency graph for true(1) with these libc changes: http://users-cs.au.dk/~sortie/sortix/tr ... educed.png

As you can see, only a few things remain: The process initialization code, the process exit code, the object containing compile-time-initialized stdin/stdout/stderr, and the file descriptor stdio backend. That's pretty damn small. It's hard to trim things further and it's probably of no actual value.

Interestingly all failure cases between _start and main have been removed as well as all system calls, so it almost directly jumps to main. The only things that remain is the pthread first thread that needs to initialize and take a lock and initializing program_invocation_short_name(3). This is nice because things are small, simple and robust now.

The size of true(1) has been significantly reduced. For this benchmark I compiled with -Os and stripped the executable. In the first dependency graph, the size of true(1) was 16 KB (with ELF overhead). Now it's just 2342 bytes (without ELF overhead) and 3.5 KB (with ELF overhead). This sounds like small savings and it is. Mind that I normally compile all programs on my OS with -O2 -g -fsanitize=undefined which creates much larger objects and executables and there are a lot of small programs, so it accumulates. Either way, the robustness of having _start almost immediately jump to main() with no error cases is worth it.

Minimal hello world

For the record, I also profiled the static linking of a simple hello world program:

Code: Select all

#include <stdio.h>
int main(int argc, char* argv[])
{
    printf("Hello %i\n", argc);
    return 0;
}

And it's object dependency graph: http://users-cs.au.dk/~sortie/sortix/he ... bjects.png

The file size is 14 KB (with ELF overhead), so it's actually smaller than the 16 KB (with ELF overhead) true(1) from before yesterday's work.

Not too much unreasonable is getting pulled in. The heap can likely be left out if I initialize the stdin and stodut objects with compile-time default buffers rather than allocating them with malloc and do some refactoring.

Going foward

The indirect pointer scheme is perhaps not too good, it's a bit slow in principle and is completely pointless if the indirected functions are needed regardless. I'll likely #ifdef __small_static_libc__ that later on. What's more important is that I consider my immediately problem solved (that true(1) should be smaller and program initialization should be done mostly at compile-time or program-load time). A major builtin limitation of static linking is that it operates only on object files and that it's dumb. I would very much like to employ link-time-optimization in my operating system in the future. At present, it doesn't work with my cross-toolchain for mysterious reasons.

alexfru · Post by **alexfru** » Mon Aug 04, 2014 7:20 am

I've been thinking about this recently.

Suppose, malloc() (or whatever function) requires certain intitialization to happen either on first invocation or before main() is entered.

On one hand, you could do the initialization inside malloc() and set a flag that it's done (let's ignore multithreading and related issues for a moment), so it doesn't happen again on the second call to malloc(). That's kind of nice, if malloc() isn't used, its initialization code isn't pulled in. Now, what if there are other related functions, e.g. realloc()? What if you call realloc(0, some_size) first instead of malloc()? You'd need to initialize from inside realloc() as well, right? Well, you could make malloc() a thin wrapper around realloc() and still get away with this scheme. But what if the API is more broad and complex than this and if you don't want the overhead of checking the "malloc_initialized" flag every time? If you move initialization outside of malloc() and realloc(), you're now almost certainly doomed to be doing it whether malloc() or realloc() is used or not.

On the other hand, you could do it a bit differently. You could create a special "init" section containing pointers to initializers of malloc() and the like. malloc() and realloc() would then pull their dedicated entry from it, by just referencing it. Whenever malloc() or realloc() themselves are referenced, this entry (pointing to the initializer of malloc() and realloc()) would appear in the "init" section of the resulting binary. Now, your startup code that runs before main() can simply walk the "init" section and call every initializer found in it, until it finds NULL or reaches the end of the section. With this you get only what's really needed. This is the mechanism I'm using to build the DLL import table in my Windows executables. If my code invokes CreateFileA(), the object file where CreateFileA() is defined, is pulled in together with its sections that add entries to the various parts of the import table and it also references and pulls another object file that adds the DLL name, "kernel32.dll", where it's needed. CloseHandle() does the same, so the import table gets only one mention of "kernel32.dll" for all the functions imported from it and the list(s) of function names, e.g. CreateFileA, CloseHandle, etc. If none of these Win32 API functions are used, nothing related to them appears in the executable and the import table is empty (except for the trailing all-zeroes entry that's appended after all entries).

If you pay close attention to what gcc/g++ does, you'll actually find such sections. For example, you'll have a special section for initializing/constructing static C++ objects that must be pre-constructed when main() starts. But, I'm pretty sure there are C (that is, non-C++) initializers like that as well.

Also, you might end up needing some of these inits to occur prior to others, not just in an arbitrary order. You can't depend on the link order here. There can be multiple ways of ordering them. Since in my linker the sections of the same type are sorted alphabetically, I can simply name the init sections appropriately, e.g. .init001, .init002, etc. And then there's a trailer section, e.g. init_end. Together they form a contiguous array (mind the section alignment!). I imagine, you could enforce a specific order via some manipulations inside the linker script.

sortie · Post by **sortie** » Mon Aug 04, 2014 8:01 am

Indeed, global constructors that run at program initialization is an interesting mechanism. As you point out, however, this introduces the problem of running the global constructors in the right order with respect to their internal dependencies.

I consider it a mistake to do any advanced initialization from these global constructors that depends on other global constructors. They should be used for initializing data on program startup that cannot be properly initialized at compile-time and nothing mode. Anything else should be run from main instead or initialized at compile-time.

alexfru · Post by **alexfru** » Mon Aug 04, 2014 8:12 am

AFAIR, the order of construction of static C++ objects is unspecified, at least it's not specified across multiple translation units. Fun!

jnc100 · Post by **jnc100** » Mon Aug 04, 2014 12:06 pm

So why not just some 'is_initialized' flag for each initializer? Then your initializer routine (that gets pulled into the .init section) for each module could be something like:

Code: Select all

int foo_is_initialized = 0;

void foo_initialize()
{
    if(foo_is_initialized)
        return;

    bar_initialize();  // dependency 1
    baz_initialize();  // dependency 2

    do foo initialize stuff;

    foo_is_initialized = 1;
}

Regards,
John.

sortie · Post by **sortie** » Mon Aug 04, 2014 2:38 pm

This is a also a reasonable design as the initialization code only gets pulled in only if needed. [Although I consider it a design bug if global constructors in statically linked libraries are only called if the object is pulled in through other means]. There is also the issue of these constructors competing with other libraries and the main program about what gets called first. This scheme is still suboptimal as the object file will contain both the data and code initializing just the data, when compared to the object file just containing compile-time-initialized data. It's a viable strategy, though, for those times it's simply not possible to initialize everything at compile time.

Though, in the case of a libc, it's usually very few things that need to be specially initialized at program start up, if at all. Most features that have global state can be zero-initialized, and there is a tendency to not have such features in libc in the first place as they are often not thread safe.

OSDev.org

Minimal Static Link

Minimal Static Link

Re: Minimal Static Link

Re: Minimal Static Link

Re: Minimal Static Link

Re: Minimal Static Link

Re: Minimal Static Link