How to best implement your own C library

max · Post by **max** » Sun Mar 08, 2015 5:56 am

Hey guys,

I'm getting tired of newlib and ffs I just can't find out how to properly make it threadsafe (it all works okay, but the whole thing just feels like an ugly blackbox), so I thought myself, I'll just write my own implementation. So theres the C standard that defines what must be existing and how it must work. But it seems such a giant load of definitions and functions you have to write, so I wonder what's the best way to start implementing your own..

My approach would be just trying to compile programs (including my own) and see what stuff is missing; then step by step add it. But this makes it feel somehow - I don't know, incomplete... Also, as it is so much, is there some kind of way to find out if my stuff works as expected in the end? How did you do it?

And another question; theres also "libm".. should I write my own implementation of this too/how hard is it, or are there good ones that I can just add on top? Does the C standard also define these? And are there any more libraries that I must have (so I can later compile libstdc++ & port programs without problems).

Greets,
Max

gerryg400 · Post by **gerryg400** » Sun Mar 08, 2015 6:23 am

Just like you I was using newlib and didn't like it. One day I just removed it from my build and compiled everything to see what was missing. I think I started by implementing memcpy and just kept going. It took some time to get everything compiling and a while after that before I realised that testing is vitally important.
In the end I wrote a simple test harness and some tests. I've had lots of problems but I would definitely say it's worth it.
Ask questions here or on irc and you'll get lots of help.

alexfru · Post by **alexfru** » Sun Mar 08, 2015 6:33 am

There's a number of things in the library that maintain certain global state behind the scenes, e.g. errno (and functions referring to it), strtok(), rand(), malloc(), getenv(), gmtime(), [as]ctime(), etc etc. In some places you only need to guarantee exclusive access by means of locks and such. In other places you need to have per-thread state (e.g. errno), for which you need TLS. The C standard (up to 1999) doesn't define what needs to be thread-safe and how, it assumes threads don't exist or you know what you're doing. POSIX specs will likely tell you more here.

I think trying to compile various programs to find out what's missing in your library is not a very clever idea. For one thing, if you don't know all those functions and variables and macros that are mandatory, you're unlikely to collect or write programs using all of those. IOW, you don't know what you don't know. Just open up the language standard and see the list of all required headers and their contents.

You need to implement ~100 standard library functions for ANSI C if you exclude wide/multibyte characters and floating point. That's not a lot and is doable in a few months.

As for making sure the stuff works, there exist tests. gcc has a battery of its own tests. You could perhaps compile and run those? You could also write a few tests for those areas where you're not sure and run them against the stock compiler and your cross compiler with your library.

While writing my library I found a number of places, where the standard seemed a bit unclear as to the expected behavior and I needed extra info to gain clarity. You'll (or you already have) run into the same problem. Resolution can be found by:
- seeing how test programs behave when compiled with stock compilers
- googling bug reports on specific functions and reading other forum discussions about those
- reading the source of existing libraries in the hope to find useful comments or gather the idea from the code itself (oh, and if you're attentive, you'll find bugs in others' code: )

sortie · Post by **sortie** » Sun Mar 08, 2015 6:46 am

Some protips:

Take standards compliance seriously.
Prefer doing stuff well rather than fast and hacky.
stdio is much more complicated than you think:
Implement fread as many small fgetc and fwrite as many small fputc calls. Do not make optimized calls until the fgetc and fputc semantics are entirely correct, because:
ungetc is a conspiracy and the whole stdio implementation has to be in on it.
Start by implementing your favourite parts and what you need to run a few sample programs of your own.
Continue by adding what you need when Cross-Porting Software.
Read POSIX 2008 http://pubs.opengroup.org/onlinepubs/9699919799/. Don't worry too much about strict ISO C namespace compliance and that just yet.
Determine completeness by seeing whether you provide all the stuff declared in the headers (see POSIX).
The XSI option group in POSIX (and some other) are usually awful stuff you'd like to be without.
Strive to not implement older compatibility things and obsoleted interfaces, adopt and embracer newer interfaces.
Don't have a separate libm and libpthread. That's old-school idiocy and actually creates more problems than it solves and actually doesn't even give the intended flexibility they wanted. Just put everything into the libc library binary.
Think about threading and thread safety early on, and ensure you use locks where needed, even if the locks are no-op, so threading is easier to add later on.
Don't forget to check out existing implementations like musl, the BSDs, and other hobbyist libcs. It's good to see how they do stuff, and they're usually cleaner than glibc and newlib.
Use the libm from musl. I think that's the best around right now. I took the one from netbsd and it fails the musl libc testsuite.
Just get started and don't hesitate to ask semantic questions, but it's a long road, and it's worth considering adopting another libc (musl or a hobbyist one).
You'll get a world of pain from gnulib (used in many GNU packages) if you have your own libc. This is their fault, and you get to deal with it.

Nable · Post by **Nable** » Sun Mar 08, 2015 7:34 am

sortie wrote:

Don't have a separate libm and libpthread. That's old-school idiocy and actually creates more problems than it solves and actually doesn't even give the intended flexibility they wanted. Just put everything into the libc library binary.

When libm is a separate static library, functions from it can be inlined by LTO (LTCG). It isn't important for a beginner but I cannot say that it's "idiocy" (although I have to agree that the whole LTO thing is the consequence of toolchains that ####^W are, erm, not very well designed).

sortie · Post by **sortie** » Sun Mar 08, 2015 8:04 am

Nable wrote:
sortie wrote:

Don't have a separate libm and libpthread. That's old-school idiocy and actually creates more problems than it solves and actually doesn't even give the intended flexibility they wanted. Just put everything into the libc library binary.
When libm is a separate static library, functions from it can be inlined by LTO (LTCG). It isn't important for a beginner but I cannot say that it's "idiocy" (although I have to agree that the whole LTO thing is the consequence of toolchains that ####^W are, erm, not very well designed).

I don't think I understand your point, and I think you might be somewhat mistaken. LTO (link-time optimization) does work within a single static library. I rather think LTO is a good thing. Splitting stuff into multiple libraries to somehow use LTO is just insane, especially because LTO doesn't need this to work. Or perhaps there's limitations in the current implementation I'm not aware of?

Actually, I'm thinking of entirely different issues that arise from splitting what should be just libc into multiple libraries like libm, libpthread, librt, libdl and so on. The obvious issue is whenever those libraries need each other or want to use each other. It's really easy to get cyclic dependencies. For instance, if libc's flockfile wants libpthread's pthread_mutex_lock, and libpthread's pthread_create wants libc's mmap. This is especially troublesome with static linking where you have to do multiple passes over each .a file (see the --start-group and --end-group linker options if this happens to you). Or worse, if a program wants to use flockfile, but libpthread isn't linked in. The result is often that libc tends to reinvent parts of libpthread just for this purpose, or if libpthread overrides symbols in libc.

It's a whole lot of issues. I actually ended up making my cross-gcc link in libpthread, libm and libc unconditionally in a --start-groups and --end-groups loop just so that I could get rid of most of the complexity and get actually coherent semantics. But that's actually equivalent to everything just being in libc.a. That's a step I'll take soon, combining it into a single library. There's absolutely no real advantages to the split, and you can get rid of a whole lot of unfortunate complexity by merging them.

The real reason I say this is idiocy is that it's a poor design, and when other systems embraced this design they didn't manage to get the advantages they wanted (like threading libraries being interchangeable). This is early threading history thinking where people somehow think it's a good idea to layer threads onto single-threaded processes with lots of complexity, rather than just natively doing them in the kernel and core standard library. The reason libm is its own thing is that the floating-point math code is an awful thing to write, and someone had already made a public domain awful implementation of it (bad code quality, modern libms tends to be surprisingly still like that) and everyone forked it into their own projects, so it was natural to keep it its own library. I don't know why librt is its own library, but it has something to do with the POSIX realtime extensions being optional prior to POSIX 2008 and people still thinking the pattern of splitting libc into more libraries was a remotely good idea.

Nable · Post by **Nable** » Sun Mar 08, 2015 8:18 am

sortie wrote:I don't think I understand your point, and I think you might be somewhat mistaken. LTO (link-time optimization) does work within a single static library. I rather think LTO is a good thing. Splitting stuff into multiple libraries to somehow use LTO is just insane, especially because LTO doesn't need this to work. Or perhaps there's limitations in the current implementation I'm not aware of?

As I've seen, libm is often stored as an AR archive with a lot of .o files inside it. Each .o file has only one math function in it and when you link with -lm only necessary functions come into the resulting binary (even if you don't have LTO). If you have LTO, then those functions can also be inlined into the functions of your program (instead of just linking them as a separate code parts). The same trick doesn't seem to be possible for main libc as its functions have a lot of dependencies on each other and it seems better to just keep libc as a large _dynamic_ library (and forget about LTO).

sortie · Post by **sortie** » Sun Mar 08, 2015 9:49 am

The same trick is perfectly usable with the main libc, as I do with my personal libc. It's pretty simple, even. I just put every function from a standard header in its own file, inside a directory with the same name as the header. Each function can easily call functions from other headers by just including the standard header and doing so. Regardless of the linking used, it's a really clean way to organize a libc. Static linking works perfectly by only pulling in the dependencies of every libc symbol used. There's still a few issues like exit(3) is used by every program (and thus fclose(3) gets pulled in, which in return pulls in free(3), and thus the heap, and a lot of stuff that isn't strictly needed), but I took care of that.

Nable · Post by **Nable** » Sun Mar 08, 2015 3:51 pm

Oh, it's really awesome. You've shown me the way out of Linux's bloat, thank you!

max · Post by **max** » Tue Mar 10, 2015 3:40 am

Hey guys,

thanks for all the great information. I think I'll just dive right in and see what happens. After all, it doesn't seem so much, and it feels like a much cleaner way than using newlib.
Sortie, the thing you stated about fwrite/fwrite and how to implement them, do you have any source where I can read more about that?

Greets,
Max

sortie · Post by **sortie** » Tue Mar 10, 2015 3:59 am

max wrote:Sortie, the thing you stated about fwrite/fwrite and how to implement them, do you have any source where I can read more about that?

Check out the applicable standards (ISO C, POSIX). It just defines fread and fwrite as a series of fgetc and fputc calls (do note you definitely want to flockfile and funlockfile around those calls, so the fread and fwrites are atomic with respect to other threads). It's the definitions of fgetc, fputc, and ungetc that are interesting, especially when you add buffering. Check out my libc, but I won't guarantee that it's entirely semantically correct, just not blatantly wrong as far as I know. I'll be happy to have a look for you. It's not that this is that hard once you know stdio well, but I'm just saying this as a warning not to fall into a common trap when implementing stdio for the first time.

Candy · Post by **Candy** » Tue Mar 10, 2015 4:13 am

Do you know how many programs use fungetc ?

alexfru · Post by **alexfru** » Tue Mar 10, 2015 4:19 am

Candy wrote:Do you know how many programs use fungetc ?

What an absurd question! Sorry.

But you can reasonably expect that some programs that parse variable structures in files will use ungetc().

Further, all *scan*() functions must unread characters back into the buffer under certain conditions.

Candy · Post by **Candy** » Tue Mar 10, 2015 6:02 am

Just trying to be pragmatic. If there's a single function causing a lot of complexity, I'll try to chop it off even if it is not completely unused or useless.

Don't think I can skip all *scan* functions though, those are fairly commonly used... sadly. Maybe I can try to make them without fungetc - if the things they scan are LL(1) an fpeekc() would suffice, and be nice to the other bits of implementation.

sortie · Post by **sortie** » Tue Mar 10, 2015 7:59 am

Candy wrote:Do you know how many programs use fungetc ?

I do, but you mean ungetc, there's no fungetc.

A quick grep in my ports tree reveal it's used by tar, grub (in gnulib), bison, quake, nasm, libmpc, wget, libmpfr, libiconv, e2fsprogs, groff, libstdc++, gcc, git, libjpeg, binutils, python, grep, hello, sed, gzip (in gnulib), gettext, bzip2, parted (in gnulib), diffutils (in gnulib), libgnutls (in gnulib), perl, libgmp, texinfo (in gnulib), libfontconfig, patch, and m4,

In other words, it's a standard and widely implemented C89 function and a massive compatibility constraint. This is a very hard battle to fight. You definitely want ungetc if you want to be compatible. I assure you that inventing a non-standard fpeekc function and changing all that software to use it, that'll get tiresome very fast.

It's not that ungetc is that bad to implement, it's just another constraint that affects how your stdio implementation should be designed, and ignoring it in the early phases can lead to surprising bugs, such as ftell and fseek returning the true current location rather than the virtual one.

I do agree with your approach to not implement parts that cause needless complexity, and I do do a lot of that in my implementation, but each battle has a price, and this one is very high as the function is very portable, it's used by a lot of software, there's no alternative standard interface, it's not obsoleted, it does things that can't be done without it, a fpeekc interface is not as powerful, and ungetc really isn't that terrible. (fpeekc will also add some complexity compared to an implementation without either)

OSDev.org

How to best implement your own C library

How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library

Re: How to best implement your own C library