Fixing BSD sockets

nullplan · Post by **nullplan** » Sat Oct 24, 2020 2:46 am

Hi all,

it seems established at this point that BSD sockets aren't well designed. They don't fit too well into UNIX, requiring their own system calls to handle even reading and writing, because read() and write() are aren't powerful enough to handle everything recv() and send() can do. ioctl() is a wart on all UNIX FD communication, but sockets don't improve things by not only participating in that mess, but adding setsockopt() and getsockopt() into the mix. And the interface on the C level is an attempt at object-oriented programming in a language not designed for it, and therefore is error-prone to the last. I mean, basically struct sockaddr is meant to be an abstract base class, from which all the actual socket addresses are derived. The abstraction was so leaky, they had to introduce struct sockaddr_storage as another abstract socket address that is large enough to handle all socket address types the system has to offer.

So I was wondering (and the search function didn't turn up anything useful) if anyone had made an attempt at fixing the things. My problem is I want my OS to be POSIX compatible, and for that the POSIX library must at least be implementable on the system calls my OS will provide. Ideally the syscall layer would be something sane, and then the POSIX library would provide compatibility.

My problems with BSD sockets:

It is overly general. Is there ever any need to bind() active TCP sockets? And what the hell does listen() do?
Common operations take too many steps. Opening an active TCP socket requires a call to socket() and connect(). Opening a passive one requires a call to socket(), bind(), listen(), and accept().
Too many too similar functions. Is there a need for read(), recv(), recvfrom(), and recvmsg()?
struct sockaddr. In its entirety. From all the pointer conversions down to the fact that someone thought it would be a good idea to bring byte order into the mix.

At the same time, each of these also has its benefits. Yes the API is overly general (and that does go hand in hand with point two), but that also makes it somewhat flexible, and allows the same calls to be used for different address families. getaddrinfo() was pretty much designed around the thing, and allows applications to create IPv4 and IPv6 sockets without caring what the system is configured with, or what getaddrinfo() returned.

Anyway, as far as attempts to fix it go, I have seen dial() on Plan9, which essentially takes a string description of what the user wants and returns a suitable socket. My problem is that it is possible to implement dial() on BSD sockets, but the other way round would be pretty messy. Although I do like the cleanliness of the design, if I did add it to the kernel. Problem is, I would have to add a string parser to the kernel (I mean, another one, after the path name reader). Not the end of the world, tho.

Microsoft added their own spin on it with their *Ex() functions (like AcceptEx()), but those don't fix the API, only extend it. Myself, I thought of making the calls less general (e.g. not having a socket() call, but socket_tcp4(), socket_tcp6(), etc., and replacing connect() with connect_tcp4(), or connect_tcp6()). SInce each of those could have their own signatures, it would not be necessary to even have a struct sockaddr, you could just give address and port directly as arguments to connect_tcp4(). A compatibility library with POSIX would be messy, but feasible. But all of this only tackles points 1, 2, and 4, leaving out the 3, and I have no idea what to do there. read() is pretty much required for other things (although I though of making preadv() the system call, and letting the other ones be compatibility functions. But then, not every FD is seekable, and I don't know if I can put the burden of tracking the file position onto the POSIX library, or if something internal would break), but what to do about the multitude of other functions?

Anyway, what are your thoughts on these issues? Do you have any other solutions? Completely ridding the system of BSD sockets is only allowed, though, if BSD sockets can still be implemented on top of your proposal. I simply don't have the time to rewrite all networking code I would like to run on my system in terms of a different API.

rdos · Post by **rdos** » Sat Oct 24, 2020 7:41 am

Well, I find the whole idea that a socket is a file and should be accessed with file IO operations absurd. This is not real object oriented programming, rather a mess.

I did my own socket API to begin with, and especially have the push() function that will send data without waiting for timeouts.
I also have a wait function that will wait for new data and buffer space functions. I then designed a real C++ wrapper class.

OTOH, I also have implemented the POSIX API, but this needed new syscalls since the strangeness with sockets as files couldn't be implemented at user level using my ordinary syscalls.

PeterX · Post by **PeterX** » Sat Oct 24, 2020 10:05 am

nullplan wrote:Anyway, what are your thoughts on these issues? Do you have any other solutions? Completely ridding the system of BSD sockets is only allowed, though, if BSD sockets can still be implemented on top of your proposal. I simply don't have the time to rewrite all networking code I would like to run on my system in terms of a different API.

I think the only option is the one you already mentioned: Build a socket API/ABI without file-I/O on the low level and implement a POSIX library ontop of that. Of course you need a kind of UFS for that (however that is done, could be just a common ABI).

I'm not sure what rdos meant, but I guess he means this: (If I understand the sockets ABI/API correctly) you still have to use some non-file-calls even if you use file-I/O for sockets. So the file abstraction is kind of senseless.

Greetings
Peter

eekee · Post by **eekee** » Sat Oct 24, 2020 1:10 pm

Have you seen how Plan 9's APE does it, and if it's good enough? (APE is A Posix Emulator or Environment or something.) It emulates sockets if _BSD_EXTENSION is defined, but I've little idea how well it does. Sockets are listed in the APE paper, but not mentioned in the Common Problems section. They were good enough to support X11, years ago. See /sys/src/ape/lib/bsd/ I guess.

Korona · Post by **Korona** » Sat Oct 31, 2020 12:55 pm

rdos wrote:Well, I find the whole idea that a socket is a file and should be accessed with file IO operations absurd. This is not real object oriented programming, rather a mess.

For all intents and purposes, UNIX files are objects (or capabilities, as other OSes like to call this concept).

Schol-R-LEA · Post by **Schol-R-LEA** » Sat Oct 31, 2020 2:55 pm

Korona wrote:
rdos wrote:Well, I find the whole idea that a socket is a file and should be accessed with file IO operations absurd. This is not real object oriented programming, rather a mess.
For all intents and purposes, UNIX files are objects (or capabilities, as other OSes like to call this concept).

'Capabilities' are quite different from 'objects', and UNIX's bags o' bytes aren't either. But then, you have to consider that object-oriented programming as we know it today wasn't a thing when Thompson and Ritchie decided to generalize file streams as the primary system interface (yes, Nygaard had already developed Simula by then, but it wasn't intended for general programming, and certainly not for systems programming).

The stream-oriented pipe system interface - a generalization of Multics' file streams - was more or less the only general-purpose IPC provided by UNIX for a long time. Hell, early UNIX didn't even have mutexes and locks until they were retrofitted into System 6 (circa 1976), AFAIK, and the lock system that was developed at that point was built on top of the existing FD handles (by way of .lck files). For several years, this wasn't really enforced by the system, either, being more of an honor system which each application had to implement support and testing for.

Whether the file abstraction is a good choice for this is left as an exercise for the readers.

In any case, when Sockets was developed, well, for better or worse, they just took the path of least resistance and made them work as close to FDs and streams as they could.

Korona · Post by **Korona** » Sun Nov 01, 2020 4:18 am

No, capabilities and UNIX FDs are exactly the same. They represent the process-local right (or a handle) to access an object.

UNIX files are not bags of bytes. (What are the bytes in an eventfd, signalfd, epoll FD, pidfd, GPU device, userfaultfd, irqfd or their BSD equivalents?)

Also, UNIX' "everything is a file" concept is one of the most commonly misunderstood lines in OS design. In today's language, we would phrase that as: "everything is a capability" or "everything is a file descriptor". UNIX and its modern variants never represented all objects as part of the file system.

rdos · Post by **rdos** » Sun Nov 01, 2020 4:54 am

Korona wrote:No, capabilities and UNIX FDs are exactly the same. They represent the process-local right (or a handle) to access an object.

UNIX files are not bags of bytes. (What are the bytes in an eventfd, signalfd, epoll FD, pidfd, GPU device, userfaultfd, irqfd or their BSD equivalents?)

Also, UNIX' "everything is a file" concept is one of the most commonly misunderstood lines in OS design. In today's language, we would phrase that as: "everything is a capability" or "everything is a file descriptor". UNIX and its modern variants never represented all objects as part of the file system.

The main problem is that both TCP/IP connections, but particlar in UDP/IP "connections" are not stream-oriented rather packet oriented. A file is not a true stream either since it has a beginning and an end, and this is also why we can set the position of a file and get the size. A stream has no size and no position. Socket data typically have delimiters and are interpreted as packets rather than streams. The main drawback of encapslating packet-oriented data in a stream is that you no longer know where one packet ends and another starts. This is also why there are implicit timeouts in the socket-API since the original "push" function couldn't be supported within a stream context. Since I view sockets as a package related protocol, I have no implicit timeouts and users must call "push" to send the current packet. The read function also differs between files and sockets. With files, it's deterministic if a number of bytes can be read, and the oeration typically blocks until the data has been read from the file. For a socket, there is no guarantee that the requested bytes will ever be sent and so read on a socket is non-blocking and just returns whatever is in the buffer. To wait for socket data there is a specific blocking function that is based on the wait-for-event API.

OTOH, I do find the handle concept useful, but I have different types of handles. If I open a socket handle I cannot use it with the file-API, and if I open a file handle, I cannot use it with the socket API. I think this is natural.

Actually, UNIX associate devices with filenames in the file system, and also uses IOCTL for various stuff that doesn't fit into the concept. Even Win32 opens non-file objects by passing strange filenames to the open file function, something they do since "everything is a file". So, I'm not convinced that I'm wrong when I say that UNIX and Win32 actually think everything is a file, and when it is not, it can still be handled like a file by opening device files or other strangeness.

It was a long time since I decided that "everything is not a file" in my OS, and that the IOCTL function should be banned from any sane OS.

Schol-R-LEA · Post by **Schol-R-LEA** » Sun Nov 01, 2020 10:17 am

Korona wrote:No, capabilities and UNIX FDs are exactly the same. They represent the process-local right (or a handle) to access an object.

There's rather more to capabilities than that, at least as I understand them; but then, a number of OSes use the term to refer to things which, technically speaking, aren't capabilities either.

No major OS in current use supports capabilities; a number of minor or experimental ones such as Coyote and Agora do, but no one who understands capabilities - even to the rather limited extend which I do personally - would call anything involving an access control list a 'capability'.

I am willing to chalk this up to a disagreement on definitions, however.

PeterX · Post by **PeterX** » Sun Nov 01, 2020 10:20 am

Schol-R-LEA wrote:
Korona wrote:No, capabilities and UNIX FDs are exactly the same. They represent the process-local right (or a handle) to access an object.
There's rather more to capabilities than that, at least as I understand them; but then, a number of OSes use the term to refer to things which, technically speaking, aren't capabilities either.

No major OS in current use supports capabilities; a number of minor or experimental ones such as Coyote and Agora do, but no one who understands capabilities - even to the rather limited extend which I do personally - would call anything involving an access control list a 'capability'.

I am willing to chalk this up to a disagreement on definitions, however.

Does Genode use capabilities?

EDIT: What's the difference between access control lists and capabilities? Or is that difficult to explain?

Greetings
Peter

Schol-R-LEA · Post by **Schol-R-LEA** » Sun Nov 01, 2020 10:30 am

PeterX wrote:Does Genode use capabilities?

EDIT: What's the difference between access control lists and capabilities? Or is that difficult to explain?

I am unfamiliar with Genode, so I can't say. As for the differences between capabilities and ACLs, this thread tried to cover that previously.

This topic probably should be moved to a different thread, as it doesn't directly relate to Sockets.

PeterX · Post by **PeterX** » Sun Nov 01, 2020 10:37 am

Schol-R-LEA wrote:
PeterX wrote:Does Genode use capabilities?

EDIT: What's the difference between access control lists and capabilities? Or is that difficult to explain?
I am unfamiliar with Genode, so I can't say. As for the differences between capabilities and ACLs, this thread tried to cover that previously.

This topic probably should be moved to a different thread, as it doesn't directly relate to Sockets.

OK I will open up a new thread.

As for sockets, there are syncronous and asyncronous socket operations.
If I understand it correctly, syncronous sockets can't done well with file I/O operations. Do I understand that correctly or am I wrong here?

Greetings
Peter

eekee · Post by **eekee** » Sun Nov 01, 2020 12:12 pm

Schol-R-LEA wrote:early UNIX didn't even have mutexes and locks until they were retrofitted into System 6 (circa 1976), AFAIK, and the lock system that was developed at that point was built on top of the existing FD handles (by way of .lck files). For several years, this wasn't really enforced by the system, either, being more of an honor system which each application had to implement support and testing for.

Whether the file abstraction is a good choice for this is left as an exercise for the readers.

Lock files were still in common use when I started using Linux in the late 90s. I don't think anyone liked them, but there they were.

rdos wrote:The main problem is that both TCP/IP connections, but particlar in UDP/IP "connections" are not stream-oriented rather packet oriented. A file is not a true stream either since it has a beginning and an end, and this is also why we can set the position of a file and get the size. A stream has no size and no position. Socket data typically have delimiters and are interpreted as packets rather than streams. The main drawback of encapslating packet-oriented data in a stream is that you no longer know where one packet ends and another starts. This is also why there are implicit timeouts in the socket-API since the original "push" function couldn't be supported within a stream context. Since I view sockets as a package related protocol, I have no implicit timeouts and users must call "push" to send the current packet. The read function also differs between files and sockets. With files, it's deterministic if a number of bytes can be read, and the oeration typically blocks until the data has been read from the file. For a socket, there is no guarantee that the requested bytes will ever be sent and so read on a socket is non-blocking and just returns whatever is in the buffer. To wait for socket data there is a specific blocking function that is based on the wait-for-event API.

OTOH, I do find the handle concept useful, but I have different types of handles. If I open a socket handle I cannot use it with the file-API, and if I open a file handle, I cannot use it with the socket API. I think this is natural.

Actually, UNIX associate devices with filenames in the file system, and also uses IOCTL for various stuff that doesn't fit into the concept. Even Win32 opens non-file objects by passing strange filenames to the open file function, something they do since "everything is a file". So, I'm not convinced that I'm wrong when I say that UNIX and Win32 actually think everything is a file, and when it is not, it can still be handled like a file by opening device files or other strangeness.

It was a long time since I decided that "everything is not a file" in my OS, and that the IOCTL function should be banned from any sane OS.

Interesting read, thanks. Thompson and Ritchie had ioctl for a long time, but ditched it for Plan 9. At the same time, they made a change to files and pipes, making them preserve message boundaries. If process A sends 4031 bytes to process B and B is waiting to read 4031 bytes or more, it'll get those 4031 bytes in a single read. I think this makes Plan 9 files suitable for packets, although Plan 9 itself will break up messages larger than its 8KB limit. (A lad wanted to up it to 64KB in 9front, but Chief Kernel Guy didn't see the point at the time and no-one else was interested.)

Separating the APIs for files and sockets may be natural, but I've known a lot of people to be very happy with the ability to mount /net from a different machine. It's an instant tunnel with no extra coding required. I've heard bad things about tunnelling TCP over TCP — which mounting remote /net is, by default — but it works fine for a lot of people. Some (not all) of those people use IL instead of TCP. IL lacks congestion control so may not have the same problems.

I do have a little problem with the statement, "With files, it's deterministic if a number of bytes can be read," but I have this problem with Unix and Plan 9 themselves. Introduce networking between filesystem and user program, and the read is no longer quite so deterministic.

So basically, I wish everything was a socket.

Plan 9's files are half-way there, but not close enough.

OSDev.org

Fixing BSD sockets

Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets

Re: Fixing BSD sockets