Trying to understand POSIX read

OSwhatever · Post by **OSwhatever** » Fri Jul 19, 2013 4:34 am

The return value of POSIX read function is described in the following text.

On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. In this case it is left unspecified whether the file position (if any) changes.

Basically, the function can return reading less than the requested amount. This puts the extra burden on the program that implements it, to loop until all requested bytes have been read and perhaps also do some yielding? My question is, what is the use case of this behaviour? STD C++ iostream read, do not return before all bytes have been read so at this level this behavior have been dropped.

I have hard to find any use case where the return before finished feature is useful. I know that in practice, it often does return when all bytes have been read and perhaps there is a flag that force this behaviour. So if we should implement the underlying OS dependent read, do we really have to support the unfinished read or can we safely block until all bytes has arrived?

Kevin · Post by **Kevin** » Fri Jul 19, 2013 5:03 am

With O_NONBLOCK you're not supposed to block in read(). And of course, if EOF is somewhere in the middle of your request, waiting for another process to extend the file so you can read() the full size isn't the expected behaviour either.

By blocking on reads on a pipe/socket/etc. until you have the full size, you make it easier for programs that need to get a message with a known size from this pipe (though they can't really make use of this if they want to stay POSIX compliant). At the same time you make it impossible to read whatever is already there (up to a maximum buffer size) in the pipe without blocking.

iansjack · Post by **iansjack** » Fri Jul 19, 2013 5:03 am

Your quote explains why read() may return less bytes than requested. The fact that less bytes were available than were requested does not mean that more bytes will later become available. If the call had to wait until the requested number of bytes had been read then it might end up waiting forever.

Seems pretty sensible to me. (BTW, it is incorrect to say that iostream doesn't exhibit similar behaviour. Any sane read function would do so.)

sortie · Post by **sortie** » Fri Jul 19, 2013 6:11 am

The read and write functions (and their pread and pwrite counterparts, and the readv, writev, preadv, and pwritev functions) have really good semantics as IO primitives, it's just not always what the user expects. I meant to write a blog post about the functions and why they are good primitives, but I'll just write something here, but I just woke up.

The read and write functions promises to either fail, or read/write at least one byte of IO. The exact behaviour of the functions depend on the code implementing the inode. The kernel code will usually do whatever is easier. For instance, if a file is being read, but only 42 bytes is in cache, and 1337 bytes were requested, then it's perfectly reasonable to just give 42 bytes. Perhaps the program is able to do something with these bytes. If it really needed more bytes, then the program would invoke the read system call again.

Alternatively, try to imagine if the system calls required all the IO to finish. That would mean network programming would become bothersome, normally you allocate a larger buffer and read into it, and the kernel would fill in whatever it can. If you forced it to full it all up, then you could easily deadlock if the remote sends all the data it wishes and then waits for a response, but the request wasn't enough to fill up the buffer. The same applies to pipes and unix sockets, though arguably it isn't a problem with files - but it's best to keep the same IO semantics for pipes and files. That said, the IO is likely always completed entirely with files, and a lot of programs unfortunately depend on this implementation detail.

It's also worth noting the return type is ssize_t, but the count parameter is size_t. It is unspecified what happens if the count is over SSIZE_MAX, but given the "give me at least one byte or fail" semantics, it's reasonable to simply do if ( (size_t) SSIZE_MAX < count ) { count = SSIZE_MAX; } and truncate the request.

(The same discussion applies to the other mention read and write functions)

There is another possible IO primitive that could give the user more control and settle this: "Give me at least x bytes, but at most y bytes". It is a bit more bothersome to implement in the kernel, and perhaps not even worth it, as it is trivial to build such functions upon the read/write functions:

Code: Select all

size_t readleast(int fd, void* buf, size_t least, size_t max)
{
	ssize_t amount = read(fd, buf, max);
	if ( amount < 0 ) { return 0; }
	if ( least && !amount ) { return 0; /* unexpected EOF */ }
	if ( (size_t) amount < least )
	{
		void* nextbuf = (uint8_t*) buf + amount;
		size_t nextleast = least - amount;
		size_t nextmax = max - amount;
		amount += readleast(fd, nextbuf, nextleast, nextmax);
	}
	return amount;
}

size_t writeleast(int fd, const void* buf, size_t least, size_t max)
{
	ssize_t amount = write(fd, buf, max);
	if ( amount < 0 ) { return 0; }
	if ( least && !amount ) { return 0; /* unexpected EOF */ }
	if ( (size_t) amount < least )
	{
		const void* nextbuf = (const uint8_t*) buf + amount;
		size_t nextleast = least - amount;
		size_t nextmax = max - amount;
		amount += writeleast(fd, nextbuf, nextleast, nextmax);
	}
	return amount;
}

The key thing about read/write is that the kernel code can do whatever is easiest and most efficient and then rely on the program to do another call if that wasn't enough. This potentially even makes the system more responsive. Note how these semantics are great for a kernel, but it's not really what users expect. This is why layers such as FILE with fread/fwrite has been built upon the Unix IO primitives. However, a large number of programs use the primitives directly, which means they have to deal with the primitives likely not being what they want. I provide the above functions in my libc to ease file descriptor programming. I also provide preadall, pwriteall, preadleast, pwriteleast, preadall, and pwriteall. (the all versions is simply a call where least=max, that is "give me exactly N bytes of input and only less upon error"). You can check for errors in these calls if they return less than least. An error could potentially have occured if they return something between least and max, but it's not an error for your program at this point, and you'll get the error for real on the next read call on the file descriptor.

I hope this clears things up. I would advise against changing these semantics to cater for a higher level like FILE or C++ streams, but rather just implement those layers using this advise. Also note that you are free to make read/write on files always complete with the requested amount, but this will likely make programs written for your OS non-portable, because they assume these semantics, which will do all other operating systems a disfavour. It's better to make people use a higher level API or some extensions like readleast.

Kevin · Post by **Kevin** » Fri Jul 19, 2013 7:57 am

sortie wrote:There is another possible IO primitive that could give the user more control and settle this: "Give me at least x bytes, but at most y bytes". It is a bit more bothersome to implement in the kernel, and perhaps not even worth it, as it is trivial to build such functions upon the read/write functions:

That it's so trivial is exactly the reason why it's not a problem putting the function in the kernel. Or in the libc. Or, though maybe the worst alternative, in each single program like with POSIX. But there will still be use cases for plain read().

linguofreak · Post by **linguofreak** » Fri Jul 19, 2013 9:40 am

OSwhatever wrote:The return value of POSIX read function is described in the following text.

On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal. On error, -1 is returned, and errno is set appropriately. In this case it is left unspecified whether the file position (if any) changes.
Basically, the function can return reading less than the requested amount. This puts the extra burden on the program that implements it, to loop until all requested bytes have been read and perhaps also do some yielding? My question is, what is the use case of this behaviour? STD C++ iostream read, do not return before all bytes have been read so at this level this behavior have been dropped.

This is because C++ iostream deals with exactly that: I/O streams. POSIX read deals with files in general.

I have hard to find any use case where the return before finished feature is useful.

If the file being read is a file on disk, rather than stdin, and the program waits to receive more bytes than are actually in the file, it will never actually receive those bytes and will wait forever.

OSwhatever · Post by **OSwhatever** » Fri Jul 19, 2013 11:23 am

sortie wrote:The key thing about read/write is that the kernel code can do whatever is easiest and most efficient and then rely on the program to do another call if that wasn't enough. This potentially even makes the system more responsive. Note how these semantics are great for a kernel, but it's not really what users expect. This is why layers such as FILE with fread/fwrite has been built upon the Unix IO primitives. However, a large number of programs use the primitives directly, which means they have to deal with the primitives likely not being what they want. I provide the above functions in my libc to ease file descriptor programming. I also provide preadall, pwriteall, preadleast, pwriteleast, preadall, and pwriteall. (the all versions is simply a call where least=max, that is "give me exactly N bytes of input and only less upon error"). You can check for errors in these calls if they return less than least. An error could potentially have occured if they return something between least and max, but it's not an error for your program at this point, and you'll get the error for real on the next read call on the file descriptor.

I hope this clears things up. I would advise against changing these semantics to cater for a higher level like FILE or C++ streams, but rather just implement those layers using this advise. Also note that you are free to make read/write on files always complete with the requested amount, but this will likely make programs written for your OS non-portable, because they assume these semantics, which will do all other operating systems a disfavour. It's better to make people use a higher level API or some extensions like readleast.

I think that's really what's confusing me in the first place because POSIX read was traditionally meant a system call right into the kernel. Read in this case could be anything like reading "raw" HW to reading a file so this partial read was used in some cases, maybe for input where data arrives sporadically. In my case read isn't any system call and resides completely in user space and is never used for low level access.

OSDev.org

Trying to understand POSIX read

Trying to understand POSIX read

Re: Trying to understand POSIX read

Re: Trying to understand POSIX read

Re: Trying to understand POSIX read

Re: Trying to understand POSIX read

Re: Trying to understand POSIX read

Re: Trying to understand POSIX read