Page 3 of 3

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Wed Dec 23, 2020 4:21 am
by Korona
OSwhatever wrote:While this might make small IO transfers faster, the question is will it matter in a real life program? One scenario is a server that serves thousands of clients and with busy wait there is risk of congestion and it would take longer because HW resources aren't available. For several concurrent IO operations, I still think that client server filesystem architecture might be better, where IO operations are queued on block level.
You are right that there needs to be some fallback to yield the thread; you probably only want to busy wait for a few ms, and if the I/O does not complete in time, you want to yield (Linux' io_uring that the article refers to supports this use case). You are also right that this only really makes sense if the NVMe driver directly writes to user space, removing an buffer cache from the equation. That's why this feature is mostly used with direct I/O (O_DIRECT) on Linux (as stated in the post linked in the OP).

Note that this "polling I/O" is not a substitute for queuing many requests on the block level. You still want asynchronous I/O (= let each CPU thread queue multiple requests, and not only one) even if you perform busy polling.

As for the use cases: I think that's mainly servers that need to load lots of data and serve it over the network. Think about streaming services like YouTube or web CDNs like Cloudflare: you have thousands of concurrent connections. Each connection wants to read a lot of data from disk and you certainly cannot afford one CPU thread per connection. Here the reasoning does not have to be that storage is so fast that you do not need to wait anyway, but rather that you want to minimize the latency until your data hits the NIC. NAS servers would work similarly. Another application would be loading screens in games: you want to load texture and model data as fast as possible and you do not care about exhausting the CPU to do that.

EDIT: here are some slides from Jens Axboe, the author of the io_uring interface that include benchmarks with polling vs. IRQ-driven I/O: https://www.slideshare.net/ennael/kerne ... gh-iouring, slides 41 and following. Apparently, this interface has already been adopted by Postgres and internally by Facebook (Axboe is a Facebook employee).

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Wed Dec 23, 2020 7:44 pm
by Ethin
Korona wrote:
OSwhatever wrote:While this might make small IO transfers faster, the question is will it matter in a real life program? One scenario is a server that serves thousands of clients and with busy wait there is risk of congestion and it would take longer because HW resources aren't available. For several concurrent IO operations, I still think that client server filesystem architecture might be better, where IO operations are queued on block level.
You are right that there needs to be some fallback to yield the thread; you probably only want to busy wait for a few ms, and if the I/O does not complete in time, you want to yield (Linux' io_uring that the article refers to supports this use case). You are also right that this only really makes sense if the NVMe driver directly writes to user space, removing an buffer cache from the equation. That's why this feature is mostly used with direct I/O (O_DIRECT) on Linux (as stated in the post linked in the OP).

Note that this "polling I/O" is not a substitute for queuing many requests on the block level. You still want asynchronous I/O (= let each CPU thread queue multiple requests, and not only one) even if you perform busy polling.

As for the use cases: I think that's mainly servers that need to load lots of data and serve it over the network. Think about streaming services like YouTube or web CDNs like Cloudflare: you have thousands of concurrent connections. Each connection wants to read a lot of data from disk and you certainly cannot afford one CPU thread per connection. Here the reasoning does not have to be that storage is so fast that you do not need to wait anyway, but rather that you want to minimize the latency until your data hits the NIC. NAS servers would work similarly. Another application would be loading screens in games: you want to load texture and model data as fast as possible and you do not care about exhausting the CPU to do that.

EDIT: here are some slides from Jens Axboe, the author of the io_uring interface that include benchmarks with polling vs. IRQ-driven I/O: https://www.slideshare.net/ennael/kerne ... gh-iouring, slides 41 and following. Apparently, this interface has already been adopted by Postgres and internally by Facebook (Axboe is a Facebook employee).
Well, this is fascinating. I'll be certain to go check those out. Getting IRQ-driven anything is difficult when your writing a function and you need to return yet (somehow) manage state simultaneously, at least it is for me, but that's off-topic. (You'd think Rust would make this easier, but it doesn't -- lifetime issues and all.) Thanks for the slides -- I'll have a look!