Page 1 of 3

Modern storage is plenty fast. It is the APIs that are bad.

Posted: Fri Dec 11, 2020 3:49 pm
by eekee
The topic title is the title of an article; the forum doesn't allow topic titles to be long enough to make that clear.

Article: Modern storage is plenty fast. It is the APIs that are bad.

I think it's an interesting subject. The article seems to be claiming that most caching and other optimizing algorithms currently in use only slow things down with modern NVME drives. The arguments I understand seem rational, but I don't have the attention span to look at it all properly.

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Fri Dec 11, 2020 5:35 pm
by Korona
Shameless self-promotion: my Managarm talk at CppCon 2020 also touches this subject: click me (how do I use the YouTube BB tag?).

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Sat Dec 12, 2020 8:04 am
by bzt
Interesting topic.

But the author's mistaken in many points: the title suggests there's something wrong with the API in general, however reading the blogpost makes it clear the author has problems with a specific Rust implementation (AsyncRead), and he's comparing Rust libraries not API concepts. Not surprisingly he's doing inconsistent tests, getting his own implementation out of the contest as best. I really would like to see a fair comparition, also with a plain syscall only test as a baseline!

He also says it a misconception that
“I am designing a system that needs to be fast. Therefore it needs to be in memory”.
He's clearly wrong about this. This is NOT a misconception, memory is always going to be faster than any storage (no matter the API nor the library) unless the entire architecture is built using memristors.

He's allegedly comparing "Buffered I/O", "Direct I/O" and "Direct I/O with readahead" completely forgetting that any kind of readahead also needs a memory buffer therefore it's not "direct" read at all.

Furthermore, a true "modern storage" outperforms a single NVMe device pretty easily. A Hitachi VSP 5000 can do more than 20m IOPS and more than 130G/s throughput by having multiple controllers, complex disk and RAM cache hierarchy and parallelizing operations. And it can work with the classic POSIX API just fine...

Cheers,
bzt

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Sat Dec 12, 2020 9:48 am
by OSwhatever
bzt wrote:He's allegedly comparing "Buffered I/O", "Direct I/O" and "Direct I/O with readahead" completely forgetting that any kind of readahead also needs a memory buffer therefore it's not "direct" read at all.
Also worth mentioning is that despite modern storage is fast there is still an overhead reading a block device related to SW (ex. setting up HW, creating a request with possible IPC, associated memory allocations). With my system despite I'm actually using a disk image stored in RAM, reading cached blocks is significantly faster because there is less SW overhead.

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Sat Dec 12, 2020 10:57 am
by eekee
It's no wonder I was confused if the author of the article doesn't understand the difference between concept and implementation! :lol: I wondered about "direct IO with readahead" too, it makes no sense. I suppose it might make sense to someone who just takes operating systems as magic boxes and accepts whatever interface they provide. On that point, I'd like to see tests with bare hardware rather than syscalls.

Would anyone mind if I linked this thread to the mailing list where I heard about the article? It's 9fans; many of the people on it modify as well as use Plan 9.

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Sat Dec 12, 2020 3:25 pm
by Korona
bzt wrote:Furthermore, a true "modern storage" outperforms a single NVMe device pretty easily. A Hitachi VSP 5000 can do more than 20m IOPS and more than 130G/s throughput by having multiple controllers, complex disk and RAM cache hierarchy and parallelizing operations. And it can work with the classic POSIX API just fine...
They "work" fine with the classic POSIX API but the POSIX API does not allow parallelism without multiple software threads. These NVMe devices can have thousands of request queues, yet a single POSIX thread can only submit to one queue at a time.

The main message of the article is correct even though details certainly depend on the use case.
eekee wrote:It's no wonder I was confused if the author of the article doesn't understand the difference between concept and implementation! :lol:
bzt wrote:He's allegedly comparing "Buffered I/O", "Direct I/O" and "Direct I/O with readahead" completely forgetting that any kind of readahead also needs a memory buffer therefore it's not "direct" read at all.
The author is certainly also aware of the OS level concepts (and not only a Rust implementation) as he is a contributor to the Linux kernel (git log turns up file system-related code, the CFQ I/O scheduler, SLUB, ...). He knows what he is talking about. But of course, people in a random forum thread are quick to discredit the post...

Regarding the buffering point, that is specifically addressed in the article:
None of these operations: page fault, interrupts, copies or virtual memory mapping update are cheap. But years ago they were still ~100 times cheaper than the cost of the I/O itself, making this approach acceptable. This is no longer the case as device latency approaches single-digit microseconds. Those operations are now in the same order of magnitude of the I/O operation itself.

[...]

Direct I/O is 20% slower than buffered reads. While reading entirely from memory is still faster — which shouldn’t surprise anybody, that’s a far cry from the disaster one would expect.

[...]

And while Buffered I/O standard APIs performed 20% faster for random reads that fully fit in memory, that comes at the cost of 200x more memory utilization making the trade offs not a clear cut.
It's not that disk is as fast as RAM, it's that the traditional buffer cache is slow.

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 6:18 am
by bzt
Korona wrote:These NVMe devices can have thousands of request queues, yet a single POSIX thread can only submit to one queue at a time.
Yes, but there are many concurrent threads and it is the kernel's job to convert those into device requests properly. That is simply not the POSIX API's job.
Korona wrote:
eekee wrote:It's no wonder I was confused if the author of the article doesn't understand the difference between concept and implementation! :lol:
bzt wrote:He's allegedly comparing "Buffered I/O", "Direct I/O" and "Direct I/O with readahead" completely forgetting that any kind of readahead also needs a memory buffer therefore it's not "direct" read at all.
The author is certainly also aware of the OS level concepts (and not only a Rust implementation) as he is a contributor to the Linux kernel (git log turns up file system-related code, the CFQ I/O scheduler, SLUB, ...). He knows what he is talking about.
No, he does not know. @eekee is right, the author is confusing API concept and implementation, plus he is completely forgetting that readahead needs a buffer too! Furthermore, kernel side disk cache and readahead buffers are, and should be completely transparent to the API.
Korona wrote:It's not that disk is as fast as RAM, it's that the traditional buffer cache is slow.
Which is 100% implementation specific! A certain buffer cache implementation can be fast or slow, not the concept of buffer cache.

Cheers,
bzt

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 8:27 am
by Korona
bzt, the author of the blog post in the OP has written parts of the readahead path in Linux, I am sure that he is aware that readahead needs memory.

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 10:15 am
by bzt
Korona wrote:bzt, the author of the blog post in the OP has written parts of the readahead path in Linux, I am sure that he is aware that readahead needs memory.
It is very sad that people who don't understand the difference between concept and implementation (and blame the API for their ignorance) are working on the Linux kernel. Just sad. Maybe it's time to switch to one of the BSDs?

Cheers,
bzt

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 10:22 am
by Korona
bzt wrote:
Korona wrote:These NVMe devices can have thousands of request queues, yet a single POSIX thread can only submit to one queue at a time.
Yes, but there are many concurrent threads and it is the kernel's job to convert those into device requests properly. That is simply not the POSIX API's job.
CPUs are most efficiently used if you operate in a one thread-per-core scenario. This minimizes context switch overhead and cache thrashing. It is the API's job to provide a way to saturate I/O bandwidth without also saturating the CPU load. In fact, modern NVMe devices have such a high bandwidth in IOPS that it is simply not possible to saturate the I/O bandwidth without an synchronous API: context switch overhead will outweight I/O overhead when you are operating with 1000s of threads (at least if you want to also look at the data that you're reading/writing). So yes, the fact that POSIX does not consider this problem is a deficit of POSIX, not a feature.

It simply does not make sense to spawn 100s of threads to saturate the 100s of request queues that high-end NVMe drives offer. Back-of-the-envelope calculation: high-end NVMe drives reach 1million IOPS which gives the CPU 1us to issue a request. A context switch is already way more costly than 1us if you factor in cache thrashing. Hence, it is not feasible to issue synchronous / non-bulk requests at maximal performance. The only realistic way to drive these SSDs at maximal performance is an asynchronous I/O thread-per-core configuration, possibly operating in polling mode (and not IRQ mode). That's exactly why io_uring exists and why it supports the polling mode. POSIX simply does not offer this mode of operation.
bzt wrote:[It is very sad that people who don't understand the difference between concept and implementation (and blame the API for their ignorance) are working on the Linux kernel. Just sad. Maybe it's time to switch to one of the BSDs?
Or maybe, maybe, maybe, that guy is an expert in his field and your understanding of the blog post is flawed. What's more likely? That somebody with real and verifiable credentials is wrong or that somebody on the OSdev boards is wrong? "Linux bad" is a great argument, really.

Nothing that you claimed that this guy is ignoring about readahead or whatever has any connection to the blog post. Did you truly read the blog post? It actually has a detailed analysis of various use cases and real performance benchmarks.

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 2:44 pm
by bzt
Let's just agree on that we disagree.
Korona wrote:Or maybe, maybe, maybe, that guy is an expert in his field and your understanding of the blog post is flawed.
An expert would never wrote such a blog post. Also don't forget that the only purpose of that blog is to propagate his own Rust library, nothing else.

And just for the records, I'm a certified Hitachi Storage Expert with more than a decade of experience btw. I've designed and successfully developed storage-specific systems for the government that nobody else could do (I know that for a fact, because all the companies tried failed, and it was me who finally come up with a working solution). That's how much an "average forum poster" I am ;-)

Have a nice day,
bzt

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 4:39 pm
by Korona
You might or might not be certified in one or more storage technologies (and what makes you think that this turns you into a "non-average" forum poster?). That's something that people cannot really verify since you are posting under a pseudonym here. In any case, your own credentials are not a reason to defame the author of the blog post. Doing so does no make you sound smart, it makes you sound like a jerk.

Now, can you state in non-inflammatory ways what is wrong about the post? So far, it just sounds as if you didn't read the post beyond the term "direct I/O with readahead". And direct I/O with readahead is very well possible - you just perform the readahead in user space (ofc it consumes RAM, and everybody is aware of that, that's just trivial). It is useful to discuss the deficiencies of the post such that we can all learn something. But it is not useful to paint the author as an idiot.

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 5:42 pm
by bzt
Korona wrote:Now, can you state in non-inflammatory ways what is wrong about the post?
I already did. For one, no serious dev would ever say that using memory rather than storage is a misconception. Also, a serious dev would not talk about how API concepts are bad when he is only comparing Rust libraries. A serious dev would know the difference between standard POSIX API and mmap interface, and he would never confuse those with high-level, language-specific libraries. No Rust libraries can circumvent the kernel's syscall API, no matter how hard they try. Etc. etc. etc.

For example statements like these:
When legacy APIs need to read data that is not cached in memory they generate a page fault.
This is absolutely not true. It might be for certain Rust implementations that use mmap without MAP_POPULATE under the hood, but definitely not true for the legacy open/read/close syscall API in general.
Those operations [page fault, interrupts, copies or virtual memory mapping update] are now in the same order of magnitude of the I/O operation itself.
Absolutely wrong, a single memory copy is still magnitudes faster than any sector read, being NVMe or not. (Put aside that there's a controller and sector access overhead too, and that NVMe/DMA needs an interrupt as well, the result of the I/O operation must be transferred into memory from a peripheral; therefore it can't be faster than a direct memory-to-memory transfer.) This is plain nonsense, not backed up by any measurements by the author.
if modern NVMe support many concurrent operations, there is no reason to believe that reading from many files is more expensive than reading from one.
Files are handled in the VFS layer, and totally independent to the block I/O layer where the concurrent NVMe operations might make a difference. From the block I/O perspective it doesn't matter at all if the sectors to be read belong to the same file or to different files. If there's an overhead for using many files then that overhead is realized in the VFS layer (and/or in the file system drivers) no matter to the block device's type.
While the device is waiting for the I/O operation to come back, the CPU is not doing anything.
It looks like the author is stuck in the DOS-era before DMA was invented... Seriously, nobody told this guy that Linux is a multitasking system?

I could go on and continue, but I'm sure the above is more than enough to see the author is no expert.

Cheers,
bzt

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 5:47 pm
by Korona
Man, I ask you to discuss a topic in a non-inflammatory way and you start your reponse with "For one, no serious dev would ever [..]".

It's sad that it's no longer possible to discuss interesting topics without toxicity on this forum.

Re: Modern storage is plenty fast. It is the APIs that are b

Posted: Mon Dec 14, 2020 6:48 pm
by OSwhatever
Korona wrote:The only realistic way to drive these SSDs at maximal performance is an asynchronous I/O thread-per-core configuration, possibly operating in polling mode (and not IRQ mode). That's exactly why io_uring exists and why it supports the polling mode. POSIX simply does not offer this mode of operation
POSIX offers an asynchronous API, late to the game but still. Then how it is implemented underneath is really implementation specific. How does POSIX limit the underlying implementation in the way you suggest?