LRU on a cache with reader/write lock?

xeyes · Post by **xeyes** » Fri Jan 22, 2021 2:57 am

I've run into an interesting performance issue with LRU on an inode cache with reader/write lock.

Perhaps this is a general programming question, but would like to hear your thoughts in terms of OS dev, as I'm planning to use the same basic design in various places in the kernel that could use a cache (for example the generic disk cache) and they will likely run into the same issue.

The cache itself has a reader/writer lock, threads get reader locks to check for cache hit, and get the writer lock to add new items to the cache and replace old ones as needed. So if they are lucky and the hit rate is good, it's almost like there's no lock contention at all.

If I add LRU on top of this, it seems to help quite a bit and single threaded test that repeatedly does folder and file creations and deletions runs 50% faster than when the replacement policy was random replace.

But since the LRU is maintained in a linked list, I have to add a mutex to protect it during any LRU updates. Thus a serialization point by the LRU mutex is now added onto the previously contention free cache hit path. It indeed causes the LRU version to actually run 20% slower than the random replace version with 4 threads doing the same thing.

It's also difficult to isolate the impact of the serialization point since if I don't have the mutex the threads will mess up the LRU linked list "in an instant" and no performance numbers can be grabbed before the whole cache locks up or otherwise asserts and stops.

So, is a way to add LRU to a reader/write locked cache without affecting its parallelism on the read locked side like a magical way to update something that can keep track of LRU and doesn't need its own mutex? Is there a good trade off for perhaps non-strict LRU? Or is LRU replacement a bad idea here altogether?

btw. One idea I had was to make the LRU list per CPU, so the read locked region will be fast, but the write locked region would be even slower.

As the thread doing the replacement has to exam all lists to decide on what to do, maybe using timestamps on the LRU entries and compare cross the lists, or maybe simply kick out the longest/shortest list's head, or be altruistic and replace from its own list. Since the cache allows callers to pin entries, it makes it more complicated as the "lucky" thread might need to do a few rounds of the compares to finally decide on a node that is not pinned and not recently used.

This can be made to work but seems a bit too complex a solution and I sense ample opportunities for more synchronization bugs and even performance issues like, what if inode X is so popular that it is in every list, which make it more likely to get picked and replaced; or by slowing down the write locked region, all readers have to wait longer for their R locks as well.

thewrongchristian · Post by **thewrongchristian** » Fri Jan 22, 2021 7:02 am

xeyes wrote:I've run into an interesting performance issue with LRU on an inode cache with reader/write lock.

Perhaps this is a general programming question, but would like to hear your thoughts in terms of OS dev, as I'm planning to use the same basic design in various places in the kernel that could use a cache (for example the generic disk cache) and they will likely run into the same issue.

The cache itself has a reader/writer lock, threads get reader locks to check for cache hit, and get the writer lock to add new items to the cache and replace old ones as needed. So if they are lucky and the hit rate is good, it's almost like there's no lock contention at all.

If I add LRU on top of this, it seems to help quite a bit and single threaded test that repeatedly does folder and file creations and deletions runs 50% faster than when the replacement policy was random replace.

But since the LRU is maintained in a linked list, I have to add a mutex to protect it during any LRU updates. Thus a serialization point by the LRU mutex is now added onto the previously contention free cache hit path. It indeed causes the LRU version to actually run 20% slower than the random replace version with 4 threads doing the same thing.

Make sure your test is representative of a real work load, before optimizing it. The purpose of the cache is to avoid as much as possible future slow disk access, so a small amount of overhead maintaining that cache can repay itself by many orders of magnitude later on.

If your current locking works, leave it, develop the rest of your system to the point you're running things other than micro-benchmarks, then revisit the cache as and when you identify actual bottlenecks, as opposed to just assumed bottlenecks based on contrived scenarios.

One thing you might want to look at, though, is splitting the cache into two queues. You have one queue for newly read inodes, and another for commonly read inodes.

Cache access will then search the first queue, and if the inode desired is in that queue, it is removed and added to the LRU queue.

The benefit of such a cache design is that scanning accesses to inodes, that are accessed only once (such as a directory listing), are less likely to poison the LRU cache, but the added benefit here is that you can potentially spread the lock contention load across two data structures instead of just one.

xeyes · Post by **xeyes** » Fri Jan 22, 2021 2:27 pm

thewrongchristian wrote:
Make sure your test is representative of a real work load, before optimizing it. The purpose of the cache is to avoid as much as possible future slow disk access, so a small amount of overhead maintaining that cache can repay itself by many orders of magnitude later on.

If your current locking works, leave it, develop the rest of your system to the point you're running things other than micro-benchmarks, then revisit the cache as and when you identify actual bottlenecks, as opposed to just assumed bottlenecks based on contrived scenarios.

Thanks for the good point! The test I ran is very different from real workloads as it fills the partition with empty folders and files and then rm -rf the whole tree in a loop. Optimizing for this surely sounds like a waste of time.

thewrongchristian wrote: One thing you might want to look at, though, is splitting the cache into two queues. You have one queue for newly read inodes, and another for commonly read inodes.

Cache access will then search the first queue, and if the inode desired is in that queue, it is removed and added to the LRU queue.

The benefit of such a cache design is that scanning accesses to inodes, that are accessed only once (such as a directory listing), are less likely to poison the LRU cache, but the added benefit here is that you can potentially spread the lock contention load across two data structures instead of just one.

It makes sense to avoid poison the LRU, I actually did waste a bit more time earlier to add a ref counter and an arbitrary threshold for any LRU operation on the read locked side (i.e. an entry has to experience N hits before falling off the end of the LRU list in order to be moved to the front again). Which was apparently able to "turn the tides" and now the LRU is a bit faster than random even in multi threaded runs and still maintained about 50% speed up in ST runs, alas it is hard to say if this will translate into speed up of any real workload.

Could you go into a bit more details of the 2 queue? Does it mean there are 2 LRU queues and entries are moved between them based on hit count?

thewrongchristian · Post by **thewrongchristian** » Fri Jan 22, 2021 3:59 pm

xeyes wrote:
thewrongchristian wrote: One thing you might want to look at, though, is splitting the cache into two queues. You have one queue for newly read inodes, and another for commonly read inodes.

Cache access will then search the first queue, and if the inode desired is in that queue, it is removed and added to the LRU queue.

The benefit of such a cache design is that scanning accesses to inodes, that are accessed only once (such as a directory listing), are less likely to poison the LRU cache, but the added benefit here is that you can potentially spread the lock contention load across two data structures instead of just one.
It makes sense to avoid poison the LRU, I actually did waste a bit more time earlier to add a ref counter and an arbitrary threshold for any LRU operation on the read locked side (i.e. an entry has to experience N hits before falling off the end of the LRU list in order to be moved to the front again). Which was apparently able to "turn the tides" and now the LRU is a bit faster than random even in multi threaded runs and still maintained about 50% speed up in ST runs, alas it is hard to say if this will translate into speed up of any real workload.

Could you go into a bit more details of the 2 queue? Does it mean there are 2 LRU queues and entries are moved between them based on hit count?

I missed mentioning the fact the first queue is a fixed size FIFO queue. The idea is that this FIFO queue acts as a filter for data that will be considered for inclusion in the LRU queue. This gives what is known as "scan resistance" in scenarios such as your benchmark.

The original paper describes it better than I could.

xeyes · Post by **xeyes** » Sat Jan 23, 2021 4:15 pm

thewrongchristian wrote:
xeyes wrote:
thewrongchristian wrote: One thing you might want to look at, though, is splitting the cache into two queues. You have one queue for newly read inodes, and another for commonly read inodes.

Cache access will then search the first queue, and if the inode desired is in that queue, it is removed and added to the LRU queue.

The benefit of such a cache design is that scanning accesses to inodes, that are accessed only once (such as a directory listing), are less likely to poison the LRU cache, but the added benefit here is that you can potentially spread the lock contention load across two data structures instead of just one.
It makes sense to avoid poison the LRU, I actually did waste a bit more time earlier to add a ref counter and an arbitrary threshold for any LRU operation on the read locked side (i.e. an entry has to experience N hits before falling off the end of the LRU list in order to be moved to the front again). Which was apparently able to "turn the tides" and now the LRU is a bit faster than random even in multi threaded runs and still maintained about 50% speed up in ST runs, alas it is hard to say if this will translate into speed up of any real workload.

Could you go into a bit more details of the 2 queue? Does it mean there are 2 LRU queues and entries are moved between them based on hit count?
I missed mentioning the fact the first queue is a fixed size FIFO queue. The idea is that this FIFO queue acts as a filter for data that will be considered for inclusion in the LRU queue. This gives what is known as "scan resistance" in scenarios such as your benchmark.

The original paper describes it better than I could.

Thanks for the link, very interesting ideas and esp. the comparisons as I didn't know the other policies mentioned there either.

But as you said this is probably a better topic to revisit after having 'real' workloads.

rdos · Post by **rdos** » Sun Jan 24, 2021 5:11 am

Filesystem caches are a bit complicated. The most important aspect to achieve high throughput with basically all disc units is that many consequtive sectors must be read/written at once. For instance, with USB discs, there is a huge performance difference between reading one sector one hundred times and reading a hundred sectors one time. Actually, it doesn't take much longer to read a hundred sectors than to just read one. This is connected to how the schedules are created. However, the same thing is true for IDE & floppies too. This is probably because the disc will need to rotate one lap to get the the next sector, while when many consequitive are read, everything happens in one rotation.

Using lists or trees makes it hard to assemble long requests that create good performance. I use a two-level array approach (similar to page tables) to make it easier to get long requests.

thewrongchristian · Post by **thewrongchristian** » Sun Jan 24, 2021 12:09 pm

rdos wrote: Using lists or trees makes it hard to assemble long requests that create good performance. I use a two-level array approach (similar to page tables) to make it easier to get long requests.

If your list/tree is already ordered, it should be trivial to find consecutive requests that can be merged.

When walking the list, given any request, you can look up the next request to see if it is contiguous with the current one, and if so, combine them. Repeat until the next request is not contiguous, and dispatch to your device.

If you don't order your lists, then I can see how this can be a problem.

rdos · Post by **rdos** » Mon Jan 25, 2021 1:45 am

thewrongchristian wrote:
rdos wrote: Using lists or trees makes it hard to assemble long requests that create good performance. I use a two-level array approach (similar to page tables) to make it easier to get long requests.
If your list/tree is already ordered, it should be trivial to find consecutive requests that can be merged.

When walking the list, given any request, you can look up the next request to see if it is contiguous with the current one, and if so, combine them. Repeat until the next request is not contiguous, and dispatch to your device.

If you don't order your lists, then I can see how this can be a problem.

Sure, if you can order them in sector order, then it would be easier to create large requests. However, the number of cached sectors could be thousands or even millions, and to sort them in a list doesn't seem practical (takes too much time, and will depend on list size). A sorted tree might be better, but then the time it takes to create long requests might increase.

Anyway, a major design goal of disc caching should be to optimize the speed of finding large requests.

thewrongchristian · Post by **thewrongchristian** » Mon Jan 25, 2021 9:01 am

rdos wrote:
thewrongchristian wrote:
rdos wrote: Using lists or trees makes it hard to assemble long requests that create good performance. I use a two-level array approach (similar to page tables) to make it easier to get long requests.
If your list/tree is already ordered, it should be trivial to find consecutive requests that can be merged.

When walking the list, given any request, you can look up the next request to see if it is contiguous with the current one, and if so, combine them. Repeat until the next request is not contiguous, and dispatch to your device.

If you don't order your lists, then I can see how this can be a problem.
Sure, if you can order them in sector order, then it would be easier to create large requests. However, the number of cached sectors could be thousands or even millions, and to sort them in a list doesn't seem practical (takes too much time, and will depend on list size). A sorted tree might be better, but then the time it takes to create long requests might increase.

Anyway, a major design goal of disc caching should be to optimize the speed of finding large requests.

Right, so you're talking about the structures to manage the cache, as opposed to IO scheduling? I think we're talking at crossed purposes.

Of course, using a hash based structure for such a cache will indeed not be sorted, and should be O(1) with a good hashing library instead of O(log N) for a BST based cache.

In IO scheduling, if you're optimizing requests based on LBA using an elevator scheduler, you'd already be sorting your requests by LBA. I take you do this combining before scheduling the request to the device?

rdos · Post by **rdos** » Mon Jan 25, 2021 9:58 am

thewrongchristian wrote:
Right, so you're talking about the structures to manage the cache, as opposed to IO scheduling? I think we're talking at crossed purposes.

Of course, using a hash based structure for such a cache will indeed not be sorted, and should be O(1) with a good hashing library instead of O(log N) for a BST based cache.

In IO scheduling, if you're optimizing requests based on LBA using an elevator scheduler, you'd already be sorting your requests by LBA. I take you do this combining before scheduling the request to the device?

I think the disc cache and IO scheduling are related. If you write a large file with async IO or have a large read-ahead on files, then you could easily create IO requests for thousands of sectors too. Especially if your device is rather slow and your file IO is fast & uses async IO.

xeyes · Post by **xeyes** » Tue Jan 26, 2021 11:12 pm

rdos wrote:Filesystem caches are a bit complicated. The most important aspect to achieve high throughput with basically all disc units is that many consequtive sectors must be read/written at once. For instance, with USB discs, there is a huge performance difference between reading one sector one hundred times and reading a hundred sectors one time. Actually, it doesn't take much longer to read a hundred sectors than to just read one. This is connected to how the schedules are created. However, the same thing is true for IDE & floppies too. This is probably because the disc will need to rotate one lap to get the the next sector, while when many consequitive are read, everything happens in one rotation.

Using lists or trees makes it hard to assemble long requests that create good performance. I use a two-level array approach (similar to page tables) to make it easier to get long requests.

Yeah it is complicated but I've gotten better at debugging race conditions as a result.

I added ATA PIO and it was a day and night difference compared to the previous ramdisk with timer interrupt jitter suddenly becoming sky high, even the world time clock can stuck for 20, 30 seconds (would probably correspond to a GUI freeze if I had GUI) before it suddenly jumps to catch up.

As a result had to change various mutex wait from spinning to yielding the CPU and be woken up later by the unlocking thread. As well as moving disk accesses out of the w locked region. Now any thread would pre read from disk before it w locks the cache and also buffer any evict victim for write back after it leaves w locked region.

With all these in the timer interrupt jitter and clock freeze is resolved, but a new batch of race conditions were also introduced. and had to be resolved.

Also needed to replace the per inode R/W lock to one that can block and wake up users. Otherwise if I put more than 1 threads on the same CPU they still deadlock (classic issue of the thread spinning on the CPU doesn't have the lock and the lock holder can't unlock as it doesn't have any CPU cycles).

All that said, I have a "simple" solution to "many consequtive sectors must be read/written at once" though. The last level disk cache is 64KB per line (i.e. any disk access from the cache is already 64KB or 128 sectors) and that can be increased further if needed. Wastes memory, runs faster

OSDev.org

LRU on a cache with reader/write lock?

LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?

Re: LRU on a cache with reader/write lock?