Brendan wrote:Hi,
Just thought I'd put in my vote..
From my perspective, there are 2 problems. The first problem is that you've got a global structure being pounded by every CPU in the computer at the same time. The second problem is that you've got a global structure that has to contain every section/segment in the entire system.
The first rule of scalability is that global structures are bad.
Ok, so I do per process structs and all. Now say the process is a torrent client and starts multiple threads and they all map chunks of files and process IO.
And oh, wait, we are right back where we started. We want a structure that does not need a global lock because that would block all cores all the time. And the structure has to cope with having millions of entries. A Debian BR iso are 25G. With torrent files working in 16k chunks the worst case would have 819200 segments. And you probably know how random the access pattern for torrents is.
Brendan wrote:
Once you've found a way to break that single global structure into many separate smaller structures; you no longer have every CPU constantly pounding the same structure and no longer have a very high number of sections/segments in the same structure. Both the original problems would be greatly reduced from the start; and even bad algorithms/structures (for "many separate things") will probably outperform the best possible algorithm/structure (for "a single global thing").
And I think here is one of the misunderstandings. I don't have any structure. Not global, not local, not smart, not stupid.
I didn't ask because I have a crappy solution that doesn't perfrom.
What I have is an idea of what data I have and how it changes over time. What I'm looking for is a structure to manage the data that isn't insanely slow or insanely complex. I'm a long way from multi core support but that is a goal. So I mentioned it as bonus points since I don't want to implement something that I know I have to completly throw away when I start on multi core support. It isn't yet a requirement but it would be nice to think ahead.
Brendan wrote:
[*]For kernel code (where task switches and/or IRQs can be disabled) "lock-less" isn't necessarily better than spinlocks and may be worse.
In general; instead of acquiring a lock and then doing work that you know will be committed; you do work without knowing if it will be committed and then retry when you fail to commit. This means that under contention, lock-less can waste a lot of time doing work that fails to commit. In addition to that lock-less algorithms are complex (error prone, harder to maintain) and are never fair (can lead to starvation, or in other words, worst case "heavily contended time" is infinite).[/list]
I'd also say that the best approach might be to implement the easiest method to maintain (without caring about performance at all); just so that you've got a baseline that you can profile with real loads. Then (possibly later on) you can try out different alternatives to see if they actually are better or worse in practice with real loads (rather than merely better/worse in theory).
I totaly agree with you that lock-less tends to be really wastefull on collision. I was giving bonus points for being lock-free (as in at least one process must be able to continue at all time). On deeper thought that is actualy a requirement for me. Deadlocks are a no go.
Since I asked here I've gotten some suggestions on IRC so let me outline the ideas in case someone else later has the same questions and finds this thread:
1) forget about segments and use page granularity
What makes the problem hard is that segments can appear, disappear, grow, shrink, split, merge or just change attributes. There are way to many special cases to consider. The smallest unit that can be affected is a page so lets stick with that.
Then a simply (uncompressed) n-ary trie can manage all the pages. Set n=512 and each node in the trie is exactly a page and the trie has height 4 no matter how many pages there are in it. Its basicaly a page table except the leafs contain the page attributes instead of addresses of physical pages.
For multithreading each page in the trie must have a write lock. Luckily there are some spare bits in each node that can be used for that. All operation can be done single pass, top to bottom and most writes will only affect the lower level nodes.
Drawback: Larger segments need to act on a lot of pages. They will have to hold write locks on higher level nodes, high enough that only a single node at that level is involved. The lock problem can probably be avoided by not allocating segments across boundaries.
2) radix tree
This also simplifies segment splits and merges somewhat. A split creates a new subtree and a merge joins 2 leafs and their parent into a new segment. Both are localised operations. A problem I see there is that merges need to propagate back up and access won't be top to bottom alone. To make it lock-free merging might have to be delayed/offloaded to the next thread. Maybe on the way down the thread could already check if a merge can be possible at all and hold all nodes locked that might be needed in a merge. But that seems to make matters rather complex. It might be simpler to simply do merges on the way down on every access. A huge merge would not propagate up all the way but would slowly rise each time there is access in that region.
3) B+Tree
In a B+Tree on each level each node has not only pointers to its children but also a pointer to the next sibling. So when a segment is found it is easy to get the next segment to the right even if the segment is the last in a block. Now the trick would be to find the segment that starts to the left of what we actualy want. Then go right until there is a segment that ends too far to the right. Then split/merge/change/insert/remove as needed. Potentialy a lot of segments could get removed in one go and would cause a lot of nodes getting merged in the tree, maybe even up reducing the height of the tree by 1. That is a problem for the multithreaded cases. But the job could be offloaded to the next thread (the one holding the lock) if the parent can't be locked on the way back up.
Space/complexity consideration: A 4 levels deep B+ tree might hold as little as 4 million segments and take as much space as the simply mapping each page seperately. Is all the complexity of it worth it?