Page 1 of 1

What is the right kind of tree for handling segments?

Posted: Sat Dec 08, 2012 5:28 am
by mrvn
Hi,

I'm looking for the right tree structure to manage my address space. The information I need to track is a set of (start, size, attributes) segments. Segments are not overlapping.

Required operations, at most runtime O(log n):
  • insertion of a new segment
  • removal of a segment
  • finding the segment containing an address
  • splitting a segment in two (removal of a chunk in the middle)
  • splitting a segment in three (changing the attributes of a chunk in the middle)
  • growing a segment (inserting an adjoining segment with same attributes)
  • merging two segments (inserting/changing an adjoining, on both sides, segment with same attributes)
  • merging three segments (changing attributes of a middle segment so 3 adjoining segments have the same attributes)
Bonus points for multi threading (no global lock). Extra bonus points for a lock-free solution.

So what are your suggestions?

Mrvn

Re: What is the right kind of tree for handling segments?

Posted: Sat Dec 08, 2012 5:55 am
by bluemoon
While there are thousand of algorithm for heap / address space management, I don't see the "allocate a segment with required size" in your requirement, is that intentional? I'm curious to your design, are you working on SAS design?

And my own implementation for address space management uses pre-defined zones:
1. kernel and system
2. device drivers
3. device resources (display LFB, mmio, etc)
4. application

In each zone there is their own allocator.

For application, the allocation is as simple as just one number, the _sbrk_ptr, which grow in one direction until it hit the limit.

Re: What is the right kind of tree for handling segments?

Posted: Sat Dec 08, 2012 8:09 am
by jnc100
It really depends on how many entries you will have in the structure and which of these operations you will be calling the most frequently. If you only expect to have one entry for the kernel, one for each section within the executable, one for the stack and one for the heap, and the most common function is checking whether an address found at page fault exists within an expandable heap or stack, then it may make sense to simply use a list with the addresses of the heap and stack stored separately for quick access.

Regards,
John.

Re: What is the right kind of tree for handling segments?

Posted: Sat Dec 08, 2012 8:48 am
by mrvn
bluemoon wrote:While there are thousand of algorithm for heap / address space management, I don't see the "allocate a segment with required size" in your requirement, is that intentional? I'm curious to your design, are you working on SAS design?
Allocation creates a new segment and inserts that into the tree to mark the segments access rights (i.e. which processes have access and how much).

Note: I'm trying to do a single address space with page sharing message passing and with page protection. So this has to track access rights for all processes and will have a lot of entries and modification/splitting/joining of existing entries will probably be the most used thing.

Re: What is the right kind of tree for handling segments?

Posted: Sat Dec 08, 2012 9:00 am
by mrvn
jnc100 wrote:It really depends on how many entries you will have in the structure...
Think millions.
jnc100 wrote:... and which of these operations you will be calling the most frequently.
There is basically only one toplevel operation that will occur:

Code: Select all

void change_attributes(uint64_t start, uint64_t end, bool add, Attribute attr);

Re: What is the right kind of tree for handling segments?

Posted: Sun Dec 09, 2012 12:25 pm
by Combuster
A quick google revealed an article on "Lock-Free Red-Black Trees Using CAS". I believe figuring out how to fit any tree-like structure to an algorithm falls under basic computer science. Go merge the two.

Re: What is the right kind of tree for handling segments?

Posted: Sun Dec 09, 2012 2:03 pm
by mrvn
Combuster wrote:A quick google revealed an article on "Lock-Free Red-Black Trees Using CAS". I believe figuring out how to fit any tree-like structure to an algorithm falls under basic computer science. Go merge the two.
Well, you went for the bonus points but failed to answere the actual question.

Or is your answere: Use a Red-Black Tree? Because that isn't so well suited for merging or splitting segments. Doing so would mean multiple passes through the tree and then you need a global lock or something to keep them atomic and you loose your bonus points.

Re: What is the right kind of tree for handling segments?

Posted: Sun Dec 09, 2012 4:10 pm
by Combuster
Ever heard about the Sorceror's Apprentice? That guy (with large ears, depending on version) that tried the advanced magic without having mastered the basics and then screwed up big time?

Either you can't make the basic synchronous algorithm, in which case talking about advanced concurrency concepts is pointless. Or you can and you're systematically asking the wrong question.

Re: What is the right kind of tree for handling segments?

Posted: Sun Dec 09, 2012 8:59 pm
by Brendan
Hi,

Just thought I'd put in my vote.. :)

From my perspective, there are 2 problems. The first problem is that you've got a global structure being pounded by every CPU in the computer at the same time. The second problem is that you've got a global structure that has to contain every section/segment in the entire system.

The first rule of scalability is that global structures are bad.

The first step towards fixing both of your problems is to not have a global structure. There's at least 2 ways this can be done:
  • split the single address space into fixed size small address spaces for each process. For example; give each process a PDPT (max. 511 processes with up to 512 GiB of space each); or give each process a PD (max. 262143 processes with up to 1 GiB of space each); or support mixed sizes (e.g. 256 larger processes with up to 512 GiB of space each plus 131071 smaller processes with up to 1 GiB of space each).
  • Have a global manager that only manages unused space; plus one manager per process that only manages "in use" space for that process. This makes space usage more dynamic (e.g. if there's millions of processes that only use 1 MiB of space or 4 processes that use a few TiB then you don't have problems with hard-coded limits); but you'd still have a global structure with many sections/segments (the global "unused space" manager).
Once you've found a way to break that single global structure into many separate smaller structures; you no longer have every CPU constantly pounding the same structure and no longer have a very high number of sections/segments in the same structure. Both the original problems would be greatly reduced from the start; and even bad algorithms/structures (for "many separate things") will probably outperform the best possible algorithm/structure (for "a single global thing").

In addition to the above:
  • There are about 4 types of costs. The "uncontended time" cost, the "heavily contended time" cost, the memory space costs, and the "cache lines touched" cost (cache locality). For example, an algorithm might have very good "uncontented time cost" but still have very bad "heavily contended time cost" and very bad "cache lines touched" cost. This means that to compare algorithms you'd need to compare all 4 costs, not just one of them.
  • Because "big-O notation" ignores the cost of each operation and the expected number of operations, it is completely worthless in practice. For example, for small values of n, O(log n) where operations are expensive can be worse than O(n*n) where operations are fast.
  • For kernel code (where task switches and/or IRQs can be disabled) "lock-less" isn't necessarily better than spinlocks and may be worse. In general; instead of acquiring a lock and then doing work that you know will be committed; you do work without knowing if it will be committed and then retry when you fail to commit. This means that under contention, lock-less can waste a lot of time doing work that fails to commit. In addition to that lock-less algorithms are complex (error prone, harder to maintain) and are never fair (can lead to starvation, or in other words, worst case "heavily contended time" is infinite).
I'd also say that the best approach might be to implement the easiest method to maintain (without caring about performance at all); just so that you've got a baseline that you can profile with real loads. Then (possibly later on) you can try out different alternatives to see if they actually are better or worse in practice with real loads (rather than merely better/worse in theory). ;)


Cheers,

Brendan

Re: What is the right kind of tree for handling segments?

Posted: Tue Dec 11, 2012 2:59 pm
by mrvn
Brendan wrote:Hi,

Just thought I'd put in my vote.. :)

From my perspective, there are 2 problems. The first problem is that you've got a global structure being pounded by every CPU in the computer at the same time. The second problem is that you've got a global structure that has to contain every section/segment in the entire system.

The first rule of scalability is that global structures are bad.
Ok, so I do per process structs and all. Now say the process is a torrent client and starts multiple threads and they all map chunks of files and process IO.

And oh, wait, we are right back where we started. We want a structure that does not need a global lock because that would block all cores all the time. And the structure has to cope with having millions of entries. A Debian BR iso are 25G. With torrent files working in 16k chunks the worst case would have 819200 segments. And you probably know how random the access pattern for torrents is.
Brendan wrote: Once you've found a way to break that single global structure into many separate smaller structures; you no longer have every CPU constantly pounding the same structure and no longer have a very high number of sections/segments in the same structure. Both the original problems would be greatly reduced from the start; and even bad algorithms/structures (for "many separate things") will probably outperform the best possible algorithm/structure (for "a single global thing").
And I think here is one of the misunderstandings. I don't have any structure. Not global, not local, not smart, not stupid.
I didn't ask because I have a crappy solution that doesn't perfrom.

What I have is an idea of what data I have and how it changes over time. What I'm looking for is a structure to manage the data that isn't insanely slow or insanely complex. I'm a long way from multi core support but that is a goal. So I mentioned it as bonus points since I don't want to implement something that I know I have to completly throw away when I start on multi core support. It isn't yet a requirement but it would be nice to think ahead.
Brendan wrote: [*]For kernel code (where task switches and/or IRQs can be disabled) "lock-less" isn't necessarily better than spinlocks and may be worse.
In general; instead of acquiring a lock and then doing work that you know will be committed; you do work without knowing if it will be committed and then retry when you fail to commit. This means that under contention, lock-less can waste a lot of time doing work that fails to commit. In addition to that lock-less algorithms are complex (error prone, harder to maintain) and are never fair (can lead to starvation, or in other words, worst case "heavily contended time" is infinite).[/list]
I'd also say that the best approach might be to implement the easiest method to maintain (without caring about performance at all); just so that you've got a baseline that you can profile with real loads. Then (possibly later on) you can try out different alternatives to see if they actually are better or worse in practice with real loads (rather than merely better/worse in theory). ;)
I totaly agree with you that lock-less tends to be really wastefull on collision. I was giving bonus points for being lock-free (as in at least one process must be able to continue at all time). On deeper thought that is actualy a requirement for me. Deadlocks are a no go.


Since I asked here I've gotten some suggestions on IRC so let me outline the ideas in case someone else later has the same questions and finds this thread:

1) forget about segments and use page granularity

What makes the problem hard is that segments can appear, disappear, grow, shrink, split, merge or just change attributes. There are way to many special cases to consider. The smallest unit that can be affected is a page so lets stick with that.
Then a simply (uncompressed) n-ary trie can manage all the pages. Set n=512 and each node in the trie is exactly a page and the trie has height 4 no matter how many pages there are in it. Its basicaly a page table except the leafs contain the page attributes instead of addresses of physical pages.

For multithreading each page in the trie must have a write lock. Luckily there are some spare bits in each node that can be used for that. All operation can be done single pass, top to bottom and most writes will only affect the lower level nodes.

Drawback: Larger segments need to act on a lot of pages. They will have to hold write locks on higher level nodes, high enough that only a single node at that level is involved. The lock problem can probably be avoided by not allocating segments across boundaries.

2) radix tree

This also simplifies segment splits and merges somewhat. A split creates a new subtree and a merge joins 2 leafs and their parent into a new segment. Both are localised operations. A problem I see there is that merges need to propagate back up and access won't be top to bottom alone. To make it lock-free merging might have to be delayed/offloaded to the next thread. Maybe on the way down the thread could already check if a merge can be possible at all and hold all nodes locked that might be needed in a merge. But that seems to make matters rather complex. It might be simpler to simply do merges on the way down on every access. A huge merge would not propagate up all the way but would slowly rise each time there is access in that region.

3) B+Tree

In a B+Tree on each level each node has not only pointers to its children but also a pointer to the next sibling. So when a segment is found it is easy to get the next segment to the right even if the segment is the last in a block. Now the trick would be to find the segment that starts to the left of what we actualy want. Then go right until there is a segment that ends too far to the right. Then split/merge/change/insert/remove as needed. Potentialy a lot of segments could get removed in one go and would cause a lot of nodes getting merged in the tree, maybe even up reducing the height of the tree by 1. That is a problem for the multithreaded cases. But the job could be offloaded to the next thread (the one holding the lock) if the parent can't be locked on the way back up.

Space/complexity consideration: A 4 levels deep B+ tree might hold as little as 4 million segments and take as much space as the simply mapping each page seperately. Is all the complexity of it worth it?

Re: What is the right kind of tree for handling segments?

Posted: Tue Dec 11, 2012 3:15 pm
by bluemoon
mrvn wrote: A Debian BR iso are 25G. With torrent files working in 16k chunks the worst case would have 819200 segments. And you probably know how random the access pattern for torrents is.
A sane torrent client should do mmap with as big as possible, and retry with smaller and find the optimal size for "file view".
- On 64-bit system it could mmap(25G) to request one segment, and let the OS page in/out on demand.
- On 32-bit system, the client map end up mapping 65536 chunks on the file, which add up to 1GB address space. subsequent calls result in OUT_OF_MEMORY. (The actual number of chunks may also subject to restriction on handles) - it is very unlikely to mmap 819200 segments at the same time.

Re: What is the right kind of tree for handling segments?

Posted: Tue Dec 11, 2012 4:10 pm
by mrvn
bluemoon wrote: - On 64-bit system it could mmap(25G) to request one segment, and let the OS page in/out on demand.
Yes, the client would mmap(25G). But as you say the OS would page that in/out on demand, which is the problem there. The demand is highly random so you get lots of little bits paged in which then gets mapped to actualy physical pages instead of being a reference to the file. As such the mapping of all those little bits would change from not present to present.