mapped address table, cons and pros

Ryu · Post by **Ryu** » Sat May 13, 2006 1:41 am

Greetings,

I have been designing a structural table to hold mapped addresses to map I/O ports and differnt regions of linear address space. In common sense I've named it "Mapped Address Table" or in short MAT. I'd like to get your cons and pros about it because I do not want waste time and change this later finding that it may be to slow, or to much overhead after it is already implemented.

In breif, it is a table of two arrays of 32bit varibles, the first array contains base addresses, the second contains the size. Each entry identifies a mapped region and referred by the index number, taking the two array base[ i ] and size[ i ]. Finally to indicate where "base" and "size" array begins, there is a record that stores base address and byte length of the MAT. Basically base address finds your way to "base" array, the base address + byte length/2 finds the start of "size" array. The count of entries within the MAT is simply calculated as byte length/8.

[tt] MAT

/\ +-----------------------------+ Base + ByteLength
|| | |
|| | Size Array |
|| | EntryCount = ByteLength/8 |
|| | |
|| | |
|| | |
|| +-----------------------------+ Base + (ByteLength / 2)
|| | |
|| | Base Array |
|| | EntryCount = ByteLength/8 |
|| | |
|| | |
|| | |
|| +-----------------------------+ Base[/tt]

What worries me is, when all entries are used up, the table must grow and so when it does, it must be moved into a larger contigous block of physical memory. The reason why they are not structured is because I would like to make use of the rep prefix opcodes which I think would make life a lot easier and faster, especially when searching but ofcourse its theorically.

When I mean structure format I mean like this:

struct MAT_ENTRY {
DWORD Base;
DWORD Size;
};

Also, documented my theorical approach of memory managment, which is just the ground basis of it, if you would like to read on it and comment/criticise post them here too. The document is in RTF file and be temperary downloaded here: http://69.194.132.218/ASRS.rtf

<edit by="Pype"> used 'tt' instead of 'font=Lucida Console' for those who don't have 'Lucida Console' at hand </edit>

mystran · Post by **mystran** » Sat May 13, 2006 7:34 am

On any semimodern processors I'd use a array of structures, instead of separate arrays. Instruction level looping speed isn't going to be an issue. Getting related data together (and aligned in some sane way) is better idea, because it tends to utilize both memory bandwidth and cache space better, if you ever do random access. So even if the array never had to grow, it might be bad idea to split it.

Finally, if you look at Intel/AMD optimization guides, you are supposed to use prefetch hints, SSE, direct stores (so you don't poison cache) and selective unrolling, when you want to move or process large arrays of data fast. I would't bother in most cases.

As for what this MAT is going to do is somewhat unclear to me, so can't really comment on that.

If you are going to need to search into it, you can keep it in order, so you can use binary search. To avoid updating the whole table on every addition, you can keep a separate pile for recent additions, search that linearly, and newarray=mergesort(array,qsort(pile)) when it grows to some predefined limit.

Ryu · Post by **Ryu** » Sat May 13, 2006 9:45 am

To my understanding, having two arrays being processed would utilize the cache line just as well, as long as they are aligned and handled as the same operand size (correct me if I'm wrong) which won't benifit in search operations much anyway, however does when I process the entries like merging. But aside from that the structure is quite simple, only holding two varibles, restructuring it so that loops don't have to exist, and replaced with rep instructions.. am I insane?.. perhaps.

If your talking about using SSE to move data, applying prefetches, selective unrolling, direct stores, performance vary on differnt processor families to another. And sometimes give no benifits unless you move VARY large arrays which isn't practical verses aligned "rep movsd". I always have favored aligned "rep movsd" for simplicity being available to all x86 families, and to me it is the fastest practical way to move memory.

MATs really is just a table so the kernel can track physical memory and I/O port usages. In my OS, I only want privilege level 0 tasks in PM32 for the moment being, so tracking memory usages for each task in linear space is required. As well as being able to map special areas such as 0-16MB for DMA buffers and then you can allocate from the table afterwards.

The idea is to sortof buffer the table, so that it doesn't grow on each addition to the table, so the reason of having a record to hold the table's base address and length in bytes. There is merging, but no overheads of sorting, again rep scasd can be used instead.

mystran · Post by **mystran** » Sat May 13, 2006 10:35 am

My point about caches is that you have 2x4bytes in a struct like this:

struct {
uint32 a, b;
}

Now, if the struct is aligned to 8 bytes, and you read a, processor fetches a cache line (which is usually at least 8 bytes), which happens to contain both values. So if you then read b, you already have it in cache.

Now, if you instead have them in separate arrays, and read a, then only values from the a array are in cache, and if you now read b, then processor has to make another fetch to memory (or next level cache or whatever).

Another thing is that if you only need to fetch a single a,b pair (that is not in cache yet), then with a struct you poison one cache line, while with separate arrays you poison two (since your first fetch will normally not be replaced in a n-way associative cache even if it could, because something else in there is older).

If you are only iterating the array, it makes no difference, if you can process whole of first array before processing the second array.

Ryu · Post by **Ryu** » Sat May 13, 2006 7:24 pm

Hmm, I had went and read about caches in the Intel Optimization Referrence Manual. In Chapter 2 General Memory Guild Lines, section Data Layout Optimizations does note exactly about "struct_of_arrays" and quotes the only disadvantage: "Note that struct_of_arrays can have the disadvantage of requiring more indepentant memory stream references. This can require more of prefetches and additional address generation calculations. It can also have an impact on DRAM page access efficiency."

I think it means precisely what your saying. However how I'm seeing this now, it would be efficent when the entire table can fit into a single cache line, but when you require to search an entire table that will end up having to prefetch into several cache lines, having two arrays would be no differnt or very little differnts, and would not outweigh the conviency of rep instuctions in my optinion but perhaps having it structured will outweigh the more complexity structure layout. I'm quite confident that these dynamic tables will end up large enough then to remain in a single cache line, but if it does it would be very little concern of mine to sacrifice rep instructions.

I figure that the only intensive thing about these tables is doing search operation on them, because they are not ordered nor are contigously used entries, in exchange you do not have to sort and move around entries (which also I think would make more cache misses). For instance, inorder to add to the table I would have to search for a free entry, and that can be done by simple rep scasd when ZF=1, In terms its utilized the cache more efficent with minimum pollution by checking if "size"=0 rather then filling the cache lines with size and base varibles. Most procedures that processes the entires will invlove having to read all entries within the table, like merging where you do need to go through all entries, so in sense I do not think it will posion the cache, however several cache lines will be at work. But this is just what I make from in chapter 6 that implies with SIMD instructions and not certain it applies here aswell.

mystran · Post by **mystran** » Sat May 13, 2006 9:51 pm

Just keep the array sorted. Adding a new entry into a sorted table (while keeping it sorted) takes time linear to size of table, the same as doing a linear search, and you can then use binary search on the table.

Ofcourse, you can use binary search to find the point to start rewriting the array, so as to save about half of the time (average, ofcourse). This way adding near end is almost free.

Also notice, that if you delete entries and just mark them deleted, and leave them in place, you can then use those to stop the array rewrite before the end of table. If you leave the search key (base?) intact, this will have almost now effect one your searching. Depending on usage patterns, this can have a huge effect.

This pays of REALLY fast, if you have more than a few entries in the table, and need to search more often than add entries.

Ofcourse if you need to move the array to another (larger) space, then you need to rewrite it anyway.

Candy · Post by **Candy** » Sun May 14, 2006 8:27 am

What about a mostly-sorted array?

Say, you keep the first N items stored. When you add an item, you add it at the end. At some time you resort the array. That way, adding is amortized NlogN and searching is, if you reorder it often enough, amortized logN. Would that work?

mystran · Post by **mystran** » Sun May 14, 2006 10:51 am

Actually, I wrote a long reply (which I discarded for some reason) about keeping a sorted array, and a separate unsorted pile, which is sorted when it overflows. Quicksort is more or less O(NlogN) if do take pivot median(first,middle,last) to avoid degenerate cases (extra fun: prove that this is sufficient), and can sort the thing in place. Then merge it to the main array.

Merging two sorted arrays is O(N), except you can binary search the merge starting point from the main array, to make adding near end less expensive, unless you need to move to a larger space anyway.

Ofcourse binary search is then O(P+logN) where P is the pile size, but if P is relatively small, it should work fine.

Using quicksort is better than just rewriting the pile (to keep it always sorted) if you often need to add several entries at the same time.

DO NOT sort the main array and the pile together when the array overflows. Separate pilesorting and merging is O(PlogP + N), while sorting the whole thing is O(NlogN) and if N is larger (which it is supposed to be) than P, the latter makes no sense whatsoever.

Ryu · Post by **Ryu** » Sun May 14, 2006 3:59 pm

Hmm I don't really know the quicksort algorithm, but I do want a in placed algorithm for both add, merge and sort all in one shot. Sorting does not have to keep them in sequently linear fashion, but just ordering the base with lowest to highest. Having another array to keep unsorted and or differnt pile is an interesting thought, but can add complexity with two separate arrays that must be colinear. But I'm having really good thoughts about implemented something such as the ideas.

By the way, the pivot idea seems practical when you want do a lot of adding to the table, and perhaps searching. I think that can be replaced, with a mathematical equation, taking the varibles lowest base, highest address of the table, and with count of used elements in the table (maybe I'm just insane) so then you wouldn't need to partition the cells.. I don't know though, just clicks in my brain that its somehow possible..

mystran · Post by **mystran** » Sun May 14, 2006 4:42 pm

"Pivot" in this case is a part of the quicksort algorithm.

I'm going to ignore stack required for recursion, since getting rid of it makes the whole thing messy, but quicksort basicly works like this:

Take some value from the array as "pivot".

Put any values smaller than the pivot in pile A, and all values larger than pivot in pile B.

Now any value from A is smaller than any value from B. Right?

Do the same to sort A and B, and so on, until you have at most one value in any given pile, and you are finished.

That's it.
---

Doing the above inplace:

1) Scan forward from beginning, until you find something bigger than pivot.
2) Scan backward from end, until you find something smaller than pivot.
3) Swap the two (both wrong side -> swap -> both right side)
4) Stop when you forward and backward scanners meet.

The point where you stopped, is where you divide the array. Any smaller-than-pivot values are now before this division point, and any larger-than-pivot values after the division point.

If you where lucky (or selected the pivot right), your two arrays will be about the same size, which is the best possible case.

---
How to select pivot value?

You could take the first value, but then you will get the worst case if your array is already sorted, because no value will be smaller than the pivot.

Any other single value will give you similar degenerate cases, but if you take first, middle, and last, and throw away the biggest and smallest, then even if you hit a degenerate case in one recursion step, that won't happen on the next step (at least assuming the normal inplace implementation).

If I'm not mistaken, this is sufficient (can't remember how to prove it, so won't check it right now).

Ryu · Post by **Ryu** » Sun May 14, 2006 6:20 pm

Hmm, I thinking to merge in the process, something like:

Do search forward in Base array for anything less or equal then A. If it is, the element is merged with the specified point A to B, and I is marked with the pointer to element address and the element is marked free. However if the element's base is greater then point A, Points A to B are swapped with the elements's contents and the element is rechecked and then continues to the last last entry. This results to an I where the merged Points A to B should be saved, so the base is sorted inplaced as we merge. This also ensures points A to B can be merged with multiple elements when they specifiy an area overlapping two or more element's addresses. If I does not point to anything, then the merged Points A to B are saved in the last free element. Given that the table is sorted, and I holds a value, the scan can stop when a elements Base is greater A+B.

A to B are the parameters that requires to be mapped into the table;

When its possible to get a merge from Point A to B within multiple elements, the side effects may create holes of free entires. But, it should not affect the algorithm, if it checks for only valid entries. Hope this makes sense.

EDIT: hmm to add to the confusion I should correct something about the algorithm, they dont get merged unless Points A to B are inboundry with the elements addresses, however swapped when A is less or equal to element's base address.

Ryu · Post by **Ryu** » Sun May 14, 2006 10:10 pm

I had to doodle a bit to make sure I'm not insane

So hopefully it clearifies what I was talking about.

a = base address (address to be be mapped)
b = (base address + size) or referred as upperbound
i = a table element.

1. if a is within boundry of base and upperbound, a merge uccurs to a, b.
2. when merge occurs, the element which a, b is merged is then marked free.
3. when merge occurs, i points to the first element which a, b is merged with.
4. a, b is be swapped with elements base and bound when a < element's base.
5. stops when b < base, or the last used entry was processed.
6. a, b is saved into i on completion unless holds no value in which case gets saved in ElementArray[nUsedCount].

the doodle:

[tt](a)00F00h,(b)1200h 0000h,1200h > 0000h,3000h > 0000h,3000h <--completed by (b < base)
---------1---------------2---------------3-------->
Table [0000h,1000h] [1100h,3000h] [4000h,5000h]
Table After [ i ] [ free ] [4000h,5000h]
Resulting Table [0000h,3000h] [ free ] [4000h,5000h]
| |
a b

when swap occurs:

(a)0000h,(b)1000h 2000h,3000h > 2000h,3000h > 2000h,3000h <-- completed (b < Base)
---------1---------------2---------------3-------->
Table [2000h,3000h] [ free ] [4000h,5000h]
Table After [ swap ] [ i ] [4000h,5000h]
Resulting Table [0000h,1000h] [2000h,3000h] [4000h,5000h]
| |
a b[/tt]

edit: some currections with rules.

Solar · Post by **Solar** » Mon May 15, 2006 3:31 am

mystran wrote: "Pivot" in this case is a part of the quicksort algorithm.

I'm going to ignore stack required for recursion, since getting rid of it makes the whole thing messy, but quicksort basicly works like this...

PDCLib 0.4 (see my signature) contains a non-recursive Quicksort implementation with insertion-sort for small sets, PD'ed courtesy of Raymond Gardner / Paul Edwards. Feel free to use it.

OSDev.org

mapped address table, cons and pros

mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros

Re:mapped address table, cons and pros