OSDev.org

Posted: **Sat Jul 17, 2010 11:13 am**

Hmm... Just went to my DragonFly VM and looked up the ELF header file. It seems that, according to the symbol table entry (Elf32_Sym), the ELF standard does not keep the computed symbol hash present at the site of the symbol table entry. Therefore, the radix implementation wins even bigger.

Once a bucket has been found, EACH SYMBOL in the bucket must be compared until a match is found, or until the bucket is exhausted. Indeed, in the case of a C++ library or program with the silly SVR4 ABI, the problem becomes pathological. I guess this is why the work had been done with building pre-link maps for some piggy KDE stuff I recall reading about.

--Corey

Posted: **Sat Jul 17, 2010 11:14 am**

MD5 produces 128-bits of output...

Posted: **Sat Jul 17, 2010 11:21 am**

coreybrenner wrote:MD5 keys are what, 40 bytes long?

16 bytes, as Google would've told you.

Longer, by far, than the average symbol name (especially in a C program or library.) Then, you still have to take into account the possibility, however unlikely, of a collision in MD5/SHA space. You must compare the symbols themselves. And if they do not match, you still have to continue down the bucket.

Yeah, we call that less than average case.

I'd like to see your performance comparison. Given that I've never had a feeling of "hey, this program is loading so slow because the symbol lookups in ELF are so slow" I seriously doubt how much real-world performance you're going to gain. Especially given the added complexity of a radix tree versus a hash tree.

Posted: **Sat Jul 17, 2010 11:30 am**

The clobber list would be good for a smart loader, which can generate contract thunks, for one thing, and for a hypothetical dynamic language which could use the library's clobber info to create optimized calling code via JIT...

It would cause horrible breakage for a linker to use a library's clobber list when linking the program, but a continual optimization engine could use this information when running in the background, in idle cycles, to enhance the performance of a long-running program.

This approach could make mixed-mode code, especially in light of an approach like Semantic Dictionary Encoding, a more tractable problem. Hmm...

Posted: **Sat Jul 17, 2010 11:42 am**

Candy wrote:
coreybrenner wrote:MD5 keys are what, 40 bytes long?
16 bytes, as Google would've told you.

Was reaching for that memory in the depths of my rear end, and grabbed something else by mistake. So sue me.

Candy wrote:Yeah, we call that less than average case.

Still, a case which must be accounted for, given that it is a possibility. Not so when operating against symbols themselves. The MD5 hash merely adds steps to the process. Expensive ones.

Candy wrote:I'd like to see your performance comparison. Given that I've never had a feeling of "hey, this program is loading so slow because the symbol lookups in ELF are so slow" I seriously doubt how much real-world performance you're going to gain. Especially given the added complexity of a radix tree versus a hash tree.

There certainly must be *some* problem, as there were entire distributions at one point setting up, or considering setting up, prelink-mapped programs. Stuff like KOffice, for instance, were having load-time problems, from what I recall. I may be grabbing for data where the sun don't shine again, and plucking out something else again, but I recall at the time that I thought these problems were silly, and the solutions were even sillier, which is what started me thinking about this in the first place.

At any rate, while it works, I think the ELF hash is suboptimal. Especially paired with the SVR4 ABI (which, incidentally, is not a problem with a radix tree implementation, so a name-mangled "flat" radix table could be kept, which would allow a loader to resolve symbols in a library in the new binary format, allowing older programs to load against new libraries.)

--Corey

Posted: **Sat Jul 17, 2010 1:30 pm**

WRT the complexity of a radix implementation vs. a hash, there isn't a lot of complexity from the perspective of the loader, and once a suitable ADT interface is established, none for the link editor putting together a program or library, or an assembler spewing out an object file.

The radix tree/patricia trie lookup is very simple. At each "node" in the table, you look at a bit index. Test the value of the bit in your symbol. If 0, skip to the node at the offset represented by node slot 0. If 1, skip to the node at the offset represented by node slot 1. Test the index at that node against that of the last node. If the new node's index <= the last node's index, you have reached the end of your search. Compare the length of the symbol at that address vs. your own symbol. If no match, the symbol is not present in the table. If the length matches, compare your symbol against that embedded in the table (better for locality). The result of the match, if you get this far, determines whether or not the symbol is in the table.

Code: Select all

struct rnode {
    uint16_t index;
    uint16_t ssize;
    uint32_t nslot[2];
    uint8_t  sname[];  // bytes dictated by .ssize, padded to 4 bytes.
};

It occurs to me that you can know whether the next node will terminate your search or not, based upon the slot offset in the current node. If .nslot[bitvalue] <= offset of current node, the search will terminate at the next node. That is, if the table was packed by a depth-first traversal of the radix tree used to store the symbols during assembly/link-edit/ranlib. For whatever that's worth. There's probably a jump that can be optimized away or something like that by an assembly language implementation of a radix table lookup.

Posted: **Sat Jul 17, 2010 3:53 pm**

Candy wrote: I'd like to see your performance comparison. Given that I've never had a feeling of "hey, this program is loading so slow because the symbol lookups in ELF are so slow" I seriously doubt how much real-world performance you're going to gain. Especially given the added complexity of a radix tree versus a hash tree.

Obviously, programs load slow because they are big, so they require a lot of DMA and time. When it comes to CPU power, your machine runs 10 times slower than it should because programmers use bad algorithms and waste CPU cycles. You don't notice the CPU power wastage in your load times because it is below your pain threshold. I agree that traversing a trie is going to always be slightly faster than calculating a 32b hash, doing a modulus, accessing a bucket to get a sym index, accessing the sym index to get a string table offset, passing the string table offset and a pointer to the original sym string into a string comparison routine, and then comparing the two strings all the way out to the 0 for a match.

You cannot guarantee a perfect hash for a pure hash search -- because your app is being dynamically linked at runtime against an unknown set of libraries that contain a large and random set of symbols.

Posted: **Sat Jul 17, 2010 10:01 pm**

There are actually perfect hash generators out there which would work for this sort of thing, and I believe you can set it up such that the functions which lookup data in the hash can be generic, based upon some key data stored with the bucket set. Given that, you could successfully make a hash-based symbol table very performant, indeed, if the hashing algorithm was not too compute intensive. Tougher, but feasible, is perfect minimal hashing. For linking a big library or application, that might take some time to create a hash function which will resolve all symbols perfectly. Once done, though, lookup should be fairly quick.

Posted: **Sun Jul 18, 2010 3:49 pm**

My previous analysis of this idea also took the idea of symbol rejection into account.

It is slow as heck to attempt to resolve a symbol in an ELF library, when that symbol does not exist in the library. Especially with the SVR4 C++ ABI. You MUST examine an entire hash bucket to determine that a symbol does not exist in the symbol table. Then, you move on to the next library.

With the radix table, you can know with at most one string comparison whether you've hit or missed. For most C programs, this may be a pretty small issue. For something like KDE/Qt, this could be huge.

--Corey

Posted: **Tue Jul 20, 2010 6:07 pm**

Fwiw, I've seen prelink save roughly 5-10 seconds of an initial 20 second program start time. This was in a large C++ application.

Also worth mentioning: the remaining majority of the time was the application loading various plugins / etc...

Loading C++ code in linux can be terribly slow, there is a reason why prelink exists....

OSDev.org

Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?

Re: Opinions? ELF Binary format -- weaknesses?