Page 1 of 1

Deflate compression - Huffman coding question

Posted: Mon Oct 17, 2011 5:31 pm
by andymc
Hello, I'm working on a compression library for my OS (Visopsys), starting with the DEFLATE algorithm (gzip, zip). I've got static Huffman codes working (compress and decompress) but I'm hitting a problem with my dynamic Huffman codes. When coding the 7-bit second-level tree for the literal-length and distance code counts, my trees are exceeding the 7-bit maximum length. I'm getting a fairly degraded binary tree.

In this case, I'm attempting to compress a piece of binary code (written in assembler) that has a lot of different codes - not a lot of repeats. So, the code counts are pretty well distributed. However, I'm also not trying to do the run lengths yet; I'm treating them as all literals. I know that I'll have to find a way to reduce my input block size when I hit some limit to the number of code counts, but can anyone suggest a good way to calculate that mathematically, without repeatedly/speculatively rebuilding my tree once I reach some number of code counts that could theoretically result in a worst-case, degraded Huffman tree? (I guess that would be only be something like 8 counts really -- not very many)

In the same vein, does anyone have a summary of a good way to choose the compression method when looking at the input bytes? static/dynamic/none, etc?

I hope that wasn't too convoluted. I'm hoping that will make sense to anyone who has worked with this type of compression before. I'd prefer not to go and study the zlib code if I don't have to :oops:

Andy

Re: Deflate compression - Huffman coding question

Posted: Wed Oct 19, 2011 1:43 pm
by turdus
andymc wrote:but can anyone suggest a good way to calculate that mathematically, without repeatedly/speculatively rebuilding my tree once I reach some number of code counts that could theoretically result in a worst-case, degraded Huffman tree? (I guess that would be only be something like 8 counts really -- not very many)
The book Introduction to Algorithms (MIT press, ISBN0262032937) has a good one. (I suppose by dynamic Huffmann you mean adaptive). The proven worst case of this coding is when the counts of your elements shows the Fibonacci numbers. In this case you'll have a binary tree only one leaf on the left side, and everything else on the right side, something like:

Code: Select all

  /\
   /\
    /\
     /\
       ...
which also means that's very unlikely you'll end up with the worst case.
In the same vein, does anyone have a summary of a good way to choose the compression method when looking at the input bytes? static/dynamic/none, etc?

I hope that wasn't too convoluted. I'm hoping that will make sense to anyone who has worked with this type of compression before. I'd prefer not to go and study the zlib code if I don't have to :oops:

Andy
I doubt it, the only way to tell it for sure is to compress the same input with all of your algorithms and check the size of output. Since most compression algorithm use an adaptive form, you cannot predict the behaviour by examining some bytes at start. You'll have to use the full input to get that, but it's impossible (not enough ram), that's why adaptive mapping s came to play in the first place. (Although you can do quite good prediction which algo should work well with your input by using some statistic logic, but you cannot be sure).

It's also proven that there's no best compression algorithm exists, some works better for one kind of input, others perform better for another. So it's possible that you can gain extra compression ratio by mixing more algo in some cases. If you also want to take this into count when examining the input bytes, you'll have to do quite a lot of work before you can start your compression on the input. I don't think it worth it, but who knows?