Deflate compression - Huffman coding question
Posted: Mon Oct 17, 2011 5:31 pm
Hello, I'm working on a compression library for my OS (Visopsys), starting with the DEFLATE algorithm (gzip, zip). I've got static Huffman codes working (compress and decompress) but I'm hitting a problem with my dynamic Huffman codes. When coding the 7-bit second-level tree for the literal-length and distance code counts, my trees are exceeding the 7-bit maximum length. I'm getting a fairly degraded binary tree.
In this case, I'm attempting to compress a piece of binary code (written in assembler) that has a lot of different codes - not a lot of repeats. So, the code counts are pretty well distributed. However, I'm also not trying to do the run lengths yet; I'm treating them as all literals. I know that I'll have to find a way to reduce my input block size when I hit some limit to the number of code counts, but can anyone suggest a good way to calculate that mathematically, without repeatedly/speculatively rebuilding my tree once I reach some number of code counts that could theoretically result in a worst-case, degraded Huffman tree? (I guess that would be only be something like 8 counts really -- not very many)
In the same vein, does anyone have a summary of a good way to choose the compression method when looking at the input bytes? static/dynamic/none, etc?
I hope that wasn't too convoluted. I'm hoping that will make sense to anyone who has worked with this type of compression before. I'd prefer not to go and study the zlib code if I don't have to
Andy
In this case, I'm attempting to compress a piece of binary code (written in assembler) that has a lot of different codes - not a lot of repeats. So, the code counts are pretty well distributed. However, I'm also not trying to do the run lengths yet; I'm treating them as all literals. I know that I'll have to find a way to reduce my input block size when I hit some limit to the number of code counts, but can anyone suggest a good way to calculate that mathematically, without repeatedly/speculatively rebuilding my tree once I reach some number of code counts that could theoretically result in a worst-case, degraded Huffman tree? (I guess that would be only be something like 8 counts really -- not very many)
In the same vein, does anyone have a summary of a good way to choose the compression method when looking at the input bytes? static/dynamic/none, etc?
I hope that wasn't too convoluted. I'm hoping that will make sense to anyone who has worked with this type of compression before. I'd prefer not to go and study the zlib code if I don't have to
Andy