Zipping Utility
Zipping Utility
hello guys,
i've planned to develop a zipping utility using the huffmann coding alg.. could u plz. explain me about how to save the encoded content into a file, which could be decoded easily for a newbie.
thnx
i've planned to develop a zipping utility using the huffmann coding alg.. could u plz. explain me about how to save the encoded content into a file, which could be decoded easily for a newbie.
thnx
Re:Zipping Utility
What do you want to know -- how to encode the data, or how to save the data to a file? Saving the data to a file is easy, though it depends on what language you're using (in C, use fwrite; in C++, use ofstream). If you want to know how to encode the data I can explain that, too.
Re:Zipping Utility
thank you for your support. i just wanted to know how to store the details about the encoded data in the compressed file. suppose, if the encoded code for the char 'a' is 101b, should i have to maintain a table of all the encoded characters, so that, while decoding, the program can read the code from the table. but, if i store the encoded code in a table, the size of the compressed file still increases.
thanx
thanx
Re:Zipping Utility
Right, you'll need to store the Huffman tree in the encoded file, otherwise you'll never work out which binary code represents which character.
After that, you need to work out a way of writing arbitrary binary sequences to a file. For instance, your most common character might be stored as a binary 0, your second most common character as 10, your third as 110, and so on. Clearly you can only write full bytes to a file. So you need to come up with a set of routines which combine blocks of bits into bytes that you can actually write.
A very simple method might be:
This is a pretty horrid function but it demonstrates the principles. You need to clump together individual blocks of bits into whatever unit is convenient (here, a byte, or you might want to put them into 32-bit DWORDs on a 32-bit CPU), and write them to disk whenever you've had a block of 8, or 32, or whatever.
Another way might be to store the whole encoded file in memory and set and clear bits within there. But imagine that when encoding a 2GB file... Basically, this is a hardcore example of boolean operators.
After that, you need to work out a way of writing arbitrary binary sequences to a file. For instance, your most common character might be stored as a binary 0, your second most common character as 10, your third as 110, and so on. Clearly you can only write full bytes to a file. So you need to come up with a set of routines which combine blocks of bits into bytes that you can actually write.
A very simple method might be:
Code: Select all
fwrite_bit(FILE *f, bool bit)
{
int position_within_byte;
position_within_byte = position_within_file(f) % BITS_PER_BYTE;
if (position_within_byte == 0)
// started new byte
f->current_byte = 0;
if (bit)
f->current_byte |= 1 << position_within_byte;
else
f->current_byte &= ~(1 << position_within_byte);
if (position_within_byte == 7)
fwrite(&f->current_byte, 1, 1, f);
}
Another way might be to store the whole encoded file in memory and set and clear bits within there. But imagine that when encoding a 2GB file... Basically, this is a hardcore example of boolean operators.
Re:Zipping Utility
Ok. But, how do i ensure that the binary 01 (for example) read is the particular character 'a' (for example). Should i have to maintain a table like :
a -> 0
b -> 10
c -> ...
thanks.
a -> 0
b -> 10
c -> ...
thanks.
Re:Zipping Utility
Yes.
You've built up a Huffman tree, right? Then just save that along with the data, in a kind of header. Like you say, a simple mapping from character to bit pattern will work.
You've built up a Huffman tree, right? Then just save that along with the data, in a kind of header. Like you say, a simple mapping from character to bit pattern will work.
Re:Zipping Utility
but, if i map 'a' -> 10, it may take 2 bytes for a character mapping in the header. it may be optimum for a large file. but, if the size of the file is small and if a character is present only once, it would take memory larger than the original file.
for example, if the content of a file is "abcd". after compression, i've to maintain a table of 8 (4*2) bytes for the header itself.
thanx.
for example, if the content of a file is "abcd". after compression, i've to maintain a table of 8 (4*2) bytes for the header itself.
thanx.
Re:Zipping Utility
stuff the huffman tree, use dynamic huffman instead. or some form of adaptive huffman encoding. :-\
-- Stu --
Re:Zipping Utility
Well, yeah, that's right. So for small files, don't bother doing any compression.Rahman wrote:for example, if the content of a file is "abcd". after compression, i've to maintain a table of 8 (4*2) bytes for the header itself.
Re:Zipping Utility
do you think winzip is a good zipping utility ? because, when i tried to compress a text file having only 4 chars "abcd", the compressed file showed a size of 112 bytes.
Re:Zipping Utility
A compressed file is in fact not only the compressed file. It's the compressed file (which can not be smaller than 0 bytes), the compression information, the filename, directory information, permission information, and possibly some other things. Files with 4 characters are not compressible, period.Rahman wrote: do you think winzip is a good zipping utility ? because, when i tried to compress a text file having only 4 chars "abcd", the compressed file showed a size of 112 bytes.
Compression itself is based around some form or repetition. There are direct forms, which use a AAAAAAAA-like repetition (Run Length Limited, or RLE) which is useful in, for example, bitmaps. Most drawn bitmaps have lots of stretches in the same color, so you can just say "200 pixels of white, 3 black, 197 white) instead of a list of WWWWWWWWW...WBBBWW...WWW, which obviously saves space.
There are also more complex forms, such as Huffman (which you picked) which tries to seek the number of strings that comprise a file (such as words, to keep it simple), then gives each a number of a length depending on its usage count (200x "the" and one "administrator" would give the shorter code to the, obviously since that saves 200x the bits it would with administrator), stores the number list, and then the word list. It's not good for small files since obviously, there's not much to be repeated in a 4-byte file. However, since most files are >2k (don't believe tanenbaum) it's useful. There is also LZW (Lempel-Ziv-Welsh), which uses an idea similar to this, but with a twist. It does not need to store the tree, since you can rebuild it from the codes you receive. Should you receive, for instance, a list
BRBRBR
it would compress to the code for B, for R, the first new code and then again the first new code, being 4 codes in total. The first new code is assigned to the first two characters that appear, and so on and so on (I don't really know how it all works, I only know how to start). However, this algorithm is patented (compuserve) and therefore you may not use it freely. It's used in GIF, which also means you may not make your own GIF encoder without paying them money (see also, www.gnu.org about the pictures).
And then you have lossy compression, such as JPEG, which works by dividing the stuff you see into squares, and then stores all information for such a square (such as the average color of it). Then, it seeks the squares where you would be most likely to notice a distortion, which means that the blue ocean is completely squary, but the faces of the people on the docks are not. Humans tend to look at some details first, so those details are encoded. You tend to never look at a close-up from the sky, which is why there is barely any information wasted about the sky.
My point in the end is, there are a lot of algorithms, didn't know for sure whether you picked huffman as a best-option or as a first-option (seeing as you don't know its strengths, you probably did not look at much others), and hope to help you with a good algorithm choice.