Rahman wrote:
do you think winzip is a good zipping utility ? because, when i tried to compress a text file having only 4 chars "abcd", the compressed file showed a size of 112 bytes.
A compressed file is in fact not only the compressed file. It's the compressed file (which can not be smaller than 0 bytes), the compression information, the filename, directory information, permission information, and possibly some other things. Files with 4 characters are not compressible, period.
Compression itself is based around some form or repetition. There are direct forms, which use a AAAAAAAA-like repetition (Run Length Limited, or RLE) which is useful in, for example, bitmaps. Most drawn bitmaps have lots of stretches in the same color, so you can just say "200 pixels of white, 3 black, 197 white) instead of a list of WWWWWWWWW...WBBBWW...WWW, which obviously saves space.
There are also more complex forms, such as Huffman (which you picked) which tries to seek the number of strings that comprise a file (such as words, to keep it simple), then gives each a number of a length depending on its usage count (200x "the" and one "administrator" would give the shorter code to the, obviously since that saves 200x the bits it would with administrator), stores the number list, and then the word list. It's not good for small files since obviously, there's not much to be repeated in a 4-byte file. However, since most files are >2k (don't believe tanenbaum) it's useful. There is also LZW (Lempel-Ziv-Welsh), which uses an idea similar to this, but with a twist. It does not need to store the tree, since you can rebuild it from the codes you receive. Should you receive, for instance, a list
BRBRBR
it would compress to the code for B, for R, the first new code and then again the first new code, being 4 codes in total. The first new code is assigned to the first two characters that appear, and so on and so on (I don't really know how it all works, I only know how to start). However, this algorithm is patented (compuserve) and therefore you may not use it freely. It's used in GIF, which also means you may not make your own GIF encoder without paying them money (see also,
www.gnu.org about the pictures).
And then you have lossy compression, such as JPEG, which works by dividing the stuff you see into squares, and then stores all information for such a square (such as the average color of it). Then, it seeks the squares where you would be most likely to notice a distortion, which means that the blue ocean is completely squary, but the faces of the people on the docks are not. Humans tend to look at some details first, so those details are encoded. You tend to never look at a close-up from the sky, which is why there is barely any information wasted about the sky.
My point in the end is, there are a lot of algorithms, didn't know for sure whether you picked huffman as a best-option or as a first-option (seeing as you don't know its strengths, you probably did not look at much others), and hope to help you with a good algorithm choice.