The Huffman Compression algorithm is an algorithm used to compress files. It does this by assigning smaller codes to frequently used characters and longer codes for characters that are less frequently used.
Code: A code is a sequence of zeros and ones that can uniquely represent a character.
A file is a collection of characters. In a file certain characters are used more than others. The number of bits required to represent each character depends upon the number of characters that have to be represented. Using one bit we can represent two characters. i.e. 0 represents the first character and 1 represents the second character. Using two bits we can represent 22
i.e 4 characters.
00 - first character
01 - second character
10 - third character
11 - fourth character.
In general if we want to represent n characters we will need 2n bits for representing one character. The ASCII code uses 7 bits to represent a character. Therefore 27= 128 bits can be represented using ASCII code.
The ASCII code and the above codes used above to represent characters are known as fixed length codes. This is because each character has the same bit length. i.e the number of bits required to represent each character is the same. In ASCII code every character requires 7 bits. Using variable length codes for each character we can reduce the size of a file considerably. By assigning smaller codes for more frequently used characters and larger codes for less frequently used characters, a file can be compressed considerably.
EXAMPLE-1
For example consider a file consisting of the following data
AAAAAAAAAABBBBBBBBCCCCCCDDD DDEE
Frequency: The number of times a character occurs in a file is usually called its frequency.
The frequency of
A is 10
B is 8
C is 6
D is 5
E is 2
If each character is represented by using three bits then number of bits require to store this file will be
3*10 + 3*8 + 3*6 + 3*5 +3*2 = 93 Bits.
Now suppose we represent the character
A by the code 11
B by the code 10
C by the code 00
D by the code 011
E by the code 010
then the size of the file becomes 2*10 + 2*8 + 2*6 + 3*5 + 3*2 = 69 bits. A certain amount of compression has been achieved. In general much higher compression ratios can be achieved by using this method.
As you can see frequently used characters are assigned smaller codes while less frequently used characters are assigned larger codes. One of the difficulties of using a variable-length code is knowing when you have reached the end of a character in reading a sequence of zeros and ones. This problem can be solved if we design the code in such a way that no complete code for any character is the beginning of the code for another character. In the above case A is represented by 11. No other code begins with 11. Similarly B is assigned the code 00. No other code begins with 00. Such codes are known as prefix codes.
In general, we can attain significant savings if we use variable-length prefix codes that take advantage of the relative frequencies of the symbols in the messages to be encoded. One particular scheme for doing this is called the Huffman encoding method, after its discoverer, David Huffman.
A Huffman code can be represented as a binary tree whose leaves are the characters that are encoded. At each non-leaf node of the tree there is a set containing all the characters in the leaves that lie below the node. In addition, each leaf is assigned a weight (which is the frequency of the character), and each non-leaf node contains a weight that is the sum of all the weights of the leaves lying below it.
Answered by Anju Singh
at
7:54 PM on October 31, 2008