A Novel Multidictionary Based Text Compression

The amount of digital contents grows at a faster speed as a result does the demand for communicate them. On the other hand, the amount of storage and bandwidth increases at a slower rate. Thus powerful and efficient compression methods are required. The repetition of words and phrases cause the reordered text much more compressible than the original text. On the whole system is fast and achieves close to the best result on the test files. In this study a novel fast dictionary based text compression technique MBRH (Multidictionary with burrows wheeler transforms, Run length coding and Huffman coding) is proposed for the purpose of obtaining improved performance on various document sizes. MBRH algorithm comprises of two stages, the first stage is concerned with the conversion of input text into dictionary based compression .The second stage deals mainly with reduction of the redundancy in multidictionary based compression by using BWT, RLE and Huffman coding. Bib test files of input size of 111, 261 bytes achieves compression ratio of 0.192, bit rate of 1.538 and high speed using MBRH algorithm. The algorithm has attained a good compression ratio, reduction of bit rate and the increase in execution speed.


INTRODUCTION
Data compression is the method representing information in a compact form.It decreases the number of bits required to represent a data.Similarly Data decompression restores compressed data back into an original form.A Bit is the most fundamental unit of information in computing and communications and it possess the value zero or one.The partial redundancy in uncompressed data paves way for compression; that is, the same information can be stored using fewer bits.
Generally compression algorithms require large execution time, memory size because of the presence of large number of alphabets in original source code (Carus and Mesut, 2010).Text compression coding can be categorized into two groups; statistical based coding and dictionary based coding.
Dictionary-based methods are popular in the data compression domain (Begum and Venkataramani, 2012; Mohan and Govindan, 2005;Sun et al., 2003).On contrary statistical methods use a statistical model of the data and encode the symbols using variable-size code words in accordance with their frequencies of occurrence, dictionary-based methods opt for strings of the symbols to set up a dictionary and then encode them into equal-size tokens using the dictionary (Li et al., 2003;Carus and Mesut, 2010;Bhadade and Trivedi, 2011).The dictionary is formed by the strings and it may be either static or dynamic (Mohan and Govindan, 2005).The static is permanent, occasionally allowing for the addition of strings but no deletions, whereas the latter holds strings formerly found in the input stream, allowing for additions and deletions of strings as a new input string is processed.
Huffman coding, Arithmetic coding and PPM are examples of statistics based coding.In this coding scheme the symbols are coded to variable lengths.The most well known dictionary based coding is LZ algorithm  (Sayood, 2012).In this coding scheme variable length of code used for symbols.These high redundant texts increase the performance of some text compression algorithms.Earlier researches (Carus and Mesut, 2010;Bhadade and Trivedi, 2011;Sun et al., 2003) use enormous dictionaries of words or phrases and their codes.The common approach employed is to use a coding scheme with high redundancy (Tadrat and Boonjing, 2008).On the other hand, their dictionary sizes make their works not suitable for embedding in compression algorithms.This study seeks for an optimal dictionary as well as a highly redundant coding scheme for such function.In this study different dictionaries with different coding schemes are experimentally investigated with various compression algorithms is apt for redundant texts (Al-Bahadili and Hussain, 2010; Martinez-Prieto et al., 2011; Kulekci, 2012).The performance of text compression is increased by text transformation.
The words in the input text are transformed with highly redundant codes by an approach known as multidictionary based text compression.By this approach the input file is first transformed into predefined codes, thereafter it is compressed using BWT, RLE and Huffman coding.On the receiver side it is decompressed using same algorithms and extracted from the compression method as shown in Fig. 1.The performance in terms of compression ratio is satisfactory.However a more efficient algorithm will give still better results.

Dictionary Formation
• The words are extracted from the input test file and a table is formed.The first letter in the words which is in the upper case is converted into the lower case letter • The frequency of occurrence of the word is calculated, sorted out and the words from the table are arranged in the descending order • Each word is assigned with an ASCII code .The respective number (33-255) of the each ASCII character is assigned as code except small letters (a...z) and capital letters (A...Z) .So totally 170 character becomes as code • ASCII character is assigned as code to every word.
In table 170 single ASCII character is assigned as a code for first 170 words !@#$%^&*()_+......... upto ASCII character of 255 • For the next 170 words the same 170 ASCII character with a prefix of character 'a'.Thus it becomes two character codes a! a@ a# a$ a% a^ a& a* a( a) a_ a+……upto ASCII character of 255 The shortest code is assigned to most frequently used words.The longest code is assigned to less frequently used words.

Multidictionary Generation
Multidictionary method helps in extracting the words in the dictionary rapidly and easily: • Words starting with the letter 'a' are converted into a dictionary similarly remaining alphabets are converted into respective dictionaries • Codes are assigned to the words in various dictionaries.
• The characters apart from the alphabets such as words starting with ASCII characters are also grouped into separate dictionary

Encoding Algorithm 2.2.1. BWT Algorithm
The data encoded by multidictionary method is given as input to the BWT transform algorithm: • The block of data is taken and is rearranged by sorting algorithm.The output of the BWT block will contain the same number of data element; however the order may be different • In the reverse transform the original order will be sorted without the loss of data • The BWT is performed on the entire block of data elements at once • The lossless compressions algorithms operate in streaming mode, reading single byte or few bytes at a time.BWT transform operates on large chunks of data.It further operates on data in the memory and encounters files that are too big to process in one swoop.In those cases the file must be split up and processed as blocks and termed as parallel BWT transform • The files that are divided into n number of blocks are given as input to BWT transform • Finally on the other side the output blocks are combined and obtained as single output block.This parallel BWT transform method increases the speed is rapidly.The output of this method is given as input to the run length coding (Sayood, 2012)

Run Length Coding
• Run length coding is widely used data compression algorithm.The main feature of the algorithm is to replace the long sequence of the same symbol by a short sequence.The output of the BWT generally has runs • Runs refer to the continual occurring of same symbols.The RLE has a specific role in conversion of such long sequences into a short sequence by substituting the number of repetition of that particular symbol before the special character '@' and the repeated symbol is followed by this special character (Salomon, 2007) Science Publications

Huffman Coding
• The output of the RLE coding is a given as an input to the Huffman coding • The number of occurrence is determined and a code is generated using Huffman coding.This leads to further compression of the input file.
Huffman coding has a unique method for choosing the representation for each symbol, resulting in a prefix code.No other mapping of individual source symbols to unique strings of bits will produce a smaller average output size.Huffman coding is an extensive method for creating prefix codes (Sayood, 2012)

Decoding Algorithm
• Huffman decoding algorithm decodes the binary code from the encoded output.This Huffman decoding output is given as input to the RLE algorithm (Sayood, 2012) • The RLE decoded output converts short sequence symbol into a long sequence symbol (Salomon, 2007) • After this conversion the output is given as an input to the BWT reverse transform which rearranges the data into an original order • The combination of (a...z) or (A...Z) with ASCII character is considered as a code and the equivalent word of the code is searched in the corresponding dictionary • Similarly with this combination if two consecutive ASCII character occurs, it is extensively considered as a separate code and searched in the respective dictionary.Finally the words are collected in the output file

RESULTS
We performed experiments on the MBRH transformation algorithms using standard Calgary Corpus text file collection and compared with some standard existing compression algorithm Eq. 1 and 2:

Output filesize Compression ratio
Input filesize =

DISCUSSION
The test files specified in Table 1 are programmed by Matlab for implementation of MBRH and are compared with various compression algorithms such as arithmetic coding, Huffman with BWT, LZSS with BWT and Dictionary Based Encoding (DBE) and multidictionary based compression, multidictionary BWT with RLE and MBRH.By using equations (1,2), compression ratio and bits per character are calculated.The comparison is shown in Table 2 and 3.The results are shown graphically in Figure 2 and 3.They show that MBRH out performs the other techniques in terms of compression ratio and Bits Per Character (BPC).Table 4 shows compression time of input file size to compression code for each algorithm.Compression ratio is increased in MBRH compared with other dictionary based compression.MBRH achieves less transmission time.

CONCLUSION
This study proposes a new method of text transformation using Multidictionary based encoding.In a channel, the amount of compression paves way for reduction in transmission time.The input text is replaced by variable length codes, the size of input text can be reduced by using Multidictionary based compression.MBRH compression algorithm attains good compression ratio, reduces bits per character and conversion time.

Fig. 1 .
Fig. 1.Multidictionary based text compression incorporating a lossless, reversible transformation Huffman coding, Arithmetic coding and PPM are examples of statisticsl based coding(Sayood, 2012).In this coding scheme variable length of code used for symbols.These high redundant texts increase the performance of some text compression algorithms.Earlier researches(Carus and Mesut, 2010;Bhadade and Trivedi, 2011; Sun et al., 2003)  use enormous dictionaries of words or phrases and their codes.The common approach employed is to use a coding scheme with high redundancy(Tadrat and Boonjing, 2008).On the other hand, their dictionary sizes make their works not suitable for embedding in compression algorithms.This study seeks for an optimal dictionary as well as a highly redundant coding scheme for such function.In this study different dictionaries with different coding schemes are experimentally investigated with various compression algorithms is apt for redundant texts (Al-Bahadili and Hussain, 2010; Martinez-Prieto et al., 2011; Kulekci, 2012).The performance of text compression is increased by text transformation.The words in the input text are transformed with highly redundant codes by an approach known as multidictionary based text compression.By this approach the input file is first transformed into predefined codes, thereafter it is compressed using BWT, RLE and Huffman coding.On the receiver side it is decompressed using same algorithms and extracted from the compression method as shown in Fig.1.The performance in terms of compression ratio is satisfactory.However a more efficient algorithm will give still better results.

Table 1 .
List of files used in experiments

Table 2 .
Comparison of BPC

Table 3 .
Comparison of compression ratio Compression ratio

Table 4 .
Comparison of compression time