A 2-D DCT Hardware Codec based on Loeffler Algorithm

2-D IDCT decoding. It is fully functional, capable of being used in any hardware coding algorithm. The Loeffler algorithm is one of the newest and most effective fast DCT algorithms for hardware implementation. 2-D DCT/IDCT was computed by applying 1-D operations on matrix using a row/column approach. For a hardware implementation, additional data buffers and control logic is used to perform 2-D transforms. Since hardware implementation is less accurate than mathematical modelling, more errors occur in the computational results. However, differences are small and cannot be directly observed. Ill


Introduction
Digital picture and video applications are becoming a inseparable part of our everyday lives.Still image and video devices such as cell phones and cameras rely on codecs (coders and decoders) for image coding, processing and compression.These codecs must meet the evergrowing demands for power dissipation and operation speed.Currently the most popular standards of compression are JPEG for image, MPEG-1, 2, 4 and H263 for video [1].All of these standards use the discrete cosine transform (DCT) and the inverse discrete cosine transform (IDCT) as a basis of their operation.DCT and the IDCT are some of the most computationally intensive blocks, therefore creating performance bottlenecks in picture and video data compression algorithms [2].The computational complexity of the DCT in the coder surpasses than of any other unit, consuming 21% of the total computations [3,4].The IDCT also incurs the largest computational cost in the decoder.The complexity of these systems leads to large power dissipation and a fast, lossless yet energy efficient hardware codec is required.Implementing DCT/IDCT, as an ASIC is a design solution, which meets the real time processing requirements [2].
Since it's introduction, numerous fast algorithms have been proposed to calculate the DCT and IDCT.This paper focuses on the fast Loeffler algorithm and it's hardware applications.

Discrete cosine transform
The DCT transforms data from the spatial domain to the spatial frequency domain.It decorrelates the image data, which is typically highly correlated for small areas of the image.Heavy correlation provides much redundant information, whereas just a few pieces of uncorrelated information can represent the same data much more efficiently.Through the DCT transform, the energy of an image sample is packed into just a few spatial frequency coefficients.The DCT typically operates on 8x8 pixel blocks of image data.For an NxN block of samples x(m,n), the two dimensional (2-D) DCT and IDCT are formulated: multiplications for an NxN block, so they are rarely implemented in hardware [5].
Beside that, computation of DCT/IDCT need floating-point multiplications, which is slow both in hardware and in software implementations.To achieve better operating speeds, coefficients can be scaled and approximated by integers so that floating-point multiplications can be replaced by integer multiplications.
DCT and IDCT are highly computational intensive, which creates prerequisites for performance bottlenecks in systems utilizing them.To overcome this problem, a number of algorithms have been proposed for more efficient computations of these transforms.
The measure of algorithm effectiveness is the number of mathematical operations needed to perform it.The number of multiplications is extremely important, since it is a relatively complex and power consuming operation, to be performed in hardware.The number of computations can be reduced from N 4 to 2Nlog 2 N using fast DCT algorithms.Many one-dimensional algorithms were published over the years for N data points where N = 2 m , m ≥ 2 [6][7][8].These algorithms are based on the sparse factorizations of the DCT matrix and many of them are recursive.The theoretical lower bound on the number of multiplications required for 1-D 8-point DCT has proven to be 11.In this sense, the method proposed by Loeffler, with 11 multiplications and 29 additions, is the most efficient solution.That's why it was chosen for this implementation.
Loeffler algorithm is combined out of stages.The stages of the algorithm numbered 1 to 4 are parts that have to be executed in serial mode due to the data dependency.In stage 2, the algorithm splits into two parts.One for the even coefficients, the other for the odd ones.The even part is a 4 point DCT, again separated in even and odd parts in stage 3.The rounding block in Fig. 1 signifies a multiplication by a constant.Each stage must be executed in series and can not be evaluates in parallel because of data dependencies.However, calculations inside one stage can be parallelized.We clarify the building blocks of the algorithm in Fig. 2.
The "butterfly" and the rotation block with their associated equations [9] The rotation block can be calculated using 3 multiplications and 3 additions instead of 4 multiplications and 2 additions.
The two dimensional DCT transform equation ( 1) can be expressed as This property, known as separability, has the advantage that 2D DCT can be computed in two steps by successive 1-D operations on rows and columns.The idea of a so-called row-column approach and is graphically illustrated in Fig. 3.It is the most popular approach for computing the 2D DCT/IDCT.It result in simple regular implementations that are less computationally efficient than direct 2D implementations.

Algorithm modelling
As we can see from the graph in Fig. 1, only 7 constant values need to be stored in the memory for Loeffler algorithm to function.
The algorithm signal-flow graph (Fig. 1) can be directly translated into a hardware implementation.However, it still uses floating-point multiplication which is too complex for hardware.The workaround for this problem is constant scaling.All constant can be multiplied by a factor of 2, so the floating point is eliminated, and the results are divided accordingly.In hardware applications, this kind of multiplication and division is easily done by shifting bits left or right.The choice of fixed-point constants is arbitrary.Using more bits gives better accuracy, because a more exact value of the constant is retained but with increased risk of overflow and surplus hardware register logic for storing and processing data.
To measure the quality of coding, the peak signal-tonoise ratio (PSNR) was calculated.It is most easily defined via the mean squared error (MSE) which for two m×n monochrome images I and K, where one of the images is considered a noisy approximation of the other, is defined as The PSNR is defined as where MAX I is the maximum possible pixel value of the image.When the pixels are represented using 8 bits per sample, this is 255 [10].Typical values for the PSNR in lossy image and video compression are between 30 and 50 dB.
Mathematical modelling showed that the best choice for quality vs. size is a constant length of 10 bits.The codec using 10 bit constants is lossy because of fixed-point arithmetic.We show the different images used to test the algorithm Fig. 4. The PSNR measured for each of them are shown in Table 1.

Picture PSNR
Lena.bmp 33.6577 Cell.bmp 33.2790 Fruit.bmp 33.7290 Using a constant length of less than 10 bits leads to a lossy algorithm, which adds gross errors in the encoding of the image, which, when decoded, show up as visible relics in the picture.Although reducing constant bit size has a downside of reducing image quality, it also greatly reduces the area and power consumption of the finished schematic, there for it might still be used in applications, where image kxC n quality is a secondary requirement, and power dissipation is a key, like portable video systems or cellular phones.

Hardware implementation
Hardware model of 2D-DCT is implemented using VHDL language.The system is divided into the following blocks:  Computation unit;  Data buffer;  Counter;  General control unit.These blocks are shown in Fig. 5.

Fig. 5. Top level schematic
The computation unit performs one dimensional DCT or IDCT operation.This block has a generic interface, suitable for both DCT and IDCT algorithms.Interface has 8 data input buses and 8 output data buses.Beside data buses, the block uses start and done signals, for control and data flow simplification.This design allows reuse of all the system components for other DCT and IDCT computation algorithms.
For encoding, the system computes a DCT for a onedimensional 8-point input sequence using Loeffler algorithm.This algorithm is directly translated from Fig. 1 and adapted to use fixed-point arithmetic with 10 bits constant length.
For decoding data DCT block is replaced with IDCT.Inverse transform may be computed by executing the stages backwards.A codec using any other 8-point 1D-DCT/IDCT computation algorithm can be implemented via reusing the components of this system.Data buffer is the most complex block of the system.It has 4 functions: 1. Input and storing an 8x8 matrix; 2. Output lines of matrix to DCT processor and replace data lines with the result; 3. Output columns of the matrix to the DCT processor and replace data line with the result; 4. Output of the data matrix.When 1D-DCT processor computes a result of any given 8-point input sequence the input data is no longer needed and can be overwritten with the computation result.This reduces memory size, resulting in smaller area and power consumption, however this method increases interconnect network.
General control unit is implemented as a finite state machine.It is used to control data flow between components.

Implementation results
We simulated VHDL register transfer level (RTL) code using Cadence simulation and modelling tools.Synthesis was performed using Synopsys tools.Since file formats for binary files, read by VHDL are not standardized, we used a pre/post-processor to convert binary images files to text format and vice versa.
Images used for testing where the same as in mathematical modelling.Results of synthesized gate level circuit simulation are shown in  We show differences in PSNR for mathematical modelling and gate level circuit modelling of each image in Fig. 6.Implementation is also proven to work on FPGA matrices.A Xilinx Spartan-3AN was used to synthesize and test the RTL code.Utilization summary for the device is shown in Table 3.

Conclusions
In this paper, we reported a hardware implementation scheme and results of a Loeffler algorithm based 2-D discrete cosine transform codec.This codec performs a full 2-D DCT encoding and 2-D IDCT decoding.It is fully functional, capable of being used in any hardware coding algorithm.The Loeffler algorithm is one of the newest and most effective fast DCT algorithms for hardware implementation.2-D DCT/IDCT was computed by applying 1-D operations on matrix using a row/column approach.Mathematical and gate-level modelling of the algorithm has shown that fixed-point arithmetic and constant approximations add errors to the results of the algorithm.
For a hardware implementation, additional data buffers and control logic is used to perform 2-D transforms.Since hardware implementation is less accurate than mathematic modelling, more errors occur in the computational results.However, differences are small and cannot be directly observed.This codec should provide enough quality for low resolution image compression or processing algorithms at high speed.
j) = 1 for j ≠ 0. The straightforward implementations of DCT and IDCT require N4

Table 2 .
Gate level simulation results