Area Optimized Architecture for AES Mix Column Operation

Advanced Encryption Standard (AES), is one of the most popular cryptographic algorithm used for data protection. The cost and power consumption of the AES can be reduced considerably by optimizing the architecture of AES. This paper proposes an implementation of the AES mix columns operation which presents a compact architecture for the AES mix columns operation and its inverse. It also proposes the method of resource sharing in case of mix column and its inverse. The delay and area consumption of the hardware implementation is compared with previous work done in this area. The proposed architecture have been implemented on the most recent Xilinx Spartan FPGA, their area and delay are compared with the previous works and it is proved that proposed technique has lower area coverage and delay.


INTRODUCTION
The Advanced Encryption Standard is a symmetrickey algorithm used for the encryption of electronic data. Symmetric key algorithm means the same key is used for both encrypting and decrypting the data. Due to the security it offers against attacks it has become the default choice in numerous applications.
The AES algorithm is an iterative algorithm composed of 10,12 or 14 rounds. It has a fixed block size of 128 bits, and variant key sizes of 128, 192, or 256 bits on which the number of rounds depends. The AES algorithm basically consists of four byte oriented transformation and a key expansion function. [1] In case of 10 round process, after the initial secret key addition (roundkey (0)), the first 9 rounds are identical, with different the final round [10]. Each of the first 9 rounds consists of 4 transformations: SubBytes, ShiftRows, MixColumns and AddRoundKey. The final round excludes the MixColumns transformation. The above encryption scheme can be inverted to get a decryption structure. The SubBytes transformation is a non-linear byte substitution that operates independently on each byte of the State using a substitution table (S-box). This S-box is constructed by composing two transformations: multiplicative inverse in the finite field GF (2 8 ) and affine transformation [6] [1].
Subbyte transformation is a nonlinear substitution that operates on individual bytes using a substitution table(sbox). Shiftrows() Is a cyclic shift of the bytes of the state with Different offsets. Add round key,a self inverting transformation transforms the input data by xoring 128-bits of the plain text with 128 bits of the expanded cipher key in the rest iteration Of the algorithm and in the subsequent iterations, the partially Processed data is xored with the expanded cipher key .In the Mix column operation, each column of the state is multiplied by the known matrix. Its a process which takes in 32 bits of Data and outputs 32 bits of data. This work addresses a method to optimize the area consumed by the mix column operation. The hardware for mix column works with 8 bit data at a time and producing 128 bit output in 16 clock cycles.This efficient architecture fits will for the embedded applications. The design is implemented in Verilog HDL and synthesized for Xilinx Spartan 3 device.The design is synthesized using Xilinx ISE tool.

II. PRELIMINARIES
The MixColumn function operates by taking four bytes as input and it outputs four bytes. Here each of the input byte affects all the four bytes of the output. A fixed matrix is used to transform the state. Each column is considered here as a four term polynomial.The columns are considered as polynomials over GF (2 8 ) and multiplied The mixcolumn operation takes place in 32 bits considering one column of the state at a time. All the operations are performed in the Galois field. In galois field the addition process is performed as a XOR operation. The multiplcation by {02} in byte level is a left shift operation followed by a subsequent conditional bitwise XOR with {1B} . By repeated addition multiplication by any constant can be implemented [1].

A. Mix column
In this module,one byte of a column is treated at a time. In four clock cycles as shown in Fig.4. the result of mix column is availabale. In each clock cycle a new byte is fed to the unit, the four registers store the intermediate results of the MixColumn calculation. Every four cycles, upon the completion, the 32bit output is fed to the output registers. The architecture takes complete 16 clock cycles to complete the operation of mix column on a state[13].

B. Inverse Mix column
InvMixColumns() is the inverse of the MixColumns() transformation. InvMixColumns() operates on the State column by column, treating each column as a four term polynomial. Here the constants in polynomial can be created similarly as in Mix column operation. The arae of it can again be considerably reduced by substrate sharing with in the units as shown in Fig.6. The unit also produces the putput in 16 clock cycles [3] [4].
The area can further be reduced by sharing the units in Mix column and inverse mix column with the use of multiplexers which selects the appropriate polynomials as per the select signals [5].

C. Control unit
Since the mix column operation takes data as a column we require a control unit to provide data to the architecture as each byte and produce the output only after 4 clock cycles. The purpose of control unit is to provide enable signals to the register and mix column unit. The activity is controlled using a 3 bit counter. The values in the register R1,R2,R3,R4 in Fig.4. is provided to the 32 bit output register at the completion of 4 cycles with the help of counter. Upon completion of the 4 th cycle the values in the registers are reset to zero using the enable signal provided to the mix column unit. At the 16 th clock cycle the mix column operation a complete state will be available. In case of 8 and 32 bit systems these operations can The simulation result of mix column with control unit is shown in Fig. 8.The output is verified for all combinations of the input signals.

IV. CONCLUSION
This work addresses the area optimization of the mix column architecture. The result show that the use of hardware reduces the device utilization from when mix column is implemented with equation with only delay of 16 clock cycles. The whole design is performed with the help of Xilinx and synthesized with Xilinx tools. The simulation is done in the Xilinx spartan 3 device.