Area-Efficient Hardware Architectures of MISTY1 Block Cipher

,


Introduction
With the wide-spread usage of wireless devices and embedded systems, cryptographic hardware circuits are proving as the critical component of modern day System-On-Chips (SOCs) laying the foundations for network security.However, the provision of security features in communication networks is materialized in the form of performance degradation thereby increasing the area or reducing the throughput.Considerable efforts are underway to optimize the hardware design / implementation of encryption algorithms for respective applications.MISTY1 block cipher characterizing repeated-loop structure is highly suitable for area-constrained applications.Taking this into account, a study has been carried out on compact MISTY1 implementations for mobile applications, hand-held devices and RFIDs.
Developed by Mitsubishi Electric Corporation, MISTY1 is an ISO / IEC standardized 64-bit block cipher algorithm designed to process smaller blocks of data e.g.PIN of 8-byte [1].It has a proven probability parametric value of 2^-56 against "Linear and Differential Cryptanalysis" [2].Many attacks have been proposed by researchers to break MISTY1 block cipher.The attacks though exposed several weaknesses of MISTY1; they could not compromise the full security for 8-rounds MISTY1 [3], [4].Moreover, the complexities subject to time-domain and acquisition of large plain-text data for retrieval of the secret key made it practically impossible to undermine the security of MISTY1 block cipher.Therefore, MISTY1 is considered as a secure algorithm and is currently being employed for online-transactions and ATM networks.
In comparison to high throughput encryption cores, compact designs make use of the logic optimization techniques for transformation functions and s-boxes using combinational logic [5][6][7][8][9][10], [12][13][14][15][16][17][18], [21], [22].Besides, re-utilization methodologies have also been implemented exploiting the rolling-feature of the architecture.The analysis carried out during this work implies that the most areaefficient cryptographic hardware circuit constitutes 1947 NAND gates for AES algorithm [15].Moreover, for low area MISTY1, a very compact hardware architecture has been realized in [5] consisting of 2331 NAND gates.The silicon area though is very less, we discovered that architectural and logic optimizations can be carried out on MISTY1 algorithm for very compact hardware implementations.The major contributions / salient features of our work are as under: a. Optimization of S9 and S7 s-boxes.
i. Design and implementation of combined substitution unit for S9 and S7 for compact MISTY1.
ii. Threshold area implementation of S9   The paper is organized as follows.MISTY1 algorithm is briefly described in Sec. 2 followed by S9 / S7 s-boxes optimization in Sec. 3. The optimized MISTY1 transformation functions are explained in Sec. 4. A detailed explanation of proposed MISTY1 hardware architectures is presented in Sec. 5. Lastly, ASIC results and conclusion are summarized in Sec.6 and 7 respectively.

64-bit MISTY1 Block Cipher
MISTY1 algorithm is depicted in Fig. 1a and its constituent functions FO, FI and FL are shown in Figs.1b-1d respectively.The algorithm transforms 64-bit plain-text to 64-bit cipher text using 128-bit secret key after n rounds operation.The specifications described in [1] recommend the value for the number of rounds as n > 8.Moreover, for n-rounds operation, the algorithm requires the generation of 128-bit extended key by using its FI function.MISTY1 odd rounds consist of 2 × FL functions, FO function and 32-bit XOR whereas the even rounds comprise of only FO function and 32-bit XOR.Furthermore, the last round is an exception consisting of only 2 × FL functions.The outputs of FL functions are finally concatenated to form 64-bit cipher text.The FO and FI functions have a feistel-like structure with FI consisting of substitution functions S9 and S7.The substitution functions S9 and S7 substitute the respective 9-bit and 7-bit inputs to 9-bit and 7-bit outputs by logic operations or LUTs [1].

Scheme 1 -S9 / S7 Combined Substitution Unit (CSU)
In this scheme, a Combined Substitution Unit (CSU) is proposed for S9 and S7 s-boxes to substitute 9-bit and 7bit inputs to 9-bit and 7-bit outputs respectively on alternate clock cycles.The first step involved in CSU design is the algebraic reductions of S9 / S7 logic expressions.Therefore, XOR gates are replaced by NOT gates (for both S9 and S7 logic expressions) and 3-input AND gates are reduced to maximize the use of 2-input AND gates (for S7 logic expressions only).The Common Sub-expression Elimination (CSE) is then carried out from combined S9 / S7 logic expressions thus eliminating the redundant 2-input ANDs and AND-XORs sub-expressions.The reduced algebraic expressions of S9 and S7 are expressed in Tab. 1 whereas common sub-expressions are shown in Tab. 2.
The AND gates of S9 and S7 reduced logic expressions are shown by respective bits for simplicity; however the implementation is carried out by permuting 9-bits to form 36 × combinations.The path delay of CSU using parallel AND-XORs hierarchy is expressed as (1) whereas the area reduction as compared to straight-forward s-boxes {2 × S9 + S7} is found as 60.8% illustrated in Tab.

Scheme 2 -S9 / S7 Threshold Area Implementation
S9 / S7 threshold area implementation is depicted in Figs. 2 and 3 consisting of MUXes, AND, XORs and 1-bit high enabled registers.The proposed design scheme sets a threshold limit for area of S9 and S7 s-boxes to generate a throughput value > 4 Mbps.
The substituted bits for S9 and S7 are produced after 45 and 58 clock cycles respectively and are based on the maximum possibilities of S9 and S7 logic sub-expressions.For instance, S9 s-box has 36 × combinations for AND

FL Function Implementation
Figures 4 and 5 depict FL functions generating 32-bit output after 2 and 4 clock cycles respectively.The input to the proposed FL function is a 32-bit plain text or the output of {FO-XOR-EKG} function and the outputs are saved in enabled registers.The design provides a reference for area reduction of FL function and can be configured for 8 and 16 clock cycles operation.The area reduction for compact MISTY1 architectures is mainly due to use of 1 × FL function; however the area reduction with the proposed methodologies FL -1 and FL -2 is found as 4% and 6.1% respectively as compared to straight-forward FL function (ref.Fig. 1d).The area for 8 / 16 clock cycles FL function is also mentioned in Tab.5; since the NAND gates difference w.r.t.proposed FL -2 is insignificant, they are not implemented in this paper.

Novel Design of {FO-XOR-EKG} Function
{FO-XOR-EKG} function is the core part of the proposed MISTY1 hardware architecture.The re-utilization methodology has widely been adopted for the optimum operation of {FO-XOR-EKG} function.The intended idea behind the design of proposed {FO-XOR-EKG} function is to perform the transformation operations including FO / FI and 32-bit XOR operation (appended with FO in rounds 1-8).Moreover, the design can generate the extended keys for onward use in MISTY1 8-rounds operation.The accumulation of the above mentioned functionalities in a single function reduces the circuit area considerably.The area reduction is complemented with optimized implementation of S9 and S7 s-boxes within {FO-XOR-EKG} function.Thus, 2 × design schemes for {FO-XOR-EKG} function implementing CSU based s-box and S9 / S7 threshold area s-boxes are shown in Figs. 6 and

Extended key generation
Step The two architectures primarily have the same design basis but differ in terms of clock cycle operations.In order to incorporate all the functionalities, 2 × 9-bit XORs and 2 × 7-bit XORs are appended with optimized s-boxes with inputs being fed by multiplexers and registers.In addition, a 16-bit secret key is added in the input multiplexer and KO i have a variable value of 16-bit SK or 0s.Table 6 describes the algorithm / steps involved for the execution of above mentioned functions.
The EK generation and FO function differs in the selection for input texts and KO i XOR.For EK generation, the input and KO i is assigned as 16-bit SK and 16-bit 0s respectively whereas the input and KO i for FO is FL L / FO O/P and SK respectively.Moreover, as compared to EK generation, the FO function has extended clock cycle operations carried out for 3 × FIs, 3 × XORs and 32-bit XOR (ref to Fig. 1b [7], [10] is depicted in Tab. 7.

Area-Efficient MISTY1 Hardware Architectures
The proposed hardware architecture of area-efficient MISTY1 8-rounds algorithm is depicted in Fig. 8.
The input to MISTY1 algorithm is a 64-bit plain-text (PT) and 128-bit secret key (SK) and the output is a 64-bit cipher-text.A 128-bit extended key (EK) is generated prior to MISTY1 8-rounds operation by {FO-XOR-EKG} function and is saved in an external 128-bit extended key register for onward round operations.The SK and EK in conjunction are later used for MISTY1 round transformation operations.The EK generation by {FO-XOR-EKG} function readily reduces the circuit area as it avoids the use of independent key generation module, i.e.FI function.However, the extended key generation by {FO-XOR-EKG} function reduces the throughput for MISTY1 8-rounds operation since EKs have to be generated in advance requiring multiple clock cycles.The speed i.e. throughput value (Mbps) of the proposed architectures can be calculated as: Throughput = Output (bits) / Clock Cycles (sec).( 2

Hardware Implementation of Proposed MISTY1 Architectures
Hardware implementation of the proposed MISTY1 architectures is performed on ASIC platform 180 nm, 1.8 V standard library cell using Synopsys Design Compiler and is optimized for area.A comprehensive analysis was carried out to obtain moderate speed MISTY1 -architecture 1 by integrating FO -1 with FL -1 whereas FO -2 is configured with FL -2 resulting into a threshold area MISTY1 -architecture 2. The proposed architectures outperform all previous implementations in terms of area depicting highly optimized MISTY1.The architecture 1 achieved throughput value of 41.6 Mbps with area of 1853 of gates depicting a compact and moderate speed MISTY1 implementation while architecture 2 sets a threshold value for both area and speed constituting area of 1546 gates with 4.72 Mbps.Compared with the lowest area of 1947 gates for AES and 2331 gates for MISTY1 implementations, the gate counts for the proposed MISTY1 architectures 1 and 2 is lesser.This represents MISTY1 architectures 1 and 2 as the most area-efficient block cipher implementations till date.The highly optimized MISTY1 architectures set a bench mark for low area implementation and can be employed for compact ASIC applications.

Conclusion
This paper proposes MISTY1 architectures for areaconstrained embedded applications.The design / methodologies adopted for MISTY1 implementations by optimizing the transformation functions and s-boxes significantly reduced the hardware area.As per ASIC results / analysis with existing MISTY1 and other block cipher architectures like AES and KASUMI, the proposed MISTY1 imple-mentations have the smallest gate-size to the best of our knowledge.The optimization techniques employed for MISTY1 can be adopted to 64-bit KASUMI block cipher and other encryption algorithms for low area implementation.The proposed architectures have high significance in cryptographic circuit design / optimization.
i. FL function implementation for 2 and 4 × clock cycles execution by logic re-utilization.ii.Designing of a new and novel function {FO-XOR-EKG} to encompass round transformation operations and extended key generation function.c.Design / configuration of area-efficient MISTY1 by employing highly optimized transformation functions FL and {FO-XOR-EKG}.
Table 8 summarizes the performance comparison in terms of area and throughput as follows.