Keywords

1 Introduction

The field of lightweight cryptography has gone into overdrive as evident from the number of cipher proposals that have emerged in the past few years, like CLEFIA [32], KATAN [13], KLEIN [18], LED [19], PRESENT [11], Piccolo [31], PRINCE [12], SIMON/SPECK [6] to name a few. However, the Advanced Encryption Standard (AES) [16] still remains the de-facto standard when it comes to practical lightweight encryption. The past few years have seen several low-power/area architectures for AES being reported in literature [17, 27, 30]. However, there has been little work that goes on to determine the design choices that lead to the most energy-efficient architecture. There are many parameters that contribute to the efficiency of a given lightweight design, with area, power, throughput and energy being the foremost among them. Power and energy, are correlated parameters, as energy is essentially the time integral of power, and power is equivalent to the energy consumed per unit time or simply the rate of energy consumption. Energy consumption, thus, is a measure of the total work done by voltage source during the execution of an operation. Hence, in many ways, energy rather than power may be a more relevant parameter to measure the efficiency of a design. Serial architectures of any block cipher that reduce the width of the datapath and reuse components, have a smaller power footprint than round based implementations in which the data path is equal to the block length of the cipher. However, serial implementations usually have high latency, that is, they take much longer to compute the result of an encryption operation than their round based counterparts, and as a result may end up consuming more energy. Therefore, there is no guarantee that low power architectures would necessarily lead to low energy architectures and vice versa.

In [5, 21], an evaluation of several lightweight block ciphers with respect to various hardware performance metrics, with a particular focus on the energy cost was done. A formal model for energy consumption in any r-round unrolled block cipher architecture was proposed in [3]. However these papers do not specifically outline design choices that lead to energy-efficient designs.

1.1 Our Contributions

In this paper, we at first try to identify design choices that are energy-efficient and the related tradeoffs that are involved as a result of it. We throw some light at the design considerations that govern low energy circuits, and look at several factors like clock frequency, architecture, loop unrolling and lay down some general thumb rules that help in optimizing for energy. Then, we choose components specifically tailored to meet the requirements of low energy design. In particular, we develop energy-efficient linear layers and non-linear layers.

We use \(4 \times 4\) almost MDS binary matrices which are more efficient than \(4 \times 4\) MDS matrices in the terms of area and signal-delay. Note that the branch numbers (the smallest nonzero sum of active inputs and outputs of the matrix) of MDS and almost MDS matrices are 5 and 4, respectively. However, due to a smaller branch number, ciphers employing almost MDS matrices are likely to require the more number of rounds to guarantee its security against several attacks. To address this issue, we propose optimal cell-permutation layers which are aimed at improving diffusion speed and increasing the numbers of active S-boxes in each round with low implementation overheads. Our optimal cell-permutations drastically improve the minimum number of differentially/linearly active S-boxes in each round, and achieve faster diffusion compared to ShiftRow-type permutation. We construct a lightweight and small-delay 4-bit S-box by focusing on the dependency of the computation in S-boxes. The signal delay in our S-boxes is 1.5 times and twice faster than those of PRINCE and PRESENT, respectively. Since the S-box layer is one of the most critical and expensive operations of the cipher, our new S-boxes sufficiently contribute to low energy consumptions.

Combining those new constructions, we design a family of low energy block ciphers Midori which is composed of two variants: Midori64 and Midori128. These provide the functionality for both encryption and decryption with minimal area and energy overhead. The two variants support a 128-bit secret key and a 64/128-bit block, respectively. Security wise, Midori64 and Midori128  do not claim related, known and chosen-key security as it is not relevant in our target application. Using the STM 90 nm standard cell library, both these ciphers consume less than 1.89 \(pJ\)/bit encrypted, which is by far better when compared ciphers like PRINCE and NOEKEON [16]. These ciphers are particularly useful for applications that run on tight energy budget, e.g. active RFID tags, sensor nodes, medical implants and battery operated portable devices.

1.2 Organization of the Paper

In Sect. 2, we look at some design considerations that help to minimize energy consumption in block cipher circuits. In Sect. 3, we outline the algorithmic specifications of the Midori128 and Midori64 ciphers. In Sect. 4, we explain our design decisions vis-a-vis the observations of Sect. 2. In Sect. 5, we outline the security analysis of the ciphers. Section 6 contains implementation results of our cipher in hardware using the standard cell library of the STM 90 nm logic process. Section 7 concludes the paper.

2 Design Considerations for Low Energy

For any given block cipher, three factors are likely to play a dominant role in determining the quantity of energy dissipated in the circuit:

  • (a) Frequency of the Clock used to drive the circuit,

  • (b) Architecture of the individual components,

  • (c) Unrolling round functions in the circuit.

We will try to understand the significance of each of these parameters in the context of energy consumption. Let us start with clock frequency. Two components characterize the amount of energy dissipated in a CMOS circuit :

  • Dynamic dissipation due to the charging and discharging of load capacitances and the short-circuit current,

  • Static dissipation due to leakage current and other current drawn continuously from the power supply.

The total energy dissipation for a CMOS gate can be written as

$$\begin{aligned} E_{gate} = E_{load} + E_{sc} + E_{leakage} \end{aligned}$$

The quantity \(E_{load}\) is the energy dissipated for charging and discharging the capacitive load \(C_L\) of a gate when output transitions occur. The energy dissipated per \(0\rightarrow 1\)/\(1\rightarrow 0\) transition is given as

$$\begin{aligned} E= \int _0^t vi ~dt = \int _0^t v C_L \frac{dv}{dt} dt = C_L \int _0^{V_{DD}} vdv = \frac{1}{2}C_L V_{DD}^2. \end{aligned}$$

The energy due to the short-circuit current, \(E_{sc}\) is dissipated in a CMOS gate, when during a transition both the n and the p-transistors are on for a short period of time. The energy due to leakage currents \(E_{leakage}\) is rather small, and is mainly caused due to the sub-threshold leakage current, which is the drain-source current in a CMOS gate when the transistor is OFF. This figure is becoming increasingly important as the technology is scaling down making the sub-threshold leakage more significant. However as pointed out in [3, 21], the effect of the leakage energy at high clock frequencies is minimal. As such, energy becomes a metric which is a measure of the total switching activity of a circuit during the process. For sufficiently high frequencies, the energy consumption required to compute an encryption/decryption operation is essentially independent of frequency of operation. In our experiments, for circuits implemented using the standard cell library based on the STM 90 nm low leakage process, at frequencies higher than 1 MHz, leakage energy is usually less than 1 % of the total energy dissipated in the circuit.

To understand the significance of the other parameters we performed the following experiments. Consider a case in which two Rijndael S-boxes are placed one after the other in a circuit as shown in Fig. 1. The signals to the input of the first S-box, the second S-box, and the output of the 2nd S-box are named S1xD, S2xD and S3xD respectively. Note that, analyzing this situation is particularly useful for understanding the energy consumption trends of unrolled designs where logic blocks are placed sequentially one after the other.

Fig. 1.
figure 1

S-boxes placed sequentially

Let us assume that the signal S1xD comes from an 8-bit register, so that it “cleanly” switches between successive byte values, i.e. all the bits of S1xD make logic transitions at the same point of time which is usually the rising clock edge for synchronous circuits. The signal S2xD will switch between various values in a given time interval \(0\rightarrow \tau _d\), before settling down to a stable value. The value \(\tau _d\) which is the delay experienced by the signal S1xD usually depends on the cell library and the architecture adopted to implement the S-boxes. Another parameter dependent on the logic process and architecture of the S-box is the switching activity of S2xD which can be informally defined as the number of logic transitions made by this signal in the period \(0\rightarrow \tau _d\).

Fig. 2.
figure 2

The signals S1xD, S2xD, S3xD

The second S-box \(S_2\), sees this signal S2xD, which is switching between various values in the time interval \(0\rightarrow \tau _d\). Therefore, the switching activity of \(S_2\) is actually at least double that of \(S_1\), as it would continue switching for another \(\tau _d\) before producing a stable signal. Figure 2 provides an example in which, the three signals for the pair of Rijndael S-boxes (implemented using the Canright [14] architecture in the standard cell library of the STM 90 nm logic process, at 10 MHz) are shown. The synthesis for each S-box was done separately, so that the synthesis tool would not group together gates from the first and the second S-box in order to save area. Since the energy consumption of a logic block depends on the switching activity of all its nodes, the S-box \(S_2\) should naturally consume more energy than \(S_1\). Again the exact energy consumed by \(S_2\) relative to \(S_1\) depends on factors like

  • (a) the logic process and hence the value of \(\tau _d\),

  • (b) the architecture of the S-box and hence the amount of “extra” switching experienced by \(S_2\) and

  • (c) the algebraic structure of the S-box, i.e. its component Boolean functions.

The extra switching activity would be proportional to the average number of gates that undergo a \(0\rightarrow 1\)/\(1 \rightarrow 0\) transition during the period \(\tau _d \rightarrow 2 \tau _d\) (the average is typically taken over all possible transitions of the signal S1xD). Similarly if a third S-box \(S_3\) were placed after \(S_2\), then too it would experience an increase in switching activity relative to \(S_2\) that would depend on the average number of gates switched in the period \(2 \tau _d \rightarrow 3 \tau _d\). The increase in switching activity of \(S_3\) over \(S_2\) is likely to be roughly the same as that of \(S_2\) over \(S_1\), since the number of gates in \(S_2\) that switch in \(\tau _d \rightarrow 2 \tau _d\) and those in \(S_3\) between \(2 \tau _d \rightarrow 3 \tau _d \) when averaged over \(\left( {\begin{array}{c}256\\ 2\end{array}}\right) \) transitions of S1xD, is likely to be same. And so if it so happens that \(S_1, S_2\) and \(S_3\) drive the same amount of capacitive load, the difference between the energy consumed between \(S_2\) and \(S_1\) is likely to be the same as between \(S_3\) and \(S_2\).

Fig. 3.
figure 3

Energy per cycle \(E_i\) in \(i^{th}\) S-box \(S_i\)

Fig. 4.
figure 4

Energy \(\varOmega _n\) required to compute \(S^{10}(x)\) using n S-boxes

Taking these ideas forward, if we connect a series of n S-boxes sequentially, the energy consumed by each S-box in a given period of time is likely to be more than the previous S-box, as the switching activity of the S-boxes are likely to increase from the first to the last. We tested three different architectures for the Rijndael S-box. The first is the Canright [14] architecture which is acknowledged to be smallest known implementation in terms of gate area. The second is the Look-up Table (LUT) based architecture as synthesized by the Synopsys Design Compiler. The LUT architecture, while larger than the Canright architecture in terms of area, is much faster in terms of signal delay from the input to output port. The third is a Decoder-Switch-Encoder (DSE) based architecture [7], which is optimal in terms of power/energy consumption. Over the years there has been much research on low power Rijndael S-boxes [28, 34], but the DSE based architecture is widely believed to be most power/energy-efficient on account of its unique architecture. The 8-bit input is first decoded to a set of 256 wires. The S-box functionality is achieved by a shuffling of wires after which the output is produced by an encoding of the 256 shuffled wires (i.e. the inverse of the decoding process). The entire circuit can be constructed by AND/NAND gates, which have very low switching probability and since the S-box functionality is provided by wire shuffling, all 8-bit S-boxes can be constructed in this manner. The architecture offers very low switching per change of input bit: a maximum of 25 % of the gates switch when one of the input bits is flipped.

We connected 10 instances of the S-box constructed using the Canright architecture (using the standard cell library of the STM 90 nm logic process) sequentially and used the Synopsys Power Compiler to estimate the energy consumed per clock cycle \(E_i\) in each of the successive S-boxes \(S_i\) at a clock frequency of 10 MHz. We repeated the same experiment for the LUT and DSE based S-boxes. The results can be seen in Fig. 3. It can be seen that the successive instances of the LUT based S-box which has a delay of around 2.1 \(ns\) consumes much less energy as compared to the Canright S-box which has a delay of around 2.9 \(ns\). In both the LUT and Canright architectures, the switching activity in the circuit is roughly proportional to the signal delay across the input and output ports. This is however not the case for DSE S-box, which although has a delay of around 2.3 \(ns\), experiences much lower increase in successive values of \(E_i\) because the total switching activity in the delay period is much lower.

The above analysis is particularly relevant due to two reasons. The first pertains to the structure of especially SPN based ciphers, in which each round typically consists of a substitution, a linear layer and a key addition placed sequentially. A substitution layer with low switching activity and signal delay ensures that the linear layer consumes less energy. Similarly a linear layer with similar characteristics ensures that any circuit placed after it consumes less energy. The second pertains to the consideration of round unrolled circuits. An r-round unrolled circuit for a block cipher is one in which, the circuit computes the results of r successive round functions in a single clock cycle. So if the block cipher specification calls for N executions of the round function, an r-round unrolled circuit will compute the result of the encryption operation in \(\left\lceil {\frac{N}{r}}\right\rceil \) cycles. An r-round unrolled architecture is constructed by placing the circuits for r round functions sequentially, followed by a register. The above analysis suggests that any multiple round unrolled circuit is unlikely to be efficient in terms of energy consumption. In the above example, using the LUT based S-box, computing the result of two S-box operations (i.e. S(S(x))) over 2 cycles costs \(2*1.88 =3.76\) \(pJ\). Computing the same over one cycle by sequential placement of 2 S-boxes will cost \(1.88+3.91=5.79\) \(pJ\). Similarly computing three S-box operations over three cycles takes 5.64 \(pJ\), whereas the same over one cycle would take \(1.88+3.91+6.40=12.39\) \(pJ\). Figure 4 shows the cumulative energy cost \(\varOmega _n\) of computing \(S^{10}(x)\) using a sequence of n S-boxes (i.e. in \(\frac{10}{n}\) cycles), for different values of n. It can be seen that, irrespective of the architecture of the S-box, the energy consumption is optimal for \(n=1\), i.e. computing the operation over 10 cycles using a single S-box, even if this involves updating the register 10 times in the process.

2.1 S-Box: 4-Bit Vs 8-Bit

In light of the above analysis, it is clear that a design using a 4-bit S-box is more efficient in terms of energy consumed per cycle than a design using an 8-bit S-box. This is primarily due to the fact that a 4-bit S-box will typically have a lower signal delay as compared to an 8-bit S-box. However 8-bit S-boxes offer higher non-linearity and lower values of the DP/LP co-efficient, and so in order to sustain similar security margins, a design using a 4-bit S-box will typically need more executions of the round function. To put things, in perspective we performed the energy evaluation of the circuit of the SPN round function (with blocksize equal to 128 bits) in which we experimented with two different substitution layers, one having sixteen 8-bit S-boxes and the other having thirty two 4-bit S-boxes. The Rijndael MixColumn was used in both cases, and the STM 90 nm cell library was used to synthesize the circuits. For this purpose four different 8-bit S-boxes were chosen. Apart from the LUT and DSE based Rijndael S-boxes, we chose the S-boxes used in mCrypton [24] and Whirlpool [4]. Unlike AES, these S-boxes can be functionally defined in terms of smaller 4-bit S-boxes, and so can be implemented efficiently in hardware. Additionally we chose three 4-bit S-boxes: the generic DSE based S-box (note that since the S-box functionality is provided by a wire shuffle, all DSE S-boxes will have same energy consumption), and the S-boxes used in PRINCE [12] and PRESENT [11].

Table 1. A comparison of energy per cycle for round functions constructed with (A) 16 8-bit S-boxes, (B) 32 4-bit S-boxes.

Table 1 reports the energy per cycle figures at a frequency of 10 MHz. It can be seen that the DSE architecture is not as effective as energy saving measure for 4-bit S-boxes. It is also interesting to note that from the point of view of energy 4-bit S-boxes out performs their 8-bit counterparts by a ratio of around 2:1. Thus, the use of 4-bit S-boxes seems to be an efficient configuration even if the number of rounds in the encryption algorithm has to be increased in order to maintain security margins.

2.2 Feistel Vs SPN and Complex Vs Simple Round Function

As far as designing lightweight ciphers is concerned, both SPN and Feistel architectures have their respective advantages and disadvantages. Feistel structures (e.g. TWINE [33], Piccolo [31], SIMON [6]) usually apply a round function to only one half of the state and as such structures can be implemented in hardware with low average power. Also, implementing the inverse of Feistel constructions is not very difficult and hence a circuit that provides functionalities for both encryption and decryption can be designed with minimal overhead. However, given the fact that Feistel structures introduce non-linearity in only one half of the state in every round and hence, to maintain security margins, such constructions usually require more executions of the round functions as compared to SPN structures. As such Feistel, constructions are not suited for low latency implementations. Most SPN constructions, on the other hand, usually apply its transformation function to the entire state and so can be implemented using fewer rounds. In principle, if n rounds of SPN function and m rounds of Feistel function (where \(m>n\)) have the same security margin and similar energy expenditure, then using the n round SPN function makes more sense since lesser energy is consumed to update the state and key register for n rounds. A similar argument can be used to resolve the choice between (a) Simple round functions with more rounds (e.g. PRESENT [11]) and (b) Complex round functions with lesser rounds.

2.3 Effect of Key Schedule

Generating separate round keys in each round by means of a key schedule operation can eat into the energy budget as it incurs the added cost of updating the key register in every round. For example using the STM 90 nm standard cell library, in AES (with DSE S-box), the key schedule consumes a total of 25 % of the total energy consumed. For PRESENT, the key schedule consumes close to 32 % of the total energy. So designs meant primarily for low energy consumption, designers should look to avoid the key schedule operation. This would also be efficient in terms of area as it would not be necessary to include a key register in the design.

2.4 Main Conclusion: Low-Energy Design Choices

We can now state some conclusions that will serve as pointers for a good low energy block cipher design. From the point of view of energy, we know that a round based architecture is usually optimal. Thus we concentrate on an efficient round based construction that would with minimal overhead provide both the functionalities of encryption and decryption. A cipher like PRINCE, although provides both encryption/decryption functionalities with minimal tweak in the circuit, does not have an equally energy-efficient round based construction [12], as it needs to accommodate 3 different round functions in the same circuit. We have also seen that components with low switching and delay tend to perform better energy wise. So another requirement is choosing components with low area and delay. In this context, it makes sense to choose 4-bit S-boxes over 8-bit S-boxes. We choose SPN architecture over Feistel to minimize the number of rounds in the design. And since providing the functionalities of both encryption and decryption is an added motivation, we try to include components which in addition to having low area/delay, are also involutions. Having such components would minimize any additional overhead required for providing the functionalities of both encryption and decryption. We will now present the specifications for the proposed block cipher and in Sect. 4 we will explain the design decisions in the context of the observations made in this Section.

3 Specification

Midori is a family of two block ciphers: Midori64 and Midori128. Both ciphers accept 128-bit keys, and have a different block size n (\(n = 64\) for Midori64 and \(n = 128\) for Midori128). The basic parameters of Midori64 and Midori128 are shown in Table 2.

Table 2. Parameters for Midori64 and Midori128

Midori is a variant of a Substitution Permutation Network (SPN), which consists of the S-layer and the P-layer, and uses the following \(4 \times 4\) array called state as a data expression:

$$\begin{aligned} S = \left[ \begin{array}{cccc} s_0 &{} s_4 &{} s_8 &{} s_{12}\\ s_1 &{} s_5 &{} s_9 &{} s_{13}\\ s_2 &{} s_6 &{} s_{10} &{} s_{14}\\ s_{3} &{} s_{7} &{} s_{11} &{} s_{15} \end{array} \right] \!, \end{aligned}$$

where the sizes of each cell m are 4 and 8 bits for Midori64 and Midori128, respectively, i.e., \(s_i \in \{0,1\}^{m}\), \(m = 4\) for Midori64 and \(m = 8\) for Midori128. A 64-bit or a 128-bit plaintext P is loaded into the state, and the i-th round output state is defined as \(S_{i}\), namely \(S_{0} = P\).

3.1 S-Boxes and Matrices

S-box: Midori utilizes two types of bijective 4-bit S-boxes, \(\mathsf{Sb}_{0}\) and \(\mathsf{Sb}_{1}\), where \(\mathsf{Sb}_{0}\), \(\mathsf{Sb}_{1}: \{0,1\}^{4} \rightarrow \{0,1\}^{4}\) (see Table 3). \(\mathsf{Sb}_{0}\) and \(\mathsf{Sb}_{1}\) are used in Midori64 and Midori128, respectively. Note that \(\mathsf{Sb}_{0}\) and \(\mathsf{Sb}_{1}\) both have the involution property.

Table 3. 4-bit bijective S-boxes \(\mathsf{Sb}_{0}\) and \(\mathsf{Sb}_{1}\) in hexadecimal form

Midori128 utilizes four different 8-bit S-boxes \(\mathsf{SSb}_{0}\), \(\mathsf{SSb}_{1}\), \(\mathsf{SSb}_{2}\) and \(\mathsf{SSb}_{3}\), where \(\mathsf{SSb}_{0}\), \(\mathsf{SSb}_{1}\), \(\mathsf{SSb}_{2}\), \(\mathsf{SSb}_{3}: \{0,1\}^{8} \rightarrow \{0,1\}^{8}\) Mathematically, each \(\mathsf{SSb}_{i}\) consists of input and output bit permutations and two \(\mathsf{Sb}_{1}\)’s as shown in Fig. 5. Each output bit permutation is taken as the inverse of the corresponding input bit permutation to keep the involution property. Let the input bit permutation of each SSb \(_{i}\) be referred to as p \(_{i}\). Let \(x_{[i]}\) denote the i-th bit of x, where \(x_{[0]}\) is the most significant bit (MSB). Then denoting p \(_i(x)=y^{(i)}\), we have

$$\begin{aligned} y^{(0)}_{[0,1,2,3,4,5,6,7]} =x_{[4,1,6,3,0,5,2,7]},~ y^{(1)}_{[0,1,2,3,4,5,6,7]} =x_{[1,6,7,0,5,2,3,4]} \end{aligned}$$
$$\begin{aligned} y^{(2)}_{[0,1,2,3,4,5,6,7]} =x_{[2,3,4,1,6,7,0,5]},~ y^{(3)}_{[0,1,2,3,4,5,6,7]} =x_{[7,4,1,2,3,0,5,6]} \end{aligned}$$

The output permutation used in each \(\mathsf{SSb}_{i}\) is simply the inverse of the map \(\mathsf{p}_i\).

Matrix: Midori utilizes an involutive binary matrix \({\varvec{M}}\) defined as follows:

$$\begin{aligned} {\varvec{M}} = \left( \begin{array}{cccc} 0 &{} 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 1 &{} 1 \\ 1 &{} 1 &{} 0 &{} 1 \\ 1 &{} 1 &{} 1 &{} 0 \\ \end{array} \right) . \end{aligned}$$

The matrix \({\varvec{M}}\) updates four m-bit values \((x_{0}, x_{1}, x_{2}, x_{3})\) as follows:

$$\begin{aligned} {}^{t}(x_{0}, x_{1}, x_{2}, x_{3}) \leftarrow {\varvec{M}} \cdot {}^{t}(x_{0}, x_{1}, x_{2}, x_{3}), \end{aligned}$$

where the operations between a matrix and a vector are performed over GF(\(2^{m}\)).

Fig. 5.
figure 5

\(\mathsf{SSb}_{0}\), \(\mathsf{SSb}_{1}\), \(\mathsf{SSb}_{2}\) and \(\mathsf{SSb}_{3}\)

3.2 Round Function

The round function of Midori consists of an S-layer SubCell \(: \{0,1\}^{n} \rightarrow \{0,1\}^{n}\), a P-layer ShuffleCell and MixColumn \(: \{0,1\}^{n} \rightarrow \{0,1\}^{n}\) and a key-addition layer KeyAdd \(: \{0,1\}^{n} \times \{0,1\}^{n} \rightarrow \{0,1\}^{n}\). Each layer updates an n-bit state S as follows.

  • SubCell (S): \(\mathsf{Sb}_{0}\) and \(\mathsf{SSb}_{i}\) are applied to every 4 and 8-bit cell of the state S of Midori64 and Midori128 in parallel, respectively. Namely, \(s_i\) \(\leftarrow \) \(\mathsf{Sb}_{0}[s_i]\) for Midori64 and \(s_i\) \(\leftarrow \) \(\mathsf{SSb}_{(i \mod 4)}[s_i]\) for Midori128, where \(0 \le i \le 15\).

  • ShuffleCell (S): Each cell of the state is permuted as follows: \((s_{0}, s_{1}, ..., s_{15}) \) \(\leftarrow \) \((s_{0}, s_{10}, s_{5}, s_{15}, s_{14}, s_{4}, s_{11}, s_{1}, s_{9}, s_{3}, s_{12}, s_{6}, s_{7}, s_{13}, s_{2}, s_{8})\).

  • MixColumn (S): \({\varvec{M}}\) is applied to every 4m-bit column of the state S, i.e., \({}^{t}(s_{i}, s_{i+1}, s_{i+2}, s_{i+3}) \leftarrow {\varvec{M}}{}^{t}(s_{i}, s_{i+1}, s_{i+2}, s_{i+3})\) and \(i = 0, 4, 8, 12\).

  • KeyAdd (S, \(RK_{i}\)): The i-th n-bit round key \(RK_i\) is XORed to a state S.

3.3 Data Processing Part

The data processing part of Midori for encryption \(\mathsf{MidoriCore}_{(R)}\) performs as follows:

$$\begin{aligned} \mathsf{MidoriCore}_{(R)}: \left\{ \begin{array}{l} \{0,1\}^{16m} \times \{0,1\}^{16m} \times \{\{0,1\}^{16m}\}^{R-1} \rightarrow \{0,1\}^{16m} \\ (X, WK, RK_{0}, ..., RK_{R-2}) \mapsto Y \end{array} \right. \end{aligned}$$
figure a

where \(R = 16\) for Midori64 and \(R = 20\) for Midori128. Similarly, the inverse data processing part \(\mathsf{MidoriCore}^{-1}_{(R)}\) operates as follows:

$$\begin{aligned} \mathsf{MidoriCore}^{-1}_{(R)}: \left\{ \begin{array}{l} \{0,1\}^{16m} \times \{0,1\}^{16m} \times \{\{0,1\}^{16m}\}^{R-1} \rightarrow \{0,1\}^{16m} \\ (Y, WK, RK_{R-2}, ..., RK_{0}) \mapsto X \end{array} \right. \end{aligned}$$
figure b

where \(L^{-1}\) (inverse of the linear layer) denotes the composition of the operations InvShuffleCell \(~\circ ~\) MixColumn, and InvShuffleCell permutes each cell of the state as follows.

$$\begin{aligned} (s_{0}, s_{1}, ..., s_{15}) \leftarrow (s_{0}, s_{7}, s_{14}, s_{9}, s_{5}, s_{2}, s_{11}, s_{12}, s_{15}, s_{8}, s_{1}, s_{6}, s_{10}, s_{13}, s_{4}, s_{3}). \end{aligned}$$

3.4 Round Key Generation

For Midori64, a 128-bit secret key K is denoted as two 64-bit keys \(K_{0}\) and \(K_{1}\) as \(K = K_{0} || K_{1}\). Then, \(WK = K_{0} \oplus K_{1}\) and \(RK_{i} = K_{(i \mod 2)} \oplus \alpha _{i}\), where \(0 \le i \le 14\). For Midori128, \(WK = K\) and \(RK_{i} = K \oplus \beta _{i}\), where \(0 \le i \le 18\). The constants \(\beta _{i}\) are defined in Table 4. It can be seen that the constants are in the form of \(4 \times 4\) binary matrices. They are added bitwise to the LSB of every round key byte in Midori128 and round key nibble in Midori64 respectively. Note that \(\alpha _i=\beta _i\) for \(0 \le i\le 14\).

Table 4. The Round Constants \(\beta _i\)

3.5 Midori Ciphers

Midori block ciphers are composed of two variants: Midori64 and Midori128 consisting of MidoriCore \(_{(16)}\) with \(m=4\) and MidoriCore \(_{(20)}\) with \(m=8\), respectively. MidoriCore \(_{(16)}\) is depicted in Fig. 6 as an example.

Fig. 6.
figure 6

Overview of Midori64

4 Design Decision

Here, we explain our design decisions vis-a-vis the observations of Sect. 2.

4.1 Linear Layer

Linear layers of the each variant consist of a cell-permutation (ShuffleCell) and four \(4 \times 4\) matrix operations (MixColumn). Those operations are performed over \(GF(2^{4})\) and \(GF(2^{8})\) for the 64 and 128-bit variants, respectively.

MDS Vs Almost MDS. Using the NanGate 45 nm open cell library, Table 5 compares three types of \(4 \times 4\) matrices, involutive MDS (\({\varvec{M}}_{A}\)), non-involutive MDS (\({\varvec{M}}_{B}\)) and involutive almost MDS matrices (\({\varvec{M}}_{C}\)) from implementation aspects. These matrices are considered lightweight in each of the three aforementioned criteria [26, 31].

$$\begin{aligned} {\varvec{M}}_{A} = \left( \begin{array}{cccc} 1 &{} 2 &{} 6 &{} 4 \\ 2 &{} 1 &{} 4 &{} 6 \\ 6 &{} 4 &{} 1 &{} 2 \\ 4 &{} 6 &{} 2 &{} 1 \\ \end{array} \right) \!, {\varvec{M}}_{B} = \left( \begin{array}{cccc} 2 &{} 3 &{} 1 &{} 1 \\ 1 &{} 2 &{} 3 &{} 1 \\ 1 &{} 1 &{} 2 &{} 3 \\ 3 &{} 1 &{} 1 &{} 2 \\ \end{array} \right) \!, {\varvec{M}}_{C} = \left( \begin{array}{cccc} 0 &{} 1 &{} 1 &{} 1 \\ 1 &{} 0 &{} 1 &{} 1 \\ 1 &{} 1 &{} 0 &{} 1 \\ 1 &{} 1 &{} 1 &{} 0 \\ \end{array} \right) \!. \end{aligned}$$

From Table 5, \({\varvec{M}}_{C}\) is obviously preferable over the others in terms of the gate size and the path delay. In fact, circulant-type almost MDS matrices are adopted in PRINCE [12], PRIDE [1], FIDES [8] and CLOC [20]. Moreover, Khoo et al. showed that, for a 64-bit block size employing the AES-like structure, the combination of 4 \(\times \) 4 almost MDS matrices (\({\varvec{M}}_{C}\)) with ShiftRow and 16 4-bit S-boxes is the most efficient in both a round-based and a serialized implementation by proposing a new comparison metric FOAM (figure of adversarial merit), which combines the inherent security provided by cryptographic structures and components along with their implementation properties [22].

While \({\varvec{M}}_{C}\) has efficient implementation properties, its diffusion speed is slower and the minimum number of active S-boxes in each round is smaller than those of ciphers employing MDS matrices due to its lower branch number. It has been known that those properties are directly related to the immunity against several attacks including impossible differential, saturation, differential and linear attacks. To improve security of the almost MDS with low implementation overheads, we adopt optimal cell-permutation layers which are aimed at improving diffusion speed and increasing the number of active S-boxes in each round. The diffusion speed is measured by the number of rounds taken to attain full diffusion, which is the property that all output cells are affected by all input cells. Importantly, changing cell-permutation patterns generally does not require additional implementation costs in a round-based and an unrolled hardware implementation.

Table 5. Comparison of three matrices
Table 6. Comparison of S-boxes

Approach to Find Optimal Cell-Permutation Layers for Almost MDS. Since it is computationally hard to exhaustively count the minimum number of active S-boxes for all possible permutations (\(=16!\approx 2^{44.25}\)) by Matsui’s search approach [9, 25], we take the following two-step approach to reduce the search space. In the fist step, we restrict the cell-permutations to row-based cell-permutations which permute four cells in each row, e.g. ShiftRow in AES. The number of possible row-based cell-permutations is estimated as \(2^{18.3}\) \((= (4!)^4)\). This step is based on the fact that the full diffusion property relies on only row-based property of the cell-permutation. As a result of our searches, we find that a class of row-based cell-permutations achieves full diffusion in 3 rounds and its necessary and sufficient condition is as follows.

Condition 1

(3-round full diffusion). For a 4 \(\times \) 4 cell-array, after applying a cell-permutation once and twice, each input cell in a column is mapped into a cell in the different column.

From our search, 576 row-based cell-permutations satisfy Condition 1. Interestingly, ShiftRow-type permutation is not included in this class, i.e. it requires 4 rounds for full diffusion.

In the second step, we add a column-based cell-permutation, which permutes four cells in each column, after applying the class of permutations satisfying Condition 1. The target cell permutation consists of the combination of the row-based and column-based permutations. Note that adding a column-based cell-permutation to the row-based permutations satisfying Condition 1 does not affect the full diffusion property. The number of all possible cell-permutations of this class is estimated as \(2^{27.51}\) \(( = 576 \times (4!)^4)\). Consequently, we find a class of cell-permutation achieving the largest number of active S-boxes in each round and the smallest number of rounds to attain full diffusion when satisfying Condition 1 and the following Condition 2 or 3.

Condition 2

(The number of active S-box). For a 4 \(\times \) 4 cell-array, after applying a cell-permutation twice and twice inversely, each input cell in a column is mapped into a cell in the same row.

Condition 3

(The number of active S-box). For a 4 \(\times \) 4 cell-array, after applying a cell-permutation once and three times inversely, each input cell in a column is mapped into a cell in the same row.

The numbers of cell-permutations satisfying Conditions 2 and 3 are both 576. We define such 1152 cell-permutation as optimal cell-permutations. Table 7 shows the minimum numbers of differentially/linearly active S-boxes of the optimal cell-permutations and the ShiftRow-type permutation. Our optimal cell-permutations drastically improve the minimum number of differentially/linearly active S-boxes in each round while keeping the 3-round full diffusion property. Thus, our optimal permutations achieve security against several attacks such as differential/linear and impossible attacks in the same number of rounds compared to ShiftRow-type permutation. Midori128 and Midori64 adopt one of optimal cell permutations satisfying both Conditions 1 and 2 as follows.

$$\begin{aligned} (s_0, s_{1}, ..., s_{15}) \leftarrow (s_{0}, s_{10}, s_{5}, s_{15}, s_{14}, s_{4}, s_{11}, s_{1}, s_{9}, s_{3}, s_{12}, s_{6}, s_{7}, s_{13}, s_{2}, s_{8}). \end{aligned}$$

Starting from the state \(S_{0}\), each cell of \(S_{0}\) is mapped to \(S_{1}\), \(S_{2}\), \(S_{1}^{-1}\) and \(S_{2}^{-1}\) after applying the above cell-permutation once, twice, once inversely and twice inversely, respectively, as follows.

$$\begin{aligned} S_0 = \left[ \begin{array}{cccc} s_0 &{} s_4 &{} s_8 &{} s_{12}\\ s_1 &{} s_5 &{} s_9 &{} s_{13}\\ s_2 &{} s_6 &{} s_{10} &{} s_{14}\\ s_{3} &{} s_{7} &{} s_{11} &{} s_{15} \end{array} \right] \!, S_1 = \left[ \begin{array}{cccc} s_{0} &{} s_{14} &{} s_{9} &{} s_{7}\\ s_{10} &{} s_{4} &{} s_{3} &{} s_{13}\\ s_{5} &{} s_{11} &{} s_{12} &{} s_{2}\\ s_{15} &{} s_{1} &{} s_{6} &{} s_{8} \end{array} \right] \!, S_2 = \left[ \begin{array}{cccc} s_0 &{} s_{2} &{} s_{3} &{} s_{1}\\ s_{12} &{} s_{14} &{} s_{15} &{} s_{13}\\ s_{4} &{} s_{6} &{} s_{7} &{} s_{5}\\ s_{8} &{} s_{10} &{} s_{11} &{} s_{9} \end{array} \right] \!, \end{aligned}$$
$$\begin{aligned} S^{-1}_1 = \left[ \begin{array}{cccc} s_0 &{} s_{5} &{} s_{15} &{} s_{10}\\ s_{7} &{} s_{2} &{} s_{8} &{} s_{13}\\ s_{14} &{} s_{11} &{} s_{1} &{} s_{4}\\ s_{9} &{} s_{12} &{} s_{6} &{} s_{3} \end{array} \right] \!, S^{-1}_2 = \left[ \begin{array}{cccc} s_0 &{} s_{2} &{} s_{3} &{} s_{1}\\ s_{12} &{} s_{14} &{} s_{15} &{} s_{13}\\ s_{4} &{} s_{6} &{} s_{7} &{} s_{5}\\ s_{8} &{} s_{10} &{} s_{11} &{} s_{9} \end{array} \right] \!. \end{aligned}$$

From those mappings, it is clear that the relation among \(S_{2}^{-1}\), \(S_{0}\) and \(S_{2}\) satisfies Condition 2. Similarly, all of the pairs \((S_{2}^{-1}, S_{1}^{-1})\), \((S_{1}^{-1}, S_{0})\), \((S_{0}, S_{1})\), \((S_{1}, S_{2})\) satisfy Condition 1.

Table 7. The number of minimum number of differentially/linearly active S-boxes (AS) of Midori64 and Midori128

4.2 S-Box Layer

According to analysis of Sect. 2.1, 4-bit S-boxes are usually more efficient than 8-bit S-boxes in terms of energy consumption per cycle. Also, the small path delay and the small gate area lead to low-energy implementation. To optimize S-layer regarding energy consumption, we aim to develop a small-delay and lightweight 4-bit S-box which fulfill the following requirements: (1) the maximal probability of a differential is \(2^{-2}\), (2) the maximal absolute bias of a linear approximation is \(2^{-2}\) and (3) involution. The requirement (3) enables us to reduce the number of possible S-boxes from \(2^{44.25}\) to \(2^{25.5}\).

Approach to Find Small-Delay and Lightweight 4-Bit S-Box. Our approach starts with a key observation that the path delay is highly related to the dependency of the computation. We introduce a metric depth to estimate the path delay of S-boxes.

Definition 1

(depth): The depth is defined as the sum of sequential path delays of basic operations AND, OR, NAND, NOR and NOT.

Example. The depth of the computation of \((x \oplus y) \cdot z\) is estimated as the sum of path delays of XOR and AND, because “\(\cdot z\)” operation is feasible only after the computation of \((x \oplus y)\),

In our search, we assume that depths of XOR, AND/OR, NAND/NOR and NOT are weighted as 2, 1.5, 1 and 0.5, respectively, based on the number of the transistors to be sequentially proceeded in the operation. The required gates of NOT, NAND/NOR, AND/OR and XOR/XNOR are estimated as 0.5, 1, 1.5 and 2 [GEs], respectively. We search all S-boxes whose depth is \(1, 1.5, 2, \ldots , \) and check whether the S-boxes satisfy our security requirements. As a result, we can find \(\mathsf{Sb}_{0}\) (see Table 3) whose depth and gate size are the lowest and the smallest ones in our search. \(\mathsf{Sb}_{0}\) can be expressed as follows, where inputs and outputs are defined as \(\{a, b, c, d \}\) and \(\{a', b', c' , d' \}\), and a and \(a'\) are the most significant bits.

$$\begin{aligned} a'= & {} \Bigl ( \overline{c} ~\mathtt{NAND}~ (a ~\mathtt{NAND}~ b)\Bigr ) ~\mathtt{NAND}~ ( a~\mathtt{OR}~ d ) \\ b'= & {} \Bigl ((a ~\mathtt{NOR}~ d) ~\mathtt{NOR}~ (b ~\mathtt{AND}~ c)\Bigr ) ~\mathtt{NAND}~ \Bigl ((a ~\mathtt{AND}~ c) ~\mathtt{NAND}~ d \Bigr )\\ c'= & {} (b ~\mathtt{NAND}~ d) ~\mathtt{NAND}~ \Bigl ((b ~\mathtt{NOR}~ d) ~\mathtt{OR}~ a \Bigr ) \\ d'= & {} \Bigl (a ~\mathtt{NOR}~ (b ~\mathtt{OR}~ c) \Bigr ) ~\mathtt{NOR}~ \Bigl ((a ~\mathtt{NAND}~ b) ~\mathtt{NAND}~ (c ~\mathtt{OR}~ d)\Bigr ) \end{aligned}$$

For instance, let us consider the computation of \(c'\). In this computation, \((b ~\mathtt{NAND}~ d)\) and \(( b~\mathtt{NOR}~ d )\) can be done at first. After that, the computation of \((b ~ \mathtt{NOR}~ d) ~\mathtt{OR}~ a \) is done. Then, the last operation of \(\mathtt{NAND}\) is executable. Thus, the depth of \(c'\) is estimated as 3.5 ( = 1 + 1.5 + 1). The depths of the remaining \(a'\), \(b'\) and \(d'\) are also estimated as 3.0 or 3.5.

Considering additional requirement full diffusion property, we find \(\mathsf{Sb}_{1}\) which has the lowest depth and the smallest gate area among 4-bit bijective S-boxes satisfying the requirements (1), (2), (3) and the full diffusion property. \(\mathsf{Sb}_1\) is expressed as follows :

$$\begin{aligned} a'= & {} \Bigl ((b ~\mathtt{NAND}~ c) ~\mathtt{NAND}~a \Bigr ) ~\mathtt{NAND}~ \Bigl ((a ~\mathtt{NOR}~ d) ~\mathtt{NAND}~b \Bigr ) \\ b'= & {} \Bigl ((a ~\mathtt{XOR}~ c) ~\mathtt{NOR}~b \Bigr ) ~\mathtt{NOR}~ \Bigl ((b ~\mathtt{NAND}~ c) ~\mathtt{AND}~d \Bigr ) \\ c'= & {} (c ~\mathtt{NAND}~ d) ~\mathtt{NAND}~ \Bigl ((a ~\mathtt{XOR}~ b) ~\mathtt{NAND}~ (b ~\mathtt{OR}~ d) \Bigr ) \\ d'= & {} \Bigl ((a ~\mathtt{NAND}~ b) ~\mathtt{NAND}~c \Bigr ) ~\mathtt{NAND}~ (b ~\mathtt{OR}~ d) \end{aligned}$$

Note that an S-box satisfies the full diffusion property if and only if any inputs \(\{a, b, c, d \}\) of the S-box non-linearly affect all outputs \(\{a', b', c' , d' \}\). This full diffusion property enables us to ensure a 3-round property regarding the diffusion in Midori128 (we will explain it in the end of this section).

Evaluation. Table 6 shows the comparison of S-boxes of PRESENT, PRINCE, \(\mathsf{Sb}_{0}\) and \(\mathsf{Sb}_{1}\) using NanGate 45 nm open cell library. The path delay of \(\mathsf{Sb}_{0}\) is 1.5 times and twice smaller than PRINCE and PRESENT, respectively, and the gate size is also smaller than the others. Those of \(\mathsf{Sb}_{1}\) are comparable to PRINCE’s S-box. Additionally \(\mathsf{Sb}_{0}\) and \(\mathsf{Sb}_{1}\) have the involution property.

Table 8. Input-output bit relations of each S-box

8-Bit S-Boxes Based on 4-Bit S-Boxes. From the observation in Sect. 2.1, we adopt 8-bit S-boxes consisting of two 4-bit S-boxes processed in parallel to minimize the path delay in the round-based implementation. Moreover, in order to avoid having the unfavorable independent property exploited in the full-round attack on KLEIN [23], we add properly-chosen bit-permutations to the begin and the end of 8-bit S-boxes as shown in Fig. 5. As described in Sect. 3.1, each output bit-permutation is the inverse of the corresponding input bit-permutation to keep the involution property. With a property of our P-layer and those bit-permutations, we claim that no independent property is found after 3 rounds in Midori128. Since \(\mathsf{Sb}_{1}\) has the full diffusion property, any input bit of \(\mathsf{SSb}_{i}\) affects the corresponding 4 bits output as shown in Table 8. For example, in \(\mathsf{SSb}_{1}\), any of the i-th input bit affects all of the i-th output bits, where \(i \in \{0, 1, 6, 7\}\). We choose bit-permutations for \(\mathsf{SSb}_{0}\), \(\mathsf{SSb}_{1}\), \(\mathsf{SSb}_{2}\) and \(\mathsf{SSb}_{3}\) so that those satisfy the following property.

Property 1

Affected 4-bit positions of outputs of an S-box are included in both of two different input groups of the other three S-boxes.

For example, the group A of \(\mathsf{SSb}_{1}\) is {0, 1, 6, 7}. Then, those bit positions are found in the groups A and B of \(\mathsf{SSb}_{0}\). This implies that the {0, 1, 6, 7}-th input bits of \(\mathsf{SSb}_{0}\) affect all 8 bits output. For the matrix operation \({}^{t}(y_{0}, y_{1}, y_{2}, y_{3}) \leftarrow {\varvec{M}}{}^{t}(x_{0}, x_{1}, x_{2}, x_{3})\), we have the following property.

Property 2

Each input cell affects three cells in the different cell positions from the input.

For instance, \(x_{0}\) deterministically affects \(y_1\), \(y_{2}\) and \(y_{3}\), and does not affect \(y_{0}\). From Properties 1 and 2, we obtain the following theorem.

Theorem 1

In Midori128, any input bit nonlinearly affect all 128 bits of the state after 3 rounds.

Proof

An input bit affects 4 bits in the corresponding cell after the first S-layer due to the full diffusion property of \(\mathsf{Sb}_{1}\). From Property 2, the affected 4 bits in the cell are diffused to three cells in the same column but the different cell position after MixColumn. Note that, in the affected three cells, the affected bit positions are the same. From Property 1, in each affected three cells, the affected 4 bits are spreads over all 8 bits in the cell after the 2nd S-layer. Therefore, all bits are affected by any input after 3 rounds (see Fig. 7).    \(\square \)

Fig. 7.
figure 7

Theorem 1 : 3-round full diffusion property

4.3 Key Scheduling Function

To save energy, Midori128 does not employ any key scheduling function. The same 128 bit key is used as the whitening key and to generate the round key. To make an efficient circuit for decryption, the i-th round key is defined as \(L^{-1}(K) \oplus L^{-1}(\beta _{18-i})\), where \(L^{-1}\) denotes the inverse of the linear layer. Computation of \(L^{-1}(K)\) involves a one-time computation with the key at the beginning at the decryption function and so does not consume any significant energy. The round key generation of Midori64, is slightly more complicated, as it involves selecting \(K_0\) and \(K_1\), i.e. the most significant and least significant halves of the 128 bit key in alternate rounds. This can be achieved by the use of a single multiplexer. For efficient decryption, a one-time computation of \(L^{-1}(K_0)\) and \(L^{-1}(K_1)\) can be done at the beginning of the algorithm, which again does not consume any significant energy.

4.4 Round Constant

Both Midori128 and Midori64 use \(4\times 4\) binary matrices as round constants. The constants have been derived from the hexadecimal encoding of the fractional part of \(\pi =3.\) 243f 6a88 85a3 \(~\cdots \). For example, the 1st, 2nd, 3rd, 4th rows of \(\beta _0\) when read as a 4-bit binary constant, are the encoding of the hex values 2,4,3,f respectively. Similarly for the other \(\beta _i'\)s. These are added bitwise to the LSB of each round key byte in Midori128 and round key nibble in Midori64. The round constants were chosen in this manner with a view to have an energy-efficient decryption circuit. Both \(\beta _i\) and \(L^{-1}(\beta _{i})\) are \(4\times 4\) binary matrices, and so in both Midori128 and Midori64, the round constant addition requires a total of 16 XOR gates only. The constants \(\beta _i\) and \(L^{-1}(\beta _{i})\) can be stored in lookup tables and filtered accordingly in each round.

5 Security Evaluation

5.1 Differential/Linear Cryptanalysis

The minimum number of differentially and linearly active S-boxes of each round is estimated as shown in Table 7. The maximum differential and linear probabilities of \(\mathsf{Sb}_0\), \(\mathsf{SSb}_{0}\), \(\mathsf{SSb}_{1}\), \(\mathsf{SSb}_{2}\) and \(\mathsf{SSb}_{3}\) are \(2^{-2}\), respectively. Midori64 and Midori128  have more than 32 and 64 active S-boxes after 7 and 13 rounds. Thus, we expect that variants of Midori64 and Midori128 reduced to 7 rounds and 13 rounds do not have any differential and linear trails whose probabilities are higher than \(2^{-64}\) and \(2^{-128}\).

5.2 Boomerang-Type Attack

The boomerang-type attacks first divide the cipher into two sub-ciphers, then find a boomerang quartet with high probability. The probability of constructing a boomerang quartet is denoted as \(\hat{p}^{2}\hat{q}^{2}\), where \(\hat{p} = \sqrt{\sum _{\beta }\Pr ^{2}[\alpha \rightarrow \beta ]}\), and \(\alpha \) and \(\beta \) are input and output differences for the first sub-cipher, and \(\hat{q}\) for the second sub-cipher. \(\hat{p}^{2}\) is bounded by the maximum differential trail probability, i.e., \(\hat{p}^{2} \le \max _{\beta }\Pr [\alpha \rightarrow \beta ]\), and \(\hat{q}^{2}\) as well. Let pq be the maximum differential trail probability for the first and the second sub-ciphers. Then, pq are bounded by multiplying the minimum number of active S-boxes in each sub-cipher. From Table 7, any combination of two sub-ciphers for consisting of Midori64 and Midori128 after 8 and 14 rounds has at least 32 and 64 active S-boxes in total. Note that these bounds of boomerang attacks are very conservative ones, i.e., it requires unrealistic assumptions of \(\hat{p}^{2} = p\) and \(\hat{q}^{2} = q\). Actually, in our active S-box search, we did not find such special events. Thus, we expect that much smaller rounds than 8 and 14 rounds are secure against boomerang-type attacks.

5.3 Impossible Differential Attacks

Midori64 and Midori128 achieve the 3-round full diffusion property. Thus, differences of all cells in a state becomes unknown after SubCell of 4 rounds, i.e., there is no any probability-one (truncated) differential characteristic. Following the miss-in-the-middle approach, the maximum number of rounds of impossible differential characteristics is estimated as 7 rounds.

In order to obtain the lower bound of rounds of impossible differential, we try to find actual impossible differential characteristics. We utilize several deterministic properties of four binary matrices \({\varvec{M}}\). This approach was also adopted in the security evaluation of FIDES [8]. As a result, we find 6-round impossible differentials such that if only one active cell is input, 6-rounds of Midori64  and Midori128  never produces only one active cell. We believe that full rounds of Midori64 and Midori128 have sufficient number of rounds as the security margin.

5.4 Meet-in-the-Middle Attacks

The 3-round full diffusion property with our S-boxes enable us to claim that any inserted key bit of {\(K_0\), \(K_1\}\) or K non-linearly affects all bits of the state after 3 rounds in the forward and the backward directions in Midori64 and Midori128, respectively. Thus, the number of rounds used for the partial matching (PM) [2] is upper bounded by 5 \(( = (3 - 1) + (3 - 1) + 1)\). The condition for the initial structure (IS) [29], also called independent biclique [10], is that key differential trails in the forward direction and those in the backward direction do not share active non-linear components. For Midori64 and Midori128, since any key differential affects all 16 S-boxes after at least 4 rounds in the forward and the backward directions, there is no such differential which shares active S-box in more than 4 rounds. Thus, the number of rounds used for IS is upper bounded by 3. Assuming that the splice-and-cut technique allows an attacker to add more 3 rounds in the worst case, at most 11-round (3 + 3 + 5) MitM attack may be feasible. However, because of white keys in the begin and the end and the actual constraint of key orders, we consider that it is difficult to construct 11-round attacks on Midori64 and Midori128.

5.5 Other Attacks

We also consider other-types of attacks including a integral differential, a truncated differential, a slide, a reflection, and an algebraic attack. Consequently, we expect that none of them work better than brute force attacks.

6 Implementation

The main design objectives of Midori were first to achieve efficiency in energy consumption and second to provide both the encryption and decryption functionalities with minimal overhead. In this context, it is essential to have a round based design optimal in terms of energy consumption, since unrolled designs are unlikely to be efficient in terms of energy consumption. The S-box and the MixColumn layer were specifically chosen for their energy-efficiency and their involutive property. Both these layers have very small logic depth which makes the energy consumption per round figure as small as possible. Structurally MidoriCore and MidoriCore \(^{-1}\) differ only in the order of application of ShuffleCell, MixColumn and InvShuffleCell operations. And so, the circuit for the round based implementation of the cipher, that accommodates both encryption and decryption can be realized in Fig. 8.

Fig. 8.
figure 8

The round based encryption/decryption architecture

Since the ShuffleCell operation (Sh) and MixColumn (MC) do not commute, the linear layer which is basically the composition of MC\(\circ \)Sh (\(=L\) say), must be inverted during the decryption by \(L^{-1}=~\)Sh\(^{-1}\circ \)MC. In hardware, this can be achieved in two ways. The first involves filtering the outputs of the L and \(L^{-1}\) operations through a single multiplexer. This requires two instances of the MixColumn logic in the circuit, and since this layer is the most expensive in terms of area and energy consumed, it is not the most efficient way to achieve this functionality. The second method which is better in terms of both area and energy is the one shown in Fig. 8. This involves using two multiplexers for filtering the outputs of the Sh and Sh\(^{-1}\) operations and a single instance of the MixColumn logic. To perform the decryption operation using this circuit, the round key needs to be changed to \(L^{-1}(K)\), and correspondingly the \(i^{th}\) round constant to \(L^{-1}(\beta _{18-i})\). The first involves a cheap one-time change to the master key, while keeping the whitening key constant. The round constant functionality can be achieved by employing two lookup tables, one each for encryption and decryption and filtering the appropriate round constant through a multiplexer. The round constants have been chosen in a manner so that both \(\beta _i\) and \(L^{-1}(\beta _{i})\) are \(4\times 4\) binary matrices, and so this layer requires a total of 16 XOR gates only. The circuit for the 64-bit variant is the same as in Fig. 8, except that it requires an extra filtering between \(K_0\) and \(K_1\) (the most and least significant halves of the secret key) in alternate rounds.

6.1 Evaluation

All the designs were initially implemented in VHDL and the functional verification was done using Mentor Graphics ModelSim SE software. The designs were then synthesized using the Synopsys Design Compiler for the Standard Cell library of the STM 90 nm Logic Process: CORE90GPHVT v 2.1.a.

Table 9. A comparison of energy consumption of Midori with selected ciphers for the STM 90 nm Logic Process. (Average Power reported at 10 MHz)

The switching activity file was then generated by performing a timing simulation on the synthesized netlist using the Synopsys VCS Software. The energy was then estimated with the Synopsys Power Compiler by using the switching activity file. An operating frequency of 10 MHz was used in all the simulations since the effect of the leakage power is minimal at this frequency, and so the energy consumed is more or less independent of the clock frequency. The results of the simulation for the 90 nm logic process are presented in Table 9 along with similar evaluations for AES, NOEKEON, SIMON 128/128, PRESENT, PRINCE. It can be seen that Midori128/Midori64 performs better than NOEKEON/PRINCE which were also designed to make the combined functionalities of encryption and decryption easily available. In Fig. 9 we compare the energy/bit consumption of the ED architectures all the seven ciphers along with the cumulative latency figure (calculated as critical path \(\times \) number of rounds). It can be seen that Midori128 and Midori64 fare optimally with respect to both parameters.

Fig. 9.
figure 9

Cumulative latency vs Energy/bit figures

7 Conclusion

In this paper we present the block ciphers Midori128 and Midori64, optimized with respect to energy consumption. We first identify design choices that make a given algorithm efficient in terms of energy. Thereafter we propose two design components i.e. MixColumn matrix and S-box, that help us achieve the objectives of low energy design. These components are additionally involutive, that makes it easier to design a circuit with functionalities for both encryption and decryption. The energy of the proposed design was then found to be optimal in comparison with state of the art block ciphers available in literature.