INT-RUP Secure Lightweight Parallel AE Modes

Owing to the growing demand for lightweight cryptographic solutions, NIST has initiated a standardization process for lightweight cryptographic algorithms. Specific to authenticated encryption (AE), the NIST draft demands that the scheme should have one primary member that has key length of 128 bits, and it should be secure for at least 250 − 1 byte queries and 2112 computations. Popular (lightweight) modes, such as OCB, OTR, CLOC, SILC, JAMBU, COFB, SAEB, Beetle, SUNDAE etc., require at least 128-bit primitives to meet the NIST criteria, as all of them are just birthday bound secure. Furthermore, most of them are sequential, and they either use a two pass mode or they do not offer any security when the adversary has access to unverified plaintext (RUP model). In this paper, we propose two new designs for lightweight AE modes, called LOCUS and LOTUS, structurally similar to OCB and OTR, respectively. These modes achieve notably higher AE security bounds with lighter primitives (only a 64-bit tweakable block cipher). Especially, they satisfy the NIST requirements: secure as long as the data complexity is less than 264 bytes and time complexity is less than 2128, even when instantiated with a primitive with 64-bit block and 128-bit key. Both these modes are fully parallelizable and provide full integrity security under the RUP model. We use TweGIFT-64[4,16,16,4] (also referred as TweGIFT-64), a tweakable variant of the GIFT block cipher, to instantiate our AE modes. TweGIFT-64-LOCUS and TweGIFT-64-LOTUS are significantly light in hardware implementation. To justify, we provide our FPGA based implementation results, which demonstrate that TweGIFT-64-LOCUS consumes only 257 slices and 690 LUTs, while TweGIFT-64-LOTUS consumes only 255 slices and 664 LUTs.


Introduction
Lightweight cryptography, that aims towards applications in resource constrained environments has seen a sudden surge in interest due to the advent of Internet of things (IoT).Particularly, lightweight authenticated encryption (AE) schemes are of utmost importance in establishing private and authenticated communication channels in IoT applications.This importance was addressed by recently concluded CAESAR competition [CAE14] and the ongoing NIST lightweight cryptography project [MBTM17].In many of these designs, the internal state size reduction is the main priority.In this context, permutationbased schemes [BDPA11,CDNY18] have an advantage over block cipher-based schemes [CIMN17], as they do not need to store the key.However, to achieve comparable security, in general, the permutation size has to be almost similar to the block cipher size (key size + block size).In this work, we mainly focus on (tweakable) block cipher-based AE schemes.

The NIST Lightweight Cryptography Standardization Project
The NIST project [MBTM17] has received submissions of lightweight designs for standardization.NIST set the following minimum requirements from the submissions.
• The key size should be at least 128 bits.
• When the key size is 128 (resp.256) bits, any cryptanalytic attack should need at least 2 112 (resp. 2 224 ) computations in a single key classical setting (i.e, the time complexity).
• There should be one primary recommendation for the scheme with key size at least 128 bits, nonce size at least 96 bits, tag length at least 64 bits and the total number of message bytes under a single key at least 2 50 − 1 (i.e, the data complexity).
In summary, the primary version of the authenticated encryption scheme should have security up to 2 50 bytes of data and 2 112 computations.

State of the Art on AE Modes in light of NIST Requirements
Depending on the performance requirements, AE modes can be categorized into two main structures.
Parallel modes: Parallel AE modes, such as OCB [KR16] and OTR [Min16], are designed to exploit the parallel computation infrastructure available in many high performance computing environments.These designs primarily focus on software efficiency with a reasonably fast implementation in hardware.Both OCB and OTR are efficient, rate 1 1 and parallelizable.However, they require a relatively large state size 2 of 3n + κ and 4n + κ bits respectively, where n and κ are the block size and the key size respectively.Furthermore, they are only birthday bound secure in block size, which means they need at least 128-bit block cipher in order to satisfy the NIST criteria.This causes possible unsuitability of these designs in lightweight applications.In addition, they do not provide any security when the adversary gets access to unverified plaintext, i.e., the so-called RUP model [FJMV03, ABL + 14], which might be relevant in many practical lightweight applications due to lack of memory buffer.In [ZWH17], Zhang et [CIMN17] that can be implemented with the same state size as JAMBU but can achieve the optimal rate 1.Recently proposed rate 1/2 AE modes SAEB [NMSS18] and SUNDAE [BBLT18] further optimizes the state size of (n + k) bits.We note that most of these constructions do not have RUP security.However, SUNDAE achieves nonce-misuse security but SAEB does not.All the above mentioned AE modes have birthday bound security on the primitive size.Consequently, in light of NIST security requirements, all the above mentioned modes require 128-bit block cipher with key size of approx 128-bit.A 128-bit block cipher like AES might not be well-suited for lightweight implementations.In fact, [BMR + 13, BBM15] show that lightweight implementations (see [MPL + 11]) of AES require much higher clock cycles, when implemented in a small and serialized core.This is not desirable when throughput or energy consumption is also a concern in addition to the hardware footprint.On the other hand, 64-bit block ciphers such as PRESENT [BKL + 07], SKINNY [BJK + 16] or GIFT [BPP + 17], have ultra lightweight implementation cost with a comparable throughput.This immediately raises an interesting problem: (a) Can we design an AE mode with a 64-bit block cipher (and 128-bit key) satisfying NIST Project's security requirements?

Design Goals
With problem (a) in mind, we aim to design an algorithm that satisfies the following criteria: • Low State Size: The overall state size of the construction should be as low as possible.
• High Security: The security of the mode should be high (preferably full security in block size) so that even 64-bit block size provides the required security.

• Integrity Security under RUP (or INT-RUP security):
The mode should provide integrity even in scenarios when unverified plaintexts are released.INT-RUP security is particularly significant in lightweight applications (smart-cards, RFID tags), where often the memory buffer is quite limited.In addition, INT-RUP is also useful in real-time streaming protocols (e.g.SRTP, SRTCP and SSH), where block-wise encryption/decryption is required and ciphertext/plaintext are released on-the-fly (though the verification oracle is also available to the attacker in addition to the unverified decryption oracle) in order to reduce the end-to-end latency.
• Versatility: The mode should also aspire to be flexible in its domain of applications, covering the spectrum of resource constrained devices.
Both OCB and OTR can achieve low state size and versatility using lighter primitives such as a 64-bit block cipher.However, as mentioned before, they do not achieve the desired security level when implemented with a 64-bit block cipher.This leads to another natural question: (b) Can we uplift the security level of OCB and OTR by keeping the functionalities as intact as possible?
In fact a positive answer to (b) leads to a positive answer to (a).In general, this should result in a highly secure, efficient and significantly lighter design.

Our Contributions
The contributions of this paper are threefold: 1 3. Finally, we implement both LOTUS and LOCUS with TweGIFT-64 as the underlying block cipher (see Sect. 5).We provide hardware implementation details on FPGA platform.We observe that our implementations achieve highly competitive result.LOCUS achieves a very low hardware area of only 257 slices and 690 LUTs, while LOTUS achieves an even lower hardware area of 255 slices and 664 LUTs.We also provide a benchmark on FPGA platform with several state of the art schemes containing lightweight designs.
We would like to point out that the proposed modes are well-suited for protocols that require both lightweight and high performance implementations e.g, lightweight clients interacting with high performance servers (e.g, LwM2M protocols [OS19]).Some of the existing sequential modes like sponges, SAEB are better in terms of area-efficiency, however, due to the sequential nature of such modes, they cannot utilize the parallel computing capability in high performance devices.On the contrary, our proposed modes are inherently parallel and can be implemented in fully pipelined manner keeping a comparable areaefficient implementation.Moreover, our modes have the lowest implementation area among all the existing parallel modes with RUP security.

Design Comparison
Table 1 summarizes a comparative study of our modes with popular lightweight AE modes.
The underlying primitive sizes are chosen appropriately for each of them to satisfy the minimum security requirements by NIST.Note that in state size, we only count main registers and provide a theoretical estimation.Actual implementation may add some additional states required for the control unit and others.However, these additional states should be small when compared to the main register size.The security proofs for OCB, OTR, COFB, SAEB, SUNDAE are given in the standard model.Since they are all birthday bound secure in terms of the block size, they will achieve the same security level even in ICM.

Security Proof: Ideal Cipher Model vs Standard Model
We use nonce-based re-keying to get beyond the birthday bound security and it is a standard practice to use ideal cipher model (ICM) in this scenario, as duly mentioned and used in [BHT18,Men17,BT16] etc.The primary reason for switching from standard related-key model (SRKM) to ICM is the lossyness of the generic standard-to-ideal reduction.In SRKM, as shown in [Men17], one can achieve related-key SPRP (RKSPRP) advantage of roughly DT /2 κ , using key recovery attack which is quite loose and will not meet NIST primary version criteria for AE with κ = 128.We emphasize that in many cases, including ours, this loss is meaningless as this attack on internal block cipher will not work on the mode due to the secret masking of the input and output of the block cipher (see Sect. 6.5).Although ICM might give an optimistic bound, we think that it captures the possible attack strategies in a better way as compared to SRKM.It is commonly believed that the SRKM might be too pessimistic, as noted in [BHT18, Men17, GPR14, BKR98].It might be possible that a hybrid notion such as masked RKSPRP [Men17] could avoid such loss.However, such exposition is out of scope for this work.

Preliminaries
For n ∈ N, [n] denotes the set {1, 2, . . ., n} and (n] := [n] ∪ {0}.For a finite set X , X ←$ X denotes the uniform at random sampling of X from X .For n ∈ N, we write {0, 1} + and {0, 1} n to denote the set of all non-empty binary strings, and the set of all n-bit binary strings, respectively.We write φ to denote the empty string, and {0, 1} * = {0, 1} + ∪ {φ}. For X ∈ {0, 1} * , |X| denotes the length (number of the bits) of X, where |φ| = 0 by convention.For any non-empty binary string X, (X k , . . ., X 1 ) n ← x denotes the n-bit block parsing of X, where We sometime use the terms (complete) blocks for n-bit strings, and partial blocks for m-bit strings, where m < n.Throughout, we use the function ozs, defined by the mapping as the padding rule to map partial blocks to complete blocks.Note that the mapping is injective over partial blocks.For any X ∈ {0, 1} + and 1 ≤ i ≤ |X|, x i denotes the i-th bit of X.For any binary string X and an integer i ≤ |X|, X i returns the least significant i bits of X, i.e. x i • • • x 1 .For any integer i i n denotes the n-bit unsigned representation of i.

Finite Field Arithmetic
The set {0, 1} κ can be viewed as the finite field F 2 κ consisting of 2 κ elements.We interchangeably think of an element A ∈ F 2 κ in any of the following ways: (i) as a κ-bit x + a 0 over the field F 2 ; (iii) a non-negative integer a < 2 κ ; (iv) an abstract element in the field.Addition in F 2 κ is just bitwise XOR of two κ-bit strings, and hence denoted by ⊕.P (x) denotes the primitive polynomial used to represent the field F 2 κ , and α denotes the primitive element in this representation.The multiplication of A, B ∈ F 2 κ is defined as A B := A(x) • B(x) (mod P (x)), i.e. polynomial multiplication modulo P (x) in F 2 .For κ = 128, we fix the primitive polynomial Then, α, the primitive element, is 2 ∈ F 128 .Throughout we use α = 2, and 1 + α = 3, and "α-multiplication" to denote the operation of field multiplication on some element 3 and α.
3 The element will be clear from the context.

Tweakable Block cipher
For n, τ, κ ∈ N, E-n/τ /κ denotes a tweakable block cipher family E, parametrized by the block length n, tweak length τ , and key length κ.For K ∈ {0, 1} κ , T ∈ {0, 1} τ , and M ∈ {0, 1} n , we use E T K (M ) := E(K, T, M ) to denote the invocation of the encryption function of E on input K, T , and M .The decryption function is analogously defined as E −T K (M ).In the special case where the tweak set is a singleton, the resulting tweakable block cipher E is simply referred as a block cipher E. We fix positive even integers n and κ to denote the block size and key size, respectively, in bits.

Authenticated Encryption in the Ideal Cipher Model
An authenticated encryption (AE) is an integrated scheme that provides both privacy of a plaintext M ∈ {0, 1} * and authenticity of M as well as associated data A ∈ {0, 1} * .Taking a nonce N (which is a value unique for each encryption) together with associated data A and plaintext M , the encryption function of AE, enc K , produces a tagged-ciphertext (C, T ) where |C| = |M | and |T | = t.Typically, t is fixed and we assume n = t throughout the paper.The corresponding decryption function, dec K , takes (N, A, C, T ) and returns a decrypted plaintext M when the authentication on (N, A, C, T ) is successful, otherwise returns the atomic error symbol denoted by ⊥.
In this paper we consider a variant of the decryption interface, due to the added capability of our AE schemes.The decryption interface provides two algorithms, a decryption function dec K that takes (N, A, C) and returns a decrypted plaintext M irrespective of the authentication result (hence we drop the tag value), and a verification function ver K that takes (N, A, C, T ) and returns a decrypted plaintext M only when the authentication succeeds, otherwise it returns ⊥.

Security Definitions
A distinguisher A is an algorithm that tries to distinguish between two oracles O 0 and O 1 via black box interaction with one of them.At the end of interaction it returns a bit b ∈ {0, 1}.We write A O = b to denote the output of A at the end of its interaction with O.In the context of this paper, we will be concerned with computationally unbounded and deterministic distinguishers A .The distinguishing advantage of A against O 0 and O 1 is defined as where the probabilities depend on the random coins of O 0 and O 1 .

TSPRP Security in Ideal Cipher Model
Let TPerms({0, 1} τ , {0, 1} n ) be the set of all tweakable permutations with τ -bit tweak and n-bit block.We write Π ←$ TPerms({0, 1} τ , {0, 1} n ) to denote a tweakable random permutation.A tweakable block cipher E is called a tweakable ideal cipher if E K ←$ TPerms({0, 1} τ , {0, 1} n ) for all K ∈ {0, 1} κ , i.e., E behaves as a tweakable random permutation for all keys.The TSPRP advantage of any distinguisher A against a tweakable block cipher P built upon a tweakable ideal cipher E and instantiated with a key K ←$ {0, 1} κ is defined as The TSPRP advantage of P, is defined as where the maximum is taken over all distinguisher A bounded by q P queries and q p E queries.The TPRP security game is a weaker variant of TSPRP where the distinguisher is restricted from making any inverse queries to the tweakable block cipher P, i.e.
It is easy to see that Adv tprp P (q, q p ) ≤ Adv tsprp P (q, q p ).

Privacy Security in Ideal Cipher Model
Given a distinguisher A , we define the privacy advantage of A against an AE scheme Θ in the ideal cipher model as where $ e returns a uniform random string of the same length as the output length of Θ.enc K .The privacy advantage of Θ is defined as (q e , q p , σ e , q p ) := max where the maximum is taken over all distinguishers making q e queries to the encryption oracle with an aggregate of σ e blocks and q p many primitive (ideal cipher) queries.

INT-RUP Security in Ideal Cipher Model
We say that an adversary A forges an AE scheme Θ under RUP in the ideal cipher model if A is able to compute a tuple (N, A, C, T ) satisfying Θ.ver K (N, A, C, T ) = ⊥, without querying (N, A, M ) to Θ.enc K and receiving (C, T ), i.e. (N, A, C, T ) is a non-trivial forgery.
In this case, a forger can make additional q d RUP decryption queries of the form (N, A, C) with a total of σ d blocks to the oracle Θ.dec K , with no restriction on nonce repetitions, and receive the corresponding M .One can also view the forging game in an equivalent distinguishing game.Under this equivalent setting, the integrity under RUP advantage for any distinguisher A is defined as where ⊥ denotes the degenerate oracle that always returns ⊥ symbol.The integrity under RUP advantage of Θ is defined as where the maximum is taken over all distinguishers making q e encryption queries with an aggregate of σ e blocks, q d RUP queries with an aggregate of σ d blocks, q v verification attempts with an aggregate of σ v blocks, and q p ideal cipher queries.Throughout we write a (q e , q d , q v , σ e , σ d , σ v , q p )-distinguisher to represent a distinguisher that makes q e encryption queries with an aggregate of σ e many blocks, q d decryption queries with an aggregate of σ d many blocks, q v verification queries with an aggregate of σ v many blocks, and q p primitive queries.Similarly, we can define distinguisher with smaller or larger tuple of resources.

Coefficient-H Technique
We outline the coefficient-H technique developed by Patarin, which serves as a "systematic" tool to upper bound the distinguishing advantage of any deterministic and computationally unbounded distinguisher A in distinguishing the real oracle O 1 (construction of interest) from the ideal oracle O 0 (idealized version).The collection of all the queries and responses that A made and received to and from the oracle, is called the transcript of A , denoted as ω.Sometimes, we allow the oracle to release more internal information to A only after A completes all its queries and responses, but before it outputs its decision bit.
Let Λ 1 and Λ 0 denote the transcript random variable induced by the interaction of A with the real oracle and the ideal oracle respectively.The probability of realizing a transcript ω in the ideal oracle (i.e., Pr[Λ 0 = ω]) is called the ideal interpolation probability.Similarly, one can define the real interpolation probability.A transcript ω is said to be attainable with respect to A if the ideal interpolation probability is non-zero (i.e., Pr[Λ 0 = ω] > 0).We denote the set of all attainable transcripts by Ω.Following these notations, we state the main result of coefficient-H Technique in Theorem 1.The proof of this theorem can be found in [Vau03].
Theorem 1. Suppose for some Ω bad ⊆ Ω, which we call the bad set of transcripts, the following conditions hold:

Specification
In this section, we present the specifications of LOTUS and LOCUS that use a 4-bit short tweak tweakable block cipher TweGIFT-64 [CDJ + 19b].We give a short description of this design in Sect.4.1.

LOTUS and LOCUS Modes
The encryption algorithm of both LOTUS and LOCUS modes receives an encryption key Both LOTUS and LOCUS operate on n-bit blocks and use a tweakable block cipher as the underlying primitive.Both the algorithms share a common initialization and associated data processing phase.During the initialization phase, the κ-bit nonce N is XORed with the κ-bit secret key K to generate a κ-bit nonce-dependent encryption key K N .Then, an n-bit nonce-dependent masking key ∆ N is generated using double encrypting a fixed value (here we have used 0 n ) with key K and K N successively with TBC.

Associated Data Processing in LOTUS and LOCUS
For associated data processing, we parse the data into n-bit blocks and process them in a similar way as the hash layer of PMAC [Rog04].To process associated data block, we first update the current key value via α-multiplication.Next, we XOR the block with ∆ N and encrypt the value using E with the fixed tweak value 2 and the updated key K N and finally accumulate the encrypted output by XORing it to the previous checksum value.If the final block is partial, we use the tweak value 3 to process the final block.We refer to the output of the associated data processing as the AD checksum.The complete description of the associated data processing is depicted in Fig. 1 and formally specified in Algorithm 1.
Figure 1: Associated Data Processing for both LOCUS and LOTUS.Here E i K N ,2 denotes invocation of E with key α i K N and tweak 0010.For the final associated data block, the use of E a K N ,2/3 indicates invocation of E with key α a K N and tweak 0010 or 0011 depending on whether the final block is full or partial.

Description of LOTUS
To process a message in LOTUS, we parse the data into 2n-bit di-blocks and process them in a similar manner as OTR [Min16].For each message di-block, we apply a simple variant of two-round Feistel cipher [LR85].However, instead of one upper layer encryption and one lower layer encryption, here we use two successive encryptions in each layer.The intermediate states in between the encryptions in each layers are used to generate the checksum (that we call intermediate checksum), which helps in obtaining integrity security under RUP setting.To process a di-block, the key is first updated by an α-multiplication and the same key is used in the four tweakable block cipher calls.However, we use 4 different tweaks for the four calls (tweak 4 and 7 in the upper layer, and 5 and 8 in the lower layer) for the purpose of domain separation.Also, we use four different tweaks (12 and 14 in the upper layer and 13 and 15 in the lower layer) during the final di-block processing.The final di-block processing is slightly different and uses the length of the final di-block.To generate the tag, we apply XEX [Rog04] like transformation on the XOR of the intermediate checksum, AD checksum, and the final message block.The complete specification of LOTUS authenticated encryption is given in Algorithm 1. Figure 2 gives a pictorial description of the encryption process.

Description of LOCUS
To process a message in LOCUS, we parse the data into n-bit blocks and process them in a similar manner as OCB [RBB03].For each of the message blocks, we first mask the block, then encrypt with the tweakable block cipher twice and then again mask to obtain the corresponding ciphertext block.Similar to LOTUS, the ∆ N masking is same along a query and the intermediate states (W i in Fig. 3) between the two block cipher calls are XORed together to generate the intermediate checksum.For the last message block, instead of applying XEX on the message block, we apply it on the final block message length and XOR the output with the final message block.This strategy ensures identical processing for complete or incomplete final blocks.Again, similar to LOTUS, we update the key by α-multiplication before each block processing, and we use tweaks 4 and 12 in the upper and lower block cipher calls for non-final blocks, and tweaks 5 and 13 in the upper and lower block cipher calls for final blocks.The tag is generated identically to that of LOTUS.The complete specification of LOCUS authenticated encryption is given in Algorithm 2. The message processing part of the encryption algorithm is depicted in Fig. 3.The dotted part in the final di-block is executed only when the message has even number of blocks.We use the notation E i K N ,j to denote invocation of E with key α a+i K N and tweak j, where a denotes the number of blocks of associated data corresponding to the message.Here W ⊕ denotes the intermediate checksum value and V ⊕ denotes the AD checksum value.len n is used to denote the n bit representation of the size of the final di-block in bits.
Processing of an m block message M and tag generation for LOCUS.len n is used to denote the n bit representation of the size of the final block in bits.W ⊕ denotes the intermediate checksum value and V ⊕ denotes the AD checksum value.E i K N ,j is defined in a similar manner as in Fig. 2.

27:
W⊕ ← W⊕ ⊕ W2 28: 29: The encryption algorithm of LOCUS.The subroutines proc_ad and proc_tag are identical to the one used in LOTUS.

Design Rationale
In this section, we briefly describe the various design choices and rationale for our proposals.Our primary goal is to design a lightweight AEAD that should be efficient, provides high performance and performs reasonably well in low-end devices as well.For efficiency, the AEAD should be one pass.To obtain high performance capability, we aim for parallelizability.In addition, we demand integrity in the RUP model.This is specially useful for memory-constrained lightweight applications.
We start with two well-known modes, namely OCB and OTR.Both OCB and OTR satisfy the first two properties.OCB is online, one-pass and parallelizable.OTR has all these features plus it offers inverse-freeness, albeit in exchange for a larger state (as it works on di-blocks).However, both of them are insecure under the RUP model.This motivates us to design an AE mode which is structurally as simple as OCB and OTR but achieves RUP security while keeping the primary features, such as efficiency and parallelism.
The new proposals LOTUS and LOCUS replace one block cipher call by two calls.The rationale behind this modification is the observation that the intermediate state between the two block cipher invocations can be used to generate a checksum, which is completely hidden and hence cannot be controlled by the adversary (even if the adversary is allowed to make RUP queries).This hidden checksum ensures integrity security in RUP model.The additional block cipher call per message block increases the number of block cipher calls from to 2 + 1 to process an -block message.However, this is the minimum number of non-linear invocations used for any state-of-the-art INT-RUP secure parallel AEAD mode.
The associated data processing phase is based on a simple variant of the hash layer of PMAC, and the computation is completely parallel.The associated data processing can be done in parallel with the plaintext and/or ciphertext processing in order to maximize the performance in parallel computing environments.
Both OCB and OTR generate the tag using the checksum (simple XORs) of all the plaintext blocks and the output of the processed associated data.However, two separate states are required to hold the message checksum and the AD checksum.We obtain INT-RUP security, by using an intermediate checksum (hidden to the adversary) instead of the plaintext checksum.Moreover, we do not store the intermediate checksum and AD checksum separately.Rather, we XOR the two checksums, which means that in a sequential implementations, the intermediate checksum can be computed on top of the AD checksum.This reduces the overall state size by size of one block.
A notable change in LOTUS and LOCUS is the use of nonce and position dependent keys.OCB and OTR have only birthday bound security on the block size.This is because the security is generally lost once the input/output of any two distinct block cipher calls matches, as the two calls share the same encryption key.In LOTUS and LOCUS, we overcome the birthday bound barrier by changing the key and tweak pair for each block cipher call.So even if there is a collision among inputs/outputs, the security remains intact, as the block cipher keys or tweaks are distinct.In fact, our modes are secure up to data complexity of 2 n , and time complexity of 2 κ , and combined data-time complexity up to 2 n+κ .This, in turn, helps us to construct AEAD algorithms with the desired security level using an ultra-lightweight short tweak tweakable block cipher of size 64 bits.
Remark 1.We remark here that our specification of LOTUS and LOCUS deviates from the original definitions available in [CDJ + 19a].First, we use distinct tweaks in the ciphertext generation calls, i.e. 4 tweaks in LOTUS and 2 tweaks in LOCUS.Second, in the tag generation module, we perform the input masking by ∆ N only when the total input length (sum of associated data and message block length) is even.We have made these small modifications in the specification to simplify and modularize the proofs along the line of OCB [Rog04].The security of the original constructions can be shown, via a dedicated proof, to be exactly the same.See Sect.6 for more details.
Remark 2. OCB-IC by Zhang et al. [ZWH17] achieves INT-RUP security using a similar idea as ours (two calls in ciphertext generation).However, we improve on several fronts.We reduce the state size by avoiding the additional storage for AD checksum.Further, we improve the security from n/2-bit to n-bit. 4 This helps us in using a lighter primitive as compared to OCB-IC.
AddRoundKey: In this step, a 32-bit round key is extracted from the master key state and added to the state (at bit positions i and i + 1 for i = 0, 1, . . ., 15).After that, the master key state is rotated by some bits.This operation is also identical to that of GIFT.
AddRoundConstant: A single bit "1" and a 6-bit round constant are XORed into the cipher state at bit position 63, 23, 19, 15, 11, 7 and 3, respectively.The round constants are generated using the same 6-bit affine LFSR as SKINNY [BJK + 16] and GIFT-64-128 AddTweak: For tweak processing, we first expand the 4-bit tweak into a 16-bit codeword using an efficient linear code.Let t 0 , t 1 , t 2 , t 3 be 4 bits of the tweak and t s be the sum of the 4 bits.Then we compute t i+4 = t i ⊕ t s for i = 0, 1, 2, 3 and t i+8 = t i for i = 0, 1 . . ., 7. Then this expanded codeword t 0 , . . ., t 15 are XORed to the state (at bit position i + 3 for i = 0, 1, . . ., 15) at an interval of 4 rounds.

Intuition
Without exploiting the tweak, TweGIFT-64 is exactly the same as the original GIFT-64-128, which has already received several third-party security analysis.This principle can apply both in the single-key and the related-key settings.To exploit the features that do not exist for GIFT-64-128 but do for TweGIFT-64, attackers need to exploit the 4-bit tweak input.However, only with 4 additional bits, what attackers can do is very limited.In addition, the 4-bit tweak input is further expanded to 16 bits in an uncontrollable way.Thus, the best attack strategy is to attack the original GIFT-64-128 instead of trying to exploit the 4-bit tweak input.
One may consider using round keys to cancel the impact of the tweak because both of round keys and the expanded tweak are XORed to the state.In particular, for differential cryptanalysis, one may consider canceling the tweak difference by the round-key difference in the related-key setting.However, AddRoundConstant and AddTweak are designed to XOR the key bits and tweak bits to different bit positions.Thus, they cannot cancel each other.In the following, we discuss more details in the case of differential cryptanalysis.

Security against Differential Cryptanalysis
The exact security bound, e.g. the lower bound of the number of active S-boxes and the upper bound of the differential characteristic probability, can be obtained by using various tools based on MILP and SAT, however to derive such bounds for the entire construction with 128-bit key difference is often infeasible.
Here we focus on the feature that the tweak expansion function ensures that the number of active bits in the expanded tweak is at least 8 when the tweak difference is non-zero.This implies that differential trails with non-zero tweak difference will have a large number of active S-boxes around the tweak injection.This motivates us to evaluate the tight bound of the differential characteristic probability for the 2-round transformation followed by the tweak injection and another 2-round transformation, which we call "4-round core."Let p core be the maximum differential characteristic probability of the 4-round core.Then, the probability for the entire construction is upper bounded by (p core ) 6 because 28 rounds of TweGIFT-64 contain six 4-round cores (Fig. 4).p core in the single-key setting.As evaluated by [CDJ + 19b], p core can be evaluated by using the MILP based tool.p core of the 4-round core is 2 −25.6 , hence the probability for the entire construction is upper bounded by 2 −25.6×6 = 2 −153.6 .Because the block size of TweGIFT-64 is 64 bits, it well resists the differential cryptanalysis.p core in the related-key setting with non-zero tweak difference.We also used the MILP based tool to derive p core by allowing the difference in the 128-bit key.Recall that the round key size is 32 bits, hence all subkey bits in the consecutive 4 rounds are independent.Under the condition that at least 1 tweak bit is active, it turned out that activating round keys leads to a higher p core than the single-key setting.The results show that p core in the related-key setting is 2 −16 , hence the maximum differential characteristic probability of 28 rounds is upper bounded by 2 −16×6 = 2 −96 .This can also be viewed that the maximum differential characteristic probability reaches 2 −16×4 = 2 −64 for 16 rounds and we have 12 rounds for a margin.One of the best related-key differential trails for the 4-round core is fully specified in Table 2.We emphasize that the maximum differential probability of 4 rounds of the original GIFT-64-128 in the related-key setting is 1, namely 4 rounds can be bypassed without activating any S-box (This occurs when only the 4th round key is active).To attack TweGIFT-64 by exploiting the tweak is more difficult than to attack GIFT-64-128. 2 is iterative and thus it immediately leads to the differential trail matching the upper bound of the characteristic probability, 2 −96 for 28 rounds.This is because the input and output state difference in Table 2 is both 0 (iterative), the expanded 16-bit tweak is iterative, and the active two 16-bit key registers are always k 0 and k 1 .However, we confirmed that TweGIFT-64 does not allow such simple iterative characteristics.Indeed, bit-rotation of the key registers for each round prevents from iterating the same differential propagation multiple times.

Gap between the lower and upper bounds. One may wonder if the trail in Table
To demonstrate this intuition clearly, we checked the behavior of the differential trail in the subsequent 4-round core after the type of trails in Table 2. Namely, we first made a limitation that only the third round is active in the first 4-round core, and then to check the behavior of the second 4-round core.The results show that when only the third round in the second 4-round core is active, the highest probability of the two consecutive 4-round cores is 2 −64 (2 −32 per 4-round core).This occurs in the following configuration.
• 4 bits of the tweak is all active, which makes ∆T e be 0xffff.
• All 16 bits of k 0 and k 1 are active, i.e. 0xffff, which cancels the difference of the S-box output.(The difference 0xffff is invariant for any rotation operation.) Hence, to the best of our knowledge, the current lower bound of the differential characteristic probability for 28 rounds with non-zero tweak difference in the related-key setting is 2 −32×6 = 2 −192 .
By not trying to cancel the difference in the second 4-round core, the third round of the second 4-round core can be bypassed with probability 2 −16 .Hence starting from the first round, this will yield a differential characteristic with probability 2 −32 up to 9 rounds.However, since the difference soon diffuses in an uncontrollable way, it is inevitable to activate more S-boxes for the entire construction with this approach.
Remarks on three 8-round cores.28 rounds of TweGIFT-64 can also be viewed as containing three of the 8-round core by ignoring two AddTweak operations between two 8-round cores.We also evaluated the maximum differential characteristic probability of the 8-round core by using the MILP-based tool, which turned out to be 2 −26.7 .Hence, from this evaluation, the probability for the entire construction can be upper bounded only by 2 3×−26.7 = 2 −80.1 .This observation demonstrates the difficulties of exploiting our tweak injection in another way.The difficulty of controlling differential trails lies in the heavy weight of the expanded tweak and thus to count as many tweak injection as possible would be the best to derive good bounds.
We notice that the maximum differential characteristic probability of 8 rounds of the original GIFT-64-128 in the related-key setting is 2 −8 , which contains only 4 active S-boxes.The tweak expansion of TweGIFT-64 introduces many (at least 8) active S-boxes around the tweak injection, and this prevents the efficient differential trails available in GIFT-64-128.

Related-key security with zero tweak difference.
Because of the difficulty of exploiting the tweak, zero-tweak difference would be the most natural scenario to attack TweGIFT-64.As discussed before, the related-key security of TweGIFT-64 without using tweak difference can be reduced to the related-key security of GIFT-64-128.Indeed, the keys are computed by a predictable way in the mode and used with a fixed tweak.This implies that related-key security of TweGIFT-64 matters in the related-key security of the entire construction.
At the time of the publication of GIFT-64-128 [BPP + 17], the designers mentioned that "GIFT aims at single-key security, so we do not claim any related-key security (even though no attack is known in this model as of today)."On the other hand, several papers tried to attack GIFT-64-128 in the related-key setting, e.g.related-key boomerang attack on 23 rounds [LS19] and related-key rectangle attack on 23 rounds [CWZ19] and on 24 rounds [ZDM + 19].Without some innovation of the cryptanalytic technique, 28 rounds of GIFT-64-128 would resist those approaches in the related-key setting.
Regarding the related-key differential cryptanalysis of GIFT-64-128, one may expect that the above-mentioned 8-round characteristic with probability 2 −8 can be iterated three times to derive (2 −8 ) 3 = 2 −24 as the upper bound of the probability for 24 rounds.However this is only the loose bound, and such high probability characteristic in fact does not exist.To find high probability differential characteristics, to use automated tools such as SAT or MILP is a popular approach.In fact GIFT receives a lot of attention with this respect both in the single-key and related-key settings [ZDY18, ZZDX19, LS19, CWZ19, ZDM + 19, JZD19].In particular, Liu et al. evaluated the lower bound of the number of active S-boxes of GIFT-64-128 in the related-key setting up to 19 rounds [LS19,Table 4], which shows that  5].Considering that the attacker can append several rounds on top of the distinguisher, one may want to increase the number of rounds.The analysis to identify the number of added rounds to provide sufficient security margin is an open problem.

Hardware Implementation
In this section, we provide a brief idea on the FPGA implementations of our designs.All the hardware implementations are written in VHDL and are implemented on both Virtex 6 xc6vlx760 and Virtex 7 xc7vx415t using Xilin ISE 14.7 and Vivado 2018.3 respectively as implementation tool.In all the cases the optimization strategy is speed oriented.

Hardware Architecture
We implement combined encryption-decryption circuits for both the ciphers in a roundbased architecture with 64 bit data path.Both the architectures are more or less similar with a few differences in the message processing phase.The main modules and the registers are briefly described below: Registers.Both the architectures mainly contains four registers.A state register of 64 bits is used to store the encryption state.A key register of 128 bits is used to store the master secret key.Note that the key schedule of the underlying block cipher is palindromic and hence it removes the requirement to use a subkey register and only one subkey register suffices.A 64 bit checksum register is used to store and update the checksum value, and one ∆ register is used to store the ∆ N value.The state register and key register are part of the module TweGIFT-64.
TweGIFT-64 Module.This module actually describes one round function of the TweGIFT-64 block cipher (Fig. 7).For LOCUS, we need both forward and inverse block cipher calls.Each round of a forward call consists of a sequence of S-box, bit-permutation and add round key operations.The inverse block cipher call performs add round key, inverse bit permutation and the inverse S-box operations (in the reverse direction).Internally a tweak value is added to the state register after each of the five consecutive rounds.For LOTUS, only the forward block cipher call is required.It takes 64 bit input from the state register computes one round of forward (or inverse in case of LOCUS) operations and then updates the state and send the output either to the accumulator or again to the TweGIFT-64 round module or release the output as the final tag (added to ∆ N before the tag release).Accumulator Module.The accumulator module ACC computes the checksum value of the ECB layer and the last block to compute the tag.InOdd Module.This module is specific LOTUS and it is used to detect whether the counter value for the current message block has an odd index.This is required as it processes two blocks at a time to xor it with the output of TweGIFT-64 to produce the output.Finite State Machine.We also report the finite state machine (FSM) that controls the circuit flow by controlling and updating the internal signals and sending them to the internal modules.The FSM has a simple structure.The overall hardware architecture for both the designs is given in Fig. 5.This module is used to control the circuit.It is used to generate and send signals to the internal modules and the functionalities of the circuit are described by several states.Fig. 6 describes the state transitions of the finite state machine (FSM) for the AE designs.The states are described below.• Wait: This state indicates that we should now initialize the cipher functionalities.It actually prepares the circuit to process the nonce.The control next enters into the state ∆ N Process state when the data signal start is set.Otherwise, it enters into the state Change_Key when the signal reset_key is set.
• ∆ N Process: This state initializes the cipher by computing ∆ N .When the computation is done (indicated by the complete signal), it enters into the associated data processing phase AD_Process.Otherwise, it will remain in the ∆ N Process state.
• Change_Key: This state indicates that the key is reset.The control transits to the state Wait when the reset is done.Otherwise, the control will remain int his state.
• AD_Process: This state indicates processing of the associated data.It internally uses the underlying block cipher, runs it and updates the intermediate checksum.
The completion of this phase it indicated by the AD_complete signal.After the completion, the state transits to the Enc/Dec state which indicates the start of the message (or ciphertext) processing.Otherwise, it remains in the same state.
• Enc/Dec: This state indicates the message (or ciphertext) processing phase.It also invoke the block cipher internally, runs it and updates the checksum.The completion of this phase is indicated by the msg_complete signal.After the completion, the control enters into the Tag_Generation state.Otherwise, the control will remain in this state.Note that, during the decryption for LOCUS, the circuit runs the block cipher decryption module.
• Tag_Generation: This state indicates that the tag needs to be generated now.The completion of this state is indicated by the tag_finish signal and the control will go to the Wait state again.Otherwise, the control will remain in the same state when the tag_finish signal is not set.

Implementation of TweGIFT-64
In this section, we first briefly describe our hardware implementation details of the TweGIFT-64 module.We have implemented TweGIFT-64 using a basic iterative type architecture.We would like to emphasize that our implementation is round-based and it uses 64-bit data path, a smaller implementation can be obtained using smaller data paths 4-bit, 8-bit, 16-bit or even serialized implementations.
Table 3 provides the implementation details of TweGIFT-64 on Virtex 6.It is evident from the results that the difference in the number of LUTs is 119 (caused by the inclusion of the decryption rounds and the multiplexers to select the input to the state register).The difference in terms of the number of slices is about 36 such that one slice in Virtex 6 has 4 LUTs and 2 Flip-flops (depends how a design is optimized and placed by the Xilinx tools).Detailed descriptions can be found in Appendix B.1.

Implementation of LOCUS and LOTUS
The hardware implementations of LOCUS and LOTUS use a round-based iterative TweGIFT-64 core as a main building block.Our designs are implemented optimizing the speed as the ones in the CAESAR benchmark.The inherent parallel characteristics of LOCUS and LOTUS allow the implementers to explore various other options such as pipeline or unrolling architectures.However, we do not use such optimizations here to ensure fair

Security Analysis of LOCUS and LOTUS
Before delving into the security proofs, we give an alternative formulation for LOCUS and LOTUS based on a tweakable block cipher.This formulation extends Rogaway's XEX [Rog04] based abstraction of OCB.
Notice that the modified algorithms are implicitly keyed due to the tweakable random permutation Π.

17:
if 2d = m then 18:  We remark here that the small modifications in the specification of LOTUS and LOCUS (see section 3) are introduced precisely to exploit this modularity.As we see later in this section, these changes make the proof modular and much easier to understand.The security of the original construction as given in the NIST submission [CDJ + 19a] is exactly the same, though requires a more dedicated and notationally complex proof.Theorem 2. For σ e + σ d + σ v ≤ 2 n−1 , q p ≤ 2 κ−1 , and σ = σ e + σ d + σ v , we have 1.For all (q e , σ e , q p )-distinguisher D,
denotes the output of the verification interface for the i-th verification query in the real oracle.Apart from λi , all other variables are adversarial inputs, and hence must match.Then, we have We fix a verification query index i and follow the following two cases.
1. Ni = N j for all j ∈ [q] e .This means that in the real world, the tweakable random permutation Π was never called for tweak input ( Ni , 6, •), whence the tag matches with at most 2 −n probability.
2. Ni = N j for some j ∈ [q] e .If Ti = T j , then the forgery succeeds with at most 1/(2 n − 1) probability, as this is equivalent of the output of a uniform random permutation when one input-output pair is already known.Suppose Ti = T j .Then, we must have Xi ⊕ = X j ⊕ .Also, ( Āi , Ci ) = (A j , C j ), otherwise the queries are duplicate.
We can have two cases, depending upon whether Āj = A i or not.We discuss the Āi = A j , and the other case can be similarly bounded.Since Āi = A j , there must be at least one ciphertext block index, say k, in .Then, the probability that Xi ⊕ = X j ⊕ is bounded by at most 1/(2 n − q d − 1) < 2/2 n (assuming q d + 1 < 2 n−1 ) due to the randomness of Wi Suppose, the two ciphertexts differ only at the last block.Then it is easy to see that the probability of Xi ⊕ = X j ⊕ is 0. This happens by design.Instead, suppose there exist k < j , such that Ci k = C j k .Then, the probability of Xi ⊕ = X j ⊕ is bounded by 1/(2 n − q d − 1) ≤ 2/2 n (assuming q d + 1 < 2 n−1 ), using a similar line of argument as in the preceding case.
Cases 2a and 2b are mutually exclusive, which in combination with the Āi = A j case, upper bounds the probability in case 2 by 4/2 n .
Cases 1 and 2 are mutually exclusive, whence we can bound Pr[Forge i ] ≤ 4/2 n .The result follows from Theorem 1.

Privacy and Integrity Security of LOTUS
The main technical result on the security of LOTUS is given in Theorem 3.
Proof.The privacy advantage is bounded to Adv tsprp

P[ E]
(σ e + q e , q p ) using exactly the same argument as in case of Θ-LOC.The integrity-under-RUP advantage is bounded to (σ + q, q p ) + 4q v /2 n using similar arguments as in case of Θ-LOT.We skip a formal proof for economical reasons.

Security of P
The main technical result on the security of P, as defined in section 6.1, is given in Lemma 1.
Lemma 1.For any (q e , q d , q p )-adversary B, we have where q = q e + q d .Proof.We employ the coefficient-H technique to bound the distinguishing advantage of B in distinguishing the real oracle ( P ± , E ± ) from the ideal oracle ( Π ± , E ± ).Let [q] denote the set of all construction query indices, and [q] e , and [q] d denote the subset of encryption, and decryption, respectively, query indices, i.e., |[q] x | = q x for x ∈ {e, d}.
For the i-th construction query, we define the following notations: The i-th primitive query variables are defined analogously, but topped with a hat to differentiate them from their construction counterpart.So, the i-th primitive query is of the form ( Li , Xi , Ŷi ), where Li , Xi , and Ŷi denote the key, input and output of the primitive.
We consider an extended version of the oracles, in which they release the internal secrets, once the query-response phase is over.The real oracle releases the secret key K, and the ∆ i N values for all i ∈ [q].This uniquely defines all the intermediate variables arising in the construction queries.
The ideal oracle first samples a dummy key K uniformly at random.Let S = {i ∈ [q] : j < i, N i = N j }.The ideal oracle samples ∆ i N uniformly at random for all i ∈ S, and sets ∆ j N = ∆ i N if N j = N i for all j ∈ [q] and i ∈ S. All other internal variables are defined according to their relationship in the real world.
Let Ω denote the set of attainable transcripts in the ideal world.For any transcript ω ∈ Ω, we segregate the construction and primitive query tuples into ω c , and ω p , i.e.
, ω p = ( Li , di , Xi , Ŷi ) i∈[q]p .Bad Transcript Analysis: We say that an attainable transcript is bad, if one of the following conditions hold: , since the transcript is good.Then, we have On dividing Eq. ( 16) by ( 15), and doing some simple algebraic simplifications, we get The result follows from the coefficient-H technique.

Some Remarks on Generic Cryptanalysis on LOCUS and LOTUS
Here we summarize some generic ways of attacking LOCUS.Similar strategies work for LOTUS as well.First of all it is clear from the proof of lemma 1 that privacy directly depends on the security of P[ E].Below, we enumerate some of the important attack strategies against P[ E]: 1. Guessing master key by making primitive queries: One can exhaustively search for the master key which requires q p = O(2 κ ) and q = O(1).This strategy is handled in event C 0 .
2. Guessing the nonce-based internal key and input mask: This requires a correct guess of K ⊕ N and E 0 K (0), which requires q p q = O(2 n+κ ).This strategy is handled in event C 2 .
3. Colliding the internal key and input of two distinct P queries: Clearly, this requires q = O(2 n+κ 2 ), as both key and input should collide.We remark that similar event requires just q = O(2 n/2 ) queries in XEX based constructions.For instance, Ferguson's forgery attack [Fer02] creates a collision on the internal input in O(2 n/2 ) queries.However, the same attack does not succeed against LOCUS, as just internal input collision is not enough.This strategy is handled in event C 3 .
In INT-RUP attacks the adversary can either try to exploit the above mentioned attacks, or it can try to guess the tag or try to collide the internal checksum values.All these cases are handled in the proof of Theorem 2. Note that the access to unverified plaintext gives no extra advantage to the adversary, as the plaintext is not used in the checksum computation.
We also remark that the recent attacks on OCB2 by Inoue et al. [IIMP19] is also not applicable against LOCUS.Basically, their attack exploits a flaw in the last block processing of OCB2.Both LOCUS and LOTUS are devoid of such flaws.

Algorithm 5
The verified decryption algorithm of LOCUS.The subroutines proc_ad and proc_tag are given in the encryption algorithm.
original specification of the GIFT-64-128 block cipher.As the key schedule operations contains only bit shifts and circular rotations, it is easy to get the round key K 2 8 from original key K 1 and vice-versa using the permutations showed in Table 8 and 9 respectively.Note that, depending on the context, we use "block cipher" to denote "tweakable block cipher".The control flow is generated by a small finite state machine with three states: BC_Reset, BC_Wait, BC_Encrypt.BC_Reset initializes the key register with the key through the key port and then goes to BC_Wait until the start signal in activated.The BC_Encrypt state executes the block cipher rounds and after executing all the 28 rounds it returns to BC_Wait.
We optimize our TweGIFT-64 implementation to encrypt bulk information in the ECB mode.If the signal start is set to 1, an additional clock cycle for the initialization phase can be avoided.In addition, when the encryption of the actual state is performed, the input can be taken directly from the feedback using the multiplexer muxIn.These two optimizations allow us to save 1 clock cycle for each of the processed blocks.
For a single chip implementation, the multiplexer muxSt selects the input to be send the state register from a decryption or an encryption round.For the encryption only module, all the decryption rounds and the multiplexer muxSt are removed from the architecture and the encryption round is then connected directly to the state register.

B.2 Component Wise Area Calculation for lightweight LOCUS and LOTUS
Both the architectures of lightweight LOCUS and LOTUS function use several modules.
Here, we provide a brief discussion about the distribution of the hardware footprints among the individual modules such as the main module, control unit module, block cipher round module, and the logic (means additional register, multiplexers, etc) components.The area utilization by the modules have been measured on Virtex 6.The distributions of the Algorithm 6 The verified decryption algorithm of LOTUS.L ← L α 8: X 1 ← C j ⊕ ∆ N 9: W 1 ← E L,8 (X 1 ) 10: Y 1 ← E L,5 (W 1 ) 11:  areas are presented in terms of the number of LUTs for both LOCUS and LOTUS.We observe that, the majority of the hardware area have been consumed by the underlying block cipher.The distributions are described in Fig. 8 below.
In this section, we provide the hardware implementation details of both LOCUS and LOTUS with the underlying block cipher TweGIFT-64.In several applications, cipher implementations with small size are desirable.We primarily target these applications and implement the cipher with small hardware area.It is easy to see that both our designs share the same structure for the associated data processing, while they differ in the message processing phase.Both LOCUS and LOTUS have simple structure: they consist of a block cipher and a few basic operations (such as bitwise XORs, multiplexers and one accumulator).We would like to point out that, majority of the hardware areas are dominated by the TweGIFT-64 module.We describe the hardware architectures as well as provide our hardware implementation results on both Virtex 6 and 7.
We provide a brief analysis on clock cycles per byte (cpb).This is a theoretical way to estimate the speed of the architecture.We would like to note that both the designs shares the same values for cpb and we provide a joint analysis here.We consider round based architecture with 64 bit datapath.To process an associated data of a blocks and a message of m blocks, we need 29a + 57m clock cycles.We use one TweGIFT-64 call to process one associated data block and two TweGIFT-64 calls to process one message block.Our block cipher is optimized to process a bulk data, and the reset is required only to indicate that the stream processing starts.We observe that the cpb values for different a and m are constant as there is no initialization overhead and the overhead for the tag generation (constant small number of clock cycles) are negligible for long messages.Our For A, B ∈ {0, 1} * and |A| = |B|, we write A ⊕ B to denote the bitwise XOR of A and B.

Figure 2 :
Figure 2:Processing of an m block message M and Tag Generation (assuming an odd number of input blocks) for LOTUS.The upper left part shows the message processing of an intermediate di-block and the upper right part depicts the message processing of the final di-block.The lower part shows the tag generation process.c denotes the number of di-blocks in the message i.e. c = m/2 − 1.The dotted part in the final di-block is executed only when the message has even number of blocks.We use the notation E i K N ,j to denote invocation of E with key α a+i K N and tweak j, where a denotes the number of blocks of associated data corresponding to the message.Here W ⊕ denotes the intermediate checksum value and V ⊕ denotes the AD checksum value.len n is used to denote the n bit representation of the size of the final di-block in bits.
[max{| Ci |, |C j |}] such that Ci k = C j k .Now, we have two cases based on | Ci | and |C j |. a. | Ci | = |C j |, say | Ci | > |C j |.Then, we choose k = ¯ i .In this case, we condition on the values of Wi and W j as well as Vi and V j , except Wi¯ i

Figure 7 :Figure 8 :
Figure 7: Architecture of TweGIFT-64 Implementation The improved security ensures that both of these modes are secure with a 64-bit block cipher and 128-bit key.This essentially makes the modes lighter.Overall, LOTUS and LOCUS require only 388-bit and 324-bit states, respectively, which are much less than comparably secure OTR and OCB modes, i.e. 640-bit and 512-bit, respectively (using a 128-bit block cipher).We show that LOTUS and LOCUS have full 64-bit INT-RUP security in the ideal cipher model, whereas it is well-known that the integrity of both OCB and OTR can be trivially broken in the RUP setting.We remark here that LOTUS and LOCUS do not achieve privacy notion in the RUP . LOTUS and LOCUS: We propose two new highly secure and hardware efficient block cipher based authenticated encryption modes, named LOCUS (Lightweight OCb with rUp Security) and LOTUS (Lightweight OTr with rUp Security) (see Sect. 3.1).Our new AE modes have the following features: (a) High Security.Both LOTUS and LOCUS satisfy DT = O(2 n+κ ), where D and T denote the query and time complexities, respectively.Here D < 2 n , and T < 2 κ are obvious conditions.We provide rigorous proofs (see Sect. 6) to obtain the respective security bounds in the ideal cipher model.(b) Lightweight.(c) Efficient and Fast.Our modes keep the basic features of OCB and OTR intact.Both are online, single pass and fully parallelizable.(d) INT-RUP Secure.setting, (IND-CPA + PA1) of [ABL + 14].However, they achieve full 64-bit security in the usual notion of privacy (see Sect. 6). 2. We choose the recently proposed short-tweak tweakable block ciphers TweGIFT-64[4,16,16,4] [CDJ + 19b] having a 64-bit block, a 128-bit key and only a 4-bit tweak to instantiate both LOTUS and LOCUS.Sometimes we use TweGIFT-64 as a short hand for TweGIFT-64[4,16,16,4].As we use the concept of re-keying, we provide a comprehensive related-key analysis (see Sect. 4.2) of TweGIFT-64.

Table 1 :
Comparison of various Lightweight AE modes with LOTUS and LOCUS.The block sizes are chosen such that they satisfy the criteria of NIST.The key size is 128-bit in all the cases.
[KR16]]nd LOCUS introduce significant optimizations over OTR[Min16]and OCB[KR16]designs, respectively.Some of the novelties in LOTUS and LOCUS as compared to OTR and OCB are listed below:1.LOTUS and LOCUS employ nonce-based key derivation as well as re-keying technique to ensure higher security with lighter primitives.As mentioned above, this allows for the use of ultralight 64-bit block ciphers.2.A subtle change inthe tag generation process saves additional n-bit state as compared to OTR and OCB, where n denotes the block size.So even with similar primitive size LOTUS and LOCUS would have drastically low hardware footprint as compared to OTR and OCB.At a very high level, LOCUS can be viewed as an amalgamation of OCB-IC [ZWH17] and ΘCB3 † [Nai17].However, LOCUS improves over both of these designs on many fronts.In comparison to ΘCB3 † it improves on two fronts.First, ΘCB3 † requires a pseudorandom function call for nonce-dependent key generation, whereas nonce-based key derivation in LOCUS is much simpler.Second, ΘCB3 † security bound contains a DT /2 κ term, whereas LOCUS security bound is DT /2 n+κ .The bound for ΘCB3 † is clearly not enough for NIST Project, when κ = 128.Although our technique for obtaining INT-RUP security is quite similar to OCB-IC, we emphasize that LOCUS achieves full n-bit INT-RUP security, whereas OCB-IC has just n/2-bit INT-RUP security.

Table 2 :
Differential Trail with p core = 2 −16 in the Related-Key Setting.

Table 6 :
Comparison on Virtex 6[ATHb].Here BC denotes block cipher, SC denotes Streamcipher, (T)BC denotes (Tweakable) block cipher and BC-RF denotes the block cipher's round function,'-' means that the data is not available.

Table 8 :
Permutation to get K 1 from K 28

Table 9 :
Permutation to get original K 28 from K 1