Deep Learning-based Side Channel Attack on HMAC SM3

SM3 is a Chinese hash standard. HMAC SM3 uses a secret key to encrypt the input text and gives an output as the HMAC of the input text. If the key is recovered, adversaries can easily forge a valid HMAC. We can choose different methods, such as traditional side channel analysis, template attack-based side channel analysis to recover the secret key. Deep Learning has recently been introduced as a new alternative to perform Side-Channel analysis. In this paper, we try to recover the secret key with deep learning-based side channel analysis. We should train the network recursively for different parameters by using the same dataset and attack the target dataset with the trained network to recover different parameters. The experiment results show that the secret key can be recovered with deep learning-based side channel analysis. This work demonstrates the interests of this new method and show that this attack can be performed in practice.


I. Introduction
S IDE CHANNEL analysis is a powerful technique that helps an adversary recover sensitive information without damaging the target. Target device leaks information (e.g. power consumption [1], Electromagnetic emanation [2], temperature [3], acoustic [4], etc.) that is related to sensitive data during calculation [5]. An adversary can make use of the leakage to recover sensitive data. In order to decrease the leakage, several countermeasures such as masking and hiding are used in the cryptographic implementation. However, even with countermeasures, adversaries can come up with more powerful methods to recover the sensitive information.
Following the current trend in the side channel analysis research area, recent works have demonstrated that deep learning algorithms were very efficient to conduct security evaluations of embedded systems and had many advantages compared to the other methods.
Nowadays, machine learning becomes a popular topic in many areas. It is usually divided into three classes: supervised learning, unsupervised learning and semi-supervised learning. In most situation, supervised learning is mainly used. There are many kinds of structures which are used in machine learning, such as Support Vector Machine (SVM) and Random Forest, etc. Deep learning is a kind of machine learning. It extracts features by several non-linear layers. Deep learning becomes popular since AlexNet [6] is proposed in 2012. Then, more and more complex network structures are proposed, such as VGGNet [7], GoogLeNet [8] and ResNet [9], etc. These networks work well in many areas, e.g. image recognition area, face recognition area and so on.
In recent years, it appears that deep learning techniques are applied in side channel analysis research area [10], [11]. Comparing to traditional side channel method, deep learning-based side channel analysis performs better. Deep learning-based side channel analysis performs better especially when the implementation has mask or jitter [12]. Without alignment or pre-processing, neural network can recover sensitive information as well, which is much more convenient than the traditional side channel method. Many researches are done in recent years. In 2013, Martinasek. et al., play an attack on an AES implementation working on PIC16F84A with only one hidden layer [13]. Another work [14] compares different kind of machine learning method on DPA contest V2 dataset. A research [15] proposed a CNN based side channel analysis and claim that this method is robust to trace misalignment. Different structures of network are applied on the dataset and come up with a CNN-best structure for the dataset. Picek et al. compare different methods for the class imbalance situation on DL-SCA [16]. Last, a research [17] come up with correlation-based loss function.
In this paper, we use a deep learning-based side channel analysis technique to analyze HMAC SM3 algorithm implementation. The structure of this paper is as follows: In Section II, we introduce the SM3 algorithm, HMAC SM3, basic idea of CNN as well as the attack path. This section will help readers have a basic understanding of the algorithm and the attack method. The attacks on real traces are demonstrated in Section III. In this section, the target, the structure of the network and the attack are illustrated. In the end, conclusion and future work are presented in Section IV.

A. SM3 Algorithm
SM3 is the Chinese hash standard [18]. The structure of the algorithm is shown in Fig. 1. The input data of the function is padded such that it can be split into N blocks of 512 bits. Each block will be treated in a same procedure: the former block calculates a new IV for the latter block through function f(), and the output of block N is the hash result of the algorithm.   The structure of function f() is show in Fig 2. The function T() convert the 512-bit input into 64 32-bit word pairs. Each pair (Wi, Wi*) are used during round Ri, and the result of each round is used as input of the next round. When the 64th round is completed, a final transformation is applied by adding the input of the first round and the output of the last round together as the output of the function f().  In order to explain the detail of each loop, we define the loop by function: IV i = f(IV i-1 , Block i ) The first constant IV 0 is: IV 0,4 = 0xA96F30BC (5) IV 0,6 = 0xE38DEE4D (7) The detail of each loop is as follows: First, initialize the 8 32-bit local variables named a to h: For each round R j , j∈[0,63], we compute: where all additions are done modulo 2 32 <<< n means left rotation of n bits, constants T j is: Function FF j is: Function GG j is: Function P k is: Input plaintext of each block Plain Block is split into 32-bit words Plain Block = {PB 0 , PB 1 , …, PB 15 }. Then the parameter W j is computed as: And the parameter W j * is computed as: The function f() is finished by 32-bit XOR with the initial state:

B. SM3 Based HMAC
The HMAC stands for keyed-Hash Message Authentication Code and is a NIST standard which can be found in [19]. Fig. 3   The process of HMAC SM3 is as follows: Derive key pair (K i , K o ) from the key K.
Calculate the first hash with K i and input data: first hash = H(K i │T) Calculate HMAC with K o and first hash: HMAC = H(K o │first hash), where H() is the SM3 hash function.

C. Side Channel Analysis
Side Channel Analysis (SCA) is first proposed by Kocher et al. in 1996[22]. It is a technique to retrieve the secret information of an algorithm by monitoring the physical information of a device (such as power, heating, time consuming, electromagnetic signals, etc., as shown in Fig. 4. The reason that SCA can recover secret is that the physical signal of a cryptographic device demonstrates correlation with the internal statement. It is much easier to recover information from side channel signals than directly breaking the core implementation. There are several kinds of SCA, e.g. simple power analysis, correlation power analysis, template attack, etc. Simple power analysis [23] is an easy way to recover secret information. By observing the side channel signals, the attacker can find the difference and recover the sensitive information according to the differences. Correlation power analysis (CPA) [24] needs much more traces. When using CPA to recover sensitive information, we need to guess the secret key to calculate a certain mid-value. Since different traces correspond to different plaintext, we can have a set of mid-value for every guesses. By computing the correlation between mid-values and the side channel signals, we can figure out the correct guess. Template attack (TA) is another kind of passive attack. It has two stages: first, template building, second, template matching. Deep learning based SCA is similar to TA. We will discuss TA and deep learning based SCA in the following section.

D. Deep Learning Based Side Channel Analysis
Template attack [20] is a traditional method of side channel analysis. During the attack, we should take a reference set of trace from a reference device first in the learning phase. For this set, we know the key, the plaintext and all the details of every trace. We can set up templates for each mid-value using the reference set. For attacking phase, we can use the templates in the learning phase to attack the target trace set to recover the mid-value such that the secret key can be recovered as well.
Deep learning-based side channel analysis is similar to the traditional template at-tack, which has two phases: a learning phase and an attacking phase. The whole procedure is shown in Fig. 5.  In the learning phase, a trace set is collected from the reference device. For each trace, we add a label that is related to the sensitive data (e.g. key). The neural network will be trained with the traces and labels. Parameters in the neural network is updated. The goal of the learning phase is to try to make the output (prediction) of the neural network be closer to the true label. After the learning phase, the parameters updated in the network will be saved.
The network saved at the end of learning phase can be applied to the attacking phase. In the attacking phase, we have a trace set with only traces but we do not know the label. With each trace, the network can give us a prediction of the label using the parameters saved in the learning phase. We can recover the sensitive data (e.g. key) according to the predictions.

E. Neural Network Architecture
In this paper, we will primarily focus on Convolutional Neural Network. We will introduce the basic idea of the Convolutional Neural Network.
Convolutional Neural Network (CNN) [21] is a popular network in deep learning do-main. Usually, CNN is consisting of three kinds of layers: convolutional layers, pooling layers, as well as fully connected layers. The convolutional layers work as pattern detectors by convolution operations. Kernels in Convolutional layers can be updated by backpropagation. In this way, we can get different kernels using the same set of input. These kernels can detect different kinds of edges to track different features. Usually one kernel corresponds to one feature and a convolutional layer contains several kernels. In the meantime, since the kernels doing convolution by moving through the whole dataset, same pattern in different positions can be detected by the same kernel. We should notice that the kernels in the convolutional layer always have a small size compare to the input data, which reduce the computation complexity of the neural network.
Pooling layers always come after convolutional layers, which reduce the size of the inputs. We can choose average pooling to reduce the size by local averaging or max pooling to reduce the size by picking up the max value in a certain area. By averaging or max pooling, the pooling layer extract features of the input and reduce the size. This operation makes the CNN more robust to the shift and deformation of the input. In addition, it can reduce the possibility of overfitting since it reduces the size of the input data.
Fully connected layer usually comes at the end of the neural network. Each neural in fully connected layers are connect to every input.
We can choose the number of convolutional layers, pooling layers and fully connected layers arbitrarily. With more layers, the neural network can learn more com-plex features. However, with more layers, the network will be easier to get over-fit. Thus, the structure should be chosen carefully according to the input data.
The detailed architecture of MLP best , MLP monobit , MLP multi-label , CNN best , CNN monobit and CNN multi-label is shown in Fig. 6. The mark "FC-200" means a fully connected layer of 200 neurons, "conv11-64, Relu" means 64 convolutional kernels of size 11 using Relu activation function, and "average pooling, 2 by 2" means an average pooling layer, whose pooling window size is 2 and the stride is 2.
For a deep insight into the differences of identity model and our multi-label model, the architecture of the output layers of CNN best and CNN multi-label (the same for MLP best and MLP multi-label ) is depicted in Fig.  7, the output layer of CNN best has 256 output neurons with softmax activation function while the output layer of CNN multi-label has 8 neurons with sigmoid activation function. Correspondingly, CNN best uses crossentropy as the loss function and CNN multi-label utilizes the binary crossentropy since there are 8 binary labels.

F. Attack Path
Since SM3 algorithm has no secret key, we cannot attack SM3 directly. We can only attack HMAC SM3 to recover the key K used in the HMAC process. In order to recover the key K, we should recover the key pair (K i , K o ) first. Thus, we should recover the first hash IV: Hi and the second hash IV: Ho to recover the key pair.
We use Ho and first hash result to calculate the HMAC. To recover Ho, we should know the first hash result first. In order to get the first hash result, we should recover the first hash IV(H i ) first. Recovering Hi and Ho can use the same process. If H i is recovered, Ho can be easily recovered as well. In this paper, we only consider recovering H i and our target is to recover a 0 , b 0 , c 0 , d 0 , e 0 , f 0 , g 0 and h 0 related to H i . In order to illustrate the attack path more clearly, we denote part of the TT1 j and TT2 j computation as δ 1, j and δ 2, j respectively: We can first recover δ 1,0 and δ 2,0 according to Equation (3) and Equation (4) respectively when j is equal to 0. With δ 1,0 and δ 2,0 known, TT1 0 and TT2 0 can be easily calculated since W 0 * and W 0 are known. Then, we can recover a 0 by targeting at TT1 0 ⊕a 0 , recover b 0 by targeting at TT1 0 ⊕a 0 ⊕(b 0 <<<9). c 0 can be recovered through targeting at the computation c 0 + FF 0 (TT1 0 , a 0 , (b 0 <<<9) + SS2 1 + W 1 * . After δ 1,0 , a 0 , b 0 and c 0 are recovered, we can simply recover d 0 by computing d 0 = δ 1,0 -(a 0 ⊕b 0 ⊕c 0 ) -SS2 0 . Similarly, we can recover e 0 , f 0 , g 0 and h 0 with TT2 0 and W 1 . Thus, the IV H i can be recovered.

A. Experiment Setup and Data Set
The testing target is a software HMAC SM3 running on a 32-bit microprocessor Infineon TC1782. The experiment setup consists of a high-performance Digital Storage Oscilloscope (DSO), high-precision XYZ stage and near-field high-bandwidth EM probe, as shown in Fig. 8.
EM traces are acquired when the HMAC SM3 is running. A single measurement contains 50,000 points, representing the computation of first hash. We collect two set of traces: Set A: 200,000 traces with input data and the first hash IV Hi varying; Set B: 50,000 traces with variance input data and fixed first hash IV Hi. Set A is used in the learning phase, while Set B is used for the attack phase. In the training phase, 180,000 traces of Set A are used as training set while the rest 20,000 traces are used as validation set to choose the best network parameter. Fig. 9 shows the structure of the network. We only use one convolutional layer with kernel size 3 and 32 convolutional filters. For the pooling layer, we use set both the pooling size and the stride to 2. The first fully connected layer has 1024 neuros while the second has 512 neuros. The input layer contains 5000 neuros while the output layer contains 9 neuros, which stands for Hamming Weight 0 to Hamming Weight 9.

B. Neural Network Structure
Input CONV 32 POOL FC1 FC2 Ouput The network contains 82,450,569 parameters in total, as shown in Fig. 10.

C. Experimental Result
We try to recover δ 1,0 first. Instead of recover all 32-bits of δ 1,0 , we recover δ 1,0 byte by byte. With learning rate 0.0001, batch size 200, we trained each model 10 epochs using Set A. The training result is shown in Fig. 11. the blue line corresponds to the training set while the orange line corresponds to the validation set. We can find that for every byte, the loss increases and the accuracy decrease in the validation set after several epochs. Thus, we save the network with best performance instead of the network obtained when training is finished.   The attack results on Set B are shown in Fig. 12. The line in blue indicates the expected value of different bytes. We can find that all bytes in δ 1,0 can be recovered with only several thousands of traces in Set B.
With δ 1,0 recovered, we can calculate TT1 0 for every trace according to the corresponding W 0 * . The EM traces leaks information related to a 0 when calculating a 0 ⊕TT1 0 . The training result is shown in Fig. 13.
We recover a 0 byte by byte as well. The result is shown in Fig. 14. The line in blue indicate the expected value of different a 0 bytes. We can find from the result that we need to use almost all traces in Set B to recover all the four bytes of a 0 . Unlike the result of δ 1,0 , the correct candidates of a 0 are not very distinguishable from other candidates. The result of δ 1,0 seems more distinguishable than that of a 0 , the reason may be that the process that leaks information about a 0 is a XOR operation while the leakage about δ 1,0 is an ADD operation. We can repeat the attack on all the other parameters using the same set of data by changing the target label to recover b 0 , c 0 , δ 2,0 , e 0 , f 0 and g 0 . Then, d 0 and h 0 can be calculated simply.

IV. Conclusion and Future Work
In this paper, we demonstrate a Deep Learning-based Side Channel Attack on HMAC SM3 algorithm. In order to recover the key used in HMAC SM3, the attacker should recover two IVs: Hi and Ho. In this paper, we only focus on recovering Hi since the method of recovering the two IVs are the same. We try to recover δ 1,0 , δ 2,0 , a 0 , b 0 , c 0 , d 0 , e 0 , f 0 , g 0 and h 0 to recover H i . The experiment result shows that we can recover the IV with 50,000 traces. In addition, we can find that when we focus on an add operation, the attack result is much better than focusing on a XOR operation. Thus, we need more traces to recover parameters when focusing on XOR operations. Although the correct candidate for XOR operation is not quite distinguishable from other candidates, we can recover the correct candidate. This situation may be solved if more traces are added to the attacking set.
In this paper, we focus on a software implementation of HMAC SM3 without any countermeasures. In future work, we can try several different HMAC implementations: (a) hardware implementation without countermeasures; (b) software implementation with some countermeasures; (c) hardware implementation with countermeasures. By doing experiments on different implementations, we can check whether deep learning works well on both unprotected and protected situations. In addition, we can try to figure out the difference of the structure of the network when attacking a hardware implementation and a software implementation.