Uncertainty Analysis in Cryptographic Key Recovery for Machine Learning-Based Power Measurements Attacks

The present work concerns side-channel attacks on cryptographic devices protected with the advanced encryption standard (AES). In this regard, the assessment of guessing entropy (GE) and the related uncertainty is proposed for machine-learning-based attacks based on power measurements. For the first time, the GE was assessed on the entire key while uncertainty was introduced in the field of side-channel attacks, thus allowing a more rigorous vulnerability test for a device. Notably, a state-of-the-art attack relying on a multilayer perceptron is exploited for classifying power traces leaked from physically accessible devices. A public database was exploited for the sake of results’ reproducibility. Thanks to cross-validation, the uncertainty associated with retrieving a single key byte can be quantified and then propagated to the entire key by means of the Monte Carlo method. It is thus shown that when exploiting about 4000 attack records (traces), there is a 10% probability to retrieve the secret key as a whole with less than ten attempts. This implies that a full cryptographic key can be discovered on average ten times for every 100 similar devices by a side-channel attack. This poses security threats particularly relevant in an Internet-of-Things scenario and addresses the need for improved vulnerability testing and proper countermeasures.

in cryptographic devices can reveal sensitive information. He pioneered new measurement-based attacks consisting of exploiting unintended computation effects, namely, leakages, to retrieve information like the secret key of a cryptographic algorithm. Such attacks are known in literature as "sidechannel attacks" [3].
In side-channel attacks, the physical accessibility of the device is required. This is a common condition for paradigms such as Internet of Things or edge computing, where a distributed network of smart devices is exploited [4], [5], [6], [7]. Other than execution time and power consumption, the exploited leakages include heat dissipation and electromagnetic emission. Power analysis is currently considered as the most powerful side-channel attack [8], and several approaches have been proposed.
Nonprofiled attacks include simple [9], differential [2], and correlation [10] power analysis. They consist of acquiring a set of power traces and then applying statistical analysis to retrieve the secret key. Therefore, the attacker exploits the correlation between power consumption and internal state of the device during cryptographic operations.
Profiled attacks, instead, rely on identifying a statistical model for the target device on the basis of compatible devices [11]. The attack consists of comparing the power traces acquired from the target device with the statistical model, thus allowing to find the secret key with a limited number of traces. A divide and conquer strategy is usually applied to break down the secret key recovery into separately recovering single bits or bytes (denoted as "sub-keys").
Since 2011, the possibility to enhance profiled side-channel attacks was explored by means of machine learning [12], [13], [14]. This approach involves the identification of a model using preliminary measures (training data) and then the classification of new data through the identified model. The classification output is a class or the probability of possible classes associated with these data [14].
Hospodar et al. [12] considered the execution of a cryptographic algorithm compliant with the advanced encryption standard (AES) [15] and proposed a support vector machine to classify power consumption. Their results demonstrated that the classification-based approach could outperform template attacks through a fine-tuning of hyperparameters. Then, in [16], the concept of enhanced brute force was introduced, namely, attempting all possible keys (brute force) while considering the most probable subkeys first. Martinasek and Zeman [17] proposed a neural network for classifying the bytes of AES keys. They focused on retrieving a single byte and suggested that in case of byte misclassification, the next most probable bytes could be considered. Such a method could potentially allow retrieving a key. Nonetheless, an analysis considering the entire key is missing. Finally, in recent works, the focus has been on improving the classification of power traces by comparing different machine learning algorithms [18], [19], by exploiting deep learning methods such as convolutional neural networks [20], [21], or by exploring different leakage analyses to prepare the training dataset [22]. These works relied on a more complex model when attempting to improve an attack.
Different metrics have been proposed to assess the performance of a model for side-channel attacks [23], [24]. Among them, a widely used metric is the guessing entropy (GE), which quantifies the number of guesses needed on average to recover the right (sub-) key in an enhanced brute-force attack [16]. Thus, the greater the GE, the lower the attack effectiveness.
In calculating the GE, no associated uncertainty was estimated to date. However, evaluating the uncertainty would be desirable, especially in the Internet-of-Things scenario, because, when attacking multiple devices, the key could be retrieved (by chance) before the average number of guesses. Thus, quantifying the uncertainty would allow more rigorous vulnerability tests for a device.
Uncertainty quantification in machine learning has been receiving increasing attention in recent literature [25], [26], [27]. This appears crucial especially when machine learning is applied in healthcare [28], [29]. Widely exploited methods are the "Monte Carlo dropout" [25], or Bayesian neural networks [30]. However, these probabilistic approaches put some constraints on model architectures, e.g., they might require a Bayesian network. On the other hand, misclassification probability could be exploited for any model with probabilistic output to assess the uncertainty of a predictive performance [31].
The present work proposes an assessment of GE and associated uncertainty for machine-learning-based side-channel attacks. Notably, the proposed approach extends uncertainty quantification to profiling attacks by directly exploiting the results of a cross-validation already needed during the model validation. The investigation was carried on as a function of the number of attack traces. Both subkeys and the entire key were considered, and the Monte Carlo method was exploited to propagate the misclassification probability of single bytes to the uncertainty of the entire key.
The approach was applied to a public dataset, and the attack was based on a state-of-the-art multilayer perceptron. In doing so, the present results can be reproduced and extended. In particular, the final proposal allows to quantify the uncertainty of the device's vulnerability before an attack to facilitate the vulnerability assessment. In the remainder of this article, Section II discusses the proposed method, while Section III reports the experimental results.

II. METHOD
In this section, we introduce basic ideas and highlight the main contributions of this study. Then, an attack relying on measured power traces is presented. Finally, a method is proposed for assessing the uncertainty associated with the guessing of the secret key. This involves propagating the uncertainty of the GE from single bytes to the key as a whole.

A. Basic Ideas
A machine-learning-based side-channel attack was exploited to retrieve the secret key of the cryptographic device from measured power traces. The AES-128 was specifically considered in the present work. Therefore, the secret key k ⋆ consists of 16 bytes. Indeed, retrieving the entire secret key would require dealing with 2 128 classes. Meanwhile, when classifying each byte singularly, 2 8 = 256 classes are tackled at a time. These classes are still many for a clear-cut classification, but the key discovery can be enhanced (see Section II-B).
The state-of-the-art of AES implementations also involve countermeasures. In this article, the Boolean masking was considered. This is a popular data obfuscation scheme that conceals a sensitive information value v, e.g., a secret key of cryptography algorithms, through a value m called mask, The attack performance is commonly assessed by GE. To obtain this metric, an array with probabilities associated with each possible key value (guessing vector g) is needed as classification output. Then, the guessing vector is sorted in descending order of probability, and the position of the correct key is identified (key rank). Finally, the GE is the mean key rank over multiple experiments In the present work, the GE was assessed as a function of the number of attack traces. In principle, the more the attack traces, the easier the secret key retrieving should be. However, no evidence has yet been provided regarding the entire key. Hence, as a first theoretical improvement, both the single key bytes (subkey) and the entire key were investigated.
In addition, when attacking multiple devices, the secret key could be occasionally discovered before or after the average guesses. Thus, as a second theoretical contribution, the tradeoff between the GE and the number of attack traces was also studied in relation with associated uncertainty.
The whole study ultimately investigates whether a cryptographic device can be penetrated by measuring few power traces during the attack and/or trying a reasonable low number of key values. This aims to improve vulnerability tests for characterizing the device's security.

B. Machine-Learning-Based Attack
Machine learning, and more specifically deep learning, is widely used for pattern recognition when recovering the bytes associated with acquired power traces [14], [20]. Recent literature indicates that convolutional neural networks should be preferred in the presence of desyncronization or jitter in power traces [32]. If this is not the case, a simpler network like the multilayer perceptron is similarly effective [14].
In principle, the parameters and hyperparameters of this neural network model would be identified, thanks to a set of power traces and their corresponding byte value. Nonetheless, recent literature suggests that 1) it is not efficient to use entire traces for the training [33] and 2) power traces cannot be analyzed independently of the plaintext [34], namely, the clear message in input to the cryptographic device.
Therefore, the power traces were preprocessed to find the subset of samples with the most significant information for the target byte [points of interest (POIs)] [33] and then an intermediate value for each trace was calculated as where k i (i = 0, 1, . . . , 15) is the key byte to attack, p i is the corresponding byte of the plaintext, r i is the corresponding byte of the mask, ⊕ is the XOR operation associated with the AddRoundKey step in the first round of the AES algorithm and then with the masking, and Sbox() represents the bytes' substitution (SubByte) step [15]. The POIs were identified by the signal-to-noise ratio (SNR) proposed in [32]. In detail, for each byte of the key (target byte), the SNR was estimated by grouping power traces in accordance with the intermediate values. The traces of the same group were averaged to obtain a single mean trace per intermediate value. Instead, multiple noise traces were derived per each intermediate value as a difference between a trace and the mean trace. Finally, for a specific target byte, the SNR was estimated as the ratio between the variances of the mean and noise traces where j indicates a sample of the trace, var[] indicates the variance among different intermediate values, S j are the mean traces, and N j are the noise traces. Ultimately, a set of POIs with associated intermediate values were exploited to train the multilayer perceptron (Fig. 1). It comprises an input layer with a number of neurons s corresponding to the input samples from a power trace, h hidden layers with the same number of neurons per layer, and an output layer with 256 neurons associated with each possible value of a byte. Each output neuron returns the probability that the input power trace is associated with a specific byte value (i.e., from 0 to 255).
Once the POIs were estimated for each byte, the hyperparameters of the neural network had to be identified. These included, first, the number of layers, the number of neurons per layer, and the activation functions of the neurons. In addition, different numbers of epochs, different loss functions, and different batch sizes were investigated during the following training phase.
After identifying the hyperparameters, the model was trained by exploiting the power traces with associated intermediate values and using the categorical cross-entropy cost function for the backpropagation algorithm.
Next, in the attack phase, the model identified for each key byte returns probabilities for all the possible values associated with an input trace. After sorting the guessed byte values by decreasing probability, the number of attempts corresponds to the position of the actual byte value. More than a single trace can also be used to improve the result. In such a case, literature considers joint probabilities of byte values for different traces. Doing so typically decreases the number of attempts to retrieve the correct byte value.
Log-likelihood is commonly used, i.e., where g * is the guessing vector with joint probabilities, g t is the guessing vector associated with a single trace t, and T is the total number of traces used in the attack. The entire key can be finally retrieved from the probabilities of single-byte values. The idea is that the first key attempt is composed of the 16 most probable byte values. In the following attempts, the less probable values are hence considered in descending order of probability until the secret key is discovered [17]. Hence, the number of attempts to discover the entire key is obtained by combining the attempts for single bytes, as discussed in Section II-C. In doing that, the machine learning approach can enhance the key retrieving process with respect to a brute-force attack.

C. Uncertainty Assessment
Section II-B presented the training of models for guessing the values of single subkeys (bytes). For each model, it is possible to obtain an average GE and the associated uncertainty. This was notably done through cross-validation on training data. Through that, the aim was to approximate the probability distribution of subkey guessing ranks. Hence, the training data were split into n folds, and only n − 1 were used for actual training while the remaining one was used for testing.
This training-test step was repeated n times. Per each iteration, the rank of the keys under test was obtained as a function of the number of traces. Then, average µ and standard deviation σ were calculated across the iterations.
The cross-validation allows us to calculate a mean rank (i.e., GE) and an associated uncertainty per each model. Type A evaluation was exploited for the uncertainty, i.e., This could be repeated for the models associated with different bytes and as a function of the number of attack traces.
The results obtained for a single byte were then combined to achieve the GE of the entire key, along with its uncertainty. The ratio was to multiply the number of guesses for discovering the single byte. However, instead of considering a single GE value per each byte (e.g., the mean), values were selected randomly from the probability distributions associated with each byte. These were assumed Gaussian with mean µ k equal to the GE of the key byte k and the associated standard deviation σ k .
The Monte Carlo method was thus exploited to obtain the probability distribution of the GE associated with the entire key. It is worth noting that although assuming Gaussian distributions for the ranks of single bytes, the final distribution is generally non-Gaussian. Therefore, using the standard deviation as a measure of uncertainty may not be appropriate. For this reason, the resulting distributions were analyzed by looking at histograms of occurrences, whose number corresponds to the iterations of the Monte Carlo analysis. Notably, the number of iterations has to properly represents the distribution while minimizing the computation time. Therefore, the analysis was stopped when the variation in the distribution's median and interquartile range was less than 5% between two consecutive iterations.
The proposal as a whole is schematically summarized in Fig. 2. It is worth remarking that for each byte, the SNR analysis allows to select different POIs from the same power traces. Then, power traces are split in different folds F i . Different models M j are trained with different data subsets and each model produces a different rank R k . Hence, the rank uncertainty on a single byte can be determined with crossvalidation, while the uncertainty for the entire key needs the Monte Carlo analysis.

III. RESULTS
This section first introduces the public data to which the proposed method was applied. The exploited traces resulted preprocessed to avoid jitter and desynchronization. Then, it reports the results in terms of GE and uncertainty of the attacks. Both the single-byte case and the entire key are considered, and the discussion is carried on by considering both the profiling and the attack phases. Moreover, some details on model identification are discussed.

A. Data
The public dataset exploited in this work was the ASCAD database [32], [35]. This constitutes a suitable framework for reproducing and improving the already existing approaches and models. The ASCAD database is composed of power traces. These were measured through the electromagnetic radiation emitted by the ATMega8515 microcontroller during the first round of an AES encryption. The acquisition produced a dataset with 60 000 traces of 100 000 samples. Traces and associated metadata were stored through the Hierarchical Data Format version 5 (HDF5). This is a multipurpose hierarchical container format capable of storing large numerical datasets with their metadata [36].  Data structure of the ASCAD database stored with the HDF5 hierarchical container format. Fig. 3 shows the hierarchy of the ASCAD database. Notably, this is made of two groups: traces and metadata. The traces group contains the raw traces arranged as arrays of samples. The metadata group contains, per each trace, information regarding the input plaintext, the secret key, the adopted mask, and the output ciphertext. These are arranged as four vectors of 16 bytes. Indeed, there is a one-to-one correspondence between traces and metadata structures, and the structures are in the same positional order of the traces group.
In accordance with the described method, the raw traces were preprocessed to obtain the actual POIs in input to the model. Meanwhile, the metadata were combined to estimate the intermediate value used as labels for each trace. This was done by considering one byte of the key per time. Hence, 16 structures similar to the one of Fig. 3 were created. A single structure, associated with a single byte, contained N subsets of s POIs taken from the 100 000 samples of each trace and N intermediate values.
It is worth remarking that the traces and metadata were split in different groups for training, validation, and test. Notably, the first 50 000 traces were used for training and validation, while the remaining 10 000 traces were used as an independent test set. Then, as detailed later, the 50 000 traces were in turn split in fivefolds of 10 000 traces for cross-validation.

B. Models Identification
The classification models for attacking a single byte of the key were identified by searching for optimal hyperparameters. As a first step, the POIs were identified by means of the SNR of (3). Fig. 4 reports an example of SNR estimated for the third byte of the key when using (2) to estimate the intermediate value. The values are reported in decibel (dB) and some peaks can be observed. These are representative of high variance points for the signal.
The specific POIs were identified by considering a window of samples wide enough to contain the peaks. For instance, in the example of Fig. 4, the selected windows was [45 400; 47 600]. Such a window contains all the peaks above the baseline SNR values, which are around −45 dB in the current case. A compatible choice for the POIs can be found in the work introducing the ASCAD database [32]. Nonetheless, instead of selecting a prefixed number of samples, the current work considered time intervals enclosing the peaks with high SNR.
The number s of neurons at the input layer is consequently fixed by choosing the POIs. Then, different combinations were investigated by means of a random search to choose the number of hidden layers and the number of neurons per layer. The search was done by testing a number of hidden layers from 3 to 6 and a number of neurons per layer spanning from 100 to 300 with step 100. The optimal performance in terms of training and validation accuracy was achieved using four hidden layers with 200 neurons each.
Concerning activation functions for the neurons, this work took into account the tanh, the ReLU, and the Softmax. Empirical evidence showed that the ReLU activation function was the optimal choice for input and hidden layers, while the Softmax activation function was selected for the output layer. This is in accordance with typical choices for activation functions in artificial neural networks.
Different training epochs were also evaluated, namely, 100, 200, 400, and 1000. Among them, the evidence showed that there is no significant increase in training accuracy when using more than 200 epochs. Two loss functions and two optimizers were then evaluated. The categorical cross-entropy was the best choice for a greater training accuracy reached. The RMSprop optimizer was used for a best validation accuracy obtained. Finally, the learning rate was assumed fixed at 10 −5 as suggested by guides on RMSprop.
The attempted model hyperparameters and the respective optimal values for the third byte of the key are summarized in Table I. Note that the third byte was randomly considered among the bytes of the key as one of the 14 bytes where the masking was applied too (worst case). For computational reasons, these same hyperparameters were also used for the other bytes of the key. Nonetheless, a further optimization was attempted whenever the resulting performance was unsatisfactory.

C. Profiling on Single Bytes
Once hyperparameters were identified, a model had to be trained for each byte before carrying out an attack (profiling phase). As discussed above, cross-validation was exploited to train and test the model multiple times and estimate its attack performance in terms of GE. Notably, a fivefold cross-validation was adopted. This section reports three representative cases of attacks on different bytes.
The mean rank assessed on the first byte of the key is flat on the minimum value, i.e., mean 1 and standard deviation 0. Therefore, the attack on this byte always reveals its value in a single attempt. This result is explained by the fact that there is no masking countermeasure in the ASCAD database for the first (and for the second byte too). Indeed, the very same result was also obtained on the second byte of the key. Similar results were observed on unprotected AES [33].
Ultimately, the model identified on a byte with masking effectively attacked bytes without countermeasures. It is worth remarking though that the model had to be trained on the specific byte before performing the attack. Mean rank and associated standard deviation resulting from cross-validation considering the third byte of the AES key. Fig. 5 shows the results of cross-validation applied to the third byte of the AES key. It is worth remembering that the ASCAD database involves a masking countermeasure for this byte. The mean rank (i.e., GE) and associated standard deviation are reported as a function of the number of test traces. This time the results demonstrate that guessing the third byte is almost random when exploiting a few attack traces, while the model can guess the exact byte value in a single attempt when the number of attack traces increases. The result is in accordance with previous works. Indeed, the same trend was observed with optimal convolutional neural networks in [32] and [37]. Moreover, the assessed uncertainty can explain the performance of a multiple-input MLP [38].
An attentive reader should have also noted that the minimum value for a rank is saturated at 1 in the figure because at least one attempt must be made to guess a byte value. Hence, the bottom of Fig. 5 and of following figures was clipped at 1. For a similar reason, the upper part of these figures is saturated at 256, which is the maximum possible number of attempts.
Similar results were obtained for other bytes, i.e., GE was minimized as the number of exploited attack traces increased. The only exception was found for the 13th byte of the AES key. Therefore, as a third representative case, Fig. 6(a) shows the result of cross-validation on byte 13 of the AES key. This byte was masked too. The mean rank and the associated standard deviation demonstrate that the number of guessing attempts remains halfway between the best and worst rank values without any beneficial effects deriving from more attack traces.
As commented in [39], this evidence suggested that the optimal hyperparameters of Table I were a bad choice for such a byte, and therefore an attempt to reoptimize hyperparameters was made. The result after optimization is thus reported in Fig. 6(b). This result appears compatible with the ones exemplified in Fig. 5, and it was achieved by increasing the number of hidden layers from 4 to 6 and the number of neurons per layer from 200 to 300.
Ultimately, the uncertainty assessment of these results allows a better vulnerability analysis with respect to cited literature, where the only mean rank is reported for each attack.

D. Attack on Single Bytes and Entire Key
After assessing the GE of the models by cross-validation, the final models were trained using the whole subset of training traces byte by byte.  They could be thus used to attack the target byte by exploiting the remaining independent traces of the test set. The rank as a function of the number of attack traces was obtained by performing such attacks. Each of these curves had to be compared with the band resulting from cross-validation. In Fig. 7(a) and (b), an example of attack performed on byte 3 and byte 13, respectively, is shown.
The mean rank and standard deviation obtained with cross-validation are recalled with a shaded band. Instead, the red line represents the rank of an attack performed using the respective byte model. It can be seen that the attack ranks are basically within the band. Therefore, this validates the model performance estimated during profiling.
Given that, the performance of an attack on the entire key could be investigated with the Monte Carlo method. A generalpurpose laptop was exploited and, in accordance with the proposed criterion, the analysis stopped at 100 000 iterations (about 20 min). The results are reported in Fig. 8. This shows that with at least 4000 attack traces, the entire key can be discovered in less than 10 6 attempts. This must be compared with the total number of possible key values, which is in the order of 10 38 . Moreover, there is a relevant probability of guessing the entire key in few attempts. The last aspect is highlighted in Fig. 8 with a color map representing probability density for rank values.
From the results of the Monte Carlo analysis, the histograms of occurring ranks in the case of 10, 4000, and 10 000 attack traces were particularly focused. These correspond to the values of the scatter plot of Fig. 8 given the respective number of attack traces. As an interesting figure of merit, the resulting probabilities for guessing the entire AES key in less than ten attempts were 0%, 10%, and 19%, respectively.
Overall, the results demonstrate the vulnerability of cryptographic devices to machine-learning-based attacks and quantify the probability of discovering the entire AES key through a side-channel attack. This appears especially relevant for an Internet-of-Things scenario and it suggests a limit for the number of similar devices to be installed when willing to prevent cyber-attacks.

IV. CONCLUSION
This article has presented a profiled attack employing machine learning for power analysis attacks on cryptographic devices. In particular, its main contribution consisted of estimating an uncertainty for attack ranks associated with the discovery of a single byte and then the entire AES-128 key. Cross-validation allowed such estimation for single bytes, while attacks on independent power traces validated the rank intervals identified by the mean and standard deviation of ranks in the profiling phase. Then, the ranks associated with discovering the entire AES-128 key were calculated with the Monte Carlo method.
The approach of the present work relies on the peculiarities of a side-channel attack, which makes it possible to extend and specifically adapt the already existing literature approaches.
The results have demonstrated that there is a 10% probability to retrieve the secret key as a whole with less than ten attempts when exploiting about 4000 attack records (traces). Note that the numeric results necessarily depend on the exploited dataset, which is strictly related to the target devices. However, such a result generally implies that security issues hold in an Internet-of-Things scenario, where multiple similar devices are installed and physically accessible, and that the proposed method allows better vulnerability testing to adopt stronger countermeasures to side-channel attacks.
Future works may also investigate different approaches for uncertainty quantification and compare the proposed method with the advantages and disadvantages of the state-of-the-art approaches.