Auto-tune POIs: Estimation of distribution algorithms for efficient side-channel analysis

Due to the constant increase and versatility of IoT devices that should keep sensitive information private, Side-Channel Analysis (SCA) attacks on embedded devices are gaining visibility in the industrial field. The integration and validation of countermeasures against SCA can be an expensive and cumbersome process, especially for the less experienced ones, and current certification procedures require to attack the devices under test using multiple SCA techniques and attack vectors, often implying a high degree of complexity. The goal of this paper is to ease one of the most crucial and tedious steps of profiling attacks i.e. the points of interest (POI) selection and hence assist the SCA evaluation process. To this end, we introduce the usage of Estimation of Distribution Algorithms (EDAs) in the SCA field in order to automatically tune the point of interest selection. We showcase our approach on several experimental use cases, including attacks on unprotected and protected AES implementations over distinct copies of the same device, dismissing in this way the portability issue.


Introduction
Today, with the growing presence of Industry 4.0 and IoT devices in our lives, more and more customers and product developers are faced with the utmost importance of the cybersecurity of small embedded devices. Consequently, this has led to a general increase in the interest in Side-Channel Analysis (SCA) and countermeasures, and in general in all physical attacks on among others, IoT devices. The main challenge is in the integration and validation of countermeasures against SCA being a long, expensive, and complex process in practice. Assessing the security of an electronic system against SCA is a problem that requires various skills and expertise from very different fields (electronics and hardware, signal processing, statistics, cryptography, deep learning, etc.) and in which the experience of a security analysis professional is often a crucial part of this operation. This is partly due to current certification processes like EMVCo [22] or Common Criteria (CC) [18] requiring to evaluate the robustness of the Device Under Test (DUT) by performing a battery of distinct side-channel attacks (such as differential power analysis (DPA) [34], correlation power analysis (CPA) [10], mutual information analysis (MIA) [24,8], template attacks (TAs) [14,15], deep learning-based attacks (DL-SCA) [41,53,43]). The motivation is to quantify the security of a system against SCA by taking into account whether the attacks are successful or not and the number of resources that they require. This approach is overseen by organizations like ANSSI [3] and BSI [1] and the amounts of time and resources needed for performing this kind of evaluations is constantly growing (with the new attacks and techniques being proposed), making infeasible to assess the security of a system against SCA in a fast, efficient and low-cost manner. Trying to mitigate this issue, several leakage assessment techniques (like Test Vector Leakage Assessment, TVLA [28]) have arisen intending to assess whether a device leaks information through side channels or not. The problem is that the security of a system against SCA cannot be assessed simply by applying these techniques: leakage assessment tests only determine whether there is any leakage in the power traces, without specifying whether that specific leakage is critical or giving any hints on how to do exploit it. Therefore, leakage assessment tests do not solve the problem since to evaluate the security of an embedded system against SCA properly it is mandatory to attack it exhaustively with known SCA techniques (including TAs), with the complexity that this entails.
The biggest challenge for TA in particular, but also for PA in general, is the finding of proper time samples containing the leakage information (usually named Points Of Interest, POI). Thus, to ease the evaluation process in general, and TAs specifically, we propose to perform POIs tuning together with the template building and key recovering steps, automatically. This allows expert evaluators to save time and parallelize tasks (improving the efficiency of the process) but also helps technicians without a deep knowledge of all the basics involved in these methods to implement TAs properly.
Thus, the main contribution of this paper is in introducing a novel advanced and automatized search strategy for the Point of Interest (POI) tuning issue. We demonstrate that this approach straightforwardly provides state-of-the-art results by searching the best groupings of POIs in the space shaped for all possible groupings. More specifically, the entire contribution of this work can be divided into the following parts: 1. We propose a novel approach for the Point of Interest (POI) tuning issue, which is fully automated. Thus, this approach not only improves the state-of-the-art in terms of performance but also mitigates complexity issues. This is accomplished by applying Estimation of Distribution Algorithms (EDAs) [44,36,50,40,17] which, to the best of our knowledge, have never before been applied in the SCA field. We name this technique Estimation of Distribution Algorithm-Profiling Attack (EDA-Based PA). 2. We demonstrate our approach on different datasets and devices, including unprotected hardware AES implementation on FPGA (AES HD data set 3 ) and protected (masked) software AES implementation on microcontroller (ASCAD data set 4 [55]). Thus, we prove that our EDA-Based PA is suitable for different implementations and leakage models. 3. Moreover, in order to demonstrate the applicability of our method in a more realistic context, i.e. considering portability (see Sec. 3.2), we introduce a new dataset: the AES PT dataset 5 (Sec. 6). AES PT is the first open dataset for SCA which includes power analysis traces of four different copies of the same hardware device: a high-performance ARM ® Cortex ® -M4 32-bit RISC microcontroller (STM32F411VE [66]). This dataset was created with the idea of making "realistic" TAs and therefore includes subsets of traces of each clone device performing unprotected and protected AES-128 implementations, with both fixed and random cryptographic keys. 4. Besides, we also demonstrate the suitability of our technique in this AES PT dataset, and hence in a portable template attack scenario. Therefore, we show how even in this real-world scenario our EDA-Based PA can break protected implementations on several clone devices with the same power model.
The paper is organized as follows, Sec. 2 briefly reviews the background on Profiling Attacks. The relevant work on automatic SCA, portable TAs and TAs on masked implementations are surveyed in Sec. 3. The Estimation of Distribution Algorithms approach to automate the POIs selection in the SCA scenario is given in Sec. 4. Sec. 6 introduces the AES PT dataset. Sec. 5 and Sec. 7 contain the experimental results supporting our method. Finally, Sec. 8 concludes the paper.

Background on Profiling attacks
Profiling attacks have become an archetype for SCA in recent years. The main idea for these attacks is to generate a model of the power consumption of a device, to be used for the recovery of sensitive information (i.e., cryptographic key). Therefore, these attacks consist of two phases: a profiling phase, in which the model is built out of a relatively big number of power traces, and an attack phase, in which the model is applied and the secret key is recovered with only a few traces. There exist different types of profiling attacks depending on the technique used for generating the model in the profiling phase: the model can be generated by using standard classification techniques like in the first publications on Template attacks (TAs) [14,56]; or Machine Learning (ML) techniques such as Support Vector Machine (SVM) [31,30,37], Random Forest (RF) [38] or recently introduced Deep learning (DL) [41,12,54]; or even regression often called the Stochastic models approach [62]. Nevertheless, "classical" TAs and ML are the two most popular approaches [39], and in this paper, we focus on the former because it is a well-known and understood technique in the field of SCA. However, the approach remains valid in a broader context, not just for TA, although we pick TA to demonstrate results.

Notation
In this section we briefly describe the notation used throughout the document. In general we follow the notation proposed in [42], with some modification. The calligraphic letter T denotes a set of traces t (also called leakage vectors). In turn, each trace is formed by T time samples t = {t 1 , t 2 , . . . , t T }. The total number of power traces t in a set of traces T is denoted by |T|. We use v = f (p, k) for the targeted intermediate value, which is related to a public variable (usually the plaintext p) and a cryptographic primitive (secret key k ). Calligraphic letter K denotes the set of all possible keys, the secret key used by the criptographic algorithm (correct key) is denoted by k * and the total number of key hypotheses is denoted by |K|. Regarding TAs, we denote each template by h = (m, C). The probability that x = i is denoted by p(x = i). In the case of binary variables, we denote the probability that x = 1 by p(x = 1) or simply p(x). Finally, given a set of T power traces, the attack outputs a key guessing vector g = g 1 , g 2 , ..., g |K| in decreasing order of probability. We understand Guessing Entropy (ge) [65] as the average rank of k * in g over multiple experiments.

Template attacks
Template attacks (TA) were proposed by Chari et al. [14] for SCA as the first form of profiling attacks. They are based on building a multivariate model of the probability distribution of the leakage, which is commonly generated assuming that the leakages follow a Gaussian distribution (parametric estimation). In these attacks (and in SCA in general) it is common to work with power traces taken when the modeled device is handling some intermediate value v = f (p, k) related to a public variable (usually the plaintext p) and the secret parameter (key k ).
For that purpose, in the profiling phase the attacker uses a set of n p profiling traces (T p ) to build a Gaussian multivariate model, which is fully defined by a mean vector and a covariance matrix (m, C) [14], for each possible intermediate value v, creating the so-called templates 6 . Formally, the PDF which describes the multivariate normal distribution of a leakage vector t = {t 1 , t 2 , . . . , t T } of length T is described by Eq.(1): Then, in the attack phase, from a set of n a attack traces (T a ) and its input data (plaintext), the attacker tries to guess the secret key. In order to do so, the attacker makes hypothesis about the secret key and calculate possible intermediate values v i,j . Note that the intermediate values v i,j depend on the input data d i (1 ≤ i ≤ D) and the key hypothesis k j (1 ≤ j ≤ K), and hence each key hypothesis k j suggest a template (m, C) for each input value. Then, a discriminant score D (k | t i ) is computed for each key hypothesis k j and the key hypothesis are ranked in decreasing order of probability. 7 Given a power trace t i , a commonly used discriminant derived from Bayes' is denoted by Eq.(2): This discriminant is obtained by omitting the denominator from Bayes' rule (see Eq. (3)), since is the same for each key hypothesis k j .
Moreover, if we assume that p(k j ) is an uniform probability, applying Bayes' rule is equivalent to computing the likelihood as in Eq. (4):

Point of Interest selection
Although TAs are optimal from an information-theoretic point of view, in their original formulation they pose a number of limitations, where computational complexity problems and the need for dimensional reduction the most critical ones [15]. The dimensionality reduction is commonly addressed in practice by selecting only a small number of the commonly big number of time samples of the power traces (Points of Interest (POI) selection [57]). In order to do so, the evaluator should select only the points in the power trace that contain the (most relevant) leakage, without losing other important information. This is typically done based on what we call "POI selection graphics" (See Fig. 3 and 10). These graphics are obtained by applying certain functions (correlation, SOSD [25], SOST [25], SNR [42], ML-based [52]) to power traces. Typically, a certain number of samples are selected in the higher values of those functions for building the templates. There exist other techniques for reducing the number of samples in each power trace (often called compression methods), like Principal Component Analysis (PCA) [5,64] or Fisher's Linear Discriminant Analysis (LDA) [32,23], but here we focus on "classical" sample selection (for POI selection) as the most widely used technique in practice (without any lack of generalization). In addition, it should be noticed that the POI selection stage can be decisive and have a huge influence on the attack results, especially when we consider a portable scenario [59]. Conversely, most of the related works start the analysis by assuming that the POIs are already pre-selected: Picek et al. identified this problem and compared the performance of several feature selection techniques in a TA scenario [52]. However, our approach is different since the EDA-Based PA can be used together with this kind of techniques or just in a standalone way.

Automatic SCA
In the field of SCA, there are only a few papers that discuss its automation. Moreover, as far as we know, no paper studies the automation of SCA in the context we are addressing in this paper: profiled attacks on cryptographic implementations. Specifically, the only papers that discuss automatic SCA are focused on using cache-timing attacks to automatically and more efficiently exploit complex Linux/Windows operating systems, which is a very different area of SCA. For instance, in [29] authors perform a novel automated attack in two stages on the T-table-based AES implementation of OpenSSL using cache-timing template attacks [11]. Also, in [63] Schwarz et al. presented a fully automated approach to find subtle differences in browser engines caused by the environment and presented two new side-channel attacks on browser engines. They collected (automatically) all data available to the JavaScript engine to build templates. It should be noticed that, when we speak about "templates" or "template attacks" in the context of cache-timing attacks, we are referring to a different concept from the attacks covered in this paper. Cache template attacks were proposed in [11], and are named templates because they are inspired by Chari's work [14], as they have the same spirit in terms of performing the attacks in two steps (profiling and attack). In short, in the profiling phase, dependencies between the processing of secret information (e.g., specific key inputs or private keys of cryptographic primitives) and specific cache accesses are determined. Then, in the attack phase, secret values are derived based on observed cache accesses.
Nevertheless, our approach is different since we are addressing another concept of template attacks in a different field of SCA: we focus on an automatic search for the points in the power trace which will give us the best results with respect to where relevant leakage occurs. Thus, we claim that this approach can mitigate the need for a human in the loop of this kind of procedures since EDAs allow us to automate POIs tuning together with the template building and key recovering steps.
The biggest challenge for TA in particular, but also for PA in general, is the finding of proper time samples containing the leakage information (usually named Points Of Interest, POI). Thus, to ease the evaluation process in general, and TAs specifically, we propose to perform POIs tuning together with the template building and key recovering steps, automatically. This allows expert evaluators to save time and parallelize tasks (improving the efficiency of the process) but also helps technicians without a deep knowledge of all the basics involved in these methods to implement TAs properly.

Portability of Template Attacks
The original idea of profiling attacks implies having two devices, the target device, and a clone hardware device which we can control completely for building the power consumption model. Practically, although this is not a realistic use case, traces for both profiling and attack phases are often captured from the same device in most of the related works on this topic [14,57,5,25,62,39,33,55,52,31,30,37,38]. In fact, applying a power consumption model built with one device to a different copy (concept referred to as portability) is sometimes considered trivial, while it can be a challenging matter as some previous works have identified [21,58,16,9,59]. The main reason is that there could still exist differences between "identical" devices or experimental setups. In practice, small variations in the construction of a device or aging can provoke behavioural deviations in the power consumption which can eventually lead to a failed attack. The same happens with subtle changes in the experimental setups used to acquire the traces: environmental changes, I/O interference, resonance due to LC and RC oscillators, influence of the past state, variations in the magnetic field penetration, etc.
To the best of our knowledge, there exist only a few papers on profiling attacks that consider portability [35,26,13]. In [21] the portability issue was identified and waveform realignment and acquisition campaigns normalization were proposed to improve the performance of portable TAs. In [16,15] authors performed different attacks over four copies of the same device to argue on the impacts of the different copies. A more recent work [9] compares the performance of different machine learning techniques in the context of portable profiling attacks. In [59] authors propose an improved POI selection technique in order to address portability of data loading template attacks. They focused on finding points of common leakage in the power traces, avoiding the ones that perturb the model and make it limited to a particular copy of the device.
In this paper, we address the portability of TAs with EDAs in two use cases (Sec. 7.1 and 7.2) and using the new aforementioned AES PT dataset, on which we are able to break a masked AES implementation in four clone devices with the same probabilistic model.

Template Attacks on Masking
In [46], Oswald et al. discussed different types of TAs on masked AES software implementations on an 8-bit microcontroller. They applied two types of attacks: TAs combined with second-order techniques and template-based DPA attacks with extra calculation considering the masks, see Eq. (5): Where t i represents a power trace, k j represents a key hypothesis, and m represents the different values the mask can take (being M the maximum number of different values that the mask can take).
Regarding the attacks, all of them break the implementation (with a different number of traces), concluding that, in the scenario of TAs, there is no difference in the security of masked and unmasked implementations. The best attack is claimed to be the template-based DPA, which can recover the key from about 15 traces (using 10k traces in the profiling phase).
Later, several works have performed "regular" template-based DPA attacks (i.e. without the extra calculation considering the masks), with successful results [33,55]. On the other hand, since in masking implementations POI selection graphics are not conclusive because the intermediate value is randomized (masked), the POI selection step is usually performed by using DPA attacks [46] or using PCA/LDA [33,55]. In this work, we follow a different approach since the POI selection is done intrinsically in our EDA-Based PA, as explained below.

The search of points of interest by means of randomized optimization heuristics
As mentioned above, the goal of using randomized optimization heuristics is to automate POIs tuning together with the template building and key recovering steps. The idea is to search in the space shaped for all possible groupings of POIs in the power trace for the best ones, that is, the POIs that turn attacks into reality. However, an exhaustive enumeration of every combination of points of the power traces is exponential, that is, if the number of points is T , the number of possible combinations is 2 T which makes an exhaustive enumeration impractical with only dozens of points. In order to accomplish this kind of tasks, optimization heuristics have gained popularity due to their high efficiency in finding optimal solutions in complex problems where other exact methods take too much time and CPU resources.
In the field of optimization heuristics, several approaches can be used including genetic algorithms among others. In this work, a new search strategy is proposed that is based on a quality measure combined with recent efficient evolutionary computation algorithms namely, estimation of distribution algorithms.

Estimation of distribution algorithms
Estimation of distribution algorithms (EDAs) are a novel class of evolutionary optimization methodology that was developed in the last decade as a natural alternative to genetic lgorithms [44,36,50,40,17].
EDAs show different advantages over genetic algorithms such as the absence of multiple parameters to be tuned (e.g. crossover and mutation probabilities) or the expressiveness and transparency of the probabilistic model that guides the search. This novel class of evolutionary optimization methodology has been proven to be better suited than GAs in some applications achieving competitive and robust results in the majority of tackled problems.

Introduction
EDAs as the GAs work with a population of candidate solutions (in our case candidate points of oints). Fig. 1 shows general schematics for any EDA approach.

Selection of N<R individuals
Induction of the probability model

Selection of N<R individuals
Initially, a random sample of candidate groups of POIs is generated. These candidate POIs are evaluated by means of an objective function. According to this evaluation, the best points are selected. Then, the selected solutions are used to learn a probabilistic model, and based on this new model a new set of groups of POIs is sampled. The process is iterated until the optimal value has been found or another termination criterion is fulfilled.

A basic taxonomy of EDAs
The literature shows a variety of models and learning algorithms, the selection of the best model for a given problem is not always straightforward. A criterion is to balance the computational cost of learning with respect to the complexity of the probabilistic model. Both aspects are strongly related to the problem dimensionality (i.e. the number of variables) and to the type of variable (e.g. continuous, discrete, mixed).
Researchers should note that simple models have low requirements in different aspects such as computational resources or the complexity of the learning process. Whereas the complex models are able to represent more complex relationships but requiring more sophisticated data structures and costly learning processes. Depending on the problem researchers should take into account the balance between the search efficiency and the cost associated with the chosen strategy. An additional criterion that should be taken into account is the previous knowledge and to chose a probabilistic model to represent this knowledge.
In order to help the researcher to find a suitable EDA type, these can be classified according to the dependencies between the variables as follows: -Univariate EDAs, assume that all variables are independent and therefore the joint probability can be factorized as a product of univariate marginal probabilities. An example of this kind of models is the Univariate Marginal Distribution Algorithm (UMDA) [45,6,27]. These algorithms are the simplest EDAs with the best CPU performance in terms of time. -Bivariate EDAs represent low complexity interdependencies between variables, examples of this class are mutual information maximization for input clustering (MIMIC) [20], the bivariate marginal distribution algorithm BMDA [51], dependency tree-based EDAs [7] or the tree-based estimation of distribution algorithm(Tree-EDA) [61]. -Multivariate EDAs factorize the joint probability distribution considering statistics of order greater than two. The complexity of the model as well as the effort required to estimate the parameters that best suit the selected points is significantly greater. Algorithms that belong to this group are EBNA [7], and BOA [49].
For additional and detailed information regarding the different models that constitute the family of EDAs, see [36,48].

EDAs in a SCA scenario
In order to explain how EDAs can be applied in the SCA scenario, we will focus on an AES implementation, same as in the experimental use cases described in Sec. 5 and Sec. 7. Given a set of T power traces (with T samples per trace) labeled with an 8-bit value corresponding with the processed intermediate value (as typical for SCA on AES implementations), the problem consists of selecting the best points of interest (samples) for building templates in a TA scenario [14,57]. This task can be performed efficiently by using an EDA which explicitly represents each of the elements involved in the problem, in this case the power traces. Our approach is to consider a vector of binary variables (candidate points of interest) of length T : The cardinality of the search space for this problem is 2 T . Each binary variable {x 1 , x 2 , . . . , x T } corresponds to one time sample of the power traces {t 1 , t 2 , . . . , t T }. If the value of a certain binary variable is 1 (x n = 1), that means that the corresponding sample of the power traces t n will be selected for building the templates, and vice versa. This process is depicted in Figure 2.
The selection of points of interest requires the handling of hundreds or thousands of variables which involves handling complex models and therefore the CPU time and x T-1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Selected points of interest X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 ··· X T-2 X T-1 X T Evaluation Power traces x T-2 x T x T-1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Power traces x T-2 x T x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Power traces memory required can be huge. Furthermore, the exponential nature of the problem represents a limitation when the number of traces increases when necessary. The decision of selecting complex models where relationships between power traces are explicitly represented can have relevant disadvantages such as: -The first one is associated with the complexity of the model when representing points of interest. If each point of the initial population must represent a codified set of points of interest then it involves the use of a more complex probabilistic model. As a consequence the different parts are affected as follows: • The size of the population. In order to correctly estimate the parameters of the model, there should be a minimum number of instances that increases with the complexity of the model. This represents a significant increase in time and computational resources. • The sampling of the model. An increase in complexity necessarily involves an increase in time and CPU resources causing the sampling of the model that can be prohibitive. • The learning of the probabilistic model. The existence of a more complex model causes that more computations are required to correctly estimate the parameters. • Computation of the measure. With bigger populations, the increase of the computation time is very significant but can also be a limiting factor. -Finally, our proposed algorithm performs a search in a huge space that is exponential in the number of variables to be considered. Any kind of increase in the number of variables has a great impact on the difficulty of finding good promising POIs candidates.
Therefore, due to the prohibitive costs cited previously the use of complex models between points of interest is discarded. Based on this premise, EDAs such as UMDA are suitable candidates to handle the problem of selection of points of interest. The model chosen is known as univariate marginal distribution algorithm for discrete domains: UMDA d . This method considers a model where there are no interrelations between the variables, and the probability distribution can be learnt as: This distribution that represents the product of T independent probability distributions associated with the points of interest p l (x i ) is a Bernoulli probability distribution that takes on two values: 1 if the point is selected and 0 if the point is not selected in the considered candidate vector. Based on this model, the estimation of the parameters of the model is performed based on marginal frequencies of the selected subset of points of interests: where:

Description of the algorithm
Thus, since EDAs are based on building explicit probabilistic models of promising candidates, each binary variable x n has its own probability of being one p(x n = 1) = p(x n ), which will be (re)computed after each iteration. Formally speaking, after each iteration, we estimate the probability distribution p(x) of promising candidates from the highest quality solutions. The proposed algorithm works as follows: 1. Firstly, the initial population D 0 of R individuals is generated. This is usually performed by assuming a uniform distribution on each variable (but can also be generated manually). In other words, based on the probabilities of each binary variable of being one p(x n ), we sample an initial population of R individuals (POI selection candidates). Whenever possible, we propose to sample the initial population based on a POI selection graphic, as detailed below (Sec. 4.4). Then, each individual is evaluated: a TA is performed with the POI selection candidate, quantifying the success of the attack with a proper evaluation function. This evaluation function has to be set by the evaluator and will depend on the desired result. We propose two different evaluation functions, based on the Guessing Entropy of the attacks, depending on the use case (Eq. (11) and (14)), as explained in Sec. 5.1, Sec. 5.2, Sec. 7.1 and Sec. 7.2. 2. Secondly, the initial population is ranked using the evaluation results in order to select a number N (N < R) of individuals from the previous D l−1 population and evolve towards the next one D l . We understand D N l−1 as the set of N selected individuals from the generation l − 1. 3. Thirdly, the T -dimensional probabilistic model which represents interdependencies between the T variables is inducted. This is also known as the learning procedure and is the most critical part of the process. As usual in practice, we start considering the simplest case: variables are independent (no conditional probabilities). In other words, we (re)compute each probability p(x n ) based only on the number of times that x n = 1 in D N l−1 . Formally speaking, p l (x n ) = p(x n |D N l−1 ). 4. Finally, the new population (D l ) of R new individuals is generated by simulating the probability distribution obtained in the learning process (Step 3). As in the second step, the individuals are evaluated and the best set of individuals is selected. Steps 2, 3, and 4 are repeated until a stopping condition is achieved (reaching a certain number of iterations, uniformity in the generated population, etc.).

Improving EDA-Based PA with POI selection techniques
The initial population of the EDA-Based PA can be generated from a uniform distribution (e.g., Bernoulli distribution) or can be generated in a more specific manner. We propose to sample the initial population from POI selection graphics (Sec. 2.3) in order to improve the performance of the EDA. It should be noticed that this approach can not be always followed. For instance, in a masked implementation, POI selection graphics are not conclusive since the intermediate value is randomized (masked). Nevertheless, where applicable, the probability of each binary variable x n of being one (p(x n = 1) = p(x n )) can be obtained from Eq. (9).
Where p(x n ) is a default probability value (i.e. Bernoulli distribution with p(x n ) = 0.5) and α is computed by: Here, s n is the corresponding sample of the POI selection graphic, and s M ax and s M in the maximum and minimum value of the graphic respectively. By doing this, we will accelerate the search since we are considering the points with higher value in the POI selection graphic in most of the individuals (they have a greater probability) and avoiding the samples which do not represent leakage (points with small correlation, SNR, SOST, etc. will have the lowest probabilities).

Experimental use cases: Single Device
In this section, we demonstrate the suitability of our method in a TA scenario (using the same device for both the profiling and attack phase) with two use cases, an unprotected AES implementation and a protected AES implementation.
It should be noticed that, as mentioned above, the proposed technique is suited for the context of side-channel evaluation. In this context, the strongest possible capabilities are commonly assumed for the attacker. That implies, among other aspects, that the attacker has physical access to the DUT and the capability to send a large number of chosen messages to the device. We also assume that the attacker knows the input and output data. In addition, it is likely that the attacker has some limited information about the internal workings of the device. These assumptions are made for all the experimental use cases proposed in this work.

Unprotected AES implementation on FPGA (AES HD)
In this use case, we showcase the applicability of EDAs in a simple TA scenario (unprotected cryptographic implementation, not considering portability). Although straightforward, the present use case represents a clear example of how EDAs can be applied in the SCA field. Nevertheless, more complex experiments are considered in the remaining use cases.

Target description: AES HD
The AES HD dataset includes 100 000 traces from an unprotected implementation of AES-128 on FPGA. The AES core is written in VHDL in a round-based architecture, taking 11 clock cycles for each encryption. The core is wrapped around by a UART module in order to enable external communication. The module is designed to allow accelerated measurements to avoid any DC shift caused by environmental perturbations. The design was implemented on Xilinx Virtex-5 FPGA of a SASEBO GII evaluation board with a total area of 1850 LUT and 742 flip-flops. Side-channel traces were obtained by measuring the electromagnetic radiation produced by a decoupling capacitor (on the power line) with a high sensitivity near-field EM probe and a Teledyne LeCroy Waverunner 610zi oscilloscope. A suitable and commonly used leakage model when attacking the last round of an unprotected hardware implementation is the register writing in the last round [2]: Attack details Table 1 gives the details of the attack. It should be noticed that, as usual, for the profiling phase we are using 10 000 random traces of the device processing different intermediate values, i.e. all the 256 possible values v can take since it is an 8-bit value. On the other hand, for the attack phase, we are using 300 traces of the device processing the same v. These attack traces are obtained from the dataset by selecting at random 300 traces of the device processing a certain v value, chosen at random too. Nonetheless, this is done on purpose since although this implies that our use case is "simpler" than other related works [33,52] (in which they perform the attack phase with 25 000 traces with different associated v values), it represents a clear and straightforward example that serves to explain our approach. Moreover, this use case indeed corresponds to a "realistic" TA scenario. In theory, in a "realistic" TA we can obtain profiling traces from all possible v values because we are using a clone device that we can fully control. However, in such a scenario we cannot fully control the attacked device, which has its own secret key that we want to recover. Therefore, we can only obtain a small number of attack power traces in which the cryptographic key is fixed because it is linked to the device. In this particular case, the plaintext also would be fixed, thus the intermediate value is always the same, which could correspond to a use case in which the device sends the same encrypted value several times. In any case, more demanding use cases have been considered in the rest of the experiments.  Regarding EDAs, even though they have less amount of parameters involved than other searching techniques, we still have to adjust the number of iterations and the population size. Also, we have to select a proper score function which enables us to correctly evaluate each individual of the entire population. In this case, we consider the following formula to evaluate each individual (POI candidate): Where ge is the result of attacking the attack (guessing entropy; rank of the correct candidate). The negative sign is there in terms of minimizing the ranks of the correct candidates, n P OI is the number of POI used to build templates (to guide the EDA into minimizing the number of POI if the attack is successful) and n samples is the number of samples per trace. We also use a correction factor (CF ) to accelerate the search. We divide ge by 256 (maximum guessing entropy in this case) in order to bound the value between 0 and 1, and the same happens with n P OI and n samples . In addition, we are using a POI selection graphic (Fig. 3) obtained from the profiling traces to guide the search even more (as explained in Sec. 4.4).
With this setup, we have essentially three variables (factors) to adjust: the correction factor, the number of iterations, and the population size (Table 2). Although it is not mandatory, we recommend using Experimental Design [67] (also known as Design of Experiments or DoE) in order to find proper values for the EDA's parameters, same as in [60,47]. The idea is to identify and quantify the effect each variable    In a nutshell, we perform 2 3 = 8 experiments with all possible combinations of variables A,B and C and their low and high settings (Tables 2 and 3). Then we apply the DoE formula and compute the effect of each variable and their interactions in the output of the attack (Fig. 4).

Main effects and interactions
It should be noticed that the attack is successful in all the experiments (Table 3, column ge, the rank of the correct candidate is one for all the column). Thus, although selecting proper values for the three variables in this use case is quite straightforward, it is a good "toy example" for showcasing how to apply the design of experiments and interpret the effect each variable has in the results. For instance, if we observe Fig.  4, we can see that variables B and C have a positive effect (the experimental results are better with the high setting). On the other hand, variable A has a negative effect (the experimental results are better with the high setting). Note that this is true in general for all the experiments but the best result is experiment 8 (all variables in its high setting) and hence we select that combination of variables for this use case. For a more detailed explanation of the procedure, we refer to [60,47].

Results
In Table 4, results from the first iteration are shown. In the left part of the table we can observe two columns: Individual and Evaluation result while in the right part of the table we can see detailed information of the TA: number of POI used and performance of the portable TA (Guessing Entropy; Average rank of the correct key candidate). Please note that we have not represented some of the individuals in order to reduce the tables' size.
The process is repeated until we reach the last iteration (iteration 10), represented in Table 5. If we compare the results of the first and last iteration we can observe that whereas in the first iteration the results of the portable attacks are quite poor, in the last round we succeed in the attack with almost all the individuals. A more graphical representation of the obtained results can be observed in Fig. 5. There you can see the experimental results (guessing entropy) of using a "regular" POI selection (just select 20 time samples on the highest SOST values), the results of the best individual (POI selection candidate) of the first iteration, and the results of the best individual (POI selection candidate) of the last iteration. It should be noticed that the "regular" POI selection throws the worse results (rank of the correct candidate is 41 after 300 traces). On the other hand, both EDA-Based PA POI selections have better performance: We obtain a rank of 13 and 1 after 300 traces respectively. In conclusion, this experimental use case serves to show how our method optimizes the attacks until an optimal result is achieved (the rank of the correct candidate is one after the attack), improving the results obtained with a "regular" POI selection with almost no effort.

Protected AES implementation
In this use case, we showcase the applicability of EDAs in a more challenging scenario: a protected AES implementation. In order to demonstrate that our approach can improve on state-of-the-art results, we are again using a freely available dataset ASCAD [55].
Target: ASCAD Dataset ASCAD was presented in [55] as the first open database for DL-SCA. The target platform in this data set is an 8-bit AVR microcontroller (ATmega8515), implementing a masked AES-128 cipher [42] and the traces are obtained by measuring the electromagnetic emanation of the device. The data set provides 60 000 traces where 50 000 are used for profiling and 10 000 for the attack. These traces are a window of 700 relevant raw samples per trace, representing the third byte of the first round masked Sbox operation. For a deeper explanation of the ASCAD dataset, we refer to [55]. As the sensitive intermediate value we use an Sbox output: Attack details Table 6 summarizes the specifications of the attack. We are collecting 50 000 random traces for building the templates (profiling subset) and 300 random traces for the attack (attack subset). For the evaluation function, we are using Eq. (11), as in the previous use case. In this case, since it is a masked implementation and obtaining the POI graphs is not straightforward, we sample the initial population from a Bernoulli distribution with p = 0.1.
Again, we need to adjust the correction factor, the number of iterations, and the population size (Table 7). In order to do so, another experimental design has been performed ( Fig. 6 and Tables 7 and 8.). Here we have selected different low and high settings for the variables B and C (higher values, Table 7). This is because this use case is considerably more difficult than the previous one as we are targeting a protected AES implementation and we can not accelerate the search using a POI selection graphic.
In practice, this would commonly imply having a wider population size and more iterations, and hence we consider bigger values for the experimental design (Table 7).     In fact, the outcome of the experimental design confirms that we obtain better results with the setup used in Experiment 8 (correction factor of 10, 20 iterations, and population size of 50), although we succeed in all the 8 experiments of the DoE anyway.
It should be noticed that even bigger values for variables A, B, and C could have been selected. However, since we succeed in all the 8 experiments, we claim that it is not necessary because the improvement is not very significant considering the increase in computational effort required. Therefore, as our experimental design shows, most of the combinations show state-of-the-art performance, which is another strength of this methodology. Nevertheless, we select the setup in Experiment 8 for the experiments as it shows a good balance between performance and resources.

Results
In Tables 9 and 10, results from the first and last iteration are shown. Please note that they are shown in the same manner as in the previous use case. We can observe that although in the first iteration the results are quite poor, in the last iteration we are able to succeed in the attack with all the 50 individuals. In addition, a graphical representation of the results can be observed in Fig. 7.
On the other hand, if we compare these results with related works, it is clear from the facts that our EDA-Based PA approach can provide state-of-the-art results (see Table 11). There we can observe the amount of attack traces required to reach Guessing Entropy 0 (N t GE ) from the work in [55] using TAs, the work in [68] which is, to the best of our knowledge, the deep learning model which better performance has    Table 10: Results of the last iteration in this dataset, and our approach. It should be noticed that a meaningful comparison should require also comparing computation resources. However, as our technique is at a very early stage and there is still a lot of room for optimizing its computation, and in [68] they do not mention their spent computation resources, we just compare our results in terms of guessing entropy.
Template Attacks [55] Deep Learning [68] EDA-Based PĀ N tGE ≈ 450 191 ≈ 150 Table 11: Comparison of performance on ASCAD In conclusion, our approach has better performance, in terms of guessing entropy, than the attacks using Deep Learning performed by Zaid et al. in [68] and TAs performed by Prouff et al. in [55].  , and D4 from now on. As mentioned in the introduction, this dataset was created with the idea of making "realistic" TAs and therefore includes subsets of traces of each clone device performing unprotected and protected AES-128 implementations (see Sec. 6.2), with both fixed and random cryptographic keys.
6.1 Acquisition specifications Fig. 8 shows a picture of the experimental setup used to acquire the power traces. In more detail, the devices are encrypting 16-byte random plaintexts using two software AES implementations: unprotected AES and masked AES (see Sec. 6.2). During that operation, we measure the power consumption of the device with a Langer EM probe attached to a 20 GS/s digital oscilloscope (LeCroy Waverunner 9104) triggered by the microcontroller, which rises a GPIO signal when the internal computation starts. The high sensibility probe is placed over a decoupling capacitor connected to the power line of the device. Each power trace is formed by 1225 samples (2300 for the masked implementation) taken at 1GHz with 8-bit resolution, corresponding to the first Sbox operation. Traces are preprocessed by applying zero mean, standardization, waveform realignment, and a lightweight software lowpass filter. Nevertheless, traces are deliberately quite noisy (due to the nature of EM measurements, variations during the acquisition of the traces, constructive differences between the devices, etc.) to serve to represent realistic experimental use cases.

AES Implementation
As mentioned above, the AES PT dataset includes traces of each clone device performing both unprotected AES and masked AES. Both algorithms are implemented in C language. The unprotected AES implementation is a regular AES-128 (in ECB mode) software implementation 10 . Regarding the masked implementation, it is a modification of the previous one which matches the same masking method described in [42] (Masked Lookup Table). Below there is a brief explanation of this masking method.  19]. Since this dataset includes traces corresponding to the first Sbox operation only, we will focus on that part of the masking method (see [42] for the full explanation, including the masking of the rest of intermediate values).

Masked Lookup
Following the approach in [42], the Sbox operation is masked using two masks: the input mask m and the output mask m . At the beginning of each AES encryption, a masked Sbox table S m is computed with the property S m (x ⊕ m) = S(x) ⊕ m, and used instead of the original table. Generating the masked table is a simple process, as one only has to run through all inputs x, look up S(x) and store S(x) ⊕ m (Alg. 1). Conversely, it should be noticed that it increases the computational effort and the amount of memory used by the microcontroller.

Dataset organization
The dataset is stored using the HDF5 format. The entire dataset is contained in a single file AES PT.h5 which has 4 groups, one per each clone device, called D1, D2, D3, and D4. In turn, each group is divided into smaller subgroups, as it includes traces of the device performing unprotected and masked AES implementations with both fixed key and random key. An organization chart of how each device group is structured can be seen in Fig. 9. In short, the dataset includes 600 000 traces per device: -150 000 unprotected AES power traces (100 000 traces of the device using random keys and 50 000 traces of the device using a fixed key) -300 000 masked AES power traces (200 000 traces of the device using random keys and 100 000 traces of the device using a fixed key) In turn, each set of traces includes its corresponding associated data: plaintext, ciphertext, key and mask (input and output mask, only for the Masked AES implementations). In this section, we repeat the previous use cases but considering portability. This is accomplished by using the aforementioned AES PT dataset.

Unprotected AES implementation on microcontrollers (AES PT )
Target description Our targets are four copies of the same development board mounting an STM32F411VE, as explained in the previous section. As an attacker, our goal is to guess the secret key used to encrypt data. A set of n p = 10000 profiling traces are taken from the profiling device (D1) using random keys and plaintext. In the attack phase a set of n a = 100 traces of D1, D2, D3, and D4 encrypting random plaintext with a fixed key (unknown by the attacker) are taken. Then the multivariate model is applied and the secret key is guessed using the maximum likelihood principle. We are using reduced templates (variance only) without compression method (no pooling). As the sensitive intermediate value we use the Hamming Weight of an Sbox output:  Attack details Table 12 summarizes the specifications of the attack. Although we could have performed another experimental design, this time we have directly selected a correction factor of 10, a population size of 50 and 30 iterations (based on the results of previous experiments) in order to save time. As evaluation function, we consider the following formula to evaluate each individual (POI candidate): −CF * ge All if ge All = 1 −CF * n P OI n samples * ge All if ge All = 1 (14) Where ge All is the sum of the attack results (ge) of attacking devices D1, D2, D3 and D4 respectively (ge All = ge D1 256 + ge D2 256 + ge D3 256 + ge D4 256 ) using the model built from D1 11 . Please note that it is the same evaluation function as in previous use cases, except for ge All . In addition, we are using a POI selection graphic (Fig. 10) obtained from the profiling traces to guide the search even more (as explained in Sec. 5.1).

Results
In Tables 13 and 14, results from the first and last iteration are shown. Please note that they are shown in the same manner as in the previous use cases. In this case, since we are considering portability but the implementation is weak (unprotected implementation), our technique throws good results from the beginning as some individuals of the first iteration show a relatively good performance. Nevertheless, after 20 iterations we are able to succeed in the attack in all copies with all the 50 individuals of the population, a significantly better performance.
A more graphical representation of the obtained results can be observed in Fig. 11. There you can see the experimental results (guessing entropy) of (a) using a "regular" POI selection (just select 20 POI on the highest SOST values), (b) the results of the best individual (POI selection candidate) of the first iteration and (c) the results of the best individual (POI selection candidate) of the last iteration. It should be noticed that the "regular" POI selection (a) throws the worse results: Although the performance of the attacks over D3 and D4 are not so bad, the attack over D2 is completely inconclusive (the model generated with D1 does not represent clearly the leakage of D2). Conversely, the best candidate of both the first and last iteration ( Fig.  11 (b) and (c)) have better performance. Even though the results in (b) do not show a huge improvement against (a) (Actually, we achieve a similar performance except for D2), the results of the last iteration are far better than the previous one (The rank of the correct key candidate is 1 for all copies after 100 traces). In conclusion, these results demonstrate the suitability of this method when a portable template attack scenario is considered.

Target description
In this use case, our targets are the same as in the previous use case except that we are using a protected AES implementation (the masked countermeasure described in [42]). Table 15 summarizes the specifications of the attack. Again, although we could have performed another experimental design, this time we have directly selected a correction factor of 10, a population size of 50 and 30 iterations (based on the results of previous experiments) in order to save time. As evaluation function, we consider the same formula as in the previous use case (Eq. 14) to evaluate each individual (POI candidate).

Results
In Tables 16 and 17, results from the first and last iteration are shown. Please note that they are shown in the same manner as in the previous use case, but we have not represented some of the individuals in order to reduce the tables' size. A graphical representation of the results can be observed in Fig. 12. We can observe how although in the first iteration the results are quite weak, in the last iteration we are able to succeed in the attack on the four copies with a model built from D1, obtaining similar results to our previous use case even when the AES implementation is protected using Masking. This demonstrates the suitability of EDA-Based PA in a portable template attack scenario, even with countermeasures and noisy traces.

Conclusions and future works
On the whole, our experimental use cases demonstrate the suitability of our method for automated TA optimization in the context of AES implementations on small embedded devices, for different implementations and leakage models. Nevertheless, this approach is not algorithm-dependent so we claim that this approach can be applied to other scenarios that may be considered in future lines of work. Moreover, although we have used TA for demonstrating our approach, other kinds of PA are potential candidates to be performed in combination with EDAs. Our EDA-Based PA is able to obtain state-of-the-art results in a relatively straightforward way: it can heuristically find the POIs with better performance automatically and efficiently. Moreover, we have shown how even in the most adverse use case (masked AES implementation considering portability), our technique provides particularly good results as we can attack several devices with a leakage model built only from one copy.
In addition, we claim that our approach mitigates the need for a human on the loop: as the attacks are performed automatically, this approach may help technicians without a deep knowledge of all the basics involved in TAs to perform this part of the evaluation process properly. Besides, this approach could be also interesting to experts in evaluation laboratories since it allows them to parallelize tasks and reduce the time cost of the evaluation process.
The method at the moment works in a matter of hours but, as future works, we have identified many ways of optimizing the method, e.g. attack parallelization, computation optimization, etc. which could reduce drastically the computation time.