Perceived Information Revisited New Metrics to Evaluate Success Rate of Side-Channel Attacks

. In this study, we present new analytical metrics for evaluating the performance of side-channel attacks (SCAs) by revisiting the perceived information (PI), which is deﬁned using cross-entropy (CE). PI represents the amount of information utilized by a probability distribution that determines a distinguishing rule in SCA. Our analysis partially solves an important open problem in the performance evaluation of deep-learning based SCAs (DL-SCAs) that the relationship between neural network (NN) model evaluation metrics (such as accuracy, loss, and recall) and guessing entropy (GE)/success rate (SR) is unclear. We ﬁrst theoretically show that the conventional CE/PI is non-calibrated and insuﬃcient for evaluating the SCA performance, as it contains uncertainty in terms of SR. More precisely, we show that an inﬁnite number of probability distributions with diﬀerent CE/PI can achieve an identical SR. With the above analysis result, we present a modiﬁcation of CE/PI, named eﬀective CE/PI (ECE/EPI), to eliminate the above uncertainty. The ECE/EPI can be easily calculated for a given probability distribution and dataset, which would be suitable for DL-SCA. Using the ECE/EPI, we can accurately evaluate the SR through the validation loss in the training phase, and can measure the generalization of the NN model in terms of SR in the attack phase. We then analyze and discuss the proposed metrics regarding their relationship to SR, conditions of successful attacks for a distinguishing rule with a probability distribution, a statistic/asymptotic aspect, and the order of key ranks in SCA. Finally, we validate the proposed metrics through experimental attacks on masked AES implementations using DL-SCA.


Background
Deep-learning based side-channel attack. Deep-learning based side-channel attacks (DL-SCAs) on cryptographic modules have been increasingly emerged in recent years [MHM14, CDP17, HHGG20, RWPP21, UXT + 22]. DL-SCA is a profiling attack which consists of two phases: profiling and attack. In the profiling phase, an attacker obtains side-channel traces from profiling device(s) with similar leakage characteristics as the target device, then trains a neural network (NN) model representing the leakage characteristics. In the attack phase, the attacker utilizes the trained NN model to estimate the secret key from the target device's side-channel leakage. Compared with conventional profiling attacks, such as template attacks [CRR02], DL-SCA can achieve a higher attack performance (e.g., key recovery capability) even on implementations with SCA countermeasures, such as masking and random delay. Thus, it has become necessary to develop an assessment methodology to evaluate DL-SCA threats because of the increasing number of cryptographic devices that are now operated in a scenario where an attacker can perform a profiling, such as Internet of Things applications. Actually, several literatures have showed/discussed the possibility and potential of profiling attacks in the real scenarios such as [OP11,DK18,WVdHG + 20]. respectively given by an infimum of CE and a supremum of PI of the probability distribution that can achieve an SR. The proposed metrics can be easily calculated for a given probability distribution and dataset as described in Section 4.4, which would be suitable for DL-SCA. The use of EPI makes it possible to perform a more accurate SR evaluation through the ECE during NN training in DL-SCA by a combination with an inequality developed by de Chérisey et al. [dCGRP19]. Note that an SR upper-bound is closely related to the lower-bound of the number of traces required for the attack success; and therefore, the bounds are used in a quantitative evaluation metric of SCA [SMY09]. The proposed metrics can also be used to measure the generalization of an NN model in terms of SR during the attack phase. We analyze and discuss the proposed metrics in terms of their relationship to SR, a statistic/asymptotic aspect, and conditions of successful attacks for a distinguishing rule with a probability distribution. In addition, we provide an analysis on the order of key ranks in SCA to show the suitability of ECE/EPI for SR evaluation. Finally, we validate the proposed metrics through an experimental attack on masked AES implementations using DL-SCA.
We suppose that the proposed approach would be especially helpful for evaluators and (white) attackers as it easily evaluates the attack performance of a model. This indicates that it would be useful for, for example, early stopping to maximize the SR and comparison of two (or more) models to determine which model is superior in DL-SCA. The experimental attack also validates this aspect. For example, in the experimental attack on masked hardware, the SR evaluation using the proposed metrics/method takes at most 0.53 seconds even using 100,000 test traces, whereas a common empirical SR evaluation requires far longer time, which may be in an order of minutes with 100,000 traces. Note that the computation time corresponds to one SR evaluation at an epoch; in practice, we should perform the computation for every epoch, which indicates that the usage of EPI would yield a significant reduction of computation time. Thus, EPI can also contribute to the SR evaluation in practical aspects, in addition to the theoretical contribution of this paper.

Paper organization
The remainder of this paper is organized as follows: Section 2 introduces the mathematical notation and reviews the previous studies on DL-SCA and PI. Section 3 derives the relation between PI and SR from a probability-theoretical perspective. Section 4 proposes, analyzes, and discusses a new information-theoretical metric named ECE/EPI. Section 5 demonstrates the validity of the proposed metric through experimental attacks on masked AES implementations. Finally, Section 6 concludes this study.

Notation
A calligraphic letter (e.g., X ) represents a set; an uppercase variable (e.g., X) represents a random variable over the corresponding set (i.e., X for X); and a lowercase variable (e.g., x) is an element of the corresponding set, if it is defined otherwise. Pr denotes a probability measure. Throughout this paper, p denotes to the true density or mass function; q denotes the probability density or mass function represented by an NN 1 . For example, the true probability mass function of discrete random variables X, Y is p X,Y (x, y) = Pr(X = x, Y = y). We may omit the subscripted random variables if the random variables of the probability distribution are obvious. For example, we may simply write p(x, y) instead of p X,Y (x, y). In addition, we may write a conditional probability distribution represented by an NN with parameter θ by q θ q Z|X (· | ·; θ). The expectation is denoted by E. For example, E X f (X) denotes the expectation of f (X), where f : X → R is a function. The conditional probability distribution is denoted by p X|Y (x | y) = p(x | y), and E [f (X, Y ) | Y = y] denotes the expected value. Finally, log and ln denote the binary and natural logarithms.
Let X denote a random variable of the side-channel trace. Side-channel traces are represented as a multidimensional real vector x ∈ X ⊂ R mt , where m t ∈ N is the number of sample points. This study focuses on SCAs on block ciphers, particularly AES. Let n k denote the bit length of the partial key, and let n t denote the bit length of the partial plaintext and ciphertext. The secret intermediate value is denoted as z = g(k, t) ∈ Z = {0, 1} nz , where g is a selection function 2 , n z denotes the bit length of z, k ∈ K = {0, 1} n k is a key, and t ∈ T = {0, 1} nt is public information such as plaintexts and ciphertexts. Their random variables are also defined in the aforementioned manner. Here, let K denote the random variable of the correct key, and k * denote the correct key value. T and K are assumed to have uniform distributions. If we require to specify the key value for Z, we write Z (k) = g(k, T ).
In this study, the conditional probability distribution between the secret intermediate variable Z and side-channel leakage X (e.g., p Z|X , q θ ) plays an essential role. For simplicity, we assume that every conditional probability distribution r Z|X satisfies −E log r Z|X (Z | X) < ∞. This condition ensures that the cross entropy of every distribution exist. Let R be a set of all the conditional probabilities such that, for every r Z|X ∈ R, ∀z ∈ Z, x ∈ X ; r Z|X (z | x) > 0 holds, and ∀z 1 , z 2 ∈ Z; z 1 = z 2 ⇒ r Z|X (z 1 | X) = r Z|X (z 2 | X) holds almost surely. Because q θ probably meets these two conditions in many cases, the conditional probability of model q θ is contained in the set R in practice. The first condition is natural because the NN model cannot take a zero value if Softmax function is used as the activation function of its last layer. On the other hand, although there would exist parameters which do not satisfy the second condition, it would be highly unlikely that such parameters are selected during learning because of the randomness of the learning algorithms 3 . Note that the true distribution p Z|X is necessarily not contained in the set R. For example, there exist different z 1 and z 2 such that Pr(p(z 1 | X) = p(z 2 | X)) > 0 holds if the leakage model is Hamming weight, and there is no noise in the traces (e.g., (z 1 , z 2 ) = (1, 2)). Even in this case, we assume that q θ ∈ R holds because of the randomness of the learning algorithms.
For the sake of simplicity, we assume that the distribution of Z is independent of the key used. This assumption is closely related to the key-independence condition [IUH21], which states that the key can be fixed during the profiling. Many practical selection functions are proven to satisfy this condition. For example, a typical selection function for 16 bytes of software AES implementation (i.e., Z = Sbox(K ⊕ T ) and its Hamming weight) satisfies this condition.

Overview of DL-SCA
The DL-SCA has two phases: profiling and attack. During the profiling phase, we train a model to approximate the conditional distribution as the device leakage characteristics.
be a training dataset used in the profiling phase, X i denotes the side-channel trace (i.e., power consumption or electromagnetic radiation) of 2 In this paper, we assume that g does not consist in a leakage function (e.g., Hamming weight), as we focus on the probability distribution r Z|X .
3 Formally, this can be stated as follows. Let Sp be a training dataset, and let M : Sp → θ be a learning algorithm which is a randomized function. Note that a learned parameterθ $ ← − M (Sp) is regarded as a random variable. This paper assumes that qθ ∈ R holds almost surely (i.e., Pr(qθ ∈ R) = 1)) the i-th observation, Z i denotes the corresponding intermediate value, and |S p | = m tr is the number of traces used in the profiling phase. We assume that X 1 , X 2 , . . . , X mtr and Z 1 , Z 2 , . . . , Z mtr are independent and identically distributed (i.i.d) random variables, respectively. Let θ denote the NN model parameter. The goal of the profiling phase is to estimate the optimal model parameterθ using the training dataset S p . This optimal parameter is usually given as the solution to the minimization problem of the CE loss function, defined as where Z and X are the random variables of a label z and trace x, respectively, and q θ represents the conditional probability distribution represented by the NN with the parameter θ. CE(q θ ) in Equation (1) takes the minimum value if and only if p = q θ [Bis06,GBC16]. Note that, depending on the hyperparameter and p, it is not generally guaranteed that there exists a model parameter such that p = q θ . We can obtain a model that approximates the true distribution p if we determine the optimal parameterθ that makes CE(qθ) sufficiently small; however, we cannot calculate Equation (1) because it contains the integral and summation of the unknown probability distribution p. Therefore, in general, we approximate CE(q θ ) using the training data S p as follows: The approximated CE in Equation (2) is called negative log-likelihood (NLL). The NLL is expected to converge in probability to CE(q θ ) as m tr → ∞ for fixed q θ . During the attack phase, we estimate the secret key k * of the target device using the trained model. Let S a = { (X j , T j ) | 1 ≤ j ≤ m at } be a dataset used during the attack phase, where |S a | = m at is the number of traces, X j is the side-channel trace at the j-th observation, and T j is the corresponding plaintext or ciphertext. During the attack phase, we calculate the NLL for each hypothetical key candidate k ∈ K using the intermediate value Z Following that, the correct key is estimated to be the key candidate with the smallest NLL value. This is equivalent to approximately computing and comparing for each key candidate k. In the following, for a simplified notation, we denote the number of traces for the attack phase by m, instead of m at . As well, the number of traces for validation/test is also simply denoted by m as a validation/test corresponds to an attack phase.
Instead of CE, some loss functions have been presented to improve the learning cost and/or the attack performance of NN. In [ZZN + 20], Zhang et al. presented the cross entropy ratio (CER), and showed that it is useful for improving the attack performance especially when the training and test datasets suffer from an imbalanced data problem, as also analyzed in [ISUH21]. In [ZBD + 21], Zaid et al. presented the ranking loss (RkL), the usage of which can suppress the approximation error and can make the convergence faster. As investigated in [KWPP21], such loss functions dedicated to DL-SCA can yield a high attack performance, although a common CE can be a good option in most cases. Thus, it is worth investigating the loss function dedicated to DL-SCA.

SCA evaluation metrics
To evaluate the performance of (DL-)SCA, the SR and GE are commonly used as quantitative metrics during the attack phase. The SR and GE with m traces during the attack phase are represented as SR m = Pr(rank(k * , m, q θ ) = 1), respectively [SMY09]. In the case of DL-SCA, the rank of correct key is defined as where 1 is the indicator function. In where ξ : [0, 1] → R + denotes a function defined as where H(K) is the entropy of K (here, H(K) = n k if K is the uniform distribution on {0, 1} n k ) and H 2 is the binary entropy function. Intuitively, ξ(SR) represents the amount of information required for key recovery for a given SR. For example, if an attacker attempts key recovery with SR m = 1, the attacker requires n k -bit information as represented by ξ(1) = n k . In contrast, if the attacker has no advantage on the key estimation (that is, SR m = 1/2 n k ), the attacker requires zero bit information about the secret key as represented by ξ(1/2 n k ) = 0. Inequality (3) states that this amount of information is upper-bounded by mutual information. To achieve a desired SR, Inequality (3) states that the attacker requires to obtain mI(Z; X) bit information through the observation of m side-channel traces. Related to SR evaluation through the validation loss, Zaid et al. showed that RkL is an SR lower-bound [ZBD + 21]. Although its usage has some advantages in DL-SCA (e.g., a suppression of the approximation error and a faster convergence), RkL-based SR evaluation requires the computational cost as high as the conventional empirical evaluations. RkL can be evaluated only experimentally/empirically, but not analytically, because RkL is derived by approximating an indicator function in the GE as a binary loss function [IUH21]. This indicates that RkL-based SR evaluation includes the conventional empirical SR evaluation. For the assessment of DL-SCA performance, it is worth studying how to evaluate the SR through the validation loss with less costs.

Optimal distinguisher
SCA can be formulated using a distinguisher, which is denoted by a function d : X m ×T m → K. A distinguisher calculates the ranks of each key candidate using a score function from side-channel trace, and estimates the correct key as the candidate with the highest score. For example, correlation power analysis (CPA) uses Pearson's correlation coefficient as the score. DL-SCA uses the NLL as the score. An optimal distinguisher is a distinguisher that maximizes the SR. The optimal distinguisher is formally defined as follows: Definition 1 (Optimal distinguisher [HRG14]). For attack traces X m = (X 1 , X 2 , . . . , X m ) and inputs T m = (T 1 , T 2 , . . . , T m ), the success rate of a distinguisher d : According to [HRG14], an optimal distinguisher d opt is given by In [IUH21], Ito, Ueno, and Homma proved that d opt has another equivalent form given by the true conditional probability distribution of the secret intermediate variable Z given a side-channel leakage X (denoted by p Z|X ), which suits to the DL-SCA. This indicates that the CE minimization in DL-SCA makes sense to achieve an optimal attack, as the goal of DL is usually to imitate the true conditional probability distribution through the CE loss minimization. However, in [IUH21], Ito, Ueno, and Homma also proved that an infinite number of probability distributions with a non-minimum CE provide distinct optimal distinguishers. Their theorem states that the true conditional probability distribution (i.e., a probability distribution with the minimum CE) sufficiently but not necessarily provides an optimal distinguisher. Using the theorem, they stated that a probability distribution with a relatively high CE does not always make the SR low in the attack phase of DL-SCA, motivating them to propose a loss function (named Probability Concentration Inequality (PCI) loss), which is used to directly maximize the SR. We review their theorem to reveal the relationship between PI and SR in Section 3.

Perceived Information
The concept of PI was initially presented by Renauld et al. [RSVC + 11]. PI is considered as an amount of information utilized by a probability distribution (e.g., NN output) providing a distinguishing rule. Let J r (Z; X) denote the PI of a probability distribution r between the secret intermediate variable Z and side-channel leakage X. J r (Z; X) is defined as PI is a lower-bound of mutual information, that is, it holds J r (Z; X) ≤ I(Z; X) for any distribution r Z|X [MDP20, BHM + 19]. The equality holds if and only if r Z|X is equivalent to the true probability distribution p Z|X . J r (Z; X) is expected to be always non-negative, as it represents an amount of information. However, the original PI can take a negative value as mentioned and shown in [BHM + 19]. In this paper, Section 3.2 states one of its reasons why PI can be negative. Let SR m (r) denote the SR of an attack using m traces and a distinguisher with r Z|X . According to the intuitive meanings of PI and ξ in Equation similarly to Inequality (3). However, in practice, some counterexamples are found (like the experiment in this paper). a probability distribution with so large CE (that makes PI too small for the attack success with regard to Inequality (6)) sometimes can succeed in the key recovery. This indicates that the existing PI does not adequately represent the amount of information that can be used with the SR inequality (3) as mutual information. In this study, we specify one of the reasons and present a modification of PI to address this issue.

Review of Ito-Ueno-Homma theorem [IUH21]
Theorem 1 (Ito, Ueno, and Homma [IUH21]). Let r Z|X be a conditional probability distribution of the secret intermediate value Z given side-channel leakage X. r Z|X yields an optimal distinguisher if CE(r Z|X ) is minimum (i.e., r = p). However, CE(r Z|X ) is not always minimum if r Z|X yields an optimal distinguisher.
Theorem 1 is proven by two propositions: one states that the true conditional probability distribution (i.e., a conditional probability distribution with the minimum CE) sufficiently yields an optimal distinguisher, and the other states that its inverse is false; that is, a conditional probability distribution with the minimum CE does not necessarily yields an optimal distinguisher. Proposition 1 (CE minimization is sufficient for optimal distinguisher [IUH21]). Let r Z|X be a conditional probability distribution of Z given X, and let d r be a distinguisher defined as Note that, for a probability distribution of trained |Z|-classification NN qθ, the distinguisher d qθ is equivalent to which would be the reasons why, in DL-SCA, we train an NN to approximate the true probability distribution p Z|X and utilize the NLL for the key estimation during the attack phase. In the following, we always consider the distinguishing rule defined in Proposition 1 for a given probability distribution.
Before introducing Proposition 2, we review Lemma 1 followed by Corollary 1, which are crucial to the proof of Proposition 2.
Lemma 1 (A conversion of probability distribution with order of key ranks preserved [IUH21]). Let is a probability distribution, and β is a positive real number. Then, for all k ∈ K and m ∈ N, rank(k, m, r) = rank(k, m, r ) holds.
Corollary 1. For a given probability distribution r Z|X and S a , the success rate SR m and guessing entropy GE m are invariant to the above conversion of probability distribution with any β.
Lemma 1 guarantees that the conversions from r to r do not change the SCA performance (i.e., SR and GE). Note that the conditional distribution r Z|X must be a positive real-valued function to hold Lemma 1. NN models satisfy this condition because they usually use a Softmax as the activation function of the last layer. Lemma 1 implies that an infinite number of such conversions exist because β is any positive real number. Using Lemma 1, Ito, Ueno, and Homma proved Proposition 2.
Proposition 2 (CE minimization is not necessary for optimal distinguisher [IUH21]). Let d be a distinguisher for the attack phase, defined as where r Z|X : Z × X → (0, 1] is a conditional probability distribution. Even when the distinguisher d is optimal, inf r CE(r ) = CE(r Z|X ) does not necessarily hold.

Relation between CE/PI and SR
We then show the uncertainty of CE/PI in terms of SR evaluation using Lemma 1. In this study, we focus on the conversion of probability distribution used in Lemma 1. We first define the conversion notation.
Definition 2. Let r Z|X be a conditional probability distribution. For any positive real number β, define a conversion of r Z|X as The application of H β to a probability distribution is equivalent to the usage of Softmax with temperature for the activation function of output layer of an NN model. In the DL community, such a Softmax with temperature is used for emphasizing a label with the highest probability if β > 1 or for placing importance relatively on labels with small probability if 0 < β < 1. It is known that the accuracy of an NN model is invariant to the temperature [GPSW17], which obviously indicates that, for one trace attack (i.e., m = 1), the rank order, SR, and GE are also invariant to the temperature. Proposition 2 generalizes this fact to more-than one traces attack; the temperature is generally meaningless for distinguishing rules in terms of attack performance with any (finite) number of traces. Meanwhile, CE and PI are dependant on β. To analyze the dependency, we derive the limits of CE and PI of H β [r] as β 0 and β → ∞. For the derivation, we introduce Lemma 2.
Lemma 2. Let r Z|X ∈ R be a conditional probability distribution, and let β be a positive real number. H β [r Z|X ](Z|X) converges almost surely to 2 −nz (i.e., uniform distribution over Z) as β 0 and Proof. First, we derive the limit of β 0 as follows: for every z ∈ Z and x ∈ X . Second, we derive the limit of β → ∞. Let (Ω, F, Pr) be a probability space. From the assumption of R, there exists a null set N such that Therefore, 1 {Z=arg max z r Z|X (z |X)} is measurable, and it holds lim β→∞ H β [r Z|X ](Z|X) = 1 {Z=arg max z r Z|X (z |X)} almost surely because Pr(Ω \ N ) = 1.
We then introduce Proposition 3.
Proposition 3. Let r Z|X ∈ R be a conditional probability distribution. Suppose that Proof (Informal sketch). Intuitively 5 , Limits (7) and (8) hold because H β [r Z|X ] converges almost surely to a uniform distribution over Z and CE of the uniform distribution is equivalent to n z . As well, Limits (7) and (8)  Proposition 3 states the CE/PI is dependant on β. In particular, we can make the CE arbitrarily large and the PI arbitrarily small by increasing β. A conditional probability distribution can be converted to other distributions with arbitrarily large CE and small PI, while the SR and GE of the distinguishing rule with such probability distributions are invariant to β. In an extreme case, according to Proposition 1, the true probability distribution p Z|X , which has the minimum CE/maximum PI, gives an optimal distinguisher (i.e., achieves the theoretically maximum SR); but H β [p Z|X ] also provides an optimal distinguisher, although J H β [p Z|X ] (Z; X) is smaller than zero for sufficiently large β. This statement also holds for any non-optimal probability distribution that can achieve a meaningful SR. Thus, CE and PI include an uncertainty in terms of SR; that is, for a given SR, the CE and PI of a probability distribution are not unique and the probability distribution can have an arbitrarily large CE/small PI. Moreover, PI can take any negative value although it is expected to intuitively represent an information amount. This reveals that the conventional CE/PI is non-calibrated and not always appropriate in terms of SR evaluation and is insufficient for the evaluation of the attack performance (with Inequality (6)).

Basic concept
In Section 3, we showed the uncertainty of CE/PI in terms of SR. To avoid such an uncertainty, we present a modification of CE/PI, named effective CE/PI (ECE/EPI). The ECE and EPI are defined as a CE lower-bound and PI upper-bound for a given probability distribution with regard to the conversion H β , respectively.
Definition 3 (Effective cross-entropy (ECE) and effective perceived information (EPI)). Let r Z|X ∈ R be a conditional probability distribution of secret intermediate variable Z given a side-channel leakage X. ECE and EPI for r Z|X are defined as respectively. If Z follows a uniform distribution over {0, 1} nz , then H(Z) = n z .
Note that EPI is always non-negative as proven in Proposition 4. Thus, we define the ECE/EPI of r Z|X by the infimum of CE/supremum of PI of probability distributions in { H β [r Z|X ] | β ∈ (0, ∞) }. In other words, given r Z|X , we can generate an infinite number of conditional probability distribution as H β [r Z|X ] which has the same SR as r Z|X with a different CE/PI. To uniquely determine CE/PI as ECE/EPI, we take the infimum of CE (or supremum of PI) among them. Thus, it is likely that ECE/EPI is given by a lower-bound of CE (or an upper-bound of PI) of probability distributions that can achieve an SR, and a probability distribution with the same ECE/EPI yields the same SR. This indicates that ECE/EPI is more appropriate for SR evaluation with regard to conversion H β (See Section 4.3 for more detailed discussion).
Recall that PI is designed to represent the amount of information utilized by a conditional probability distribution between secret intermediate variable and side-channel leakage. Because ECE/EPI is defined as the infimum of CE/supremum of PI for a given probability distribution r Z|X , we expect that ECE/EPI can be used for the tightest and most accurate SR evaluation for r Z|X with an SR inequality (3) by de Chérisey et al. [dCGRP19]. We expect that the following inequality holds.
Conjecture 1 (SR-EPI inequality). Let SR m (r) denotes the success rate with a distinguishing rule in Proposition 1 using a conditional probability distribution r Z|X when the number of attack traces is m. Then, we have for the evaluation of SR upper-bound for a given probability distribution r Z|X . To achieve SR m (r) = 1, Inequality (11) is also represented by As in Inequality (12), SR upper-bound conversely represents a lower-bound of the number of traces required for an attack success (i.e., to achieve an SR). We demonstrate the validity, effectiveness, and tightness of the SR evaluation using EPI/ECE through experimental attacks in Section 5.
As well as PI, EPI is upper-bounded by the mutual information I(Z; X). In addition, EPI is always non-negative, although the conventional PI can take any negative value. Proposition 4 describes the range of EPI.

This inequality also states that the equality holds if and only if CE
Proposition 4 would validate the usage of Inequality (11) for the SR evaluation; that is, the SR-EPI inequality (11) does not overestimate the attack performance (i.e., SR upper-bound and lower-bound of the number of traces required for attack success) than an optimal attack with p Z|X , as a larger J r (Z; X) implies a higher performance. Moreover, EPI provides a tighter and more accurate evaluation than the conventional PI (as it always holds J r (Z; X) ≤ J * r (Z; X) due to its definition), whereas PI is likely to underestimate the attack performance as discussed in Section 3. Major differences of Proposition 4 from the inequality J r (Z; X) ≤ I(Z; X) in [MDP20, BHM + 19] are the equality condition and that EPI is guaranteed to be non-negative. EPI is consistent with its intuitive meanings, as EPI is maximized by all probability distributions that provide optimal distinguisher with regard to H β and is always non-negative.

Relation between attack success and ECE/EPI
To discuss ECE/EPI in detail, we introduce Lemma 3, stating that L(H β [r]), which is an approximation of CE(H β [r]), is a strictly convex function in terms of β.
Lemma 3. Let r Z|X ∈ R be a conditional probability distribution, and let H β be a conversion of probability distribution defined above. Let β be a positive real number.

L(H β [r Z|X ]) is almost surely a strictly convex function in β and and CE(H β [r Z|X ]) is a strictly convex function in β.
Proof. To handle NLL and CE simultaneously, we introduce the empirical distribution. Let F Z,X be a true cumulative probability function, and letF (m) Z,X be an empirical probability distribution for m samples. We can denote NLL and CE by respectively. Therefore, it is sufficient to consider the following equation: Recall that the sum of linear and convex/concave functions is a convex/concave function, and a concave function is the negative of a convex function and vice versa. In the expectation in Equation (14), the first term β log r Z|X (Z j | X j ) is linear in terms of β. We then consider the convexity of the second term − log z r Z|X (z | X) β . The second term can be rewritten as (β ln r(0 | X), . . . , β ln r(|Z| − 1 | X)), (15) where LSE is a log-sum-exponential (LSE) function. Equation (15) is concave because it is well-known that a LSE function is convex. We then prove that it is strictly concave with probability 1. We first investigate the condition where a LSE function becomes strictly convex. Let y ∈ R n be an n-dimensional real vector. For a twice-differentiable function f : R n → R, let ∇ 2 f be the Hessian matrix of f , respectively. Let v = ( i e yi ) −1 (e y1 , e y2 , . . . , e yn ) T . We then have Note that 1 T ∇ 2 LSE(y)1 = 0 because 1 T v = 1, where 1 = (1, 1, . . . , 1) T is an n-dimensional vector whose elements are 1. Since rank(diag(v)) = n and rank(vv T ) = 1, the rank of the Hessian matrix rank(∇ 2 LSE(y)) is equal to n − 1. In other words, for any vector ln r(1 | X), . . . , β ln r(|Z| − 1 | X)). For any β 1 , β 2 ∈ (0, ∞) and any λ ∈ (0, 1), we have f ( ln r(1 | X), . . . , ln r(|Z| − 1 | X)) T is linearly independent of 1 almost surely due to r ∈ R. Thus, the summation β log r Z|X (Z | X) − log z r Z|X (z | X) β is almost surely a strictly concave function. Because the expectation of a convex/concave function is a convex/concave function [BBV04, Section 3.2.1], Equation (14) is a strictly convex function.
A strictly convex function has at most one stationary point and the function takes the unique minimum at the stationary point. There are two cases for a given r: there exists a minimum of L(H β [r Z|X ]) for β > 0 or not. If there exists min β L(H β [r Z|X ]) for β > 0, then CE * (r Z|X ) < H(Z) and J * r (Z; X) > 0. This is because there exists some β such that L(H β [r Z|X ]) < lim β 0 L(H β [r Z|X ]) = n k owing to the convexity. Therefore, in this case, we can conclude that the attack using such r would succeed for some number of traces m that satisfies the SR-EPI inequalities (11) and (12).
We then consider another case. If there does not exist min β L(H β [r Z|X ]), then L(H β [r Z|X ]) is monotonically increasing on the range of β ∈ (0, ∞) owing to its convexity. Therefore, according to Proposition 3, inf β L(H β [r Z|X ]) = n z because lim β 0 L(H β [r Z|X ]) = n z for any r, which is equivalent to J * r (Z; X) = 0. Thus, such a conditional probability distribution exploits as little information about secret intermediate variable from sidechannel leakage as a uniform distribution over Z; and therefore, the attack using this conditional probability distribution would fail. Note that, if J * r (Z; X) = 0, then ξ(SR m (r)) = 0 according to the SR-EPI inequality (11), which is followed by SR m (r) = 1/2 n k for any m. In contrast, the conventional PI cannot guarantee an attack failure even if J r (Z; X) = 0 as discussed in Section 3, whereas EPI would (relatively) correctly evaluate the attack performance with a probability distribution.
This notation, which implies that EPI is always non-negative (as J * r (Z; X) = H(Z) − CE * (r Z|X ) where sup r CE * [r Z|X ] = n z ), is consistent with the intuitive meanings of EPI, although the conventional PI defined in Equation (5) can be a negative value. In summary, the condition that min β L(H β [r Z|X ]) exists for β > 0 would be sufficient for an attack success from the viewpoint of EPI, whereas our EPI-based SR evaluation method also indicates that a conditional probability distribution which does not satisfy this condition would fail to attack.

Suitability of ECE/EPI for SR evaluation
Using Lemma 1, we can prove that the order of ranks for each key candidate is invariant to the conversion H β (as in Theorem 2); and thus, SR/GE is invariant to β. EPI/ECE is a metric to address this uncertainty. If there exist other conversions of conditional probability distribution which preserve the order of key ranks, EPI/ECE may not be able to accurately evaluate the SR using Inequalities (11) and (12). Fortunately, we can prove that there exists no such conversion except for H β , which states that ECE/EPI is appropriate for SR evaluation with regard to probability distribution conversion that preserves the order of key ranks. For the proof, we first define an equivalence relation, quotient set, and order to represent key ranks. a = (a 1 , a 2 , . . . , a m ) and b = (b 1 , b 2

Definition 4 (Key rank order). Let
This equivalence relation, quotient set, and order represent key rank, because key ranks are calculated as NLL, namely, a sum of negative log-probabilities (i.e., real numbers) like j a j . We then introduce Lemma 4. ; (a 1 , a 2 , . . . , a m ) → (f (a 1 ), f (a 2 ), . . . , f (a m )) be a function defined using a function f : R → R. Function F is assumed to be well-defined as a function from R m /∼ to R m /∼. F is order automorphic on (R m /∼, ) if and only if f is given by a linear polynomial function f (a) = βa + γ, where β is a positive real number and γ is a real number. (F (b)). Hence, if S • F : R m /∼ → R is an order isomorphism from (R m /∼, ) to (R, <), then F is an order automorphism. Therefore, we show that S • F is an order isomorphism if and only if f (a) = βa + γ (β > 0).
(⇒) Order automorphism on (R, <) is always a strictly monotonically increasing function. According to the assumption that S • F is order isomorphic, there exists a strictly monotonically increasing function g such that g( . Let e(a) = g(a) − mf (0). We then have e( j a j ) = j h(a j ). In addition, it holds e = h because e(a 1 ) = e(a 1 + h(a 1 ). Here, h (and e) is a conditional solution of Cauchy's functional equation h(a 1 + a 2 + · · · + a m ) = h(a 1 ) + h(a 2 ) + · · · + h(a m ), where the condition is that e is a monotone function. Therefore, h(a) is given by h(a) = βa for some positive real number β, and β > 0 because e is monotonically increasing. By letting f (0) = γ, we conclude f (a) = βa + γ.
Using Lemma 4, we prove Theorem 2. Theorem 2. Let S a be a trace dataset for the attack. Let r Z|X , r Z|X ∈ R be conditional probability distributions. Then, for all k ∈ K and m ∈ N, we have rank(k, m, r) = rank(k, m, r ) if and only if there exists a positive real number β such that Proof. It is obvious from Lemma 1 that the sufficient condition is true. We prove that the necessary condition is true; that is, if the order of ranks for each key candidate is identical for r and r , then r = H β [r] for some β. Because L (k) (r) is given by a sum of m real numbers (i.e., negative log-output of conditional probability distribution), the ranks correspond to the strict total order defined in Definition 4. According to the assumption that the order of ranks (namely, NLLs for key candidates) is preserved, Lemma 4 states that the conversion applicable to − log r(z Theorem 2 states that the proposed metrics are the most appropriate for SR and GE evaluations among any conversions of probability distribution that preserve the order of key ranks. However, if there is a conversion such that SR is preserved but the order of key ranks are not preserved, ECE and EPI are not guaranteed to be unique to an SR and be appropriate for SR evaluation. If there does not exist such a conversion, then ECE and EPI are truly unique in terms of SR evaluation. The analysis on the existence of such a conversion is an important future work.

Computation of ECE/EPI in practice
In this subsection, we describe how to evaluate ECE/EPI of a probability distribution using a given dataset. Recall that CE cannot be calculated directly in practice, and NLL is used for its approximation. This indicates that, to calculate inf β CE(H β [r Z|X ]), we need to approximate it as inf   Figure 1: Our NN architecture used in experiment.
model in terms of SCAs. The second experiment demonstrates that EPI is also useful to compare the performances of several models. Our metrics enable us to select a model with good performance without the intensive calculation of SR.

Experimental setup
We demonstrate the validity of the proposed metrics through experimental attacks on masked AES software and hardware implementations. For the experiment, we employ a DL-SCA, considered as one of the best attacks with a distinguishing rule that directly utilizes a conditional probability distribution, which is the main focus of this study. The experiment also demonstrates that the proposed metrics can be used to measure the generalization of NN model in terms of SR during the attack phase. As a masked software implementation, we employed the ASCAD dataset, which is one of the most common datasets to evaluate DL-SCA [BPS + 20]. For the attack on ASCAD dataset, we employed an NN model presented by Zaid et al. [ZBHV19], which is a publicly available NN model developed for DL-SCA. For the training, we used a categorical CE as a loss function, used Adam as an optimizer, and set the learning rate to 0.001.
As a masked hardware implementation, we used an open-source masked AES hardware based on the threshold implementation (TI) [git21], which was presented in [UHA17]. We synthesized the masked AES hardware as it is (i.e., without the hierarchy broken), implemented it on a Xilinx Kintex-7 FPGA on SAKURA-X board, and acquired its sidechannel traces through an on-board co-axial connector. We used a Keysight DSOX6004A oscilloscope and set the sampling rate as 455 MSa/s. We used one million traces for the NN training with random secret keys and plaintexts. The target hardware is byte-serial implementation, which indicates that we should guess two consecutive key bytes to employ XOR based selection function in a practical attack. However, for the simplicity, we consider one byte known and attack on the other one byte in this experiment. Hence, the partial key length in the attack is n k = 8 for both the AES software and hardware implementations in our experiment.
We attempted many NN architectures/hyperparameters to apply DL-SCA to the above TI-based AES hardware, and employed the most successful one for the experiment. In fact, we found that it was difficult to achieve a successful key recovery from the TI-based AES hardware using common NN models in DL-SCA, such as ASCAD and Zaid et al.'s models [BPS + 20, ZBHV19]. Figure 1 illustrates the NN architecture finally used in our experiment. In the figure, r × c indicates the size of each feature map or kernel, where r is the length of each filter, and c is the number of channels. Table 1 summarizes our NN hyperparameters. We used CUDA 11.4 cuDNN 8.2.4 Tensorflow 2.6.0 for the training. We used the NLL as a loss function, and set the learning rate, batch size, and the number of epochs as 0.0001, 512, and 1,500, respectively.

Experimental results
Let q θ denote the probability distribution of the NN output with a parameter θ. Figure 2 and Figure 3 report the experimental results, where the horizontal axis is the number of epochs in the training. In Figure 2 and Figure 3(a), the red curve denotes raw NLL whereas the blue curve denotes β-optimized NLL loss (i.e., inf β L(H β [q θ ]) in Equation (13) or in its summation form (16)), which is an approximation of ECE loss. Figure 3(a') magnifies the blue curve of Figure 3(a) in its range. Figure 2(b) and Figure 3(b) denote the number of traces required for achieving SR m = 0.9, where the red curve is the empirical result and the blue curve is an estimation value using the proposed metrics with the SR-EPI inequality (11). Note here that the red curve in Figure 2( Figure 2 and Figure 3 denote the estimated β using the Newton-Raphson method. As we require to evaluate SR for many epochs, the usage of EPI yields a significant advantage in computational cost over the conventional empirical SR evaluation. We confirm that the red and blue curves in Figure 2(b) and Figure 3(b) are similar in shape, which indicates that the proposed method can appropriately evaluate the lower-bound of the number of traces required for attack success (or SR upper-bound conversely) for a given probability distribution q θ in an analytical manner using the SR-EPI inequality (11). In particular, we also confirm that the model at the number of epochs with a minimum value of β-optimized NLL loss (i.e., 90 to 100 epochs for Figure 2 and around 720 epochs for Figure 3) achieves the highest attack performance (i.e., achieves the attack success with the smallest number of traces). This implies that the β-optimized NLL loss, which is the approximation of the ECE loss, can also be used to measure the generalization of the NN model in terms of SR maximization in this experiment, and to determine the timing of early stopping.
Furthermore, we confirm that the blue curve in Figure 2(a) and Figure 3(a) (and Figure 3(a')) does not exceed n k = 8, as proven in Proposition 4. The attack did not succeed for epochs in the experiment if the β-optimized NLL was n k = 8, which was consistent with the discussion in Section 4. In contrast, in Figure 2(a), the raw NLL is always greater than the β-optimized NLL, which is likely to result in an underestimation of attack performance. Note that PI-based SR estimation is not guaranteed to remain a lower bound according to Proposition 3, although the EPI-based SR estimation provides a consistent lower bound of the true SR in accordance with Conjecture 1. The situation is more critical for Figure 3. The raw NLL was greater than n k = 8 for the most parts of Figure 3(a); therefore, the corresponding conventional PI became smaller than zero, which implies that the SR-PI inequality (6) cannot be applicable or significantly underestimates the attack performance, although the attack was actually successful for most parts in the figures in our experiment. Thus, we can confirm the validity, effectiveness, and usefulness of the proposed method.  Figure 4: PI, EPI, and empirical SR evaluation results of four models.

Model comparison
In this subsection, we experimentally calculate SR, PI, and EPI for some models to confirm that our method can also be used as a performance metric for model selection through an experimental attack on the ASCAD dataset without and with desynchronization. Therefore, to enhance their performance, we employ "feature standardization" and "horizontal scaling between −1 to 1" for the ASCAD dataset with and without desynchronization, respectively. These model parameters are obtained from the GitHub repository released by Wouters et al. 11 Other experimental conditions are the same as that in Section 5.2.1. Figure 4 reports the experimental results. In the figure, "ASCAD MLP" and "ASCAD CNN" correspond to the models proposed in [BPS + 20]. Also, "Nmax" means the amount of desynchronization of the ASCAD dataset. The bars of the SR denotes the number of traces required for successful attacks with 90% probability. The bars of PI and EPI denote the estimated minimum number required for attack success with a 90% probability. The absence of the bar means that the number of required traces for successful attacks would be larger than 10,000.
First, when comparing the results of SR and PI, the number of required traces estimated by PI becomes larger than 10,000, even when attacks succeed with high probability. This would be because of the redundancy of CE/PI in terms of SR. Meanwhile, the figure shows that the proposed method never overestimates the number of required traces. In addition, the number of traces estimated by EPI is approximately proportional to that estimated by SR. Thus, we could compare the attack performances of models by the EPI-based method without the calculation of SR.

Conclusion
In this study, we revisited the perceived information (PI), and presented new metrics to evaluate the SCA performance using a conditional probability distribution. We first showed that the conventional definitions of PI and cross-entropy (CE) had an uncertainty in terms of SR evaluation, and therefore, were non-calibrated and insufficient as metrics for evaluating the SCA performance (i.e., SR). We then presented new metrics, named effective CE/PI (ECE/EPC), to remove the uncertainty. Using ECE/EPI, we can perform more accurate measurements of the SR upper-bound for a given probability distribution in an analytical manner using a PI-SR inequality. ECE/EPI is easily calculated from a given probability distribution for SCA and a dataset, which can be adopted in the context of DL-SCA. We experimentally validated the effectiveness of the proposed method through experimental DL-SCAs on masked AES software and hardware implementations. The experimental results validated our statement on the proposed method, and revealed that the proposed metrics could be used to measure the generalization of NN model in terms of SR maximization. In some ways, the proposed metrics could provide a solution on the open problems on DL-SCA: the relationship between a DL evaluation metric (i.e., loss) and SCA evaluation metrics (i.e., SR/GE) and the difficulty in measuring the generalization and determining the timing of early stopping through the loss value during training. In the future, we will conduct further validation of the proposed metrics using other datasets/implementations. It is also important to prove the unexistance/existance of probability distribution conversion that preserves SR but does not preserves the order of key ranks, to reveal whether ECE and EPI are truly unique to an SR and the most appropriate for the SR evaluation.
The side-channel trace dataset for our experiment on the masked AES hardware is available at https://github.com/ECSIS-lab/perceived_information_revisited.
To prove Proposition 3, we introduce the following two lemmas.
Lemma 5 (Extension of Lebesgue's dominated convergence theorem). Let Λ be a subset of R ∪ {−∞, ∞}, and let b ∈ Λ be a point of the closure of Λ denoted as Λ. Let {X λ } λ∈Λ be a family of random variable. Suppose that lim λ→b X λ = X holds almost surely where X denotes a random variable, and there exists an integrable random variable Y such that, for all λ ∈ Λ, |X λ | ≤ Y almost surely. We then have EX λ → EX as λ → b.
Proof. Let {λ i } ∞ i=1 ⊂ Λ be any sequence converging to b. We have lim i→∞ X λi = X almost surely. because lim λ→b X λ = X almost surely. Therefore, from Lebesgue's dominated convergence theorem, we have lim i→∞ EX λi = E lim i→∞ X λi = EX. Since this holds for any sequence {λ i }, we have EX λ → EX.
Lemma 6 (Extension of Fatou's lemma). Let Λ be a subset of R ∪ {−∞, ∞}, and let b ∈ Λ be a point of the closure of Λ denoted as Λ. Let {X λ } λ∈Λ be a family of random variable, where X λ > 0 holds almost surely for all λ ∈ Λ. If lim inf λ→b X λ is measurable, we have lim inf λ→b EX λ ≥ E lim inf λ→b X λ .
Proof. Let {λ i } ∞ i=1 ⊂ Λ be a sequence converging to b such that lim inf i→∞ EX λi = lim inf λ→b EX λ . Note that we have lim inf λ→b X λ ≤ lim inf i→∞ X λi . Hence, from Fatou's lemma, we have We then prove Proposition 3.