k -Nearest Neighbors Algorithm in Proﬁling Power Analysis Attacks

. Power analysis presents the typical example of successful attacks against trusted cryptographic devices such as RFID (Radio-Frequency IDentiﬁcations) and contact smart cards. In recent years, the cryptographic community has explored new approaches in power analysis based on machine learning models such as Support Vector Machine (SVM), RF (Random Forest) and Multi-Layer Perceptron (MLP). In this paper, we made an extensive comparison of machine learning algorithms in the power analysis. For this purpose, we implemented a veriﬁcation program that always chooses the optimal settings of individual machine learning models in order to obtain the best classiﬁcation accuracy. In our research, we used three datasets, the ﬁrst contains the power traces of an unprotected AES (Advanced Encryption Standard) implementation. The second and third datasets are created independently from public available power traces corresponding to a masked AES implementation (DPA Con-test v4). The obtained results revealed some interesting facts, namely, an elementary k-NN (k-Nearest Neighbors) algo-rithm, which has not been commonly used in power analysis yet, shows great application potential in practice.


Introduction
Power analysis (PA) measures and analyzes the power consumption of cryptographic devices depending on their activity.The main goal of PA is to determine the sensitive information from the power consumption measured and to apply the information obtained in order to abuse the cryptographic device.There are two basic methods of power analysis: simple and differential.Simple power analysis (SPA) tries to determine the sensitive information (mostly encryption key that is stored in the device) more or less directly from one power traces measured.A typical example of the SPA is the attack aimed on the implementation of the asymmetric cryptographic algorithm RSA (Rivest Shamir Adleman), where the differences in power consumption revels private key [1] (implementation of Square and multiply algorithm).The goal of differential power analysis (DPA) attacks is to reveal the secret key of the cryptographic device based on a large number of the power traces that have been recorded while the device was encrypting various input data.The basic principle of DPA was introduced on DES (Data Encryption Standard) algorithm using the statistical method based on Difference of Means [2].A general description of PA attacks is presented in [3], [4].
From a different perspective, we can divide the power analysis attacks into two main categories, namely profiling and non-profiling attacks.In profiling attacks, an adversary needs a physical access to a pair of identical (similar) devices that we call a profiling device and a target device.Basically, these attacks consist of two phases (profiling and attack phase).In the first phase, the adversary analyzes the profiling device in order to approximate the leakage behavior and in the second phase, the adversary attacks the target device.Typical examples are the template-based attack (TA) [5], [6] and Stochastic Approach (SA) [7], [8].The practical aspects of template attacks (TA) have been discussed in [9], [10].The profiling phase of TA was improved in [6], [11], [12].Most crucial step during these profiling attacks lies in selecting of the interesting points.Various techniques are used to localize interesting points that provide information about data processed e.g.Normalized Inter-Class Variance (NICV) [13], Sum Of Squared pairwise Differences (SOSD) [14], Sum Of Squared pairwise T-differences (SOST) [14], Principal Components Analysis (PCA) [11], [15] or Pearson Correlation (CPA) [3].By contrast, the non-profiling attacks are onephase attacks that perform the attack directly on the target device (it represents a more realistic scenario in practice).The attacker measures a set of power traces for the known plain text and compares these power traces with hypothetical power consumption that was calculated based on every secret key hypotheses and on power consumption model [4], [16], [17].Only the correct key hypothesis shows dependency of the hypothetical power consumption calculated and the power consumption measured.DPA based on the correlation coefficient and on Hamming weight power consumption mode represents a typical example of non-profiled attacks that are aimed on smart card implementations [3], [18].
In order to prevent power analysis attacks, one can implement some of the countermeasure techniques.The goal of every countermeasure is to create the power consumption of a cryptographic device independent of intermediate values that are currently processed.Generally the countermeasure techniques are divided into two basic groups, hiding and masking.In the masking approach, each intermediate value is concealed by a random value that is called mask.Various masking methods of the AES algorithm have been already proposed [19][20][21][22].By contrast, hiding tries to break the link between the power consumption and the data values processed [3].It is clear, that the implementation of countermeasures brings overhead in terms of memory and time therefore researchers have started to look for the lightweight possibilities.One of these lightweight countermeasures is Rotating Sbox Masking that is a type of Low-Entropy Masking Scheme [23][24][25].The main idea is based on the usage of the precomputed table look-ups [26] and at the same time the overhead is reducing by carefully choosing the limited mask set [27].This essentially allows to reuse S-boxes and reduce the computation of mask compensation because only 16 possible masks are applied.The set of chosen mask can be a public parameter however this set should be shifted by a random offset before each encryption.We refer interested readers to works [23], [25], [28] where more details of RSM and its security analysis are provided.RSM has been studied by researchers worldwide under the framework of DPA Contest V4 [29].DPA Contest is an international framework that allows researchers to compare their power analysis attacks under the same conditions and is organized by Telecom ParisTech French University.
To complete the introduction about power analysis, we note that an adversary can bypass masking techniques using several intermediate values to calculate hypothesis.These types of attacks are called higher-order DPA attacks [30], [31].Higher-order DPA attacks exploit the joint leakage of several intermediate values that occur inside the cryptographic device.In the paper, we do not take this possibility into account.
Machine learning (ML) as a scientific discipline explores the construction of algorithms that can improve their performance based on previous experiences or trainings [32].Most of the machine learning problems deal with the classification of various input data.In general, machine learning approaches can be classified as supervised [33] and unsupervised learning [34].Intuitively, in supervised learning, the machine is presented with a set of training data with the label and the goal is to determine the general function that associates the data with the label.In unsupervised learning, the machine is presented with a set of unlabeled data, and the machine tries to determine the hidden structure of the data.From the description above, one can clearly see an analogy between machine learning approaches and power analysis attacks.More specifically, profiling attacks are a supervised learning problem, where ML techniques are used for a model creation of the target device.Generally, the model created is based on the multivariate normal distribution in the power analysis.This fact is based on the following simple analysis of a power trace: one can analyze power consumption measured by looking at a single point of a power trace (we look at the power consumption of a cryptographic device at a fixed moment of time).For this point, we can determine the probability distribution that is dependent on the processed data.Generally, it is difficult to make a statement about the data dependency of the power consumption of cryptographic devices.However, for most cryptographic devices, it is valid to approximate the distribution of the data dependency of the power consumption by a normal distribution.Moreover, power consumption of cryptographic devices is mostly proportional to the Hamming weight1 or the Hamming dis-tance2 of data processed.In these cases, the distribution is composed of nine normal distributions with different mean and the standard deviation is approximately the same.In order to consider the correlation between more points in the power trace, it is necessary to model a power trace measured as a multivariate normal distribution which constitutes a generalization of the normal distribution to higher dimensions.We refer the book [3] where authors realized the complete analysis of statistical characteristics of power traces.
On the other hand, non-profiling attacks can be seen as an unsupervised learning.Instead of statistical methods, one can apply ML in order to find the desired structures in the data.In this paper, we do not take into account this application of ML.

Related Work
In the field of power analysis, the possibility of using neural networks was first published in [35].Naturally, this work was followed by the other authors, e.g.[36], who dealt with the classification of individual power prints.These works are mostly oriented towards reverse engineering based on power print classification.Yang et al. [37] proposed MLP in order to create a power consumption model of a cryptographic device in CPA.Lerman et al. [38], [39] compared a template attack with a binary machine learning approach, based on non-parametric methods.
Hospodar et al. [40], [41] analyzed the SVM on a software implementation of a block cipher.Heuser et al. [42] 1Typical for smart cards.2Typical for micro controllers.created the general description of the SVM attack and compared this approach with the template attack.In 2013, Bartkewitz [43] applied a multi-class machine learning model which improves the attack success rate with respect to the binary approach.Moreover, they used (linear) SVM as a preprocessing tool for feature selection, similar as Brank [44].Recently, Lerman et al. [45] proposed a machine learning approach that takes into account the temporal dependencies between power values.This method improves the success rate of an attack in a low signal-to-noise ratio with respect to classification methods.Another SVM-based attack was presented in [46] where the authors used SVM to recover the secret key (bit by bit) by exploiting the leakage in the key permutation round.
Lerman et al. [47] presented a machine learning attack against a masking countermeasure, using the dataset of the DPA Contest v4.The method of power analysis based on a multi-layer perceptron was first presented in [48].In this work, the authors used a neural network directly for the classification of the AES secret key.In [49], this MLP approach was optimized by using the preprocessing of the power traces measured.Lerman et al. [50] introduced semisupervised a Template Attack, that combines supervised and unsupervised learning.The method was confirmed by the experiments on an 8-bit microcontroller and by a comparison to a template attack.The authors proposed an unsupervised learning approach for PA in [51] aimed on DES algorithm.Heyszlet et al. [52] introduced unsupervised cluster classification algorithm k-means to attack cryptographic exponentiation of public key cryptographic system and recover secret exponents without any prior profiling.Note that the algorithm k-means should not be confused with k-nearest neighbor algorithm.Zhanget et al. [53] proposed DPA attack based on the correlation coefficient using Genetic Algorithm.Perin et al. [54] presented the attack based on clustering algorithm that attacks the randomized exponentiation of RSA algorithm.In work [55], the k-NN algorithm was briefly mentioned as a possible mutual information estimator.
At the end of 2014, Dirmanto realized a small but concise overview of machine learning approaches in power analysis [56].The survey paper summarizes the main theoretical aspects [56].During the CHES 2015 conference, Whitnall presented unsupervised clustering algorithm (K-means clustering) in order to recover nominal power model [57].The model was used to key recovery attack, with minimal requirements in the profiling phase and moreover the approach was effective and robust across an extensive set of distortions.

Contribution
In previous works, individual machine learning (ML) approaches are compared mostly with the template attack or the stochastic attack (SA) [40], [41].ML approaches have not been compared yet.The work [47] can be mentioned as an exception, where SVM and RF are compared with the TA and the SA.In this article, we try to make an extensive comparison of machine learning algorithms in PA.We focus only on the usage of the individual ML algorithms in profiling attacks where ML techniques are used for a model creation of the target device.We do not consider other possible application such as structure searching, preprocessing or feature selection.
For our research, we implemented a verification program that always chooses the optimal settings of the individual ML models in order to obtain the best classification accuracy.Our research was based on three datasets, the first dataset containing the power traces of an unprotected AES implementation where we classify one byte of the secret key.The second and third datasets were independently prepared from public datasets of power traces corresponding to the masked AES implementation (DPA Contest v4 [29]) where we classify the secret offset.We decided to use the first order success rate as a metric of the comparison because our two datasets were focused on mask classification and Guessing entropy is not suitable in this case3.Furthermore, we compared every ML approaches with template based attack.In this research, we wanted to answer particularly these questions: • Which ML algorithm is the most suitable for profiling PA attacks?
• Are there any generally appropriate settings of the ML algorithms that can be used by the potential attacker for PA attacks?
• How big is the influence of the number of power traces and interesting points on the classification results of individual ML algorithms?
Nowadays, the method using the SVM is considered to be the most effective from machine learning algorithms in the power analysis.In many concrete attacks, in which an adversary has only a limited number of power traces available, the SVM is better in comparison with the classical template attack or the stochastic attack.Based on the results obtained, we propose a power analysis method based on the k-NN algorithm as the most effective method.Even there is no "intelligence", the algorithm shows great application potential in PA, because the usage of the algorithm provides some advantages for the attacker in comparison with the other machine learning approaches and classic power analysis attack.Moreover, we describe the general scheme of this method in profiling power analysis attacks.

k-Nearest Neighbors
In the previous section, we have already provided relevant references that deal with the well-known ML approaches in power analysis attack (such as SVM, MLP and RF).Therefore, the following text focuses only on the description of the approach proposed based on k-NN algorithm.We provide the general scheme of this method in profiling power analysis attacks.
Preliminaries: a learning set Y (sometimes denoted as a training set) and a test set X with n and m instance represents power traces measured in the context of the power analysis.Each instance y i where i = 1, . . ., n and x j where j = 1, . . ., m in the learning and training set contains one assignment (a class label that determines which class the concrete instant belongs to) and several attributes y i = y 1 , . . ., y N , x j = x 1 , . . ., x N (features or observed variables).These attributes represent the interesting points of power traces in time (samples).The learning set is used in the profiling phase of the profiled attack and the test set is used during the attack phase.In profiling power analysis attack, the label represents the desired byte value of secret key i.e. together 256 possible variants (0 to 255).From the perspective of ML, we can see this problem as multiclass classification where ML classifies the instance into 256 possible classes.The second method that is often used is to transfer this problem into the multi-label classification.The multilabel classification represents the problem of finding a model that maps inputs x j to binary output vectors y i .There are two main methods for tackling the multi-label classification problem: problem transformation methods and algorithm adaptation methods [46], [60], [61].Problem transformation methods transform the multi-label problem into a set of binary classification (two classes 0 or 1) that can be realized by single-class classifiers (such as binary relevance or label powerset).Algorithm adaptation methods adapt the algorithms to directly perform multi-label classification (Multilabel Neural Networks).
In machine learning, the k-Nearest Neighbors algorithm is a non-parametric method used for classification and belongs to the simplest machine learning algorithms [62].The training phase of the algorithm consists only of storing the learning set into the memory.In the classification phase, k is a user-defined constant (typically small), and a point of the test set is classified by assigning the label which is most frequent among the k training samples nearest to that classified point.If k = 1, then the object is simply assigned to the class of that single nearest neighbor.class because 3 black circles and 2 stars are in the selected area (dashed line outer circle).

A typical example of k-NN classification is shown in
A commonly used distance metric for continuous variables are defined as: where i represents a number of attributes in the learning set.The overlap metric or Hamming distance are other possible metrics for discrete variables.The best choice of k, that strongly depends on the learning set, is very important.Generally, large k reduces the effect of the noise on the classification but makes the boundaries between the classes less distinct.Suitable k can be selected by the various heuristic techniques, for example the hyperparameter optimization [63].The following text describes the power analysis method based on k-NN.

Profiling Phase
In the attack based on k-NN, we assume that we can characterize the profiling device by labeling of measured data.One can implement a certain part of the cryptographic algorithm and execute the sequence of instructions on a profiling device with different data d i and different key values k j , and record the power consumption.After measuring n power traces, we create the matrix Y n that contains power traces corresponding to the pair (d i , k j ).According to the key value, we add label in to the matrix Y n .In case of byte classification, the label can be expressed by four columns where every row represents a class using the binary expression 00000000 to 11111111 (every possible byte value from 0 to 255).The matrix Y n represents a learning set which is stored into the memory.

Attack Phase
During the attack phase, the adversary uses the stored learning set together with the measured power trace from the target device (denoted as t = [x 1 , . . ., x N ]) to determine the secret key value.Let's assume that for our k-NN algorithm we chose the following parameters: k = 5 and Euclidean distance.The classification takes three steps: • at the beginning, the algorithm calculates Euclidean distances of all stored training vectors y n to vector t: • In the second step, 5 closest training points are found according to the distances calculated.
• In the last step, the class is selected based on the majority vote.
The result of this classification is the most probable class based on training set Y. Since each training instance y n is associated with a secret key value, the adversary obtains the information about the secret key stored in the target device.

Discussion of Simple Application of k-NN in Power Analysis
In the following text, we provide a simple example of the attack based on k-NN using the real data of power consumption (we used only two dimension ie two interesting points were selected) in order to demonstrate the suitability and simplicity of approach proposed.
Let's assume that the adversary wants to determine a Hamming Weight (HW) of processing data or secret key value.During the profiling phase, the adversary measures 2 560 power traces, 10 for every byte value (it means that 10 power traces corresponding to the HW = 0, 80 power traces corresponding to the HW = 1 and so on).In power traces, the adversary chooses two interesting points that leak information about HW.The profiling phase is finished by storing the points into the memory of a computer.Figure 2 shows scatter plot of two chosen interesting points.The division of two dimensional space into nine groups according to the HW is clearly visible.We can approximate this data dependency of the power consumption by a two variable normal distribution (we refer interested readers to consult statistical analysis with book [3]).In these cases, the distribution is composed of nine normal distributions.
In the attack phase, the adversary measures power trace from the target device and puts same interesting points to the k-NN algorithm.Figure 3 shows the process of classification for unknown point that is marked with a black star and for parameter k = 5.All five nearest neighbors points belong to the distribution marked with blue color that corresponding to the HW= 0. The adversary obtains the desired information that the HW of processed data is 0. Similarly, the adversary can continue with additional power traces in order to reveal desire sensitive information.It is obvious, if we focus on two points of the power consumption at fixed time in our example, and realize measurement of power consumption repetitively for constant data than points measured will appear more or less on the same are (group).Similar situation occurs for different data proceed by cryptographic device therefore the points are in clusters.Widely known fact is that k-NN algorithm is one of many algorithms that is robust, simple and suitable for classification problem.Using this simple example, we wanted to demonstrate the classification problem during the power analysis attacks and simple application of ML algorithms.

Settings of Experiments
The following text summarizes the most important facts about the experimental setup and the verification program (implemented attacks).We created three datasets in order to test chosen machine learning approaches.Based on the state of the art, we chose SVM, MLP, k-NN, DT (Decision trees), RF and LDA (Linear Discriminant Analysis).The first dataset (DS1) is focused only on the first byte classification of the secret key.Dataset is prepared from power traces of unprotected AES-128 implementation in our testbed.The cryptographic module was represented by the PIC 8-bit micro controller, and for the power consumption measurement we used the CT-6 current probe and the Tektronix DPO-4032 digital oscilloscope.We used standard operating conditions with 5 V power supply.Stored power traces had 100 000 of samples and covered the AddRoundKey and SubBytes operations in the initialization phase of the algorithm.Our implementation was realized in the assembly language and the executed instruction of examined operation were exactly the same for every key byte (identical power prints).Therefore it was possible to use the place, where the first byte is processed, in order to create a model with which it was possible to determine the whole secret key byte by byte.We verified this assumption experimentally and it is naturally conditioned by the excellent synchronization of measured power traces.Finally, we chose 5 interesting points based on standard CPA method (note, we used well know CPA method to localize interesting points in our whole research).Every of our chosen interesting points leaked information about Hamming weight of the processed data.We chose points that had a distance at least one clock cycle from each other.This restriction for having IP not too close avoids the numerical problems when inverting the covariance matrix during the template based attack.
An example of power traces selected are depicted in Fig. 4. Overall characteristics of IP selected are depicted in Fig. 5 using the box plot.On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually.It can be observed that the first dataset does not include almost any noise.Consequently, our first dataset represents a matrix 2 560 × 13 where the last 8 values are labels.Each label is expressed by four columns where every row represents a class using the binary expression from 00000000 to 11111111.
The second dataset (DS2) is focused on the mask classification and consists of 1 000 power traces.DS2 is prepared from electromagnetic traces that are freely available on the website of DPA Contest v4 [29].The masked block-cipher AES-256 in encryption mode without any mode of operation is implemented on target cryptographic device Atmel  ATMega-163-based smart card.The implemented masking scheme is a variant of the Rotating Sbox Masking [23], [64].According to the authors, this masking scheme keeps performance and complexity close to the unprotected scheme and is resistant to several side-channel attacks.Sixteen masks are public information that are incorporated in the computation of the algorithm.offset value, which is drawn randomly at the beginning of computation, is a secret value.Mask values are rotating according to the offset value [23], [64].Each stored trace has 435 002 samples associated to the same secret key and corresponds to the first and to the beginning of the second round of AES algorithm.For DS2, we chose only the points that are the most correlated with the secret offset value.
We realized classical CPA for operation Plaintext blinding dependent on offset value in order to locate the interesting points.We chose the 3 highest correlated points for every mask value, together 48 interesting points were selected.In other words, DS2 represents a matrix 1 000 × 52 where the last four values are labels.In our case, the label value corresponds with the offset value 0 to 15 (sixteen possible variants).An example of power traces selected are depicted in Fig. 6.The overall characteristics of the interesting points selected are depicted in Fig. 7 using the box plot.
The third dataset (DS3) was created by Liran Lerman during preparation of the attack in DPA Contest v4 [47].This DS is focused on the mask classification and we used first 1 000 traces of 1 500 available.The author chose 50 interesting points according to the computed Pearson correlation between each instance of 1 500 traces and the offset value.In other words, our DS3 represents a matrix 1 000×54 where the last four values are labels.Again, the label value corresponds with the offset value 0 to 15 (sixteen possible variants).We refer the work [47] for more information about the original dataset.An example of power traces selected is depicted in Fig. 8.The overall characteristics of interesting points selected are depicted in Fig. 9.
A well known fact is that noise always poses the problem during the power consumption measurement.Every stored power trace from DS1 was calculated as an average power trace from ten power traces measured using the digital oscilloscope to reduce electronic noise.We refer the website [29] to consult level of noise in DS2 and DS3.

Implemented Program
Figure 10 shows the main principle of the verification algorithm implemented4.The main part of the implemented program is the block denoted as Optimize Parameters This block finds the optimal values of the selected parameters for the tested machine learning algorithms.In other words, it executes each model for all combinations of user selected values of the parameters and then delivers the optimal parameter values as a result.Selected specific parameters are described in more details in the next section.The second important block of the program is the Cross-validation.Crossvalidation (CV) is a standard statistical method to estimate the generalization error of a predictive model.In l-fold crossvalidation, a training set is divided into l equal-sized subsets.Then a model is trained using the other (l − 1) subsets and its performance is evaluated on the current subset.This procedure is repeated for each subset.In other words, each subset is used for testing exactly once.The result of the crossvalidation is the average of the performances obtained from l rounds.
In our verification program, we used typical 10-fold cross-validation.We repeated CV four or eight times in the Loop, because we created four or eight models for individual bit classification depending on the dataset.In other words, in our program we chose the multi-label classification where a model maps inputs x j to binary outputs vectors y i using single-class classifiers.The verification program returned two output values: the best parameters for learning models and the obtained accuracy using these parameters.The accuracy was described using the typical confusion matrix.
4We use Rapid Miner for implementation [65].The original implemented program contained the block called Forward selection that selected individual attributes of DSs.In each round, this block can add attribute and the performance is estimated using the inner operators, e.g. a cross-validation.This configuration allows us to get the best result of machine learning algorithms depending on individual parameters setting and attributes selected.In this way, we tested the influence of number and the combination of selected attributes on the classification results.

Selected Parameters
It is obvious, that time required to solve the program implemented strongly depends on the number of selected parameters of individual ML approaches.One can test an infinite number of parameters in theory, but it has no sense in practice.For example, it is really unnecessary to test k-NN algorithm for k > 11, because the results are worse and the advantages of the algorithm are reduced (based on long years practical experience with ML approaches).For these reasons, we chose only limited number of parameters that are relevant and important for testing individual ML algorithms.Selection of parameters was realized based on our experience and knowledge with ML.
In order to test SVM approach, we chose the complexity parameter from 0 to 50, epsilon from 0.1 to 1 and type of kernel radial, linear, polynomial, together it was 3 333 of combinations.We selected three parameters: the depth from 1 to 100, the confidence from 1.0 e −7 to 0.5 and criterion set to gain ratio, information gain, gini index and accuracy (484 combination) to test decision trees (DT).The MLP approach was tested by the following parameters: one hidden layer, type of the activation function, a number of training cycle from 1 to 1 000, learning rate, neural network momentum both from 0 to 1 and normalization true or false (5 324 combinations).During the testing of the k-NN algorithm, we selected different types of metrics: Euclidean, Camberra, Manhattan and Chebychev distance, Correlation, Cosine, Dice, MaxProduct, Overlap and Jaccard similarity, parameter k = 1, 3, ..., 11 and weighted vote (true or false).Together, only 132 of combinations were tested.Testing of Random forest (RF) model involved the parameters: the depth from 1 to 100, the confidence 1.0 e −7 to 0.5, criterion set to gain ratio, information gain, gini index and accuracy and last parameter was a number of trees from 1 to 500 (5 324 combinations).As a last tested machine learning algorithm, we involved linear discriminant analysis LDA in default setting.
In order to complete our comparison, we implemented the classical Template attack.We were interested in comparison of effective template attack based on pooled covariance matrix [6], therefore we calculated the pool covariance matrix as an average value of all covariance matrices and we calculated the probability density function (equation ( 4)) with this matrix.Implementations of template attacks were done according to the equation ( 4): where (m, C) represents templates prepared in profiling phase based on multivariate normal distribution that is fully defined by a mean vector and a covariance matrix.Measured power trace from the target device is denoted as t and NI is the number of interesting points.In the following text, classical template and template attack based on the pooled covariance matrix are denoted as T cls and T pool sequentially.In the first experiment, we did not include a reduced template attack, because if the adversary does not consider the covariance matrix, he loses information about the relationship between the interesting points.All template attack implementations were realized in the Matlab environment.

Results Evaluation
The implemented program provided two outputs: accuracy and best parameters selected for individual ML algorithm.In order to calculate accuracy, it was used a typical confusion matrix.Interested readers can consult [66] to obtain additional explanations about performance measurements for classification, e.g.confusion matrix, precision, recall.Examples of confusion matrices of DS1 classification for SVM-rbf and k-NN algorithm are shown in Tabs. 1 and 2. In the confusion matrix, accuracy is the arithmetic mean of the accuracy obtained from the 8 × 10 cross-validation for individual models.The σ value represents their standard deviation.The value denoted as mikro is actually the accuracy computed from the confusion matrix.In other words, it is the success rate calculated for all of the 20 480 experiments carried out in DS1.It is not possible to present every obtained confusion matrices in this paper, therefore we present the results based on mikro value of success rate.In our first experiment, we verified that the influence of the block Forward selection on resulting success rate is very low.It is clear, if the selection of interesting points is done in a correct way, the algorithm chooses always maximum of attributes.This conclusion is natural and not surprising, because selection of the interesting points from power traces is crucially important during the profiled attacks, and we used well know and verified CPA method in order to localize interesting points during dataset preparation.Based on the results obtained, we skipped this block and always chose the maximum of attributes in the following experiments.
In the second experiment, we were searching for the best success rate corresponding with the parameters of selected machine learning algorithm on our three datasets.In this way, we got the best possible success rate for machine learning algorithm and we could compare machine learning algorithms according to the highest value.Table 3 summarizes the success rate obtained in percentage.The penultimate rows provide the average value of the success rate calculated from three values obtained and the last rows provide the differences between average value and maximal obtained average value.
From the results, one can confirm that differences between optimized ML algorithms are negligible.Note, that the SVM with the rbf kernel had the best success rate of all SVM kernels for all datasets.The algorithm k-NN classified the DS1 with the highest success rate from all ML tested (DS1 has the lowest value of noise).The SVM-rbf was the best ML classifier of the DS2 and the DS3.Generally, the template attack based on the pooled covariance matrix5 was the best in average success rate based on all tested datasets, but if we focus on the ML, the SVM-rbf together with k-NN were the best with almost the same success rate.The difference was only 0.45 % and we can consider the difference negligible taking into account the number of the experiments (together 28 480 of individual bit classification).MLP was the third best algorithm with only 0.57 % difference from SVM-rbf.We can conclude that the SVM-rbf, MLP and k-NN are the most suitable candidates for profiling power analysis attacks.The following text summarizes the parameters selected of individual machine learning algorithms during the second experiment.
5Results of template based attack are informative, because template attacks were aimed at the whole byte classification.
We can recognize some similarities of the parameters selected.Definitely, good choices for an attacker can be either MLP with one hidden layer and normalization true, SVM-rbf with the parameters C = 45 and epsilon 0.28 or we suggest the k-NN with k = 5 and Euclidian distance.
Practically, we can optimize every ML algorithm (using individual parameter settings) to get almost the identical classification results.The biggest difference between the tested algorithms lies in the required time that is needed to find the best setting of the concrete ML algorithm.For example, finding the best parameters of the SVM algorithm with poly kernel takes approximately eight days using parameters selected and the DS1.The difference is enormous in comparison with 6 minutes and 35 seconds that was necessary for k-NN optimization for the same DS1.In order to demonstrate this feature, Tab. 4 summarizes the time consumption of one executed 10-cross validation for implemented ML algorithms.
The time required to calculate one 10-cross validation for k-NN was less than 1 s and 320 s for SVM-rbf.Naturally, the attacker has to calculate so many numbers of CV as the number of the tested parameters, therefore the time needed is directly proportional to the number of the tested parameters and learning time.In our case, we tested only 132 combinations of the parameters for k-NN, but 1 111 combination for SVM with poly kernel.The algorithm k-NN is really easy, therefore it is not necessary to test many parameters and the algorithm does not include learning phase.These are the main reasons why in terms of time consumption, the k-NN algorithm provides the best performance in our implementa-tion6.
Our fourth experiment examines the classification success rate based on the number of power traces.For this purpose, we prepared new datasets from the original three DSs that differed in the number of power traces.From the first DS1, we created 10 datasets each containing one power trace more successively that corresponds to each key value.In other words, datasets created have from 256 to 2 560 of power traces with step 256.From DS2 and DS3, we created again 10 datasets containing sequentially 100 to 1 000 power traces with step 100.Data prepared were classified successively using the implemented program.The obtained results are depicted in Figs.6Note, that our test was performed on datasets containing 1 000 and 2 560 of power traces.Displayed graphs confirm the theoretical assumption of the increasing success rate with the increasing number of power traces in profiling phase.The success rate is precipitously increasing until a maximal value and after that the value of the success rate stays almost constant.An interesting fact is that the number of power traces required to achieve the maximum value was comparable for every ML algorithms (especially for the best three, SVM-rbf, MLP and k-NN).Generally, about 500 power traces of DS1, 300 power traces of DS2 and 200 power traces of DS2 were necessary to achieve the maximal value.From a comparison of Fig. 12 and Fig. 13, we can see the influence of choosing the interesting points because these DSs were prepared based on identical power traces and aimed on the secret offset classification (in other words, datasets differ only in method of selecting interesting points).The shift of the maximum success rate values around 10 % is obvious.During the DS3 classification, the classical template attack provides really low values of calculated probabilities, therefore the first order success rate was worse when compared with other approaches.
In our fifth experiment, we investigate a success rate of the masks revelation depending on the number of interesting  points and the number of power traces.Moreover, we investigate the influence of multiclass classification.In comparison with our previous experiments, we modified the program in such a way that ML classified instance to 256 classes (whole byte classification).In other words, ML and TA classified secret offset directly (not successively bit by bit).For this purpose, we prepared datasets based on DS2 that differed in the number of power traces and interesting points.We prepared learning sets that contain 100, 250, 500 and 1 000 of power traces successively and a test set that contains 1 500 instances in order to test profiling attacks.Figures 14, 15, 16 and 17 report the success rate to predict the right offset value as a function of the number of interesting points selected for the best profiled attacks: SVM-rbf, NN, k-NN, T cls and T pool .The experiments described in [48], [49] implemented MLP approach in Matlab using Netlab.In order to extend our research on testing different implementation, we involved also the MLP approach implemented in Netlab [67] (denoted MLP_Matlab) and we involved to this experiment reduced template attack denoted as T red .Reduced template attack is calculated according the equation ( 4) but the covariance matrix is equal to the identity matrix (reduced templates contain only the mean vector).One can extract the following observations.First, as expected, the higher the number of traces in the learning set, the higher the accuracy.For example, maximal success rate achieved was 70 % and 99 % for learning set containing 100 and 1 000 power traces successively.Second, the number of the selected points in each trace influences the success rate: the higher the number of interesting points, the higher the success rate of every attack implementations.The main finding is that the rise in success rate of the attacks based on ML occurs much earlier than for every TA attack.We can observe success rate of 72 % for the MLP and of 7 % for TAs for 20 interesting points and 1 000 power traces.
It is remarkable that if learning set is small (in our experiment less than 1 000 of power traces), the classic template attack is practically inapplicable.It provides the success rate somewhere around 7 %.This is caused by the numerical problems that are connected with covariance matrix.These numerical problems occur during the inversion which needs to be done in equation (4).In our case, the values calculated were very small and that leads to bad classification results.
The obtained results confirmed that generally the ML approach is much more effective profiling power analysis attack in terms of small number of power traces and interesting points.It is pretty surprising that the MLP_Matlab approach is better in comparison with the second implementation.It is caused by more precise settings.The fact is that the template attack based on the pooled covariance matrix and the ML approaches (NN and SVM) are practically the same for the larger learning sets.The obtained success rates were 99.9 % and 99.6 % for T pool and MLP successively.Furthermore, the results obtained confirmed that k-NN is more similar to classic template attack.The success rate lies between T cls and T pool .This approach is much more efficient than the classic template attack for smaller datasets, on the other hand, T pool is better, because it takes into account the relation between the interesting points selected.In practice, the k-NN approach corresponds with the reduced template attack, that does not take the covariance matrix into account.Naturally, we realized the same experiment for DS3 and we evaluated the results obtained.Not surprisingly, the results were practically the same, therefore we did not include those in the article.
In our last experiment, we performed ROC (Receiver Operator Characteristic) analysis for the chosen profiled attacks based on machine learning algorithms.This method is commonly used in medical decision making, and in ML in order to illustrate the performance of a binary classifier as its discrimination threshold is varied [68].Therefore ROC graphs are useful to organize, select classifiers and visualize their performance.
The results of accuracy for individual models are calculated based on the confusion matrices (see section 4 and example in Table 2 for k-NN).The numbers along the major diagonal represent the correct decisions, and the numbers off diagonal represent the errors, the confusion between the various classes.In other words, each column of the table corresponds to the correct values of the class (in our case bit value 1 or 0) and each row corresponds to the predicted values.
For the following analysis, we denote: • True positive (TP): the model predicted bit value 1 and the actual bit value was 1.
• False positive (FP): the model predicted 1 and the actual value was 0.
• True negative (TN): the model predicted 0 and the actual value was 0.
• False negative (FN): the model predicted 0 and the actual bit value was 1.
The sensitivity of a classifier is estimated as: The specificity of a classifier is estimated as: Generally, ROC graphs are two-dimensional graphs in that sensitivity is plotted on the Y axis and 1 − specificity is plotted on the X axis, therefore they depict relative tradeoffs between benefits (TP) and costs (FP).It is well known that the best possible prediction model would yield a point (0, 1) in the upper left corner of the ROC space.This point is also called a "perfect classification", because it represents 100 % sensitivity (no FN) and 100 % specificity (no FP).The perfect classier produces a curve that runs vertically upwards from the origin (0, 0) up to the point (0, 1) and from this point horizontally to the right.A completely random guess (considering binary classification with the success rate of 50 %) would give a line along a diagonal from the origin (0,0) to the top right corner (1,1).
Based on previous results we involved MLP, SVM-rbf, k-NN, DT and RF into the ROC analysis successively.We implemented ML using optimal parameters that were discovered using the second experiment (please consult in Section 4 -part of the second experiment).We calculated ROC based on whole datasets prepared and 10-fold cross validation.The results of ROC analysis corresponding with each bit classification of DS1 are depicted in Fig. 20 in Appendix.The semitransparent areas indicate the standard deviation that results over the different cycles of 10-fold cross validation.Solid line indicates the average result of cross validation performed.It can be observed that some of the bits of the secret key can be perfect distinguished.This group includes the first bit, the third bit and the eighth bit.In these cases, each model provides almost perfect classification.Moreover, the remaining group of bits was also classified with high performance, but the differences between ML models were more significant.It is remarkable that the ROC curve plot of k−NN was the closest to the perfect classifier for each remaining bit and the second best model was SVM-rbf.Model based on k−NN provided perfect classification even though other models not (classification of bit 4, 5, 6 and 7).On the other hand, it was interesting that according the ROC comparison MLP was the worst for DS1.Naturally, the observation corresponds with the results of the second experiment (please consult this observation in Tab. 3).Based on our experiments and experience with MLP in profiled power analysis attack, we concluded that it is caused by a small number of interesting points7.
The results of ROC analysis corresponding with each bit classification of DS2 are depicted in Fig. 18 and shows that each 4 bits of secret offset were classified with great performance.Based on ROC plots of k-NN, SVM-rbf and MLP, it can be observed that these models are really close to the perfect classification.The difference between these models is really negligible for DS2.
ROC curves corresponding with each bit classification of DS3 are depicted in Fig. 19.As in previous case, k-NN, SVM-rbf and MLP provided the best plot in ROC space.Moreover, the influence of the most crucial step of profiling PA, that lies in selecting of the interesting points, is demonstrated by comparing Fig. 18 and Fig. 19, because datasets DS2 and DS3 were prepared from identical raw power traces.In Fig. 19, the plot of ROC curves are more distant from (0,1), therefore the selection was performed less precisely and the model created provided lower performance.We conclude that if selection of interesting points is properly performed, every possible distinguisher will provide more or less similar results regardless of the underlying technique deployed (either classic templates or machine learning after optimization).

Conclusion
In this paper, we provided an extensive comparison of widely used machine learning algorithms in power analysis such as SVM, decision tree, MLP including the new approach based on k-NN.We implemented a verification program that chose optimal settings of the parameters of individual machine learning algorithms in order to obtain the best classification accuracy.Based on the obtained results, we can consider SVM-rbf, MLP and k-NN as the most suitable candidates for profiling power analysis attacks (in terms of classification accuracy).Generally, we can optimize every ML algorithm using parameter settings to get almost identical classification results.On the other hand, optimization of individual ML algorithm can be time consuming (possible difference can be enormous based on the selected parameters and algorithm, for example 6 minutes and 35 seconds was needed for k-NN optimization and 8 days for SVM-poly optimization).
Moreover, we investigated a success rate of the masks revelation depending on the number of the interesting points and the number of power traces.As expected, the higher the number of traces and points in the learning set, the higher the accuracy of every power analysis attacks implemented.The main finding was that the sharp rise in success rate of the ML attacks (MLP and SVM) occurs much earlier than for every TA attacks.We can conclude that the ML approach is much more effective profiling power analysis attack in terms of a small number of power traces and interesting points.In other words, it is better to use profiling power analysis attack based on the MLP if the adversary has only limited power traces measured than to realize attack based on templates.
From every experiments realized, we see really good potential in k-NN algorithm.The approach proposed based on simplest k-NN algorithm can provide important advantages to the attacker compared with other profiling attacks .We summarize these observations in the followings points: • the basic principle of the method is very simple, profiling phase constitutes only the storing of data measured in the memory (the attacker has to realize this in any case), • it is not necessary to prepare (calculate) templates, the attacker can save time and memory, • the k-NN approach is implemented by default in many program environments, therefore it is no problem with attack implementation, • the success rate is comparable with the template attack, the k-NN approach corresponds with the reduced TA that does not take into account the covariance matrix, • the attacker can use more interesting points compare with TA where it is limited due to memory limitation resulting from the covariance matrix, • the k-NN does not include a learning phase compared with other MLs, therefore the attacker can work more efficiently (fast response to every changes related to the power traces measured, number of interesting points, size etc.).
We hope that this method will have continuance in profiled power analysis attacks, because from the basic principle follows a good assumption of the multivariate normal distribution classification (interesting points of power traces).We demonstrated this assumption with simple example and performed experiments.

Fig. 2 .
Fig. 2. Scatter plot of two interesting points that leak HW.

Fig. 14 .
Fig. 14.Success rate of the secret offset revelation based on 100 power traces of DS2.

Fig. 15 .
Fig. 15.Success rate of the secret offset revelation based on 250 power traces of DS2.

Fig. 16 .
Fig. 16.Success rate of the secret offset revelation based on 500 power traces of DS2.

Fig. 17 .
Fig. 17.Success rate of the secret offset revelation based on 1 000 power traces of DS2.