Task-Aware Meta Learning-based Siamese Neural Network for Classifying Obfuscated Malware

Malware authors apply different techniques of control flow obfuscation, in order to create new malware variants to avoid detection. Existing Siamese neural network (SNN)-based malware detection methods fail to correctly classify different malware families when such obfuscated malware samples are present in the training dataset, resulting in high false-positive rates. To address this issue, we propose a novel task-aware few-shot-learning-based Siamese Neural Network that is resilient against the presence of malware variants affected by such control flow obfuscation techniques. Using the average entropy features of each malware family as inputs, in addition to the image features, our model generates the parameters for the feature layers, to more accurately adjust the feature embedding for different malware families, each of which has obfuscated malware variants. In addition, our proposed method can classify malware classes, even if there are only one or a few training samples available. Our model utilizes few-shot learning with the extracted features of a pre-trained network (e.g., VGG-16), to avoid the bias typically associated with a model trained with a limited number of training samples. Our proposed approach is highly effective in recognizing unique malware signatures, thus correctly classifying malware samples that belong to the same malware family, even in the presence of obfuscated malware variants. Our experimental results, validated by N-way on N-shot learning, show that our model is highly effective in classification accuracy, exceeding a rate \textgreater 91\%, compared to other similar methods.


Introduction
Malware producers are ever more motivated to create new variants of malware, in order to gain profits from unauthorized information stealth. According to the malware detection agency AV Test (https://www.av-test.org/en/ statistics/malware/ (accessed on 1 July 2021)), 100 million new variants of malware were generated from January to October 2020, which translates as roughly three thousand new malware daily.
In particular, we have witnessed the fast growth of mobile-based malware. An NTSC report published in 2020 https: //www.ntsc.org/assets/pdfs/cyber-security-report-2020.pdf (accessed on 15 June 2021) reported that 27% of organizations globally have been impacted by malware attacks sent via Android mobile devices. In recent times, we have seen malware producers employ techniques such as obfuscation , Chua and Balachandran [2018], Bacci et al. [2018] and repackaging Song et al. [2017], Lee et al. [2019], , mostly through the change of static features Zhu et al. [2018], Sun et al. [2017], Hu et al. [2014], to avoid detection. In response to the trend in the growth of mobile-based malware attacks, numerous Artificial Intelligence (AI)-based defense techniques have been proposed Vasan et al. [2020], Luo and Lo [2017], , Makandar and Patrot [2018], Hsiao et al. [2019], Singh et al. [2019].
We argue that there are two main issues to be addressed in the existing state-of-the-art of AI-based mobile malware attack defense.
The first issue is that most of the existing research tends to focus on learning from common semantic information about the generic features of malware families, and building feature embeddings Raff et al. [2018], Gibert et al. [2019], Shen et al. [2018]. In this context, 'feature embedding' means the features contained in a malware binary sample, which provide important clues as to whether the malware image-generated from the hexdump utility that displays the contents of binary files in hexadecimal, from which feature embedding is created-is malicious or not: if malicious, what type of malware family it belongs to is assessed, to build the right set of response strategies. These existing works often treat the fraction of the code changed by the obfuscation and repackaging as a type of noise Gibert et al. [2018], and thus tend to ignore the effect of the modification: this is largely because the code changed by the obfuscation and repackaging techniques displays a similar appearance when malware visualization techniques are applied Akarsh et al. [2019], Ni et al. [2018], Naeem et al. [2020]. Using common semantic information as data input points to be fed into a deep neural network cannot capture the unique characteristics of each malware family signature: thus, they will not be able to accurately classify many variants arising from the same malware family Kalash et al. [2018], Milosevic et al. [2017], Vasan et al. [2020], Yuan et al. [2020], especially if an obfuscation technique is applied.
The second issue with the existing approaches is the demand for large data input, with which to find more relevant correlations across the features: such input is unable to detect and classify malware families trained with a limited number of samples (e.g., newly emerging variants of malware) .
To address these two important issues, we propose a novel task-aware few-shot-learning based Siamese neural network, capable of detecting obfuscated malware variants belonging to the same malware family, even if only a small number of training samples are available.
The contributions of our proposed model are as follows: • Our task-aware meta-learner network combines entropy attributes with image-based features for malware detection and classification. By utilizing the VGG-16 network as part of the meta-learning process, the weight generator assigns the weights of the combined features, which avoids the potential issue of introducing bias when the training sample size is limited; • For the hybrid loss to compute the intra-class variance, the center loss is added alongside the constructive loss, to enable positive pairs and negative pairs to form more distinct clusters across the pairs of images processed by two CNNs; • The results of our extensive experiments show that our proposed model is highly effective in recognizing the presence of a unique malware signature, despite the presence of obfuscation techniques, and that its accuracy exceeds the performance of similar methods.
We organized the rest of the paper as follows. We examine the related work in Section 2. We describe how control-flowobfuscated malware variants are created, and we address why the generic SNN approach cannot detect such obfuscated malware variants in Section 3. We provide the details of our proposed model, along with the details of the main components, their roles and responsibilities, and an overview of the algorithm involved, in Section 4. In Section 5, we describe the details of the dataset, feature extraction, and the experimental results with analysis. Finally, we provide the conclusion to our work, including the limitations of our proposal, and future work directions, in Section 6.

Related Work
In this section, we review two lines of research relevant to our study: few-shot-learning-based malware detection and feature embedding applied for malware detection.

Entropy Feature in Feature Selection
Entropy value serves as a metric for feature selection, quantifying the information each feature contributes to the target variable or class. Selecting features with the highest information gain can enhance model performance and reduce overfitting. Huang et al. Huang et al. [2016] utilized the entropy function defined on sparse signal x to recover such signals: minimizing it with an appropriate p-value yielded sparse solutions and improved signal recovery rates. Additionally, Finlayson et al. Finlayson et al. [2009] demonstrated that quadratic entropy values, being smoother and typically having a single minimum, offer the most efficient approach. This method aids full-color shadow removal, by comparing edges in the original and invariant images, followed by re-integration. Moreover, Allahverdyan et al. Kolouri et al. [2018], Allahverdyan et al. [2018] suggest that entropy minimization can lead to basic forms of intelligent behavior. Specifically, Kolouri et al. Kolouri et al. [2018] employed entropy minimization to bolster classifier confidence, while entropy regularization ensured that predictions remained close to unseen attributes in zero-shot tasks.

Control Flow Obfuscation
Malware obfuscation is a technique that is applied by malware authors to create new malware variants, in order to avoid detection without creating a completely brand-new malware signature. Among the many different obfuscation techniques, we focused on control flow obfuscation, which involves creating a new malware variant by reordering the control flow of functional logic from the original malware program . This type of obfuscation technique makes the compiled malicious code appear to be different from the existing malware signature so that it can easily avoid detection. Our model can detect three different types of control flow obfuscation.

Function Logic Shuffling
This technique alters the control flow path of a malware program, by shuffling the order of function calls without affecting the semantics (i.e., purpose) of the original malware program: while the functionality between the original malware and the obfuscated version remains the same, the changing of appearance in the compiled code can result in the appearance of the malware image changing, and detection accuracy decreasing. An example is shown in Figure 1, where the order of function logic MyClass_2 and MyClass_3 is changed.

Junk Code Insertion
In this technique, the malware author inserts much (junk) code that never gets executed, whether after unconditional jumps, calls that never return, or conditional jumps with conditions that will never be met. The main goal of this technique is to change the control flow path, in order to avoid detection or to waste time for the reverse engineer analyzing useless code. An example of a junk code insertion is shown in Figure 2, where a junk code, MyClass_J, is added in between two normal function calls.

Function Splitting
With this technique, the malware author applies the function splitting method, where a function code is fragmented into several pieces, each of which is inserted into the control flow. This technique splits the function into n code fragments, or merges pieces of unrelated codes, to make the changes in the compiled code, which also results in the malware image appearance changing. An example of a function splitting is shown in Figure 3, where two splits from MyClass_2 are generated, and randomly added among other function calls. Though the functionality between the original malware and obfuscated versions (e.g., malware variants) stays the same, a malware code applied with control flow obfuscation can easily avoid detection from many anti-virus programs . To address this issue, we propose a model resilient against the presence of many variants of malware created as a result of applying the control flow obfuscation technique. Our proposed method utilizes the information gain calculated through the entropy features associated with each malware variant. In our proposal, the entropy features measure the amount of uncertainty of a given probability distribution of a malware program that is not affected by the order of functional logic of the malware program. The binary cross entropy loss with center loss. d i w

Generic Approach and Issues
The distance feature of a pair of images. y d The label of pairs of images. F w The convolutional filter with parameters. w β The hyper-parameter.
In the last few years, few-shot learning technology, for example, the Siamese neural network (SNN), which uses only a few training samples to get better predictions has emerged. An SNN contains two identical subnetworks (usually convolutional neural networks)-hence the name 'Siamese'. The two CNNs have the same configuration, with the same weights, W ∈ R d , where W depicts the model's parameters, while R d depicts the distance embedding to calculate two samples inputted from each subnetwork, respectively, and the value of the distance indicates whether they are closed to each other, as the value increases or decreases in the Euclidean space. The updating of the hyperparameters is mirrored across both CNNs, and is used to find the similarity of the inputs, by comparing its feature vectors.
Each parallel CNN is designed to produce an embedding (i.e., a reduced dimensional representation) of the input. These embeddings can then be used to optimize a loss function during the training phase and to generate a similarity score during the testing phase.
The architecture of Spiking Neural Networks (SNNs) contrasts significantly with traditional neural networks, which rely on extensive datasets to learn the prediction of multiple classes. The latter's requirement for total retraining and updating with each addition or removal of a class presents a challenge. SNNs instead learn through a similarity function, testing whether two images are identical. This innovative architecture empowers them to classify new data classes without the need for additional network training. Furthermore, SNNs demonstrate greater robustness against class imbalance, as a small number of images per class is enough to enable future recognition. The corresponding notations are listed as Table 1. Figure 4 illustrates the working of a generic SNN, its goal being to determine if two image samples belong to the same class or not: this is achieved through the use of two parallel CNNs (CNN1 and CNN2 in Figure 4) trained on the two image samples (Image 1 and Image 2 in Figure 4). Each image is fed through one branch of the SNN, which generates a d-dimensional feature embedding (h 1 and h 2 in Figure 4) for the image: it is these feature embeddings that are used to optimize a loss function, rather than the images themselves. A supervised cross-entropy function is used in the SNN for the binary classification, to determine whether two images are similar or dissimilar, by computing [h 2 -h 1 ] and processing it by a sigmoid function. Mathematically, the similarity between a pair of images (x 1 , x 2 ) within Euclidean distance (ED) is computed in an SNN using the following equation: where the F w indicates the feature representation for the inputted feature matrix. Generally, the model g w : R n to R d is parameterized by the weights w, and its loss function is calculated using the following equation: where 1} denotes the ground truth label of image pairs (x i , x j ), and d w represents the Euclidean distance (ED) between two images at the i-th pair. Note that the most similar images are supposed to be the closest in the feature embedding space: though this approach would work well for finding similarities/dissimilarities across distinct image objects, it would not work well for obfuscated malware samples.
Recall that an obfuscated malware-for example, x 1 -changes some part of the original malware code, x 2 : when these two are converted as a feature representation-for example, F w (x 1 ) and F w (x 2 )-the feature values in the feature representation will look very different, which is how obfuscated malware avoids detection by anti-virus software. Inadvertently, the different values in the feature representation make the distance across obfuscated malware images very different from one another (i.e., d w (x 1 , x 2 ) is large). Eventually, when a similarity score is computed and compared, using the loss (i.e., L b ) based on the distance calculation (i.e., d w (x 1 , x 2 )), they appear to be different malware families-though, in fact, they all belong to the same malware family.

Task-Aware Meta Learning-Based Siamese Neural Network
We now introduce our task-aware meta-learning-based SNN model, which provides a novel feature embedding bettersuited to control-flow-obfuscated malware classification. We start with the overview of our proposed model, and the details of the CNN architecture that is used by our model, followed by how task-specific weights are calculated using factorization. Finally, we discuss the details of the loss functions our model uses to address the challenges of weight generation with a limited number of training samples.

Our Model
As shown in Figure 5, our model utilizes a pre-trained network and two identical CNN networks. We use a pre-trained network (VGG-16) to compute more accurate weights for entropy features. Similarly, each CNN takes image features to calculate weight for the image features, and to generate feature embedding using the task-specific weights and shared weights of both the entropy and the image features. The feature embeddings produced by the two CNNs are used by the SNN to calculate the similarity score across intra-class variants, using a new hybrid loss function. Within each CNN, there are two sub-networks: a task-aware meta-learner network and an image network, as shown in Figure 6. The task-aware meta-learner network starts by taking a task specification (e.g., an entropy feature vector), and generates the weights. At the same time, the image network (e.g., a typical CNN branch of an SNN) starts by taking the image feature and convoluting it, until a fully connected layer is produced. The last fully connected layers use the weights generated by the task-aware meta-learner network, along with the shared weights, to produce a task-aware feature embedding space. Embedding loss for inter-class variance is calculated for back-propagation, until the CNN is fully trained for all input images.
Mathematically, this can be written as the following equation: where the SNN F takes malware images x as inputs, and produces a task-aware feature embedding that is used by the SNN to predict the similarity y ∈ {1, 0} between an image pair inputted to two CNNs. Each CNN is parameterized by the weights θ, which are composed of generated parameters from T and the share parameters θ s in the SNN F that is shared across all malware families. The task-aware meta-learner network T creates a set of weight generators g i , i = 1 . . . k, to generate parameters for k feature layers in F, conditioned on e t . The overall approach of our proposed model can also be summarized using Algorithm 1: Figure 6: CNN architecture of our proposed model.
Algorithm 1: Pseudo-code of our proposed algorithm.
1 Binary cross-entropy with center loss:L b,c ; 2 Additional supervision loss :L e ; Input :Entropy graph feature F ent , texture feature F t , support set s, query set q, pair label y i d , sample label y i t , hyper-parameters β , initialized centers c i Output :[Predicted similarity] 3 Training stage: 4 (1) Initializing the parameters for our proposed models, and the task-specific weights W i for the weight generators in the task-aware meta-learner network g i , using the weight factorization Equation (5); 5 (2) To input the malware texture features F t and the 4096 features of entropy value F ent extracted by the pre-trained network (e.g., VGG-16); note that these are integrated with the support set s of our proposed model; 6 (3) The weighted features from Equation (5) are fed into the embedding loss, according to the one-shot label generated from the target label; 7 (4) Calculating the Euclidean distance (ED) of features in the two branches of SNN, through the hybrid loss function; 8 (5) Back-propagation and update parameters by Adam optimizer. 9 Testing Stage: 10 while not reach to iterations do 11 Extract features of samples in the query set q, and feed them the one-shot accuracy.

Task-Aware Meta-Learner
Our task-aware meta-learner provides two important functionalities: one is generating optimized task-specific weights, using the entropy values extracted from a pre-trained deep learning model (e.g., VGG-16); the other function is to work with the image network, to compute the new weights, based on the shared weights and the task-specific weights, so that the embedding loss is accurately calculated, in order to capture the relative distance across inter-class variance (e.g., the features of the image). These functions are necessary, because some malware samples (e.g., zero-day attack samples) are usually much smaller than the number of images required for training an SNN model.
Using the entropy values, our meta-learner recognizes a specific malware signature present in the entropy, so that later it uses this knowledge to find whether some malware samples are derived from the same malware family or not (e.g., obfuscated malware). The entropy values are extracted from the VGG-16 when entropy graphs are inputted.
We use entropy graphs to recognize a unique malware signature belonging to each malware family. To illustrate the use of an entropy graph as a task specification, four samples of malware images are shown in Figure 7. Figure 7a,b are two obfuscated malware samples from the same Wroba.wm family. Similarly, Figure 7c,d of the name Agent.ad are from the same Agent family. One can see that the entropy graphs within the same family share a similar pattern, while there are visible differences in the entropy graphs between two different malware families. Our task-aware meta-learner utilizes the entropy values extracted from the entropy graph, to train our proposed model to recognize if an image pair is similar or dissimilar (i.e., belong to the same malware family or not)-see Algorithm 2.
To obtain an entropy graph, a malware binary file is read as a stream of bytes, and is separated into a few segments. The frequency of unique byte value is counted, and computes the entropy, using Shannon's formula, as follows: where M is the probability of an occurrence of a byte value. The entropy obtains the minimum value of 0 when all the byte values in a binary file are the same, while the maximum entropy value of 8 is obtained when all the byte values are different. The entropy values are then represented as a stream of values that can be reshaped as an entropy graph. The entropy graph is then converted as a feature vector inputted through the convolutional extractor of a pre-trained network (e.g., VGG-16 Simonyan and Zisserman [2014]). The summary of the steps involved in the entropy graph is described in Algorithm2: Algorithm 2: Pseudo-code of entropy graph Input :f : malware binary file; l: segment length; n: the number of files Output :entropy graph matrix m 1 while not reach to n do 2 1. read l bytes from f , and defined as segment s;

Weight generator via factorization
In the generic SNN approach, the feature extractor only uses the image feature: this approach, however, is no longer effective in the detection of obfuscated malware, where multiple obfuscated malware samples contain almost identical image features.
The problem is further complicated when only a few samples (i.e., less than five malware samples) exist (i.e., not enough malware feature information to use for classification, as very few variations of malware samples can be collected from a small number of malware samples). To address this issue, we present a new novel weight generation scheme based on the work presented by Gidaris and Komodakis [2018]. In our proposed model, the weight generator G(., .|φ) receives, as input, the entropy vectors W ave of a class, in addition to the image vectors Z = {z i } N i=1 of the N training samples of the corresponding class: this results in a weight vector w = G(Z , W ave ).
In our model, the weight generator scheme is incorporated in the fully connected (FC) layer, to solve the non-linear issue that exists in the relationship between the entropy feature and the malware image. By integrating the weight generator into the FC layer, the weights of the features extracted before the FC layer can then be integrated better into calculating new and more optimized weights for the whole model. We generate weights by creating a weight combination. The weight combination produces the composite features that encode the non-linear connection in the feature space: this is done by multiplying the entropy features and image features together, such that the composite features learn a feature-embedding resistance to different obfuscated malware variations. Note that the dimension of the weight generator g i on the FC layer must be matched with the dimension of the weight size of the ith feature layers in F , so that the weights W i ∈ R m×n can be decomposed using the following equation: sr ∈ R m×n is the average entropy feature vector for each malware family {t 1 , . . . t N }, and W i ts ∈ R n is the task-specific image feature vector for each malware. With such factorization, the weight generators only need to generate the malware-specific parameters for each malware family in the lower dimension, and learn one set of parameters in the high dimension shared across all malware families.

Loss Function
Our proposed model uses two different types of loss functions. Embedding loss is used by our task-aware meta-learner network to compute a loss across the inter-class variance (e.g., the features in the feature embedding space of a CNN branch), while a hybrid loss is used by the differencing layer of the SNN, to compute the similarities across inter-class variants between an image pair.

Embedding Loss for Meta-Learner
The feature representation of the entropy graph of a malware class can be easily influenced by binary loss: this is because the use of binary loss can only give the probability of the distribution of distances between positive and negative image pairs, and cannot estimate the probabilities of distances between positive and negative image pairs across different malware variations, thus not being able to correctly classify similar pairs of images across obfuscated malware samples (i.e., not being able to learn a discriminative feature during the training procedure). To address this issue, we added a secondary cross-entropy loss, not only to learn the discriminative feature, but also to address the effect of overfitting caused by contrastive loss. This embedding loss is defined using the following equation: where x i represents the ith sample in the dataset of the size N . The one-shot encoding applied to the input based on the labels is indicated by y t ∈ {0, 1} t , while T indicates the number of tasks during training (e.g., either in the whole dataset or in the minibatch).

Hybrid Loss for Our SNN
To calculate the similarity score for our proposed model, we propose a hybrid loss function comprised of a center loss and a constructive loss. The center loss proposed by Wen et al. Wen et al. [2016] is a supplement loss function to the softmax loss for the classification task, which can learn to find a sample that can act as the center image of each class, and try to shorten the distance across the training samples of similar features by moving them to be as close to the center of the sample as possible. This center loss can be calculated as follows: This approach, however, does not address the issue of moving apart from the training samples of dissimilar features. To address this issue, we propose the hybrid loss function integrated with the pairwise center, to better project the latent task embedding e t = T (t) into a joint embedding space that contains both the negative and positive center points.
We have adopted a metric learning approach, where the corresponding learned feature is closer to the joint feature embedding for positive inputs of a given image pair, while the corresponding learned feature is far away from the joint feature embedding for negative inputs of a given image pair: where β is the hyperparameter to balance the two terms; in our study, we set it at 0.8.

Experiments
In this section, we describe the details of the datasets we used for the experiments, the model configuration, and the results of our experiments. The results were obtained by running the experiments on a desktop with a 32 GM RAM, Nvidia Geforce RTX 2070(8GB), and Intel(R) Core(TM) i7-9700 CPU @ 3.00 GHz.

Andro-Dumpsys Dataset
We used the Andro-Dumpsys dataset obtained from wook Jang et al. [2016], which has been widely used for malware detection. The original dataset consists of 906 malicious binary files from 13 malware families. As illustrated in Table  2, the number of malware variants and the total number of samples from different malware families varied. Almost half of the malware families had no more than 25 malware samples, while some only had 1 sample, as they were most likely the new malware detected lately (e.g., Blocal and Newbak). In addition to the original dataset, we also generated three additional synthetic malware variants, each of which was applied using the different control flow obfuscation techniques described earlier: function logic shuffling; junk code insertion; and function splitting, respectively. Two samples from each additional malware variant, a total of 6 additional samples for each malware family, were added to the original dataset. Blocal 1 (4) 1 (7) 3 Climap 1 (4) 5 (11) 4 Fakeguard 1 (4) 10 (16) 5 Fech 1 (4) 3 (9) 6 Gepew 4 (7) 112 (118)  7 Gidix 6 (9) 108 (114) 8 Helir 1 (4) 15 (21) 9 Newbak 1 (4) 1 (7) 10 Recal 2 (5) 25 (31)  11 SmForw 23 (26) 166 (172)  12 Tebak 10 (13) 93 (99)  13 Wroba 23 (26) 108 (114) Figure 8 illustrates a snippet of how obfuscated malware is created by applying a junk code insertion. In this example, we created a dummy array that acts as a junk code, which we added in between two function calls from the original malware code. We applied a similar approach to the other two types of control flow obfuscation. We also increased the image sample size, so as to have at least 30 samples for every malware family, using a data augmentation technique (e.g., applying random transformations, such as image rotations, re-scaling, and clipping the images horizontally). The details of the augmentation parameters are shown in Table 3. In particular, ZCA whitening is an image preprocessing method that leads to transformation of the data, such that the covariance matrix is the identity matrix, leading to decorrelated features, while the fill_mode argument with "wrap" simply replaces the empty area, and follows the filling scheme. Randomly flips inputs horizontally.

Image Feature
We used the same technique proposed by Zhu et al. [2020] to produce image features. To produce image features, we first read the binary malware as a vector of 8-bit unsigned integers, which were then converted into 2D vectors. We used the fixed width, while the height was decided according to the size of the original file. Finally, the 2D vector was converted into the gray images, using the color range [0,255]. Note that the gray images at this stage were different dimensions according to the varying heights and widths in which size biases could occur in the fully connected layer: to address this issue, we used the bilinear interpolation method, as suggested by Malvar et al. [2004], to produce our image feature in uniform to the size of 105×105.

Entropy Feature
To obtain the entropy feature of each malware family, we took each byte of the malware binary file as a random variable, and counted the frequency of each value (00h-FFh). More concretely, the byte reads from the binary file were divided into several segments. For each segment, we first calculated the frequency of each byte value p j (1 ≤ j ≤ m), and then we calculated the entropy y i of the segment. The entropy values were then represented as a stream of values that could be reshaped as an entropy graph, with the size of 254 × 254 × 1. These entropy graphs were then converted as a 4096-dimensional feature vector inputted through the convolutional extractor of the VGG-16 architecture Simonyan and Zisserman [2014].

Model Configurations
The task-aware meta-learner network T (t) was a two-layer FC network with a hidden unit size of 512, except for the top layer, which was 4096 for input. The weight generator g i was a single FC layer with the output dimension the same as the output dimension of the corresponding feature layer in F. We added a ReLU function to the output of g i in cases where processed malware images were used as inputs, and the convolutional part was configured as the 4-layer convolutional network (excluding pooling layers), following the same structure as Koch et al. [2015]. In addition, the ReLU function and batch normalization were added afterwards by the convolution layer and the FC layer. The total number of parameters in our proposed model is 40 million. The overview of our network configurations is described in Figure 9.

Results
We set the batch size to 32, and we used Adam as the optimizer, with an initial learning rate of 10 −4 for the image network and the weight generators, and 10 −5 for the task embedding network. The network was trained with 50 epochs, which ran for approximately 2 h. As Figure 10 illustrates, both train and validation loss stabilized after 50 epochs, confirming that the training had been done by this stage. Figure 10: Loss during training on epochs.
The testing process conducted M times of N-way on N-shot learning tasks, where Q times of correct predictions contributed to the accuracy, calculated by the following formula: Accuracy = (100 * Q/m)%.

N-Way Matching Accuracy
The evaluation of N -way learning at each test state was carried out for one-shot and five-shot. For N -way one-shot learning, we chose an anchor image from one class of test, and then randomly selected N classes of images to form the support set X = {x i } N i=1 , where x 1 , ∀x ∈ X; the selected image's class was the same as the anchor imagex, and the other images in the support sets were from different classes. The similarity score betweenx and other images was calculated using our model. Specifically, if the similarity score of the feature vector of x 1 , which could be represented as , was the maximum of S, then the task could be labeled as a correct prediction; otherwise, it was regarded as an incorrect prediction. For N -way five-shot learning, we randomly selected N unseen classes and six instances, in which five instances of each class were randomly selected as the support set, X = {x 1 , . . . , x i } N i=5 , and the remaining instances of each class formed the query set: its prediction procedure was the same as the test in the one-shot.
The matching accuracy of N-way accuracy for the one-shot and five-shot are illustrated in Figure 11. We randomly used 50 pairs of images, 25 containing positive image pairs and 25 containing negative image pairs, to test the effectiveness of our proposed model. As shown in the N-way one-shot result in Figure 11a, 19 out of 25 positive image pairs were matched correctly, while there were 6 true negatives (i.e., 2 positive pairs not matched correctly). Similarly, 23 of 25 negative pairs matched correctly, while there were 2 false positives (i.e., 2 negative pairs matched incorrectly). For the N-way five-shot results (shown in Figure 11b), the accuracy of matching was higher, as almost 22 out 25 pairs matched correctly for both positive and negative pairs, and there were 3 incorrectly matched results. Figure 12 shows the projection of the embeddings space, using the two-dimensional principal component analysis (PCA) technique, where each orange point dictates the distance of a positive pair, while each blue point dictates the distance of a negative pair. Figure 12a shows the embedding space before training, while Figure 12b projects the embedding space after training. After training, we could clearly see two distinct clusters-one around the distance calculated for all positive pairs, and the other for negative pairs: this confirmed that our proposed model had learned well, so as to distinguish the similarities among positive pairs and negative pairs and separate them far apart.  Table 4 shows the result of benchmarking our proposed model against the current state-of-the-art, especially the Matching network and Prototypical network, as well as the original Siamese network. Our model surpassed the performance of the Matching Network and Prototypical Network by 2.4% and 1.8% on the one-shot learning on the 5-way. The difference between our results of one-shot and five-shot was 3.1%, 1.9%, and 1.4% on the 5-way, 10-way, and 15-way, respectively. Our proposed five-shot result outperformed all three exiting models by 1.4%, 5.9%, and 5.4% on the 5-way, 10-way, and 15-way, respectively.

Distance Measure Effectiveness
We also examined the effectiveness of our proposed model in distance measurement, by using the AUC (area under the curve) ROC (receiver operating characteristic) curve. The AUC-ROC curve is commonly composed of two performance measures: the true-positive (FPR) rate and the false-positive rate (FPR) rate. The equations related to these two performance measures of the AUC-ROC curve are shown as follows: F P R(P * ) := 1 p * f 0 (p)dp, where the f 0 (p) is denoted by the probability of a density function for the predictions p(x) produced by our proposed model. The negative pairs are labeled as 0, and f 1 (p) is the probability from the positive pair that is labeled as 1. The given discrimination threshold P is the integrals of the tails of these distributions according to the true-positive rate and false-positive rate. Based on the two parameters TPR and FPR, the AUC-ROC curve is defined as follows: where the AUC measures the entire two-dimensional area underneath the entire ROC curve (i.e., integral calculus) from (0,0) to (1,1). For example, a model whose predictions are 100% wrong has an AUC of 0.0, while a model whose predictions are 100% correct has an AUC of 1.0. Using this concept, we demonstrated the result predicted by the saved weights of one-shot and five-shot, respectively, of the learned model on the test set, including 5-way, 10-way, and 15-way. These AUC-ROC curves are shown in Figures 13 and 14, which illustrate that the β value at 0.8 provided the best performance when it was tested on one-shot N-way learning. This result also confirms that the hybrid multi-loss function can reduce the distance between the same classes, and enlarge the distance between different classes. Additionally, it does not change the attribute of the feature in the feature space, so the optimization of this layer will never negatively affect the deeper network layers. This hybrid loss function can also compute classification accuracy with a learned distance threshold on distances. As shown, our result is on a set of points in the true positive rate-false positive rate plane. The results achieved an AUC-ROC equal to 0.92, 0.91, and 0.80, respectively, under the 5-way, 10-way, and 15-way on the one-shot learning.
We further conducted the AUC-ROC on the five-shot learning. Our proposed model also obtained better performance than the generic SNN, with 95.6, 90.7, and 86.8 % at 5-way, 10-way, and 15-way, respectively. As expected, the accuracy of both one-shot and five-shot dropped as the number of N-way increased with higher intra-class variance.
Note that our model always performed better, as shown in these graphs, as the AUC-ROC areas (i.e., the areas up to the blue line) of our proposed model were larger compared to the generic SNN.

Conclusions
We propose a novel task-aware meta-learning-based Siamese neural network to accurately classify different malware families even in the presence of obfuscated malware variants. Each branch of the CNNs used by our model has an additional network called the "task-aware meta-learner network" that can generate task-specific weights, using the entropy graphs obtained from malware binary code. By combining the weight-specific parameters with the shared parameters, each CNN in our proposed model produces fully connected feature layers, so that the feature embedding in each CNN is accurately adjusted for different malware families, despite there being obfuscated malware variants in each malware family.
In addition, our proposed model can provide accurate similarity scores, even if it is trained with a limited number of samples. Our model also uses a pre-trained VGG-16 network in a meta-learning fashion, to compute accurate weight factors for entropy features. This meta-learning approach essentially solves the issues that are associated with creating potential bias due to not having enough training samples.
Our model also offers two different types of innovative loss functions that can more accurately compute the similarity scores within a CNN and the feature embeddings used by two CNNs.
Our experimental results show that our proposed model is highly effective in recognizing the presence of unique malware signatures, and is thus able to correctly classify obfuscated malware variants that belong to the same malware family.
We are planning to apply different types of malware samples (e.g., DDoS attack Wei et al. [2021] and ransomware families Zhu et al. [2021a,b], McIntosh et al. [2018McIntosh et al. [ , 2019) and other data samples (e.g., finding similar abnormalities in medical X-ray images Feng et al. [2022]), to test the generalizability of our model.