1 Introduction

Malicious code is software designed to carry out malicious activities or launch attacks. According to the “Internet Security Situation Analysis Report for the First Half of 2021” published by the China National Internet Emergency Center (CNCERT/CC) [1], approximately 23.07 million malicious software samples were captured in the first half of 2021, with a daily distribution frequency amounting to 5.82 million instances. This involved about 208,000 malicious code families, infecting approximately 4.46 million computer terminals. Concurrently, the “China Internet Security Report 2022” released by Rising [2] indicated that 73.55 million malicious codes were caught by the Rising “Cloud Security” system in 2022, with virus infections occurring 124 million times, including devices from individuals, companies, and government agencies. The rapid propagation and variance of the malicious code have severely impacted users' everyday lives, jeopardizing our national cybersecurity and hindering the development of a communal digital future. Hence, accurately and efficiently detecting and categorizing malicious software and its varieties has become a focal point in this field.

Traditional malicious software identification methodologies hinge on the match of a signature-based model. This necessitates researchers manually extracting the signature of the malicious software using expert knowledge and then comparing these signatures with known ones stored in a database. However, numerous variations of malicious software have been generated with the evolution of obfuscation and wrapping techniques. This circumstance renders traditional detection methods less efficient and ineffective at detecting and recognizing variations of malicious software. To tackle the challenges faced by static analysis-based malicious code detection methodologies, visualization-based detection and classification techniques for malicious code have emerged [3, 4]. These methods map malicious code as images based on the distinct texture features within the same malicious code family and differing texture characteristics between various code families. They extract texture features of the malicious code image and use these elements to detect and categorize malicious code samples. Visualization-based malicious code analysis methodologies neither rely on expert knowledge nor static decompilation procedures and have been proven capable of detecting malicious code variations effectively.

Since the method was proposed, a large number of experts and scholars have researched on it and some progress has been made. The main focus is on using machine learning and deep learning techniques to improve detection accuracy and efficiency. Nataraj et al. [5] fused image and signal features to describe the malicious code and used KNN (K-Nearest Neighbor) as a classifier to identify the malicious code. Kancherla et al. [6] to enhance the diversity of the features incorporated Gabor features, Wavelet features and intensity features are fused as total features and SVM (Support Vector Machines) classifier is trained to achieve malicious code classification. Yashu Liu et al. [7] constructed anti-confusion features by fusing GIST features and LBP (Local Binary Pattern) features of malicious images to solve the problem of degradation of classification performance of the model in similar malicious images. The above studies applied machine learning to visualisation-based malicious code detection methods, and although there is some progress, they usually require manual extraction of features from the data, and the detection efficiency is low.

With the advent of deep learning, Naeem et al. [8] proposed a malware variant classification method. They first converted malware files into grey-scale images and then used global malicious and local collective mechanisms to identify malware variants. Their paper describes the proposed method in detail. The only shortcoming may be that comprehensive empirical psychological results are not provided to demonstrate the performance of variant classification. Mathew et al. [9] devised a method to classify malware variants. Their method converts malware files into colour images and introduces a local pyramid pool to handle various input image sizes. Although in practice, their paper also did not address variant classification performance, especially the results of experiments on variant classification for various levels of variability.

In recent years, with the advancement of computer vision and deep learning technologies, deep learning algorithms, represented by Convolutional Neural Networks (CNNs), have made breakthrough progress in image classification and feature recognition, becoming the mainstream architecture for visual models. They have gradually been applied to malicious code detection and classification fields. The classic Convolutional Neural Network (ConvNet) [10], composed of Conv, ReLU, and pooling, has achieved significant success in image recognition. The advent of Inception [11], ResNet [12], and DenseNet [13] have shifted a vast amount of research interest towards intricately designed architectures, escalating the model complexity. Recent architectures are based on automatic [14] or manual [15] architectural searches or searched compound scaling strategies [16]. Although many complex CNNs have improved in accuracy, their disadvantages are significant: (1) Complex multi-branch designs, for instance, residual additions in ResNet and branch connections in Inception, make the model difficult to implement and customize, leading to slow inference speed and low memory utilization. (2) Certain components, such as depthwise convolution in Xception and MobileNets [17, 18] and channel shuffling in ShuffleNets [19], increase memory access overhead and lack support for various devices.

With the success of data-driven models in image classification, detection and segmentation tasks [20, 21], a range of hybrid visual transformer models have emerged [22,23,24,25]. Different from convolution layers, the self-attention mechanism of Vision Transformers offers a global context by modeling long-distance dependencies. However, achieving this global view often incurs high computational costs [26] and increases memory access overhead [27], thus resulting in significant latency overhead. To alleviate this challenge, some studies [26, 28, 29] focus on mitigating the computational burden associated with self-attention layers. Design approaches include replacing Patchify Stem with convolution layers [30], introducing early convolution stages [31], or employing window attention [32] to implement implicit hybrid models. The latest research has established explicit hybrid structures that better facilitate information exchange among tokens (or patches) [33,34,35]. In most hybrid structures, token mixing primarily depends on self-attention.

Inspired by recent work [27] utilizing reparametrized skip-connections to reduce memory access costs, this study introduces an architectural component called the Rep Mixer into the model. This operator is a fully reparametrized token mixer, combining the advantages of convolution architectures and Transformers, supplanting the self-attention layer to achieve computational latency reduction. The jump connections are eliminated through structural reparametrization. In addition, Rep Mixer also employs deep convolution to carry out operations similar to the information space mixing of ConvMixer. To enhance performance, various studies [17, 18, 36] have incorporated deep convolution or group convolution, subsequently resorting to 1 × 1 point convolution to factor the k × k convolution. Despite this technique effectively bolstering the model's overall operational efficiency, the parameter decrease may result in a drop in model capacity. To further ameliorate latency, the number of floating-point operations (FLOPs), and parameter count, more recent research [27, 37, 38] employs linear training time over-parametrization to increase such a model's capacity. This paper replaces all the dense k × k convolutions with their decomposition version, that is, depthwise convolution followed by pointwise convolution, and uses the linear training time over-parametrization proposed in [27] to enhance the capacity of these layers. These additional branches are only introduced during training and are reparametrized during inference.

Additionally, in the design of the MDC-RepNet (Structural Reparameterization and Multi-scale Deep Convolutional Classifier Network, MDC-RepNet), we have adopted large kernel convolution to replace the early-stage Self-Attention method. Although the Vision Transformer based on self-attention demonstrates high accuracy, it is inefficient in handling latency [26]. As a result, we introduce large kernel convolutions in the Feed Forward Network (FFN) layer and the patch embedding layer. Compared with other Vision Transformer architectures, our MDC-RepNet has a more negligible impact on overall latency while improving performance.

Aiming at the problem of insufficient extraction accuracy and low efficiency of current deep learning-based malicious code classification methods, this paper proposes a malicious code detection method combining CNN and Transformer, compared with other deep learning-based malicious code detection methods, the MDC-RepNet proposed in this paper has the following advantages.

  1. 1.

    For the problem of data image texture information loss, In the data preprocessing stage, pixel-filling based image size normalization algorithm and data enhancement techniques are used to improve the image texture information loss and dataset category imbalance problem during malicious code image size deflation, respectively, and to enhance the expression of key features and alleviate the overfitting phenomenon of the model.

  2. 2.

    For the problem of slow detection speed, A deep neural network is adopted as the framework, and the fusion module is introduced to make its structure reparameterised, which effectively reduces the memory access cost by eliminating the jump connections in the network.

  3. 3.

    For the problem of poor classification accuracy, linear training time over-parameterisation and large kernel convolution technique are used to improve the network accuracy.

  4. 4.

    Through the final experiments, it is proved that the method in this paper has a stable improvement in both accuracy and operation efficiency, which is better than the latest malicious code detection technology.

In conclusion, MDC-RepNet is based on the Vision Transformers architecture, leveraging structural reparametrization to achieve lower memory access costs and higher efficiency, realizing superior accuracy-latency balance.

2 Malicious Code Classification Method Based on MDC-RepNet

Our proposed malicious code detection scheme consists of two core components: data preprocessing and the construction of the MDC-RepNet. During the data preprocessing stage, it includes visualizing malicious code, normalizing image size, and data augmentation techniques. The construction of the MDC-RepNet stage introduces the Rep Mixer token-mixing operator. It employs a structural reparametrization strategy to eliminate skip connections within the network to reduce memory access costs. Simultaneously, it utilizes over-parametrization during training time and extensive kernel convolution techniques to enhance the model's accuracy. The complete architecture is shown in Fig. 1.

Fig. 1
figure 1

Schematic diagram of the model structure

2.1 Data Preprocessing

2.1.1 Malware Visualization

Malicious Code Visualisation is the conversion of malicious code executables into greyscale images. Malware visualisation does not require any feature engineering or domain expert knowledge and is a simple and easy-to-use method for malicious code analysis. The visualisation-based malicious code analysis method can present the static structural information of the malicious code through images, which can quickly process a huge number of samples. And its ability to capture small changes between malicious code variants while preserving the global structure will help in analysing malicious code.

The process of malicious code visualization entails transforming malicious code binary files into grayscale images, as illustrated in Fig. 2. First, given a malicious code, a binary file is read in groups of 8-bit unsigned integers. Each group of binary numbers is then converted into a decimal integer. Subsequently, the row width is determined according to the PE file size and transformed into a two-dimensional array, and the correspondence between the row width and the file size is shown in Table 1. Finally, each element in the two-dimensional array is considered as the grayscale value of the image, mapping the two-dimensional array onto a grayscale image—the partial conversion of malicious family samples as depicted in Fig. 3.

Fig. 2
figure 2

Illustration of malware images visualization

Table 1 Image width for various file sizes
Fig. 3
figure 3

Samples of different malware family grayscale images

The correspondence of the different parts of the malware code binary file mapped to a grey scale image is shown in Fig. 4. In Fig. 4, the text contains not only malware executable code but also black blocks filled with zeros. The data section contains information about initialised and uninitialised variables. The final rsrc section contains all kinds of compiled and generated malicious code resources, including the program's icons and so on.

Fig. 4
figure 4

Malicious code PE file section and its corresponding visualised image fragment information

2.1.2 Malware Image Size Normalization

In convolutional neural networks, the size of the weight matrix in the fully connected layer is fixed, meaning the feature size input into the fully connected layer must remain consistent. If the input image sizes vary, the feature sizes following convolution and pooling operations will also differ, leading to disparate feature sizes input into the fully connected layer and rendering the fully connected layer ineffective. Thus, images input into the convolutional neural network must be the same size. However, the sizes of visuals created after visualizing malicious images are all different. Consequently, it is necessary to normalize the size of the malicious images after visualization.

We adopt the bilinear interpolation algorithm for image size normalization to maintain the original texture features of the malicious images after normalization as much as possible. This algorithm first selects four-pixel points directly adjacent to the interpolation point of the malicious image, then performs linear interpolation calculations twice in the \(x\) direction, and finally performs linear interpolation in the \(y\) direction to obtain the pixels of the interpolation point:

$$f(x,y_{1} ) = \frac{{x_{2} - x}}{{x_{2} - x_{1} }}f(x_{1} ,y_{1} ) + \frac{{x - x_{1} }}{{x_{2} - x_{1} }}f(x_{2} ,y_{1} )$$
(1)
$$f(x,y_{2} ) = \frac{{x_{2} - x}}{{x_{2} - x_{1} }}f(x_{1} ,y_{2} ) + \frac{{x - x_{1} }}{{x_{2} - x_{1} }}f(x_{2} ,y_{2} )$$
(2)
$$f(x,y) = \frac{{y_{2} - y}}{{y_{2} - y_{1} }}f(x,y_{1} ) + \frac{{y - y_{1} }}{{y_{2} - y_{1} }}f(x_{2} ,y_{2} )$$
(3)

where \(f(x,y)\) is the pixel value of the interpolation point in the malware image. \((x_{i} ,y_{j} )\) \((i,j = 1,2)\) are the four pixels near the interpolation point in the malware image. Figure 5 shows the malicious image of a sample in the Allaple.A family after normalisation, by observation it can be seen that the basic texture features of the malicious image after the bilinear interpolation algorithm are well preserved.

Fig. 5
figure 5

Bilinear interpolation method to deflate the malicious code image

2.1.3 Data Enhancement Techniques

In deep learning models, the effect of classification is closely related to the quality of the dataset, and an adequate and balanced dataset can not only improve the classification accuracy of the model but also avoid the overfitting phenomenon to a certain extent. When the number of samples in the dataset is small or the number of samples in each category is unbalanced, data enhancement techniques can be used to increase the number of samples in a few categories, so as to suppress the impact of unbalanced samples on the model and improve the robustness of the model. The common image data enhancement is to generate new data by transforming the original image data, such as: scaling, flipping, shifting, etc. To solve the problem of an unbalanced number of samples of various categories in the malicious code dataset, this paper uses the image data augmentation technique function in python to expand the samples of the dataset, and the parameter settings of the data augmentation technique used in the experiment are given in Table 2.

Table 2 The parameter settings of data augmentation

2.2 Feature Extraction and Classification

The design inspiration of MDC-RepNet comes from the combination of CNN and Transformer. CNN is good at extracting local features from images, while Transformer can globally capture sequence information. MDC-RepNet attempts to combine the advantages of both to achieve stronger feature extraction and model representation capabilities. The overall architecture of the MDC-RepNet is shown in Fig. 6.

Fig. 6
figure 6

Overall architecture of the MDC-RepNet

The starting point of MDC-RepNet is Stem, which uses convolutional structures for feature extraction. During the inference stage, the structure consists of 3 × 3 convolution, 3 × 3 depth convolution, and 1 × 1 convolution to extract multi-scale features from the original image. To achieve structural reparameterization, additional 1 × 1 convolution or Identity branches are introduced during the training stage, providing greater flexibility to the model and helping to optimize its representation ability.

MDC-RepNet is divided into four stages, each of which halves the resolution of the feature map and doubles the number of channels. The first three stages use the same internal structure, using the Rep Mixer in Fig. 6d for token mixing. This structure aims to achieve feature reuse across stages and dimensions, and improves the representational power of the model by reparameterizing the skip connections.

The internal structure of the fourth stage is shown in Fig. 6a, using attention as a token mixer. This design sacrifices inference speed to ensure higher accuracy. The attention mechanism allows the model to focus on key information within the global scope, further improving the quality of feature representation.

The ConvFFN architecture is used in each stage of MDC-RepNet, which is different from traditional FFN. ConvFFN combines deep separable convolutions (7 × 7) and feedforward networks to achieve more efficient feature extraction and model representation. Deep separable convolutions allow the model to learn more complex spatial features while reducing computation, which helps improve model inference speed and accuracy.

To achieve structural re-parameterization, MDC-RepNet introduces a novel fusion module. This module aims to fuse different levels of features from CNN and Transformer, leveraging the advantages of both. During training, the fusion module allows the model to adaptively adjust the feature fusion method according to task requirements, optimizing the model's performance. This design provides greater flexibility for the model, enabling it to better adapt to various visual tasks.

In summary, MDC-RepNet achieves powerful feature extraction and model representation capabilities by cleverly combining a CNN and a Transformer. the CNN is responsible for extracting local features from an image, while the Transformer captures global sequence information using a self-attentive mechanism. This integration approach enables MDC-RepNet to simultaneously process both spatial and sequence information of an image, resulting in excellent performance in a variety of visual tasks. In addition, by introducing the novel fusion module and ConvFFN architecture, MDC-RepNet further improves its feature extraction and model representation capabilities.

2.2.1 Structural Reparameterization

Multi-branch network structures, boasting receptive fields of varying scales, increase the network's width and parameter amount compared to tiling network structures. This is conducive to enhancing network performance. However, as the network becomes more branched, the memory consumption during training and inference speed are significantly affected. Therefore, Ding et al. [39] propose the notion of structural reparametrization, which equivalently converts complex multi-branch structures into a single-branch structure. In this way, a network with a multi-branch structure can be selected during training to enhance network performance. After training, the network can be converted into a single-branch structure for inference. The converted single-branch structure can maintain the original network's performance while improving the network's running speed and reducing memory consumption and the parameter count.

Figure 7 illustrates the transformation of the multi-branch structure during training into a single-branch structure for inference. Figure 7a presents a multi-branch structure known as a basic block. Apart from the bottommost branch in the basic block, each branch comprises a convolution layer and a Batch Normalization (BN) layer. In contrast, the bottommost branch consists solely of a BN layer. Fusing all branches in the basic block into a single branch entails multiple fusion steps, including the fusion of the BN layer with the convolution layer and the fusion of convolution layers of different sizes, etc.

Fig. 7
figure 7

Processing flow for transforming a multi-branch structure into a single-branch structure

Fusion of the BN layer with the convolution layer. The Conv. change and BN operation are represented as follows:

$$y_{{{\text{conv}}}} = \omega \cdot x + b$$
(4)
$$BN_{\gamma ,\beta } (y_{{{\text{conv}}}} ) = \gamma \frac{{y_{{{\text{conv}}}} - \mu_{B} }}{{\sqrt {\sigma_{B}^{2} + \varepsilon } }} + \beta$$
(5)

In the equations, x represents the input. Hence, after passing through the convolution layer and BN layer, the input x can be expressed as:

$${\text{BN}}_{\gamma ,\beta } (x) = \frac{\gamma \omega }{{\sqrt {\sigma_{B} }^{2} + \varepsilon }}x + \frac{\gamma }{{\sqrt {\sigma_{B} }^{2} + \varepsilon }}(b - \mu_{B} ) + \beta$$
(6)

In the equations, \(\omega\) and \(b\) are the weights and bias before merging, \(\gamma\) and \(\beta\) are the translation and scaling parameters obtained after training, \(\mu_{B}\) and \(\sigma_{B}\) are the means and variances of all training data, respectively, \(\varepsilon\) is a very small constant to avoid division by zero.

$$\hat{\omega } = \frac{\gamma \omega }{{\sqrt {\sigma_{B}^{2} + \varepsilon } }}$$
(7)
$$\hat{b} = \frac{\gamma }{{\sqrt {\sigma_{B}^{2} + \varepsilon } }}(b - \mu_{B} ) + \beta .$$
(8)

From Eqs. (7) and (8):

$${\text{BN}}_{\gamma ,\beta } (x) = \hat{\omega } \cdot x + \hat{b}$$
(9)

where \(\hat{\omega }\) and \(\hat{b}\) are the weights and bias of the fused convolutional kernel, respectively. The convolutional layer can be fused with the BN layer by the above transformations to reduce the computation amount during inference. The specific fusion method is:

(1) Fusion of 1 × 1 convolution with 3 × 3 convolution. The 1 × 1 convolution is filled to 3 × 3 size using 0. Utilizing the additivity of convolution, the convolution obtained from the filling is added to the original 3 × 3 convolution to obtain the fused 3 × 3 convolution.

(2) Fusion of 3 × 3 convolution and 7 × 7 convolution. Similar to the fusion of 1 × 1 convolution and 3 × 3 convolution, the 3 × 3 convolution is filled with zeros to a convolution of size 7 × 7 and then added to the original 7 × 7 convolution to obtain the fused 7 × 7 convolution.

(3) Fusion of residual branches with BN layer. Residual branching can be considered a convolution operation of the original input data with a convolution kernel with unique parameters. Using 1 × 1 convolution and making the parameter of the convolution kernel corresponding to the current number of channels one and the parameter corresponding to the rest of the channels 0, the output can be obtained as the same as the input. This 1 × 1 convolution is fused with the BN layer and then filled to a 5 × 5 convolution using zeros.

After fusing the convolutional layer with the BN layer using the fusion method described above, the transformation of Fig. 7a to Fig. 7b can be realized, and the additivity of the convolution can be utilized to transform Fig. 7b, c. Up to this point, a multi-branch basic block during training can be equivalently transformed into a single convolution during inference.

2.2.2 Linear Train-Time Overparameterization

To further improve the efficiency of parameter counts, FLOPs, and delays, all the model's dense k × k convolutions are replaced with their factorized versions, i.e., k × k depth convolutions followed by 1 × 1 point convolutions. However, the lower number of parameters resulting from the factorization reduces the model's capacity. To increase the capacity of the decomposition layer, a linear training time over-parameterization is used, which means that the model is trained using more parameters than are required. Adding more parameters during the training process helps to fit the training data better and thus improves the model's performance. However, due to the increased computational overhead of branching, training time overparameterization leads to an increase in training time.

Over-parameterizing the model with additional parameters on a linear scale. The advantage is that although the complexity of the model increases, the training time does not increase dramatically because the scale of the model is linear, so the model can still be trained in a reasonable amount of time. MDC-RepNet replaces only dense k × k convolutions with their decomposed forms and over parameters as described above. These layers are in the convolutional backbone, block embedding, and fully connected layers. Since the computational cost of these layers is lower than the rest of the network, overparameterizing these layers does not result in a significant increase in training time.

2.2.3 Large Kernel Convolution

Compared with the sensory field of Self-Attention in Vision Transformer architecture, the sensory field of Rep Mixer is localized. However, the Vision Transformer based on the Self-Attention mechanism has a higher computational overhead. Introducing deep large kernel convolution in FFN and patch embedding layers is a computationally efficient way to improve the sensory field in the early stages without using Self-Attention by combining deep large kernel convolution. MDC-RepNet introduces large kernel convolution at the Patch Embedding layer and FFN. The experimental model parameter settings are shown in Table 3.

Table 3 MDC-RepNet parameter settings

The architecture of the FFN and patch embedding layer is shown in Fig. 5c. The FFN block has a similar structure to the ConvNet block with some key differences. The MDC-RepNet architecture uses Batch Normalization instead of Layer Normalization in the ConvNet block. The BN layers can be fused with the previous layer during inference, and, similar to the implementation of the ConvNet block, no additional reshaping operations are required to obtain a suitable tensor layout for the Layer Norm.

Conv. FFN blocks are usually more robust than vanilla-FFN blocks [40], as the sensory field increases, a sizeable convolutional kernel helps to improve the robustness of the model. Therefore, combining large convolutional kernels is an effective way to improve model performance and robustness.

3 Experiments and analyses

3.1 Data Set and Experimental Environment

This experiment uses the Malimg dataset published by the University of California Vision Research Laboratory for model training and classification. The dataset contains a total of 9435 malicious code sample data from 25 malicious families, and the operating system in the model training is Ubuntu version 22.04.2, running in CUDA 11.8 environment, using Pytorch1.11.0 deep learning framework, and hardware configuration of Nvidia GeForce RTX 3080Ti.

The Malimg dataset is divided into two parts, the training set and the validation set, in a ratio of 9:1. Among them, the training set is used for model training, and the validation set is used for observing and evaluating the model’s performance. The name of each family and the distribution of the number of samples per family are shown in Table 4.

Table 4 Details of malimg dataset

3.2 Evaluation Indicators

3.2.1 Model Training Performance Evaluation Metrics

Four common rubrics in the field of malicious code classification: Accuracy, Precision, Recall, and F1-score are selected to evaluate the model classification, which has been widely used in related research [41,42,43]. The formulas are as follows:

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(10)
$${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(11)
$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(12)
$$F1 - {\text{score}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}$$
(13)

where TP is the actual class (meaning that malware is correctly categorized as malware), FN is the false negative class (meaning that malicious code is incorrectly categorized as regular code), FP is the false positive class (meaning that regular code is incorrectly categorized as malicious code), and TN is the actual negative class (meaning that regular code is correctly categorized as standard code).

The performance of the model is presented using a visual representation of the confusion matrix, and the values of TP, FP, TN, and FN for the multiclassification problem are shown in Fig. 8 where Fi (i = 0,1,2…n) denotes the malicious code family category.

Fig. 8
figure 8

Confusion matrix for multi-categorization problems

3.2.2 Indicators for Evaluating the Speed of Model Inference

The MDC-RepNet uses structural reparameterization to optimize the rate of model operation in the inference phase. Using a divided test set (10%) of about 944 images as experimental data, the inference speed of MDC-RepNet versus classical network models for predicting an unknown sample, i.e., the time overhead incurred in categorizing each image of malicious code, is calculated as a metric for determining the inference speed of the model.

3.3 Analysis of Experimental Results

3.3.1 Hyperparameter Comparison Experiment

Limited by the characteristics of fully connected layers in CNNs, the size size of the malignant code image input to the model must be deterministic. In addition, the size of the image put into the CNN will not only affect the size of the model but also change the effectiveness of the model. In order to obtain the most suitable input image size for the model, this paper applies bilinear interpolation to normalize the malicious code images to 32 × 32, 64 × 64, 128 × 128, 256 × 256, and 512 × 512. The performance of the model is experimentally tested by inputting the malicious images in the Malimg dataset into the model, and it can be analyzed in Table 5 to know that, after the size of the malicious code image is increased from 32 × 32 to 256 × 256, is an increase in accuracy from 86.65 to 99.57%, however, when the image size continues to increase from 256 × 256 to 512 × 512, the accuracy decreases from 99.57 to 98.92%, which indicates that the model is overfitting. Moreover, as the image size increases, the number of parameters increases gradually. This is because the more significant the image size, the more convolution operations and parameters are involved, which further increases the consumption of computer resources and leads to longer model training time. Then, after a comprehensive trade-off between the malicious code classification accuracy and the number of parameters, a 256 × 256 malicious code image is finally chosen as the input to the model.

Table 5 Experimental results on the effect of input image size on the model

In deep learning, optimizers are algorithms that update and find the optimal parameters of a model. For the purpose of optimizing model parameters, in this paper, we conduct comparative experiments on some optimizers Adagrad, Adamax, Adam, NAdam, and AdamW that perform well in classification tasks. Table 6 lists the performance graphs of the above optimizers concerning the relevant metrics. The experimental results show that AdamW outperforms other optimizers regarding accuracy, precision, recall, and F1-score. Therefore, in this paper, AdamW is selected as the optimizer of MDC-RepNet for performing the malicious code family classification task.

Table 6 Comparative experimental results of different optimizers

3.3.2 Experiments to Verify the Validity of Structural Reparameterization

A structural reparameterization technique is used to decouple and separate the training process and inference phase of the MDC-RepNet. The multi-branch model is first trained, then equivalently transformed into a one-way model, and finally the model is deployed. In the prediction phase, the actual model that has been transformed is run. To verify the effectiveness of structural reparameterization, experiments are set up to analyze the training and inference phases quantitatively, and to evaluate the accuracy, FLOPs, inference speed, the number of parameters, and the model size of the MDC-RepNet before and after the transformation, to verify the actual effect of structural reparameterization. In the model training, the number of samples selected for each training batch size is set to 32, the number of Epoch is set to 30, the initial learning rate is set to 0.0001 using the AdamW optimizer, the weight decay is 0.05, the peak learning rate is 10–3, and the cosine scheduling is used to attenuate the learning rate, and the loss function is the cross-entropy loss function, the results are shown in Table 7.

Table 7 Validation experiments on the validity of structural reparameterization

The MDC-RepNet has the same recognition accuracy before and after the conversion, the FLOPs, the number of parameters, and the model volume are reduced by about 10%, while the inference speed is reduced by about 15–20%. The experimental results show that with the increase of the model depth and width, the inference speed of the model is reduced, and the recognition efficiency is significantly improved.

To further verify the effectiveness of structural reparameterization, the confusion matrices of the MDC-RepNet before and after conversion are plotted to verify its effectiveness more intuitively. The confusion matrices before and after structural reparameterization are shown in Fig. 9. Observing that the confusion matrices before and after structural reparameterization are identical, it can be learned that the recognition accuracy before and after the conversion is also identical, and the network can achieve a high recognition accuracy by training 30 Epochs. The results show that the MDC-RepNet before and after structural reparameterization does not reduce the recognition accuracy and affects the performance of the model before conversion. Decoupling the training and inference phases reduces the inference speed of the model and reduces the time for malicious code image classification and recognition.

Fig. 9
figure 9

Confusion matrix before and after structural reparameterization

3.3.3 MDC-RepNet Ablation Synthesis Experiment

The MDC-RepNet introduces a large kernel convolution at two locations, the patch embedding layer, and the FFN, to improve the sensory field in the early stages without using self-attention by combining the deep large kernel convolution. To verify the role of the large kernel convolution for the whole network. Setting up the ablation experiment using the large kernel convolution to replace the self-attention module layer by layer, the experimental results are shown in Table 8.

Table 8 Large kernel convolutional ablation experiments

RM indicates that the current stage uses RepMixer-FFN blocks. SA indicates that the current stage uses Self-Attention-FFN blocks. The standard setup uses 3 × 3 factorization convolution in the block embedding and backbone layers, and 1 × 1 convolution in the FFN. In variants V4 and V5, large kernel convolution (7 × 7) is used for the patch embedding and FFN layers.

Comparison shows that V5 has an 11.2% increase in model size, a 0.4% gain in accuracy, and a 2.3 × increase in inference speed with V3. V2 is 20% larger than V4 and achieves similar accuracy, while inference speed is 7.1% higher than V4. Overall, large kernel convolution provides a 0.9% accuracy gain on MDC-RepNet. Experimental results show that using deep large-kernel convolution has similar effects as using the self-attention mechanism while inducing a slight increase in inference speed.

To further verify the enhancement effect of the multi-branch structure and the introduction of the large kernel convolution on the network representation ability, several components of MDC-RepNet are removed by the network ablation method, and the role of the components for the whole network is analyzed. The two shortcut branches and the large kernel convolutional block of MDC-RepNet are used as components to test the recognition accuracy of the model after combining with the main branch, respectively. Ensuring that other parameters remain unchanged, the results of the ablation analysis experiments are shown in Table 9.

Table 9 MDC-RepNet ablation synthesis experiment

As can be seen from Table 8, when not including any of the components, the accuracy decreases by 9.16% compared to MDC-RepNet, but the inference speed increases dramatically. When only one component is used, the accuracy rate also decreases to different degrees compared to MDC-RepNet, and the inference speed increases accordingly. The results show that the multi-branch structure can increase the network’s representation ability and improve the model's recognition accuracy. After the introduction of large kernel convolution, MDC-RepNet increased the accuracy by 3.1% compared to the original network, and the inference speed was the same, indicating that the introduction of large kernel convolution did not increase any computational resources for the inference stage.

3.3.4 Comparison with Other Deep Learning Models in Image Classification Tasks

The study expects to improve the inference speed of the model as much as possible and reduce the recognition time of the image while ensuring higher recognition accuracy. In this section, the model is trained and tested against classical networks (AlexNet, VGG series, and ResNet series) and lightweight networks (DenseNet, MobileNetV2, and ShuffleNetV2), which are deep neural network models with excellent performance in image classification tasks based on the Malimg dataset. The results of the test of the aforementioned network models are compared to the data shown in Table 10.

Table 10 Comparison of experimental results of different network models

As can be seen from Table 9, the MDC-RepNet possesses fewer FLOPs and faster inference speed with higher accuracy than the VGG16 network, the number of parameters is about 5% of the latter, and the model volume is about 22% of the latter. Through structural reparameterization, the MDC-RepNet adopts the one-way model architecture of the VGG16-style network in the inference stage, and the inference speed is about twice as fast as that of the latter. The MDC-RepNet adopts the multi-branch structure as the backbone network and discards the fully connected layer of the VGG16-style network. The FLOPs are about 10% of that of the latter, and the complexity of the model is reduced dramatically.

Compared with the multi-branch structure ResNet series network, MDC-RepNet has more evident advantages in FLOPs, inference speed, number of parameters, and model volume over the former group compared with the ResNet50 network. The MDC-RepNet borrows the residual module from the ResNet series network, making it possible to train the network deeper and making the network more complex in terms of similar recognition results. The MDC-RepNet borrows the residual module from the ResNet series network, making it possible to train the network deeper and reduce the model complexity in the case of a similar recognition effect with a lower number of parameters and faster inference.

MobileNet and ShuffleNet networks are network architectures designed for devices with limited computing resources, which significantly reduce the computational overhead while retaining a certain level of model accuracy, significantly increase the inference speed, and greatly reduce the FlOPs, the number of parameters, and the size of the model. The MDC-RepNet is a high-efficiency model designed for dedicated hardware or GPU, which pursues faster inference speed and more memory-saving. Faster and more memory-efficient, with less attention to the number of parameters and FLOPs. MDC-RepNet improves the recognition accuracy by 1–2% compared to MobileNetV2 and ShuffleNetV2, but the inference speed is not as fast as the two, and the number of parameters and FLOPs are higher than the two. However, it can be observed that compared with MobileNetV2, the FLOPs are five times higher than the latter. However, the inference speed is 71% of the latter, which shows that the FLOPs do not entirely determine the inference speed, indicating that the computational density of the MDC-RepNet is higher.

To observe the combined classification accuracy and detection time more intuitively, the accuracy and detection time results of each model are plotted in Fig. 9. The Fig. shows that the MDC-RepNet has the highest accuracy rate, and it only takes 219 ms to predict an unknown sample, which better realizes the balance between accuracy and delay. In summary, the MDC-RepNet has a high accuracy rate, a low number of parameters, and a high computational density. This is because the MDC-RepNet efficiently captures both local and global information. The structural reparameterization module obtains lower memory access costs and higher efficiency by eliminating jump connections in the network. At the same time, techniques such as large kernel convolution are used to improve accuracy. This ultimately reduces the number of parameters in the model and decreases the amount of floating-point operations, improving the model's speed.

To further observe and analyze the classification performance of the models, Fig. 10 plots the classification details of each model in each malicious family, and the results show that the MDC-RepNet, through time-training over-parametrization and extensive kernel convolution techniques, improves the above classical deep neural network model's insufficient classification accuracy in some confusing malicious families to varying degrees, and then improves the overall classification accuracy, Accurately classify 25 families on the Malimg dataset. Experiments prove that the malicious code classification performance of this paper's model outperforms other models (Fig. 11).

Fig. 10
figure 10

Comparison of classification accuracy and detection time of each model

Fig. 11
figure 11

Classification details of each model in malicious families

3.3.5 Comparison with Other Malicious Code Classification Techniques

To verify that the proposed model can achieve satisfactory performance and to further validate the model's malicious code detection capability, this section compares MDC-RepNet with existing malicious code detection methods based on visual analysis based on the Malimg dataset. Table 11 summarises the performance metrics of each malicious code detection method, from which we can see that MDC-RepNet achieves an accuracy of 99.57% and a prediction time of 67 ms, which is much lower than that of the existing state-of-the-art research. Our model achieves better results in terms of accuracy, precision and other evaluation metrics than all existing research techniques.

Table 11 Performance metrics of each malicious code classification method

The reason for this is related to the properties of convolution: in this paper, we use point-by-point convolution after deep convolution, large kernel convolution instead of the self-attention method to improve the model performance in the early stage, and re-parameterisation and jump-over connectivity in inference, which eliminates the time overhead of the extra branches in the inference stage and improves the classification accuracy without introducing the extra inference time to aggravate the computational burden. Thus the model parameters from the training phase can be equivalently used at inference time, resulting in lower memory access costs and higher efficiency, achieving an accuracy-latency balance.

4 Conclusion

This paper proposes a malicious code detection method that combines CNNs and Transformers. The method takes deep neural networks as a framework, adopts the modular design idea, and introduces a new Token hybrid operator to make its structure reparameterized, which reduces the memory access cost by eliminating the jump connections in the network. Meanwhile, this paper adopts techniques such as training time over-parameterization and large kernel convolution to improve the accuracy. In the data preprocessing stage, this paper uses pixel-filling-based image size normalization algorithm and data enhancement techniques to improve the loss of image texture information and dataset category imbalance problem during malicious code image size deflation, respectively, and to enhance the expression of critical features to alleviate the overfitting phenomenon of the model. Finally, the deep neural network model is trained to realize the classification of malicious code and its variants. Through experiments, it is proved that the method in this paper has a stable improvement in accuracy and operation efficiency, which is better than the current malicious code detection technology.

In our future research, we will further investigate, design and implement appropriate improvements to enhance the performance and applicability of MDC-RepNet. We will try to address some of the current limitations of MDC-RepNet with respect to family confusion and continue to explore new methods and techniques such as color deconvolution techniques [59], dynamic multi-scale topology techniques [60] and linguistic sequence-based evaluation methods [61, 62], etc., Combining the latest deep learning-based research results [22, 24] to improve the research in the field of malware classification.