1 Introduction

Since 2020, the “COVID-19” virus has spread throughout the world, seriously affecting the health of all mankind and the normal development of society. The virus has caused unprecedented disaster and panic to mankind. In the cyberspace, computer viruses and other malicious software are also trying to break through the computer’s defenses time and time again, causing damage to the computer world.

Although information security departments spend a lot of money to maintain network security every year, researchers have been trying to use more advanced technology to solve the problem of information security, and have achieved a lot of research results, which effectively curb the harm caused by malicious network attacks. However, due to the particularity of information technology, malware developers often look for a breakthrough by actively looking for system vulnerabilities. On the other hand, the security department often studies the measures to identify and defend the malware only after the system is broken and the loss of users is caused. Therefore, the current security measures are often in the state of “hindsight” and passive defense. How to predict the characteristic information of malware, detect malware in advance, and nip the spread of malware in the bud has become an important topic of information security research.

Most of the defense systems still use static analysis as their primary measurement, which is mainly based on signature matching [1]. Some defense systems have developed dynamic analysis, including sensitive behaviors, access to critical privileges, network analysis, and key process monitor as their assistant method [2, 3]. However, all of these methods are mainly focused on specific malware or malware classes so that they are limited when defensing new types or variants of malware. Besides, they are also weak to the anti-detection techniques, which would let the detectors be deceived by disguised malware and cause damage. All of the situations indicate that developing a new method of detection is essential.

In recent years, deep learning technology has shown an unprecedented degree of intelligence. Through the analysis of a large number of historical data, it can find the rules and predict the unknown samples. It shows strong adaptability, prediction ability and intelligent level, and has achieved good results in computer vision, natural language processing, speech recognition and other fields. At the same time, it provides an effective means for the detection and prevention of malware, which can not only accurately detect malware, but also prospectively predict the maliciousness of unknown software.

Aiming at solving the problems of traditional static detection and dynamic detection methods, this paper proposes a novel approach of malware detection based on application programming interface (API) call sequence and deep learning algorithm. Firstly, the API call relation is extracted, and the ordered cycle graph is constructed based on Markov chain. Then, the graph convolution neural network (GCN) is used to detect malware. The performance analysis and comparison are carried out. Consequently, the main contributions of our work are listed as follows:

  • We present a new method to extract features from samples with three-dimensional structure. Firstly, we extract the weight model of malware samples based on Markov chain by using the API call information of a large number of known malware samples, and then use the samples to be detected to map in the weight model, so as to extract the features of the samples to be detected. The model combines the features of test samples with the general features of malware, so that the newly generated features can not only maintain the features of the samples themselves, but also increase the generality of the features, and effectively resist the disguise and variants of malware.

  • A malicious code detection method based on GCN is proposed. Taking the characteristic graph of malicious code as the input, the graph convolution neural network is trained and tested, and the malware detection model based on GCN is established. The model takes advantage of the graph as input of GCN to improve the adaptability of the model in detecting malicious code.

The remainder of the paper is organized as follows. Some related works are reviewed in Sect. 2. System framework ,workflow and the method of detection are presented in Sect. 3. The data set and experimental environment are introduced in Sect. 4. The details of experiments and results are shown in Sect. 5, and the whole work is discussed and concluded in Sect. 6.

2 Related work

2.1 Development trend of malware

Since the birth of computer, malicious threat has been accompanied by the development of computer. As early as 1949, John von Neumann’s paper “theory and organization of complex Automata” [4] mentioned the assumption of how computer programs can achieve self-replication, which can actually be regarded as the germination of malicious code. In 1970, Bob Thomas, a developer of BBN technologies, created a program called creeper, which can realize self-replication and continue to spread through ARPANET network. In 1983, Fred Cohen wrote the first computer virus program recognized in history, which can realize self-replication and spread [5]. Then, computer attack and computer defense has gradually developed into a huge industry, and the scale of the industry is also higher and higher. Malicious code refers to software with malicious intention, such as computer virus, worms, spyware, browser hijackers, adware and track software, which can control and destroy the user’s computer, data and network, and damage the user’s interests. Nowadays, with the continuous improvement of malicious code detection technology and the development of artificial intelligence technology, the technical level of malicious code is becoming more and more advanced. We try every means to avoid the detection of detection software by using escape strategies. These strategies are smarter and more hidden than many conventional anti-malicious code systems [6, 7]. Generally speaking, the development trend of malicious code is as follows: (1) the forms of attacks are diverse, and the complexity of threat capability is increasing. (2) Malicious code attacks tend to be intelligent. (3) Malicious code has been fully industrialized. (4) Malicious code attacks are organized. (5) The ability of anti-detection has improved significantly.

2.2 Traditional detection methods for malware

In 1987, Fred Cohen put forward the concept of computer virus for the first time, and put forward the basic theory of computer virus detection and defense, established the basis of program behavior detection, put forward a series of defense schemes, opened the road of computer Malware Defense Research. In recent decades, researchers have explored a series of methods and techniques to detect malware after a lot of work. Detection of malware is mainly to detect the characteristic code of malware, which mainly includes anomaly-based detection and signature-based detection.

Anomaly-based detection technology is to check the maliciousness of the program by detecting the difference between the behavior of the abnormal program and that of the normal program. Generally, the behavior trajectory of malware is different from that of normal software. After fully understanding the behavior of normal program, a set of standards and specifications will be formed. If the behavior trajectory of the program to be detected is abnormal and violates this set of specifications, it can be determined as malware. There are three different detection methods for anomaly-based detection: static detection, dynamic detection and hybrid detection.

Sundarkumar et al. proposed a model based on API call sequence type, which uses text mining and topic modeling to detect malware [8]. After analysis, it suggests to use decision tree to design malware detection expert system. Wu Songyang and others used the data stream application program interface (API) as the classification function, and adopted the improved k nearest neighbor classification model to detect Android malicious code. Through machine learning, the API list related to data flow is further optimized, and the efficiency of sensitive data transmission and analysis is significantly improved [9].

2.3 AI-based malware detection technology

Although traditional methods play a very important role in malicious code detection, they have also made some achievements [10]. However, because malicious code writers often use various means to avoid the traditional detection methods, or study some new types of malicious code or variants of malicious code, the accuracy of the traditional detection model will be greatly reduced in these cases. With the continuous development of machine learning technology, malware detection technology based on machine learning model has also developed and achieved some successful results.

Schultz et al. introduced machine learning based on static features to detect unknown malware, by using program executables (PE), byte n-gram and string for feature extraction [11]. Elovici et al. used PE and Fisher score (FS) method for feature selection, and used artificial neural networks (ANN), Bayesian network (BN), decision tree (DT) and other methods to detect malicious software, with an accuracy of 95.8%. Moskovitch et al. used filter method for feature selection [12]. They used gain ratio (GR) and Fisher score for feature selection, and used artificial neural network, decision tree, Naive Bayes (NB) and support vector machine (SVM) classifier to detect malware, with an accuracy of 94.9%. They also put forward a method, using n-gram operation code as feature, using document frequency (DF), GR and FS as feature selection method, using artificial neural network, decision tree, naive Bayes and other classification algorithms, in the case of poor performance of ANN, DT, Boosted DT(BDT), to keep a lower false alarm rate level [13].

Santos et al. proposed supervised learning to detect unknown malware [14]. They use information gain method for feature selection, and use different classifiers, such as DT, k-nearest neighbor (KNN), BN, SVM, in which SVM shows good accuracy. Ivan firdausi et al [15]. designed malware detection technology by using five classifiers including KNN, NB, J48 DT, SVM and MLP. The experimental results show that J48 DT achieves the best overall performance, with the recall rate of 95.9%, false alarm rate of 2.4%, accuracy of 97.3% and accuracy of 96.8%. In a word, it can be concluded that the proof of concept can detect malware very effectively based on the use of automatic behavior-based malware analysis and machine learning technology.

Konrad Rieck et al. proposed a framework for automatically analyzing malware behavior using machine learning [16]. The framework can automatically identify new malware categories (clusters) with similar behaviors, and assign unknown malware to these discovered categories (classifications). Based on clustering and classification, an incremental method based on behavior analysis was proposed, which can process the behavior of thousands of malware binaries every day.

In order to facilitate more researchers to use machine learning model to study malicious code detection technology, Anderson et al. [17] provided a malicious code benchmark dataset ember for machine learning, and then demonstrated a use case using LightGBM and baseline gradient-enhanced decision tree model trained by default. The author also proposed a general framework based on reinforcement learning (RL) for static PE anti-malware engine. Through training with the anti-malware engine, the general framework understands which operation sequences may lead to avoiding the attack of any given malware sample in the detector, thus generating malware samples that can avoid detection, which provides a reference for the design of more advanced detection software. Sanjay Sharma et al. proposed a method based on the appearance of opcodes to improve the accuracy of malware detection for unknown advanced malware. This method uses the Fisher scoring method for feature selection, and uses five machine learning classifier algorithms to detect unknown malware. Among them, random forest, LMT, J48 DT and NB have reached \(100\%\) accuracy [18].

Smita Naval et. [19] proposed a new model based on an improved API sequence. The method was tested through a series of experiments, and the results were compared with existing malicious code detectors, which proved the effectiveness of the method.

Tang et al. [20] proposed a new method of static malicious code detection based on the API call sequence: firstly, extract the API sequence through dynamic analysis, and then convert the sequence into a characteristic image that can represent the behavior of the malicious code. Finally, the convolutional neural network (CNN) is used to classify the malicious code into nine malicious code families. The results showed that the TPR indicator exceeded \(99\%\).

Li Jin et al. proposed a malicious code detection system based on permission usage analysis-SigPID. It uses machine learning-based classification methods to classify different families of malware and benign applications. Experimental results show that it can achieve an accuracy of more than \(96\%\).

In the work of Raff et al. [21], they used neural networks to detect malicious code at the entire executable file level. This solution avoids many of the problems of the more common byte n-gram method, but it achieves consistent generalization on both test sets.

Alzaylaee et al. [22] proposed a malicious code detection system based on deep learning, DL-Droid, to dynamically detect malicious Android applications, using dynamic features to achieve a detection rate of \(97.8\%\).

As a new type of neural network architecture, graph neural network (GNN) has been widely used in many industries in recent years [23], but there is still little research on malware detection. Since GNN uses graphs as input, which is related to the API call sequence graph studied in this paper, so we try to use a special type of graph convolution network (GCN) in GNN.

3 Malware detection algorithm based on graph convolutional network

In this section, we will introduce data preprocessing method and the detector based on GCN proposed in this paper. First, we introduce the framework of the system, and then we introduce the workflow.

3.1 System framework and workflow

This system mainly includes four steps, namely the extraction of API call sequence, the generation of directed cycle graph, the Markov process and classification. The malware detection system framework is shown in Fig. 1.

Fig. 1
figure 1

GCN-based malware detection system framework

The system workflow is shown in Fig. 2. In the figure, the system process is shown on the left, the highlight work has been shown on the right side of the figure, and the corresponding steps are indicated by blue curves. The sample is first executed in the sandbox, and the API called by the sample is extracted from it, and then the API is used as the vertex of the graph, and the number of times the API calls other APIs is used as the weight of the directed edge, thereby establishing a directed cyclic graph. Then, extract the feature map based on the Markov chain, and the GCN classifier is used for classification.

Fig. 2
figure 2

GCN-based malware detection system workflow

3.2 Algorithm for generating directed cyclic graph of API based on Markov chain

A directed acyclic graph (DAG) is a sequential graph with a limited set of point S and a set of directed edge E, while the arbitrary edge e directed from a vertex \(s_i\) to another vertex \(s_j\ (s_i, s_j \in S)\), and arbitrary vertex could not return itself with the directed edge sequences [24]. DAG can represent the set of possibilities that satisfied Markov property, which clarifies that the probability of transitioning from a state to another only depends on the current state.

Directed acyclic graph (DAG) is widely used in malicious code classification based on API call sequence, but it is not convenient to use in API sequence with cyclic call, so it needs to use directed cyclic graph (DCG).

DCG is similar to DAG, but allows loops from vertex to itself. In our work, the DCG will be generated from API call sequences. The graph consists of vertices s that presents API and edges e to represent invokes, and the edges in the DCGs will be weighted according to the invokes chains of the samples. The structure of DCG is shown in Fig. 3, where the vertex (API) and the relation among vertexes (invokes) are presented concisely. For an edge \(e_{i,j}\) in the graph G, the weight of the edge would be the number of invokes that from API \(s_i\) to \(s_j\), it is labeled \(n_{ij}\) in the figure.

Fig. 3
figure 3

Structure of DCG

Table 1 Adjacent matrix of the DCG in Fig.  3

For the convenience of model processing, a DCG will be presented as an adjacent matrix, which is given in Table 1. In the matrix, the row refers to the vertex started, namely API that make calls, and the column refers to the vertex ended, i.e., API that is called. The weights of the edges are stored in the cell of the matrix, which means the number that the call occur [25].

When the scale of API call is large, it is not conducive to feature extraction. Fortunately, API calls often satisfy Markov property. In order to extract features of API calls effectively, we use Markov chain to extract features and simplify models, so as to facilitate the classification.

Markov chain is a memory-less stochastic process that satisfies Markov property. Due to the characteristic of Markov Chain that can simplify the features of sequences, it is broadly used in the classification and processing of sequential data, especially dynamic detection [26, 27].

To calculate the weight, firstly, an original malware dataset would be used to generate a Markov Chain. The dataset must be rich enough to present the general characteristics of malware, and the chain can be defined as “weighting graph,” which can be labeled as \(M_w\). Suppose the number of APIs is m, the weight \(w_{i,j}\) of edge \(e_{i,j}\) in the weighting graph can be computed as follows:

$$\begin{aligned} w_{i,j} = \frac{n_{i,j}}{\sum _{k=1}^{m}n_{i,k}} \end{aligned}$$
(1)

where \(n_{i,j}\) means the number of calls that from API \(s_i\) to \(s_j\), and \(\sum _{k=1}^{m}n_{i,k}\) refers to the sum of all calls that are invoked from \(s_i\). The nature of the weight in Markov Chain is the probability of the event’s occurrence. Hence, the sum of the weight invoked from an API \(s_i\) must be 1:

$$\begin{aligned} \sum _{k=1}^{m}w_{i,k}=1 \end{aligned}$$
(2)
Fig. 4
figure 4

Merge process of DCG and weighting graph

In the weighting phase, the final weight must fit both the sensitivity of invokes and the characteristics of the sample itself. Thus, the final graph will be produced from the merge of the adjacent matrix M (which generated from the sample), and the weighting graph \(M_w\). The process of the merge is shown in Fig. 4. The merged graph will only retain the edges (invokes) existed both in M and \(M_w\), and the final weight \(W_{i,j}\) of the invoke from \(s_i\) to \(s_j\) can be calculated as follows:

$$\begin{aligned} W_{i,j}=w_{i,j} \cdot n_{i,j} \end{aligned}$$
(3)

where \(n_{i,j}\) is the value of (ij) in the adjacent matrix M, namely the number of invoke \(e_{i,j}\) occurred in the sample. The final output for detection is a matrix presenting the merged graph, which is shown in Table 2.

Table 2 Adjacent matrix of the merged graph

3.3 GCN-based malware classification algorithm

Due to the structure of each generated graph is different, and the dimension of adjacency matrix is also different, using traditional neural network needs to unify the dimension, which will be a tedious work. Fortunately, graph convolution network can deal with graphs with any structure, so in this paper, we try to use graph convolution network to classify the generated graphs.

In the classifier module in the Fig. 1, a typical graph convolutional network structure architecture is shown. It has C input channels and F output characteristic graphs, and can contain multiple hidden layers [28]. Load a graph to GCN, and through several layers of GCN, the feature of each node changes from S to Z, and then completes the classification.

In our experiment, we design a two-layer semi-supervised classification GCN based on the weighted feature matrix R generated. In the forward phase, the calculation is as follows:

$$\begin{aligned} \hat{R}=\tilde{D}^{-\frac{1}{2}}\tilde{R}\tilde{D}^{-\frac{1}{2}} \end{aligned}$$
(4)

Here, \(\tilde{R}=R+I\), I is the identity matrix,\(\tilde{D}\) is the degree matrix of \(\tilde{R}\). Then, we use the forward model as follows:

$$\begin{aligned} Z=f(X,R)=\mathrm{softmax}(\hat{R}\mathrm{Relu}(\hat{R}XW^{(0)})W^{(1)}) \end{aligned}$$
(5)

\(W^{(0)} \) is the weight matrix from input layer to hidden layer, \(W^{(0)} \) is the hidden-to-output weight matrix, the Relu function is defined as \(\mathrm{Relu}(x)= \mathrm{max}(0, x)\). In fact, since there are only two types, our softmax function is defined as follows:

$$\begin{aligned} \mathrm{softmax}(p_i)=\frac{e^{p_i}}{e^{p_1}+e^{p_2}} \end{aligned}$$
(6)

Here, \(p_1+p_2=1\). In the backpropagation stage, we use the cross-entropy loss function:

$$\begin{aligned} \mathcal {L}=-\sum _{i=1}^N \left[ y_1^{(i)}log(p_1^{i})+y_2^{(i)}log(p_2^{i})\right] ,\quad (i=1,2,\ldots ,N) \end{aligned}$$
(7)

Here, N is the number of samples, y is the label, p is the probability, \(y_2^{(i)}=1-y_1^{(i)}\), \(p_2^{i}=1-p_1^{i}\).

4 Dataset and experiment design

In this work, we collected a dataset consists of 13,624 samples, which have 6686 malware and 6938 benign samples. The detailed statistics are given in Table 3. The malicious samples came from VirusTotal and VirusShare, and the benign samples came from system programs as well as the Internet. The benign dataset was split into five parts evenly to fit the numbers of malware dataset for each year. In our experiments, the datasets were set \(80\%\) for training and \(20\%\) for evaluation. For weighting models, an API log dataset was adopted for training. The dataset for weighting included 62307 malware samples that were obtained from Virusshare and selected randomly so that the generality of the weighting model could be guaranteed.

Table 3 Datasets for evaluation

The experiments were performed on a workstation with Ubuntu 18.04 system. To monitor and extract the call sequences of each sample, a cuckoo sandbox was deployed on the workstation as the running environment of the sample subsection weighting and generation of graph. In this phase, the extracted call sequences were numbered firstly. Then, the sequences were transformed into the DCG. The index of the DCG presented the corresponding API, and the value in each cell referred to the appearance number of the API invoked by the previous API. After generating DCG, the graph mixed with the weighting graph, so that a weighted DCG with a unique value for each edge was created.

In the weighting phase, firstly, the weighting graph was trained from the dataset. The initial weighting graph had 1609 rows and columns after the training. Then, the weighting graph was used to generate merged graphs for detection.

5 Experiments and results analysis

5.1 Detection and evaluation methods

Before analyzing the results, several common detection and evaluation methods are introduced.

Accuracy is a standard metric that measures the exactitude of prediction. Precision refers to the number of predicted positive samples that are really positive. Recall is the number of positive examples in the sample that are predicted as positive. F1-Score is a comprehensive measure index of the classification model. All of these indexes could be computed as the following equations:

$$\begin{aligned}&\mathrm{Accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} \end{aligned}$$
(8)
$$\begin{aligned}&\mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \end{aligned}$$
(9)
$$\begin{aligned}&\mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \end{aligned}$$
(10)
$$\begin{aligned}&\mathrm{F1}{\text{-}}{\rm {Score}} = \frac{2\times \mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}} \end{aligned}$$
(11)

where TP is True Positive, which refers to the number of positive samples that are predicted as positive. FN is False Negative, namely the number of positive samples that are predicted as negative. FP is False Positive, which means the number of negative samples that are predicted as positive. TN is True Negative, which is the number of negative samples that are predicted as negative. In our work, malicious samples were labeled as positive, and benign samples were labeled as negative.

5.2 Classification results and analysis

After we extracted the API call sequence diagram, we analyzed the sample API call sequence and corresponding weights.

Figure 5 shows scatters of four randomly selected groups of samples, where the X-axis referred to the serial number of API that invoke others, Y-axis referred to the serial number of API that was invoked by others, and Z-axis referred to the weight of the API invoke. From the figure, we could find that the point of benign samples distributed more dispersed than that of malicious samples. Also, the weights of API invokes in the benign samples were more varied than the malicious samples.

Fig. 5
figure 5

Scatters of final merged graph

Table 4 shows the evaluation result of the models with datasets from different years. The results showed that the performance among the models had distinct differences. For the models based on machine learning, most of them performed well in the detection of most datasets. The overall performance of GCN in the evaluation was relatively good, most of the indicators were ahead of other models, and the CNN model was relatively close, which was related to their classification principles. In addition, with the increase in the year, the prediction accuracy has declined to a certain extent. We believe that the anti-detection ability of the sample has improved to a certain extent. It is particularly important to note that in the 2020 data, the detection indicators of GCN are higher than other models, which indicates that the GCN model has a better classification effect against models with strong detection capabilities.

Table 4 Evaluation of the models with different datasets

The results show that in most of the models based on machine learning or deep learning, the method has a better performance on the detection of general malware, which proves the effectiveness of the method.

We compared other methods with our model to test the performance of the model. Table 5 introduces the comparison results of our model performance and other existing malicious code detection methods based on deep learning or machine learning. As shown in Table 5, the TPR of all methods exceeds 99%. In addition, the FPR of our method is lower than other methods, so our proposed method is superior to other methods in terms of FPR and accuracy. The reason for the better performance of our model is that as the malicious code data set increases, our extracted malicious features are more accurate and the model is more robust.

Table 5 Comparison of the detection effect of GCN and existing methods

6 Conclusion

Malware has a long history, which seriously threatens the security of computer system. With the rapid development of anti-detection technology, the capability of traditional detection methods based on static analysis and dynamic analysis is limited. With neural network having strong prediction performance, the application of AI technology in malware detection has become a research hotspot. However, due to the difference of malware, feature extraction is difficult, which is not conducive to the application of traditional neural network. To solve the problem, we use the flexibility of GCN input to design a malware detector based on GCN to adapt to the differences of malware. The specific method is to extract the API call sequence from the malicious code and generate the directed cyclic graph, use Markov chain to extract the characteristics of the graph, and then use GCN to realize classification. We also have done evaluation comparing with other machine learning algorithms. The results show that the method has better performance in most detection, and the highest accuracy is 98.32%. From the research, we find that the technology has potential adaptability, but it has not been realized yet. In the future work, we will focus on the research of adaptive detection model based on GCN, so that the malware detection system has stronger adaptive ability, so as to reduce the cost of personnel of malware detection.