Learning and Fusing Multi-View Code Representations for Function Vulnerability Detection

: The explosive growth of vulnerabilities poses a signiﬁcant threat to the security of software systems. While various deep-learning-based vulnerability detection methods have emerged, they primarily rely on semantic features extracted from a single code representation structure, which limits their ability to detect vulnerabilities hidden deep within the code. To address this limitation, we propose S 2 FVD, short for Sequence and Structure Fusion-based Vulnerability Detector, which fuses vulnerability-indicative features learned from the multiple views of the code for more accurate vulnerability detection. Speciﬁcally, S 2 FVD employs either well-matched or carefully extended neural network models to extract vulnerability-indicative semantic features from the token sequence, attributed control ﬂow graph (ACFG) and abstract syntax tree (AST) representations of a function, respectively. These features capture different perspectives of the code, which are then fused to enable S 2 FVD to accurately detect vulnerabilities that are well-hidden within a function. The experiments conducted on two large vulnerability datasets demonstrated the superior performance of S 2 FVD against state-of-the-art approaches, with its accuracy and F1 scores reaching 98.07% and 98.14% respectively in detecting the presence of vulnerabilities, and 97.93% and 97.94%, respectively, in pinpointing speciﬁc vulnerability types. Furthermore, with regard to the real-world dataset D2A, S 2 FVD achieved average performance gains of 6.86% and 14.84% in terms of accuracy and F1 metrics, respectively, over the state-of-the-art baselines. This ablation study also conﬁrms the superiority of fusing the semantics implied in multiple distinct code views to further enhance vulnerability detection performance.


Introduction
In recent times, the incidence of network attacks has witnessed a significant upsurge. These attacks are primarily driven by the ubiquitous presence of software vulnerabilities. To date, over 200,000 such vulnerabilities have been recorded on the Common Vulnerabilities and Exposures (CVE) website [1]. Given the pervasive exploitation of vulnerabilities and the significant security threats they pose, it is critical for developers to proactively detect vulnerabilities in their code. whether written by themselves or reused from open-source software.
However, identifying multi-faceted vulnerabilities requires security-related domain knowledge that goes beyond the expertise of most developers. This presents significant challenges for vulnerability detection. In light of the ever-expanding scale and complexity of modern software systems, it has become increasingly impractical, even for security professionals, to manually detect potential vulnerabilities within millions of lines of code, given the tremendous efforts and time required.
Inspired by its impressive performance in diverse domains such as NLP [2] and program analysis [3][4][5][6], deep learning has also been harnessed to develop a range of approaches [7][8][9] for the detection of vulnerabilities. These approaches utilize labeled training samples and extract semantic-aware features from them to construct classifiers that map the target code snippets onto a class space that indicates the absence or presence of vulnerabilities, or specific vulnerability types. Typically, prevailing deep-learningbased vulnerability detection methods rely on a single code representation structure to identify vulnerabilities, which, however, may fail to comprehensively capture vulnerabilityindicative patterns and detect those vulnerabilities that are well-hidden within the code. This is due to the fact that these vulnerability-indicative patterns may require different perspectives on the code reflected by different code representation structures.
To address the aforementioned limitation, we propose S 2 FVD, which entails a novel approach that leverages fused semantic vectors that are learned from three essential code representations, including token sequence, attribute control flow graph (ACFG), and abstract syntax tree (AST). These code representations provide distinct perspectives on the code, thereby allowing the model to more comprehensively capture vulnerability-indicative features from the code. The main contributions of this paper are summarized as follows. • A novel DL-based vulnerability detection method called S 2 FVD is presented. To accommodate the distinct representations of the code, an adaptive learning model has been devised to capture the multi-faceted aspects of function semantics and fuse them together to ensure the extraction of comprehensive semantic features. This strategy effectively prevents the loss of critical features that are indicative of vulnerability patterns. • An extended-tree-structured neural network called ERvNN has been designed, which can effectively encode the semantics implied in the abstract syntax tree. With a GRUstyle aggregation optimization on the tree nodes, it supports the straightforward and efficient encoding of multi-way tree structures, which otherwise should be firstly converted to the binary tree form. • Extensive experiments were conducted to evaluate the performance of S 2 FVD. The results demonstrated that S 2 FVD outperformed existing state-of-the-art DL-based methods in terms of accuracy, F 1 score, precision, and recall when detecting the presence of vulnerabilities and pinpointing the specific vulnerability types. Moreover, ablation studies confirmed the effectiveness of the devised ERvNN for encoding AST and the strategy of representation fusion for enhancing the performance of S 2 FVD. • A new dataset has been constructed to facilitate vulnerability detection research. The dataset consists of 25,333 C functions, each of which is well labeled with either a specific CWE ID indicating a vulnerability or a non-vulnerable ground truth. The source implementation of the S 2 FVD has also been made publicly available at https://github.com/lv-jiajun/S2FVD (accessed on 22 May 2023) to facilitate future benchmarking and comparisons.
The rest of this paper is structured as follows. Section 2 presents a review of closely related works. Section 3 delves into the essential designs of the S 2 FVD by discussing the specific encoding of each distinct raw code view and the fusion strategies. The experimental evaluation details regrading the experimental setup, the evaluation results, and the observations of the S 2 FVD and the comparison methods are outlined in Section 4. Section 5 discusses possible threats to validity issues, the limitations, and some interesting future works to extend. Finally, Section 6 concludes this work.

Related Work
The closely related vulnerability detection methods, which broadly fall into the following three categories, including code similarity-based methods [10,11], static-rule-based methods [12], and learning-based methods [13,14], are mainly discussed. Also, in introduc-ing these methods, we focus more on the deep-learning-based ones. It should be noted that this is not a survey paper. Thus, the other types of vulnerability detection methods, which focus on binary code [15,16], examine executing dynamic analysis [17], or review formal semantic analysis [18,19] (e.g., model checking and symbolic execution), are not delved into.

Code-Similarity-Based Methods
Code-similarity-based vulnerability detection relies on the core idea that source code exhibiting high similarity is likely to share vulnerabilities [20,21]. However, while this approach can effectively identify vulnerabilities introduced through code cloning, it suffers from high rates of false negatives when it is used to detect other types of vulnerabilities not resulting from code cloning [22,23].

Rule-Based Methods
The static-rule-based methods involve scanning the target source code using a multitude of meticulously defined vulnerability rules or patterns. Prominent examples of typical static analyzers in this category include Infer [24], CodeChecker [25], and Checkmarx [26]. One of the main issues is that the vulnerability rules defined by human experts are often subjective, making it challenging to consider all possible scenarios that distinguish between vulnerabilities and non-vulnerabilities [27,28]. As a result, this approach may lead to a high rate of false positives and false negatives.

Learning-Based Methods
These methods can be broadly categorized as traditional machine-or deep-learningbased, depending on whether expert-defined features are required.

Conventional Machine Learning-Based Methods
Early works [29,30] typically utilized traditional machine learning algorithms for training detection models. These models rely on representative features that are engineered by experts such as code complexity metrics, code churns, imports and calls, and developer activities [31]. Nevertheless, these engineered features are often inadequate in indicating the presence of vulnerabilities. Additionally, most existing methods are restricted to in-project vulnerability detection, rather than providing general-purpose solutions.

Deep-Learning-Based Methods
Deep-learning-based methods, on the other hand, leverage the powerful feature learning capabilities of deep neural networks to automatically extract vulnerability patterns or features without requiring manual definition from experts [32]. The majority of deeplearning-based detection research concentrates on sequence-based code representation learning. For example, Russell et al. [7] developed a lexical analyzer to transform C/C++ functions into corresponding token sequences. These sequences were subsequently input into CNN and RNN models for training and then applied to detect code vulnerability. Li et al. created VulDeePecker [33], which is a vulnerability detection system based on deep learning. This system generates code gadgets (i.e., sets of control or data-dependent statements) that are lexically analyzed to establish token sequences, which are then fed into neural networks for vulnerability detection purposes. Later, Li et al. proposed SySeVR [8], which is a system framework for detecting vulnerabilities in C/C++ source code. This framework is primarily focused on obtaining code sequences that capture both syntactic and semantic information to achieve vulnerability detection.
Since sequence-based code representation overlooks the syntactic structure and control flow information inherent in source code, some research on code vulnerability detection has resorted to trees or graphs as code representations, as well as employing corresponding neural network models to learn semantic information within the code. For instance, Dam et al. [34] parsed a source code file into an abstract syntax tree and employed the Tree-LSTM model to detect vulnerabilities within files. Zhou et al. [9] introduced Devign, which is a graph neural network (GNN) model that bases its composite code representation on the abstract syntax tree. Devign encodes various data and control dependencies to create a joint graph, which is subsequently input into GNNs to detect source code vulnerability. Li et al. [35] deployed a program dependence graph as the code representation and used a FA-GCN (graph convolution network with feature attention) to classify the graph, thereby achieving the successful detection of code vulnerabilities.
However, the single-representation-based method has difficulty in capturing the complete semantic information in the code, thus leading to higher rates of both false positives and false negatives. To address this issue, multiple distinct code representations are extracted, while adaptive deep neural network models are selected or devised to encode the different aspects of the function semantics. By retrieving the deeply implied semantic features and fusing them organically, more comprehensive vulnerability indicative features are obtained that lead to enhanced code vulnerability detection performance.

The Approach
The structural overview of the proposed approach is illustrated in Figure 1, wherein a function serves as the fundamental analysis unit, as opposed to an entire program, to assure a moderate detection granularity. S 2 FVD first extracts and normalizes the token sequence, the ACFG, and the AST of the function as three raw code views of the function by parsing its lexical attributes and syntax. Next, word embedding is performed to derive initial vector representations for the tokens in the token sequence, the nodes in the ACFG, and the nodes in the AST. Subsequently, the vulnerability-indicative semantic features implied in each code view are captured using carefully selected or improved neural networks. Specifically, the token sequence is encoded using DPCNN, the ACFG is encoded using GAT, and the AST is encoded using an improved RvNN that is extended to support multi-way tree structures. Furthermore, these semantic features learned by these models are integrated in various ways to facilitate fusion. Finally, the fused vectors are classified in the classification layer to identify the presence of vulnerabilities or the specific vulnerability type.

Semantic Encoding of Token Sequence
This section details the process of extracting semantic features from the token sequence. It covers the extraction and normalization of the token sequence from a function, as well as the specific neural network model employed to capture its semantics.

Token Sequence Preparation
The token sequence is processed in a manner similar to that performed in natural language processing. Such processing takes into account the natural order of the source code and reflects the programming logic embodied within the code to a significant degree. Figure 2 presents the process of transforming a function into a token sequence, which entails the following operations. (1) Comment removal: Comments, being unrelated to code vulnerability, are removed from the function. (2) Code normalization: this involves screening self-defined variable and function names and replacing them with uniform names to remove semantically irrelevant information. Variables and function names within a function are mapped to corresponding symbol names in the order of their occurrence. For example, "VAR1" and "VAR2" represent different variables within the same function, and "FUN1" and "FUN2" represent different function names within the same program. Literals in strings are also removed, leaving only quotation marks. (3) Finally, lexing is performed to convert the pre-processed function into an ordered sequence of tokens, such as identifiers, keywords, operators, and symbols.

Sequence Encoding Network
The DPCNN [36] is well-known for its capability of capturing the long-range associations in sequences by augmenting the network depth. Given the fact that the token sequence of a function can be lengthy, the DPCNN model was adopted for extracting vulnerability-indicative features from the token sequence.
To detect whether a function is vulnerable using a learning model, it is necessary to convert the token sequence into a numerical vector. This facilitates the processing of the input for subsequent classifiers. In this regard, we employed the word2vec algorithm [37], which is a popular choice for producing high-quality token embeddings to convert these tokens into vectors, which are then fed into subsequent models for learning [38]. Figure 3 depicts the feature extraction of the token sequence through the DPCNN. On the basis of the token embedding obtained, each token sequence can be initially converted into a feature matrix A.
where e i R d is the corresponding embedding of the ith token in the input sequence, l is the length of the token sequence, and d is the dimension of the token embedding. Specifically, in the implementation, the dimensions of the token embedding and length are set to 100 and 500, respectively. Subsequently, the region embedding operation is executed, where the token embedding matrix undergoes convolution processing using m filters with the dimensional size of n × d. This generates a regional embedding capable of spanning multiple tokens. In the network implementation, n was set at 3 and m was set at 250. Two convolutional layers are then designated as convolutional blocks to conduct equal-length convolution operations, with the number of convolution filters and the size of the convolution kernels being fixed at 250 and 3, respectively. Following each convolutional block, a max pooling operation is performed with a stride size of two to compress the internal representation size of each function by half, thereby reducing the computational time for the subsequent convolution computations.
In addition, when initializing the DPCNN model, the initial weight values of each layer are typically small, which can impede the propagation of gradients. To address this issue, shortcut connections [39] were utilized, where the output obtained after the region embedding was added directly to the output obtained after the two equal-length convolution operations. The aggregated result is then passed as input to the subsequent layer of the network. Such connections help mitigate the impacts of small initial weights on each layer and prevent gradient vanishing. Formally, the shortcut optimization is defined as: where y denotes the output derived from two equal-length convolutions, and f (x) is the input for the subsequent network layer. Finally, the equal-length convolution and pooling operation are performed repeatedly to yield a single vector, which is then fed into a fully connected layer to obtain the feature vector V token of the corresponding token sequence after undergoing linear transformation.

Semantic Encoding of ACFG
This section outlines the process of extracting semantic features from the ACFG. It includes the ACFG extraction process, as well as the graph neural network model used to encode its semantics.

ACFG Preparation
The control flow graph (CFG) is a commonly used code representation structure in the field of program analysis that implies semantic information regarding control dependencies between code elements. Additionally, the program statements within the control flow nodes are abstracted to assign attribute information to the nodes, thus enabling the construction of an attributed control flow graph (ACFG). This alternative code view of functions is able to capture not only the dependencies of control flow nodes, but also the attribute information associated with program statements.
The ACFG is a directed graph that can be represented as G = (V, E, A), where V and E denote sets of vertices and edges, respectively, and A represents the set of attributes associated with each vertex in the graph. In the context of code vulnerability detection, each vertex corresponds to a node in the control flow graph, and each edge represents the control flow of the code. The attributes assigned to each node in the ACFG consist of both the node type and the specific program statement it represents. For instance, the node shown in Figure 4 with the "<operator>.equals" type signifies a logical operation, while the string "data==null" corresponds to the detailed program statement within the node.  Similarly, to facilitate the handling of node attributes in the subsequent encoder network, a lexical analysis is conducted to convert node attributes into token sequences. For instance, the node attributes "strcpy, strcpy (dataBuffer, data)" can be represented as a sequence of eight tokens, i.e., "strcpy", ",", "strcpy", "(", "dataBuffer", ",", "data", and ")". Specifically, to extract the attributed control flow graph from the source code of C functions, the widely used open-source tool Joern [40] is utilized.

Graph Encoding Network
To deal with the ACFG, either a graph attention network (GAT) [41] or a graph convolutional network (GCN) [42] can be leveraged for its representation learning. The essence of both GAT and GCN is to generate more expressive representations for nodes by aggregating features from their own nodes alongside their neighboring nodes. Different from GCN, the GAT model assigns different weights to different nodes in the same neighborhood, which is believed to promote more effective integration of the inter-node feature correlations and ultimately enhances overall feature extraction performance. Therefore, GAT was selected as the base neural network structure for extracting semantic features from the ACFG.
In order to effectively extract the semantic information contained in the individual nodes, the TextCNN model, which is particularly well-suited for capturing the distinct features of tokens by applying different convolution kernels, is used to extract the initial features of the nodes in the ACFG. Specifically, the token sequence that corresponds to the attributes of each node is transformed into a feature matrix N ∈ R w×k , where w is the token embedding dimension, and k represents the length of the token sequence. Thereafter, feature extraction is carried out via convolution kernels of sizes two, three, and four, respectively. Since the feature maps obtained from convolution kernels of different sizes may have different dimensions, a pooling function is employed to standardize their dimensions. Finally, the resulting representations are concatenated and transformed into output features through a fully connected layer. These features serve as the initial features for the nodes in the graph, which are then processed by the GAT model. Figure 5 illustrates the ACFG feature extraction structure built on the GAT. The vector VEC i produced by TextCNN served as the initial hidden state of node B i . These initial hidden states were organized into an n × m-dimensional feature matrix X, while the connections between nodes constituted an n × n-dimensional matrix A, known as an adjacency matrix. Here, n denotes the number of nodes, and m denotes the dimension of each node's initial hidden state. X and A were fed as inputs into the GAT model for attention-enhanced hidden feature aggregation, which can be formally expressed as follows: where X l is the hidden state of the nodes at layer l, x (l+1) i is the hidden state of node i at j is the hidden state of all the neighboring nodes of node i at layer l, W k denotes the corresponding linear transformation matrix for input features, and a k ij denotes the weights for the kth group of attention mechanisms. In the specific implementation, S 2 FVD features a two-layer GAT structure. The first layer consists of multi-head attention, and the second layer consists of single-head attention. The utilization of multiple attention layers facilitates effective learning of the deep semantic features. As such, three independent groups of attention mechanisms are employed in the first layer, and their outputs are subsequently concatenated to obtain x 1 i . For the readout operation, the graph representation V ac f g is computed by taking the mean of the node representations as follows: where N denotes the number of nodes in the ACFG, and x i is the feature vector of node i.

Semantic Encoding of AST
This section presents the details for extracting semantic features from the AST. This encompasses the AST preparation phase and the proposed extended tree-structured neural network, which endorses the direct encoding of multi-way tree structures through a GRUstyle aggregation optimization for the tree nodes.

AST Preparation
The Abstract Syntax Tree [43] is another widely used code representation in program analysis, where the primary code elements (e.g., variable types, symbols, and operators) constitute its leaves, and the defined set of code structures (e.g., expressions and loops) constitutes its non-leaf nodes. The AST depicts both the lexical information and the syntactic structures of the source code [44].
Other code representations such as a program dependency graph (PDG) are artificially constructed and tend to emphasize specific facets of the code (e.g., dependencies between statements), which may suffer from semantic distortion or loss when representing incomplete or non-compilable code fragments. By contrast, the AST stands out as a lossless code representation that preserves the naturalness of the code, thereby yielding more comprehensive and precise semantics than these other representations. Furthermore, a previous investigation [45] has suggested that the AST is a superior code representation for detecting vulnerabilities. Specifically, to acquire the ASTs for functions, the well-established C parsing library Pycparser [46] is utilized. Figure 6 illustrates an instance of the parsed AST.

Tree Encoding Network
Recursive neural networks (RvNNs) were adopted to extract features from treestructured data. Their core idea is to recursively generate feature vectors for each node in the tree by aggregating the features of its child nodes. However, a standard tree-structured RvNN [47] only deals with binary trees and cannot directly handle the typical multi-way tree structure of ASTs. Thus, this work extended the aggregation operation to support multiple child nodes as inputs, which we referred to as the Extended Recursive Neural Network (ERvNN). The ERvNN served as the neural network model for encoding the semantics from ASTs.
Let us consider an abstract syntax tree T = {V, E}, where V and E denote its node and edge sets, respectively. For a given node v i ∈ V, let S i denote its immediate child nodes. Then, the hidden state of the node v i is computed through a GRU-style neural unit by integrating the semantic information of both the child nodes and the node v i itself. This computation can be formulated as: where h S signifies the semantics aggregated from the children by max pooling the hidden states of all the nodes in S i ; σ represents the sigmoid activation function; denotes the element-wise product operation, Ws, Us, and bs are the weights and biases that need to be learned during the model training process, respectively; e i is the embedding vector that corresponds to the token in node v i , which can be obtained via looking up the token embeddings that have been pre-trained with the word2vec algorithm. After iteratively calculating the hidden state of each node in the AST using ERvNN in a bottom-up way, the hidden state of its root node is taken as the AST's final semantic vector representation. In this regard, given an AST, its semantic encoding can be denoted as V ast = h v 0 , where h v 0 is the hidden state of the AST's root node. Figure 7 illustrates the layer-by-layer semantic aggregation process for encoding the entire tree.

Multi-View Fusion
To capture more comprehensive semantic information from the program code-the semantic vector V ac f g that is obtained using the GAT on the attributed control flow graphthe semantic vector V token that is obtained using the DPCNN on the function token sequence, and the semantic vector V ast that is obtained using the proposed ERvNN on the abstract syntax tree, are further fused.
There are various strategies for fusing the vectors, and, in this work, five widely-used approaches were considered, including point-by-point addition, concatenation, average pooling, max pooling, and non-linear fusion with a multiple layer perceptron (MLP). These strategies were empirically evaluated to determine the best one for our task. Formally, the fused representations can be computed as: where V add , V avg , V max , V con , and V ml p denote the fused vector obtained with the point-bypoint addition strategy, the average pooling, the max pooling, the concatenation strategy, and the MLP, respectively. It is worth emphasizing that the addition, average, and max pooling operations between the participant vectors require the same dimensionality. In the specific implementation, the dimensions of the extracted feature vectors V token , V ac f g , and V ast were all equal to 192, which otherwise should be padded to the same length accordingly.

Experiments and Evaluations
To evaluate the effectiveness of S 2 FVD, the following research questions were explored: • RQ1: Impacts of the Fusion Strategies-which fusion strategy, as discussed in Section 3.4, most effectively blends the semantic features collected from the distinct code perspectives for S 2 FVD to deliver its best vulnerability detection performance? • RQ2: Performance Comparison with Baseline Methods-how does the performance of S 2 FVD compare to the baseline methods in detecting the presence of vulnerabilities, as well as pinpointing the specific vulnerability types?

Experimental Setup
The datesets used for the evaluation, the experiment settings regarding the model training and testing, the baseline methods against which S 2 FVD were compared against, as well as the evaluation metrics, are described in this section.

Datasets
To evaluate the proposed method, we constructed a dataset consisting of C functions on the basis of the Software Assurance Reference Dataset (SARD) [48], which is a vulnerability database that is widely used as a source for producing experimental samples. The programs in SARD consist of a blend of academic, production, and synthetic code, with each program categorized as "bad", "good", or "mixed". Typically, each "bad" program contains one vulnerable function, while each "good" program comprises fixed or patched non-vulnerable functions. A "mixed" program contains both a vulnerable function and its patched versions within a single program.
The C source file is typically composed of a header file, macro definition statements, and multiple functions. To generate the function samples, the ANTLR tool [49] was used to parse the raw C source files. Initially, the source file is read, and macro expansion is performed during preprocessing to replace macro names with strings, as macros may contain vulnerability-related information. Subsequently, the source file is transformed into the ANTLR file stream format, which serves as input for the subsequent lexical analysis phase. Given that C programs were being dealt with, CPP14Lexer was used for lexing, thereby producing a sequence of matching tokens. The token sequence was then passed to the parser for syntactic analysis, which converts the program into a syntax tree to facilitate the extraction of the hierarchical structure of the program. During the syntax tree traversal, each node was examined, and an instance with the type of "FunctionDefinitionContext" was marked as the root node of a function subtree. By traversing the subtree in a depth-first manner, the specific source code regarding the function within the source file could be extracted.
Finally, a total of 13,541 non-vulnerable functions and 11,792 vulnerable functions were gathered, which are scattered in 26 distinct types of vulnerabilities. Table 1 presents in detail the number of functions that locates in each vulnerability type, along with their corresponding labels, from the set of 11,792 vulnerable functions. Since the programs in SARD are basically synthetic, we also evaluated S 2 FVD and the comparison works against a real-world vulnerability dataset called D2A [50]. This dataset was curated by the IBM research team from multiple popular open-source software projects, including FFmpeg, httpd, Libav, LibTIFF, Nginx, and OpenSSL.

Experiment Settings
To conduct the experiments, both datasets were partitioned into the training, validation, and testing sets using an 8:1:1 proportion. The models were then trained with an initial learning rate of 1 × 10 −3 , which was reduced by 0.8 after every 10 epochs using the Adam optimizer and a batch size of 16. In each epoch, the training set was shuffled, and accuracy on the validation set was computed. The early stopping mechanism was used to halt the training when the validation accuracy did not improve after 5 epochs. The model that achieved the best accuracy was then selected as the final detection model, which was used to evaluate the performance on the testing set. All the experiments were conducted on two Linux servers, each equipped with two 2.1 GHz Intel Xeon Silver-4310 CPUs, 128 GB RAM, and two NVIDIA RTX3090 GPUs.

Baseline Methods
Three state-of-the-art deep-learning-based vulnerability detection methods, including VulDeePecker, SySeVR, and Reveal, were used as the comparison baselines. A brief overview of them is presented below: • VulDeePecker proposes to extract code gadgets, which are comprised of code statements that exhibit control dependency relationships with respect to certain code elements of interest (such as library/API calls and array usage), to represent programs. Recurrent neural networks are then trained on these gadgets to detect vulnerabilities. • SySeVR further enriches the concept of code gadgets. It proposes SeVCs (semanticbased vulnerability candidates) to represent the code by taking into account the data dependencies among the code statements in addition to the control dependencies. • Reveal is an approach that operates on the graph-based representation of code known as the code property graph (CPG). It uses a GGNN (gated graph neural network) to extract features that are indicative of vulnerabilities present in the code.

Evaluation Metrics
As with most existing learning-based methods for vulnerability detection, the widely used metrics of accuracy, precision, recall, and F1-score were adopted to evaluate the performance of S 2 FVD and the comparison methods. It should be noted that "Accuracy" denotes the overall accuracy, while "Precision", "Recall", and "F1-score" correspond to the weighted-averages of precision, recall, and F1-score, respectively, in the experiments involving the detection of vulnerability types.
To be specific, let k denote the class label, let {c 1 , c 2 , · · · , c k } denote the number of function samples for each class, and let c 1 , c 2 , · · · , c k be the number of functions whose corresponding class was accurately classified by the classifier. The accuracy can be defined as follows: Let {p 1 , p 2 , · · · , p k }, {r 1 , r 2 , · · · , r k }, and { f 1 , f 2 , · · · , f k } be the precision, recall, and F 1 -score values computed with respect to the k classes, respectively. The weighted-average precision, recall, and F 1 -score can be defined as:

Experimental Results
The subsequent sections cover the experimental findings pertaining to the research questions as discussed at the beginning of Section 4.

RQ1: Impacts of the Fusion Strategies
To identify the most effective fusion strategy that best enhanced the vulnerability detection capability of S 2 FVD, its performances under different fusion strategies were evaluated in this experiment.
As summarized in Table 2, the values of the performance metrics, S 2 FVD ml p , which resorted to a MLP for feature fusion, outperformed the alternative models that adopted the other fusion strategies, in both the vulnerability presence detection and in the vulnerability type detection task. This can be attributed to several reasons: Firstly, the concatenation of the feature vectors preserved more semantic information than straightforwardly averaging or adding them in a point-by-point manner, as the extracted code representations provided complementary descriptions of the functions from both the sequence and structure perspectives; Additionally, the non-linear fusion capability provided by the MLP empowered S 2 FVD to pay more attention to the vulnerability indicative features from the concatenated vectors, which, therefore, made it more advantageous for vulnerability detection. As one may have noticed, the fusion process utilizing the MLP can alter the dimensionality of the semantic encoding vectors. Intuitively speaking, the output vector with an improperly small size could lead to the loss of subtle code semantics. On the other hand, retaining an approximate dimension as the input vector does not necessarily improve the fusion effect, but it does increase the computational costs. Therefore, in order to gain insight into the effects of this hyper-parameter on the fusion process, the detection performances of S 2 FVD ml p parameters were evaluated by varying the dimensionality of the fused semantic vectors. As illustrated in Figure 8, S 2 FVDml p exhibited optimal performance at a dimensionality of 192. Thus, for the sake of simplicity, in the following experiments, unless explicitly stated, S 2 FVD will always refer to S 2 FVD ml p , with the fused vector's dimensionality set to 192.

RQ2: Performance Comparison with Baseline Methods
In this experiment, the efficacy of S 2 FVD in either detecting the presence of vulnerabilities or in pinpointing the specific vulnerability types was assessed and compared with the SOTA baseline approaches, as discussed in Section 4.1.3. The evaluation results are presented in Table 3. As the metric values show, S 2 FVD demonstrated a performance that was superiority regarding both the vulnerability detection tasks and in terms of all the metrics. This indicates S 2 FVD's ability to effectively capture the significant vulnerability features encoded in the fused semantic vectors in a more comprehensive and precise manner. Specifically, regarding vulnerability presence detection, S 2 FVD achieved a leading accuracy of 98.07% and an F1-score of 98.14% for the SARD dataset, as well as an accuracy of 63.02% and and F1-score of 68.99% for the D2A dataset. For the vulnerability type detection task, where the methods are required to identify the specific vulnerability type present in the vulnerable code, S 2 FVD again demonstrated the best performance among the comparison methods, with an accuracy of 97.93% and an F1-score of 97.94%. In addition, it can be observed that the performance results of the DL-based approaches on the D2A dataset were much lower than on the synthetic dataset, thereby suggesting that detecting vulnerabilities in real-world programs is still challenging, due to the more intricate and varied code contexts that vulnerabilities reside in. However, S 2 FVD exhibited promising performance gains compared to other DL-based methods, with an average improvement in detection accuracy and F1-score of 6.86% and 14.84%, respectively. This emphasizes the potential of S 2 FVD in detecting vulnerabilities, even in much more complex code contexts. Table 3 shows that S 2 FVD exhibited generally superior performance compared to the baseline methods, as evidenced by the metric values. To further verify whether this performance difference was statistically significant, the Wilcoxon rank sum test and t-test were enforced between S 2 FVD and each of the baseline methods using 5 × 2 cross-validation. The p-values for accuracy and the comprehensive metric F1-score are presented in Table 4. It can be observed that none of the p-values exceeded 0.05 for either test, thereby indicating that there was a statistically significant difference between S 2 FVD and the baseline methods. In this section, we conducted substitutional experiments by replacing the constituent neural network structures used in S 2 FVD to extract semantic features from the different code views with other typical neural networks. Specifically, we selected three other sequenceoriented models, including TextCNN, TextRNN, and Transformer [51], to encode the token sequences, in addition to the originally adopted DPCNN in S 2 FVD. These models are wellknown for their superior feature capturing capability in handling sequences. For extracting features from the ACFG, a graph convolution neural network (GCN) was regarded as the substitute of the GAT for performance comparison. For extracting features from the AST, TBCNN [52] was selected as the substitute of the original ERvNN for comparison.
The results obtained on the vulnerability datasets, as shown in Table 5, indicate that the combination of the encoding models utilized in S 2 FVD resulted in the best performing vulnerability detection model, and substituting different parts of it with the listed alternatives led to varying degrees of performance degradation. Additionally, the metric values of S 2 FVD TBCNN , where TBCNN was substituted for ERvNN to encode the AST, suggest that our designed ERvNN can capture the semantics implied in the AST more effectively.

RQ4: Ablation Study
To ascertain whether the fusion of multiple semantic vectors encoded from the distinct code views contributes to the enhanced performance of S 2 FVD in detecting vulnerabilities compared to utilizing only a subset of them (i.e., utilizing a single vector or the vector fused from any two code views), an ablation study was conducted in this experiment.
In Table 6, the experimental results show that the overall performance of S 2 FVD surpassed the alternative models that only fuse semantic vectors extracted from two types of code views. Moreover, these alternative models outperform the ones that solely utilize the semantic vector obtained from a single code view. The progressive improvement in performance as the number of distinct code views was increased indicates the effectiveness of the fusion strategy in combining the semantic features extracted from diverse aspects of the code. These findings also suggest that, when the model is trained with fewer represen-tations, it may struggle to fully comprehend the semantic information implied in the code, thereby resulting in inferior detection outcomes. It can also be inferred that there may be some degree of semantic overlap between the features extracted from the token sequence, the attributed control flow graph, and the abstract syntax tree. However, by additionally identifying and fusing the disjointed parts of these features, the vulnerability detection capability of the model was substantively enhanced.

Threats to Validity
As highlighted in previous studies [53,54], the prevalence of mislabelling in vulnerability datasets has yet to be resolved. However, in the datasets employed in our experiments, the incidence of mislabelling was relatively low, as the samples had been annotated either by security experts or through a meticulously designed differential analysis technique [50]. Therefore, as a deep-learning-based approach, S 2 FVD should exhibit resistance to occasional label noise during model training, with any effect on testing performance stemming from the low-ratio noise being negligible.
Deep and machine learning models have been demonstrated to be vulnerable to adversarial attacks across multiple domains [55][56][57]. It is common knowledge that programmers use semantic-preserving code obfuscations or transformations to safeguard their code, which can potentially undermine the detection capability of DL-based methods, including the proposed S 2 FVD. To address this issue, one potential approach is to enforce adversarial training [58] by augmenting the training data with obfuscated or transformed adversarial samples. The investigation into how S 2 FVD can be affected by code obfuscations and other potential adversarial attacks, as well as the possible strategies to mitigate these effects, would be taken as one of the interesting further works.
It should be note that S 2 FVD has not undergone a systematic hyper-parameter tuning process currently. Instead, either the default or commonly used empirical values for the hyper-parameters were utilized. Despite this, the evaluation results indicate that S 2 FVD, trained with the current hyper-parameter settings, exhibited highly impressive vulnerability detection capability. While a systematic or exhaustive grid-search-based hyper-parameter tuning could potentially further improve S 2 FVD's detection performance, such a process would require significantly more computing resources and time. As a result, we leave it for future work as well.

Limitations
As a learning-based approach, S 2 FVD faces the challenge of only issuing black-box detection results. This means that, unlike the rule-based methods that furnish supplementary information that hints at possible bug-trigging paths of the detected vulnerabilities [59], it provides only a vulnerable/non-vulnerable prediction or a specific vulnerability type without explanations. As part of future work, explainable AI techniques will be combined to highlight the statements or paths with significant contributions to the prediction outcomes.
In contrast to the conventional rule-based detection methods, as well as the other deep-learning-based vulnerability detection methods that operate solely on a single code view, S2FVD's approach of extracting and combining semantic features from multiple views inevitably incurs a heavier overhead runtime. Although we acknowledge the significance of both the detection efficacy and efficiency, it is believed that the former carries greater importance in the realm of vulnerability detection. Moreover, as the computing power from both CPUs and GPUs continues to advance, achieving a moderately fast detection speed will become more feasible.
Similar to most existing works, the dataset we have primarily used is labeled at the function level. However, vulnerabilities can cross function boundaries. Therefore, designating a function as vulnerable solely because the vulnerability is revealed within it might not always be correct. Unfortunately, establishing such datasets of precisely labeled bug-triggering code contexts remains a challenging task that necessitates continuous and arduous effort from domain experts. In addition, S2FVD's assessment was limited to code written in C and C++, which are among the languages that have been hit hard by vulnerabilities. Hence, its capability in detecting vulnerabilities that transcend function boundaries, as well as its applicability to other programming languages, deserve further investigation.

Conclusions
Aiming at the problem that existing vulnerability detection methods that operate on a single code view are limited in detecting deep vulnerabilities, this work presents S 2 FVD, which adopts a strategy of learning vulnerable indicative features from different code perspectives and fusing them to enhance the detection capability. In particular, to make the semantics implied within the AST be effectively encoded, an extended tree-structured neural network called ERvNN was devised. It supports the direct encoding of multi-way tree structures by implementing a GRU-style aggregation optimization for the nodes within the tree. Through the extensive experiments conducted on two large datasets consisting of both synthetic and real-world samples, a superior vulnerability detection capability of the S 2 FVD was observed against SOTA approaches. Notably, a performance improvement of 6.86% and 14.84% regarding the accuracy and F1 metrics, respectively, was achieved for the real-world dataset D2A, thus indicating S 2 FVD's potential in detecting vulnerabilities in more complex code contexts. Additionally, ablation studies confirmed the effectiveness of the ERvNN in encoding semantics from the AST and the superiority of the adopted multi-representation fusion strategy for boosted vulnerability detection capability.

Data Availability Statement:
The data presented in this study are available in https://github.com/ lv-jiajun/S2FVD.

Conflicts of Interest:
The authors declare no conflict of interest.