Network traﬃc analysis through node behaviour classiﬁcation: a graph-based approach with temporal dissection and data-level preprocessing (cid:2)

Network traﬃc analysis is an important cybersecurity task, which helps to classify anomalous, potentially dangerous connections. In many cases, it is critical not only to detect individual malicious connections, but to detect which node in a network has generated malicious traﬃc so that appropriate actions can be taken to reduce the threat and increase the system’s cybersecurity. Instead of analysing connections only, node behavioural analysis can be performed by exploiting the graph information encoded in a connection network. Network traﬃc, however, is temporal data and extracting graph information without a ﬁxed time scope may only unveil macro-dynamics that are less related to cybersecurity threats. To address these issues, a threefold approach is proposed here: ﬁrstly, temporal dissection for extracting graph-based information is applied. As the resulting graphs are typically affected by class imbalance (i.e. malicious nodes are under-represented), two novel graph data-level preprocessing techniques - R-hybrid and SM-hybrid - are introduced, which focus on exploiting the most relevant graph substructures. Finally, a Neural Network (NN) and two Graph Convolutional Network (GCN) approaches are compared when performing node behaviour classiﬁcation. Furthermore, we compare the node classiﬁcation performance of these supervised models with traditional unsupervised anomaly detection techniques. Results show that temporal dissection parameters affected classiﬁcation performance, while the data-level preprocessing strategies reduced class imbalance and led to improved supervised node behaviour classiﬁcation, outper-forming anomaly detection models. In particular, Neural Network (NN) outperformed Graph Convolutional Network (GCN) approaches for two attack families and was less affected by class imbalance, yet one GCN performed best overall. The presented study successfully applies a temporal graph-based approach for malicious actor detection in network traﬃc data. © 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Today, cyber attacks are becoming more sophisticated and more destructive than ever. Attackers increasingly try to find vulnerabilities and exploit breaches in order to gain unauthorized access, damage or steal information, assets, network or any kind of sensitive data. If a cyber attack compromises at least one of the three security pillars of a target -confidentiality, integrity or availability -it can generate a considerable loss of value for the owner, in economical, ethical, digital, psychological and societal terms ( Agrafiotis et al., 2018;Formosa et al., 2021 ).
Currently, increasing efforts are being made to prevent such cyber threats and reduce their impact ( Khraisat et al., 2019;Van Schaik et al., 2020 ). However, as it is difficult to determine rules for manual or semi-automatic threat detection, there is a https://doi.org/10.1016/j.cose.2022.102632 0167-4048/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) need for new solutions based on Artificial Intelligence ( Xin et al., 2018 ). Derived techniques such as Deep Learning have shown promising results in terms of quantifying cyber risks and optimizing cybersecurity operations ( Sarker et al., 2020 ).
Machine learning (ML) techniques are often used to perform anomaly detection (AD) ( Omar et al., 2013 ), an operation that can be used for both threat detection and threat prevention. In particular, anomaly detection tasks considering time series analyses have been extensively explored in recent years ( Li et al., 2021 ). However, in a complex system of actors or entities, interaction patterns can evolve as entities may disappear, emerge or simply change their dynamics. Considering network traffic information in a graph-based representation can highlight these changes and favors the definition, classification and visualization of such information ( Akoglu et al., 2015;Tan et al., 2020 ) as graphs integrate both structured and unstructured data in a representation of entities and their relationships ( Zhou et al., 2009 ). However, approaches that apply Deep Learning paradigms directly to graphs have so far mainly been applied in other domains ( Kipf and Welling, 2017;Scarselli et al., 2008 ).
The majority of graph-based studies in the cybersecurity domain are focused on identifying anomalous or malicious traffic by directly classifying node communications using statistical ( Djidjev et al., 2011 ), data mining ( Iliofotou et al., 2009b ) or machine learning ( Yao et al., 2018;Zheng et al., 2019 ) techniques. However, analyzing every single connection in real networks may be resource-consuming due to the large amount of flows to be processed. Furthermore, malformed communications carried out by normal entities can result in falsely detected malicious flows (false positives), and attackers could generate normal activities in a complex attack routine for obfuscating their presence in the network.
In these cases, it is interesting to classify a single connection and at the same time detect who has generated it, in order to make appropriate decisions that may increase network cybersecurity (i.e. exclude, isolate entities etc.). For these reasons, we introduce here a novel methodology that shifts the focus from analyzing the edges of a graph (nodes communication) to its nodes (entities behaviour) for malicious/attack detection. More concretely, this novel methodology intends to translate an attack classification task into a (node) behavioural classification task, thus converting time series network traffic into graph-based structures. Our approach is presented in three distinct phases: graph creation, data-level preprocessing, and finally, classification . In the first phase, the main idea is to split the entire network information into temporal intervals and extract graph-based structures from each of them, recalling the concept of a temporal Traffic Dispersion Graph (TDG) ( Iliofotou et al., 2009a ). In this way, it is possible to visualize entity interactions and their dynamics over time. Furthermore, this temporal dissection avoids small, scarce interactions between entities being obfuscated by more frequent ones. In fact, considering the whole graph at once can generate skewed scenarios that may encourage machine learning models to classify macro-dynamics over more critical micro-dynamics. Further, long times would be required to detect attack activities, making macro-dynamics approach less usable.
After extracting the graphs -as in almost any classification problem related to cybersecurity -it is important to deal with class imbalance ( Japkowicz and Stephen, 2002 ) since typically normal network activities outnumber anomalous activities or attacks. This phenomena can strongly degrade the quality of classification performance, especially when supervised machine learning techniques are used ( Fernández et al., 2018 ). Since common data-level methods for class imbalance cannot directly be applied to graph data, new approaches are required. For this reason, in the second phase, we propose two novel approaches for dealing with class imbalance in graph structures, which we call R-hybrid and SM-hybrid .
The key idea is to apply operations of data-level preprocessing to graph structures, while avoiding the distortion of their topology.
In phase three, behavioural classification tasks using the newly balanced graph-based structures are performed. The main goal of this phase is to compare the performance of models that are able to exploit both behavioural features and graph relations (such as Graph Convolutional Networks, GCNs), with the ones using behavioural features only (such as Neural Networks, NNs). In this way, it is possible to analyze how much the information related to graph connections affects the classification task.
Furthermore, in order to explore the benefits and limitations of our approach, we compare classification results achieved by training graph-based supervised models on the balanced dataset with several common state-of-the-art unsupervised anomaly detection techniques. These unsupervised models are trained with the imbalanced dataset as their main goal is to detect anomalies in the data distribution (outliers) ( Goldstein and Uchida, 2016 ), ( Leung and Leckie, 2005 ). A relevant cybersecurity dataset ( Moustafa and Slay, 2015 ) containing network traffic data is used to validate the presented approach.
To the best of our knowledge, this is the first work that proposes an entire pipeline for working with network traffic data extracting temporal graph-based (node) behaviours, handling class imbalance in graph structures and finally performing classification to detect malicious entities/nodes. The rest of the paper is organized as follows. Section 2 introduces several key concepts concerning graph theory and then presents related work in terms of graph construction techniques, class imbalance and Deep Learning on graphs. In Section 3 , motivations and contributions are described as well as our novel approach and its phases. Section 4 provides an overview of the used datasets and their limitations, the metrics used to evaluate the experiments, the extracted graphs and presents the experiments. Section 5 describes and discusses the obtained classification results. Finally, conclusions and guidelines for future work are given in Section 6 .

Preliminaries
In this section, theoretic concepts applied in this study are introduced. In particular in Section 2.1 , key concepts related to graph theory are presented, whereas in Section 2.2 related works are described.

Graph Theory
The aim of this study is to classify node behaviours extracted from network traffic and analyze them in a graph domain. For this reason, key concepts related to graph theory are reported, following their definitions in Bollobás (2013) .
where V represents vertex or node set and E is an unordered pair of elements of V called the set of edges. The number of nodes and edges for G will be | V | and | E| respectively.
A walk consists of an alternating sequence of consecutive incident vertices and edges that begins and ends with a vertex. A path is a walk without repeated vertices.
.., e n ∈ E such that there is a path from u to v . If a graph is not connected, every connected maximal subgraph of it is called a component of the graph.
Definition 2.5. Let G = (V, E) be a graph and let e ∈ E be an edge of the graph. Let G = (V, E − { e } ) be the subgraph of G with e cut. Let c(G ) denote the number of components of the graph G . Then, e is said to be a bridge of G iff c(G ) = c(G ) + 1 .
In this work, the focus is on graphs and its components. Therefore, and abusing the notation, the term subgraph is used for referring to each component of a graph for the rest of this paper.

Related Work
In this section, related work is introduced, organized as follows: in Section 2.2.1 techniques for extracting graph information from time series are analyzed, while in Section 2.2.2 , techniques for addressing class imbalance are presented. In Section 2.2.3 , related work based on Graph Convolutional Networks is described, and finally in Section 2.2.4 , models used for anomaly detection are introduced

Graph Construction Techniques from Time Series
Network traffic datasets are composed of information gathered from a network that is usually represented as time series. Each row in these datasets contains information and features related to packages, connections or flows between a source and a destination (depending on the granularity of the dataset). For simplicity, we use the word flow to indicate a row of any network traffic dataset.
The first step is to convert this time series information into temporal graph structures. However, domain transformation can always cause distortion or loss of information. These problems can be even more severe when converting unstructured data or a time series into a network data representation. In Silva and Zhao (2016) , several techniques for transforming unstructured data, vector-based data, or even time series data into networks are described.
In Wehmuth et al. (2015) , a model for representing finite discrete Time-Varying Graphs (TVGs, typically used to model complex dynamic network systems), is presented, whereas in Crovella and Kolaczyk (2003) a graph wavelet approach is applied on elements connected via an arbitrary graph topology (spatial domain). In Djidjev et al. (2011) andJin et al. (2009) , temporal graph representations for analyzing large networks are introduced, using telescoping subgraphs (TSGs) in the first case and Traffic Activity Graphs (TAGs), in the second case.
An alternative was presented by Ilioftou et al Iliofotou et al. (2007) , known as Traffic Dispersion Graph (TDG). These graphs are graphical representations of various interactions of a group of nodes ("who talks to whom"). The authors exploit network-wide interactions between hosts for extracting graph structures from network traffic datasets, considering each node as a distinct IP address and edges as their communication flows. Although the definition of the TDG's nodes is a simple process, the principal task is the definition of the edges. This can be done based on the available information as, for example, the first sent packet, the amount of exchanged information, the protocol used and so on. These edges can be both directed or undirected, according to the final goal. This mapping represents a viable solution when inputs are network traffic data. In particular, it allows monitoring, analyzing and visualizing relations among defined nodes using social interaction paradigms ( Iliofotou et al., 2007 ). In Iliofotou et al. (2009a) , the concept of TDGs is extended introducing a temporal factor. The authors propose to split the initial data into subintervals and extract a graph from each subinterval.
In this study, temporal TDGs are used as a starting point for translating the attack classification problem into a node behaviour classification task.

Class Imbalance Problem
One of the most common challenges when facing classification tasks is addressing an uneven distribution among classes, which can produce skewed behavioural models ( Fernández et al., 2018 ). In this regard, data-level techniques are one of the most straightforward and effective ways to address class imbalance ( Leevy et al., 2018 ). These techniques can be divided into under-sampling, oversampling or hybrid methods, depending on how they modify the data distribution (removing or adding samples from the dataset).
The simplest, yet effective alternative for under-sampling is known as Random Under-Sampling (RUS), removing random samples from the majority class until the dataset is balanced. Other methods in this category are Tomek links ( Tomek et al., 1976 ), Condensed nearest neighbor rule ( Hart, 1968 ) and Near miss undersampling ( Yen and Lee, 2006 ). Among the over-sampling approaches, the most common are Random Over-Sampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE) ( Chawla et al., 2002 ) . ROS takes samples from less represented classes and randomly replicates them until reaching the population size of the majority one, whereas SMOTE is based on exploiting k -nearest neighbors to create new elements. Other over-sampling methods are Adaptive Synthetic (ADASYN) He et al. (2008) and Bordeline oversampling Nguyen et al. (2011) .
When dealing with graph structured data, the application of these techniques is not straightforward and can be misleading. In fact, these techniques are focused on the feature space and if applied without modifications the graph structure might be modified, changing its intrinsic information. For this reason, several techniques have been proposed for working with graph structures in order to reduce their size and preserve properties, as shown in Hu and Lau (2013) . Commonly studied sampling techniques can be grouped in three classes ( Wu et al., 2016 ): Node-based samplings, Edge-based samplings and Transversal-based samplings . Random Node technique Leskovec and Faloutsos (2006) and Random Degree Node Stumpf et al. (2005) are typical examples of the first strategy; Random Edge Technique, DropEdge Rong et al. (2019) and GAUG Zhao et al. (2020c) belongs to Edge-based samplings . Finally, Snowball Sampling Stivala et al. (2016 and Forest Fire Sampling (FFS) Leskovec and Faloutsos (2006) are Transversal-based samplings strategies.
All introduced approaches, although with different impact, generate changes in the graph topology, thereby altering its intrinsic information. For this reason, we present two novel approaches for handling graph data-level preprocessing based on the idea of preserving subgraph properties and topology ( Section 3.3 ).

Graph Convolutional Network
In recent years, Deep Learning has been successfully applied in many fields related to image, video, recorded speech data etc. ( Garcia-Garcia et al., 2018 ). However, when real world applications are characterized by complex relationships and interdependence between samples, new learning paradigms are required to exploit such kind of structures . Graph Neural Networks (GNNs) were introduced in Scarselli et al. (2008) to handle supervised machine learning tasks using graph structured data (cyclic, directed, undirected, or a mixture of these) in which each node is defined by its features and its relation to other nodes in the graph. The idea of GNNs is to generate outputs by analyzing the embedded state of a node, which also contains information about the neighborhood of the node itself ( Zhou et al., 2018 ). Further, Graph Convolutional Networks (GCNs) ( Kipf and Welling, 2017 ) exploit the local and global structural patterns of a graph. The aim of GCNs is to consider the relation among nodes through a convolution operation similar to the one used in Convolutional Neural Networks (CNNs). While CNNs and GCNs have very similar underlying key concepts, CNNs are specifically built to operate on Euclidean data, while GCNs are suited for non-Euclidean data such as graphs that contain nodes, connections, relations and unordered information. According to the chosen filter and its application, graph convolution operations can be separated into spatial-based and spectralbased convolution ( Zhang et al., 2019 ). In the first category, the operation is applied directly on the graph nodes and its neighborhood, i.e. through some aggregations of graph signals within the node neighborhood. In the second case, the filters are defined following graph signal processing concepts, i.e. using graph Laplacian Matrix to define the graph Fourier transformation, and the graph filtering operators within ( Defferrard et al., 2016;Kipf and Welling, 2017 ).
While it is straightforward to compute the convolution in the spatial domain (for example by applying a weighted average function over a node and its neighborhood ( Hamilton et al., 2017;Monti et al., 2017 )), it is more complex in the spectral domain. In particular, in this second case, the convolution operation is defined through Equation 1 , where x ∈ R N represents a scalar vector (a scalar for every node in the graph), U is the matrix of eigenvectors of L, the Laplacian of the graph, and g θ ( ) is a function of the eigenvalues of L and it represents the filter in the Fourier domain. Nevertheless, solving this equation can be computationally complex and unreachable, particularly for large graphs.
For this reason, a solution for Equation 1 is obtained by parameterizing the term g θ ( ) as a polynomial function that can be computed recursively. More specifically, Chebyshev polynomials with a K degree can be used, as shown in Defferrard et al. (2016) ( Equation 2 ). In particular, g θ ( ) can be approximated by a truncated expansion of the Chebyshev poly- Defferrard et al. (2016) , the authors suggest rescaling the filters by = 2 λmax -I N , where λ max is the largest eigenvalue of Laplacian matrix L and I N is the identity matrix. θ ∈ R K is a vector of Chebyshev coefficients. Kipf et al. Kipf and Welling (2017) demonstrate that a good approximation can be reached by truncating the Chebichev polynomial to get a linear polynomial ( k = 1) and by performing a renormalization trick in order to avoid numerical instabilities and vanishing gradients. In this scenario, Equation 1 is reduced to Equation 3 ( Kipf and Welling, 2017 ), where σ represents an activation function, e.g., ReLU (·) = max (0 , ·) , H (l) is the matrix of activations in the l th layer with H (0) = X ∈ R NxC , i.e. a matrix of N nodes each one associated with a C-dimensional feature vector, and W (l) denotes a layer-specific trainable weight matrix. The term A is the adjacency matrix with the identity matrix added (part of the renormalization trick), and D i j = A i j . Basically, the term D − 1 2 A D − 1 2 is used to normalize the graph structure and to convert it to a regular neural network function. In Kipf and Welling (2017) , an in-depth discussion of this approximation is presented.
Thanks to its ability to learn graph representations, Graph Convolutional Networks are used in a wide range of applications, especially for detecting similarity among networks and for discovering patterns among the nodes' relations. Examples include applications in chemistry, biology and bioinformatics for classifying drugs , chemical reactions ( Coley et al., 2019 ), molecules ( You et al., 2018 ) and material properties ( Xie and Grossman, 2018 ), or in social science for implementing recommendation systems Wu et al., 2018 ). Regarding the cybersecurity domain, Zhao et al. Zhao et al. (2020a,b) first used a GCN to transform a botnet detection problem into a semisupervised classification problem and proposed a new framework for cyber threat discovery based on multi-granular attention and Indicator of Compromises (IOCs) extraction. In Gao et al. (2021) , heterogeneous graphs and GCNs are combined for classifying Android malware, meanwhile, for the same task, in Pei et al. (2020) , the GCN and recurrent networks are used for identifying and learning semantic and sequential patterns. Oba et al. Oba and Taniguchi (2020) presented a solution based on a GCN able to analyze a multigraph based on triplets of client IP, server IP and TCP/UDP ports. Crovella et al. Crovella and Kolaczyk (2003) use graphs for representing and analyzing incoming and outgoing flows of an access link and identify denial of service (DoS) attacks. In Nagaraja et al. (2010) and in Wang et al. (2020) graph-based information is combined with flow-based data for detecting botnets.
In Sun et al. (2020b) and Sun et al. (2020a) , new malicious domain detection systems based on GCNs were introduced exploiting Heterogeneous Information Network for analyzing relations among domains, clients and IP address and finally extract meta-path information for detecting malicious activities. Jiang et al. Jiang et al. (2019) presented a GCN-based anomaly detection system for threat and fraud detection. Results showed that GCNs outperform 4 state-of-the-art techniques. In Zheng et al. (2019) , an anomalous edge detection framework based on GCNs with an attention model (Gated Recurrent Unit) was presented.
Following these promising trends, we explore here benefits and limitations of graph convolutional approaches for node behaviour classification and compare them with unsupervised anomaly detection techniques, as well as with a more traditional neural network approach, where only feature information is used.

Anomaly detection (AD)
Anomaly detection (AD) is a task in which models learn the distribution of given data and try to detect points that are different from the norm, thereby classifying them as anomalies ( Chandola et al., 2009 ). Even though this paper is focused on presenting a novel graph-based methodology to address graph-related class imbalance to improve node behaviour classification using supervised approaches, it is interesting to compare our results with traditional anomaly detection approaches that can be directly applied to the imbalanced graph-based dataset. For this reason, as part of our study, we implemented 5 different anomaly detection models, which, although all based on unsupervised machine learning, can be separated into semi-supervised AD and unsupervised AD based on their training setup ( Aggarwal, 2017 ), ( Goldstein and Uchida, 2016 ) ( Section 4.5 ). This selection of algorithms is not exhaustive but we think that it represents a good benchmark set for evaluating the benefits and limitations of our approach. In particular, Isolation Forest (IForest) ( Liu et al., 2012 ), Local Outlier Factor (LOF) ( Breunig et al.,20 0 0 ), Density-Based Spatial Clustering of Applications with Noise (DBSCAN) ( Ester et al., 1996 ), k-Nearest Neighbors ( k -NN) ( Liao and Vemuri, 2002 ) and Autoencoder (AE) ( Zhou and Paffenroth, 2017 ) were implemented here.

Methodology
In this section, a novel methodology is presented, which converts network traffic classification into node behaviour classification. In Section 3.1 , the problems to be addressed and the main contributions of this work are presented. In Section 3.2 and Section 3.3 , the operations for extracting the Traffic Dispersion Graphs (TDGs) via temporal dissection and the novel graph datalevel preprocessing techniques are described. Finally, in Section 3.4 , the applied classification operations are detailed.

Motivation and Contributions
In graph analysis based on real use cases, it may be useful to not only classify a single node's communication, but also to detect the entity that has generated the communication in order to isolate or exclude it from the network. However, when graph structures are extracted from network traffic data, it should be noted that it is not possible to use the whole dataset at once for creating a unique static and monolithic graph. In fact, due to the structure of common network traffic datasets that usually involve captures of many hours or even days, the amount of data in the created graph can require too much computational effort, making the application of these graph approaches difficult. For example, using a graph generated by a 10-hour long capture-day for training a model, the final application (the test), would require a graph of comparable size (other ∼ 10 hours). This can compromise the usability of the trained model as attack detection timing is fundamental in order to mitigate a threat promptly.
For this reason, in this work, we propose to extract and analyze graph-based information from time series data (TDGs) by defining entities/nodes and their links/edges. Furthermore, we add an operation of temporal dissection, i.e. a fragmentation of the initial dataset into fixed time intervals (or temporal snapshots ) to highlight network micro-dynamics and hence increase the usability of the solution. In particular, a study analyzing the effect of three different temporal snapshot sizes on the behaviour definition is presented. Further, an enrichment operation is performed in which graph features are extracted from each TDG and added to the entities, enhancing their behaviour description.
As these temporal TDGs show a strong graph class imbalance (due to the dominance of traffic information related to normal activities rather than attack connections), we analyze and compare two different approaches. On the one hand, we test unsupervised models for anomaly detection, which directly assess the imbalanced dataset. On the other hand, we implement two supervised machine learning approaches for node behaviour classification; one that involves behavioural descriptions only (Neural Networks or NN) and two others that additionally use graph neighborhood information (Graph Convolutional Networks or GCN). Specifically, two distinct approximations for the GCN convolution filter are tested. Furthermore, in order to improve the supervised machine learning results, two novel techniques for addressing the class imbalance problem in graph-structured data are introduced. These two graph data-level preprocessing techniques called R-hybrid and SM-hybrid , exploit the fragmented TDGs for reducing their impact on subgraph topology. Figure 1 shows how the methodology for the supervised ML is separated into three distinct phases: Graph creation; Data-level Preprocessing; and finally, Classification . Note that the unsupervised approach does not need the Data-level Preprocessing phase.
In summary, the main contributions of this work are: • A temporal analysis using different time intervals for transforming network traffic data into graph-based structures (TDGs) is presented; • Connection information is converted into node behaviours for characterizing graph entities; • Two novel graph data-level preprocessing techniques are introduced to tackle class imbalance in graph data; • Three Deep Learning approaches are compared in terms of node behaviour classification performance (one based on behavioural features only and the other two using graph relational information as well). • The results obtained using our approach based on supervised machine learning and based on a balanced dataset ( classification ) are compared to the ones obtained using 5 state-of-theart unsupervised techniques applied to the imbalanced dataset ( anomaly detection ).

Phase 1: Graph creation
The first phase can be split into two parts; the first one focussing on the temporal dissection and TDG creation, and the second one implementing an enrichment process.

Temporal Dissection and Traffic Dispersion Graphs (TDGs)
As the node behaviour classification task is here addressed as a supervised machine learning task, an initial labelled dataset is required, which is relevant for the definition of normal and attack behaviour. Figure 2 a shows how the time series datasets contain information about traffic flows between a source and a destination, each one identified by several features such as IP, port, protocol, bytes and so on. We propose to use a temporal dissection operation, which allows us to split the network traffic dataset into fixed-time intervals, generating so-called temporal snapshots. From each of these, Traffic Dispersion Graphs (TDGs) are extracted.
Each TDG is characterized by nodes -also referred to as entities -as a combination of IP and port number, and edges that indicate if traffic is exchanged between nodes. Following these definitions, each row of the network traffic dataset can be seen as an undirected edge. In fact, each row contains information about data exchanged between the source and the destination, as well as the information about the response sent by the destination to the source (bidirectional flow). In this way, each row can be reduced to a 4-tuple of source entity, destination entity, edge features and edge label ( Figure 2 b). This operation allows us to generate a first version of the TDG from each temporal snapshot, as shown in Figure 2 c. However, traffic information is still stored on the edges of the graph, which is why an operation for translating them into node behaviour is required ( Figure 2 d). This operation is performed by combining all the edge feature vectors in which a certain node is involved. Let all the edges related to node i be described as a feature vector e j = { f j 1 , f j 2 , ..., f j N } , where N is the number of features and f j h represents the h − th feature of the j − th edge, then it is possible to compute node i 's behaviour B i by combining its edge features with an averaging operation. For example, following Furthermore, it is possible to enhance the node behaviour description by enriching it with m additional features that can be computed considering extra information such as the number of joined edges, the maximum number of edges with the same entity, and so on. This operation increases the behavioural vector dimensionality from N to N + m elements.
Once TDGs are generated and edge information is converted into node behaviours, it is important to assign a label to each node in order to distinguish normal and attacker nodes. For executing common attacks like Reconnaissance, Denial-of-Service, Port Scan, Exploits, Fuzzers and many others in real use cases, attackers represent active entities that start the communication with its targets. Hence, in order to label normal and attacker behaviour, only labels related to edges in which a node appears as source entity are considered and combined. If among these labels the majority are nor-  mal edges, the node is labelled as normal. Otherwise, it is labelled as attacker.
From here onward, in this paper, the terms behavioural features will refer to the features that define each node behaviour, and 0class and 1-class will be used to indicate normal (0) and attacker (1) behaviour, respectively.

Graph Features Enrichment
Our approach proposes to extract new information from TDG structures in order to enhance/enrich node descriptions. In this manner, each node behaviour is ultimately identified by both behavioural and graph features.
Creating the temporal TDG based on the network traffic dataset and defining nodes as a combination of IP and port number generates highly fragmented graphs characterized by many subgraph structures ( Figure 3 ), which is in accordance with literature ( Iliofotou et al., 2009a ). In fact, it is normal that in a fixed, short time interval activities related to a certain node are limited -which promotes the creation of simple and disconnected graphstructures. Yet, in rare cases, complex structures are generated in which the subgraph is composed of two dense parts that are only connected by one node. In order to denote these complex struc-tures, the concepts of r-nodebridge, set of r-nodebridges and e -bridge are defined.
the only two adjacent edges between those nodes, and suppose that both e 1 and e 2 are bridges of G .
ponents obtained by cutting e 1 or e 2 , respectively and let r = max Then, v is called r-nodebridge and W s is the set of r-nodebridges for r ≥ s . The removed edge will be called e-bridge . This definition allows for detecting all elements of the W s . For each one, its e -bridge is removed in order to split the two dense parts and create two distinct subgraphs, thereby reducing the overall complexity.
In order to improve the definition of node behaviour, several graph metrics are directly extracted from the temporal snapshot and added to each node behaviour. In particular, 12 common graph metrics are chosen, as indicated in Table 1 . Three of them are first computed by considering the whole graph in a temporal snapshot  and then by considering the concrete subgraph to which the node belongs, obtaining a total of 15 graph features.

Phase 2: Data-level preprocessing
While the process described in Section 3.2 transforms traffic data into graph behavioural information, it does not address the class imbalance problem. The distribution of normal and attack behavioural nodes generated in Phase 1 keeps suffering from the same imbalance as the initial dataset.
For this reason and for improving the supervised ML results, in Phase 2, the idea is to balance the number of attackers and normal nodes introducing two novel techniques that are able to create new synthetic samples for each temporal TDG. On the one hand, these operations try to remove repeated subgraphs in which normal nodes (majority class) are involved. On the other hand, they try to replicate the most "interesting" subgraphs in which the local-majority population is made up of attacker nodes (minority class). In this way, these methods address the graph imbalance problem without modifying the structural topologies of each TDGs' subgraphs ( Figure 4 ). More specifically, the first approach combines RUS and ROS techniques to obtain a comparable distribution among the classes for each TDG (named as R-hybrid ), whereas the second approach exploits RUS, SMOTE and ROS techniques (named as SM-hybrid ).
As mentioned, the first step is the same for both approachesdetecting subgraph structures in each temporal snapshot that are characterized by nodes belonging to the most represented class only (normal behaviour). Once these structures are detected, they are randomly removed for each snapshot until the population of that class is halved ( Figure 4 a). In this manner, direct relational information is not modified and neither are the structures of the untouched subgraphs.
For each TDG subgraph, structures that are characterized by at least 60% of nodes belonging to the less represented class (attacker behaviour) are then selected and randomly replicated, as shown in Figure 4 b. The chosen percentage between attacker and normal nodes in the selected subgraphs is justified for ensuring the convergence of this approach, i.e. in each iteration there are more elements added belonging to the less represented class than elements belonging to the majority class, allowing to reach a balanced population. The replication process is stopped when the number of attacker nodes reaches the initial population of the normal nodes, i.e. the population before starting the replication operation. Although both approaches replicate the subgraph structures in the same way, their main difference is related to how they replicate single node behaviour in each subgraph, as shown in Figure 5 . 1. R-hybrid : after applying RUS, this technique replicates not only the most relevant subgraphs, but also their node behaviours using a ROS strategy, as shown in Figure 5 a. 2. SM-hybrid : after applying RUS, this technique replicates the most relevant subgraphs, however, normal behaviour are replicated using ROS technique, whereas attacker nodes are gen- In fact, graph features are not considered in the SMOTE process since they directly depend on the chosen subgraph and can be calculated afterwards, as explained in Section 3.2.2 . In order to apply the SMOTE technique, the minority class should be characterized by a minimum number of neighbors ( N). When this condition is not satisfied, the ROS technique is used instead.

Phase 3: Classification
Finally, in Phase 3, the node behaviour classification is performed by training Deep Learning models. Considering the presented data structures, we opted for comparing two distinct learning approaches -Neural Networks (NN) using behavioural and graph features to classify normal and attacker nodes and Graph Convolutional Networks (GCNs), which combine neighborhood in- formation extracted from the graph via a convolution operation. Here, we investigate two GCN implementations where Chebychev polynomials up to K degrees are used for modelling the (spectral) convolutional filters, as described in Section 2.2.3 . In particular, the first implementation is based on maximum Chebychev degree K equal to 3 (ChebyNet or 3-GCN), and the second implementation based on a Chebychev simplification (linear polynomial) presented in Kipf and Welling (2017) , called first-order GCN or 1-GCN.

Experimental Framework
A network traffic dataset is used to validate the methodology introduced in this work. In Section 4.1 and Section 4.2 , the dataset as well as its limitations are described. In Section 4.3 , an overview of the temporal dissection and the graph results are presented, whereas Section 4.4 and Section 4.5 show the metrics used to evaluate the Deep Learning models and their settings. Finally, the experiments are detailed in Section 4.6

Dataset Overview and Feature Selection
To test our approach for node behaviour classification, a widely used cybersecurity dataset ( Monshizadeh et al., 2019;Zhang et al., 2018 ) called UNSW-NB15 6 is used. This dataset was created with the aim to improve existing benchmark datasets, which may not be able to provide a comprehensive representation of network traffic and attack scenarios ( Moustafa and Slay, 2015 ). The UNSW-NB15 dataset contains real normal and synthetic abnormal network traffic generated in the University of New South Wales (UNSW) cybersecurity lab. UNSW-NB15 is characterized by nine major families of attacks (Fuzzers, Analysis, Backdoors, Denial of Services, Exploits, Generic, Reconnaissance, Shellcode and Worms) and normal traffic generated over two distinct capture days, the first with a simulation period of 16 hours and the second with a simulation period of 15 hours. Although it is not always possible to clearly distinguish attack families due to their multi-purpose applications and correlations, in this study the following three categories are considered in order to facilitate the description of the nine families: attacks for gathering information, attacks for making a resource unavailable and attacks for taking control of a target (hijacking): Gather information 1. Analysis: composed of different attacks like port scan, spam and HTML file penetrations. 2. Generic: techniques that try to decrypt information using methods that work against block ciphers (with a given block and key size). 3. Reconnaissance (or Gather): attacks used for discovering more information about a target, usually to start an investigation before deploying the real attack.

Target unavailable
4. Fuzzers: attacks that randomly send data as input to a target in order to exploit vulnerabilities and bugs for generating failure, crashes or unwanted behaviour. 5. Denial-of-Service (DoS): method that tries to make a target unavailable by flooding it with traffic, or by sending information that triggers a crash.

6.
Backdoor: technique that uses methods for gaining authorized and/or unauthorized access bypassing system security mechanisms. 7. Exploits: attacks that take advantage of known weaknesses of the target (bugs or vulnerabilities) to take over control and/or steal information. 8. Shellcode: method that uses a small piece of code (usually command shell) to exploit software vulnerabilities and take control of the compromised machine. 9. Worms: malware that has the ability to self-replicate and spread automatically across multiple devices exploiting target vulnerabilities.
Tools like Argus 7 and Bro-IDS 8 are used to generate a first preprocessed dataset, in which network packets are aggregated in cumulative records -each one defined by a total of 49 features, including two different class labels generated by the detection tools. One general class label indicates if a connection represents a normal activity (0) or an attack (1), whereas the second label specifies the attack category according to the nine available categories. This labelled dataset is available in a CSV format, and contains clean data without missing values or duplicated records ( Moustafa, 2017 ). Using this labelled version of the UNSW-NB15 dataset reduces the captured duration to 12 h : 35 m : 10 s and 11 h : 57 m : 41 s for the first and second capture day, respectively. This preprocessed dataset is characterized by 2,540,044 labelled flows of which 2,218,761 are labelled as normal and 321,283 are labelled as attack flows. As described in other studies ( Moustafa and Slay, 2017;Zhang et al., 2018 ), not all of the 49 available features are relevant for the classification. Hence, only the following 21 features were used and combined here for defining and extracting node behaviours: source/destination IP, source/destination port number, transaction protocol, state protocol, record total duration, source/destination to destination/source bytes, source /destination to destination/ source time to live, source/ destination packets dropped, source/ destination bits per second, services (dns, ftp, ssh,..), source/ destination to destination/ source packet count, record start timestamp, record label and category label .

Dataset limitations and clean up
As presented in Section 4.1 , nine attack families are given in the UNSW-NB15 dataset. However, the dataset suffers from two major problems: class imbalance and class overlap ( Zoghi, 2020 ). In this study, class imbalance is addressed by applying novel data-level preprocessing algorithms in the graph domain, whereas class overlap is mitigated by reducing noisy information and by focusing the analysis on the most populated protocols and services only.
Class overlap is created when the space generated by class features from one class overlaps the space generated by features from another class. As shown in Zoghi (2020) , several classes of the UNSW-NB15 dataset suffer from class overlap. To mitigate the problem and reduce the randomness of the data without losing important information, the analysis was focused on particular services and protocols only, following previous suggestions ( Zhang et al., 2018 ). More specifically, only services identified as undefined, dns, http, smtp and ftp were chosen. This filtered dataset was characterized by more than 100 protocols. As not all them have the same relevance, we decided to keep only the 16 most represented protocols. The final filtered dataset loses less than 10% of the originally available information, as shown in Table 2 , where the size of the dataset for each category is shown before and after filtering. Regarding the population of the 1-class (attack, malicious class), the Worm family was excluded from the analysis due to its low representation (0.06%), reducing the considered attack families to eight.
Even though the main goal of this study is to compute a binary classification to distinguish between normal and attacker behaviour, the eight remaining attack families are considered individually during the data-level preprocessing operations. In this manner, all considered attack subcategories are replicated often enough, mitigating the creation of small disjuncts in the classification problem. A small disjunct is a disjunct that covers only a few training examples and that generates high error rates in testing ( Weiss, 2010 ).
To the best of our knowledge, this is the first time that this dataset has been used for defining a structured graph to highlight and classify node behaviours.

Temporal graph extraction
As explained in Section 3.1 , we need to extract temporal TDGs (graphs) from the network traffic dataset (time series) prior to running experiments. The first operation is to fragment the dataset into fixed time intervals (temporal dissection). Here, three tempo-  ral snapshots sizes are chosen: 30 0 s , 60 0 s and 90 0 s and for each configuration, TDGs are built. In Figure 6 , the number of snapshots obtained in each configuration depending on the capture day is reported, as well as the number of snapshots that are fully characterized by normal nodes (without attacks) and the number of snapshots with an empty population (without any kind of nodes). Temporal dissection generates different results depending on the available data and the duration of the capture day. In particular, more than 80% of the created snapshots in the first capture day do not show attacker behaviour. For this reason, only non-empty snapshots obtained from the second capture day are considered -which are 141, 71 and 48 for 300 s , 600 s and 900 s , respectively. Table 3 shows that the overall number of 1-class nodes is still much lower than the number of nodes belonging to the 0-class when considering all nodes in all extracted TDGs on the second capture day. This once again highlights the strong class imbalance that affects this dataset (imbalance rate ∼ 22). In fact, the 1-class population represents less than 5% of the whole graph-based in-formation regardless of the chosen temporal snapshot size. Furthermore, a strong reduction of the Generic family is shown after the Graph creation phase even though it was the most represented family in terms of single connections ( Table 2 ). This is because there are only a few distinct nodes that belong to this family ( Table 3 ) that generate large attack traffic (flows).
Once these temporal TDGs are extracted, the elements of the set W s of each TDG are detected and all e -bridges associated with them are removed in order to simplify rare enclosed complex graph structures. In particular, to detect the elements of the W s set, we chose s equal to 6. After this reduction, graph features are computed and used for enriching the node behaviours description, generating the inputs for the experiments.

Evaluation Metrics
Classification metrics used to evaluate the graph Deep Learning model implemented in our experiments are obtained via the confusion matrix related to binary classifications, defining the values of True Positives, True Negatives, False Positives and False Negatives.
• True Positives (tp) represent cases in which both the predicted and the real label indicate anomalies/attacks (class 1). • True Negatives (tn) represent cases in which both the predicted and the real label indicate no-anomalies/normal (class 0). • False Positives (fp) represent cases in which the model predicts anomalies (class 1), however the real label indicates noanomalies (class 0). • False Negatives (fn) represent cases in which the model predicts no-anomalies (class 0), however the real label indicates anomalies (class 1).
Starting from the confusion matrix, 5 important metrics are calculated for evaluating the learning classifier: Accuracy, Precision, Sensitivity, False Prositive Rate (FPR) and F1-score. Accuracy represents an overall effectiveness of a classifier; Precision is a measure of a classifier's exactness; Sensitivity represents a measure of a classifier's completeness; FPR indicates the ratio of negative elements predicted as positive ones; F1-score shows the relation between actual positive labels and those given by the classifier. These metrics, as well as their formulas and range values are reported in Table 4 . Furthermore, we include the Area Under the Receiver Operating Characteristic Curve (AUC-ROC or AUROC) ( Fawcett, 2006 ) for evaluating and comparing different classification performances. The AUC-ROC represents the classifier's ability to distinguish between classes and it takes values in a range between 0 and 1.
To perform the experiments, for the supervised models, the temporal TDGs are randomly separated into train, validation and test datasets keeping a fixed proportion of temporal snapshots of 60%, 20% and 20%, respectively. For the unsupervised LOF, AE, and IForest the initial dataset is split into train and test dataset only, with a proportion of 80% and 20%, respectively as they do not require a validation operation. In particular, all attack samples are excluded from the train dataset in order to have the unsupervised models learn normal behaviour only -using an anomaly detection setup called semi-supervised AD ( Aggarwal, 2017 ), ( Goldstein and Uchida, 2016 ). For the cluster-based unsupervised models DBSCAN and k -NN, the whole dataset is used at once in order to create clusters and detect outliers within them -using an anomaly detection setup called unsupervised AD ( Aggarwal, 2017 ), ( Goldstein and Uchida, 2016 ). For supervised models and semi-supervised AD, each experiment is repeated 5 times with a different composition of their partitions but keeping the described proportions. Hence, for such models, the reported metrics represent an average over 5 repetitions. Standard deviations are reported as well. The process of calculating such metrics over multiple repetitions, however, may introduce unintentional and undesired bias ( Forman and Scholz, 2010 ). There are several methodologies to overcome these issues. As presented in Forman and Scholz (2010) , for AUC-ROC, averaging the single values computed in each iteration represents the best solution, whereas for F1-score, it is recommended to apply Equation (5) , where T P, F P and F N represent the total number of true positives, false positives and false negative over the all repetitions, respectively. For the unsupervised AD setup, results are obtained with a unique execution as the entire dataset is used at once.
In this paper, we exploit the labeled data to compute classical supervised metrics such as Precision, Recall, F1-score and AUC-ROC for unsupervised models too, as in previous studies ( Kwon et al., 2019 ), ( Meira et al., 2020 ), ( Perez et al., 2019 ). In this way, it is easier to compare both supervised and unsupervised results.
The terms F1-score avg and AUC-ROC are used to denote the metrics computed via averaging values obtained in each repetition, whereas F1-score tp,fp is used to indicate the metric obtained using Equation (5) . F1-score tp,fp does not have a standard deviation as it is calculated combining data from all repetitions.

Machine learning model architectures and parameters
As explained in Section 3.3 , the minimum number of neighbors ( N) for allowing SMOTE operations in the SM-hybrid approach is set to 6. In Table 5 , configuration parameters for the NN, 1-GCN and 3-GCN learning models are reported. All models are implemented using a similar structure, although they perform the classification task in different ways. Specifically, they are composed of 1 hidden layer of 300 neurons with a Rectified Linear Unit (ReLu) as activation function and 0.50 as dropout value. ADAM optimization algorithms with learning rate 0.0 0 05 are used. For the NN, 512 is chosen as batch size. Generally it is difficult to determine the training duration without generating under-fitting or over-fitting during neural network training. Hence, it is important to fix a parameter called early stopping, which represents the number of consecutive epochs in which the model performance is evaluated ( Prechelt, 1998 ). If performance degrades, the training process is stopped. In our experiments, the maximum number of epochs is set to 100 and the early stopping parameter is set to 10. For LOF and k -NN, the minimum number of neighbors was set to 7. Further, for the k -NN, the maximum distance was chosen as 80% of the data distribution, i.e. all the samples that had a distance higher than 80% of the data distribution were considered anomalies. The IForest was implemented with a number of estimators set of 100 and with a threshold anomaly of 0.8. This threshold is used to evaluate the predictions, i.e. if the predicted score of an element exceeds such threshold, the point represents an anomaly, otherwise normal behaviour. For the DBSCAN approach an (maximum distance between two samples) of 0.2 was chosen. The AE was composed of one hidden layer to reduce the input dimensionality to 45, using a tanh function as activation and a sigmoid for reconstructing the output. The AE was trained with a batch size of 512 during 50 epochs.

Experiments
The goal of this study is to classify normal and attacker behaviours in a network traffic dataset using a temporal graph-based representation. For this reason, a novel methodology is introduced applying temporal graph extraction while tackling class imbalance and finally performing classification via Deep Learning-based approaches. Applying this methodology, however, raises the following main questions: 1. Do the two novel data-level preprocessing approaches correctly address graph class imbalance? How do unsupervised techniques perform with the original, unbalanced dataset? 2. How does temporal snapshot size (temporal dissection) affect node behaviour classification? 3. Which is the best learning model to be used?
To address these issues, three main experiments are carried out. Experiment 1 : The aim of this experiment is to compare the performance of 3 graph-based supervised models trained with the imbalanced graph dataset (baseline) versus the same models trained using a balanced dataset that was obtained by applying R-hybrid and SM-hybrid. Furthermore, 5 unsupervised models for anomaly detection are implemented and compared. All the models are trained and tested with temporal TDGs extracted with only one fixed time interval ( 600 s ). Both supervised and unsupervised models are first trained and tested with the imbalanced dataset, then (as shown in Figure 7 a) for the supervised machine learning models, the two data-level preprocessing approaches are directly applied to the training dataset in order to create a more balanced training population (Phase 2). As described in Section 4.2 , the datalevel preprocessing operations are performed considering the population of each attack family separately to avoid the creation of  small disjuncts. For each temporal snapshot, data-level preprocessing is executed until the sum of attack populations of all families is equal to the initial normal population. Experiment 2 : The goal of this experiment is to evaluate how temporal snapshot sizes affect the definition of node behaviours and their classification. For this reason, three temporal snapshot sizes -30 0 s , 60 0 s and 90 0 s -are used to split the dataset and extract the temporal TDGs. Then, for each temporal snapshot size, a balanced version of these TDGs is used for training the learning models. The balanced TDGs are obtained by applying the data-level preprocessing operation that shows the best results in Experiment 1 . Finally, an analysis of how the temporal snapshot size affects the duration of a single training epoch for each model is presented.
Experiment 3 : The aim of this experiment is to evaluate which of the three learning approaches NN, 1-GCN and 3-GCN performs best for the node behaviour classification. We compare the best results each model has obtained in previous experiments and also compare their different configurations. Although models are trained to perform a binary classification only (attack/normal behaviour), a study regarding the most detected attack families is carried out in order to highlight benefits and limitations of the presented methodology. For this, the population of a test dataset is evaluated to count how many elements of each attack family are actually classified as attacker node.

Experimental Study
In this section, our methodology is validated with a traffic network dataset. In particular, in Section 5.1, Section 5.2 and Section 5.3 the results of the three experiments are detailed, and finally, in Section 5.4 , strengths and limitations are discussed. Table 6 shows the average training population in 5 repetitions of the first experiment, when a temporal snapshot size of 600 s was used. Both data-level preprocessing techniques -R-hybrid and SM-hybrid -allowed to address the graph imbalance in the training dataset, converting an initial distribution of 95.64% and 4.36% for the 0-class and 1-class, respectively, to a more balanced distribution of ∼ 55 % and ∼ 45 %. Figure 8 reveals the structure of the training dataset and, in particular the distribution of the attack families that composed the 1-class set, averaged over 5 repetitions. During data-level preprocessing, all attack families were considered separately reshaping their distribution, i.e. the number of samples belonging to each attack family was homogenized. This effect is evident in Figure 8 , in which the Generic, Reconnaissance, Fuzzers and DoS families tended towards a similar representation between 16% and 20% and the Exploits family tended to a percentage between 23% and 25%. However, the three other families -Analysis, Backdoor and Shellcode -did not seem to be affected by data-level preprocessing, and their overall representation was decreased (less than 0.20%). This phenomena may be explained by their graph structures, which may not have satisfied the minimum requirements for being included in the data-level preprocessing, i.e. attacker nodes may not have been involved in subgraphs with at least 60% of attacked nodes.

Experiment 1: On the goodness of graph data-level preprocessing
In Table 7 , the classification results obtained after training the learning models with the imbalanced dataset are presented. In particular, these results allow us to define the baseline performance when several supervised as well as unsupervised models are applied. While class imbalance of the training dataset significantly affected the two graph-based models (1-GCN and 3-GCN), it was less relevant for NN and DBSCAN performance, which both reached acceptable values in terms of F1-scores (more than 0.60) and AUC- Table 7 Comparison between models trained with imbalanced data, using 600 s as temporal snapshot size (Baseline results).

Supervised Machine Learning Unsupervised Machine Learning
Classification  Table 8 Comparison between models trained with balanced (R-hybrid, SM-hybrid) data, using 600 s as temporal snapshot size. ROC (more than 0.70). More specifically, among the unsupervised methods, the models based on unsupervised AD setup performed better than the ones based on semi-supervised AD. However, all of them (even though reaching best precision scores, i.e. no false positives), showed a high number of false negatives and thus low values of sensitivity (less than 0.50). Among the supervised models, 1-GCN showed very limited performance on the imbalanced data since it learnt to classify all samples as belonging to the 0class. This effect generated an AUC-ROC value of 0.50, F1-score tp,fp of 0 and made it impossible to calculate the F1-score avg . On the other hand, the 3-GCN learnt a few details related to attacker behaviour (AUC-ROC = 0.52), which resulted though in rather low F1scores (F1-score tp,fp = F1-score avg = 0.06). The best results for imbalanced training data were obtained by the NN model with both F1scores equal to 0.66 and AUC-ROC equal to 0.81. Table 8 shows the results based on a dataset balanced by using the newly introduced data-level preprocessing techniques. These results demonstrate that a more balanced dataset generated clear improvements when supervised machine learning was used in the classification task. Specifically, the 3-GCN presented the best results in terms of F1-scores and AUC-ROC reaching 0.73 and 0.95, respectively, regardless of the data-level preprocessing used. The 1-GCN improved as well when using the balanced dataset and reached its best values with the SM-hybrid technique (F1scores = 0.65 and AUC-ROC = 0.89). The NN model, however, was only slightly affected as its F1-score avg and AUC-ROC improved by only 0.01 and 0.12 points, respectively. This first experiment indicated that both data-level preprocessing techniques generated classification improvements independently of the supervised machine learning model used. Furthermore, the results obtained in this way were the best results, also when comparing them with the ones obtained from the considered traditional, unsupervised models. More concretely, best results were generated with the SM-hybrid approach. For this reason, only this data-level technique was considered for the second experiment. Table 9 highlights the effects of the temporal snapshot size on node behaviour classification. The reported metrics were ob- tained by training and testing the three learning models using distinct temporal snapshot sizes. SM-hybrid is used for balancing the dataset in all cases. All models generated similar values in terms of AUC-ROC regardless of temporal snapshot size, but they diverged with respect to F1-scores. More specifically, all models showed best results using a temporal snapshot size of 600 s , with an F1score avg of 0.67, 0.65 and 0.73 for the NN, 1-GCN and 3-GCN, respectively. Increasing the temporal snapshot size to 900 s generated the worst values in terms of F1-scores and the three models scored 0.06, 0.04 and 0.05 points less compared to their best results, respectively, but kept the AUC-ROC values almost unchanged. On the other hand, decreasing the temporal snapshot size to 300 s generated slightly lower values in terms of AUC-ROC, where NN, 1-GCN and 3-GCN lost 0.01, 0.02 and 0.01 points, respectively.

Experiment 2: Comparing different snapshot sizes
Regarding the computational costs of the different approaches when dealing with different tem poral snapshot sizes, Figure 9 shows the average duration of an epoch during the training process. Epoch duration during NN and 1-GCN training was similar and did not change visibly when changing temporal snapshot size. However, for the case of 3-GCN training, the duration of an epoch increased almost exponentially with increasing temporal snapshot size, reaching a value of more than two minutes for a single epoch when using a size of 900 s . Table 9 Classification metrics obtained in the second experiment by varying the temporal snapshot size and by using SM-hybrid data-level preprocessing.  . 10. Attack families as detected by the three learning models in their best configuration.

Experiment 3: Comparing different learning approaches
In Table 10 , the best configuration and the best results of each learning approach are reported. All tested models -NN, 1-GCN, and 3-GCN -showed their best results using the same configuration, i.e. using a temporal snapshot size of 600 s and using training data balanced with the SM-hybrid technique. Observing this table, the best results over all metrics were obtained using the 3-GCN, which highlights the importance of considering graph relations during node behaviour classification. However, a linear approximation as applied in the 1-GCN did not show benefits since Recall, F1-scores and AUC-ROC metrics were outperformed by the NN model ( Table 10 ). Figure 10 details the percentages of detected attack families for each of the three learning models when using the best configuration. The 3-GCN detected 7 out of 8 attack families with accuracy higher than 90% and for the Reconnaissance family even ∼ 99 %. However, 3-GCN had problems detecting samples belonging to the Analysis family, as demonstrated by low accuracy values ( ∼ 50 %) and the high variability. The 1-GCN was again outperformed by the NN, which for two families -Backdoors and Shellcode -also outperformed the 3-GCN classifier. However, NN showed lower accuracy than 3-GCN in detecting Generic, Fuzzers, DoS and Exploits families (between 85% to 90%), and a clear deficit regarding detection of the Analysis family ( ∼ 27 %)

Discussion
In this paper, we aimed to classify node behaviours in network traffic data by transforming the given time series data into graphs. As the obtained graph-based dataset is characterized by strong class imbalance, we tested two different approaches, one based on supervised ML and the other based on unsupervised ML (anomaly detection, AD). Both approaches initially showed low recall scores and thus lower overall performance (F1-scores and AUC-ROC). For this reason, we investigated the effects of class imbalance in the input dataset and presented two new methods to tackle this issue. Further, we analysed how the temporal snapshot size used for extracting the graphs affects the classification results.
Regarding class imbalance, it was found that the NN implementation was generally not strongly affected and aligned with the unsupervised AD models, while both graph-based learning models required the application of data-level preprocessing techniques in order to improve classification performance. As the application of data-level techniques in the graph domain is not straightforward, we propose here two approaches: R-hybrid and SM-hybrid . Our experiments show that SM-hybrid generally achieved better results. This approach involves RUS, SMOTE and ROS techniques. Results hence demonstrate that class imbalance is indeed an issue when using graph-based approaches and that these data-level techniques need to be carefully set up.
Regarding temporal snapshot size, in our case, interestingly all models performed best with a size of 600 s and worsened when increasing the size further. This indicates that it is important to carefully choose temporal snapshot sizes when trying to detect attack node behaviour in network data. Furthermore, it should be noted that using graph-based approaches with higher order filter approximation (such as the 3-GCN) with larger temporal snapshot sizes may yield significantly longer training epochs as our results have shown for the example dataset.
In all experiments, the best results with high Accuracy, Recall and AUC-ROC values were obtained with a 3-GCN model, suggesting that exploiting graph relations for classifying node behaviours is a promising approach. However, experiments showed also that these graph learning models need to be applied and configured properly -as a linear approximation (1-GCN) did not seem to be sufficient to take full advantage of the added graph-based information. Interestingly, this model was outperformed (by a few score points) by a NN model that did include graph-based features, but did not consider relations generated by nodes. Furthermore, the NN model proved to be robust against class imbalance and was not affected in terms of training epoch duration by the chosen temporal size.
Interesting performance differences among the three considered models were revealed when looking at the attack families that could be detected. Here, NN seemed to be the most suitable method for detecting Backdoors and Shellcode attacks, while the 3-GCN approach performed very well for all other attack families apart from Analysis . Actually, all compared models had problems in detecting samples belonging to the Analysis family. This attack family, as well as the Backdoor and Shellcode families, are the least represented ones in the training dataset as they do not participate in the data-level preprocessing. Although our approach translates the attack classification problem into a behavioural classification task, the classification results of these three attack families can be traced back to the class overlap problem that characterizes the UNSW-NB15 dataset. Analyzing the KMeans Intercluster Distance Map of the original dataset ( Zoghi, 2020 ), Backdoor and Shellcode families suffer from this class overlap problem with DoS and Fuzzers families, respectively. However, in the case of binary classification, this issue positively affects the results as more represented classes contain information of less represented classes as well. The Analysis family, on the other hand, suffers from class overlap with the Worm family. This is the class that had been removed due to its low number of samples. Hence, in this case, there are no other attack families that have information regarding the Analysis samples, most likely causing these low scores.
While our approach generated overall improvements -especially when graph-based learning was used -the values of F1scores could potentially be further improved. More precisely, as demonstrated by the third experiment, Precision values are limited generating a loss of classification quality (although FPRs are less than 0.04). This effect could be related to the application of the UNSW-NB15 dataset and its limitations, as introduced in Section 4.2 .

Methodology Limitations
Graphs created from network traffic data mainly depend on the shape of the initial dataset, which should contain information about sources and destinations as well as a temporal axis. One advantage of this approach is that the definition of a "node" could be easily adapted according to the problem at hand, promoting its application to other domains. Moreover, the usage of node behaviour helps to maintain a stable supervision of entities, linking behaviors between snapshots. Further, it allows evaluating features that directly depend on the nodes, such as event logs or suspicious actions in the operating system of the device. However, although the presented methodology can be applied to several IT applications, forensic analysis etc., there are specific scenarios in which its application may be limited. These scenarios include, for example, analysing communication data in critical infrastructures (energy, industry, etc.) where timing is fundamental in order to apply countermeasures and reduce the impact of a cyber-attack very quickly. In these cases, operators typically require reaction times shorter than the temporal snapshots used here for extracting node behaviours. Hence, in these scenarios, traditional intrusion detection systems (IDSs) based on time-series data may be more effective (see next section), which, however, do not provide information regarding node behaviours.
There are also other limitations in our approach that should be discussed. On the one hand, the introduced data-level preprocessing operations are designed for balancing graph datasets in which the information is fragmented and hence we focus on increasing the substructures in which the minority class is present. This is the case of temporal TDGs and network traffic data, but adaptations will likely be required when the problem presents a unique graph where all nodes are connected among them. On the other hand, in terms of the GCN implementation, it is clear from the results that a higher order convolutional filter improves the ability of the classifier, but at the same time it substantially increases the training time.

UNSW-NB15 Comparative Study
This work introduces a novel methodology to analyze a network traffic dataset using a graph-based approach for the classification of node behaviour. The idea is not to directly classify network traffic itself (punctual or specific information), but to evaluate who has generated such flows by classifying node behaviour in a temporal snapshot (summary information). With this graphbased representation, it is therefore not possible to directly apply temporal deep learning technologies like 1D-CNN, Long Short-Term Memories (LSTMs) or Recurrent Neural Networks (RNNs) as the information that should be classified is not represented by a point in a time series but by a complex graph composed of distinct nodes that change their individual behaviour (state) over time.
When comparing our work with others in the literature, we did not find any study directly related to our work. To the best of our knowledge, our approach has not been explored before. For this reason, Table 11 reports previous works that use the UNSW-NB15 dataset for implementing different intrusion detection systems (IDSs). In these studies the network traffic dataset is analyzed as a time series with the aim of instantly detecting malicious flows (typical IDS task). Hence, a direct one-to-one comparison with our graph-based approach analysing node behaviours is difficult and potentially unfair. In particular, previous implementations are mainly based on ML and Deep Learning (DL) models such as Support Vector Machines (SVM), Random Forest (RF), Naive Bayes (NB), Multilayer perceptron (MLP), CNNs, 1D-CNN and LSTMs.
As shown in Table 11 , the accuracy obtained with the proposed node behaviour approach is quite similar to the one obtained by directly classifying network connections; in fact, in their best implementation they reach ∼96% and ∼99%, respectively. However, analyzing the results in terms of Precision, Recall and F1 score, it is clear that the node behaviour classification is a more complex and challenging task than the one based on network connections. In particular, in its best configuration, the introduced node behaviour approach reaches values of 0.60, 0.93 and 0.73, for Precision, Recall and F1, respectively, while network connection classifiers show values higher than 0.90 in all the three metrics. Yet, it is to be noted that the network traffic approach generates very variable results depending on the approximation and the algorithm used. In terms of training time, some ML approaches were the fastest with a duration of a couple of seconds only (NB), while more complex approaches like CNN required more than 1 hour. The 3-GCN implemented in this work required more than 2 hours for training the model, while the 1-GCN was faster than a simple CNN ( Jiang et al., 2020 ) and was totally aligned with the training time of a CNN combined with a BiLSTM ( Jiang et al., 2020 ). Note that he reported values (strongly) depend on the computational resources used, which varies in the different works.
Finally, we would like to point out again that a one-to-one comparison between our graph-based approach and previously published models is challenging as these models predominantly analyse time-series data of communication flows between actors in the network to detect suspicious individual flows. Additionally, although all the analyzed approaches start from the same initial dataset, the extraction of TDGs for the node behaviour classification alters the magnitude of the imbalance problem, increasing the difficulty to compare the results. Furthermore, the models based on time series do not provide broader information regarding the actors, i.e., regarding the nodes themselves and their behaviour over time. Unlike our graph-based approach they cannot classify an entire node's behaviour as a potentially malicious actor in the network. Their results are thus instantaneous and therefore quicker but without providing the bigger picture concerning all nodes' behaviours. With our GCN-based models, one could identify malicious actors within a network and then make decisions regarding the isolation or exclusion of this actor in order to restore the cybersecurity of the entire network. In fact, future elaborate network monitoring systems may provide a combination of both detection capabilities: time-series-based IDS for quick, short-term malicious flow detection and longer term graph-based node analysis to pinpoint potentially malicious actors in the network based on their behaviour.

Table 11
Comparison of the GCNs implemented in this study with existing IDS models that analyse the UNSW-NB15 dataset (note, however, that results between node behavior classification and network connection classification are not directly comparable).

Conclusions and Future work
In this study, we present a novel methodology that converts an attack classification problem into a node behaviour classification problem, thereby highlighting the importance of understanding and correctly manipulating the input data and properly configuring the classification model. Our approach allows extracting temporal graph-based information (temporal TDGs) from network traffic data while focusing on micro-dynamics and the evolution of node behaviours. Two novel techniques for addressing class imbalance in the graph domain were proposed (R-hybrid and SMhybrid) and proved to be suitable to reduce class imbalance while minimizing changes in the graph topology. When temporally dissecting the given network traffic data to unveil network microdynamics, we investigated the effect of the temporal snapshot size on the classification of node behaviour. Finally, three different Deep Learning approaches -Neural Network, 1-GCN, and 3-GCN -were implemented and compared with more traditional unsupervised anomaly detection methods that do not require a balanced dataset. Overall, the methods presented in this paper showed promising results that can be summarized as follows: • It is possible to convert time series information into graph-based structures by properly defining the concept of node/entity and link/edge. Then, edge information can be combined to characterize each node behaviour; • The unsupervised models for anomaly detection trained as semi-supervised AD together with the 1-GCN and 3-GCN were most affected by class imbalance; • The NN together with the DBSCAN (unsupervised AD) showed good results for the imbalanced data; • Novel proposed data-level preprocessing techniques -R-hybrid and SM-hybrid -successfully solved the class imbalance problem in graph data and yielded good classification results. SMhybrid generated the best results.
• Temporal snapshot size is relevant when analysing network traffic data using a graph-based approach. All learning models generated the best classification results using a temporal snapshot size of 600 s for the given dataset. • Increasing the temporal snapshot size heavily affected single training epoch duration for the 3-GCN model (exponential upward trend), while NN and 1-GCN training epoch duration was less affected; • The 3-GCN model including graph-based features showed best overall classification performance. However, the graph-based model applying linear approximation (1-GCN) was outperformed by the NN approach, which did not use any graph relations, regardless of the temporal snapshot size; • Among the 8 considered attack families, the 3-GCN was able to detect 7 of them with an Accuracy above 90%. However, the attack families Backdoors and Shellcode could best be detected via NN. None of the learning models were able to detect Analysis attacks well.
The presented graph data-level preprocessing approaches were developed in order not to modify the graph structures and topology. Further, they needed to be applicable in situations in which graph information is fragmented and the minority class creates highly-represented substructures. As future work, it may be interesting to modify these techniques so that they can be applied in cases where entities are fairly mixed with other classes (as it is the case for the Analysis, Backdoor and Shellcode families in our experiment). Furthermore, the precision of the classification could be improved by applying clustering operations after the TDGs creation in order to aggregate similar behaviours, reducing noisy data and enhancing small interactions between the nodes. The effectiveness of distinct clustering algorithms should be compared in order to understand how they affect the initial population as well as the final classification. Finally, an approach based on Generative Adversarial Networks (GANs) for creating new synthetic node behaviour and their connections within the graph could be a solution for improving the variety of attack class samples.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.