1 Introduction

An essential component of an organization’s information technology (IT) infrastructure is the Security Operations Center (SOC), which provides a centralized unit that employs people, processes and technology to continuously monitor and manage the organization’s security assessment. One of its key functions is to identify, investigate, and resolve cyber threats and attacks. This is achieved through active monitoring, which involves a continuous scan of the network for abnormalities or suspicious activities. The SOC collects and analyzes data from various sources, such as security devices, logs, and other data sources, to detect and respond to potential security incidents. By providing continuous monitoring and response capabilities, SOC helps organizations maintain their security posture and protect against cyber threats.

In network monitoring, situational awareness refers to best-practice engineering approaches used to continuously monitor IT processes and operations to improve service provision in the form of automatic action recommendation and decision support [1]. It is a challenge to deal with large volumes of data online and in real time. Many of such data are not labeled [2] or are not significant for detection purposes. Another challenge presents large quantities of alerts whose evaluation consumes unnecessary time of security analysts and thus makes the notification of real alerts more difficult. Various tools such as firewalls, virtual private networks, intrusion detection systems (IDSs), and intrusion prevention systems (IPSs) are used to ensure network security. Early detection, alerting, and rapid response are crucial components of a proactive security approach, helping prevent potential problems from impacting the network. Other challenges [3] in detecting network anomalies are a dynamic and fast changing environment, the diversity of resources and tools, and the security nature of the domain that prevents the sharing of logs among SOCs.

Artificial intelligence (AI), specifically its subfield neural networks (NNs) and deep learning (DL) in unsupervised way, aids in this process. This involves processing telemetry data from a monitored network, representing them as time sequences, and training an intelligent model to predict the most probable future states (called baseline) based on previous observations [4, 5]. In typical network operation, alarms are triggered when a monitored value exceeds or falls below predefined thresholds. However, setting these thresholds accurately can be challenging over time [6, 7]. If the thresholds are too narrow, a large number of alarms can be triggered, exceeding the analysis capacity of network managers. If the thresholds are too wide, real anomalies can be overlooked, leaving the network vulnerable to potential threats. Then, dynamically defining thresholds are becoming increasingly important. Such a nuanced soft baseline enables a complex system overview and helps to indicate possible anomalies, allowing security systems to continuously refine and adapt to changing situations over time and online. However, real data in real production are hard to obtain due to the security nature of the domain. There are a number of public, synthesized, and artificially generated datasets, but real scenarios are rarely seen.

The motivation of this work is to deploy an AI module for proactive network monitoring in real time to improve security protection. In this field, the implementation aspect of AI deployment has not yet been thoroughly investigated. By cooperating with IDS/IPS, the intelligent AI module is designed to monitor the infrastructure network in production and identify potential threats. To achieve this, the latest trend in multihorizon and multivariate time-series prediction is used, along with DL techniques in an unsupervised way, to simultaneously model multiple monitoring channels. This approach is expected to provide a more comprehensive and accurate view of the network’s state, enabling proactive measures to be taken to improve network security.

The work presented in this paper makes several key contributions to the field of AI for IT operations (AIOps) research, specifically in the context of online data stream monitoring, including:

  • The development of a DL-based multihorizon forecast model that simultaneously models multiple monitoring channels in an unsupervised manner, enabling a comprehensive view of network activities.

  • The deployment of an AIOps module containing the DL-based trained model for proactive network security monitoring that cooperates in real-time with IDS in production and allows a dynamic nuanced baseline to improve the effectiveness of threat detection.

  • The automated development for the entire software stack using declarative composition, which helps to simplify AIOps deployment and management over time.

These contributions demonstrate the potential for AI and advanced modeling techniques presented in this work to improve network security and improve the efficiency of IT operations.

The paper is further structured as follows: Section 2 surveys the related work. Section 3 provides the main idea about the cooperation architecture in Sect. 3.1, the learning development phase in Sect. 3.2, the online model deployment in Sect. 3.3, and the online anomaly detection in Sect. 3.4. Section 4 goes through the deployment of the architecture in Sect. 4.1, online data processing in Sect. 4.2, the quality forecast in Sect. 4.3, and anomaly detection results in Sect. 4.4. Section 5 concludes the main points of the work and its future extension.

2 Related work

2.1 Network behavior analysis

Network anomaly and detection, real-time monitoring, behavior baseline, analysis dashboards, network security measures, and access to threat intelligence are essential components of the network security strategy. By implementing these measures, organizations ensure that their networks are secure and protected against cyber threats. There are several types of IDS, including network IDS, host IDS, wireless IDS, and their combination. IDS only alerts against network threats and vulnerabilities, and more intervention is needed to act on these alerts. More than IDS, IPS alerts and blocks network threats and vulnerabilities based on a signature approach. Automated thread response is based on user-predefined thresholds.

Most of the current IDS/IPS are capable to respond in real time or near real time to network activities. This is achieved primarily by reactive solutions, typically as a set of rules. These rules are used to identify known patterns of malicious behavior, such as known attack signatures, and to trigger an alert or take an action to block activity in real time [8]. However, the effectiveness of these reactive solutions is limited by the system’s ability to recognize previously unknown or novel attack patterns. To address this limitation, the use of machine learning (ML) algorithms is merged to learn from historical data and proactively detect and prevent potential threats [9].

In this context, network behavior analysis (NBA) [10] is a technique used in cybersecurity to monitor and analyze network traffic for suspicious behavior (Table 1). Unlike misuse-based detection methods that rely on known attack patterns, NBA is designed to detect anomalous behavior that may indicate a new or unknown threat. NBA involves analyzing network traffic in real time to identify patterns of behavior that deviates from normal usage. This approach can help detect a wide range of attacks, including malware infections, data exfiltration, and insider threats. Examples of cybersecurity platforms and tools [11, 12] that can incorporate NBA techniques are IDS/IPS software such as ZEEK, Snort, Nessus, Suricata, Zabbix, OSSEC, FlowMon, and Rapid7.

Table 1 Cybersecurity tools and platforms that can incorporate NBA techniques

2.2 Predictive analysis in cybersecurity

A data-driven approach and intelligent data modeling extract information from large volumes of data for informed decision making, using advanced analytics to identify patterns, predict trends, and respond to threats and opportunities [13]. There are two main types of network analysis:

  • Misuse-based detection, also called the knowledge-based or signature-based approach, aims to detect known attacks by using the signatures of attacks. Its strong side is its high accuracy and it can identify types of attacks. However, it is a reactive solution and requires a data background of attacks. There is a lack of real labeled datasets of known patterns due to the diversity of variety of IDS/IPS working in real production.

  • Anomaly-based detection considers an anomaly as a deviation from a known behavior (other words: profile, baseline, threshold), representing the normal or expected behaviors derived from monitoring regular activities over a period of time [14]. It is a proactive solution, predicting in advance and in time, for any type of attack and does not require a data background of attacks. However, the approach may provide low accuracy in dynamic environments where all events dynamically change.

In the context of misuse-based detection, the works [15,16,17] present a range of techniques for data preprocessing, data filtering, feature learning, and representation, which are presented across multiple topics, including semi-supervised anomaly detection and insider detection of network traffic [18]. The authors compare and select from a variety of ML methods such as logistic regression (LR), Naïve Bayes (NB), k-nearest neighbor (kNN), support vector machine (SVM), decision tree (DT), random forest (RF), extreme learning machines (ELMs), and multilayer perceptron (MLP). These works utilize various techniques to calculate anomaly scores, such as local outlier factor, one-class SVM, isolation forest, histogram-based outlier score, or distance distribution-based anomaly detection. To calculate the distances between feature spaces, dimensionality reduction techniques, such as principal component analysis (PCA) or random projection, are often used.

Anomaly-based detection is also a widely used technique to identify situations in data that deviate from expected behavior. It finds applications in various domains, including fraud detection in credit card, insurance, traffic flows, intrusion detection in the cybersecurity industry, and fault detection in industrial analytics [19]. In time series, the objective is to recognize temporal patterns in past data and to use these results for forecasting. Kalman filtering method was one of the models capable of resolving regression concerns and minimizing variance to achieve optimal results. After that, the auto-regressive integrated moving average (ARIMA) model is a well-known and standard framework for predicting short-term data flow. Numerous modifications to the ARIMA model were implemented, and the results ensured improved performance [20]. Unsupervised real-time anomaly detection for streaming data was adequately investigated in [21] to evaluate the performance of hierarchical temporal memory networks over the Numenta Anomaly Benchmark (NAB) corpus that contains a single metric with timestamps.

In works [22, 23], the authors present dynamic monitoring characteristics and propose the use of incremental DL to handle concept drift in simulated evolving data stream scenarios. The comparison result between traditional approaches such as ARIMA versus DL showed that DL outperformed traditional approaches, providing better results. We can list DL methods from references such as NN, deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural network (RNN), long short-term memory (LSTM) with incremental learning [24], gated recurrent unit (GRU), autoencoder (AED), and variational autoencoders (VAEDs), as well as their combinations. More examples are restricted Boltzmann machines (RBMs), deep Boltzmann machines (DBMs), deep belief networks (DBNs) [25, 26], and recently transformer [27, 28] for multiple metrics with timestamp problem as sequences.

Table 2 Deep learning models for multihorizon multichannel modeling

These works provide valuable information on the strengths and weaknesses of different approaches, helping researchers and practitioners makes informed decisions about the methods to apply to specific use cases and identify inherent obstacles to applying ML/DL in practical domains [30]. The authors report a significant interest in validating the effectiveness of their network intrusion detection approach [31] to simulate a wide range of scenarios that mirror real-world conditions as closely as possible. Using various techniques such as time series, graph network, DL, and their combination [32], the authors aim to develop a robust and reliable system to detect anomalies in network traffic flows.

DL has become increasingly popular in anomaly detection with multihorizon time-series forecasting (Table 2), showing significant performance improvements compared to traditional time-series methods [33, 34]. Multihorizon forecasting is a technique in time-series data analysis [35]. Unlike one-step-ahead forecasting, which predicts the value of a variable only one time period ahead, multihorizon forecasting allows the estimation of future trends over multiple time periods [36]. This capability is especially valuable for situational awareness and decision support, as it enables optimization of actions in multiple steps in advance.

This surge in the use of DL has led to various experiments aimed at developing intelligent modules that address intrusion detection in a complex way. In this problem, patterns that deviate from a baseline, estimated only from normal traffic, are indicated as anomalous. Here is also a discussion of technical challenges of applications of deep ensemble models [37, 38] with the challenges that arise during the data management, model learning, model verification, and model deployment stages, as well as considerations that affect the whole deployment pipeline in production [39].

2.3 AI for IT operations

AIOps is the term used to describe the integration of AI and ML algorithms into IT operations to improve the efficiency and reliability of IT services [40]. It represents a significant step forward in the field of IT operations, enabling organizations to take advantage of the power of AI to improve their IT operations and provide better services.

In this context and for cybersecurity intrusion detection research, several artificially generated, public, and synthesis datasets [17, 25, 41] are widely recognized as valuable resources. The most well-known are presented in Table 3. These datasets provide a valuable resource for researchers and practitioners to test and evaluate new approaches to cybersecurity challenges, allowing the development of more effective and robust cybersecurity solutions.

Table 3 Cybersecurity datasets: public, generated, and synthesis

Furthermore, an annotation approach is presented to generate artificial alerts in [49], which includes a pioneering automatic scheduling scheme. This method enables efficient and effective monitoring of data streams, providing timely alerts for potential anomalies or other important events. Similarly, in the work [50], an automated adaptation strategy for stream learning is presented, which has been tested on 36 publicly available datasets and five ensemble ML methods such as the simple adaptive batch learning ensemble (SABLE), dynamic weighted majority (DWM), paired learner (PL), leveraged bagging (LB), and the best last method (BLAST) for stream data analysis. These experiments demonstrate the potential of ML/DL to enable real-time analysis of data streams, helping to detect important patterns and anomalies.

As mentioned in Sects. 2.1 and 2.2, in a wide range of studies, including those mentioned and many others, researchers have conducted experiments that demonstrate promising results in the development and testing of various methods for data analysis and monitoring [51]. These studies have used a variety of public datasets that contain various monitoring information, allowing rigorous testing and comparison of different approaches. The positive results of these experiments suggest that these methods have the potential to be useful in a wide range of applications, from anomaly detection to predictive modeling and beyond. These works have made important contributions to cybersecurity research, their focus has often been on predictive analysis (that is, data analytics with ML methods but without real deployment), using publicly available, synthesis, or artificially generated datasets (Table 4).

Table 4 Predictive analysis in cybersecurity

Although such datasets can provide valuable information on the performance of different approaches, it is important to recognize that real-world cybersecurity challenges often involve unique and complex data that are not fully represented in them. It is crucial to validate cybersecurity solutions over real data to ensure their effectiveness in practical settings to develop robust and effective cybersecurity solutions.

Network intrusion detection based on anomalies is an area of great interest in cybersecurity research, but its real-world applications pose a significant challenge. Although many studies have focused on developing intelligent approaches to detect anomalies in network traffic, deploying these methods in real production with real stream data and flows is difficult and rarely achieved due to the natural security of the domain as presented in Table 4.

3 Method description

To bridge the gap between research and real production, the AIOps integration process and the challenges involved in the deployment of intelligent approaches for anomaly detection in stream data merges, while technologies are already active [53]. The starting point of our work is the integration of the AI/DL model to cooperate with ZEEK IDS, which supervises the infrastructure network in the online real-time data stream for anomaly detection.

3.1 Cooperation architecture

Network anomalies and detection are critical to ensuring the security of computer networks. Real-time monitoring is essential to detect unusual activity or behavior that may indicate a security breach. To establish a baseline for network behavior, it is crucial to understand what normal activity looks like. This is where behavior baseline comes in, which provides a clear picture of expected network activity.

To make sense of the large amounts of data generated by real-time monitoring and behavior baseline, analysis dashboards are essential. These dashboards provide network administrators with insight into network performance, traffic patterns, and potential security threats.

The cooperation architecture (Fig. 1) in this work is designed with the main components and technologies presented in Table 5. The combination of ZEEK, Apache Kafka, Elasticsearch, Kibana, and Docker and Docker Compose creates a system architecture that is capable of integrating an intelligent module to improve the detection capacity of Zeek. It can also provide a dashboard for visualization analysis with a dynamic behavior baseline produced by the module.

Fig. 1
figure 1

Cooperation architecture of proactive network monitoring deployment

The detailed description of each software component in our cooperation architecture with its purpose and key features is as follows:

Table 5 Main software components and technologies of the cooperation architecture

ZEEK is an open-source IDS that continuously monitors network traffic to detect and identify potential intrusions in real-time. It does so by parsing network traffic and extracting application-level information to analyze and match input traffic patterns with stored signatures. ZEEK is known for its speed and efficiency, and it is capable of performing detection without dropping any network packets. Additionally, it includes an extensive set of scripts designed to minimize false alarms and support the detection of signature-based and event-oriented attacks.

Apache Kafka is an open-source distributed event streaming platform that enables high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. It operates as a cluster of one or more servers communicating through a high-performance transmission control protocol (TCP) network protocol and can span multiple data centers or cloud regions for enhanced scalability and fault tolerance.

It provides a variety of application programming interface (API) in Java and Scala, including higher-level libraries for Go, Python, C/C++, and Representational State Transfer (REST) API. The primary concepts of Kafka include topics, partitions, producers, consumers, brokers, and zookeeper. The main concept of Kafka (Fig. 2) used in our work are:

  • Topics represent a stream of records or events that are stored in a distributed manner across partitions. They are partitioned data (distributed placements of data) spread over a number of buckets located on different Kafka brokers.

  • Producers publish data to topics, they are those client applications that publish (write) events to Kafka;

  • Consumers subscribe to topics to consume the data, they are the clients that read and process these events;

  • Brokers serve as intermediaries between producers and consumers, managing the storage and replication of data across partitions.

  • Zookeeper is a centralized service for maintaining configuration information, providing distributed synchronization. Kafka uses it to monitor the status of cluster nodes, as well as topics and partitions. Zookeeper stores the list of existing topics, the number of partitions for each topic, the location of replicas, and access control lists (ACLs).

Fig. 2
figure 2

Kafka concept in the ZEEK log stream context

Elasticsearch is a distributed RESTful search and analysis engine, designed to handle a wide range of use cases. As the core component of the Elastic Stack, it provides lightning-fast search capabilities, fine-tuned relevancy, and robust analytics that can easily scale to handle even the largest datasets. Elasticsearch is the most popular enterprise search engine based on Lucene, offering a distributed multi-tenant capable full-text search engine with a user-friendly hypertext transfer protocol (HTTP) interface and schema-free JavaScript Object Notation (JSON) documents.

Kibana is an open-source front-end application that works seamlessly with the Elastic Stack. Kibana provides search and data visualization capabilities, allowing users to easily explore and analyze data indexed in Elasticsearch. It was originally part of the Elasticsearch-Logstash-Kibana (ELK) stack. Kibana also serves as a user interface to manage, monitor, and secure an Elastic Stack cluster. Its aim is to configure and monitor Elasticsearch indices, visualize data using pre-built dashboards and graphs, and manage security settings to ensure data privacy and access control.

Docker is a standardized unit of software. It is a set of platform-as-a-service products that use operating system (OS) level virtualization to deliver software in packages (containerization). Developers and operation teams use the Docker tool to create and automate the deployment of applications in lightweight containers, for example, on virtual machines, to ensure that applications work efficiently in multiple environments. Using OS-level virtualization, Docker provides a consistent and predictable environment in which applications can run, regardless of the underlying infrastructure [55].

Docker Compose is used to define and run multi-container docker applications within a single command to create and start all services from the configuration YAML Ain’t Markup Language (YAML) file. Docker Compose is used to run multiple containers as a single service. Each of the containers runs in isolation but can interact with each other when required.

The description and the online state of our monitoring stream data are in Sect. 4.2. The snapshot with details of our architecture deployed as the software stack running in its production testing environment is in Sect. 4.1.

In the following sections, the two phases of our proposed intelligent module designed for the cooperation architecture are presented as follows:

  • The development of our multihorizon multichannel model is described in Sect. 3.2.

  • The online deployment of the model is presented in Sect. 3.3.

The online anomaly detection principle is presented in Sect. 3.4.

3.2 Multihorizon multichannel modeling as dynamic baselines

Complex network monitoring is considered a data function y of the time point t indicated by \(y_t\). It has the form of a multivariate vector containing multiple variables of m multichannel monitoring \(( y^1_t, y^2_t, \dots , y^m_t )\) for proactive modeling of network activity.

$$\begin{aligned} y_t = \left( y^1_t, y^2_t, \dots , y^m_t \right) \end{aligned}$$
(1)

at time t. The mathematical basis for proactive forecasting of the y value at the next q time points \((t+1)~\dots ,(t+q)\) is based on the y values at the previous p time points \((t-p+1)~\dots ,(t)\) with added/subtracting error terms.

In this work, we extend the motivation based on time-series forecasting backgrounds [56, 57] from forecasting one horizon (or one time point) as in [58] to generalizing the forecast to the k horizon ahead of the direct forecast of the q horizons (Fig. 3) as follows (Eq. 2):

$$\begin{aligned} \begin{pmatrix} y^1_{t-p+1} & \cdots & y^1_{t} \\ y^2_{t-p+1} & \cdots & y^2_{t} \\ \vdots \\ y^m_{t-p+1} & \cdots & y^m_{t} \\ \end{pmatrix} \rightarrow \begin{pmatrix} y^1_{t+k} & \cdots & y^1_{t+k+q-1} \\ y^2_{t+k} & \cdots & y^2_{t+k+q-1} \\ \vdots \\ y^m_{t+k} & \cdots & y^m_{t+k+q-1} \\ \end{pmatrix} \end{aligned}$$
(2)

as vector form (Eq. 3):

$$\begin{aligned} y_{t-p+1},~\dots ,~y_{t} \rightarrow y_{t+k},~\dots ,~y_{t+k+q-1} \end{aligned}$$
(3)

where:

\(1 \le k < p\);

\(1 \le q < p\);

k indicates the time points ahead between the current time t and the start time point \((t+k)\) of the forecast;

q indicates the multihorizon forecast from time point \((t+k)\) to time point \((t+k+p-1)\).

The modeling values of \(y_t\), called predicted \(\widehat{y_t}\), are used as known behavior (other words: dynamic baseline, dynamic threshold), representing the normal behavior of regular network activities over a period of time t for anomaly detection purposes (more details in Sect. 3.4).

Fig. 3
figure 3

Multihorizon multichannel modeling

The direct forecasting strategy involves training models using time-series values that have been shifted to the desired number of time periods in the future. It can be improved by using a multichannel forecasting strategy, also known as a multiple-output strategy. With this approach, a single model is developed that can predict the entire forecast sequence in one go. This can make the AIOps deployment easier and more flexible compared to making separate predictions for each time period using multiple models.

Data preprocessing is a crucial step before modeling and prediction. An important aspect of preprocessing is feature engineering, which involves transforming raw logs into time-series data that can be used for modeling. To ensure the quality of transformed data, network protocols are checked against white noise, randomness, and unit root [59] using the augmented Dickey–Fuller (ADF) test [60]. A unit root is a stochastic trend in a time series, sometimes called a random walk with drift. If a time series has a unit root, it shows a systematic pattern that is unpredictable. The origin Dickey–Fuller test is to test the model (Eq. 4).

$$\begin{aligned} \Delta y_t = y_t - y_{t-1} = \alpha + \beta t + \gamma y_{t-1} + e_t \end{aligned}$$
(4)

ADF is the augmented version of the Dickey–Fuller test. It allows for higher-order autoregressive processes by including \(\Delta y_{t-p}\) in the model (Eq. 5).

$$\begin{aligned} \Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \delta _1 \Delta y_{t-1} + \delta _2 \Delta y_{t-2} + \cdots \end{aligned}$$
(5)

For both tests, the null hypothesis (\(H_0\)) is a unit root presented in the data. The alternative hypothesis (H1) is that the unit root is not present in the data. Mathematically, \(H_0\) states that if \(\gamma =0\) then y is a random walk (or non-stationary) process. If the ADF test returns \(p\_value < 0.05\) threshold then y is stationary. The goal of these tests as part of data preprocessing is to identify and filter out protocols that contain excessive noise or randomness, as they are not suitable for modeling. Unlike most of the other forecasting algorithms, DL architecture or models are capable of learning non-linearities and long-term dependencies in the sequence. Then, stationary is less of a concern for DL models. By performing data filtering, the quality of the data is improved, leading to more accurate predictions [61].

To model the complex behavior of network traffic, we employ the state-of-the-art DL architectures (Table 2) to handle online stream data. These DL models were investigated, customized according to our domain data, and compared (Sect. 4.3.2) to select the best, i.e., GRU model for online deployment (Sect. 3.3).

The forecast quality is evaluated using a separated error matrix (Eq. 6) for multihorizon modeling (k horizon before direct forecast of q horizons) of multichannels (m monitoring channels). We denote

$$\begin{aligned} E = \begin{pmatrix} e^1_{t+k} & e^1_{t+k+1} & \cdots & e^1_{t+k+q-1} \\ e^2_{t+k} & e^2_{t+k+1} & \cdots & e^2_{t+k+q-1} \\ \vdots \\ e^m_{t+k} & e^m_{t+k+1} & \cdots & e^m_{t+k+q-1} \\ \end{pmatrix} = \begin{bmatrix} e^1 \\ e^2 \\ \vdots \\ e^m \end{bmatrix} \end{aligned}$$
(6)

The average error value \(avg(e^j)\) of the series (Eq. 8) of errors \(e^j\) of each communication protocol j (Eq. 7) is calculated for the m monitoring channels with direct forecast of the horizons q.

$$\begin{aligned} e^j & = \begin{bmatrix} e^j_{t+k}&e^j_{t+k+1} \ldots&e^j_{t+k+q-1} \end{bmatrix} \end{aligned}$$
(7)
$$\begin{aligned} \text {avg}(e^j) & = \frac{1}{q} \sum _{i=t+k}^{t+k+q-1}e^j_{i} \end{aligned}$$
(8)

Errors are indicated mainly by the smooth version of symmetric mean absolute percentage error (SMAPE) (Eqs. 9). SMAPE is used to avoid zero division. It is also less sensitive to near-zero values compared to mean absolute percentage error (MAPE) (Eqs. 10).

$$\begin{aligned} \text {SMAPE} & = 100\% \frac{1}{n} \sum _{i=1}^n \frac{|y_i - \widehat{y_i} | }{ \frac{1}{2}(|y_i| + |\widehat{y_i}|)} \end{aligned}$$
(9)
$$\begin{aligned} \text {MAPE} & = 100\% \frac{1}{n} \sum _{i=1}^n \left|\frac{(y_i - \widehat{y_i})}{ y_i } \right|\end{aligned}$$
(10)

Root Mean Square Error (RMSE) (Eqs. 11) and coefficients of determination (\(R^2\)) (Eqs. 12) are used as supported metrics.

$$\begin{aligned} \text {RMSE} & = \sqrt{\frac{1}{n} \sum _{i=1}^n (y_i - \widehat{y_i})^2 } \end{aligned}$$
(11)
$$\begin{aligned} R^2 & = 1 - \frac{ \sum _{i=1}^n \Bigl ( y_i - \widehat{y_i} \Bigr )^2}{\sum _{i=1}^n \Bigl ( y_i - \frac{1}{n} \sum _{j=1}^n y_j \Bigr )^2 } \end{aligned}$$
(12)

All the aforementioned metrics measure the inconsistency between the ground truth y and the prediction \(\widehat{y}\) of n observations. Mean squared error (MSE) (Eq. 13) and mean absolute error (MAE) (Eq. 14) are used as a loss function in model training.

$$\begin{aligned} \text {MSE} & = \frac{1}{n} \sum _{i=1}^n (y_i - \widehat{y_i})^2 \end{aligned}$$
(13)
$$\begin{aligned} \text {MAE} & = \frac{1}{n} \sum _{i=1}^n |y_i - \widehat{y_i}| \end{aligned}$$
(14)

The hyperparameter settings for our DL architectures are presented in Table 7 (Sect. 4.3.1). All features are transformed by normalization scaling [62]. The dropout layer works great with our data. The main advantage of using unsupervised (self-supervised) methods is that we do not need to label the data. The main disadvantage is that detected patterns may not be anomalous, but rather intrinsic to the dataset.

3.3 Model online deployment

Teacher forcing is a method of quickly and efficiently training and retraining neural network models using ground truth values. It is a critical training method used in DL for natural language modeling, such as machine translation, text summarization, image captioning, and many other applications. In some cases, teacher forcing is also called changeably incremental online learning. After the model is first trained, it is used to start the process and ensures the predicted output in a predefined interval. The recursive output-as-input process approach can be used when training the model for the first time, but it can result in problems like model instability or poor skill in a dynamic changing environment. Teacher forcing is used to address model skills and stability regularly over time [63].

Algorithm 1
figure a

Online model execution over streamed data with teacher forcing periodic model updates

DL comes with more complex architectures, which promise more accurate learning, but at the cost of model complexity and computing resources. Our choice of GRU architecture is based on the comparison analysis of the DL architecture [58] and based on the analysis done in this work for the implementation of online learning with real-time stream data (Sect. 4.3). GRU behaves well in the development phase, and teacher forcing is supported in realization with production-grade TensorFlow for deployment.

Due to the dynamic nature of the monitoring data, model adaptation and its updating are needed during the deployment, especially when the model is run for a longer period of time. This process is described in (Algorithm 1). The prediction model is deployed within a Kafka consumer that consumes two streams online:

  • \(topic\_in\): used for the prediction, and

  • \(topic\_benign\): used to update the model.

Both streams are processed using a sliding window approach, but each with different length and step settings. The resolution of time (that is, the time step) is set to 10 min according to the domain experience and the delay effect described in Sect. 4.2.2 with the statistical results presented in Fig. 6.

Regarding the prediction task (Algorithm 1, lines 9–15), the sliding window defines the prediction context. The length of the sliding window is \(context\_len\) and corresponds to the number of recent time steps that the model sees in its input. Step \(slide\_step\) is the same as the model forecast horizon, in our case, it is one time step corresponding to 10 min (Table 7).

Predictions are made continuously at each sliding window step (Algorithm 1, line 13) by feeding the model with a sliding window content; that is, a multivariate time series of features \(features\_in\). The forecasting results in the form of a feature vector \(features\_out\) enriched with model metadata (for example, model name) are sent to the output stream \(topic\_pred\) with the help of a Kafka producer (Algorithm 1, line 14), where they can be further processed by other consumers, such as a consumer feeding Elasticsearch (Fig. 5).

Updates to the online model are made periodically if sufficient training data are received from the benign data stream (Algorithm 1, line 20). In our case, for the 24-h update period and the training data for 24 h, the sliding window length (\(update\_context\_len\)) and its step (\(update\_period\)) are set to 144 (24 h / 10 min = 144).

The complexity of ExecOnlineModel (Algorithm 1) is O(n), where n is the number of message in Kafka’s consumer (Algorithm 1, line 7). The algorithm presents the inference using the pre-trained model in production, which is not computationally or communication intensive. The model works in place with a regular amount of online stream data from the monitoring log flow (Algorithm 1, lines 2 and 3) in the context of one current sliding window (Algorithm 1, from line 7 to line 14).

3.4 Online anomaly detection

Anomalies in network monitoring are identified as significant deviations from the values of normal activity level [14]. The most important point is the precision of the prediction values of \(\widehat{y}\) (Sect. 3.2) compared to the real values of y to keep alert warnings at an acceptable level for network administration. The overall anomaly detection score for the m monitoring channels at time t is calculated as:

$$\begin{aligned} score_t = cosine\left( \widehat{y}_{t,upper}, y_t \right) \end{aligned}$$
(15)

where:

\(\widehat{y}^i _{t,upper} = (1 + \alpha ^i)~\widehat{y}^i_t\);

model is the model architecture in production;

\(y_t\) is a vector of real values of m channels at time t;

\(\widehat{y}_t\) is predicted values of \(y_t\) calculated at time \((t-k)\);

\(\widehat{y}_{t,upper}\) is an upper boundary for \(y_t\);

\(\alpha ^i\) is a threshold coefficient for \(i^{th}\) channel in percents; where \(\alpha ^{model}_i \ge \frac{1}{2} SMAPE^{model}_i\).

The warning of an abnormal state at time t is called anomaly detection rules applied for multichannel monitoring (Eq. 16), which is activated when

$$\begin{aligned} score_t > \vartheta \ge 0 \end{aligned}$$
(16)

where \(\vartheta\) is a threshold constant reservation.

Algorithm 2
figure b

Online anomaly detection

Conventional threshold-based methods provide a horizontal baseline line. DL data modeling provides a nuanced soft baseline in the form of the prediction \(\widehat{y}\) during the monitoring time of the network (Fig. 3). Algorithm 2 works well under IDS production conditions. The core is to obtain the predicted values \(\widehat{y}_t\) and use them as a flexible baseline (or threshold). The result is indicated as the blue line in Fig. 10 (Sect. 4.4), where the green line indicates the real values y.

4 Experiments and evaluation

The cooperation architecture (Fig. 1) based on the Kafka concept for ZEEK log stream processing (Fig. 2) is realized as a production testing environment as follows:

  • Deployment of the cooperation architecture (Sect. 3.1) as running software stack (Sect. 4.1) with online model deployment for multihorizon multichannel forecasting.

  • Online stream data in cooperation with ZEEK supervising network activities (Sect. 4.2);

The details of the experiments carried out in the production testing environment are presented as follows:

  • Stream data processing (Sect. 4.2.1);

  • Online delayed log effect (Sect. 4.2.2);

  • Forecasting quality (Sect. 4.3);

  • Anomaly detection results (Sect. 4.4).

4.1 Software stack deployment

The deployment [64] of the complete software stack in our work for data streams follows automation and orchestration with Ansible semantic declaration. The snapshot of the deployment is shown in Fig. 4 with the component serving instances as shown in Table 6.

Table 6 Component serving instances

The Ansible YAML file uses a declarative composition to easily compose, deploy, and scale the software stack (Fig. 4) for our deployment into the testing environment. YAML is a human-readable data serialization language that serves the same purpose as Extensible Markup Language (XML) but with Python-style indentation for nesting. Compared to XML, YAML has a more compact format.

Fig. 4
figure 4

The snapshot of the complex software deployment stack running in its production testing environment

4.2 Online stream data

The next part of our work is cooperation with one instance of ZEEK working in production. The instance is installed on the network routing machine. Here are two kinds of logs:

  • Compressed log files: ZEEK collects metadata on ongoing activity within a monitored network. This activity is recorded as transaction logs and organized into compressed textual log files that are grouped by date, protocol, and capture time within specific directories $PREFIX/logs. These logs are collated and organized into compressed textual log files for long-term storage [65] that can be easily searched and analyzed to identify patterns and trends in network traffic [58].

  • Online stream log files: ZEEK analyzes traffic according to the predefined setting policy and produces the results of a real-time network activity in a specific and special directory $PREFIX/logs/current on the ZEEK monitoring server. These incremental logs are stored as uncompressed log files [66], which serve as a source of streaming data for subsequent analysis and modeling in this work.

By default, ZEEK regularly takes all logs from $PREFIX/logs/current and compresses them by date time and archives to $PREFIX/logs. The frequency is set to every hour by default and can be specified in the configuration file as the value $PREFIX.

Online logs are in human-readable (ASCII) format and data are organized into tab-delimited columns. Logs that deal with analysis of a network protocol often start like with:

  • a timestamp,

  • a unique connection identifier (UID),

  • a connection 4-tuple (originator host/port and responder host/port),

  • protocol-dependent activities that’s occurring.

Apart from the conventional network protocol specific log files, ZEEK also generates other important log files based on the network traffic statistics, interesting activity captured in the traffic, and detection-focused log files, for example, conn.log, notice.log, known_services.log, weird.log.

4.2.1 Data stream processing

Before any further analysis, an initial analysis of the transaction logs was performed to filter out white noise protocols with stochastic characteristics (Sect. 3.2). This filtering process helps to remove irrelevant or noisy data from the dataset, improving the accuracy and efficiency of subsequent analysis steps. For network protocols that pass data validation tests and are used for modeling, it is important to recognize that time-series data have characteristics of ordered time-dependency sequences, where each observation has a temporal dependency on previous observations. Preserving this temporal dependency is crucial for accurate modeling.

For incremental filtered log files conn.log, dns.log, http.log, sip.log, ssh.log, ssl.log of network protocols, dedicated Kafka producers are launched to monitor them for new records in real time. New records are treated as live streams (Fig. 5) and parsed as soon as they appear in the log files.

Fig. 5
figure 5

High-level view on stream data flow

The parsed logs are immediately sent to the Kafka cluster. They are organized by topic and key (Fig. 2), where the topic refers to message origin (a particular ZEEK instance) and the key refers to originating log files (e.g., conn, http).

In our work, a concept of aggregators is engaged to consume ZEEK messages from Kafka, and compute statistics over predefined sliding windows (10min, 30min or 1 h) are applied to preprocess streaming data. Aggregated statistics are pushed back to Kafka as new messages under dedicated topics, e.g., ’zeek1-agg1-10 m’ and keys, e.g., conn, http. Aggregations are computed according to a specified set of declarative rules. There are two types of aggregation rules:

  • agg rule computes the number of total and unique connections (for instance conn log file), the sum of received and sent bytes, and the average duration of connections within the sliding window.

  • groupby type of rules is used to group data within the sliding window by a specified column before computing the aggregation statistics.

The groupby aggregation rule is useful if we want to compute the statistics for particular column values. An example of such statistics is counting the total number of values true and false for the column AA within the sliding window as follows.

  • The rule leads to the creation of two columns in the resulting record; i.e., dns_AA_T_count for the true value count and dns_AA_F_count for the false value count.

  • Column names are automatically generated according to the original log file and the expansion of the aggregation rules.

Before applying the aggregation rules, the data are grouped within the sliding window by direction of communication into three groups:

  • in,

  • out,

  • internal.

The direction is resolved by IP address, where applicable, and the result is taken from the perspective of the monitored network. It should be noted that before applying any aggregation rules, there is an option to prepare the data consumed using declarative data cleaning rules.

4.2.2 Online delayed logging effect

The analysis of more than 2.1 billion records collected over a 32-month period revealed that the logs in the incremental files are sometimes delayed. Delays present a problem for online aggregation, because delayed logs are not included in their aggregator sliding window. Figure 6 illustrates the analysis of the conn.log data with the graph showing how long an aggregator should wait for delayed logs to receive the desired percentage of incoming logs for the relevant sliding window, i.e., the 300-s interval is needed for delayed log aggregation to obtain \(97\%\) of the logs for the actual sliding window.

Fig. 6
figure 6

Statistics of delayed logs in conn.log for a monitoring period of 32 months

The conn.log file is one of the most important ZEEK logs. It contains logs of TCP/UDP/ICMP connections that have been completed [65]. Log records are written to the incremental log file at the time of completing a connection; i.e., when the connection is closed gracefully or abruptly. There is also a third option when a connection is considered completed, despite the fact that it might still be alive. It is the case when it exceeds the inactivity timer set by ZEEK. This timer is set to 1 h by default, so longer connections are incorrectly broken into multiple sessions.

However, the timer is not the main cause of delays, as there are up to \(97\%\) logs with delays below this timer. The user datagram protocol (UDP) and internet control message protocol (ICMP) connections are interpreted by ZEEK using flow semantics; that is, the sequence of packets from a source host/port to a destination host/port is grouped. Each log contains ts and duration fields, where ts is the time of the first packet and duration is the elapsed time between the first and last packets in a session. For 3-way and 4-way connection teardowns, the final acknowledgment (ACK) packet is not included in the duration computation, which could be the reason for the delays.

The delays in the analysis were calculated as follows: The logs from the beginning are chronologically iterated as they were recorded, while computing the end timestamp ts_end for each transaction log as \(\texttt {ts} + \texttt {duration}\) (Eq. 17).

$$\begin{aligned} ts\_end = ts + duration \end{aligned}$$
(17)

Logs with missing duration were skipped. The difference between consecutive logs is computed as in Eq. 18. The negative difference diff is considered a delay.

$$\begin{aligned} \textit{diff} = ts\_end_t - ts\_end_{t-1} \end{aligned}$$
(18)

Based on our experience and experiments with domain data, the optimal sliding window of 10 min is set with a 10-min step for data aggregation. The waiting time for the delayed logs is set to 5 min, which according to the analysis presented above should cover up to \(97\%\) of the actual window logs. Following these settings, the data aggregator is deployed as a Kafka consumer/producer, which reads new Kafka logs and performs data aggregation.

4.3 Forecasting quality

In addition to DL data modeling (Sect. 3.2), the cross-validation based on a rolling window is used, that is, the data are divided into fixed-size sliding windows. This approach allows us to train the model on a subset of the data and then test it on a different subset of the data. By repeating this process, we assess the model performance and make sure that it generalizes well to new data without overfitting and to ensure that our model makes accurate predictions on incoming online stream data.

4.3.1 Hyperparameter setting

The complete setting of the hyperparameters is presented in Table 7. Hyperparameter tuning is performed on the basis of the Bayesian optimization sequential design strategy [67] in combination with preliminary exploratory online data analysis (Sect. 4.2.2) and our previous experience [68] with data and DL modeling in the domain.

Table 7 Hyperparameter setting

4.3.2 Model performance

The deployment model works well to monitor simultaneously \(conn\_in\), \(conn\_out\), \(dns\_out\), \(http\_in\), \(ssl\_in\) network protocols. SMAPE and cosine are used as main metrics for the performance measurements of the model, and the rest are auxiliary supporting metrics. In practice, the number of metrics used is greater for the development and deployment of models. It is appropriate to mention that the range SMAPE is \(<0\%, 200\%>\) (Eqs. 9).

  • The forecasting quality for multiple (1–10) horizons in one go with SMAPE (Tables 89). The best model performance value (the lower value is the better quality) of combinations (protocol and horizon) is highlighted in bold for the values in these two tables together for DL models: MLP, CNN, GRU, s2s-GRU, CNN-GRU, and transformer.

  • The forecasting quality for multiple (1–10) horizons in one go with cosine (Tables 1011). The best model performance value (the higher value is the better quality) of combinations (protocol and horizon) is highlighted in bold for the values in these two tables together for DL models: MLP, CNN, GRU, s2s-GRU, CNN-GRU, and transformer.

  • The forecasting stability for one-step-ahead of GRU with SMAPE (Table 12, Fig. 8) and cosine (Table 13, Fig. 9).

Table 8 MLP, CNN, GRU: model performance by SMAPE for multihorizons (1–10)
Table 9 s2s-GRU, CNN-GRU, Transformer: model performance by SMAPE for multihorizons (1-10)
Table 10 MLP, CNN, GRU: model performance by cosine for multihorizons (1-10)
Table 11 s2s-GRU, CNN-GRU, Transformer: model performance (cosine) for multihorizons (1–10)

Experiments were carried out to measure the quality of direct multihorizon forecasts for multichannels. The selected results are presented as follows: Tables 8, 9, 10 and 11 present the quality of the model for multiple horizons (1, 10) in one go simultaneously forecast with metrics SMAPE and cosine. The model works simultaneously for all selected network protocols. The main key findings based on the evaluations of the experiments are as follows:

  • The forecasting quality is not equal for all protocols due to their dynamic nature in the monitored network environment. Specifically, \(dns\_out\) and \(http\_in\) have sharper and more spiky diagrams (Fig. 10), and their fluctuations are more significantly reflected in the metrics, especially in SMAPE.

  • The cosine metric corresponds more or less to SMAPE for all monitoring protocols. The average values of metrics for multihorizon forecast in one go are higher in comparison with one horizon forecast.

  • The result of the multichannel forecast for one horizon in one go is slightly better than the multihorizon multichannel forecast in one go (Tables 12, 13).

  • The quality of the model decreases gracefully with time, which is the expectation and the reason why teacher forcing is used for online deployment.

Fig. 7
figure 7

GRU versus transformer performance for multiple horizon forecast in one go. The lower value indicates the better quality of the model by SMAPE in (a, b). The higher value indicates the better quality of the model by cosine in (c, d)

Tables 12 and 13 present the prediction stability in the n-step ahead forecast with the GRU model. Values are in stable ranges for various metrics (as visualized in Figs. 89); then, we do not emphasize ranges, nor the best values in bold style like in other tables.

Table 12 GRU forecasting stability in n-step ahead forecast (SMAPE)
Fig. 8
figure 8

GRU forecasting stability with SMAPE for multiple-step ahead

Table 13 GRU forecasting stability in n-step ahead forecast (cosine)
Fig. 9
figure 9

GRU forecasting stability with cosine for multiple-step ahead

Regarding the selection of the best trained model to deploy, the following key findings are concluded:

  • GRU, CNN-GRU, and transformer are the top three best models;

  • MLP model can be considered as the baseline for performance comparison, and there is no big difference between MLP and CNN models;

  • s2s-GRU acts as an encoder-decoder with GRU blocks. The difference between the performance of GRU and the performance of s2s-GRU is not significant in favor of GRU;

  • CNN-GRU can be seen as an improvement that combines the advantages of both CNN and RNN. It is better than CNN and nearly as good as GRU;

  • Transformer gives interesting and impressive performance. Its performance in the four tables is an average of 10 times training. Sometimes, the model gives superior performance, which outperforms other models. Another time, the training is not always stable, which can be caused by the dropout rate and normalization layers. The consequence of that is that the teacher forcing for transformer is more complicated in comparison with other models, concrete GRU such as the need of learning rate warm-up and linear decay of the learning rate to ensure the model stability.

In summary, GRU belongs to the best models trained with our data. We chose GRU over transformer to deploy after considering and comparing the possible deployment obstacles of both models. GRU (Fig. 7a, c) has simpler requirements with comparable or better performance quality compared to the transformer (Fig. 7b, d).

The effect of direct multihorizon forecast in network monitoring. Multihorizon forecasts (Tables 8, 9, 10, 11) have slightly higher errors compared to the forecast for one horizon (Table 12, 13). This effect can be expected due to the stochastic character of network monitoring and the more complex computational models of multihorizon forecasting on multichannels (Figs. 89). It also reflects the average errors of multiple horizon modeling (Sect. 3.2).

4.4 Anomaly detection results

Figure 10 illustrates the proactive prediction of the normal state as the blue soft line, which takes place 10 min in advance compared to the real monitoring value of the protocols http and ssl (green lines). The http protocol has a clear pattern of connections, which is well modeled by the DL approach. Such patterns are difficult to model with traditional approaches (Sect. 2).

Fig. 10
figure 10

Kibana’s dashboard for online monitoring (HTTP and SSL protocols) backed with Kafka and Elasticsearch in production with GRU model deployment on stream data. The blue line is the forecast used as the baseline over time. The green line is the real monitoring values with anomalies indicated as significant excesses in comparison with the baseline (color figure online)

The abnormal state in proactive network monitoring is also illustrated in Fig. 10, where the spikes of the green lines (real monitoring value) significantly exceed the blue soft line of the proactive predicted states. The difference between two lines (blue and green) is calculated in Sect. 3.4 as the angle between two vectors (Eq. 15) with threshold rules (Eq. 16).

It is suitable to note that monitoring systems usually do not have two lines (blue, green) as in Fig. 10 of our Kibana’s dashboard, but only one line of real monitoring values.

The next detected anomaly is shown in Fig. 11, where the alerts are displayed in red alert areas. The proactive effect is 10 min in advance (with \(k=1\), that is, one horizon ahead forecasting, Fig. 11) in the flavor of the network administration to react to the abnormal situation.

Fig. 11
figure 11

Alert warning in the Kibana’s production dashboard with GRU model deployment: the forecast baseline (gray color) and the actual state (blue color), i.e., the real number of unique incoming connections within a sliding 10 min window. Further inspection of alert-flagged windows (red color) revealed excessive port scanning activity during the Russian–Ukrainian conflict from the outside to the monitored network (color figure online)

However, the concrete type of anomaly excessive port scanning is not detected using our anomaly-based detection but reported as the feedback of the network administrator after log examination. Our anomaly-based detection only alarms the anomaly state (red alert area) in network security according to the high deviation from the baseline (blue area) of the actual connection number (gray area) using the anomaly detection rule (Sect. 3.4, Eq. 16).

The proactive time effect is configurable (with the value of steps ahead \(k > 1\)) which can give network administrators more time to react. The number of warning alerts is set by the value \(\alpha\) as described in Eq. 15. The higher the value of \(\alpha\) (in the range of 0 to 1), the lower the number of alerts for anomaly warnings. Our default value \(\alpha\) is set to 0.5 for the testing production environment (Fig. 4).

The deployed dashboard (Fig. 10) allows us to add a dynamic adaptive threshold baseline to improve awareness of the situation. The development of the entire software stack using a declarative composition simplifies AIOps deployment in practice.

Currently, control over the benign data stream is up to the SOC operators, who filter malicious data out of this stream based on the monitoring results. In the event of a detected anomaly based on the rules (Eq. 16), the update of the model could also be performed automatically without manual intervention in the automatic teacher forcing process (update and exchange model). The initial model, which is used as the starting model for online prediction, is always trained on benign historical data. When the model is already running online, the quality of the model can be validated backward. If the accuracy of the prediction is below, a threshold or the anomaly level of the data does not exceed the predefined threshold (the parameter \(\alpha\) in Eq. 16), then the data can be considered benign and used to update the model. If the accuracy is above the threshold, an anomaly alert is raised and the monitoring data are excluded from the teacher forcing training process (Fig. 5).

We prefer to deploy one model over multiple or ensemble models. The deployment of a single DL model is really not an easy task. Deployment is not a simple process and requires building a complicated pipeline for every model retraining. The combination of more models in production certainly increases the deployment and production cost with model and data curation online in real time.

5 Conclusion

The work presented in this paper presents the development of a DL-based multihorizon forecast model that simultaneously models multiple monitoring channels in an unsupervised manner, allowing a comprehensive view of network activities. With the trained DL model, the deployment of an AIOps module for proactive network security monitoring is realized in cooperation with the IDS working in real-time production.

The deployment of an intelligent module into a production environment is often accompanied by various obstacles, which are presented in this work. Moving forward, there are several directions for future work. A promising approach involves integrating innovative concepts for hybrid detection that combine anomaly-based detection with misuse-based detection for attack classification. It is also imperative to validate anomaly detection scores using well-defined labeled reference datasets. Data curation and model maintenance in production can be time intensive and require thorough verification, especially in the context of evolving data distribution and scale. Another direction is distributed and federated learning for interoperability to reduce the sensitive data sharing silos in cybersecurity.