: General

The reliability of wireless base stations is essential to guarantee the user experiences in wireless networks, thereby employing the anomaly detection on multivariate time series is indispensable for network operators to monitor the behaviours of large-scale wireless base stations. In this paper, a general unsupervised anomaly detection model is proposed using multivariate time series for large-scale wireless base stations, called GenAD. Firstly, multi-correlation attention and time-series attention are employed to learn the representations of the complex correlations and various temporal patterns of multivariate series. Secondly, a general model on large-scale wireless base stations is pre-trained with self-supervision, which can be easily transferred to a speciﬁc station with a small amount of training data. Experimentals show that GenAD boosts F1-score by total 9% on real-world datasets.

✉ Email: zhulinyj@chinamobile.com The reliability of wireless base stations is essential to guarantee the user experiences in wireless networks, thereby employing the anomaly detection on multivariate time series is indispensable for network operators to monitor the behaviours of large-scale wireless base stations. In this paper, a general unsupervised anomaly detection model is proposed using multivariate time series for large-scale wireless base stations, called GenAD. Firstly, multi-correlation attention and time-series attention are employed to learn the representations of the complex correlations and various temporal patterns of multivariate series. Secondly, a general model on large-scale wireless base stations is pre-trained with self-supervision, which can be easily transferred to a specific station with a small amount of training data. Experimentals show that GenAD boosts F1-score by total 9% on real-world datasets.
Introduction: With the large-scale commercialization of 5G, there are millions of wireless base stations (WBS) deployed in mobile networks, serving the connected users. Once an anomaly occurs within WBS, the connected users will suffer from the issues of dropped calls and slow access to the Internet. Thus, monitoring the behaviours of the large-scale WBSes is of vital importance to guarantee the user experiences. The behaviours of a WBS can be characterized by multiple performance metrics, e.g. wireless connection rate, etc. If one of these metrics becomes anomalous, the WBS is likely to suffer from performance degradation. These metrics are continuously collected at a predefined time interval and form the multivariate time series (MTS). Due to anomaly diversity and lack of training labels, the detection of WBSes' anomalous behaviours is converted to the unsupervised anomaly detection using MTS.
Recently, several deep learning-based methods [1][2][3] have been proposed for unsupervised anomaly detection using MTS. But there are two challenges in applying the methods in wireless networks: 1) The complex correlations and dynamic patterns of MTS. There exists complex inter-dependencies (correlations among time series) and various intra-dependencies (temporal patterns within one time series) for the MTS of each WBS. Additionally, these inter-dependencies and intradependencies of each WBS are dynamic, e.g. some WBSes behave in a pattern on weekdays, while the pattern can be dynamically changed during the holidays. These existing anomaly detection methods [4][5][6], either do not learn the inter-dependencies or intra-dependencies well, or lose the dynamic characteristics. 2) The heterogeneity of large-scale WBSes. Owning to various surrounding environments and different manufacturers, millions of WBSes behave in various patterns. It is infeasible to train one anomaly detection model for all WBSes.
To address the abovementioned challenges, we propose a general unsupervised anomaly detection model using multivariate time series, called GenAD. Firstly, GenAD proposes multi-correlation attention and time-series attention to represent the complex correlations and temporal patterns of MTS simultaneously. Specifically, the attention mechanism is introduced to capture the dynamics among time series and within one time series, and the multi-head and hidden-layers are utilized to capture non-linear, coupling, higher-order correlations, and trend, delay, periodicity of N-dimensional series. Secondly, GenAD pre-trains a general  To adapt for each station-specific pattern, we fine-tune the model with only a small quantity of data. Lastly, we propose the two-level dynamic threshold setting strategy to achieve effective anomaly detection.
Dataset description and problem setup: In this work, we create the WBS dataset that has been collected from the real-world mobile networks of China Mobile (one of the major China mobile operators). The wireless base stations gather 18 types of key performance metrics with a 15-min sampling interval over a period of 10 days in 2021. For each WBS, these collected data constitute the 18-dimensional MTS, each MTS has 960 data points. More details of the WBS dataset can be found in Table 1. We first randomly select 3000 WBSes (unlabelled data) with 18-dimensional MTS in each station. Except for 3000 WBSes, we randomly pick three areas marked as WBS-Area1, WBS-Area2, WBS-Area3, respectively, and then choose 10 WBSes (labelled data) in each area at random. The MTS of each station is divided into two parts with the same data length: the training set size is 480 points (insufficient to train the existing deep models) and the testing set size is 480 points. Given the multivariate time series X of a WBS, it contains N dimensions and T time points, Our goal is to determine whether there exist anomaly segments of X in future time points by learning the complex correlations and various temporal patterns of X .
Overall architecture: Figure 1 depicts the overall architecture of the proposed general unsupervised anomaly detection model using multivariate time series for large-scale wireless base stations. The wireless base stations collect the performance metrics for anomaly detection. The proposed GenAD model implements the anomaly detection by assessing the reconstruction error between the original series and the reconstructed series. It consists of three parts, including multiple hidden-layers with time-series attention and multi-correlation attention in each layer, linearlayer and loss function. Consequently, the anomaly detection results can be utilized to reconfigure the network and ensure the user experiences in mobile networks.
Random masking for input series: Due to the limited storage resources of WBS, only a small quantity of data can be leveraged to train the anomaly detection model for each WBS. To reconstruct time series well on the limited training data, GenAD randomly selects series for masking and reconstruct them with the unmasked series. Considering that GenAD is unaware of which series have been picked for masking or have been employed for reconstruction during model training, it can force the model to automatically learn correlations and temporal patterns of MTS and minimize reconstruction loss. Specifically, we choose 20% of N-dimensions to be masked with a fixed random time series in T e . As demonstrated in Figure 1, the entire time series is divided into five segments. T a , T b , T c , and T d have the same length that is equivalent to that of T e . The remaining 80% original series and 20% masked series in T e , along with all series in T a to T d , input to GenAD.

Multi-correlation attention and time-series attention:
Existing deep learning-based methods employ convolutional neural networks and autoencoders to represent correlations of MTS, after the offline model training, the correlations will not change during online inference. However, the MTS of WBSes are not identically distributed. For instance, there may exist strong correlation between x i and x j of X in (0, T 1 ), but when distribution of x i and x j has changed, there will be weak or no correlation between x i and x j in (T 1 , T ). Hence, GenAD designs multi-correlation attention to capture the dynamic correlation, along with non-linear, coupling, and high-order correlations.
The input of multi-correlation attention is all the MTS in T e (including the masked and unmasked series), depicted in Figure 1. The original and masked series in T e is denoted as X Te , and the reconstructed series by surroundings isX Te (only the masked series are reconstructed, not the whole series). Given one of the masked series x i Te in X Te as an input, the reconstructed series of x i Te isx Te i , where Q i Te is the transformation of Te still change online, the result of Q i Te and K j Te also remain changing, such that the dynamic correlations between x i Te and all series is captured, and the non-linear and coupling correlations are captured by ReLU activation function and dot-product operation. Furthermore, we introduce Multi-head to strengthen the ability of representing correlations, and stack multiple layers of multi-correlation attention to capture the higher-order correlations.
Existing deep learning-based methods employ gated recurrent unit and long short-term memory to represent temporal patterns of MTS, but they are troublesome to capture the various temporal patterns (including dynamic, trend, delay, and periodicity) of N-dimensional series simultaneously, especially when N is large. To this end, GenAD employs time-series attention to represent the various temporal patterns of MTS.
Given x k in X as an input, x k in time (T 1 , T ) is masked with a fixed random series, denoted as x k Te . The input of time-series attention is x k in (0, T ), including x k Ta , x k Tb , x k Tc , x k Td , and x k Te , as shown in Figure 1. The reconstructed series of x k Te isx Te k and time-series attention is given, Ultimately, we adopt the fusion of time-series attention and multicorrelation attention to capture the temporal patterns and correlations of MTS simultaneously. Each layer captures both the temporal patterns of series and the correlations among series. The fusion is achieved by learning the general attention representations that can establish two types of attention meanwhile.
Pretraining strategy for large-scale WBSes: To tackle the challenge regarding to heterogeneity of large scale WBSes, we firstly pre-train GenAD with the unlabelled MTS data from 3000 real-world WBSes to learn the general representations. Then, to adapt for the specific patterns of each WBS, we finetune GenAD with only a small quantity of data from each WBS. Concretely, only 480 data points of each WBS are required to finetune the general model, the remaining 480 data points are utilized to test the model performances. Note that the pretraining strategy enables the model to obtain the universal feature representations and thus reduces the underfitting problem on a small dataset.
Anomaly detection based on two-level dynamic threshold: As GenAD detects anomalies by evaluating the reconstruction error between the original series and the reconstructed series, we set a two-level dynamic threshold, including the metric-level threshold λ m and the entity-level threshold λ e ). For the metric-level, if reconstruction error of one time series at time t is greater than λ m , the time series is declared as anomalous. For an N-dimensional entity at time t , if there are M time series (M > λ m ) that are anomalous, then the entity is declared as anomalous.
GenAD determines the dynamic anomaly threshold derived from the anomaly rate. The anomaly rate is marked as A R , the anomaly threshold is denoted as λ, and the reconstruction error of single-dimensional or multi-dimensional in T e are E Te , then Given that the probability density function of E Te is f(e), the probability distribution function F(e) = e −∞ f(t )dt, then Equation (4) is written as, To determine the dynamic anomaly threshold λ, we need to obtain the probability distribution function F(e) of E Te . The intuitive idea for getting F(e) is to analyse the statistical features of E Te , e.g. E Te obeys Gaussian-distribution or t-distribution. But it is not available in largescale service or equipment scenarios, as the statistical features of each service are various and there are few data for each service to analyse. Besides, OmniAnomaly [6] applies extreme value theory (EVT) [7] to estimate the parameters of the distribution of E Te . Nonetheless, its high complexity leads to obtaining the anomaly threshold λ inefficiently. Unlike the abovementioned methods, we estimate the probability density function f(e) of E Te based on the data of E Te , Num E is the number of E Te , E Te i is ith sample data of E Te , and e is the sample interval of E Te for f(e). Then we integrate f(e) to get the probability distribution function F(e) = e min E Te f(t ), and get the dynamic threshold λ by Equation (5). Note that, in order to reduce the error of parameter estimation through sampled data, we set A R = η + A R and Equation (5) .01] is fixed to maximize the F1-score during validation.
Experimental results and analysis: To evaluate the model performance, we compare the proposed GenAD model with the following baseline models: LSTM-NDT [8], OmniAnomaly [6], and MSCRED [5]. Table 2 reports the precision, recall, and F1-score on WBS dataset, where GenAD(G) represents GenAD with the pretraining strategy and the best scores are highlighted in bold. From Table 2, GenAD(G) outperforms all baseline methods and boosts its F1-score by up to 9% on the total dataset. It is worth mentioning that the recall of MSCRED, OmniAnomaly, and GenAD(G) can achieve 1.0, since we adopt the point-adjust way [6] to calculate the metric. But the recall of the prediction-based model(LSTM-NDT) is lower than 1.0. It is due to the fact the prediction-based method is more sensitive to noise and some uncontrollable factors (e.g. the network environment change) make some series less predictable. Additionally, compared with MSCRED and OmniAnomaly, GenAD(G) performs better by virtue of the pre-training and fine-tuning framework. Furthermore, to investigate the influence of the pretraining strategy, we train GenAD without pre-training (GenAD(WP)) on WBS datasets. From Figure 2, GenAD(G) achieves a higher F1-score on the three areas, verifying the effectiveness of pre-training strategy on heterogeneous WBSes. We also evaluate the performance of the GenAD without timeseries attention (GenAD(WT)) on WBS-Area1, which are WBS-Area1total and three WBSes selected in WBS-Area1 datasets. Figure 3 shows that the total performance (WBS-Area1-total) of GenAD(WT) decreases 4% compared with GenAD(G). This is because GenAD(WT) cannot capture the various temporal patterns of MTS in WBSes well. Besides, as for the effect of multi-correlation attention, it should be pointed out that GenAD without multi-correlation attention will detect each time series in isolation and lose the correlations among multivariate series.

Conclusion:
In this paper, we propose a general unsupervised anomaly detection model using multivariate time series for large-scale WBSes, called GenAD. We adopt multi-correlation attention to represent the complex correlations among the MTS and employ time-series attention to represent the various temporal patterns of each time series. Moreover, we employ the pretraining strategy to adapt to large-scale and heterogeneous WBSes. Extensive experiments reveal GenAD improves F1-score by up to 9% on real-world datasets in mobile networks. By leveraging the Jiutian artificial intelligence platform, we have applied GenAD in monitoring large-scale WBS behaviours in the realistic mobile networks of China Mobile, which has boosted the operation efficiency by 30-40% than these actually deployed anomaly detection techniques.