On the Latent Space of mmWave MIMO Channels for NLOS Identification in 5G-Advanced Systems

In mission-critical verticals such as automated driving, 5G-advanced networks must provide centimeter-level dynamic positioning along with ultra-reliable low-latency communication services. Massive Multiple-Input Multiple-Output (mMIMO) and millimeter waves (mmWave) are the key enablers, allowing high accuracy angle and delay estimation. Still, extracting such information from highly-dimensional Channel Impulse Responses (CIRs) results in a complex task, due to channel sparsity and intermittent blockage. In this paper we focus on non-line-of-sight (NLOS) identification from CIR data, proposing a Deep Autoencoding Kernel Density Model (DAKDM) to characterize the statistics of the channel latent features. We formulate the problem as a semi-supervised anomaly detection task in which only LOS samples, i.e., normal data, are adopted for training. DAKDM is a single-stage training model that takes as input the full CIR thanks to an AutoEncoder (AE) structure. The proposed method is able to learn the latent distribution by means of a Kernel Density Estimator (KDE) in combination with a deep learning likelihood network. We validate the proposed solution in a 5G Urban micro (UMi) vehicular scenario. Results show that the proposed model can significantly outperform conventional algorithms and obtain similar performances to variational Bayes algorithms at one tenth of the inference time.

Solutions for blockage detection should exploit the whole power-delay-angle profile of the channel impulse response (CIR) as this embeds a wide range of geographical data and propagation characteristics [53]. In 5G industrial use-cases, e.g., automated driving, historical CIR data are largely available in roadside units that receive continuous information from geolocalized vehicles [54], [55], [56], [57]. could easily exploit these data for automatic NLOS detection. However, since such signals are highly dependent on the environment, ML approaches for detecting NLOS using CIRs frequently fail to generalize to varied contexts [58]. Moreover, massive MIMO and very high frequencies of advanced-5G networks will produce high dimensional channel responses which may be complex to handle. An example of channel power-delay-angle-profile is shown in Fig. 1, for a 5G urban scenario with carrier frequency 30 GHz, bandwidth 400 MHz and uniform planar antenna array receiver of 64 elements. The sparse power delay-angle profile of the channel is a signature of the user location and should be exploited to infer the visibility conditions of the base station.
In this paper, we propose an innovative strategy to characterize the sparsity of the mmWave MIMO channel and approximate whatever high-dimensional distribution in a fast and compact way. To demonstrate the efficacy of the method, we address the problem of NLOS identification, exclusively employing LOS CIRs for training. This is done because LOS CIRs are easier to extract in training procedures and present more peculiar distributions, i.e., usually the direct path is the dominant factor in a Rician fading channel. In addition, this facilitates the deployment and results in higher generalization compared to other systems that require both classes for training (i.e., LOS and NLOS). Therefore, we treat the problem as an anomaly detection case in which LOS samples are considered as normal samples, while NLOS samples are anomalous.

II. RELATED WORKS
In this section, we first review the literature starting from early works on ultra wide-band (UWB) systems (Sec. II-A) and then we extend the analysis to ML-based algorithms (Sec. II-B). Next, we review the state of the art on anomaly detection focusing on neural network (NN) approaches (Sec. II-C) and we discuss the original contributions provided in the paper (Sec. II-D).

A. NLOS Identification
Existing techniques for NLOS identification/detection problem can be mainly divided into three major categories: based on range estimates, based on position estimates and based on channel statistics. The first group of methods, i.e., based on range estimates, measures the running variance of the ranges and applies a threshold using pre-computed variance statistics [59]. The techniques based on position estimates are mainly map-based, i.e., they observe the user equipment (UE) position in relation to the geometry of the environment [60], [61]. These first two categories are either too oversimplified or require perfect knowledge of the UE's position and of the map geometry.
The third class relies on channel statistics, such as amplitude, mean and root-mean-square delay. In case these statistics are known at-priori, a joint-likelihood ratio test can be adopted for hypothesis selection [62], [63] or as soft information in weighted least squares (WLS) algorithms. The limitations of this last class of existing techniques include experiencing delays while gathering channel statistics to create a database and determining the complex combined probability distributions of necessary features for statistical methods. ML-based approaches overcome these drawbacks by avoiding statistical modeling of the input features.

B. ML for NLOS Identification
ML approaches to NLOS identification can be divided into: supervised, unsupervised and semi-supervised learning. First works (i.e., supervised learning) use hand-crafted channel state information (CSI) features such as energy, maximum rise time, kurtosis, root-mean-square-delay spread, maximum amplitude, time of flight (ToF), Ricean-K-factor and mean excess delay [48], [64], [65], [66]. These deterministic features have a solid theoretical basis and capture the differences between LOS and NLOS conditions in terms of power and delay attributes, as well as the strength of the dominant signal component relative to the multipath components. The most popular adopted ML models are support vector machines (SVMs), relevance vector machines (RVMs), random forests (RFs) and Gaussian processes (GP). These methods can also be used to directly mitigate the range bias by applying a regression problem to the ranging-error estimates [49], [67].
Despite achieving good results, these methods highly depend on the pre-selected features which limit their potential. On the other hand, deep learning (DL) approaches can directly learn the most suitable combination of features (typically nonlinear) using as input the full CIR and producing as output the desired classification. First works in this direction can be found in [68], [69], [70] using convolutional neural networks (CNN) to perform feature extractions in grid-like data where local patterns and structures are critical. Some recent studies [71], [72] directly exploit the automatic feature extraction of the CNN in order to locate a target by performing a fingerprint training. A main limit is the need of extensive measurement campaigns and time-consuming labeling of data. Moreover, supervised learning approaches require updating the training database when conditions are changed and need representative samples of all the possible NLOS anomalies.
A solution could be permitting not to have labels at all and manage the problem as an unsupervised one. Authors in [73] fit a Gaussian mixture model (GMM) with two components (one for LOS and one for NLOS) using some key handcrafted features of the channel and output the classification according to the magnitude of the membership weights. While unsupervised techniques are very promising, unfortunately they do not achieve very high performances, due to lack of knowledge or lack of structured data.
The third class of semi-supervised approaches has the advantage of not needing examples of all the possible anomalies as supervised learning. Moreover, powerful DL semisupervised methods focus on learning one single distribution which, in many cases, is easier than a separating boundary between two distributions [74]. Works that adopt this strategy can be found in [58], [75] which adopt the Pearson correlation coefficient and one-class SVM to perform NLOS classification, respectively. A very recent work [76] employs variational autoencoder (VAE) to perform feature extraction and imposes a Gaussian distribution to the latent features in order to ease learning of distribution of normal samples. The score adopted to define the probability of NLOS is then used to estimate the bias and variance of time-based measurements. Although the idea of using an autoencoder (AE) to have a compact representation of the channel can give very good results, the usage of sampling-based methods to perform the prediction has the main drawback of not being suitable for real-time applications.

C. Neural Networks for Anomaly Detection
Anomaly detection is frequently employed in problems where we have a large amount of data from normal circumstances but little data from abnormal ones. Here, on the other hand, we consider the setting of semi-supervised learning in which normal training data only are provided. In this case, the problem turns out to be locating those samples that do not conform to the normal ones or a model explaining normal ones. Thus, the objective is to learn in a finer way as possible the distribution related to the normal samples.
To this aim, many works focus on end-to-end models to directly produce the classification using one-class neural networks (OC-NN) [77]. On the other hand, generative models are increasing in popularity with generative adversarial network (GAN) and VAE [78]. However, GANs are problematic to control in the training phase [79] and VAEs have the downside of requiring sampling, which is unfeasible under certain use cases, and furthermore experiments have shown that they tend to perform worse than AE [80], [81].
Reconstruction methods, as AE, are the most widely used DL techniques for anomaly detection in images [82]. Usually, they are used in combination with density-based methods, as kernel density estimation (KDE) [83], for score estimation by first performing dimensionality reduction, and then applying density estimation to the latent low-dimensional space. However, these two-steps methods restrict the modification to the dimensionality reduction since fine-tuning is difficult in well-trained AE. To solve this problem, authors in [84] propose a deep autoencoding Gaussian mixture model (DAGMM) to mutually learn the latent feature representations and their density under the GMM framework by mixture membership estimation. Even though their approach is direct and does not require two step-training, GMM may not be able to fully represent the latent distribution of normal samples and are subject to singularity problems. On the other hand, KDE are perfect to represent complex distributions, but they are very slow in evaluation and require storing the whole dataset for inference.

D. Contributions
In this paper, we address the problem of NLOS identification in 5G-advanced cellular systems using an innovative approach that allows to overcome the aforementioned limitations. The main contributions are the following: • We propose a feature extraction method that exploits the angle-delay channel power matrix (ADCPM) as input data and allows to characterize the distributions of the latent features of the sparse space-time channel in massive MIMO cellular systems using orthogonal frequency division multiplexing (OFDM). • We design NLOS identification as a semi-supervised anomaly-detection problem by exploiting a deep autoencoding kernel density model (DAKDM). The DL model allows to identify the few key parameters that describe the sparse space-time channel response and to learn the distributions of such latent features from training data. The proposed approach is able to jointly learn the sparse channel representation and approximate the KDE likelihood in a single training stage without storing the dataset. • We simulate a realistic 5G-advanced MIMO-OFDM vehicular scenario, according to the standard specifications [85], using a Matlab ray-tracing software [86]. The scenario is composed of multiple vehicular UEs created with simulation of urban mobility (SUMO) software [87]. The paper is organized as follows: Sec. III introduces the channel model for a multi-user MIMO-OFDM system and its extracted fingerprinting. Sec. IV provides the context of anomaly detection applied to the NLOS identification and defines the proposed DAKDM solution. Sec. V is devoted to the description of the simulated 5G scenario and to the comparison with state-of-the-art anomaly detection DL methods. Finally, Sec. VI draws the conclusion.

E. Notation
Random variables are displayed in sans serif, upright fonts; their realizations in serif, italic fonts. Vectors and matrices are denoted by bold lowercase and uppercase letters, respectively. For example, a random variable and its realization are denoted by x and x; a random vector and its realization are denoted by x and x; a random matrix and its realization are denoted by X and X, respectively. Random sets and their realizations are denoted by up-right sans serif and calligraphic font, respectively. For example, a random set and its realization are denoted by X and X , respectively. The function p x (x), and simply p(x) when there is no ambiguity, denotes the probability density function (PDF) of x. j = √ −1 denotes the imaginary unit. The notation X T , X * and X H indicate the matrix transposition, conjugation and conjugate transposition. The Kronecker and the Hadamard product between two matrices are denoted with the symbols ⊗ and ⊙, respectively. With the notation x ∼ CN (µ, σ 2 ) we indicate a complex Gaussian random variable x with mean µ and standard deviation σ. We use E{·} and V{·} to denote the expectation and the variance of random variable, respectively. R and C stand for the set of real and complex numbers, respectively. Re(x) and Im(x) are the real and complex part of the complex number x, respectively. ⌊x⌋ indicates the largest integer not greater than x, while δ(·) and δ[·] are the Dirac delta and Kronecker functions, respectively.

A. Channel Model
We consider a multi-user mmWave MIMO-OFDM system in which K UEs transmit in uplink direction over a bandwidth B at carrier wavelength λ c . The base station (BS)'s cell panel is equipped with an uniform planar array (UPA) with N × M isotropic antennas. The antenna spacings are d h and d v , over the horizontal and vertical dimension, respectively. We assume that the UE transmits using only one logical port and a number of physical antennas unknown at the BS. Between the UE k = 1, . . . , K and the BS, we consider a multipath channel with N k paths with ToF τ k,p for path p = 1, . . . , N k . The DoAs from the k-th UE and of the p-th path are divided into azimuth 0 ≤ φ k,p < π and elevation 0 ≤ θ k,p < π. A picture of the panel array can be found in Fig. 2. We restrict the azimuth up to π since we consider an UPA antenna. For trisectorial BSs the angular coverage is reduced to 2π/3.
The OFDM modulation is performed over N c sub-carriers, sampling interval T s and symbol duration T c = N c T s . Considering a baseband representation of the signal, we define the frequency at the ℓ-th sub-carrier as f ℓ = ℓ Tc , ℓ = 0, . . . , N c −1. The cyclic-prefix duration is T g = N g T s and it is assumed to be larger than the maximum channel delay for all UEs τ MAX = max k,p τ k,p . Consequently, we define with r k,p = ⌊ τ k,p Ts ⌋ the temporally resolvable propagation delay of the p-th path with respect to the k-th UE. Thus, the baseband CIR of user k is modelled as [88]: where the p-th path is characterized by a complex path gain α k,p = a k,p e −j2π d k,p λc β k,p with β k,p = e j2πν k,p t due to the Doppler frequency shift, a traveled distance d k,p = cτ k,p , a pulse waveform approximated with a Dirac delta function δ(τ − τ k,p ) and an array response vector e(θ k, where it is s 0 = 1 for LOS (i.e., with a deterministic direct path contribution and Rician fading) and s 0 = 0 for NLOS (i.e., Rayleigh fading). Additionally, we consider the Dopplerrelated rotation to be almost constant over time interval τ MAX and that the complex amplitudes α k,p associated to different paths as uncorrelated, according to the wide-sense stationary uncorrelated scattering model. At the BS, the array response vector can be decomposed into [89]: where the M × 1 response vector to the elevation angle is: and the N × 1 response vector to the azimuth angle is: Adopting an OFDM modulation with sampling at t = nT s , the channel frequency response (CFR) at the ℓ-th sub-carrier can be written as the discrete Fourier transform (DFT) of the CIR of the different paths [90], [91]: where the approximation holds for ToFs multiple of the sampling interval T s or equivalently for N g → ∞. Finally, the space-frequency channel response matrix (SFCRM) H k ∈ C M N ×Nc of the k-th UE is obtained as: which will be used in the next section to extract the channel fingerprints.

B. Channel Fingerprints
To detect the propagation conditions that generated the response (5), classifying them in LOS or NLOS, we propose to analyze the CFR in the angle-delay domain, which eases the recognition of the clustered multipath components associated to the direct (LOS) or secondary (NLOS) macro-paths. We thereby convert the SFCRM (6) into the domain of the angle of arrival (AoA) and the ToF, by introducing the angledelay channel response matrix (ADCRM). We define with Nc . ADCRM is computed as [93]: where (V H M ⊗ V H N ) and F * project the SFCRM into the angle and delay domain, respectively.
For NLOS identification, we propose to use the ADCRM to compute the average power of the channel components that are collected into the ADCPM defined as: where We recall here that the ADCPM holds some important asymptotic properties, as for N , M and N g → ∞, it tends to be a sparse matrix with elements [P k ] i,r matching the i-th AoA and the r-th ToF [93]: where m k,p N + n k,p denotes the index of the i-th AoA and r k,p the index of the r-th ToF. Note that the angle and delay indexes m k,p N + n k,p and r k,p , are two distinct and discrete quantities which relates to the physical AoA and ToF in the following way. The ToF can be approximated as τ k,p = r k,p T s , while the azimuth φ k,p and elevation θ k,p can be written as , respectively. Consequently, working in the transformed angledelay domain allows the DL model to learn the locationdependent features and, therefore, the statistics of LOS data to be used for blockage prediction.
Regarding the complexity overhead due to the ADCPM computation, we observe that P k is obtained from the channel matrix H k which is always estimated for communications purposes. Therefore the only overhead is the computation of (7), which can be efficiently performed using the 2D-Inverse Fast FT (IFFT), with an overall complexity of O(M N N g · log(M N N g )).

IV. BLOCKAGE DETECTION METHODOLOGY
In this section, we first introduce the problem formulation of the semi-supervised setting which serves for the proposed DL model's foundation. Then, we describe the network input, i.e., the ADCPM fingerprint, followed by the definition of the DAKDM. Finally, we define the loss function used to train the model.

A. Problem Formulation
We consider a semi-supervised setting in which we are given a training dataset S train comprising only normal data, i.e., X i sampled from p X , and a smaller testing data S test comprising normal (label y i = 0) and anomalous data (label y i = 1). Here, we refer to LOS samples as normal, while we consider NLOS samples as anomalous. Nevertheless, the choice of normal/anomalous condition is arbitrary and could be customized to the specific scenario, as the proposed model would still be valid in both cases, i.e., LOS or NLOS samples as normal data. Usually, the high-dimensional distribution of normal samples p X is complex and unknown. Thus, the objective is to first elaborate S train such that we can learn its manifold distribution and, subsequently, during inference time, identify the anomalous samples in S test as outliers. The mapping of the high-dimensional data is carried out using a DL model f (·) that learns the normal data distribution while also attempting to reduce an anomaly score A(X i ) given as output. The higher the anomaly score of A(X i ) for a test sample X i , the higher probability that X i is anomalous. For evaluation, a threshold (η) criterion is applied, i.e., A(X i ) > η denotes an anomaly, based on a predefined false positive rate (FPR).

B. DL Input
We employ (9) as input to the neural network for NLOS identification, as this matrix represents the clustered multipath structure of the channel and embeds the information on LOS/NLOS propagation conditions that we are interested to extract. Moreover, the sparsity of the matrix helps the CNN in features extraction since the first layers of CNN are usually sparse and they gather the more discriminant features [94]. In Fig. 3 we can see the ADCPM P k composed by M N angle directions and N g delay samples. The sparsity of the matrix is well-visible even without a huge number of antennas or sample's resolution. From now on, for simplicity of notation, we drop the index k related to user k and we denote the i-th input sample as X i = P i .

C. Deep Autoencoding Kernel Density Model
The proposed DAKDM system for NLOS identification involves three main elements: an AE, a KDE model and a likelihood network. The model can be seen in Fig. 4. The AE comprises an encoder E(·), that elaborates the i-th input X i ∈ R M N ×Ng into a latent representation z i ∈ R m , and a decoder D(·), that carries out an inverse transformation to return to the original high-dimensional distribution, obtaininĝ X i . The latent distribution p z may have any form, i.e., it is not constrained to belong to any specific PDF family. This makes the proposed approach general enough to be applied to any channel environment.
The distribution p z is automatically learned by the KDE block of the system (see Fig. 4). The KDE is a non-parametric method to estimate any distribution directly from a set of samples drawn from it. Given a set of samples {z j } Ns j=1 from p z , we define the KDE K(·) applied to sample z i as [95]: where k h : R m → R is a kernel function with bandwidth h which regulates the balance between the estimator's variance and bias. The kernel employed in this paper is the widely known Gaussian kernel: The output of a KDE, trained only with normal latent samples {z j } Ns j=1 , can be seen as the likelihood of the test sample to belong to the normal distribution. Thus, the derived anomaly score can be obtained as A K (z i ) = −log(K(z i |{z j } Ns j=1 )). However, the downsides of KDE lie in the fact that it requires Encode incoming signal X i : z i = E(X i ).

5:
Compute KDE prediction: A K (z i ) = −log(K(z j i |{z j−1 l } Ns l=1 )). Compute loss function L j tot . 8: Perform backward-pass. 9: end procedure storing all the training dataset to estimate the density function at inference time.
The idea to solve this issue is to first reduce the number of samples N s used to estimate the distribution, and then approximate the output of the KDE with a NN that is much faster in the prediction. We call the NN to estimate the output of the KDE as likelihood network and denote it with L(·). The logical steps for the training procedure with a batch of N s samples are described in Algorithm 1. First, we encode the input with the encoder. Then, we extract the anomaly score as A(X i ) ≜ A L (z i ) = −log(L(z i )) and we compute the KDE prediction A K (z i ). Finally, we compute the loss function which is described in Sec. IV-D and perform the backward pass. The key aspect here is that the KDE output is computed using the previous mini-batch, i.e., K(z j i |{z j−1 l } Ns l=1 ). This permits to avoid storing all the training dataset to estimate the density function. The underlying assumption is that the batch size N s is able to give a good representation of the likelihood through the KDE. Formally: where KL(·∥·) is the Kullback-Leibler divergence. On the contrary, at inference time, we just check if the anomaly score A L (z i ) > η. This implies that, during deployment, we can completely discard both the decoder D(·) and the KDE K(·), just relying on the faster prediction of the encoder E(·) and likelihood network L(·).

D. Loss Function
The objective of the loss function is to first induce the DAKDM to learn the latent representation of normal data and then to approximate A K (z i ) with A L (z i ). To this aim, we consider the training dataset S train = {B j } N b j=1 , where N b is the number of batches in the training dataset and B j = {X j i } Ns i=1 is the j-th mini-batch with N s samples. We define the total loss related to mini-batch j as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where L rec (X j i ,X j i ) = ∥X j i −X j i ∥ 2 is the loss function that describes the reconstruction error given by the AE, w KL and w lik are the weighting parameters that control how much the single losses affect the objective function as a whole and KL point is the pointwise KL-divergence defined as: With the second right-hand side of (13), we exploit the power of the likelihood network to learn the KDE output trained with the previous mini-batch. The choice of the loss function is motivated by the fact that, if assumption (12) holds, then we can write the contribution of z i to the anomaly score with the following [96]: where L(p(z i )|z i ) is the likelihood network that provides the probability of z i given z i . For the proof of (15), please refer to Appendix A. We directly insert the first right-hand side of (15) in the loss function to induce the AE to decrease the anomaly score, thus increasing the likelihood. On the other hand, we do not have a KDE that provides the probability of its predictions, therefore we consider the p(K(z i |{z j } Ns j=1 )|z i ) as a single deterministic value that we approximate through the likelihood network.

V. SIMULATION EXPERIMENTS A. 5G NR Network Simulation
To evaluate the proposed DAKDM method for NLOS identification, we simulate realistic CSI data based on the 5G NR clustered delay line (CDL) channel model [97] which is characterized by a maximum bandwidth of 2 GHz over the whole frequency range of 0.5 GHz to 100 GHz. We simulate the wave propagation using a ray-tracing method [98], [99], [100] provided by Antenna Toolbox Matlab package where the propagation pathways from the UE to the BSs are computed based on the surface geometry from a map file. Ray-tracing uses the shooting and bouncing ray (SBR) method [101], accounting for up to 10 path reflections. The method does not take into account buildings' windows and possible foliage, which would require a high-definition 3D mapping of the environment or a complex simulation with artificially created maps. The channel model is then produced by coupling all the paths taking into account the small-scale fading due to the UE's movement, angle spread and cluster properties. This permits to achieve spatial consistency, meaning that two adjacent positions present similar channel characteristics due to comparable scattered environments.

B. Urban Mobility Scenario
For the experiments, we simulate a 3GPP urban micro (UMi) scenario in an area of 1000 m × 1000 m, near the Leonardo Campus of Politecnico di Milano, with specific parameters described in [85]. As shown in Fig. 5, the scenario comprises 19 urban sites, placed in an hexagonal layout with Inter-Site Distance (ISD) of 200 m, each equipped with 3 cells. The BS antennas are characterized by an UPA configuration with M = N = 8 elements and a downtilt of 15 • . The transmission power is 44 dBm and each antenna element was defined using the specifications in [102], providing a front-toback ratio of about 30 dB and a maximum gain of 8 dBi.
The vehicular UEs move in the area traveling along different trajectories generated with SUMO software which replicates actual traffic patterns on a particular route network. We generate up to 50 trajectories sampled every second, for a total simulation time of 170 s. Each UE is equipped with an omnidirectional antenna and transmits the 5G standard compliant sounding reference signals (SRSs) to all the BSs in the area using a carrier frequency f c = 30 GHz and a transmission bandwidth B = 400 MHz. The BSs, which can be in either LOS or NLOS condition due to occlusions and reflections, demodulate the OFDM signal and estimate the channel using a least squares (LS) estimator. Subsequently, they obtain the channel fingerprint using the estimated channel response to compute the angle-delay channel structure (7) and then the associated power structure (8).
For the experiments, we do not consider the multi-user interference (MUI), but it is worth mentioning few considerations for possible real implementations of the method. In practice, the BSs can adopt various techniques to manage the inaccuracy of channel estimation due to factors such as the MUI. One common technique is to use channel estimation algorithms that are robust to MUI, such as linear minimum mean-square error (LMMSE) [103] which obtains sub-optimal performance (sub-optimal as it does not use the knowledge of the full CSI) with moderate computational complexity. Additionally, other techniques may rely on non-linear pre-coding schemes which have been found to provide near-optimal performance [104], [105]. In the standard of 5G-NR, codebookbased MIMO precoding techniques have been proposed and they are described in the 3GPP technical specifications (TS) 38-214 [106] and 38-211 [107]. With latest releases, i.e., Rel 16 and 17, MU-MIMO codebook (type II) has been improved with the reduction of the feedback overhead. By implementing these techniques, the MUI is highly reduced and the residual interference resembles to background noise. Moreover, in case the model has been trained in a channel in presence of non-null interference, we would have an even-broader LOS distribution characterization, which would be beneficial in case of singleuser transmission.

C. CSI Dataset
In the offline phase, each BS is assured to gather LOS only realizations of the channel, composing a training dataset for the blockage detection. In the online phase, on the other hand, we create the test dataset adopting unobserved positions of the UEs and collecting a balanced number of samples between LOS and NLOS conditions. We saved more than 7.5 · 10 4 and 8.6 · 10 4 samples in the training and testing set, respectively. Before the training, all the samples are standardized (i.e., transformed such that the mean intensity is 0 with standard deviation of 1) and shuffled at each epoch. MATLAB 2022a is used to create the channel fingerprints of the data points, while the DL model for training and testing is implemented using Pytorch [108] (v1.12 with Python 3.7.11). We run our simulations on a workstation with an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40 GHz, 96 GB of RAM and a Quadro RTX 6000 24 GB GPU. The testing times, described in Sec. V-F1, only apply to the run time on Pytorch 1.12. Unless otherwise specified, we train the model for a number of epochs E = 30 with a batch size N s = 64. w KL and w lik are both empirically set to 0.1. We adopted the Adam [109] with an initial learning rate lr = 10 −4 , and momentum β 1 = 0.9, β 2 = 0.999.

D. DL Model Characteristics
For the AE part, we adopt the Segnet architecture [110] with one single channel encoder and decoder. The upsampling layers employ the encoder pool indices to create a sparse feature mapping which is ideal for reproducing the sparse ADCPM input. The AE is the most complex part of the model, however, at testing time, we use only the encoder part, thus halving the inference time if compared with VAE models or in general solutions that adopt the reconstruction error as a monitoring feature.
On the other hand, we develop the likelihood network using a simpler multi-layer perceptron (MLP) which is able to learn whatever non-linear function. The network can be found in Table I. To cope with the overfitting we adopt the dropout technique [111] after each activation function. Furthermore, we insert a single batch normalization layer [112] right before the ReLU function. This is done to avoid that the output of the network will converge to a unique value after a long training.

E. Baselines
To evaluate the performances of the proposed model, we compare it against a number of DL approaches proposed in the literature to solve anomaly detection problems:  [84]. Single-stage training model composed by an AE and a GMM used for learning the latent feature distribution. The membership weights, which represent the probability that a given data point belongs to each component, are usually computed with the expectation (E)-step of the expectation-maximization (EM) algorithm used for the GMM fitting. However, in this case, the membership weights are produced by an estimation DL network. • AE-KDE [83]. Double-stage training model in which first the AE is trained and then a KDE is used to learn the distribution of latent features from all the training dataset. The bandwidth and the kernel are the same of DAKDM. • VAE [76]. Auto-encoding variational Bayes applied to NLOS identification. Here, the sampling mechanism is mandatory since we need to sample new latent variables from the learned probability distribution, i.e., in this case a Gaussian distribution. The anomaly score A(X i ) is computed as A(X i ) = 1 Nm Nm j=1 L rec (X i ,X j i ), where N m is the number of samples. As suggested by the authors, we draw 10 samples from the latent space representation to derive the anomaly score.
• GANomaly [113]. Deep-generative model composed by an AE, a second encoder and a discriminator. The model minimizes simultaneously the reconstruction error, the encoder loss given by the second encoder and the adversarial loss yielded by the discriminator. For a fair comparison, we give the same input to each model and we adopt the unchanged architecture for the encoders and decoders with respect to DAKDM. Therefore, we adopt the same number of latent features for all architectures.
In addition to DL model baselines, we compare our method with classical ML and statistical algorithms. In particular, we implement: • JLRT [63]. Joint-likelihood ratio test considering the statistics of the kurtosis, mean excess delay and rootmean-square delay spread. The PDFs of the statistics are approximated as log-normal distributions and they are considered independent of each other. • RF [66]. Random forest model with 100 trees and, as input features, the Rician K-factor, root-mean-square delay spread, mean excess delay and dominant channel tap. • CORR [75]. Pearson correlation coefficient computed using a reference set of LOS ADCPM sliced in the direction of arrival with higher received power. We gathered 100 LOS reference signals and we considered only the samples comprising 10 points before and 100 points after the first peak. The likelihood of a test input is obtained by averaging the correlation coefficient with the reference LOS signals. • OC-SVM [58]. One-class support vector machine which computes the smallest hyper-sphere containing normal, i.e., LOS, samples. We use the score function as a likelihood measure. As regards the feature selection, we adopt both static channel characteristics as the maximum received power, kurtosis, skewness, rising time, root-mean-square delay spread, Rician K-factor, angular spread of arrival and both time-varying features [114] like the angular variant of arrival. Note that, while CORR and OC-SVM are semi-supervised learning algorithms, JLRT and RF are supervised learning methods since they require statistics/samples of both classes. The models and algorithms are run independently by each BS, after the UE uplink transmission. The training, if required, is performed before the validation procedure at each BS using the locally collected input samples.

1) Inference Timings:
In this assessment, we want to measure the time required by each DL model to predict the output of a sample. This is of particular relevance in realtime applications where the inference time must be as low as possible. An example is the vehicular applications where the end-to-end latency must be contained within 100 ms or less [115]. In Fig. 6, we show the boxplot of the inference time for each sample over the whole testing dataset. We notice that the proposed DAKDM is able to predict the anomaly score in half of the time with respect to DAGMM as it does not require the decoder part prediction. Moreover, GANomaly and AE-KDE models require up to 4 ms for a single prediction. This is because GANomaly holds a more complex model, while AE-KDE has to pass through the whole training dataset for a single prediction. Finally, VAE takes about ten times more than DAKDM due to the sampling strategy.
2) Batch Size: This assessment has the goal of verifying how the batch size N s affects the performances of the proposed DAKDM. Theoretically, N s should be large enough to generate a good representation of the latent features' distribution. To verify this behaviour, in Fig. 7 we analyze the anomaly score A L (z i ) of normal (Fig. 7a) and anomalous (Fig. 7b) samples in the testing dataset after 30 epochs for N s = {8, 16, 32, 64, 128}. To avoid singular issues due to possible zeros values given as input to the logarithm, we shift the likelihood distribution as A L (z i ) = −log(L(z i )+1). The first thing to notice is that the anomalous score of normal data is lower than the abnormal data and this is because the likelihood network outputs higher values for samples with normal distributions. Second, we observe that decreasing N s for normal data, will produce lower mean and variances distributions, thus enhancing the NLOS identification capabilities. This is due to the fact that with a large batch size, the model struggles to learn the pointwise KL-divergence since in the loss function we have the contributions of many points. On the contrary, with lower batch sizes, the likelihood network learns exactly which value assign to each latent representation. Reducing N s has the benefit of being suitable for simpler devices with low computational capabilities, in exchange for higher training times. As a trade-off between performances and training times, we choose N s = 16.
3) Hyper-Parameters Tuning: This experiment aims of tuning the main hyper-parameters related to the density models, i.e., the bandwidth h of the KDE for DAKDM and AE-KDE and the number of GMM components, denoted with g, for DAGMM. In Fig. 8 for DAKDM and the number of GMM components g ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} (Fig. 8b) for DAGMM. Starting from Fig. 8a, we notice that the higher AUC is reached by h = 0.2 and that for not optimal values, the AUC can differ significantly. This is somehow due to the range values of z i and to the number of points that we have. Since, in practice, we have few anomalous samples, for tuning the bandwidth we can simply rely on maximizing the likelihood of normal samples varying the bandwidth. Comparing the results with Fig. 8b, we see that DAGMM is not able to achieve high peaks of performances in average, i.e., above 90% of AUC. This means that the latent distribution cannot be well approximated with less than 10 Gaussian distributions. Clearly, increasing g will improve the performance but at the cost of a higher complexity of the DAGMM estimation network. 4) Performance Comparison: In this last assessment, we compare the performances of the proposed DAKDM with the models described in Sec. V-E. In Table II, we report the average AUC after 10 runs and the F-score, Accuracy, Precision and Recall using a threshold on the anomaly score related to 20% of FPR. We notice that the performances of the AE-KDE (highlighted in green) are superior with respect to all the others. The reason behind this is that the AE-KDE represents the perfect unfeasible upper-bound, i.e., it requires storing the whole training dataset for inference and thus it can perfectly reconstruct the latent features distribution. On the other hand, the lower-bound is represented by the statistical JLRT method (highlighted in red), which obtains an AUC of 63%. This method assumes the independence of few hand-crafted features, which may not hold in any situation. The second non ML-based method CORR reaches an AUC of 64%, meaning that the LOS reference signals are not a good representation of the LOS distribution. The OC-SVM, a classical ML method, achieves a slightly higher AUC due to its capabilities of projecting the original features in a higher hyper-space (kernel trick). However, its main limitations lay in the features-choice which, for sparse and high-dimensional spaces, constitutes a non-trivial task. Moreover, we can notice that the precision (94%) is much higher with respect to the recall (60%). This means that OC-SVM tends to classify all test samples as LOS, learning a rough, i.e., too general, LOS distribution. Finally, among classical ML methods, the RF achieves the highest performances by reaching an AUC of 79%. However, we remark that this method requires the knowledge of NLOS samples, thus it holds an advantage with respect to semi-supervised learning methods.
Focusing now on the DL models, numerical results show that they highly outperform the classical ML and statistical algorithms. Indeed, while deterministic feature extraction might be more suitable for low-dimensional or simple channels, using the raw ADCPM as input to the CNN structure allows the DL models to utilize the full potential of automatic feature extraction, which contributes to the superior performance of the DL methods in comparison to classical ML and statistical algorithms. However, this does not exclude the possibility of incorporating deterministic features in future work to further improve the performance of the proposed model. Among the DL methods, DAKDM (highlighted in bold) and VAE hold the highest AUCs if compared with DAGMM and GANomaly. In particular, DAKDM and VAE outperform DAGMM and GANomaly of 7% and 16%, respectively. The reasons behind this are that GANomaly is a very complex network that requires a non-negligible effort in hyperparameter tuning and optimization, with additional issues in training stability [79]. On the other hand, DAGMM is not able to accurately learn the LOS latent feature representation due to its limited number of Gaussian components. Both DAKDM and VAE achieves the highest performances, i.e., 95% and 96% of AUC, but with two different methods. While VAE imposes a simple latent distribution, DAKDM automatically learns the low-dimensional LOS distribution thanks to the KDE in training phase. However, the main advantage of DAKDM is that it does not require sampling procedures

VI. CONCLUSION
This paper addressed the problem of high-dimensional channel distribution characterization for next generation cellular networks. In order to demonstrate the method, we tackle the problem of NLOS identification in a mm-wave MIMO system with sparse space-time channel responses. We model the problem within the semi-supervised anomaly detection framework where LOS samples correspond to normal data with peculiar characteristics and distributions. We propose a deep autoencoding kernel density model (DAKDM) where the manifold distribution of normal data is elaborated with an AE that takes as input the sparse ADCPM which univocally map the position-dependent features of the channel. The AE is jointly learned together with a likelihood network which is trained to learn the output of a KDE that directly estimates the distribution of the latent features.
The DAKDM has the main advantage of learning whatever latent distribution without storing the whole training dataset.
We validated the model in a 5G standard compliant UMi scenario simulated with Matlab ray-tracing package, permitting spatial channel consistency between adjacent positions. The UEs are vehicles which move in the area according to dynamics simulated by the SUMO software. Compared with DL state-of-the-art models, results showed that the proposed DAKDM is much more efficient, both in terms of inference time and computational requirements. In particular, DAKDM holds a prediction time per sample which is up to one fourth and one tenth of GANomaly and VAE, respectively. This makes it appropriate for edge devices with strong latency requirements for mission-critical applications. From a performance point-of-view, DAKDM is able to achieve similar performances of the top-performer VAE, outperforming GMM-based method such as DAGMM of about 7%.
In the next years, ML and in particular DL methods are expected to play a crucial role in next generation cellular networks. Communication systems, but also localization techniques, are required to increase performance capabilities and types of services to accomplish increasingly high standards. Thus, DL-based methods as the proposed DAKDM become essential to push further the performances. A natural extension of our work would be to integrate NLOS mitigation into the system in order to compensate the induced error given by lack of visibility or directly integrate DL techniques into positioning algorithms suited for high complexity environments. A further direction of research could be the extension to a cooperative inference framework where BSs exchange mutual-soft information for accuracy enhancement. Moreover, more realistic environments with simulated foliage and dynamic obstructions should be explored. Challenges are represented by NLOS situations, changes in the environment and lack of possible representative samples for each feasible location. Fig. 9. Variational inference problem as described in [116]. The generative model is indicated by solid lines while variational approximation is reported with dashed lines. APPENDIX A PROOF OF (15) In this Appendix we provide a proof of the anomaly score contribution given by z i . From the variational inference approach [96], we can note that the likelihood network performs the same objective of latent variable inference. To see this parallelism, we recall the variational inference context where we are given an observation h (i.e., latent variable) from the prior distribution p θ h with parameters θ. Subsequently, a datapoint x is generated from p θ x|h , which is considered intractable. The objective is to estimate the exact posterior p θ h|x , also intractable, with a simpler variational posterior q ϕ h|x with parameters ϕ. For a graphical representation of the problem, please refer to Fig. 9 [116].
We can view the datapoint x as a compact representation of the channel z i and the latent variable h as the distribution p(z i ). Let us denote the probability as s i = p z (z i ), which can be approximated with the KDEŝ i = K z i |{z j } Ns j=1 , the likelihood network L ϕ (·) (with parameters ϕ) will be acting as the variational distribution q ϕ (·). Following this parallelism, the contribution of z i to the anomaly score, i.e., negative log-likelihood, can be written as follows: where (17) is the consequence of Jensen's inequality. It follows that −log p z (z i ) ≤ −E L ϕ log p z,si z i , s −log L ϕ s|z i = −log p z (z i ) + KL L ϕ s|z i p si|z s|z i ⋍ −log K z i |{z j } Ns where the approximation in (18) is due to (12), concluding the proof.