Robust Computationally-Efficient Wireless Emitter Classification Using Autoencoders and Convolutional Neural Networks

This paper proposes a novel Deep Learning (DL)-based approach for classifying the radio-access technology (RAT) of wireless emitters. The approach improves computational efficiency and accuracy under harsh channel conditions with respect to existing approaches. Intelligent spectrum monitoring is a crucial enabler for emerging wireless access environments that supports sharing of (and dynamic access to) spectral resources between multiple RATs and user classes. Emitter classification enables monitoring the varying patterns of spectral occupancy across RATs, which is instrumental in optimizing spectral utilization and interference management and supporting efficient enforcement of access regulations. Existing emitter classification approaches successfully leverage convolutional neural networks (CNNs) to recognize RAT visual features in spectrograms and other time-frequency representations; however, the corresponding classification accuracy degrades severely under harsh propagation conditions, and the computational cost of CNNs may limit their adoption in resource-constrained network edge scenarios. In this work, we propose a novel emitter classification solution consisting of a Denoising Autoencoder (DAE), which feeds a CNN classifier with lower dimensionality, denoised representations of channel-corrupted spectrograms. We demonstrate—using a standard-compliant simulation of various RATs including LTE and four latest Wi-Fi standards—that in harsh channel conditions including non-line-of-sight, large scale fading, and mobility-induced Doppler shifts, our proposed solution outperforms a wide range of standalone CNNs and other machine learning models while requiring significantly less computational resources. The maximum achieved accuracy of the emitter classifier is 100%, and the average accuracy is 91% across all the propagation conditions.


Introduction
The rapidly growing mobile traffic and user base is fueling the demand for spectrum resources that are increasingly challenging to provision due to scarcity and access restrictions. One of the mitigation strategies consists of improving spectrum utilization efficiency by sharing the unlicensed operation of radio-access technologies (RATs). Recent examples include the use of the 5 GHz band, which Wi-Fi occupies (IEEE 802.11a) and an unlicensed version of Long-Term Evolution cellular standard (unlicensed-LTE) [1,2], as well as the unlocking of the 6 GHz band for Wi-Fi 6 (IEEE 802.11ax) and 5G New Radio Unlicensed (NR-U) standards that will operate along with incumbent primary users [3].
Realizing the benefits of the shared unlicensed operation to meet the increasingly stringent quality of service (QoS) application targets is contingent upon meeting challenging requirements, which include ensuring fair and harmonious coexistence between users, secure access in line with regulations, and maximal performance through optimized resource allocation and interference management. The challenge in meeting these requirements emanates from the complexity of the emerging access environments, which are influenced by a confluence of physical (channel effects), human (user access and mobility patterns), and technological (design and modus operandi of RATs) factors.
A potential solution is advancing RAT intelligence to equip radio assets with adaptive learning and decision-making capabilities that will enable a greater level of autonomous operation compared to centralized schemes [4]. Intelligent spectrum monitoring is arguably a core component of adaptive radio learning that enables RATs to collect measurements and make sophisticated real-time inferences about the spectral state that drive reasoning and intelligent decision-making. This has motivated several works that build on two enabling technologies: The first is the software-defined radio (SDR), allowing programmable radio frequency (RF) operation across a wide range of frequency bands and with diverse cost and form-factor options [5]. The second consists of Artificial Intelligence and Machine Learning (AIML) techniques, whose performance and sophistication have increased, especially in the area of Deep Learning (DL) [6]. DL allows learning hierarchical representations of discriminant features in a generalized and efficient manner compared with the intensive and rigid feature engineering by human experts [7]. The introduction of Machine Learning techniques in communication is detailed in [8]. Many researchers have investigated using DL for wireless communication [9][10][11][12][13], some of them proposing DL models for different applications than signal processing [14][15][16][17][18]. Detailed surveys about DL for mobile and wireless networking are provided in [19,20].
Applying DL to wireless communication is currently an active research area (as highlighted in [19,21,22] and the references therein); this is motivated by similarities with other domains of successful DL application, such as speech and object recognition [23], and by the ability to build size-appropriate training data sets as wireless networks inherently generate large data volumes that can be efficiently collected and ingested [24].
Several works are concerned with AIML-based emitter RAT ( also referred to as protocol, wireless standard, and wireless technology classification) classification, which act as RATagnostic data-driven detectors that can be used for accurate access pattern detection and prediction of RATs operating in unlicensed shared bands. As the medium access schemes for unlicensed, shared RATs are mainly envisioned as variants of Listen-Before-Talk (LBT) schemes [40], where spectrum access is controlled by schemes based on sensing spectrum occupancy, RAT classification is the primary driver potentially for optimizing spectrum utilization and minimizing interference through intelligent and situation-aware dynamic spectrum access. It also can be utilized as a tool to support access policy enforcement for automated detection of violations that is more coverage efficient than manual humanresource based in-field analysis [41,42].
Emitter classification works include proposals based on feature engineering [43][44][45], as well as DL-based proposals using time-frequency (TF) representations that include spectrograms calculated using the Short-time Fourier Transform (STFT), as well as other custom TF representations [11,41,[46][47][48][49][50]. Compared to strictly time-based features, TF representations can lead to better performance in emitter classification [51] and allow visualizing rich multi-emitter scenarios as patterns recognizable by human domain experts [46]. Formulating emitter classification as an object recognition problem allows leveraging state-of-the-art DL algorithms mainly based on training supervised convolutional neural network (CNN) classifiers which achieved top performance in other application domains [52].
However, there are limitations associated with CNN-based emitter classification. First, the visual patterns in TF representations are susceptible to corruption induced by the communication channel, which can severely degrade classification performance. Recent works showed that the visual features in spectrograms-the archetypal TF representationscould be indiscernible in low SNR conditions and significantly altered by frequencyselective fading [41,47]. The degradation can be significantly more pronounced in harsh channel conditions encountered in typical indoor environments due to severe multipath and non-line-of-sight (NLOS) conditions and mobility-induced Doppler effects. Second, the performance gains of CNNs might come at a high computational cost. While an abundance of computing resources might be available at the training phase of CNNs, the resulting inference engines may be deployed settings such as network edge [53] that are constrained in terms of computational resources and energy consumption and favor tight coupling between the RF circuits (sensing component) [54][55][56]. Unless addressed, the high computation and energy cost of CNNs might be a significant limiting factor towards broader adoption.

Contributions of Paper
The aforementioned challenges highlight the need for improving the robustness and the computational efficiency of TF-based emitter classification. This work proposes a novel emitter classification approach that uses a hybrid DNN consisting of a convolutional denoising autoencoder (CDAE), followed by a CNN classifier. By construction, in this approach, the representation learning phase devoted to denoising (performed by the CDAE) and the classification phase (performed by the CNN) are performed separately.
The theoretical motivation behind this approach relies on the following considerations: (1) the decoupling allows to perform a more efficient training and (2) a standalone representation learning phase focused on obtaining a reconstructed TF image before classification can more effectively support the CNN. The classifier can operate based on more clearcut visual features of the TF representation after the representation learning. In a sense, the use of the CDAE allows incorporating explicitly into the process the priors about the protocol spectrogram original visual features that have been degraded during the signal propagation.
In practice, we demonstrate by simulation that, compared to the state-of-the-art standalone CNNs, our hybrid DNN approach achieves high accuracy under harsh propagation conditions. Our proposed DNN requires significantly less computational resources compared to standalone CNNs. The main contributions of the work are summarized as follows: • A CDAE is trained to reconstruct the original visual patterns of pre-channel emissions out of spectrograms corrupted by harsh channel conditions. We conducted a comprehensive study in reconstruction performance using simulation results for several standards, including LTE and versions of Wi-Fi, under harsh propagation conditions that include SNR, multipath, NLOS, and Doppler. • A CNN performs emitter classification using the denoised representation generated by the CDAE, leading to high classification accuracy under harsh channel conditions. Moreover, the resulting CNN requires significantly less computational resources as it operates on a compressed representation with lower dimensionality than spectrogramfed CNN classifiers. We demonstrate, using simulations, that our proposed hybrid CDAE-CNN approach outperforms in classification accuracy and computational cost a wide range of DNN and ML-based schemes while requiring significantly less computing resources.
The rest of the paper is structured as follows. Section 1.1 presents the related work about DL for wireless communication. Section 2 describes our system model and the problem statement. In Section 2, simulation setup and data generation for unlicensed LTE and Wi-Fi standards are explained. Section 3 details our DL-based approach to classify wireless signals operating in the same unlicensed band. In Section 4, we report results under various noise scenarios, while the comparison between our approach and other ML and DL algorithms is presented in Section 5. Conclusions are drawn in Section 6.

System Model
Consider the setting illustrated in Figure 1, which involves N wireless devices, each operating a Radio Access Technology (RAT) drawn from a set of M distinct RAT types, and a single receiver, tasked with monitoring device spectrum activity. Denote the devices by d nm , where n = 1, . . . , N is an index reflecting a unique usage context that includes user activity patterns and application data profiles, and by m = 1, . . . , M the RAT-type label. When in transmission mode, d nm emits an upconverted version of a base-band timevarying signal s nm (t) by Radio Frequency (RF) with a given carrier frequency. The signal s nm (t) contains payload data formatted into RAT-defined packets, and its bandwidth and center frequency determine the spectrum band of operation, which can be fixed or dynamically determined as is the case with the unlicensed shared operation. A typical implementation of the receiver uses software-defined radios (SDRs), which are available offthe-shelf or custom-made [5] for applications including radio-environment maps (REMs) that characterize spatio-temporal spectrum access patterns and provide a useful tool for enforcing access policies by identifying violations. Assuming that there is no interference and that the receiver is tuned to the same spectrum band as the emitter, the downconverted received signal is denoted as r(s nm ) and can be modeled as where G(s nm ) denotes the additive white Gaussian noise (AWGN), and H(s nm ) other physical and user-induced degradation present in the wireless communication link between d nm and the receiver: it may include multipath, shadowing and non-line-of-sight (NLOS), mobility-induced Doppler effects, and specific Signal-to-Noise Ratio (SNR) conditions. Upon reception, r(s nm ) is converted to a spectrogram by applying the Short-Time Fourier Transform (STFT) [57]. We selected the spectrogram mainly because it is one of the most general and commonly-used time-frequency representations. The model and the methodology we employ can be extended to other representations, e.g., to spectrogram's functions, such as the power spectrum transform and other custom time-frequency representations. Let us denote by R(s nm ) ∈ C N f ×N t the matrix of complex values representing the the STFT of the received signal r(s nm ), with matrix elements R(s nm )(τ, ω), where τ and ω denote values of the discretized time and frequency, respectively, while N f and N t denote the number of points sampled along the frequency dimension and time dimension, respectively. If we denote by i ∈ [1, N f ] and j ∈ [1, N t ], respectively, the index of the time and of frequency discretized values, the elements (pixels) of the N f × N t matrix are denoted by [R(s nm )] ij .
The real valued spectrogram R(s nm ) ∈ R N f ×N t of the received signal is obtained by calculating pixel-by-pixel the norm of the corresponding complex number in the spectrogram R(s nm ), i.e., [R(s nm )] ij ≡ |[R(s nm )] ij | Summarizing the relationships between the defined quantities we have The chain of transformations applied to r(s nm (t)) can be applied to the original transmitted signal s nm (t). We use the following notation conventions: The comparison between the transmitted signal real valued spectrogram T(s nm ) and the received signal real valued spectrogram R(s nm ) is at the heart of the training of the Denoising Autoencoders described in the next section.
Notice that sometimes, to reduce the dimensionality of the problem, the values of each spectrogram are encoded to binary values using a comparison to the average value of the pixel. For instance, one can define In this work, we assume that each spectrogram captures a single emission of each of the active RAT types. This assumption is motivated by the common practice of programming SDRs to scan multiple frequency bands [25,28]. Our model can be expanded upon to include wideband spectrograms that can span multiple emissions by using spectrum localization [46], and other time-frequency representations based on sweeping carrier frequency [58].
Having calculated the spectrogram R(s nm ), and possibly its binarized version B(s nm ), the receiver passes it to the emitter classification algorithm that outputs an estimate of the operating RAT type. Figure 2 shows samples of the received spectrograms for some combination of channel scenarios and SNR levels. The transmitted spectrograms for each IEEE 802.11 protocol and LTE are presented in the first row of Figure 2. The LTE spectrogram could be corrupted easily under Scenario 2 for SNR 20 dB. Channel Scenario 1 and SNR 0 dB damage the preamble's visual pattern in the captured spectrograms. The detailed description about data generations and the implemented Channel Scenarios are discussed later in Section 4. In real environments, it is hard to know the channel condition and to model it by a precise equation, especially in a harsh dynamic environment. To perform the identification task and have a system with a robust and high accuracy, the problem of the channel and noise effects in the received spectrograms should be solved. In the following section, our proposed methodology to achieve a robust identification task is detailed.

Proposed System and Methodology
The proposed approach for protocol identification consists of two main phases: (1) Signal Denoising: Reconstructing the unlicensed signals by removing the noise and the signal degradation effects using the DL model. (2) Protocol Identification: Identifying the unlicensed bands for the corrupted signals based on denoising DL weights.
To study the proposed system's performance, first, a simulation environment is developed for LTE and WLAN 802.11 standards. Second, the data are generated and collected under different SNR values and channel propagation models. Third, data preprocessing is performed to prepare the DL model's data by mapping the signals to the corresponding spectrograms. Fourth, the signal denoising is performed for each dataset to reconstruct the corrupted received signal spectrogram for each protocol using Convolutional Denoising Autoencoders (CDAEs). Finally, a signal classification is performed to identify the received protocol using CNN classifier model taking into input the CDAE representation. Figure 3 illustrates the proposed system for using DL.

Signal Denoising
The data denoising phase of our approach is based on particular artificial neural network layered architectures known as Denoising Autoencoders (DAE) [59], which are typically used to reconstruct data from a corrupted input.
In standard Autoencoders (AE), the input is received under the form of examples, and AE is trained to reconstruct them, while the DAEs input comes under the form of noisy examples, and the objective for DAE is to reconstruct their original non-noisy form (both non-noisy and noisy examples are provided to the DAE). In radio wireless communication, examples can be obtained either by measuring the target output T(s nm ) and the input R(s nm ) in a physical environment or from a radio simulation environment, or by artificially corrupting non-noisy data.
Autoencoders (AEs) are DL models used for self-supervised learning of an encoding of the input data. The input data instance can be an image or any other signal. The input is fed to the AE in the form of a 1D or a multidimensional array (e.g., a 2D image) typically flattened into a 1D form. Hereafter, we denote this array by x = (x 1 , . . . , x k , . . . , x K ) (where K is the total number of elements in the flattened array). AEs learn to encode an input data set in a compact representation that preserves the statistical correlation properties of the original distribution of inputs most relevant for the reconstruction. The AE architecture consists of one or more encoding layers and of as many decoding layers. A special role is assigned to the central (hidden) layer, called code layer. The main role of the encoding layers is to map the input vector x into the hidden representation y = f θ (x); when there is a single encoding layer, f is defined by the following matrix equation: where θ is a shorthand for (W, b) and represents the set of parameters defining W and b, W is the weight matrix, and b is the offset vector, while σ defined as a nonlinear function acting on the matrix elements and can be a Rectified Linear Unit (ReLU) function or a sigmoid function. With more encoding layers, the computation is iterated from one layer to the next. The mapped representation y is decoded to get the reconstruction matrix z. The size of matrix z is the size of the input x. The reconstructed matrix z = g θ (y) with a single decoding layer, is defined as With more decoding layers, the computation is iterated from one layer to the next. The value of the parameters W, W , b, and b is passed to a loss function during the learning process. The parameter values can be optimized by minimizing the loss function. There are different loss functions which can be selected to train the AE parameters: usually for binary data, the loss function is the binary cross entropy (BCE), which is typically used for binary classification.
where K is the total number of input array elements. Mean Square Error (MSE) is also a common loss function that is always non-negative: We used the MSE as a loss function for the reconstruction error in our DAE, because it is known to work better with the DAE in image reconstruction. In our model, the MSE loss function is minimized by the Adaptive Gradient Descent (AGD).
DAEs use a noisy inputx = x and are trained to produce encodings that restore the properties of the original non-noisy input. In our case, the noisy input isx = R(s nm ), the real valued spectrogram of the received signal, while the target output is x = T(s nm )) the real valued spectrogram of the transmitted signal. During the training of the DAE, a reconstructed spectrogram z obtained from the noisy signal spectrogramx is compared to the non-noisy target x. The non-noisy and noisy version of each input are obtained either by measuring the T(s nm ) and R(s nm ) signals in a physical environment or from a radio simulation environment, or even by artificially corrupting non-noisy data.
In our proposed DAE model, convolutional layers are used where the parameters W and b of each images batch are shared among all the locations to provide spatial locality. In this way, we put together the advantages of DAEs and the low complexity of the CNN paradigm. The resulting architecture is a Convolutional Denoising Autoencoder (CDAE). In general, CDAEs are very effective in signal processing [60] and in image processing perform better than classical DAEs [61]. Figure 4 shows the process of the proposed CDAE for signal denoising. . Schematic view of the operation for signal denoising using a Convolutional Denoising Autoencoder (CDAE). The input is noisy data (spectrograms of the received signal under various channel propagation scenarios). The encoder and decoder consist of multiple convolutional layers: the output is compared to the clean signal's spectrogram. By optimizing the loss function, the network finds a denoised representation [62].
The effectiveness of using DAEs for reconstructing spectrograms of corrupted signals in unlicensed bands had been reported in our recent works [62,63]. However, it was only applied for denoising two Wi-Fi protocols in light noise scenarios. In this paper, we expanded our analysis substantially, using CDAEs to denoise multi-protocol signals, including unlicensed-LTE, in view of performing protocol identification. Furthermore, we assessed the effectiveness of CDAE for various propagation models and SNR values, with severe degradation in signal reception in harsh environments. We tuned the CDAE architecture so that it generalizes with consistent performance across all signals operating in the unlicensed spectrum and all noise levels and harsh environments. The architecture of our CDAE is explained in detail in Section 4.4.

Protocol Identification
We aim to apply ML in identifying the unlicensed radio technology specifically for spectrum sharing. Some signal processing functions can be learned within the physical layer as discussed in [13]: ML is used only for modulation classification of single-carrier modulation schemes using the CNN on radio frequency time-series data [13]. There are other ML methods applied for classifying radio signals such as SVMs [44], small feedforward neural networks, and random forests [64].
We are now ready to discuss the protocol identification stage to be deployed after the CDAE stage for signal denoising. The architecture of our classifier is a Convolutional Neural Network, fed with with the features learnt by the CDAE in signal denoising process. The basic block of any CNN is the convolution, a simple application of a filter to an input that results in an activation. CNN filters are locally connected to capture correlations between different data regions in the image and output a feature map. The convolutional structure reduces the number of model parameters significantly and provides a robust recognition of affine invariance [6]. Powerful CNN models developed for imaging applications include ResNet, Inception-V4 [65], and GoogLeNet [66]. These models mostly differ from one another in terms of the CNN depth (Different inception and residual techniques were also proposed to overcome overfitting and gradient vanishing problems that are typical of "deep" CNNs. A detailed discussion of these techniques is outside the scope of this paper.).
CNNs were also adapted for video action recognition using 3D CNNs [67]. While our intuition supported the notion that CNNs could perform well on the snapshots/images of wireless spectra, we did not jump to conclusions. Instead, we implemented several ML models and compared their protocol identification performance across different SNR values and propagation models.
Our resulting pipeline, which composes a DAE and a CNN classifier, includes a series of convolutional layers, a maxpooling layer, a fully connected layer, dense layer, and a softmax activation layer to perform the classification. The input to our CNN-based classifier is the DAE weights' matrix (compressed representation), and the output is the class type of the signal. Our overall architecture for signal classification is illustrated in Figure 5. A further detailed explanation is given in Section 4.5.

Dataset Generation for LTE and IEEE 802.11 Family
We focused on Wi-Fi signals operating in the 5 GHz industrial, scientific, and medical (ISM) band in our experimentation. The selection of this unlicensed band is due to LTE, which can operate in the 5 GHz band, based on the operators' preference [68]. To the best of the authors' knowledge, there is no available dataset for wireless local area network (WLAN) 802.11 protocols or LTE, especially under multiple noise scenarios and propagation models; therefore, we resorted to data generated through simulation. We set up the simulation of the following five protocols operating in ISM bands: LTE, IEEE 802.11ac, IEEE 802.11ax, IEEE 802.11n, and IEEE 802.11a under different channel conditions. Protocols and conditions will be further detailed in Section 4.2. IEEE 802.11ax is the high throughput and efficiency WLAN amendment, which will replace both IEEE 802.11n and IEEE 802.11ac [69,70]. The end-to-end radio signal environment is built using the MATLAB WLAN System Toolbox for various WLAN 802.11 standards such as IEEE 802.11a, IEEE 802.11n, IEEE 802.11ac (Wi-Fi5), and IEEE 802.11ax (Wi-Fi6). MATLAB also includes LTE System Toolbox, which is used to build, design, and generate LTE waveforms and model end-to-end LTE communication links. The propagation of channel modeling functions allows fading and noise to the transmitted LTE and WLAN signals. In the following subsections, the implementation system for WLAN IEEE 802.11 protocols and unlicensed LTE is explained. The simulation setup details for LTE and IEEE 802.11 standards, the characteristics, the channel conditions, and the received signal for each protocol or RAT are all detailed in Appendix B.

Dataset Preparation
The simulation for each IEEE 802.11 protocol and LTE standard has been run independently to generate signals and save the radio spectrogram images for T(s nm ) and R(s nm ), under various channel scenarios and SNR values. The radio spectrogram images correspond to as many preambles for each IEEE 802.11 protocol and LTE subframes.
Each spectrogram represents the Short-Time Fourier Transform (STFT) [57] of the raw time series s nm (t) corresponding to the signal of a preamble. Spectrograms were selected as signal representations because of their ability to capture the behavior of multiple received signals. Each spectrogram image represents the STFT for transmitted or received signal as a function of the (discretized) time τ and frequency ω. Each spectrogram image consists of 64 × 3782 pixels where 3782 time intervals are captured, and for each interval, 64 frequencies are computed. To reduce the high dimensionality in the spectrograms, they were binarized according to Equation (2).

Datasets
We generated 25 datasets, representing as many combinations of channel scenarios (5 scenarios) and SNR values (5 levels: 20 dB, 15 dB, 10 dB, 5 dB, 0 dB). Each dataset is evaluated under a specific channel scenario and SNR level. In total, we generated 500,000 spectrograms images. Each of our 25 datasets consists of 20,000 spectrogram images, i.e., 10,000 pairs of transmitted and received spectrograms, 2000 for each of the five protocols we considered: IEEE 802.11a, IEEE 802.11ac, IEEE 802.11ax, IEEE 802.11n, and LTE. A summary of the implemented noise model scenarios for LTE signals is given in Table 1. The characteristics, the channel conditions, and the received signal for each protocol/RAT are detailed in Appendix A. Figure 2 shows samples of the captured spectrograms for some combination of channel scenarios and SNR levels.

Results of Autoencoder-Based Denoising
The dataset for each channel scenario was examined independently to evaluate our Convolutional Denoising Autoencoder (CDAE) performance in reconstructing the clean spectrograms. Overall, 10,000 transmitted spectrograms make up the clean dataset, while 10,000 received spectrograms make up the noisy dataset. Each spectrogram image is resized to 128 × 128.The dataset was split, 80% for training and the remainder for testing the model.

The Proposed Autoencoder Architecture
The structure of our CDAE consists of 24 layers: an input layer, eleven 2D layers for the encoder part, and twelve 2D layers for decoder part. The 2D encoder convolutional layers consists of a rectified linear units (ReLU) activation function layer, a dropout layer, and a 2D max-pooling layer. The convolutional layer C is used to learn the weights and biases of the spectrogram parameters. The ReLU layer computes the max(0, x) element-wise activation function thresholding at zero. The dropout layer is utilized for better generalization and to avoid overfitting. Maxpooling is performed for downsampling the convolutional layer C to lower the spatial dimensions.
The input data of the encoder consist of a quantized spectrogram image with a size of (128, 128). The first hidden layer C1 is a convolutional layer consists of 16 feature maps. Each feature map is connected to a kernel where the size of each kernel is (3,3). The kernel is a small matrix that is used for feature detection. A ReLU activation function layer is used for the decoder convolutional layers, upsampling is performed, and a dropout layer is included to avoid overfitting. Overall the model has 74,304 parameters.

Performance Metrics
When reconstructing received noisy spectrograms, the objective function is, of course, to minimize the reconstruction error. The performance metrics of our DAE is the accuracy, defined by 1 − Err, with where the index runs over the set of training samples S and where L(x, z) is the MSE loss function.

Results and Discussion for Signal Denoising
Besides accuracy, we assessed the reconstruction accuracy of our CDAE with different types of metrics for all channel scenarios and SNR levels.
Our CDAE achieves an average accuracy of above 77% within 500 epochs for all datasets under light to strong noise conditions, while SNR value varies from 0 to 20 dB across all channel scenarios. The lowest achieved accuracy is 76.79% for spectrograms under SNR 0 dB for scenario 5 and scenario 4. The highest denoising accuracy for the CDAE is 79.96% for the dataset with SNR 20 dB and scenario 4. The radio signals with SNR 10 dB and scenario 2 reaches 79.96% of testing accuracy with 295 epochs only. It was observed that the datasets of SNR 0 dB under the channel propagation model of scenario 2 requires only 171 epochs to reach a 78% of accurate denoising.
With 77% average denoising accuracy, our CDAE successfully recovers the essential features for all IEEE 802.11 family (802.11a, IEEE 802.11n, IEEE 802.11ac, and IEEE 802.11ax) and unlicensed LTE signals. This encoding accuracy provides a robust foundation for protocol identification, as we discuss in more detail in the following subsection. The performance of our CDAE is uniform for all channel scenarios and SNR; this supports the notion of using a library of CDAEs to reconstruct the original T(x) ij regardless of the level of noise or the channel condition, or the type of the radio signal. Figure 6 shows reconstructed spectrograms from different protocols in scenario 1 with SNR 0 dB.
We remark that reconstructed spectrograms in Figure 6 are almost the same obtained across all channel scenarios for all the 25 datasets. Reconstructed spectrogram images clearly preserve the preamble for the WLAN IEEE 802.11 protocol and the subframes of the LTE signal. This could be related to the advantage of using convolutional layers, which preserves the spatial locality for input images. Our results emphasize the robustness of the CDAE architecture to reconstruct corrupted radio spectrogram, whatever the noise level or the channel propagation effect in the field.

Results and Discussion for Protocol Classification
Our experiment aims to measure the effectiveness of using CDAE compressed features as an input for a classifier to identify the unlicensed RATs, studying the performance of different Artificial Neural Network classifier architectures in identifying the protocols under different conditions. The different classifier configurations are summarized in Table 2. They are distinguished based on the number of layers, the use of convolutional or full connectivity layers (CNNs as opposed to ordinary Multi-Layer Perceptrons (MLPs)), and most importantly by the fact that they receive as inputs the original noisy images, or their compressed representation coming from our previously trained CDAE (this option is described in Table 2 as CDAE weights).
Notice that the structure of the input to the classifier changes depending on the use of the original noisy image or the CDAE representation: in the former case, the input size is (128, 128, 1), i.e., 128 × 128 a Boolean image of depth 1; in the latter case, it is (4, 4, 32), i.e., a set of 32 Boolean images each of size 4 × 4 (from the CDAE structure, one can already see that we imposed that the network learns 32 filters). Notice also that the input's former representation has 2 14 Boolean degrees of freedom, while our CDAE filter-based representation of the input only uses 2 9 Boolean degrees of freedom. The solution we put forward in this paper is reported in the last row of the table: it consists of a 24-layer CNN classifier receiving as input our CDAE representation. This pipeline outperforms other classifiers in the identification of the protocols (IEEE 802.11a, IEEE 802.11n, IEEE 802.11ax, IEEE 802.11ac, and unlicensed LTE) even under severe signal degradation (SNR is 0 dB) and for the different channel conditions (severe fading, multipath, NLOS, Doppler Shift, and all the previously described channel conditions). Moreover, most alternative models display a performance comparable to that of a random classifier (which, with 5 alternative classes, would display a 20% accuracy). A more detailed description follows.
The 4-layer MLP with and without CDAE, the 6-layer CNN classifier with and without CDAE, the 8-layer CNN classifier with and without CDAE, and the 13-layer CNN classifier without CDAE are stuck at a 20% accuracy across the whole considered range of channel scenarios and SNR levels. Overall, they perform as a random classifier. The CDAE representation with 24 layers, on the contrary, features an accuracy ranging from 60% to 100% for multiple protocol classification, depending on the noise and SNR values. CNN3 + AE performing better than CNN2 + AE suggests that adding one CNN layer helps to extract more features, which are useful for identification. Finding the dimensionality of the CNN which corresponds to the coarseness of data representation where the "right" features emerge is a classic problem, part of the CNN model's hyperparameter tuning. We want to highlight that we explored both a lower number of layers and a higher number of layers. As it often happens, there is an optimal point in the complexity of a network that represents a good trade-off between bias and variance: the CNN3 + AE represents such a good trade-off. On the contrary, we found that the CNN2 + AE performance does not generalize well across all the datasets, while already the CNN4 + AE shows a clear sign of overfitting: we assume that increasing the number of layers thus further increasing the number of parameters would produce an even more apparent overfitting. Therefore, we did not experiment with CNN5 + AE, CNN6 + AE, and so on.
Multiple performance metrics based on the confusion matrix, such as F-score (The F-score, also called the F1-score), recall, and precision, have also been used to evaluate the proposed model's performance. A sample of the accuracy confusion matrices for the proposed classifier model across different datasets is presented in Figure 7.
Our proposed DL classifier model achieves an average accuracy of 95.44% for identifying a wireless technology that falls in the range of SNR 10 dB to 20 dB across all the channel propagation effects. The average accuracy of the wireless identification in intense noise environments considering the propagation effect scenarios' effect is 82.55% for SNR range between 0 dB and 5 dB. The maximum achieved accuracy of our emitter classifier based on CDAE weights is 100%, and the average accuracy is 91% across all datasets.

Performance Comparison
In this section, we provide a full comparison of our pipeline's performance with the one of traditional ML and DL-based models.

Comparison to Traditional ML Algorithms
In this section, we compare the performance of our (CNN3 + AE) pipeline to wellknown ML algorithms such as Support Vector Machine (SVM) [71], Random forest (RF), and K-nearest neighbors (KNN). The performance of a shallow learner is also explored using a one-dimensional Convolutional Neural Network (1D CNN).
SVM is a supervised ML algorithm proposed in 1995 [72], where an optimal hyperplane is used to separate classes in the data space. SVM proved to be a successful method to attack a two-class classification task when extracting well-representative features from data. RF was first introduced by Tin Kam Ho [73]. It is an ensemble learning method for classification, where a multitude of decision trees are trained to output the label of the classes, and the predicted class is determined by majority voting. The KNN algorithm is a nonparametric, lazy learning algorithm that outputs each input's class based on the majority of its nearest neighbors (proximity is measured by feature similarity) [74].
We tested these algorithms under all channel scenarios and SNR values to compare them to our proposed classifier. 1D CNN was chosen as a shallow benchmark learner as it enables frame-level investigation, and its use had been explored for audio recognition and Natural Language Processing (NLP). 1D CNN has been used with raw waveform and usually combined with a Recurrent Neural Network (RNN) in audio applications [75]. The convolution layer's kernel size in our benchmark 1D CNN is set to 3, and 24 filters were used with a ReLU activation. The soft-max activation function was used to classify the protocols.
The input for SVM, KNN, RF, and 1D CNN consisted of the noisy spectrogram images of size 128 × 128, i.e., each image had 16, 384 features; for the 1D CNN, this was shaped as a (16384, 1) array. Each dataset was split to 80% for training, and the remaining 20% is for testing. In each training session, the data were shuffled, and 33% of it was used for cross-validation to tune the hyperparameters of the model whenever this could apply.
Each classifier's performance was evaluated using accuracy, recall, precision, and F-score [76]. Figure 8 shows the empirical cumulative distribution function (CDF) of the overall accuracy for the shallow classifier on our datasets across all the channel scenarios. The overall accuracy is the ratio of correctly predicted observation to the total observations. The accuracy of the SVM is the lowest, while KNN and RF achieve 20%. 1D CNN gives an accuracy of 20% in most of the datasets and 60% in a few datasets, depending on the channel scenario and SNR levels. The recall (i.e., sensitivity) is shown in Figure 8b. The precision across all datasets under various noise conditions and SNR values is shown in Figure 8c. The overall CDF of F-score is illustrated in Figure 8d across all the 25 datasets.
The performance of our (CNN3 + AE) is much higher than the other compared ML algorithms in terms of accuracy, recall, precision, and F1 score. The highest accuracy achieved by our developed model is 100% across many datasets such as a dataset with SNR 5 dB and scenario 3, 5 dB and scenario 4, and 5 dB and scenario 4 as shown in Figure 9. The minimum accuracy is 55% for a dataset with the AWGN channel and SNR level equal to 0 dB (More features are recovered at 0 dB that were reported lost in [47].). Our proposed DL classifier model achieves an average accuracy of 95.44% for the identification of a wireless technology that falls in the range of SNR 10 dB to 20 dB across all the channel and propagation effects. The average accuracy of the wireless identification in a strong noise environment considering the propagation effect scenarios' effect is 82.55% for SNR range between 0 dB and 5 dB.
The overall average accuracy is 91%, calculated across all the SNR values and channel scenarios in the five unlicensed radio signals (across all datasets). Figure 9 depicts the results of the accuracy.
Based on our results, we conclude that the traditional supervised ML algorithms such as SVM, KNN, and RF did not perform well for protocol identification to classify unlicensed radio signals' protocols such as IEEE 802.11a, IEEE 802.11n, IEEE 802.11ac, IEEE 802.11ax, and unlicensed LTE. Similar results are reported for SVM, KNN, and RF in [47], but it was not including the study of the propagation models in harsh environments.

Comparison to Benchmark DL Models
Deep CNNs are known to achieve good performance in image classification tasks. In this section, we report about the performance of some well-known deep CNN architectures, namely, the VGG [77], Inception [66], and ResNet [78] algorithms. The dataset for all benchmark DL architectures was split into 80% for the training set and 20% for the test set. During the training, 33% of the data was used for cross-validation to tune the DL classifier models' hyperparameters. A max-pooling layer was used at the model's output, and the soft-max layer was used to identify the class of the protocols.
In our experiments, we explored different DL architectures and evaluated their performance over unlicensed radio datasets at SNR 20 dB (light noise) and with AWGN to check their ability to identify the protocols in light noise environment conditions. Table 3 details the configuration and the results for different DL models, which are trained for SNR 20 dB with AWGN (scenario 5) dataset.  [77]. VGG uses a large number of small convolution filters. The size of filters usually is 3 × 3 and 1 × 1 with the stride of one. The number of filters in VGG depends on the depth of the VGG model. VGG has been used in modulation classification in [35] and shows good performance when combined with 1D CNN. Table 3 summarizes the architectures of the two VGG classifiers which were trained and assessed. The input shape for the VGG3 and VGG16 classifiers was (128, 128, 1), i.e., a 128 × 128 Boolean image of depth 1. VGG16 was built using the Keras library, according to the architecture explained in [77]. Table 3 details the average precision, recall, F1-score, and accuracy for all the classifiers based on the VGG algorithm. Despite its depth, VGG16 cannot classify the protocols even in a very light noise scenario (SNR 20 dB with AWGN). The accuracy of VGG3 is 20%, equivalent to a random classifier for the five classes.

Performance of Inception and GoogLeNet
The concept of Inception for very deep CNNs was introduced in [66] with the GoogLeNet model. The GoogLeNet model is based on a block of parallel convolutional layers with differently sized filters (1 × 1, 3 × 3, and 5 × 5) and a max-pooling layer (3 × 3). Then, the results of inception networks are composed by chaining.
In our experiments, we explored several architectures based on the inception model. The effect of the pooling layer on the output of the model was studied as well. The input shape for the Naive Inception model and Inception2 model is (128, 128), while the spectrogram image is resized to (299, 299) to fit the very deep convolutional Inception-V3 architecture, which is developed in [66]. The performance of Inception-V3 with global average pooling (GAP) in the output layer is better than other inception models as listed in Table 3. We can summarize the result by concluding that, despite the inception model's depth, very deep inception networks feature low performance if one considers the associated cost of implementation.

Performance of Residual Networks (ResNet)
Residual Networks (ResNets) are very deep convolutional network models proposed in 2016 [78]. ResNet is derived from the VGG deep convolutional networks by adding Residual blocks. Residual block consists of two convolutional layers. ReLU activation function is used for each convolutional layer. The output of each block is combined using a shortcut connection. ResNet-V2 is a modified residual network using Residual Inception Blocks, which provides good detection performance but is costlier than ResNet or Inception-V3 [65].
For our experiment, the effect of the output pooling layer is studied for ResNet-V2. The total parameters for each ResNet-V2 are stated in Table 3. The input images were resized to (299, 299) to fit the developed architecture of ResNet-V2. The ResNet-V2-GAP classifier starts to improve the accuracy of classification of different protocols in the light noise scenario (SNR 20 dB with AWGN) in comparing with ResNet-V2-MAX.
The Inception-V3-GAP classifier can start identifying the protocols in the light noise scenario with the AWGN scenario, as indicated in Table 3. We also observed that the Inception-V3 with GAP layer performs better than ResNet-V2-GAP in classifying the spectrograms. However, the Inception-V3-GAP classification accuracy is low compared with our developed emitter classifier based on CDAE weights (CNN3 + AE). Inception-V3-GAP achieved 59.8% for SNR 20 dB in the AWGN channel scenario, which is considered a light noise condition in our experiment. Our CNN3 + AE model achieves 100% accuracy for the same dataset.
In summary, we observed that using DL models like inception networks, VGG blocks, or residual networks protocol identification does not improve the classification accuracy for protocols operating in unlicensed bands, even in light noise conditions such as SNR 20 dB and channel affected by AWGN only (scenario 5). ResNet-V2 and Inception-V3 with the GAP layer achieve very low accuracy even in a light noise than our developed model using DAE weights (CNN3 + AE). Our (CNN3 + AE) pipeline achieves 100% accuracy. Our (CNN3 + AE) has the lowest number of parameters compared to other DL models as highlighted in Table 3.
Complexity-wise, the number of Floating Point Operations (FLOs) expresses how computationally expensive a CNN model is. The FLOs of our proposed emitter model and other benchmark DL models were computed using the TensorFlow built-in profiler. Table 4 details the number of FLOs. The number of FLOs for our proposed model (CNN3 + AE) is 68.4 million, which is much less than the number of FLOs for benchmark DL models. The number of FLOs for Inception-V3 and ResNet-V2 is 11.4 billion and 26.4 billion, respectively. The complexity is therefore very promising for the deployment of our proposed DL model in real-time applications.
We named our pipeline (CNN3 + AE) "ConvAE" DL model for protocol identification in unlicensed spectrum. ConvAE shows very high accuracy in identifying a range of radio access technologies in the unlicensed bands in harsh environments, which outperforms other well-known DL models in terms of accuracy and number of FLOs.

Conclusions
In this paper, we examined the use of DL to solve the coexistence problem between various communication technologies, achieving dynamic spectrum sharing and avoiding performance degradation. We studied our proposed DL method under various propagation channel models and very low SNR values. Specifically, we investigated using Convolutional Denoising Autoencoders (CDAEs) for reconstructing corrupted LTE and Wi-Fi spectrograms with the same carrier frequency under various channel scenarios and SNR values. We tested DL models to perform protocol identification for various IEEE 802.11 WLAN protocols and unlicensed LTE using CDAE weights. Our results show the benefit of performing CDAE before classifying the spectrograms under light to strong noise and different channel propagation conditions. Our proposed methodology for CDAE can reconstruct 77% of the corrupted signals sharing the same spectrum, while showing stable performance under severe noise conditions and propagation models. The achieved accuracy is sufficient to restore and preserve the preamble of the corrupted Wi-Fi 802.11 signals or the sub-frames of the transmitted unlicensed LTE signal. Furthermore, our methodology for protocol identification based on CDAE reduces the training parameters, learning time, and the number of FLOs compared to other DL models. Finally, our methodology significantly improves the average accuracy for protocol classification to 91% in identifying radio access technologies in the unlicensed bands compared to other well-known DL models such as VGG16, ResNet-v2, and Inception-V3.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The channel medium between the transmitter and receiver in IEEE 802.11 protocols is affected by multiple channel propagation models and SNR levels. A set of channel models was designed and studied thoroughly to provide sufficient channel models for IEEE 802.11 PHY layer simulation and performance testing for IEEE 802.11n WLAN [79], 802.11 ac [80], 802.11ax [81], and 802.11a [82]. Each channel model is designed for a certain environment. Each channel has a certain number of clusters, where each cluster includes a set of taps. The number of clusters, number of taps, the value of Root mean square (RMS) delay (σ RMS ), the maximum delay (σ Max ), and the standard deviation σ value of shadow fading are all characterized in each model in case of Line-of-Sight (LOS) and NLOS conditions. Moreover, there are two key signal propagation models for WLAN channel modeling: (1) Large-scale propagation (or large-scale path loss) and (2) small-scale fading.

Conflicts of
For large scale path loss, the considered path loss model L(d) for indoor WLAN channel is based on the following equation: where d is the distance between the transmitter and receiver, L FS is the free space loss (log-distance model) for the slope of 2 for up to break-point distance d BP and 3.5 after d BP .
The d BP for our channel model is 5 m. Generally, the breakpoint distance is assumed to be the boundary for LOS and NLOS conditions [79]: where G t and G r are the transmitter and the receiver gain, which we assume equal to 1, λ is the wavelength of the transmitted carrier frequency f c at speed of radio frequency (RF) signal λ = ν f c . For our implementation, the RF signal speed ν (approximately the speed of light) is equal to ν = c = 3 × 10 8 m/sec.
In our simulations, we consider the channel path loss model in an indoor and open environment. Thus, Model B delay profile is selected as a channel model for IEEE 802.11ac, 802.11ax, and 802.11n. Model B represents a typical sizable open space and office environments with NLOS conditions and 100 ns RMS delay spread for IEEE 802.11. The parameters used to model path loss to provide sufficient channel models for IEEE 802.11n PHY layer simulation are detailed in [79], for IEEE 802.11ac PHY layer in [80], and for IEEE 802.11ax PHY layer in [83]. The channel path loss model for IEEE 802.11a is simulated as a Rayleigh channel model, with a sample rate of f s = 20 MHz, and the carrier frequency f c is 5.25 GHz.
The shadow fading (also known as log-normal shadowing) is also considered in our analysis. It is modeled with a zero-mean Gaussian random variable x and a standard deviation σ, and added to the path loss model as given by where X σ is the random variable and the value of σ differs before and after the break-point distance d BP as described in [79]. For our analysis and propagation models, small-scale fading [84] is also considered. Small-scale fading causes signal distortion for the transmitted WLAN signal. Various physical aspects induce small-scale fading effects such as mobile speed, multipath propagation, surrounding objects motion, and the transmission bandwidth of the signal [84]. The motion of the people around the environment causes Doppler spread f d , defined as follows: where ν o is the environmental speed. For our experiment, where f c is 5.25 GHz, the wavelength λ = 3×10 8 m/s 5.25×10 9 Hz = 0.0571 m. f d is calculated and encountered in the simulated environment for the received signal for each WLAN IEEE 802.11. The typical walking speeds for indoor environments is approximately 1.2 km/h (0.333 m/s) [79]. The worst-case frequency shift known as Doppler spread is around 5.8 Hz for WLAN packets.

Appendix B. Simulation Setup for LTE
Unlicensed LTE is based on the 3GPP Release 12 LTE technology to be used in the unlicensed spectrum. LTE is used as a secondary cell within the LTE carrier aggregation framework anchored by a licensed primary cell. In our simulation, the downlink system for LTE is considered.

Appendix B.1. Physical Layer for LTE
The downlink physical layer of LTE is based on orthogonal frequency-division multiple access (OFDMA).
For our simulation, LTE is built based on a single transmitter and a single receiver. The LTE frame consists of 10 subframes that are individually generated. The frame is OFDM modulated. The LTE frame structure for the downlink is Frequency Division Duplexing (FDD). Each user is assigned to a different time/frequency Resource Block (RB). The simulation parameters of evolved NodeB (eNB) in LTE-subframe are presented in Table A3.

. LTE Channel Conditions
In LTE, RB is assigned to user equipment (UE) from eNB in the downlink channel. The LTE signal suffers from signal degradation due to a dynamic change of the distance between eNB and UE, radio power level, noise level, and multipath fading effects.
The downlink unlicensed-LTE signal is affected by the following: (1) LTE moving channel of propagation conditions which implements scenario 1 for an extended typical urban environment using an (ETU200) Rayleigh fading model with 200 Hz Doppler shift and changing delays, (2) LTE moving channel with scenario 2 which represents a single non-fading model as specified in [85,86], (3) LTE fading channel for a multipath fading MIMO channel propagation conditions using a Generalized Method of Exact Doppler Spread (GMEDS) for a Rayleigh fading model type [87], and (4) AWGN.
In our simulation, the LTE radio signals are exposed across a range of light to serve noise conditions where the SNR ranges from 0 dB to 20 dB. Besides, the LTE downlink signal is affected by three main propagation channel models and the AWGN. The downlink LTE signal is affected by (1) LTE moving channel of propagation conditions which implements scenario 1, (2) LTE moving channel with scenario 2, (3) LTE fading channel, and (4) AWGN. The LTE moving propagation scenarios are implemented as specified in TS 36.104, Annex B.4 [85]. The used parameters of the LTE fading channel model are specified in [87]. A summary of the implemented channel conditions for unlicensed-LTE is presented in Table A4.