3D convolutional neural networks based automatic modulation classification in the presence of channel noise

Automatic modulation classification is a task that is essentially required in many intelligent communication systems such as fibre-optic, next-generation 5G or 6G systems, cognitive radio as well as multimedia internet-of-things networks etc. Deep learning (DL) is a representation learning method that takes raw data and finds representations for different tasks such as classification and detection. DL techniques like Convolutional Neural Networks (CNNs) have a strong potential to process and analyse large chunks of data. In this work, we considered the problem of multiclass (eight classes) classification of modulated signals, which are, Binary Phase Shift Keying, Quadrature Phase Shift Keying, 16 and 64 Quadrature Amplitude Modulation corrupted by Additive White Gaussian Noise, Rician and Rayleigh fading channels using 3D-CNN architectures in both frequency and spatial domains while deploying three approaches for data augmentation, which are, random zoomed in/out, random shift and random weak Gaussian blurring augmentation techniques with a cross-validation (CV) based hyperparameter selection statistical approach. Simulation results testify the performance of 10-fold CV without augmentation in the spatial domain to be the best while the worst performing method happens to be 10-fold CV without augmentation in the frequency domain and we found learning in the spatial domain

Deep learning (DL) technology has revolutionized the landscape of modern technologies. Its applications such as neural networks have been instrumental in bringing changes to the communication technologies worldwide. Convolutional Neural Networks (CNNs) are a powerful family of neural networks for learning from data and have wide applications in image recognition, object detection [17,18], and semantic segmentation tasks etc. Typical desirable properties of the features learned by CNNs are spatial invariance, translation invariance, and locality while typical components of these networks are convolutional layer, padding and stride operations, maximum and average pooling layers, fully connected (FC) layer, dropout layer, and batch normalization layer etc. For object recognition purposes, deformations like pose, affine transformations like scaling, translation, optical flow as well as rotation or shear are widely used as augmentation methods to synthetically increase the size of the datasets [19]. Colour information instead of a grayscale image may also improve prediction performance [20]. Spatial transformation methods such as per-pixel flow, mean blur and differentiable bilinear interpolation can also be used to deform the input images benefiting many visual recognition tasks [21]. CNNs performance in a classification process is based on several aspects, including fulfilling the requirements of the Nyquist sampling principle. Small shifts in the input can drastically change the output of a CNN. Classic anti-aliasing may improve shift equivariance of deep networks leading towards better generalization of the network. Corruptions like salt and pepper noise, masking noise and additive isotropic Gaussian noise makes it difficult to learn a useful representation. In general, CNNs performance is robust to corruptions such as rotation, scaling, blurring, and noise variants. Literal shift equivariance cannot hold when subsampling and recovered only when features can be extracted densely. Shift equivariance is lost in modern deep networks as commonly used down sampling layers such as strided convolution, max pooling and average pooling ignores Nyquist sampling theorem [22].
Modulation is a technique by which a message-carrying signal is superimposed on a carrier signal for transmission. Commonly used modulation schemes are Binary Phase Shift Keying (BPSK), Quadrature Phase Shift Keying (QPSK) and Quadrature Amplitude Modulation (QAM). The term PSK is widely used in a radio communication system. This method is largely compatible with data transmissions. It allows information to be transmitted through a radio communications signal in a more efficient manner as compared to other modulation techniques. QAM combines amplitude and phase information at different levels and has wide applications in internet services and digital cable television.
Blurring is often used as a first step before operations such as thresholding, edge detection, or before finding the contours of an image. Applying a low-pass blurring filter smooths the edges and removes noise from an image. A Gaussian blur is a lowpass frequency filter and thus blurring is tolerant of changes in the high frequency range. The down sampling of an image allows sharper blurred areas to exist. A complicated, but largely image-independent relationship exists between corresponding blur levels in images at different resolutions, which can be clarified by a blur magnitude model studied as a function corresponding to spatial frequency. The human visual system has an uneven response to various frequency components. It is vulnerable to many frequency elements and treats them unequally. CNN models are more vulnerable to low frequency components than higher frequency components and are thus close to the human visual system. Spectral bias of the CNN models can't allow the CNN models to keep unimportant frequency channels during inference without the loss of accuracy during inference. Discrete cosine transform (DCT) represents a finite data sequence in terms of a sum of cosine functions that oscillates at various frequencies.
DCT is used in a wide range of applications such as digital audio, speech coding, digital image, digital video and digital radio. DCT is a Fourier related transform but uses only real numbers [23].
The Additive White Gaussian Noise (AWGN) is invariant with respect to its signal space origin rotation. It is circularly symmetrical in any direction of the signal space, applied to any noise inherent to the information system with a constant magnitude around the frequency band, and it has a regular distribution with an average time domain value of zero. Rayleigh fading is however a reasonable pattern when several objects disperse the incoming signal in the atmosphere before it reaches the receiver. If the scattering is high enough, the central limit theorem states that the responses of the channel impulses are modelled as a Gaussian process irrespective of the distribution of each component. If the scatter has no dominant element, then the mean would be zero and the phase will be evenly dispersed between zero and 2 radians. The Rician fading is also used to characterize fading in environments where the transmitter and the receiver have a line of sight (LOS) or clear specular direction. The factor Rice is measured as the ratio of LOS or specular power to distributed power and follows closely a non-central chi-square distribution with two degrees of freedom.
A large number of AMC approaches suggested in the literature can be divided into two distinct categories: feature based (FB) [24]- [27] and likelihood based (LB) methods [28]- [32]. The LB processes are known to obtain optimal outcomes by considering AMC as a problem of hypothesis testing to have an optimum solution in the Bayesian sense, thus decreasing probability of misclassification, however under restricted computing resources the time per decision is not feasible. On the other hand, FB classifier is computationally proficient and can achieve nearly optimum performance when properly designed. FB algorithms including cyclic statistics, wavelet transforms and cumulant-based methods extract features in order to identify modulation schemes. FB methods are favoured as suboptimal classifiers in practice. The feature extraction is performed in FB approaches during the pre-processing stage preceded by the classification stage. Traditional FB methods focus largely on professional expertise, enabling them to do well in some contexts, but suffering from higher computational complexity and limited generalization issues.
Several studies in the literature have been proposed that are aimed at designing specialized features for the recognition of breast cancer [39]- [41], future robust networks for 6G [42], classification of digital modulated images [43], in-band spectral variation and deviation from unit circle while utilizing Nesterov accelerated adaptive moment estimation technique and the classifier based on the artificial neural network, which carries out AMC across a wide range of signal to noise ratio (SNR), multi-gene genetic programming (MGP) based on features that transforms cumulative sample estimates into highly discriminatory, iterative features before maximal MGP features are achieved and to determine the final classification performance of the MGP features, while taking advantage of the structural risk minimization principle [4]. Other works include the integrating of a new Nelder-Mead channel estimator into the radio frequency distinct features fingerprinting technique, as well as utilizing a multipath system with degraded SNR [8], a blind modulation classification algorithm using discrete Fourier transform to check the existence of a synchronization defect, that is a timing-phase offset and frequency without previous knowledge on the signal and channel parameters for the QPSK, BPSK, Minimal Shift Keying (MSK), 16-QAM and Offset-Quadrature Phase Shift Keying (OQPSK) schemes [10], utilizing higher order cumulants and signal spectral features to train K-Nearest Neighbour (KNN) classifiers and Support Vector Machine (SVM) [15], and a block coordinate descent dictionary learning algorithm for multiclass classification between QPSK, 8-PSK, 8-QAM, 16-QAM, Quadrature Amplitude Shift Keying (QASK), and 8-ASK modulation schemes [16].
In addition to these traditional methods for AMC tasks, DL has risen as an emerging field for AMC tasks. In the literature, different methods have been employed for the use of auto-encoding neural networks for the extraction of features and classify them using millimetre waves over fibre optic communication systems [7], an auto encoder focused on DL to extract spectrum representative features to accurately classify waveforms as idle, jammer, or Wireless Fidelity [5], a deep neural network made of FC layers for multiple input and multiple output (MIMO) OFDM system for QPSK, BPSK, 64-QAM and 16-QAM modulation schemes [10], an improved CNN based AMC network to classify among 8-PSK, Double Side Band Amplitude Modulation (AM-DSB), BPSK, Wideband Frequency Modulation (WBFM), Gaussian FSK, 16-QAM, 64-QAM, Continuous Phase Frequency Shift Keying (CPFSK) and 4-Pulse Amplitude Modulation (4-PAM) schemes in beyond fifth generation communication systems [14]. In the OFDM system, CNN based AMC approach is used to consider phase offset for the classification of 8-PSK, 16-QAM, QPSK and BPSK modulation schemes [12], an ensemble deep neural network employing Euclidean distance based rectified linear unit (ReLU) activation functions for the classification of 16-QAM, 64-QAM, QPSK and BPSK modulation schemes [13] ,aD L based radio frequency signal classifier for the classification of BPSK, QPSK, Continuous Phase Modulation (CPM), Gaussian FSK, 16-QAM and Gaussian MSK modulation schemes [14], a combination of two CNNs for the classification of CR based signals representing BPSK, QPSK, 8-PSK, Gaussian FSK, CPFSK, 4-PAM, 16-QAM and 64-QAM modulation schemes [6], as well as a feed-forward deep neural network based multiclass classifier, which is made using FC layers, for adaptive spatial modulation MIMO systems [33].
Different from other works, in this paper, we have considered the problem of multiclass (eight classes) classification of modulated signals, which are, BPSK, 64-QAM, 16-QAM and QPSK signals affected by Rician, AWGN and Rayleigh fading channels employing a 3D-CNN architecture in both frequency and spatial domains with three data augmentation techniques such as random zoomed in/out, random weak Gaussian blurring and random shift augmentation techniques. We employed augmentation only in the spatial domain during training of the DL architecture. The remaining paper is set accordingly.
Mathematical background is given in Section 2, datasets are explained in Section 3, 3D-CNN architectures are explained in Section 4, Section 5 explains the experiments, and results and discussion is given in Section 6. Finally, conclusions are given in Section 7.

MATHEMATICAL BACKGROUND
In this part, we will give a mathematical background for fading channels which are used for simulations in this study such as AWGN, Rayleigh and Rician fading channels. Furthermore, the convolution operation will be described followed by modulation schemes which are used in the experiments such as BPSK, QPSK, 64-QAM and 16-QAM modulation schemes as well as frequency domain (DCT) operation. The AWGN channel is defined at discrete time event index i by a series of outputs Y i . Y i is the sum of the IN i and noise NO i inputs, where NO i is distributed independently and identically and taken from a normal zero-mean distribution of variance N. Furthermore, it is assumed that the NO i is not correlated with the IN i . Mathematically, The response to the channel impulses is better modelled as a Gaussian mechanism no matter how each component is distributed. If the scatter does not have a dominant component, the scatter will be spread uniformly between 0 and 2π radians with zero mean and phase components. The channel response envelope is thus distributed as a Rayleigh function. Mathematically, where Ω=E (R 2 ). There are two parameters to characterize a Rician fading channel. The first one, K is the ratio of the power in the direct direction to the power in the other dispersed directions.
The second one, W is the total power from both paths and acts as a scaling factor to the distribution The 3D kernel is convolved into the cube to accomplish 3D convolution, made up of several contiguous frames that are stacked together. By this design, the feature maps in the convolutional layer are connected to the previous layer by several contiguous frames together. The value of the j th feature map in the i th layer at point (x, y, z) is formally defined with: where R i is a 3D kernel size in the temporal dimension, pqr ijm is the (p, q, r) th kernel value that is connected to the previous layer of the m th feature map. Since multiple kernels are convolved with the input layer, the output contains a stack of activation maps when several kernels are convolved with the input layer. BPSK, functionally equivalent to 2-QAM modulation, uses two phases separated by 180 • and 0 • . Mathematically, QPSK uses four points on the constellation map equally distributed around a circle functionally equivalent to 4-QAM modulation, which are separated by 7 ∕4, 5 ∕4, 3 ∕4a n d ∕4. Mathematically, One carrier lags the other by 90 • in a QAM signal and is generally defined as the in-phase component I(t) for its amplitude information. The quadrature component Q (t) is the other modulating function for its phase information. Mathematically, The finite sequence of data points is expressed by DCT in terms of the number of cosine functions that oscillate at various frequencies. Mathematically,

DATASET DESCRIPTION
We

DESCRIPTION OF THE 3D-CNN ARCHITECTURES
We used two 3D-CNN architectures for the experiments which are shown in Figures 2 and 3. The only difference between these architectures is the number of filters. We used more filters for the experiments with more data in the training set such as those involving augmentation methods. As we can see in Figures 2 and 3, there is an input layer with a size of 297 × 167 × 10 with zero-centre normalization to obtain data dimensions (channels) of approximately the same scale through division of each dimension (channel) by its standard deviation once it has been zerocentred. This is done by subtracting the mean from each of these dimensions (channels) so that the data cloud is centred at the origin. After that, feature maps are created using a 3D convolutional layer by moving a filter of size 3×3×3 with number of feature maps set to 12 or 11 depending on the number of samples in the training set. Here, we set the values of bias and weight L2 factors to 0.00005 as smaller but non-zero weights generate simpler model that is able to learn complex data patterns and thus helps avoid overfitting by mitigating noise in the samples. After that, a batch normalization layer [34] dynamically normalizes the inputs on a per mini-batch basis which has shown to FIGURE 1 A sample digital modulated image of all classes used for the multiclass classification task improve the training time while avoiding overfitting. After that, an exponential linear unit (ELU) layer [35] is added to speed up learning by pushing mean activations closer to zero. Mathematically, it could be described as: The pooling process progressively decreases the spatial resolution of hidden representations by aggregating information such that greater receptive fields in the higher layers (in the input) are sensitive to each hidden node. The pooling operator consists of a fixed window that slides over all input regions in compliance with its strides to measure single output at each position crossed by a fixed-shaped window. Max pooling operators compute the highest factor value in the pooling window. FC layer (also known as dense layer or the inner product layer) has full links to all activation maps of the previous layer. The input is simply multiplied by a weight matrix and a bias offset is added. The FC layer is somehow similar to the convolutional layer. Although the convolutional layer is connected to a local region of the input, all inputs are connected to the FC layer. One can be converted into the other easily. The dropout operation [36] helps in mitigating the overfitting phenomenon. It works by injecting noise into each layer of the network during training zeroing out some fraction of the nodes in the individual Finally, the softmax function helps in interpreting the outputs of the 3D-CNN architectures as probabilities by optimizing the model parameters to produce probabilities that maximize the likelihood of the observed data. The classification layer places the outputs into one of the eight classes.
In our proposed architecture, we have added three FC or dense layers with the final dense layer having 8 neurons to place the input into one of the 8 categories. The first two dense layers have 500 and 300 neurons each to capture the feature activations as they are getting passed from the convolutional layers. After that, we added a dropout layer right before the softmax layer with a probability of 0.1 to mitigate the disharmony between batch normalization and dropout techniques caused by the variance shift phenomenon [38]. We chose the architecture to keep the number of parameters to a minimum without sacrificing the performance. Figure 4 displays another view of the proposed 3D-CNN architecture

EXPERIMENTS
We conducted a number of experiments using 10-fold and 5fold cross-validation (CV) approach to choose the optimum set of hyperparameters. The experiments were carried out in the spatial domain with and without augmentation techniques as well as without augmentation in the frequency (DCT) domain. We performed the following experiments: 1. Multiclass (8-classes) classification of modulated signals augmented by the combined random shift, random weak Gaussian blurred and random zoomed in/out augmentation techniques in the spatial domain using 5-fold CV approach.

Multiclass (8-classes) classification of modulated signals
augmented by the combined random weak Gaussian blurred, random shift and random zoomed in/out augmentation schemes in the spatial domain using 10-fold CV approach.

Multiclass (8-classes) classification of modulated signals
augmented only by the random weak Gaussian blurred augmentation scheme in the spatial domain using 5-fold CV approach.

Multiclass (8-classes) classification of modulated signals
augmented only by the random weak Gaussian blurred augmentation scheme in the spatial domain using 10-fold CV approach.

Multiclass (8-classes) classification of modulated signals
augmented only by the random shifted augmentation scheme in the spatial domain using 5-fold CV approach.  augmented only by the random zoomed in/out augmentation scheme in the spatial domain using 5-fold CV approach.

Multiclass (8-classes) classification of modulated signals
augmented only by the random zoomed in/out augmentation scheme in the spatial domain using 10-fold CV approach. 9. Multiclass (8-classes) classification of modulated signals without augmentation in the spatial domain using 5-fold CV approach. 10. Multiclass (8-classes) classification of modulated signals without augmentation in the spatial domain using 10-fold CV approach. 11. Multiclass (8-classes) classification of modulated signals without augmentation in the frequency domain using 5-fold CV approach.

Multiclass (8-classes) classification of modulated signals
without augmentation in the frequency domain using 10fold CV approach.
For all the experiments that involve 10-fold CV approach as well as 5-fold CV with augmentation(s), we choose a mini-batch of size 2, an initial learning rate of 0.001, maximum number of epochs were set to 30, training set was shuffled after every epoch to mitigate overfitting, piecewise learning rate was selected that lowers the learning rate after every 5 epochs by multiplying with a factor of 0.1, while Adam [37] was used as the optimizer and categorical cross-entropy as the loss function. Feature maps in the convolutional layer of the 3D-CNN architecture were set to 12. The training was conducted on a single NVIDIA Titan RTX graphical processing unit.
For all the experiments that involve 5-fold CV approach without augmentation(s) in the frequency and spatial domains, we choose a mini-batch of size 2, an initial learning rate of 0.001, maximum number of epochs were set to 30, training set was shuffled after every epoch to mitigate overfitting, piecewise learning rate was selected that reduces the learning rate after every 5 epochs by multiplying with a factor of 0.1, while Adam was used as the optimizer and categorical cross-entropy as the loss function. Feature maps in the convolutional layer of the 3D-CNN Architecture were set to 11.
For the experiments involving 5-fold CV approach without augmentation in the frequency and spatial domains, the validation set has 96 samples per class while the training set has 384 samples per class. For the experiments involving 5-fold CV approach with single augmentation technique in the spatial domain, the validation set has 96 samples per class while the training set has 768 samples per class. For the experiments involving 5-fold CV approach with combined augmentation techniques in the spatial domain, the validation set has 96 samples per class while the training set has 1536 samples per class.
For the experiments involving 10-fold CV approach without augmentation in the frequency and spatial domains, the validation set has 48

RESULTS AND DISCUSSION
Tables 1-6 represent the results of the experiments performed for the multiclass classification of signals in the presence of fading. We used RCI, CEN, IBA, GM and MCC as our performance metrics. The RCI metric is an entropy-based measure that quantifies how much the uncertainty of the decision problem is reduced by the classifier, relative to classifying by simply using the prior probabilities of each class. It corrects for differences in prior probabilities of the diagnostic categories, as well as the number of categories. Values of this measure lie in the interval between 0 and 1, where values close to 1 represent better classification.
CEN is an information theoretic measure based upon the idea of entropy for measuring classifier performances. It evaluates the confusion level of the class distribution of misclassified samples. CEN measures generated entropy from misclassified cases considering not only how the cases of each class have been misclassified into other classes, but also how the cases of the other classes have been misclassified as belonging to this class, as well as entropy inside well-classified cases. Small values of CEN represent less information loss and better classification.    GM focuses only on the recall of each class which is aggregated multiplicatively. It is defined as the product of sensitivity and specificity under a square root. Higher values of this measure indicate better classification performances.
MCC is a coefficient of correlation between the classifications that are observed and predicted. Its values lie in the interval between -1 and +1, where +1 indicates perfect classification.
As given in Tables 1-6, we considered average values of classwise statistics for CEN, IBA, GM and MCC metrics for the eight classes, the values of the RCI metric as well as individual and overall ranking of the methods. The average values are calculated by taking the sum of values in the eight classes and dividing that sum by 8.
As a visual aid, Figures 5-10 display the values of each of these performance metrics as well as the overall ranking of the methods based on these metrics.
The procedure for forming the ranking system will be explained next. To form the RCI based ranking of methods, we took the maximum values and sorted all the methods based on these values. To form the CEN based ranking of methods, we took the minimum of average values of this metric and sorted all of our methods based on these values with the best performing

FIGURE 6
Average GM values of the methods in the study method been given the minimum average CEN value. To form the IBA, GM and MCC based ranking of the methods, we took the maximum of average values of these metrics and sorted all of our methods based on these values with the best performing method been given the maximum average IBA, GM or MCC value. Finally, we formed the overall ranking of the methods based on the individual RCI, CEN, IBA, GM and MCC rankings. In this case, our overall and all individual metrics based rankings are exactly the same which shows strong correlation between the individual metrics as given in Table 6. As a visual aid, Figure 10 displays the ranking of all the methods considered in this study as given in Table 6.
As can be seen in Table 6, the best performing method turns out to be 10-fold CV without augmentation in the spatial domain while the worst performing method happens to be 10-fold CV without augmentation in the frequency domain. We  can also observe that spatial domain methods have an edge over frequency domain methods. One reason for the better performance of spatial domain methods could be that they work on data with larger variation in intensity values of inputs which allows them to capture intrinsic information of a volume better. Furthermore, combining the augmentation methods in the training set resulted in performance degradation in comparison with the methods that used single augmentation methods. In addition to that, we can observe that random weak Gaussian blurring augmentation method is the best performing augmentation method in comparison to random zoomed in/out augmentation and random shifted augmentation methods. Although the best performing method uses more data in the training set, methods that used more data during training may not necessarily be the best. The performance of methods that used random shifted augmentation during training was found to be inferior to the other methods. One reason for this phenomenon is that small translations or rescalings of the input image can drastically change the prediction of a CNN model as CNNs are not invariant to such transformations due to the ignorance of classical sampling theorem. The better performance of random weak Gaussian blurring augmentation method could be explained by the blurring operation that smooths the output of the non-linearity which helps in preventing high frequency activations, and also helps in isolating the aliasing phenomenon. AMC is an important task with wide range of civil and military applications and has a number of uses in different scenarios. Our work lies at the intersection of DL and AMC task. We deployed state-of-the-art 3D CNN architectures for the multiclass classification of modulation schemes in the presence of noise in both spatial and frequency domains with data augmentation procedures to carry out this task.
The proposed architecture is optimally designed to carry out multiclass classification of modulation schemes. The main gist behind the design is to attain maximum performance from the proposed design. Number of neurons in the FC layers, feature maps in the convolutional layers and other hyperparameters have been chosen to avoid overfitting, long training time and other problems. It can be seen that 3D CNN architectures are a useful tool for this task in achieving better performances.
Comparison of our work with the other studies reported in the literature is given in Table 7.

CONCLUSION
In this work, we compared and contrasted the performance of several DL architectures for the multiclass (8-classes) classification of modulated signals in the presence of noise in both spatial and frequency (DCT) domains. The best performing model has been found to be 10-fold CV without using augmentation in the spatial domain while the worst performing model has been found to be 10-fold CV without augmentation in the frequency domain. Furthermore, we note that spatial domain methods performed better than their frequency domain counterparts. This study can be extended further by considering other modulation schemes such as frequency modulation methods as well as noise models such as Nakagami model and other DL architectures such as graph convolutional networks.