Abstract

The recognition technology of the radar signal modulation mode plays a critical role in electronic warfare, and the algorithm based on deep learning has significantly improved the recognition accuracy of radar signals. However, the convolutional neural networks became increasingly sophisticated with the progress of deep learning, making them unsuitable for platforms with limited computing resources. ResXNet, a novel multiscale lightweight attention model, is proposed in this paper. The proposed ResXNet model has a larger receptive field and a novel grouped residual structure to improve the feature representation capacity of the model. In addition, the convolution block attention module (CBAM) is utilized to effectively aggregate channel and spatial information, enabling the convolutional neural network model to extract features more effectively. The input time-frequency image size of the proposed model is increased to , which effectively reduces the information loss of the input data. The average recognition accuracy of the proposed model achieves 91.1% at -8 dB. Furthermore, the proposed model performs better in terms of unsupervised object localization with the class activation map (CAM). The classification information and localization information of the radar signal can be fused for subsequent analysis.

1. Introduction

Radar is widely deployed on the current battlefield and has progressively become the dominant key technology in modern warfare as a result of the continuous improvement of radar technology [14]. Therefore, recognizing the modulation type of enemy radar signals rapidly and accurately can effectively obtain battlefield information and situation and provide decent support for subsequent decision-making. It is significantly vital in the field of electronic warfare.

Traditional radar signal recognition methods rely on handcraft features extraction [511]. However, these methods lack flexibility and are computationally inefficient. With the continuous development of radar technology, radar signal parameters have become more complex, and radar signals have become more concealed. In addition, the widespread application of radar equipment and the rapid increase in the number of radiation sources have made the electromagnetic environment on the battlefield more complicated. Therefore, the traditional radar signal recognition method cannot effectively recognize radar signal modulation in the complex electromagnetic environment.

Deep learning has progressed rapidly in the last few years, and it has been extensively used in a wide range of traditional applications. Radar signal recognition methods based on convolutional neural networks have surpassed traditional recognition methods based on handcraft feature extraction [2, 7, 1214]. The convolutional neural networks automatically extract the deep features of objects through supervised learning and have significant generalization performance [15]. Besides, the convolutional neural network has a hierarchical structure, and the model structure and parameters can be adjusted arbitrarily, which significantly reduces labor costs and is more convenient to use. To achieve higher recognition accuracy, the convolutional neural network becomes larger and the model structure tends to be more complicated. On the other hand, it is also important to strike a balance between recognition speed and computational efficiency in practical applications. Therefore, this paper proposes a new multiscale lightweight structure, ResXNet, which has the advantages of lightweight and high computational efficiency.

In addition, the convolution block attention module (CBAM) attention mechanism is utilized in the model proposed in this paper. CBAM is a lightweight general attention module that can be seamlessly integrated into any convolutional neural network structure and trained end-to-end together with the basic convolutional neural networks [1618]. The convolutional layer can only capture local feature information, ignoring the context relationship of features outside the receptive field. The CBAM significantly improves the feature representation capability of the model by enhancing or suppressing specific features in the channel and spatial dimension. And its calculation and memory overhead can be ignored.

At the same time, this paper also investigates the application of object localization based on class activation mapping (CAM) in radar signal recognition. Class activation mapping is a weakly supervised localization algorithm that locates the object position in a single forward pass, which improves the interpretability and transparency of the model and helps researchers build trust in the deep learning models [19, 20].

Therefore, this paper proposes a multiscale ResXNet model based on grouped residual modules and further improves the recognition accuracy of the model through the CBAM attention module. The ResXNet lightweight attention network model proposed in this paper is based on grouped convolution and constructs a hierarchical connection similar to residuals within a single convolution block. ResXNet can expand the size of the receptive field and improve the multiscale feature representation ability. The grouped residual convolutional layers effectively reduce the number of parameters while also improving the generalization performance. In addition, CAM is used to obtain the localization information of the radar signal in the time-frequency image, and the classification information and localization information of the radar signal can be fused for subsequent analysis.

2.1. Radar Signal Classification

Traditional radar signal recognition methods usually rely on handcrafted feature extraction, such as cumulants, distribution distance, spectral correlation analysis, wavelet transform, and time-frequency distribution features [21]. Machine learning algorithms, such as clustering algorithms [22], support vector machines [9, 23], decision trees [7], artificial neural networks [5], and graph models [24], are used to classify radar signals according to the extracted features. However, the traditional radar signal recognition method is inefficient since it relies significantly on manual feature extraction and selection. And it is affected by noise easily, and the recognition performance substantially decreases in low SNR.

Convolutional neural networks have found their way into the field of radar signal classification as the development of deep learning. The radar signal recognition based on convolutional neural networks first converts one-dimensional time-domain radar signals into two-dimensional time-frequency images through time-frequency analysis and then automatically extracts features from time-frequency images of different radar signals by training convolutional neural networks. The radar signal recognition based on the convolutional neural network significantly improves the recognition accuracy in low SNRs.

Kong et al. [3] proposed a convolutional neural network (CNN) for radar waveform recognition and a sample averaging technique to reduce the computational cost of time-frequency analysis. However, the input size of this method is small, and there is a significant loss of information. Hoang et al. [13] introduced a radar waveform recognition technique based on a single shot multibox detector and a supplementary classifier, which achieved extraordinary classification performance. However, this method requires much manual annotation, and the computational efficiency is low. Wang et al. [12] proposed a transferred deep learning waveform recognition method based on a two-channel architecture, which can significantly reduce the training time and the size of the training dataset, and multiscale convolution and time-related features are used to improve the recognition performance. On the other hand, the transfer learning method requires a large convolutional neural network as a pretraining model, which has a high computational cost and is incompatible for embedded platforms or platforms with limited computing resources.

2.2. Convolutional Neural Network

In 1995, LeNet created the history of deep convolutional neural networks. AlexNet [25], the first deep convolutional neural network structure, achieved breakthrough success in image classification and recognition applications in 2012. The VGG [26] model modularizes the convolutional neural network structure, increases the network depth, and uses small-size convolution kernels. Experiments show that expanding the receptive field by increasing the depth of the convolutional neural network can effectively improve performance [27]. The GoogLeNet model utilizes parallel filters with varied convolution kernel sizes to increase the feature representation ability and recognition performance [15]. ResNet [28] presents a 152-layer deep convolutional neural network that incorporates the identity connection into the convolutional neural network topology, alleviating the vanishing gradient problem.

With the continuous development of deep learning, the depth of the convolutional neural network is deeper, the calculation is more sophisticated, and the requirements for hardware are higher to achieve higher accuracy. As a result, the construction of small and efficient convolutional neural networks has gained more attention. DenseNet [29] connects the output of each layer to each subsequent layer. All previous layers serve as inputs for each convolutional layer, and the output features serve as inputs for all subsequent layers. DenseNet enables the network to extract features on a larger scale and alleviate the vanishing gradient problem. MobileNet [30] employs a depthwise separable convolution to build lightweight convolutional neural networks. The advantages of MobileNet include a tiny model, lower latency, lower computing complexity, and higher inference efficiency. It can easily match the requirements for platforms with limited computing resources and embedding applications.

Grouped convolution was first introduced in AlexNet [25] for distributing the convolutional neural network model over multiple GPU resources. ResNeXt [31] found that grouped convolution can reduce the number of parameters and simultaneously improve the accuracy. Channel-wise convolution is a special case of grouped convolution in which the number of groups is equal to the number of channels. The channel-wise convolutions are components of depth separable convolution [30].

3. Proposed Method

The overview of the proposed algorithm is depicted in Figure 1. The proposed algorithm first converts the radar signal into a time-frequency image through time-frequency analysis. The ResXNet model proposed in this paper is presented for radar signal recognition, and the CAM is utilized for signal localization in time-frequency images. The ResNet model is composed of a grouped residual module and a CBAM attention mechanism.

3.1. Radar Signal Processing

In this paper, the radar signal interfered by additive white Gaussian noise can be expressed as

where signifies complex radar signal samples and stands for additive Gaussian white noise (AWGN) with zero mean value and variance . represents the nonzero constant amplitude, and denotes the instantaneous phase of the radar signal. The inherent difference between radar signals of different modulation types is the frequency variation over time. The one-dimensional radar signal is transferred into two-dimensional time-frequency images (TFIs) through time-frequency analysis. The pattern in the time-frequency image corresponds to the frequency variation with time.

The instantaneous phase consists of instantaneous frequency and the phase function , which determine the modulation type of the radar signal. The instantaneous phase is defined as

Eight LPI radar waveforms considered in this paper are grouped into two categories, FM (frequency modulation) and PM (phase modulation). In the FM, the instantaneous frequency varies while the phase is constant; in the PM, the phase varies while the instantaneous frequency is constant [3], as defined in Table 1.

The Choi-Williams distribution (CWD) based on the time-frequency distribution of Cohen’s class has the advantages of high resolution and cross-term suppression. The resolution of the time-frequency analysis can be modified by adjusting the parameters of its exponential kernel function. The time-frequency image of radar signals based on Choi-Williams distribution is defined as

where and denote frequency and time axes, respectively. And is the exponential kernel function of the Choi-Williams distribution. The kernel function is regarded as a low-pass filter that can suppress cross-terms effectively.

Figure 2 depicts the time-frequency images of different radar signals by Choi-Williams distribution considered in this paper. The time-frequency images visualize the frequency variation over time and thus recognize the radar signals effectively. Before feature extraction, time-frequency images are normalized to reduce the influence of the bandwidth of distinct radar signals. The time-frequency images are transformed to gray images as follows:

where indicates the time-frequency image by Choi-Williams distribution, is the gray images, and denotes each pixel in the time-frequency images. The gray time-frequency images contain significant components and information of radar signals.

3.2. ResXNet Module

The traditional convolutional neural network expands the receptive field size of the model by simply stacking convolutional layers. This strategy, on the other hand, increases the size of the model and the number of parameters, making the training of the model increasingly complex. The development of advanced model structures reveals a tendency toward improving the receptive field size and multiscale learning capability of the models while maintaining lightweight.

A multiscale model based on group convolution is proposed in this paper. On the premise of keeping the model lightweight, the receptive field of each convolutional layer is increased to improve the capabilities of feature extraction and multiscale representation. The proposed model structure is a modular design, which can flexibly adjust the size and parameters of the model.

Grouped convolution significantly reduces the size of the model and the number of parameters. Grouped convolution splits the input feature maps along the channel dimension into several feature map subsets, and each branch can only use a subset of the feature map and cannot use the entire input feature map [27]. The ResXNet proposed in this paper employs channel-wise convolution to each branch, and each branch takes the complete input feature map as input. channel-wise convolution compresses the number of channels of the input feature map, and then a convolution with the same number of channels is used for feature extraction. Then, the output feature maps of each branch are concatenated, and features maps of different scales are fused using channel-wise convolution.

As shown in Figure 3, the same feature map is input to each branch, and the convolution in each branch is used to compress the number of channels of the feature map, and each branch contains a convolutional layer for feature extraction. and represent convolution operation and channel-wise convolution, respectively. The convolution input of the th branch is the summation of and ; thus, the output feature map can be expressed as

where is the input feature map, and the convolution operation of each branch extracts features from the input features and the output of the preceding branches. Therefore, the multibranch structure and the connection of different branches are beneficial to extract global and local features.

In ResXNet, refers to the number of branches of each convolution module, larger can learn the features of larger receptive field size, and the calculation and memory overhead introduced is negligible. ResXNet further improves the multiscale capability of convolutional neural networks and can be integrated with existing state-of-the-art methods to improve recognition accuracy and generalization performance. The proposed ResXNet model can be regarded as an improvement of Res2Net [27]. Changing the grouping convolution in the Res2Net model to several channel-wise convolutions makes the parameters and structural design of the model more flexible.

3.3. Convolution Block Attention Module

The convolution block attention module (CBAM) is an attention module for convolutional neural networks that is simple and efficient [18]. The diagram of CBAM is depicted in Figure 4. CBAM sequentially calculates the channel and spatial attention of the feature map and then multiplies the two attention maps with the input feature map to refine the adaptive feature. CBAM is a lightweight general module that can be seamlessly inserted into any convolutional neural network architecture, with negligible computing and memory overhead, and can be trained end-to-end together with the base convolutional neural network.

The attention module allows the model to concentrate on informative features and suppress irrelevant features. CBAM applies the channel attention module and the spatial attention module sequentially to enable the model to reinforce effective features. Given the feature map as input, CBAM calculates one-dimensional channel attention map and two-dimensional spatial attention map . The overall attention module is expressed as

where is the element-wise multiplication and denotes the feature map of the final CBAM attention module output. In the multiplication process, the attention feature is broadcasted accordingly: the channel attention value is broadcast along the spatial dimension and vice versa [18]. The specific details of channel attention and spatial attention are introduced as follows.

3.3.1. Channel Attention Module

The channel attention module (CAM) exploits the interchannel relationship of the feature maps to generate the channel attention map. Figure 5 depicts the computation process of the channel attention map. Average pooling and maximum pooling are used to aggregate spatial information, then the fully connected layer is used to compress the channel dimensions of the feature map, and the multilayer perceptron is used to generate the final attention map.

Specifically, first, we use global average pooling and global maximum pooling to aggregate the spatial context information of the feature mapping to generate two different spatial descriptors, which denote the global average pooling feature and the global maximum pooling feature . Then, the two feature descriptors are forwarded into a shared network to generate channel attention map . The shared network is a multilayer perceptron that contains one hidden layer. To reduce the computational overhead, the activation unit size of the hidden layer multilayer perceptron is set to , where is the reduction rate. The shared network is applied to the different spatial descriptors, and then, the two feature descriptors are merged using an element-wise summation to obtain the final channel feature vector. Therefore, the channel attention can be written as

where represents the sigmoid activation function, and denote the weights of MLP, and the ReLU activation function is used in each layer.

3.3.2. Spatial Attention Module

The spatial attention module (SAM) utilizes the interspatial relationship of feature maps to generate spatial attention maps. As illustrated in Figure 6, the average pooling and maximum pooling are applied to the feature map along the channel dimension, and then, the two feature maps are concatenated to create the spatial feature descriptor. The spatial attention map highlights informative regions [18]. Finally, the spatial attention map is generated by convolution operation on the spatial feature descriptor to enhance or suppress the feature region.

Specifically, using spatial global average and max pooling to calculate the spatial information of the feature map, two 2-dimensional spatial feature maps and are generated. Then, the 2D spatial attention map is calculated by a convolution operation. The spatial attention is computed as

where represents sigmoid activation function and denotes a convolution operation with the kernel size of .

The convolution block attention module (CBAM) divides attention features into the channel and spatial attention modules and achieves a significant performance improvement while keeping a small overhead. CBAM can be seamlessly integrated into any convolutional neural network architecture and trained end-to-end with the CNN model. The CBAM can prompt the network to learn and aggregate the feature information in the target area, effectively strengthen or suppress the features of a specific space or a specific channel, and guide the convolutional neural network to make good use of the feature maps [18].

3.4. Class Activation Map

The class activation map (CAM) is portable and applied to a variety of computer vision tasks for weakly supervised object localization. The CAM is trained end-to-end based on image-level annotation and localizes objects simply in a single forward pass. CAM avoids the flattening of the feature map by replacing the fully connected layer with the global average pooling, which completely preserves the spatial information of the objects in the embedded features. More importantly, CAM can make the existing state-of-the-art deep models interpretable and transparent and help researchers understand the logic of predictions hidden inside the deep learning models.

For each image , denotes the feature map of the last convolutional layer. Suppose that there are feature maps in the last convolution layer. The indicates the activation of the th channel at spatial coordinate , where . Each node of the GAP (global average pooling) layer is spatial average of the activation and can be computed by

The function definition of CAM is

where is the class activation map for class and represents the weight corresponding to class for unit . signifies the importance of the activation at spatial grid for class .

Each class activation map consists of the weighted linear sum of these visual patterns at different spatial locations, which contain a series of part saliency maps. Upsampling is applied to resize the class activation map to the size of the input time-frequency image to localize the image saliency regions most relevant to a particular class.

3.5. ResXNet Architecture

The group dimension indicates the number of groups within a convolutional layer. This dimension converts single-branch convolutional layers to multibranch, improving the capacity of multiscale representation [31]. The CBAM block adaptively recalibrates channel and spatial feature responses by explicitly modeling interdependencies among channel and spatial [18].

Table 2 shows the specification of ResXNet, including depth and width. ResXNet makes extensive use of convolution and uses average pooling layers to reduce the feature size. The proposed models are constructed in 5 stages, and the first stage is a conventional convolution block with . The following 4 stages, respectively, contain [1, 2, 4] ResXNet modules to construct the proposed ResXNet model. The number of channels in each stage of the module is [32, 64, 128, 256], respectively. The global average pooling is followed by a fully connected layer as the head for the classification task. A dropout layer is inserted before the fully connected layer to prevent the model from overfitting. A CBAM attention mechanism is implemented after the output of each ResXNet module to improve the capability of global feature extraction.

Each convolutional module is composed of the convolutional layer without bias, batch normalization layer, and ReLU activation function. The input size of the proposed model is . The last convolutional layer connects the fully connected layer through global average pooling (GAP) instead of flattening the features, such that changes in the input size do not affect the number of parameters. Table 3 shows the model information with different groups, including model size and the number of parameters. When , the ResXNet model degenerates to a normal conventional convolutional neural network.

4. Simulation and Analysis

The proposed ResNet model was evaluated and compared with the Res2Net and GroupNet on the simulation dataset to verify the effectiveness and robustness. In this section, we evaluate the performance of the proposed ResXNet model to recognize radar signals with different modulations and the ability of the class activation map (CAM) to localize radar signals in time-frequency images. We implement the proposed models using the TensorFlow 2.0 framework. Cross-entropy is used as a loss function. We trained the networks using SGD and a minibatch of 16 on NVIDIA 3090 GPU. The learning rate was initially set to 0.01 and divided by 0.1 every 10 epochs. All models, including the proposed and compared models, are trained for 30 epochs with the same training strategy.

The dataset consists of 8 types of radar signals, including Barker, Costas, Frank, LFM, P1, P2, P3, and P4. In the training dataset, the SNR of each signal category is from -18 dB to 10 dB with the step of 2 dB, and each SNR contains 500 signals, totaling 60000 signal samples in the training dataset. In the test dataset, each SNR contains 100 samples for a total of 12000 signal samples. The parameters variations of each kind of radar signal are listed in Table 4.

4.1. Recognition Accuracy of the Different Models

The first experiment is to compare the effectiveness of group dimensions on model accuracy. As shown in Table 5, the accuracy of the ResXNet improves with the increase of the number of groups. The number of parameters of the ResXNet model decreases as the increase of group; meanwhile, the training time will gradually increase. Because the convolution output of the proposed model is connected to the classification layer through the global average pooling (GAP) layer, changing the input size of the model does not change the size and the number of parameters of the model. The proposed model has the combined effect of the hierarchical residual structure, which can generate abundant multiscale features and effectively improve the recognition accuracy. Increasing the input data size of the model can significantly reduce the information loss of the input data, which is beneficial to the convolutional neural network to extract the informative features from the time-frequency images.

4.2. Recognition Accuracy of the Different Models

Table 6 lists the overall recognition accuracy of the proposed and compared models for radar signal modulations with the group dimension is 8. Compared with Res2Net and GroupNet, ResXNet has higher recognition accuracy. The recognition accuracy of the proposed ResXNet model can achieve 88.9%, which outperforms Res2Net by 1.4%.

Figure 7 depicts the recognition accuracy of the proposed ResXNet model for different radar signals at various SNRs. The accuracy of Barker code, Frank code, and LFM is still 100% even at -18 dB. The average recognition accuracies of all radar modulations under different SNRs for Res2Net, GroupNet, and proposed ResXNet models are presented in Figure 8. The average accuracy of all radar modulations is still above 50% at -18 dB. The average recognition accuracy of the model is above 90% at -8 dB. The radar signals are disrupted by Gaussian white noise at low SNR, resulting in misclassifications. As the SNR increases, the features extraction becomes more distinguishable from each other.

Figure 9 illustrates the confusion matrix of different models to further analyse the recognition capability for radar modulation types. Frank code and P3 code, as well as P1 code and P4 code, are similar and easy to confuse. The recognition accuracy of the other four radar signals is 100%.

4.3. Comparison of Object Visualization

The visualization of model features can accurately explain the features learned by the model and provide reasonable explanations for the prediction results. In this paper, the predictions are visualized by the class activation mapping (CAM) [19], which is commonly used for localizing the discriminative regions in image classification. CAM generates the object saliency map by calculating the weighted sum of the feature maps of the last convolutional layer. By simply upsampling the class activation map to the size of the input time-frequency image, we can identify the image regions most relevant to the particular category [19]. CAM can reveal the decision-making process of the convolutional neural network model. CAM makes the model based on the convolutional neural network more transparent by generating visual explanations. And CAM can localize objects without additional bounding box annotations.

Figure 10 depicts the CAM visualization results of the ResXNet model and the grouped convolution model for different radar signals, with the discriminative CAM regions highlighted. Figure 10 shows that the ResXNet model outperforms GroupNet in terms of visual object localization. Compared with GroupNet, the CAM results of ResXNet have more concentrated activation maps. The ResXNet localizes the radar signal more precisely and tends to encompass the entire object in the time-frequency image due to its better multiscale capability. The ResXNet model proposed in this paper can help researchers develop confidence in artificial intelligence systems by improving the interpretability and trustworthiness of the model. This ability to precisely localize discriminative regions makes ResXNet potentially valuable for object region mining in weakly supervised localization tasks of radar signals.

4.4. Comparison of Object Visualization with Different Input Sizes

In addition, the paper also explored the impact of the input size of the proposed models on CAM visualization. Figure 11 shows the CAM visualization results with various input sizes. As demonstrated in Figure 11, the model can generate more precise CAM with a higher resolution as the input image size increases, improving the visual localization capacity of the model. Especially for the Costas signals, ResXNet can accurately localize each signal component.

Therefore, the multiscale feature extraction capability of ResXNet helps to better localize the most discriminative regions in radar time-frequency images compared to GroupNet. For the same ResXNet model, increasing the input image size of the model can obtain more precise and complete discriminative regions. As a result, the proposed ResXNet model contributes to improving the target localization capability of the model and the interpretability of the prediction.

5. Conclusions

In this paper, a radar signal recognition approach based on the ResXNet lightweight model is proposed. The proposed model substantially improves the multiscale representation ability and recognition accuracy by grouping and cascading convolutional layers. The CBAM attention mechanism is employed to effectively aggregate the channel features and spatial features, so that the convolutional neural network model can make better use of the given feature maps.

Experiments demonstrate that the proposed model has capable multiscale representations and achieves higher recognition performance in low SNR. And the model is tiny, the number of parameters is low, and it is suitable for the embedded platform application. ResXNet adds the grouped dimension by channel-wise convolution, allowing to adjust the size and number of parameters of the model and improve the multiscale representation capability of the model. The proposed ResXNet model provides a superior capacity for object localization, and the radar signal can be localized more precisely through CAM.

For future research, more lightweight models for radar signal recognition, as well as the use of CAM in radar signal recognition and localization, will be explored.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was financially supported by the National Natural Science Foundation of China (Grant Nos. 61971155 and 61801143), Natural Science Foundation of Heilongjiang Province of China (Grant No. JJ2019LH1760), Fundamental Research Funds for the Central Universities (Grant No. 3072020CF0814), and Aeronautical Science Foundation of China (Grant No. 2019010P6001).