Fine-Grained Recognition of Mixed Signals with Geometry Coordinate Attention

With the advancement of technology, signal modulation types are becoming increasingly diverse and complex. The phenomenon of signal time–frequency overlap during transmission poses significant challenges for the classification and recognition of mixed signals, including poor recognition capabilities and low generality. This paper presents a recognition model for the fine-grained analysis of mixed signal characteristics, proposing a Geometry Coordinate Attention mechanism and introducing a low-rank bilinear pooling module to more effectively extract signal features for classification. The model employs a residual neural network as its backbone architecture and utilizes the Geometry Coordinate Attention mechanism for time–frequency weighted analysis based on information geometry theory. This analysis targets multiple-scale features within the architecture, producing time–frequency weighted features of the signal. These weighted features are further analyzed through a low-rank bilinear pooling module, combined with the backbone features, to achieve fine-grained feature fusion. This results in a fused feature vector for mixed signal classification. Experiments were conducted on a simulated dataset comprising 39,600 mixed-signal time–frequency plots. The model was benchmarked against a baseline using a residual neural network. The experimental outcomes demonstrated an improvement of 9% in the exact match ratio and 5% in the Hamming score. These results indicate that the proposed model significantly enhances the recognition capability and generalizability of mixed signal classification.


Introduction
Automatic modulation recognition (AMR) technology is used to identify the modulation schemes of received signals and find extensive applications in both military and civilian domains, closely related to non-cooperative communication, electronic warfare, and security detection [1][2][3].With the rapid development of wireless communication technologies, the modulation types of signals are becoming increasingly diverse and complex.Additionally, the radio frequency environment is becoming more challenging, leading to frequent occurrences of signal time-frequency overlap [4].Currently, research on automatic recognition of single modulation types has become relatively mature.For instance, methods employing decision theory, likelihood approaches, classification models based on domain knowledge and manually designed features [5,6], and neural network models based on deep learning [7] have been developed.However, research on the recognition of signals involving two or more modulation types is less explored.There are two main approaches for modulation type recognition in the context of multiple signal mixtures: one involves preprocessing by separating signals first and then recognizing the modulation type of each Sensors 2024, 24, 4530 3 of 18 quantities on the network's output parameters, a multi-label classification approach is adopted for signal categorization.This method significantly reduces output dimensions relative to traditional multi-class classification techniques.In the experiments, a dataset was constructed with signals from five different modulation types, including mixed signals, under various signal-to-noise ratios to validate the effectiveness of the proposed module in enhancing classification capabilities.Additionally, experiments with a robustness dataset demonstrated the model's stability and general applicability across diverse mixed signal conditions.These research findings provide new ideas and methods for the development of automatic modulation recognition technology, with the potential to enhance the performance and reliability of wireless communication systems in practical applications.Future research will continue to focus on optimizing model structures, improving recognition accuracy, and exploring more efficient signal feature extraction and classification methods to address increasingly complex and dynamic communication environments.
The main innovations of this paper are as follows: (1) The proposed Geometry Coordinate Attention Fusion Module is applied in neural networks.This module defines pooling operations based on information geometry theory to design an effective spatial dimension feature enhancement mechanism.This method provides corresponding time-frequency weights for features, resulting in more expressive feature representations.(2) The low-rank bilinear pooling module is introduced to achieve cross-layer interaction between fused features and backbone features.This module obtains fine-grained feature representations of signals by mapping and aggregating between different features.
The remainder of this article is as follows.In Section 2, we introduce the signal reception model, the derivation of the information geometry model, and related work, including the backbone neural networks we used.In Section 3, we present our proposed model framework and its underlying principles.In Section 4, we evaluate various performance metrics of the model through experiments and conduct a comparative analysis.In Section 5, we summarize the strengths and weaknesses of the proposed model and provide an outlook on future development directions.

Related Works 2.1. Signal Data
In a complex electromagnetic environment, there typically exists one or more sources that transmit and receive signals.When there are M signal transmitters and the signals are captured by a single-channel sensor, the mathematical expression of the signal can be described as follows: where x(t) represents the signal received by the single-channel sensor, A denotes the signal receiving mixing coefficient matrix, s i (t) is the i-th independent signal source, M is the total number of independent signal sources that the sensor can receive, and n(t) represents additive noise.
In complex electromagnetic environments, the signals received often exhibit poor sparsity and significant overlap, whether in the time or frequency domains.Distinguishing between various signal types in these domains remains a considerable challenge.To address this, our study converts the signals into the time-frequency domain (TF), where they are represented as two-dimensional images, evolved from their original one-dimensional forms.This conversion notably enhances signal sparsity and magnifies the differences among distinct signals.We employ the Short-Time Fourier Transform (STFT) as the core method and investigate the energy plots it produces.The mathematical formula for the STFT is presented below: where x(t) is the received signal, h(t) is the window function of STFT, and the window function used in this article is the Hamming window.The energy plots in the time-frequency domain are generated based on time-frequency representations.They possess the characteristic of being less affected by cross-interference, and they exhibit better separability in signal time-frequency domain.
The mathematical expression of the energy diagram is The mathematical expression of the energy diagram after the signal is transformed is where X(t, f ) is the received signal of x(t) after time-frequency transformation, S i (t, f ) is the transmitted signal of s i (t) after time-frequency transformation, and N(t, f ) is the noise in the time-frequency domain.

Subsection Statistical Manifolds and the Geometric Structures
In the pooling layers of neural network models, traditional average pooling and max pooling operations extract features by focusing solely on the average or maximum values within specific regions.A more sophisticated approach involves performing separate average pooling and max pooling operations on the signals while sharing weights [19], but this method still results in information loss.This paper proposes a novel approach based on information geometry theory to preserve more feature information during pooling operations.The goal is to enhance the separability between sample features and improve the model's recognition accuracy.

Fisher Information
Based on information geometry theory, under different principles of invariance, statistical regularity can be endowed with different Riemannian metrics, representing distinct geometric structures.The Fisher information matrix, due to its statistical and geometric properties, serves as a cornerstone in the theory and applications of information geometry, often used for constructing Riemannian metrics on statistical manifolds.
Without loss of generality, consider the density function of an exponential family, p θ (x), θ ∈ Θ, and where θ is the parameter space of the probability density function of the exponential distribution family.
In fact, there are many ways to define the Riemannian metric on the parameter space Θ.For different problem scenarios, it is necessary to select a concise and effective metric among many Riemannian metrics.Without losing generality, the algorithm in this chapter selects the Fisher information metric as the analysis object.
Assume the following: Therefore, the Fisher information metric can be expressed as Assume u, v ∈ T θ , where T θ = T θ (Θ) is the tangent space at point θ ∈ Θ, Therefore, g(u, v) = g ij u i v j (10)

Gaussian Statistical Manifold
The Gaussian distribution has extremely broad practical applications.In radar engineering, the probability distributions of many random variables can be approximated by the normal distribution.It holds a significant position in information geometry and is one of the core topics in statistical geometric analysis.
Consider the following density function of the Gaussian distribution family: where µ and σ are the mean and standard deviation of the Gaussian distribution, respectively.Taking the logarithm of (10), we can obtain where (θ 1 , θ 2 ) = (µ/σ, −1/2σ), let Find the partial derivatives of both ends of Equation ( 13) with respect to (θ 1 , θ 2 ), and we obtain Therefore, And the Fisher information is or Based on Equation (18), this paper modifies the pooling operation within the Coordinate Attention (CA) mechanism and proposes a new attention mechanism called Geometry Coordinate Attention (GCA).The GCA aims to enhance information retention in sample features.

ResNet
In deep learning algorithms, particularly for deep convolutional neural networks (CNNs), increasing the depth of the network can enhance its learning capacity to a certain extent.However, as the network depth increases further, a phenomenon known as the vanishing or exploding gradient problem occurs during backpropagation, where gradients become extremely small or large when propagated back to shallow layers.This makes it challenging to update parameters effectively, resulting in model degradation and reduced classification performance.To address this issue, He et al. introduced residual networks (ResNet) in 2016, which is based on residual learning [20].The main structure in ResNet is the residual building unit (RBU), consisting of a non-linear layer followed by an identity shortcut connection.The use of identity shortcut connections helps alleviate the difficulty of parameter optimization caused by increasing model depth.Parameter updates not only propagate layer by layer but can also be directly passed through the identity shortcut connections to shallower layers in ResNet, thereby reducing the training difficulty of the network effectively.

Geometry Coordinate Attention and Low-Rank Bilinear Pooling Network
The structure of the geometric coordinated attention and low-rank bilinear pooling network constructed in this article is shown in Figure 1.This paper uses Resnet-50 as the backbone network, adds geometric coordination attention to further improve the feature learning ability of the convolutional network, and then introduces a low-rank bilinear pooling module to achieve fine-grained analysis of signal sample features.Finally, multilabel classification is used to replace the multi-class output sample prediction values.In this section, we first introduce the principle of the geometric coordination attention module, and then introduce the multi-modal feature fine-grained fusion method of the low-rank bilinear pooling module.Based on Equation (18), this paper modifies the pooling operation within the Coordinate Attention (CA) mechanism and proposes a new attention mechanism called Geometry Coordinate Attention (GCA).The GCA aims to enhance information retention in sample features.

ResNet
In deep learning algorithms, particularly for deep convolutional neural networks (CNNs), increasing the depth of the network can enhance its learning capacity to a certain extent.However, as the network depth increases further, a phenomenon known as the vanishing or exploding gradient problem occurs during backpropagation, where gradients become extremely small or large when propagated back to shallow layers.This makes it challenging to update parameters effectively, resulting in model degradation and reduced classification performance.To address this issue, He et al. introduced residual networks (ResNet) in 2016, which is based on residual learning [20].The main structure in ResNet is the residual building unit (RBU), consisting of a non-linear layer followed by an identity shortcut connection.The use of identity shortcut connections helps alleviate the difficulty of parameter optimization caused by increasing model depth.Parameter updates not only propagate layer by layer but can also be directly passed through the identity shortcut connections to shallower layers in ResNet, thereby reducing the training difficulty of the network effectively.

Geometry Coordinate Attention and Low-Rank Bilinear Pooling Network
The structure of the geometric coordinated attention and low-rank bilinear pooling network constructed in this article is shown in Figure 1.This paper uses Resnet-50 as the backbone network, adds geometric coordination attention to further improve the feature learning ability of the convolutional network, and then introduces a low-rank bilinear pooling module to achieve fine-grained analysis of signal sample features.Finally, multilabel classification is used to replace the multi-class output sample prediction values.In this section, we first introduce the principle of the geometric coordination attention module, and then introduce the multi-modal feature fine-grained fusion method of the lowrank bilinear pooling module.

Geometric Coordinated Attention Module
In order to effectively retain multi-level feature information, this paper designs a weighted fusion module of multi-scale features.Extract low-, medium-, and high-level features from the backbone network, so that features of different scales are optimized and fused in the time-frequency domain.The design structure of this module is shown in Figure 2.

Geometric Coordinated Attention Module
In order to effectively retain multi-level feature information, this paper designs a weighted fusion module of multi-scale features.Extract low-, medium-, and high-level features from the backbone network, so that features of different scales are optimized and fused in the time-frequency domain.The design structure of this module is shown in First, the low-, medium-, and high-level features x1, x2, and x3 of the received signal are extracted from different layers of the backbone network.Feed the features individually into GCA, where H, W, and C represent the height dimension, width dimension, and channel dimension of the feature, respectively.Since the input feature of the neural network is the time-frequency feature plot of the signal, the width dimension and height dimension of the feature can be regarded as the time and frequency dimensions of the feature.Then, in the GCA, the time-frequency dimension weight information of multiscale modal features is extracted and weighted, and finally, the multi-scale modal features are fused.
In GCA, the pooling operation P( ) x is defined according to Equation ( 18) as ( ) ; μ and σ are the mean and variance of the input features in a certain dimension.The pooling operation method defined by Equation ( 19) comprehensively considers the statistical and geometric characteristics of features, and can extract more feature information than the traditional average pooling operation and maximum pooling operation.
The input features encode the dimensions along the horizontal and vertical axes through two pooling operations to generate a pair of dimension-aware feature sets.This transformation can help locate the features in spatial locations, and correlate features in time and frequency.The mathematical expression of vertical pooling coding (H Pool) is as follows: where C x is the feature of the received signal feature in each channel dimension of the neural network, h is the minimum granularity of the height dimension, c is the mini- mum granularity of the channel dimension, and is the height direction feature of the sample in the channel dimension and is also a frequency characteristic.The mathematical expression of horizontal pooling coding (W Pool) is as follows: First, the low-, medium-, and high-level features x 1 , x 2 , and x 3 of the received signal are extracted from different layers of the backbone network.Feed the features individually into GCA, where H, W, and C represent the height dimension, width dimension, and channel dimension of the feature, respectively.Since the input feature of the neural network is the time-frequency feature plot of the signal, the width dimension and height dimension of the feature can be regarded as the time and frequency dimensions of the feature.Then, in the GCA, the time-frequency dimension weight information of multi-scale modal features is extracted and weighted, and finally, the multi-scale modal features are fused.
In GCA, the pooling operation P(x) is defined according to Equation (18) as where v = [µ, σ]; µ and σ are the mean and variance of the input features in a certain dimension.The pooling operation method defined by Equation ( 19) comprehensively considers the statistical and geometric characteristics of features, and can extract more feature information than the traditional average pooling operation and maximum pooling operation.
The input features encode the dimensions along the horizontal and vertical axes through two pooling operations to generate a pair of dimension-aware feature sets.This transformation can help locate the features in spatial locations, and correlate features in time and frequency.The mathematical expression of vertical pooling coding (H Pool) is as follows: where x C is the feature of the received signal feature in each channel dimension of the neural network, h is the minimum granularity of the height dimension, c is the minimum granularity of the channel dimension, and y H ∈ R H×1×C is the height direction feature of the sample in the channel dimension and is also a frequency characteristic.The mathematical expression of horizontal pooling coding (W Pool) is as follows: where w is the minimum granularity of the width dimension and y W ∈ R 1×W×C is the width direction feature of the sample in the channel dimension and also the time feature.
Time and frequency features are concatenated in the channel dimension, and the convolution function with a convolution kernel size of (1,1) is used to obtain the timefrequency joint intermediate state features.Then, the batch normalization and a non-linear activation function are applied to reduce differences among the joint features.
where r is used to control the GCA parameter amount.
Sensors 2024, 24, 4530 8 of 18 Then, the joint intermediate features along the channel dimension are reshaped into two features: z H r ∈ R H×1×(C/r) and z W r ∈ R 1×W×(C/r) .Over-sampling operations using convolutional functions with kernel size (1,1) are applied on each feature to restore the parameter dimensions equal to the input features.Finally, a non-linear activation function is applied to each feature to obtain weights or coefficients along the time and frequency dimensions.
where sigmoid(•) is the sigmoid function operation, which performs normalization and activation.Finally, multiply the input features of different scales by their corresponding feature weight coefficients along the dimensions to obtain enhanced weighted features along the time and frequency dimensions.Additionally, normalize the scaled weighted features to achieve feature fusion with unified scale parameters.
where x H+W i is the weighted feature obtained by multiplying the corresponding input feature with its weight coefficient.
where the ⊕ operation is element-wise addition.The variable z ∈ R 7×7×2048 is the final output, which is a multi-scale weighted fusion feature.

Low-Rank Bilinear Pooling Module
This paper leverages a method involving the outer product of extracted neural network features from two modalities to perform weighted fusion of features, obtaining bilinear characteristics of the signal to enhance the network's ability to analyze fine-grained features and consequently improve its capability for signal modulation recognition.Addressing the issue of parameter explosion resulting from bilinear pooling operations, the paper employs low-rank bilinear pooling to optimize the operation and reduce parameter count.Using Hadamard product in place of the outer product to interact with different features, the corresponding relationships are captured.This approach significantly reduces parameter count at the expense of certain computational complexity.In the final fully connected layer, a staged reduction in parameters is employed alongside introducing dropout to mitigate the risk of overfitting.
The bilinear pooling operation is shown in Figure 3, and its mathematical expression is where x and z are two features that are mutually mapped.x is the backbone feature obtained from input samples through the backbone architecture, while z is the fused features output by the Geometric Coordinate Attention module from the high-, middle-, and low-level features of the samples.C x and C z are the channel dimensions of the two features, w is the mapping between the channel dimensions of the two features, W is the weight matrix, β is the bias, and f represents the output features.
where x and z are two features that are mutually mapped.x is the backbone feature obtained from input samples through the backbone architecture, while z is the fused features output by the Geometric Coordinate Attention module from the high-, middle-, and low-level features of the samples.x C and z C are the channel dimensions of the two features, w is the mapping between the channel dimensions of the two features, W is the weight matrix, β is the bias, and f represents the output features.

Bilinear Matrix
Feature z

Feature x W U V
Low-Rank Decomposition Perform matrix decomposition on the weight matrix as follows: where U and V are low-rank decomposition matrices of W and m is the joint em- bedding dimension.Equation ( 28) can be further expressed in Hadamard product form, with a mathematical expression as follows: where  denotes the Hadamard product and F is the output feature after the mathe- matical expression transformation.
Using two low-rank matrices, U and V , to approximate W avoids the direct computation of the outer product of the two features as in the original bilinear pooling method.This reduction in computation and parameters decreases the parameter count from .It can be observed that the parameter count is controlled jointly by the joint embedding dimension m and the channel numbers of the two input features.Therefore, while ensuring the features remain unchanged, it is possible to attempt multi-channel merging to reconstruct feature parameter scales, reduce the number of channels, and increase computational complexity to decrease parameter count, ensuring efficient network operation.
In this module, the mutual mapping between two sets of input features is used to extract joint representations of features across channels, enabling multi-modal bilinear pooling for improved neural network performance in signal modulation recognition within the same sample.Additionally, the use of low-rank matrix approximation for outer product computations addresses the issue of excessive parameter count in bilinear pooling, thereby mitigating potential inefficiencies and overfitting effects in neural network operations.Perform matrix decomposition on the weight matrix as follows: where U and V are low-rank decomposition matrices of W and m is the joint embedding dimension.Equation ( 28) can be further expressed in Hadamard product form, with a mathematical expression as follows: where • denotes the Hadamard product and F is the output feature after the mathematical expression transformation.
Using two low-rank matrices, U and V, to approximate W avoids the direct computation of the outer product of the two features as in the original bilinear pooling method.This reduction in computation and parameters decreases the parameter count from C x × C z to m × (C x + C z ).It can be observed that the parameter count is controlled jointly by the joint embedding dimension m and the channel numbers of the two input features.Therefore, while ensuring the features remain unchanged, it is possible to attempt multichannel merging to reconstruct feature parameter scales, reduce the number of channels, and increase computational complexity to decrease parameter count, ensuring efficient network operation.
In this module, the mutual mapping between two sets of input features is used to extract joint representations of features across channels, enabling multi-modal bilinear pooling for improved neural network performance in signal modulation recognition within the same sample.Additionally, the use of low-rank matrix approximation for outer product computations addresses the issue of excessive parameter count in bilinear pooling, thereby mitigating potential inefficiencies and overfitting effects in neural network operations.

Loss Function and Evaluation Metrics
This paper's model uses a multi-label classification approach for signal classification, which consists of multiple binary classifiers combined together.Sample labels are encoded using one-hot encoding.The normalization function connected to the model's output layer is replaced with the sigmoid function instead of the Softmax function.The model employs a combination of the cross-entropy loss function and the binary cross-entropy loss (BCEloss) for multi-label binary classification.Compared to traditional multi-class algorithms, this method effectively reduces label dimensions and optimally utilizes the feature space of output predictions.The BCEloss not only considers the situation when the prediction value is 1 but also incorporates the impact of the prediction value being 0 on the loss.The expression for the BCEloss is as follows: where y i,j true is the true value of the i-th label in the j-th sample, and y i,j pred is the predicted value of the i-th label in the j-th sample.p is the total number of samples, which is equivalent to the batch size during batch training.q is the number of sample labels, which is equivalent to the prediction output dimension.
The evaluation indicators of the experiment are based on ref. [21], which uses the exact match ratio (EMR), hamming score (HS), and F-Measure (F1).The three indicators are used as a specific evaluation of the model's ability to identify signal modulation types.The exact match ratio is the strictest measure of model recognition effectiveness, representing the ability of model identification to be completely accurate, but partial accuracy is also part of model evaluation.Therefore, hamming score and F-Measure are included as evaluation metrics.The hamming score refers to the accuracy feedback of the overall sample, representing the model's perception of all labels.The F-Measure is a comprehensive measure of sample recognition accuracy, balancing precision and recall, representing the balance of the model.
The calculation formula for the exact match ratio (EMR) is as follows: where y j pred is the predicted value set for the j-th sample, y j true is the true value set for the j-th sample, and I(•) is a function that returns 1 when the predicted value is exactly equal to the true value, otherwise it returns 0. Accuracy represents the average accuracy across all samples, where for each individual sample, it indicates the proportion of signal categories where both predicted and true values are true among the total signal categories predicted as true or true in reality.
The hamming score calculation formula is as follows: The F-Measure calculation formula is pred +y i,j true (34)

Dataset
In this experiment, a single-antenna reception method was simulated under complex environments by adding colored noise to sample signals to better reflect real-world conditions.The dataset consists of single-signal samples or mixed-signal samples with a signal-to-noise ratio (SNR) spanning from −12 dB to 8 dB in 2 dB increments.The signal sampling frequency is 1 × 10 7 Hz, with a frequency range from 0.5 × 10 7 Hz to 3 × 10 7 Hz.The resulting sample data format is a 224 × 224 RGB matrix.The mixing of signals is based on fifteen combinations as listed in Table 1, with each type of sample generated at various SNR levels resulting in 240 samples per combination, totaling 39,600 signal time-frequency energy plots.The mixing levels among different signal types are evenly distributed.The dataset is divided into training and test sets with a ratio of 5:1.

Experiment Environment
The experimental environment configuration for this paper is as follows: the computer system used was Ubuntu 16.04, with a TAITAN graphics card and 24 GB of memory.The programming language used was Python 3.8, the deep learning framework was PyTorch 1.7, and the optimizer was Adaptive Moment Estimation optimizer.For training, the approach involved training from scratch, with an initial learning rate set to 0.001 and adaptive decay.The training was conducted for 200 epochs, and the batch size was set to 32 based on network parameters and GPU memory size.

Ablation Study
To verify the reliability of the proposed modules in this paper, four ablation experiments were conducted using the same dataset and experimental setup with identical evaluation metrics.The neural network model containing both the Geometry Coordinate Attention Fusion Module and the low-rank bilinear pooling module was named GCABNet; the model containing only the Geometry Coordinate Attention Fusion Module was named GCANet; the model containing only the low-rank bilinear pooling module was named BNet; and the backbone network model was named InitNet.In the BNet model, both input variables in the bilinear pooling module use feature x.Additionally, a model named GCABVGG, which integrates the proposed modules into the VGG16 network structure, was compared with the VGG16 network (VGG).
The results of the ablation experiments under dynamic SNR conditions depicted in Figure 4a-c indicate that the GCA Module and the low-rank bilinear pooling module proposed in this paper led to an approximately 9% increase in EMR and a 5% increase in HS under low SNR conditions.The model's recognition capability improved under high SNR conditions, with an EMR of over 98% for signal modulation type recognition at SNRs above 2 dB and approaching 100% at SNRs exceeding 4 dB.Specifically, the low-rank bilinear pooling module achieved an approximate 6% increase in EMR under low SNR conditions, demonstrating that bilinear pooling can yield more effective discriminative features at a finer granularity, significantly impacting the model's recognition capability.Furthermore, the Geometric Coordinate Attention accurately obtained time-frequency dimension weights from multiple scales of sample features, effectively retaining sample feature information and assisting the model in capturing underlying discriminative features.Therefore, the proposed modules in this paper show feasibility in improving the recognition of signal modulation types.
Moreover, as shown in the experimental results in Figure 4d-f, integrating the proposed modules into the VGG network as the backbone structure resulted in a 7% improvement in EMR and approximately 9% improvement in HS under low SNR conditions, with improved balance in the F-Measure, further demonstrating the reliability of the proposed modules in this paper.
Moreover, as shown in the experimental results in Figure 4d-f, integrating the p posed modules into the VGG network as the backbone structure resulted in a 7% impro ment in EMR and approximately 9% improvement in HS under low SNR conditions, w improved balance in the F-Measure, further demonstrating the reliability of the propo modules in this paper.

Attention Comparison Experiment
To verify the advantages of obtaining more effective time-frequency weight coefficients using the Geometric Coordinate Attention proposed in this paper's GCA Fusion Module, this section conducts six comparative experiments by replacing the GCA part of the GCA Fusion Module with CA, SAM, BAM, CAM, CBAM, and SENet, while keeping all other conditions identical in the experiment environment.The contrasting performance of different attention on the dataset in terms of recognition capability and classification effectiveness is evaluated.
Specifically, CA (Coordinate Attention) analyzes the input features in multiple dimensions through multiple independent attention heads and outputs weight coefficients after normalization.SAM (Spatial Attention Module) analyzes both height and width dimensions, enhancing the model's ability to process spatial information.CAM (Channel Attention Module) extracts channel dimension information of features, enhancing the model's ability to learn dependencies between different channels.BAM (Bottleneck Attention Module) is a parallel combination of CAM and SAM attention mechanisms.CBAM (Convolutional Block Attention Module) is a series combination of CAM and SAM attention mechanisms.SENet (Squeeze-and-Excitation Network) enhances the response of important feature channels while suppressing the response of unimportant feature channels.
The experimental results in Table 2 demonstrate that the GCA Module used in this paper exhibits superior performance compared to classical attention.The GCA used in this paper outperforms other attention mechanisms in terms of both EMR and HS, with up to a 3% performance advantage in EMR and a 1% performance advantage in HS.Furthermore, the F-Measure results indicate that the stability of the mechanism proposed in this paper is also advantageous compared to other mechanisms.These results demonstrate that the GCA proposed in this paper maximizes the extraction of signal feature weight coefficients based on information geometry theory, enhancing the learning ability of neural networks and thereby improving model recognition capabilities.Compared to other mechanisms, the GCA considers both local granularity information in the time-frequency domain and integrates channel weights, retaining more feature information and enhancing the separability of signal features, proving the feasibility and importance of capturing time-frequency weight information.The comparison between the proposed GCA and the CA concludes that attention mechanisms based on information geometry theory can more effectively retain sample feature information, thus improving model recognition capabilities.

Robustness Experiment
To verify the robustness of the signal recognition model proposed in this paper, a supplementary dataset was created in this section.It includes additional mixed-signal combinations not present in the original dataset from Section 4.1.Under the same experimental conditions, this supplementary dataset was used as the test set for further experiments to validate the reliability of the signal recognition model.The signal combinations in the supplementary dataset are shown in Table 3.The robustness dataset comprises mixed signals with signal-to-noise ratios (SNRs) ranging from −12 dB to 8 dB in steps of 2 dB.Each type of sample was generated 40 times at each SNR level.In the robustness experiment, a confusion matrix was used to represent the experimental results.In this matrix, aside from signal mixture types serving as labels, the "X" label was introduced to indicate that the model recognized a signal combination not present in the corresponding dataset mixture.Additionally, the "NULL" label denotes cases where the model did not recognize any signals.Figure 5 shows the confusion matrix obtained by randomly selecting 80 signal samples for each mixed type from the original dataset at various signal-to-noise ratios (SNRs).It reveals that the EMR for signals reaches 84.6%, with EMR exceeding 92% for single signals and nearing 80% for multiple signals.The HS is concluded to be 91.1%.It is also observed that as the number of coexisting mixed signals within a sample type increases, the model's ability to recognize signal types weakens.This is due to the increased number of signals under the same noise conditions, causing energy dispersion in the time-frequency domain and resulting in blurred features that reduce separability and increase recognition difficulty.
periment, a confusion matrix was used to represent the experimental results.In this trix, aside from signal mixture types serving as labels, the "X" label was introduced indicate that the model recognized a signal combination not present in the correspond dataset mixture.Additionally, the "NULL" label denotes cases where the model did recognize any signals.Figure 5 shows the confusion matrix obtained by randomly selecting 80 signal s ples for each mixed type from the original dataset at various signal-to-noise ratios (SN It reveals that the EMR for signals reaches 84.6%, with EMR exceeding 92% for single nals and nearing 80% for multiple signals.The HS is concluded to be 91.1%.It is also served that as the number of coexisting mixed signals within a sample type increases, model's ability to recognize signal types weakens.This is due to the increased numbe signals under the same noise conditions, causing energy dispersion in the time-freque domain and resulting in blurred features that reduce separability and increase recogni difficulty.In Figure 6, the EMR for signals from the supplementary dataset is only 47.2%, while the HS reaches 84.6%.The supplementary dataset contains a significantly higher number of mixed-signal samples compared to the original dataset, resulting in a noticeable decrease in the EMR due to the increased complexity of mixed samples.However, the HS approaches that of the original dataset.Considering the EMR, HS, and analysis of the confusion matrix, the model's ability to recognize signal types in the supplementary dataset is comparable to that in the original dataset.However, there is a discernible limitation in distinguishing between different types of signal combinations, particularly affecting the recognition performance when QAM and BPSK signal types coexist.Therefore, under the same experimental conditions, introducing supplementary dataset samples with a 40:1 ratio in the training set enhances the model's learning of different types of combinations and leads to a new supplementary confusion matrix.
Sensors 2024, 24, x FOR PEER REVIEW In Figure 6, the EMR for signals from the supplementary dataset is only 47.2% the HS reaches 84.6%.The supplementary dataset contains a significantly higher n of mixed-signal samples compared to the original dataset, resulting in a noticea crease in the EMR due to the increased complexity of mixed samples.However, approaches that of the original dataset.Considering the EMR, HS, and analysis of th fusion matrix, the model's ability to recognize signal types in the supplementary d is comparable to that in the original dataset.However, there is a discernible limita distinguishing between different types of signal combinations, particularly affecti recognition performance when QAM and BPSK signal types coexist.Therefore, un same experimental conditions, introducing supplementary dataset samples with ratio in the training set enhances the model's learning of different types of combin and leads to a new supplementary confusion matrix.In Figure 7, the EMR of the samples reaches 73%, with an HS of 90.3%.The m of samples that did not achieve an absolute match were due to misidentifications o gle signal type, indicating a certain level of accuracy.Therefore, the model in this demonstrates a certain capability and robustness in recognizing signal combinatio side of the training samples.After training with small samples of unknown combin the recognition capability improves significantly, with an approximately 17% incr EMR and a 5.7% increase in HS, approaching the model's accuracy on the original d In Figure 7, the EMR of the samples reaches 73%, with an HS of 90.3%.The majority of samples that did not achieve an absolute match were due to misidentifications of a single signal type, indicating a certain level of accuracy.Therefore, the model in this study demonstrates a certain capability and robustness in recognizing signal combinations outside of the training samples.After training with small samples of unknown combinations, the recognition capability improves significantly, with an approximately 17% increase in EMR and a 5.7% increase in HS, approaching the model's accuracy on the original dataset.

Feature x Feature z Figure 3 .
Figure 3. Low-rank bilinear pooling method.

Table 1 .
Signal sample combination in initial dataset.

Table 2 .
Performance comparison of different attention mechanisms in the model.

Table 3 .
Signal sample combination in supplementary dataset.

Table 3 .
Signal sample combination in supplementary dataset.