Unlocking Signal Processing With Image Detection: A Frequency Hopping Detection Scheme for Complex EMI Environments Using STFT and CenterNet

Accurate detection and parameter estimation of frequency hopping (FH) signals remain challenging in FH signal-based transmission systems. This study proposes a scheme combining time-frequency analysis (TFA) and deep learning (DL)-based image processing algorithms to alleviate the degradation of detection accuracy and estimation performance caused by complex electromagnetic interference (EMI). A short-time Fourier transform (STFT) was used to obtain the signal spectrogram, which reflects the signal energy in a concentration-dependent manner. Then, a CenterNet-based deep network was employed to identify each FH hop’s shape and position, reducing the computational burden via a lightweight neural network while maintaining high recognition accuracy. Inverse mapping from the coordinates to the spectrogram was used to perform parameter estimation in the time-frequency (TF) domain. The estimation error was reduced by precisely locating the centroid of the signal energy using CenterNet. The simulation results demonstrate that the proposed scheme can accurately estimate the FH signal at a low signal-to-noise ratio (SNR) with complex EMI. Furthermore, appropriately determining the optimal parameters of CenterNet to ensure the estimator performance provides a novel approach for integrating DL into signal detection and estimation in complex EMI environments.


I. INTRODUCTION
Frequency hopping (FH) signals have been increasingly utilized in civilian and military communications. Since their invention, the robust anti-interference capability, low interception probability, and excellent anti-fading effect have made it possible to diversify the human use of electromagnetic waves [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. For instance, Code Division Multiple Accessbased 3G cellular communications significantly enhance the channel capacity; however, the multipath effect may distort the frequency identification of FH carriers, leading to an elevated Bit Error Rate. While military FH communications The associate editor coordinating the review of this manuscript and approving it for publication was Wen Chen . require less channel capacity and spectrum utilization, the adversary's intense Electromagnetic Interference (EMI) usually focuses on covering the hopping bands, resulting in false detection or even communication disruption at the receiver. Thus, the precise detection and estimation of the received FH signals are essential for effective information transmission. To address this issue, an adaptive kernel was utilized in [13] to maximize the mutual information between the input signal and the measurement, thus improving the detection performance after compressive sampling of the full FH spectrum. In [14], a spectrum estimation algorithm for FH signals underlying compressive sensing was introduced, which reconstructed FH signals with missing observations by exploiting the inherent structure of the signal. A compressionaware measurement was presented in [15] to decode FH signals without prior knowledge and pseudo-random code, and an adaptive kernel was designed for compression and decoding. After the immediate exclusion of gross errors, [16] proposed a low-complexity multiple-target detection method for FH signals, based on a small set of random samples. In [17], FH signals were assumed to be blindly separated by exploiting the polarized frequency correlation, and the estimation accuracy was improved by descrambling reconstruction.
The solutions above consider only the case of FH signals under the assumption that no other EMI exists. However, in realistic environments, there are interferences from the nature, numerous electronic devices around the link, and deliberately released disturbances. Furthermore, the FH signal is time-varying, making it difficult to describe its changing patterns visually and to identify FH signals from interferences using traditional time-domain or frequencydomain analysis, especially in complex EMI environment. Fortunately, approaches based on time-frequency analysis (TFA) enable the visualization of signal characteristics in the time-frequency (TF) domain, providing an effective solution for this problem [18], [19], [20], [21], [22], [23].
Among the various TFA methods, the widely used ones include the short-time Fourier transform (STFT) [24], [25], [26], [27], [28], [29], the Wigner-Ville distribution (WVD) [30], [31], [32], [33], and its modified solution, known as the smooth pseudo Wigner-Ville distribution (SPWVD) [3], [34], [35], [36], [37]. However, although the WVD has the highest resolution, its sizeable cross-term interference requires numerous computationally intensive matrix operations. On the other hand, the SPWVD can handle cross-term interference but has limited noise tolerance. In contrast, the STFT is a computationally simple process that does not correlate the signal and prevents cross-term interference. Additionally, because STFT is a windowed transform, apparent edges in the spectrogram with high resolution can be achieved with a smaller window, thus making it easier to analyze the signal features using image processing algorithms. However, in complex EMI environments, interfering signals often contaminate the TF representation (TFR) of the STFT, causing a noticeable drop in detection accuracy. Consequently, accurately identifying FH signals from generated images has become a pressing concern.
With the advent of deep learning (DL) and rapid development of hardware capabilities, image-processing algorithms based on DL have emerged. Various networks focusing on target detection have been proposed in the literature [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], using convolutional kernels to extract image features and improve recognition accuracy by increasing the network depth and innovating structural design. However, these algorithms require abundant network parameters, resulting in a high computational burden. Moreover, these algorithms focus on the entire image and pay insufficient attention to the energy concentration of FH signals in spectrograms. In contrast, CenterNet [48], [49], [50], [51] is a lightweight network with a target centroid as the primary measurement criterion for recognition, ensuring a high recognition accuracy and speed.
Based on the above analysis, this study proposes an FH signal estimation scenario using STFT and CenterNet, which can accurately estimate FH signals after detecting them in complex EMI. The scenario first transforms the signal into a spectrogram using STFT to show energy aggregation, and then uses CenterNet to detect the FH signals from other interferences by labeling the aggregation area of each hop in the spectrogram. Finally, mapping the pixel coordinates to the TF grid provides an accurate parameter estimation of the FH hops. The contributions of this study are as follows: 1) Making a more intuitive distinction, the scenario converts statistical features of both signals and interferences into image features; 2) By introducing DL-based image detection methods into signal processing, the fast iterative properties of DL algorithms can rapidly improve the accuracy and anti-interference capability of signal estimation; 3) This scenario enriches the toolbox of signal detection and parameter estimation by taking advantage of specific parameters from image detection, such as intersection over union (IoU) analysis; 4) CenterNet boosts the computing speed by leveraging parallel computing for machine learning.
The remainder of this paper is organized as follows: Section II establishes an FH signal model with complex EMI. Section III describes TFA for the FH signal, target detection, and parameter estimation. Section IV presents the simulation setup and results evaluation, and Section V provides concluding remarks.

II. SIGNALS AND ENVIRONMENT MODELING
This section analyses the EMI environment and creates a signal model in the TF domain.

A. COMPLEX EMI ENVIRONMENT
Interfering signals and noise always coexist with FH signals in an EMI environment. These interferences typically consist of three categories: fixed-frequency, linear-frequency modulation (LFM), and burst signals. Fixed-frequency signals such as voice broadcasts and TV signals typically have a permanent frequency and usually last for seconds. LFM signals, also known as chirp signals, are signals whose frequency varies linearly with time and whose appearance is typically periodic, such as radar signals. In contrast, burst signals are random and irregular with fixed frequencies and short durations. These interferences and background noise constitute the main obstacles to identifying FH signals [52].

B. MODELING
Assume that there is a mixed-signal r(t) consisting of M FH signals c m (t), H chirp signals f h (t), L fixed-frequency signals g l (t), Q burst signals j q (t), and v(t) at the receiver during the VOLUME 11, 2023 observation time T 0 , denoted as where v(t) represents the additional Gaussian white noise (AWGN) with zero mean and variance of σ 2 .
Given that the mth FH signal has a hopping period of T m , there are B hops in the observation time T 0 . The carrier frequency corresponding to the bth hop is denoted as f m b . Because the start time of the first hop cannot be determined and the hop may not be complete, the duration is generally considered to be αT m with 0 < α < 1. Then, c m (t) can be expressed as From (1), it can be seen that there are no correlations between the FH signals and others; thus, signals other than FH can be unified as additive interference n(t) to be represented as From (1) and (3), we have

III. METHOD
This section provides an overview of FH signal detection, covering spectrogram generation, deep network image detection, and parameter estimation using the detection outputs.

A. SCHEME OVERVIEW
The structure of the proposed scheme is illustrated in Fig. 1. The time-domain input signal is converted into a spectrogram via the STFT. The spectrogram is then fed into the CenterNet as an image for detection, yielding the center point position and bounding box (Bbox) of the FH signal. Finally, the frequencies and hopping period of the FH signal are estimated through inverse mapping based on the coordinates from the image.

B. STFT-BASED SPECTROGRAM GENERATION
STFT is a Fourier transform with added windowing, where a time-domain signal is processed separately by a sliding window and then switched into the TF domain as stacked spectra. The STFT of the signal r(t) is defined as where h(t) denotes a window function. Performing the STFT on both ends of (4) simultaneously yields To facilitate 2-D image processing, we denote the TF values using the following notation: This representation makes it easier to detect the FH signal in a spectrogram, which is the process of detecting After converting the time-domain signal into a spectrogram using the STFT, different signals can be easily distinguished from various interferences. As shown in Fig. 2, the FH signal appears as a short horizontal line of consistent length in the spectrogram, whereas the fixed-frequency signal appears as a long horizontal line. The chirp signal appears as diagonal segments with the same slope, frequency band, and period, whereas the burst signal appears as a short horizontal line of random length. The random noise signal covers the entire spectrogram as irregular dots, and the lower the signal-tonoise ratio (SNR), the higher the energy and brightness of the dots in the spectrogram. According to the above analysis, detecting FH signals in a spectrogram is transformed into an image detection problem. The objective is to detect regular short horizontal lines from other shapes in the image.
Current transmitters usually select a relatively clean band to avoid overlapping interference. Therefore, we assume that the FH signal does not overlap with the interfering signals in the spectrogram, which allows for the accurate detection and parameter estimation of the FH signal using the proposed scheme.

C. NETWORK ARCHITECTURE
The architecture of the CenterNet used in this study is illustrated in Fig. 3, which comprises two main components: a convolution-based backbone network and an upsampling network. The backbone network extracts features from the image, and the upsampling network merges these features  into feature maps. The feature maps are then fed into three different convolutional heads to produce the detection results of the FH signal. These three heads are responsible for the following: 1) Extracting the heat map of the features for classification; 2) Determining the height and width of the target Bbox; 3) Recording the offset between the center point and the target's actual position.
One advantage of CenterNet over traditional CNN-based networks (such as R-CNN and Faster R-CNN) is its insensitivity to the initial size and stretching of the image. Instead of traversing the image using numerous alternate anchors, CenterNet focuses solely on the center point position and anchor size, making the network design straightforward. In this study, the input image was preprocessed from 900 × 1200 to 512 × 512 pixels in the RGB format, which contains three channels.
The backbone network is typically selected from ResNet, DLA, and Hourglass [48]. In this study, we selected ResNet-50 as the backbone network, which improves processing speed while ensuring learning accuracy and avoids gradient disappearance and explosion caused by network deepening [53]. After feature extraction using the backbone network, the dimensions of the feature maps were 16 × 16 with 2048 channels. Subsequently, deconvolution of the feature maps was used as an upsampling tool to obtain heat maps, acquiring feature maps at a dimension of 128 × 128 pixels with 64 channels [42], [54].
The obtained feature maps are duplicated into three parallel heads to output parameters. First, the convolution depth is equal to the number of desired classifications. As the FH signal is the unique class in this study, the convolution depth of the head was 1. Second, because the Bbox is determined by the width and height, the convolution depth was 2. Finally, to compensate for the offset during feature extraction, a finetuning of the centroid position is performed with a convolution depth of 2. The specific architecture of the detection network is presented in Table 1.

D. TRAINING AND PREDICTION OF THE SPECTROGRAMS 1) POSITION PREDICTION OF THE FH SIGNAL
The keypoint prediction of the FH signal was referenced from [55]. Given a particular FH signal c in the input image I ∈ ℜ W ×H ×3 with width W and height H , the keypoint of the ground truth is Y ∈ [0, 1] W S × H S ×C with the position of p ∈ ℜ 2 , which is scaled by the same ratio as the position of p = p S and is dispersed by the Gaussian kernel in (8) on the heat map. The output scaling stride S was set to 4 according to the network structure, and the number of classes C was set to 1 in this study. The variance σ 2 p of the heat map is calculated from the radius r, which is determined using IoU settings.
Let the corresponding keypointŶ ∈ [0, 1] W S × H S ×C be given by the network prediction, whose position isp ∈ ℜ 2 . We useŶ xyc = 1 to denote that the keypoint is acquired, whereasŶ xyc = 0 denotes the background. Moreover, we use the focal loss for logistic regression, and the loss function can be expressed as where α and β are the hyperparameters of focal loss. In this study, we take α = 2 and β = 4, respectively. M is the number of keypoints in an image when normalizing the focal loss.
In addition, for the head used for offset prediction, letÔ ∈ ℜ W S × H S ×2 be the offset prediction shared by all classes with offset coordinates (δx, δŷ). When using L1 loss to participate in the training, we have   size of the Bbox s m (w (m) , h (m) ) are and respectively. Let the size prediction beŜ ∈ ℜ W R × H R ×2 , and use L1 loss again to participate in the training. Then we have Overall, the loss for the entire training can be expressed as where the coordinates use the pixel positions in the original image. The weights assigned to each part of the loss were defined by λ size = 0.1. Finally, the Bbox predicted by the network can be expressed as . (15) whereŵ andĥ denote the width and height of the Bbox, respectively.

E. PARAMETER ESTIMATION OF THE FH SIGNAL
The FH signal c m is obtained from network prediction, and the frequency limit of its image representation is displayed between the upper limit F max and the lower limit F min , respectively. The inverse mapping from the image to the signal is calculated as and The hopping period T m and carrier frequency f m b of c m can be converted to and When there is only one FH signal in the environment, the hop duration for the M hop FH signal in a single observation window can be estimated as

IV. NUMERAL EXPERIMENTS AND ANALYSIS
In this section, numerical results are used to demonstrate the effectiveness of the proposed scheme.

A. EXPERIMENT SETTINGS
In this study, the carrier signal operates within the very-low frequency (VLF) band of the International Radio Consultative Committee (CCIR) and the long-wave (LW) band of the International Special Committee on Radio Interference CISPR-25. This band is well suited for air-sea integrated communications and navigation because of the electromagnetic waves transmitted over longer distances, allowing for communication over several kilometers. Furthermore, the band is highly stable, providing robustness against day and night, weather, climate, cosmic rays, and other interferences. Additionally, the band is more penetrative underwater, with less attenuation up to 100 m. However, the VLF band still faces complex EMI issues, particularly in FH communication. Chirp signals are commonly used in maritime radar systems that may interfere with VLF band communication, as well as the fixed-frequency radiation emitted by electrical devices such as motors and engine blades on aircraft, ships, and offshore platforms. Furthermore, the burst emissions of maritime lightning and ionospheric fluctuations can also affect VLF band communication. Therefore, identifying FH signals in such complex EMI is of great significance.
This study focused on baseband signal processing to simplify the analysis. The SNR of AWGN ranged from -10 dB to 10 dB, with an observation time of 0.08 s and a 128-point Hamming window for the STFT. The parameters of the FH, fixed-frequency, chirp, and burst signals presented in the complex EMI are listed in Table 2. In addition, to guarantee the statistical convenience and accuracy of the experimental results, Monte Carlo samples considering the algorithm's robustness are usually selected by discrete SNR with a step of 1 dB. However, to better understand the prediction results between selected SNRs, it is crucial to conduct sufficient Monte Carlo samples at each SNR. Therefore, this study generated 500 Monte Carlo samples per dB per step for a hopping period of the FH signal between 0.01 s and 0.02 s with a step of 0.002 s, which contained a total dataset of 63,000 spectrograms.
For the CenterNet parameters, the training, validation, and test sets were segmented at 50%, 25%, and 25%, respectively. The model was pre-trained on MS COCO [56] for weight initialization, and the Adam algorithm with 0.94 momenta and 0.0001 weight decay was used for training. We used a learning rate segmentation strategy to avoid misconvergence. The learning rate was initially set to 1 × 10 −4 with a batch size of 64, and then decayed to 1 × 10 −5 after 50 epochs with a batch size of 16, and the training terminus was set to 100 epochs. Fig. 4 compares the spectrograms and contour plots of STFT with WVD and SPWVD for the same mixed signal at an SNR of 0 dB. Comparing Fig. 4(a) and 4(b) with Fig. 4(c) and 4(d) reveals that the WVD extensively aggregated the signal representation, as evidenced by the slim and bright outlines, and the noise was effectively suppressed. However, owing to the presence of cross terms, the WVD of signal and interference were represented as non-independent, causing a severe TF overlap. Furthermore, bacause WVD is the Fourier transform of the autocorrelation function, the values lose linearity over time, resulting in a more confusing TFR, making it challenging to achieve signal detection using image processing algorithms. Comparing Fig. 4(c) and 4(d) with Fig. 4(e) and 4(f), we can see that the SPWVD reduced the TF aggregation and overlap. Nevertheless, as the cross terms remain, the TF overlap generated substantial interference and few correct outlines, which differs from the STFT. The results confirm the incomparable advantage of the STFT in this scheme.

2) RESULTS OBSERVATION OF THE PROPOSED SCHEME
To illustrate the procedure of the proposed scheme more intuitively, we showcase the detection outputs of the FH signal in a mixed signal with AWGN at an SNR of 5 dB, as shown in Fig. 5. The mixed signal in the time domain is shown in Fig. 5(a). The 3-D mesh in the TF domain after applying STFT highlights the concentration of the signal energy, as shown in Fig. 5(b). After 2-D processing to obtain Fig. 5(c), the FH signal can be visually identified from other interferences with the energy distribution. Finally, the Bbox coordinates of the FH hops are revealed in the image through CenterNet, as shown in Fig. 5(d). After the screening, an inverse mapping is performed to derive the frequency and hopping period estimations of the FH signal.

3) NETWORK TRAINING PERFORMANCE
In Fig. 6, we present the trend of the total loss and validation loss with respect to the epoch of our DL architecture, observing that both losses decreased smoothly as the number of training epochs increased. The result indicates VOLUME 11, 2023    that the architecture parameters were well selected without overfitting. Moreover, the losses reduced steadily when the epoch was between 30 and 50 and decreased sharply again when the learning rate was adjusted to 1 × 10 −5 at the 50th epoch, demonstrating an appropriate setting. Finally, the losses became smooth again after 90 epochs and were reduced to the magnitude of 10 −1 , indicating quick convergence and appropriate fitting of the architecture.

4) PERFORMANCE COMPARISON OF THE DL ARCHITECTURES
In this part, we conducted two experiments to compare the performance of CenterNet with YOLOv7 [47] and YOLOX [46]. The results of the mean average precision (mAP), AP test 50 , and AP test 75 obtained from training and testing these three DL architectures on the dataset generated in Section IV-A are summarized in Table 3. As can be seen, the average precision (AP) performance of all architectures is significantly higher than that of MS COCO [56], with APs exceeding 90%. This result originates from the continued training for the dedicated application in this study based on the pre-trained model, which improves the generalization capability of the network parameters for FH signal detection. In addition, CenterNet obtained 99.25% mAP, 99.91% AP test 50 , and 99.75% AP test 75 , outperforming YOLOv7 and YOLOX by 6.31%, 0.01%, 0.03% and 8.39%, 0.03%, 1.21%, respectively. This is because YOLOv7 and YOLOX are both anchor-based architectures that require different aspect ratios of anchors to traverse the image. In contrast, CenterNet is an anchor-free architecture that uses a heat map of the features to represent the probability of the center point location, thus obtaining a better representation of the features of the object. On the other hand, YOLOv7 and YOLOX primarily concentrate on optimization for smalltarget detection, and the improvement in performance for regular-target detection is comparatively limited. For the application in this study, the FH signal will not appear as a small target in the spectrogram as long as the TF scale is set appropriately; thus, the performance is generally satisfactory.   Furthermore, we compared the mAPs of the three DL architectures under different SNRs, as shown in Fig. 7. As can be seen, the mAP of all architectures gradually increased as the SNR increased. The mAP of CenterNet exceeded 90% when the SNR reached -1 dB, whereas the mAPs of both YOLOv7 and YOLOX exceeded 88% when the SNR reached 3 dB. Subsequently, the mAPs of all the architectures converged. It was also evident that the mAPs of YOLOv7 and YOLOX were closer under all SNRs, whereas the mAP of CenterNet was consistently higher by at least 2%. Thus, we can conclude that CenterNet outperforms YOLOv7 and YOLOX in terms of the prediction performance under all SNRs.

5) PERFORMANCE IMPACT OF THE IOU
The key to detecting a signal using image detection is correctly identifying the target representation from the image. Therefore, the next step is to verify the detection accuracy of the proposed method. Here, we chose IoU as the criterion and used miss probability as an index of recognition accuracy. The miss probability P miss is the quantity ratio of detected FH hops N detect to total FH hops N total .  The SNR-miss probability graph for different IoU thresholds is shown in Fig. 8. We observed that if the IoU was too large, the miss probability also became significant. When the IoU decreased gradually, the miss probability also decreased and stabilized until it almost stopped changing when the IoU was less than 0.6. These results can be explained by the fact that, although the Bbox of the detected FH hop is close to the ground truth, there may still be some errors in the exact position. Thus, setting the IoU between 0.3 and 0.6 will ensure the reliability of the detection results to avoid discarding of approximate predictions, which would decrease detection accuracy by having a high threshold value.
In addition, considering IoU = 0.5, as an example, the miss probability decreased as the SNR increased. When the SNR exceeded -7 dB, the miss probability dropped to 0, demonstrating that 100% of the FH hops were detected. Furthermore, the miss probability reduced to within 2% when the SNR reached -10 dB, indicating that the proposed scheme can accurately detect FH signals in complex EMI environments with a low SNR.
Furthermore, we evaluate the estimator using the root mean squared error (RMSE) metric and present the results in Fig. 9. Fig. 9(a) and 9(b) illustrate that the estimation of the frequency and hopping period reached the magnitudes of 10 −3 and 10 −4 , respectively. The RMSE of the frequency tended to stabilize after a slight decrease as the SNR increased, converging after the SNR reached −1 dB. On the other hand, the RMSE of the hopping period decreased as the SNR increased and converged after the SNR reached −9 dB, indicating that the proposed scheme achieves desirable estimation accuracy. Furthermore, increasing the IoU decreases the RMSE because a higher IoU implies a stricter screening of the prediction results, which selects a Bbox closer to the ground truth and lowers the RMSE. However, combined with the analysis of Fig. 8, the reduction in estimation error comes at the cost of recognition accuracy, verifying that an IoU between 0.3 and 0.6 is the optimal choice for performance compromise.

6) PERFORMANCE IMPACT OF THE HOPPING PERIOD
In this part, we compare the miss probability and the RMSE of the proposed scheme at different hopping periods. The results with hopping periods of 0.01 s, 0.016 s, and 0.02 s at IoU of 0.5 are shown in Fig. 10. As can be seen from Fig. 10, the three curves overlapped with each other with an irregular trend, indicating that the hopping period change does not affect the experimental results. It is worth noting that the statistics of other hopping periods still obey the same conclusion. Therefore, Fig. 10 does not present the statistical results for all hopping periods simultaneously to maintain conciseness and to facilitate comparison.

V. CONCLUSION
This paper presents a novel scheme for estimating FH signals in complex EMI environments by combining STFT analysis and CenterNet image detection. The optimal parameters of CenterNet were obtained to ensure the detection accuracy and estimation precision. Simulations showed promising results in detecting and estimating the FH signal in complex EMI at a low SNR, providing a reference for designing and optimizing other DL-based image detection algorithms for detecting and estimating FH signals. Future work will focus on exploring the effects of STFT window size, discovering other TFA algorithms, fingerprinting multiple FH transmitters, detecting interfering signals, analyzing the minimum size of the dataset for model training, and designing adaptive spectrum switching or FH code schemes to mitigate the effects of active or passive interference. The performance of the proposed scheme under more complex background noise requires further exploration. Moreover, the current performance evaluation focuses mainly on the accuracy of the offline scheme, and we will develop an online version in the future to comprehensively evaluate its real-time performance.