Moving Point Target Detection Based on Temporal Transient Disturbance Learning in Low SNR

: Moving target detection in optical remote sensing is important for satellite surveillance and space target monitoring. Here, a new moving point target detection framework under a low signal-to-noise ratio (SNR) that uses an end-to-end network (1D-ResNet) to learn the distribution features of transient disturbances in the temporal proﬁle (TP) formed by a target passing through a pixel is proposed. First, we converted the detection of the point target in the image into the detection of transient disturbance in the TP and established mathematical models of different TP types. Then, according to the established mathematical models of TP, we generated the simulation TP dataset to train the 1D-ResNet. In 1D-ResNet, the structure of CBR-1D (Conv1D, BatchNormalization, ReLU) was designed to extract the features of transient disturbance. As the transient disturbance is very weak, we used several skip connections to prevent the loss of features in the deep layers. After the backbone, two LBR (Linear, BatchNormalization, ReLU) modules were used for further feature extraction to classify TP and identify the locations of transient disturbances. A multitask weighted loss function to ensure training convergence was proposed. Sufﬁcient experiments showed that this method effectively detects moving point targets with a low SNR and has the highest detection rate and the lowest false alarm rate compared to other benchmark methods. Our method also has the best detection efﬁciency.


Introduction
The detection of moving targets has important applications in security monitoring, military reconnaissance, and satellite detection [1][2][3]. In some scenarios, such as early warning against space debris [4] and small faint bodies in near-Earth space or against naval ships and fighters, optical remote sensing detection has the characteristics of long distance and large field of view [5]. In this condition, the fast-moving target is more like a point in the image. The point target does not have shape, size, texture, or other spatial information and may even be submerged in background and clutter, resulting in a very low space-time signal-to-noise ratio (SNR) of the target and making it difficult to detect. Therefore, the problem of moving target detection in optical remote sensing images at a long distance and under a large field of view can be transformed into the problem of moving point target detection under a low SNR, which is important for effective detection.
There are currently three detection methods based on the temporal and spatial features of moving point targets: spatial-based detection, temporal-based detection, and spatiotemporal-based detection.

Spatial-Based Detection Methods
Spatial-based detection mainly realizes detection by enhancing small targets and suppressing the background or by converting the detection problem into an optimization problem of separating sparse and low-rank matrices. For example, the top-hat algorithm first calculates an image to estimate the background and then subtracts the background from the original image to obtain small targets [6]. The max-mean filter and max-median filter suppress clutter by filtering in four directions and then subtracting the background to obtain candidate targets [7]. Local contrast measure (LCM) and its improved algorithms, such as MPCM, HWLCM, MLCM-LEF, and WVCLCM, use local contrast information to enhance the point target and suppress the background [8][9][10][11][12]. In contrast to the abovementioned methods, IPI-based methods use the background non-local self-correlation property to transform the small target detection problem into an optimization problem of the recovery of low-rank and sparse matrices and use principal component pursuit to solve the problem [13][14][15][16]. Xia et al. considered both the global sparsity and local contrast of small targets and proposed a modified graph Laplacian model (MGLM) with local contrast and consistency constraints [17]. Because a point target with a low SNR lacks effective spatial information, the above methods cannot separate the target from the background.
In recent years, with the development of deep learning, point target detection algorithms based on convolutional neural networks have emerged endlessly, including ALCNet, GLFM, ISTDU, ISTNet, MLCL, and APANet [18][19][20][21][22][23]. The principles of these CNN-based methods are predominantly similar to those of traditional methods. Multilayer neural networks are used to enhance the point targets, suppress the background, and box the target position. Although the CNN-based method has improved the feature extraction ability of the target, it still cannot achieve excellent detection for low-SNR point targets lacking spatial information. In addition, the track of the target cannot be obtained by detecting a single image. In early warning systems, it is still necessary to detect image sequences. Because CNN-based detection methods take a long time to detect image sequences, they are inefficient.

Temporal-Based Detection Methods
Temporal-based detection refers to the detection of image sequences using the target's movement information in temporal terms, such as optical flow [24], temporal difference [25], dynamic background modeling (DBM) [26,27], and tracking before detection (TBD) [28]. Optical flow uses the correlation between adjacent frames in the image sequence and the changes of pixels over time to find the corresponding relationship between moving targets in the frames in order to calculate the motion information of moving targets. This method assumes that the brightness of the target is constant, that the motion between adjacent frames is derivable, and that the motion of adjacent pixels is similar. There are numerous constraints and few scenes that satisfy this assumption. In addition, the optical flow method is time-consuming and struggles to meet real-time requirements. The temporal difference method makes use of the gradual change of the background in the image sequence to directly identify differences in the adjacent frames. If there are moving targets in the sequence, this will lead to a large difference in the intensity of the adjacent frames. However, the temporal difference is sensitive to background noise and has a poor detection effect for point targets with a low SNR. The DBM models the background in the image sequence and determines whether the pixel belongs to the foreground or background according to the established model to segment the moving target. The detection performance of this method depends on the modeling accuracy. It is difficult to distinguish the moving point target from the background under a low SNR, and the target is easily misjudged as background. Thus, this method's robustness is poor. TBD is a commonly used algorithm for detecting the traces of small moving targets. This algorithm accumulates multiple frames, searches for every possible trace of targets, and finally decides on the searched trace. Therefore, it does not need to detect every single image, but it directly outputs the target's motion trace. However, this method requires excessive time to search. Moreover, if the target is weak, the target cannot be found effectively.

Spatiotemporal-Based Detection Methods
Researchers have proposed spatiotemporal-based detection methods that combine spatial and temporal information. For example, Zhang et al. proposed a three-dimensional filtering detection method, which takes a segment of an image sequence as the input and uses multiple matching filters to suppress the background in order to ultimately obtain point targets [29,30]. Deng et al. proposed a filtering method based on spatiotemporal local contrast, which calculates spatial and temporal local contrast, respectively, and then performs filtering mapping on spatiotemporal local contrast to obtain detection results [31]. Lin et al. used the Pearson correlation coefficient to suppress the background in the timedomain window and then used the target detection algorithm based on the regional gray level to suppress the residual background and finally obtained a target motion track [32]. Zhu et al. filtered the frame first, then detected the frame's gradient to obtain the candidate targets, and finally supplemented local contrast information in temporal terms for spatiotemporal joint judgment. Yan et al. used the top-hat algorithm to separate small targets from the background, a grid-based density peak search algorithm and gray area growth algorithm to identify false alarm points, and an improved KCF algorithm to achieve target tracking for continuous frames [33]. These algorithms use the spatiotemporal information of point targets to improve the detection effect, but their assumptions on small targets are too strong and require considerable prior information.

TP-Based Detection Methods
These temporal-based or spatiotemporal-based methods only use a few frames and do not fully use the temporal information of the target, and so they do not exhibit good detection performance. Under the observation condition of staring imaging, the intensity change of a single pixel in the image sequence over time can be regarded as a profile. If a target passes a pixel, it will produce a transient disturbance in the temporal profile (TP) of that pixel. If the transient disturbance can be detected, the target will be detected. Thus, the point target detection in an image can be converted into the detection of transient disturbances in the TP. Methods based on TP have been proposed. Liu et al. estimated the background signal from the original TP and then subtracted it to obtain the target signal [34,35]. Subsequently, Liu et al. performed the nonlinear adaptive filtering of TP to extract the target signal [36]. Recently, Liu et al. used FFT and KL to calculate the similarity between the TP and waveform to detect the target signal [37]. Niu et al. proposed detection methods based on statistical distribution distance involving high-frame-rate detection [38][39][40]. These methods are effective for TPs with a high SNR, but for TPs with a low SNR, the target signal cannot be separated from the background signal, and the time when the target appears in the TP cannot be identified.
The transient disturbance of the target formed by the pixels can be regarded as a pattern that can be recognized by CNN-1D. Therefore, to overcome the problems of the previous methods and achieve effective moving point target detection under a low SNR, we proposed a detection framework based on transient disturbance distribution feature learning. The framework takes the image sequence as the input and directly outputs the track of the point target.
The main contributions of our work are as follows: We formulated the TP formed by pixels and generated a simulation dataset according to the TP formula. By combining the simulation data and real-world data, a training and verification dataset satisfying the research of moving point target detection with a low SNR is generated. Compared to other spatial-based and temporal-based methods, the proposed method exhibits the best performance in terms of its detection rate, false alarm rate, and computing efficiency. The biggest advantage of our method is that it exhibits excellent detection performance under extremely low SNRs.
The remainder of this paper is organized as follows. Section 2 analyzes the components of the TP and establishes mathematical models for each part. The mathematical expressions for the target TP, background TP, and clutter TP are presented in Section 2. Section 3 details the moving point target detection framework, including the network architectures, model training, and the entire detection process. Section 4 presents the experimental scheme and results. We designed experiments based on four aspects and compared our method with other benchmark methods on test sequences. Section 5 discusses our method in detail and compares it with other methods, followed by network ablation experiments and visualization studies. Section 6 presents the conclusions of this study.

The Components of the Temporal Profile
Under the condition of staring imaging, each pixel in the image will form a TP, which tracks the change in pixel intensity value over time. Each TP is different. What is most important is the transient disturbance formed by the target passing through the pixel. Therefore, all TPs can be divided into two categories: background TP and target TP [34]. The TP of any pixel under ideal clutter-free conditions can be described as follows: where t i,j and b i,j represent the distribution of the target TP and background TP, respectively; i and j represent the row and column index of the pixel in the image, respectively; k represents time; and k 1 and k 2 are the times when the target enters and leaves the pixel, respectively. The TP formation process is illustrated in Figure 1.

The Target Temporal Profile
The TP of a target passing through a pixel can be regarded as a transient disturbance, and the following formula is used to describe the target TP: Under ideal clutter-free conditions, because the view of the detector is fixed, the background pixel intensity is constant for a short time, and the background TP can be considered a short-time stationary signal. However, in real image processing, the imaging results are affected by noise from different sources, including shot noise, thermal noise, photon noise, etc. In [35], additive white Gaussian noise (AWGN) was used to model these different noises. Thus, the actual TP can be expressed as follows: where AN represents the AWGN.

The Target Temporal Profile
The TP of a target passing through a pixel can be regarded as a transient disturbance, and the following formula is used to describe the target TP: where s(k) represents the transient disturbance caused by the appearance of the target. The ideal imaging model of the optical system is pinhole imaging, and the light diffracts when mapping the object through the pinhole, forming a series of light-dark alternating diffraction rings. Therefore, a point in the real world will be a circle with a certain radius after imaging. Academia describes this phenomenon with a point spread function, and Pentland uses a two-dimensional Gaussian distribution to model it [41], which is defined as follows: where A represents the intensity of the target in the imaging, a represents the optical parameters of the detector, and (x 0 , y 0 ) represents the center position of the target. When the point target passes through a pixel, the intensity of the pixel first increases and then decreases, and a bell-shaped transient disturbance then appears on the TP of the pixel. The bell-shaped TP can be described by the following formula: where v is the moving speed of the target and k 0 = k 1 + ( k 2 − k 1 )/2 represents the time when the target center passes through the pixel. The formation process and specific shape of target TP are shown in Figure 2. Because the size of the target is smaller than the imaging spatial resolution, the target cannot completely cover the background, and the intensity of the target in imaging will be affected by the background. Therefore, the formula of TP can be expressed as background distribution plus target distribution, as shown below: where , is the distribution of AN. Because the size of the target is smaller than the imaging spatial resolution, the target cannot completely cover the background, and the intensity of the target in imaging will be Remote Sens. 2023, 15, 2523 6 of 22 affected by the background. Therefore, the formula of TP can be expressed as background distribution plus target distribution, as shown below: where n i,j (k) is the distribution of AN.

The Framework of Temporal Transient Disturbance Learning
We used CNN-1D to detect transient disturbances formed by the target. Because the transient disturbance is extremely weak, the feature extraction of the transient disturbance is difficult and the extracted features are easily lost in the network. The skip connection can directly transfer the shallow feature to deeper layers so that the network can fully learn the distribution feature of the transient disturbance and achieve high-accuracy detection. The detection framework of our method is shown in Figure 3, which includes two modules: training and detection. In the training part, we first generated the simulated TP by adding a bell-shaped signal to the background signal. Noise was then added to the TP to simulate a real situation. Next, a training dataset containing 160,000 TPs was generated under the experimental parameters. Subsequently, the proposed networks, 1D-ResNet-8 and 1D-ResNet-16, were trained under the same super-parameter settings. In the detection part, the trained model was used to detect the transient disturbance in TPs formed by pixels to detect the moving track of the point target.

Architectures of 1D-ResNet
There are two tasks for detecting transient disturbances in a TP. One involves classifying the target TP containing a bell-shaped signal and the background TP. The other involves obtaining information on transient disturbances, such as the time of occurrence and the duration of the bell-shaped signal. Therefore, for these two detection tasks, inspired by classical ResNet and Darknet, we use one-dimensional ResNet as the backbone feature extraction network and CBR-1D as the basic feature extraction unit [42,43] to propose the 1D-ResNet. To verify the impact of the network layers on the detection performance, 1D-ResNet-8 and 1D-ResNet-16 were designed. The architectures of these 1D-Res-Net are shown in Figure 4.

Architectures of 1D-ResNet
There are two tasks for detecting transient disturbances in a TP. One involves classifying the target TP containing a bell-shaped signal and the background TP. The other involves obtaining information on transient disturbances, such as the time of occurrence and the duration of the bell-shaped signal. Therefore, for these two detection tasks, inspired by classical ResNet and Darknet, we use one-dimensional ResNet as the backbone feature extraction network and CBR-1D as the basic feature extraction unit [42,43] to propose the 1D-ResNet. To verify the impact of the network layers on the detection performance, 1D-ResNet-8 and 1D-ResNet-16 were designed. The architectures of these 1D-ResNet are shown in Figure 4.  Both networks are composed of input, backbone, neck, and output. A TP with a size of 512 × 1 was the input for the network. Backbone was used to extract the features of the TP. The neck connects the backbone and the output and to provide higher-dimensional features for the output. Finally, three outputs are obtained. If a bell-shaped signal exists, the outputs are the class of TP, the center position, and the size of the bell-shaped signal. Otherwise, we obtain three zero outputs.
In the training network stage, the TP class is easy to identify, as it is the first output. Meanwhile, identifying the center position and size is difficult. Therefore, two LBR blocks are set behind the convolution layer as the neck to further extract the features. Each LBR block includes a linear layer, batch normalization (BN), and ReLU.
Several skip connections were used to transmit the feature from the shallow layer to the deeper layer in order to avoid the loss of the transient disturbance feature. The CBR-1D includes a one-dimensional convolution layer (Conv1D), BN, and ReLU. Conv1D was used to extract local features in the TP and then normalize the extracted features. Finally, ReLU was used to activate the features. This can inhibit the change in the data distribution, accelerate the convergence speed, and avoid the problems of gradient disappearance and gradient explosion.

Generate the Dataset
Point targets with a SNR below 3 dB will have no obvious spatial features; therefore, the SNR research range was established as −3 dB to 3 dB. Because the actual TP under a specific SNR is difficult to obtain and label, the features of the target and ground TP were combined to generate a dataset through simulation. To enable the network to fully learn the features of TP within the research SNR range, TPs were generated between −4 dB and 4 dB. Both networks are composed of input, backbone, neck, and output. A TP with a size of 512 × 1 was the input for the network. Backbone was used to extract the features of the TP. The neck connects the backbone and the output and to provide higher-dimensional features for the output. Finally, three outputs are obtained. If a bell-shaped signal exists, the outputs are the class of TP, the center position, and the size of the bell-shaped signal. Otherwise, we obtain three zero outputs.
In the training network stage, the TP class is easy to identify, as it is the first output. Meanwhile, identifying the center position and size is difficult. Therefore, two LBR blocks are set behind the convolution layer as the neck to further extract the features. Each LBR block includes a linear layer, batch normalization (BN), and ReLU.
Several skip connections were used to transmit the feature from the shallow layer to the deeper layer in order to avoid the loss of the transient disturbance feature. The CBR-1D includes a one-dimensional convolution layer (Conv1D), BN, and ReLU. Conv1D was used to extract local features in the TP and then normalize the extracted features. Finally, ReLU was used to activate the features. This can inhibit the change in the data distribution, accelerate the convergence speed, and avoid the problems of gradient disappearance and gradient explosion.

Generate the Dataset
Point targets with a SNR below 3 dB will have no obvious spatial features; therefore, the SNR research range was established as −3 dB to 3 dB. Because the actual TP under a specific SNR is difficult to obtain and label, the features of the target and ground TP were combined to generate a dataset through simulation. To enable the network to fully learn the features of TP within the research SNR range, TPs were generated between −4 dB and 4 dB.
During TP simulation, a bell-shaped target signal is generated according to Formula (5) and the location where the target signal appears is set randomly. To verify the effect of the target signal size on the detection performance, the target signal size range was set to 10~110 and signals of different sizes were generated in equal proportions. A constant was randomly set as the background signal. The two signals were superimposed to obtain the simulated TP. Finally, AWGN was added to the TP simulation. To ensure that the model exhibits good performance on TPs with different SNRs and different target signal sizes, we set the number of TPs to be equal for each SNR and size range.
After generating the dataset, we divided it into training and validation sets at a ratio of 8:2, respectively. The composition of the TPs in the dataset is shown in Figure 5 During TP simulation, a bell-shaped target signal is generated according to Formula (5) and the location where the target signal appears is set randomly. To verify the effect of the target signal size on the detection performance, the target signal size range was set to 10~110 and signals of different sizes were generated in equal proportions. A constant was randomly set as the background signal. The two signals were superimposed to obtain the simulated TP. Finally, AWGN was added to the TP simulation. To ensure that the model exhibits good performance on TPs with different SNRs and different target signal sizes, we set the number of TPs to be equal for each SNR and size range.
After generating the dataset, we divided it into training and validation sets at a ratio of 8:2, respectively. The composition of the TPs in the dataset is shown in Figure 5. The left figure shows the distribution of TPs under different SNRs and the right figure shows the distribution of TPs with different sizes under the same SNR.

Loss Function
As the network trained in this study is a multitask learning network, the loss function is composed of three parts: classification loss, center position loss, and size loss. The classification loss uses binary cross-entropy loss, and the center position loss and size loss use the mean square error loss. Because there are significant differences in the order of magnitude of these three parts of the loss function, it is necessary to manually set their weights to prevent imbalance loss, and the final weighted loss function is shown in Formula (7). * * * where represents the classification loss, represents the center position loss, and represents size loss. The formulas for these three parts are as follows: where represents the category label, represents the predicted category, represents the center point position label, represents the predicted center position, represents the size label, and represents the predicted size label. N represents the number of TPs in a batch.

Loss Function
As the network trained in this study is a multitask learning network, the loss function is composed of three parts: classification loss, center position loss, and size loss. The classification loss uses binary cross-entropy loss, and the center position loss and size loss use the mean square error loss. Because there are significant differences in the order of magnitude of these three parts of the loss function, it is necessary to manually set their weights to prevent imbalance loss, and the final weighted loss function is shown in Formula (7).
where Loss C represents the classification loss, Loss P represents the center position loss, and Loss s represents size loss. The formulas for these three parts are as follows: where C i represents the category label, C i represents the predicted category, P i represents the center point position label, P i represents the predicted center position, W i represents the size label, and W i represents the predicted size label. N represents the number of TPs in a batch.

Training the Networks
PyTorch was used to build the network architectures and the training environment. The training equipment used was a workstation with an NVIDIA GeForce GTX 1080 ti GPU and 32 GB of memory.
During training, the random seed was set to 3407, the Adam optimizer was used, the parameter penalty coefficient was set to 1 × 10 −5 , the learning rate was initially set to 1 × 10 −4 , and the batch size was set to 2000. During training, rough training was first conducted for 10 epochs, then the learning rate was reduced 10-fold and fine-tuning was performed. If the loss of the validation set did not decrease within 10 epochs, the training ended. The loss optimization of network training is shown in Figure 6.

Training the Networks
PyTorch was used to build the network architectures and the training environment. The training equipment used was a workstation with an NVIDIA GeForce GTX 1080 ti GPU and 32 GB of memory.
During training, the random seed was set to 3407, the Adam optimizer was used, the parameter penalty coefficient was set to 1 10 , the learning rate was initially set to 1 10 , and the batch size was set to 2000. During training, rough training was first conducted for 10 epochs, then the learning rate was reduced 10-fold and fine-tuning was performed. If the loss of the validation set did not decrease within 10 epochs, the training ended. The loss optimization of network training is shown in Figure 6.  Figure 6 shows that the two networks converged after five epochs of training, and the training effect of 1D-ResNet-16 was slightly better than that of 1D-ResNet-8.

The Moving Point Target Trajectory Detection Process
The detection process of our proposed framework is as follows: 1. Input an image sequence and obtain its TP for each pixel. 2. Pre-process the TPs, standardize the TPs, and divide the TP segments according to the network input size. 3. Load the trained model, input the TPs into 1D-ResNet in batches, and obtain the outputs. 4. Determine whether the TPs exist in the transient disturbance caused by the target according to the specified threshold value. If a TP exists, its pixel is considered to be in the foreground; otherwise, it is considered to be in the background. 5. Unify all foreground pixels and output the motion track of the target.  Figure 6 shows that the two networks converged after five epochs of training, and the training effect of 1D-ResNet-16 was slightly better than that of 1D-ResNet-8.

The Moving Point Target Trajectory Detection Process
The detection process of our proposed framework is as follows: 1.
Input an image sequence and obtain its TP for each pixel.

2.
Pre-process the TPs, standardize the TPs, and divide the TP segments according to the network input size.

3.
Load the trained model, input the TPs into 1D-ResNet in batches, and obtain the outputs.

4.
Determine whether the TPs exist in the transient disturbance caused by the target according to the specified threshold value. If a TP exists, its pixel is considered to be in the foreground; otherwise, it is considered to be in the background.

5.
Unify all foreground pixels and output the motion track of the target.

Experiments and Analysis
To evaluate the feasibility and performance of the proposed method, extensive experiments were conducted, including a TP simulation experiment, image-sequence simulation experiment, real-world experiment, and comparison experiment.

•
The TP simulation experiment directly detects the simulated TP and evaluates the classification and positioning performance of the method under ideal conditions using the accuracy of the receiver operating characteristic (ROC) and intersection over union (IOU).

•
To further fit the real scene and test the performance of the detection framework, we established image-sequence simulation experiments. A simulated moving point target was added to the real background image sequence. Simulation sequences were used as the input data of the detection framework.

•
We shot the movement process of the point target outdoors and conducted a real-world experiment based on these data.

•
To verify the performance of the proposed method, we compared it with that of other benchmark methods.

Details of Image Sequences in Experiments
In the experiments, we used seven image sequences, three of which were simulated. The other four were real-world data taken outdoors. The details of the image sequences used in the experiments are listed in Table 1. In the image-sequence simulation experiment, we used asphalt roads, pure sky, and a complex scene to simulate space-based and ground-based detection. The backgrounds of sequence 1 and sequence 2 are simple, while the background of sequence 3 is more complex and has scenes such as sky, mountains, buildings, etc., in the background. In sequences 1-3, we added a point moving target that was 1-3 pixels in size to these background image sequences. This point target moves from the upper-left corner to the lower-right corner of the image sequence.
To verify the performance of the proposed method in a real image sequence, we used a high-speed camera to capture outdoor image sequences. We tracked the movement of a glass ball from a height of approximately 50 m at 20,000 fps. The diameter of the glass ball was 1.5 cm, and the SNR was approximately 1-5 dB. To facilitate the experimental analysis, we obtained 8192 frames from the original sequence and established a window of 100 × 100 pixels for the target to pass through.
After obtaining the original sequence 4, to verify the impact of the target's stay time on the detection effect on a single pixel, we down-sampled sequence 4 to obtain sequences 5-7.

TP Simulation Experiment
The experiments in this section were conducted in two ways to verify the detection effect of our method on TPs with different SNRs and target signals of different sizes. ROC and IOU accuracy rates were used to evaluate the classification and positioning capabilities of the method, respectively.
The ROC curve is a graphical representation of the performance of a binary classification model as the discrimination threshold is varied. The x-axis represents the false positive rate (FPR), which is the ratio of false positives (incorrectly classified negative samples) to the total number of negative samples. The y-axis represents the true positive rate (TPR), which is the ratio of true positives (correctly classified positive samples) to the total number of positive samples. In this paper, positive samples refer to TPs containing the target signal, while negative samples refer to TPs without the target signal. Each point on the ROC curve reflects the sensitivity of the classifier to different discrimination thresholds. The larger the area under the curve (AUC) covered under the ROC curve, the better the detection performance of the method.
The center position and size of the transient disturbance form the bounding box. If the IOU is greater than 0.5, the positioning is considered correct. The calculation method of the IOU of the predicted and true bounding boxes is shown in Equation (11). The higher the accuracy of the IOU, the better the positioning performance of the method.
where E T and E P represent the right boundary of the true bounding box and predicted bounding box, respectively, and S T and S P represent the left boundary of the true bounding box and predicted bounding box, respectively.

The Detection Performance under Difference SNR
To verify the influence of SNR on detection performance, simulation TPs under different SNRs were generated. Under each SNR, the size of the target signal is set between 10 and 110 in equal proportion. The ROC curves drawn using the two networks under different SNRs are shown in Figure 7, and the AUC and accuracy of the IOU are shown in Table 2. olds. The larger the area under the curve (AUC) covered under the ROC curve, the better the detection performance of the method.
The center position and size of the transient disturbance form the bounding box. If the IOU is greater than 0.5, the positioning is considered correct. The calculation method of the IOU of the predicted and true bounding boxes is shown in Equation (11). The higher the accuracy of the IOU, the better the positioning performance of the method.
where and represent the right boundary of the true bounding box and predicted bounding box, respectively, and and represent the left boundary of the true bounding box and predicted bounding box, respectively.

The Detection Performance under Difference SNR
To verify the influence of SNR on detection performance, simulation TPs under different SNRs were generated. Under each SNR, the size of the target signal is set between 10 and 110 in equal proportion. The ROC curves drawn using the two networks under different SNRs are shown in Figure 7, and the AUC and accuracy of the IOU are shown in Table 2.   As is shown in Figure 7 and Table 2, the classification performance of the two models reached a good level, and all ROCs covered over 90% of the area. With a decrease in the SNR, the classification performance worsens. The accuracy of the IOU also decreases with a decrease in the SNR. This is because transient disturbances under low SNR are very weak and can easily be submerged in the background. During the detection process, the target signal is prone to clutter interference, resulting in classification and positioning errors.

The Detection Performance under Different Target Signal Sizes
To verify the influence of target signal size on detection performance, in the experiment, simulated TPs with different sizes were generated, in which the SNR was set at an equal ratio of -3 dB to 3 dB under each size. The ROC drawn by the two networks under different target signal sizes are shown in Figure 8, and the AUC and accuracy of IOU are shown in Table 3. a decrease in the SNR. This is because transient disturbances under low SNR are very weak and can easily be submerged in the background. During the detection process, the target signal is prone to clutter interference, resulting in classification and positioning er rors.

The Detection Performance under Different Target Signal Sizes
To verify the influence of target signal size on detection performance, in the experi ment, simulated TPs with different sizes were generated, in which the SNR was set at an equal ratio of -3 dB to 3 dB under each size. The ROC drawn by the two networks under different target signal sizes are shown in Figure 8, and the AUC and accuracy of IOU are shown in Table 3.   As is shown in Figure 8 and Table 3, with an increase in the target signal size, the classification and positioning capabilities of the two models show a significant improvement trend. For classification tasks, when the target signal size was less than 20, the classification performance was very poor, whereas when the target signal size increased to 40, the AUC of both models reached over 99%.
For positioning tasks, the IOU accuracy exhibited a more obvious trend with an increase in the target signal size. When the size was increased to 70, the accuracy increased to over 90%.
From the experimental results, we can see that the size of the target signal is a crucial factor for our methods. The longer the moving target stays on a single pixel, the more sufficient are the motion features and the better the performance of the proposed method. Therefore, the detection performance can be improved by increasing the frame rate of the detector.

TP Simulation Experiment Analysis
The SNR and size of the target signal are important factors that affect detection performance. The higher the SNR and the larger the proportion of the target signal, the better the model detection performance. The proportion of the target signal has a greater impact on the detection effect than the SNR. The SNR cannot be significantly improved; however, the proportion of the target signal can be further improved by increasing the frame rate of the detection equipment.
Among the two networks, although 1D-ResNet-16 has an additional eight layers of CBR-1D and 228,160 parameters compared to 1D-ResNet-8, the improvement of the model's detection performance is very small, the classification performance gap is small, and the IOU accuracy rate is less than two percentage points higher than that of 1D-ResNet-8. This proves that for weak transients, deeper network layers do not lead to greater performance improvement; however, deeper networks lead to greater computing consumption, which is contradictory to real-time detection performance.

Image-Sequence Simulation Experiment
We used our method to detect sequences 1-3. The detection results for the two networks are presented in Figure 9 and Table 4.   From Figure 9 and Table 4, we can observe that both networks show good detection performance for all three sequences. Although there were some false alarm points, the moving track (main diagonal) of the target was clear. Additionally, these false alarm points can be removed through post-processing.
Compared to 1D-ResNet-8, the detection rate of 1D-ResNet-16 is higher, but the time consumption of 1D-ResNet-8 is lower. In an actual detection task, we should use 1D-ResNet-16 if the detection rate is more important. However, if the detection speed is more important, 1D-ResNet-8 should be used.

Real-World Experiment
We used our method to detect real-world sequences. The detection results are shown in Figure 10 and Table 5.   The results show that both networks have relatively good detection performance on the sequences, both of which completely detect the moving track of the glass ball. With an increase in the de-sampling fold, the stay frames of the target in a single pixel become shorter and the detection performance worsens. In this experiment, 1D-ResNet-16 had no significant advantage over 1D-ResNet-8 in terms of detection performance. Therefore, 1D-ResNet-8 can meet the detection requirements when the SNR is high.
Sequences 1-4 were used for comparison. The results are presented in Figure 11 and Tables 6 and 7.   Table 8 shows the computational efficiency of all methods. These methods are implemented on a computer with an AMD Ryzen 7 1700 CPU and a Nvidia GeForce GTX 1080 ti GPU. From Table 8, it can be seen that our method has the fastest detection speed. The detection speed of 1D-ResNet-8 is faster than that of 1D-ResNet-16, as 1D-ResNet-8 has fewer parameters. In the future, we will improve the network by proposing lightweight networks to further improve the detection speed.   Figure 11 shows that the temporal-based methods can detect low-SNR point targets in an image sequence, whereas the spatial-based methods cannot detect the target track.
Among the temporal-based methods, our method has the best performance, followed by ICLSP. Although ICLSP exhibits similar performance to our method on simulation sequences, its detection effect is far inferior to that of our method on real-world low-SNR sequences. The Kernel method can better detect a real sequence with a high SNR, but there are many false alarm points for the simulation sequence with a low SNR. This shows that our method not only has excellent detection ability for moving point targets with a low SNR but also has good robustness for real point targets. Table 8 shows the computational efficiency of all methods. These methods are implemented on a computer with an AMD Ryzen 7 1700 CPU and a Nvidia GeForce GTX 1080 ti GPU. From Table 8, it can be seen that our method has the fastest detection speed. The detection speed of 1D-ResNet-8 is faster than that of 1D-ResNet-16, as 1D-ResNet-8 has fewer parameters. In the future, we will improve the network by proposing lightweight networks to further improve the detection speed.

Discussion
In this section, we discuss our method in detail and compare it with other methods to illustrate its advantages and disadvantages. After that, we discuss the results of our ablation experiments to verify the effects of various parts of 1D-ResNet. Finally, we discuss the results of our visualization research on the network to verify whether it learned the features of transient disturbances.

Analysis of All Methods
In this section, we analyze the characteristics, advantages, and disadvantages of all methods, as shown in Table 9.  The two networks we propose have the best detection performance and fastest detection speed for low-SNR moving point targets. Of the two networks, 1D-ResNet-16 has the best detection performance, while 1D-ResNet-8 has the fastest detection speed.
Other TP-based detection methods (ICLSP, NAF, TRLCM, and Kernel) can also detect the motion trajectory of targets, but their detection rate and false alarm rate are not as good as those of our methods, and these methods require more time for detection.
Other spatial-based methods (MaxMean, LCM, and IPI) are completely unable to detect point targets under a low SNR.

Ablation Experiments
In this section, we conducted ablation experiments to verify the superiority of the 1D-ResNet and CBR-1D. Due to the similarity of the two network structures (1D-ResNet-16 only has eight more layers than 1D-ResNet-8), this section is based on 1D-ResNet-8 only.

Network Structure Study
We first removed the skip connections from the network and then replaced the basic structural unit CBR with Conv-1D (Conv1D and ReLU) and CBL (Conv1D, BN, and LeakyReLU). We then removed the LBR module from the network. The ablation experiment we designed is shown in Table 10.

Network Skip Connection Basic Structural Unit LBR
We did not weigh the loss function to see the performance of these networks. The loss optimization of all networks is shown in Figure 12 From Figure 12, we can see that the performance of the CNN-1D network is the worst, but after adding skip connections, the performance of ResNet-1D is significantly improved. This indicates that skip connections are very helpful for optimizing network loss. After adding the LBR module, the loss of ResNet-1D LBR further decreased. The addition of BN to the basic structural unit accelerates the convergence speed of the network. However, we can also see that replacing the activation function (ReLU or LeakyReLU) does not affect the network optimization.
Next, we use these networks to test the TP and verify its detection performance. The experimental data are TP with SNR = 0 dB and target signal size = 80. The experimental results are shown in Figure 13 and Table 11. From Figure 12, we can see that the performance of the CNN-1D network is the worst, but after adding skip connections, the performance of ResNet-1D is significantly improved. This indicates that skip connections are very helpful for optimizing network loss. After adding the LBR module, the loss of ResNet-1D LBR further decreased. The addition of BN to the basic structural unit accelerates the convergence speed of the network. However, we can also see that replacing the activation function (ReLU or LeakyReLU) does not affect the network optimization.
Next, we use these networks to test the TP and verify its detection performance. The experimental data are TP with SNR = 0 dB and target signal size = 80. The experimental results are shown in Figure 13 and Table 11.   From the experimental results, it can be seen that all networks have good classification ability, but our network has the highest AUC. The positioning ability of networks without skip connections and LBR modules is poor. After adding skip connections, the network positioning ability is improved but not by very much. The addition of the LBR module greatly improves the positioning performance of the network. This indicates that skip connections can transfer the transient disturbance features extracted from shallow layers to deeper layers, preventing feature loss. Additionally, the LBR module can extract higher dimensional features, which helps to better locate transient disturbances. ResNet-1D LBR with no BN in its basic structural unit has the best positioning performance, but it is only 0.17% higher than that of our network. Adding BN will not affect the performance of the network in theory, but it can accelerate the convergence speed of the network.

Network Visualization
In this section, we conduct visualization research on the network to verify whether it has learned the distribution features of transient disturbances. Grad-CAM [45] (Gradientweighted Class Activation Mapping) was used to visualize the network in order to verify whether the network has learned the features of the TP. The intensity of the target signal was set to 3, the size was set to 60, and its SNR was controlled at 3 dB. The chosen visualization layers were CBR5, CBR9, CBR13, and CBR16. The visualization results are shown in Figure 14, where the blue line is the original TP and the orange line is the heatmap calculated using Grad-CAM. The larger the value of the heatmap, the more interested the network is.  From the experimental results, it can be seen that all networks have good classification ability, but our network has the highest AUC. The positioning ability of networks without skip connections and LBR modules is poor. After adding skip connections, the network positioning ability is improved but not by very much. The addition of the LBR module greatly improves the positioning performance of the network. This indicates that skip connections can transfer the transient disturbance features extracted from shallow layers to deeper layers, preventing feature loss. Additionally, the LBR module can extract higher dimensional features, which helps to better locate transient disturbances. ResNet-1D LBR with no BN in its basic structural unit has the best positioning performance, but it is only 0.17% higher than that of our network. Adding BN will not affect the performance of the network in theory, but it can accelerate the convergence speed of the network.

Network Visualization
In this section, we conduct visualization research on the network to verify whether it has learned the distribution features of transient disturbances. Grad-CAM [45] (Gradientweighted Class Activation Mapping) was used to visualize the network in order to verify whether the network has learned the features of the TP. The intensity of the target signal was set to 3, the size was set to 60, and its SNR was controlled at 3 dB. The chosen visualization layers were CBR5, CBR9, CBR13, and CBR16. The visualization results are shown in Figure 14, where the blue line is the original TP and the orange line is the heatmap calculated using Grad-CAM. The larger the value of the heatmap, the more interested the network is.  Figure 14 shows that the heatmap has the highest value at the target signal, proving that the network has fully learned the distribution features of the transient disturbance; however, the heatmap of the shallow layers also contains a lot of clutter. With an increase in the network depth, the clutter gradually decreases and the network learns more features of the transient disturbance. Therefore, the Grad-CAM visualization of the network shows that the network proposed in this study has interpretability.  Figure 14 shows that the heatmap has the highest value at the target signal, proving that the network has fully learned the distribution features of the transient disturbance; however, the heatmap of the shallow layers also contains a lot of clutter. With an increase in the network depth, the clutter gradually decreases and the network learns more features of the transient disturbance. Therefore, the Grad-CAM visualization of the network shows that the network proposed in this study has interpretability.

Conclusions
To resolve the problem of moving point target detection at a low SNR, we converted the problem of point target detection into the problem of transient disturbance detection in the TP formed by each pixel. For the transient disturbance detection problem, we propose a detection framework to learn the distribution features of the transient disturbances. In this framework, we first formulated different types of TP and generated a training dataset. Then, two networks, 1D-ResNet-8 and 1D-ResNet-16, were designed, which can adapt to the situation of detection speed priority and detection rate priority. Of the two networks, 1D-ResNet-16 has better detection performance than 1D-ResNet-8, but it requires more time. For detection tasks with high real-time requirements, 1D-ResNet-8 is a better choice. Adequate experiments showed that our TP model is correct and that our method is effective. Compared to other benchmark methods, the proposed method has obvious advantages when it comes to improving the detection rate and reducing the false alarm rate at a low SNR. Our method also has the fastest detection speed. In addition, we conducted ablation experiments to verify the superiority of our network and the CBR-1D structure, and the experimental results showed that all the modules of our proposed network were necessary. Network visualization research proved that our network learned the features of transient disturbances well.
Moreover, we studied the factors that affect detection performance and found that the size of the target signal had a greater impact on the detection results than the SNR of the TP. The detection performance of our method can be improved by increasing the sampling frame rate of the camera.
The method proposed in this study has the potential to be deployed in space-based or ground-based intelligent detection equipment. In the future, we will continue to study the problem of moving point target detection to propose a more efficient and stable detection method in order to make further contributions to this research field.