Simultaneous removal of noise and correction of motion warping in neuron calcium imaging using a pipeline structure of self-supervised deep learning models

Calcium imaging is susceptible to motion distortions and background noises, particularly for monitoring active animals under low-dose laser irradiation, and hence unavoidably hinder the critical analysis of neural functions. Current research efforts tend to focus on either denoising or dewarping and do not provide effective methods for videos distorted by both noises and motion artifacts simultaneously. We found that when a self-supervised denoising model of DeepCAD [Nat. Methods 18, 1359 (2021)10.1038/s41592-021-01225-0] is used on the calcium imaging contaminated by noise and motion warping, it can remove the motion artifacts effectively but with regenerated noises. To address this issue, we develop a two-level deep-learning (DL) pipeline to dewarp and denoise the calcium imaging video sequentially. The pipeline consists of two 3D self-supervised DL models that do not require warp-free and high signal-to-noise ratio (SNR) observations for network optimization. Specifically, a high-frequency enhancement block is presented in the denoising network to restore more structure information in the denoising process; a hierarchical perception module and a multi-scale attention module are designed in the dewarping network to tackle distortions of various sizes. Experiments conducted on seven videos from two-photon and confocal imaging systems demonstrate that our two-level DL pipeline can restore high-clarity neuron images distorted by both motion warping and background noises. Compared to typical DeepCAD, our denoising model achieves a significant improvement of approximately 30% in image resolution and up to 28% in signal-to-noise ratio; compared to traditional dewarping and denoising methods, our proposed pipeline network recovers more neurons, enhancing signal fidelity and improving data correlation among frames by 35% and 60% respectively. This work may provide an attractive method for long-term neural activity monitoring in awake animals and also facilitate functional analysis of neural circuits.


Introduction
Calcium imaging visualizes the specific activity of neurons through active sensors, which can aid in studying the neuronal behavior of animals and promote the use of animal models for neuroscience research.Calcium imaging provides the possibility of evaluating the long-term effects of relevant pharmacological interventions [1].As research on calcium imaging continues to mature, two-photon microscopy [2] and confocal microscopy have been continuously developed and applied in this field.Both of these techniques allow high-resolution three-dimensional structural and behavioral imaging of neurons in the brain.A two-photon system can effectively collect high-resolution three-dimensional volume information on neuronal activity in depth with a low chance of photobleaching [3,4].For example, Cheng et al. studied the application of a two-photon system in brain imaging and indicated that it could implement subtle imaging of neuronal calcium transients in a small field of view at multiple depths [5].Although a confocal imaging system cannot distinguish as deeply as two-photon microscopy, it can be easily implemented in wide-field applications and usually obtains high-contrast and background-free images of the brain tissue.For example, Pacheco et al. applied a confocal microscopy system to the whole-brain imaging of a mouse to solve the problem of brain region segmentation [6,7].
In practical applications, calcium imaging is often affected by motion distortions.Although anesthetizing animals is an effective method to prevent their biological activities in the awake state by affecting the imaging process [8], this may have adverse effects on the nervous system [9,10], and thus affect neurological research.In addition, even if the imaging probe and animal remain stationary, animal activities such as breathing are unavoidable, and body movements are dynamic, resulting in motion distortions in the collected neuronal videos.It is better to perform calcium imaging when the animal is active [11]; however, motion artifacts are more severe.For example, long-term monitoring of a certain area is required to study the neural mechanisms of neural perception, learning, and behavior [12].However, in continuous time-frame imaging, a low laser power must be used to prevent phototoxicity and photobleaching [2], which inevitably leads to severe noise in the imaging video.Therefore, in two-photon and confocal imaging systems, there is a tradeoff between imaging time and laser power [10,13].Consequently, in the continuous imaging process of active animals, both motion artifacts and noise are produced, which affect the subsequent analysis and study of brain neural activities.
Several methods for warp removal using calcium imaging have been reported.Initially, a registration algorithm to eliminate fast-motion artifacts was proposed and used for two-photon calcium imaging in conscious animals [14]; however, this algorithm relies on dual-channel fluorescence data gathering with a poor signal-to-noise ratio.Therefore, an online algorithm for piecewise rigid motion correction (NoRMCorre) was developed [15], which can quickly remove the slight shaking caused by interference in calcium fluorescence.Because of its rapid, effective, and convenient utilization, NoRMCorre has been integrated into CaImAn [16], a renowned comprehensive processing toolbox for calcium imaging.However, this approach is limited to nonrigid image deformation.Recently, a PatchWarp method for video registration of two-photon brain imaging was developed [17], which achieves a more resilient non-rigid registration than NoRMCorre and inherits its high computing speed.However, it is not sufficient to handle slowly changing image distortions, and the accuracy of distortion correction for nonrigid deformations of neuronal structures is also affected.Currently, deep learning has been used to study motion artifact removal in medical imaging, such as magnetic resonance imaging (MRI) [18] and computed tomography (CT) [19].To the best of our knowledge, no previous studies have used deep learning to eliminate motion artifacts in calcium imaging.
Numerous studies have reported the denoising of calcium imaging, which can be classified into traditional mathematical approaches and deep-learning methods.The constrained nonnegative matrix factorization (CNMF) method [20] and its extended version, extended CNMF for microendoscopic data (CNMF-E) [21], are typical mathematical methods for calcium imaging denoising.Processing methods have significant advantages in terms of processing speed and have been integrated into CaImAn [16].However, the successful performance of the above methods requires that the noise in the video should not cover the original characteristics of the signals, that is, a relatively high imaging contrast-to-noise ratio.Recently, two deep-learning models were developed for calcium imaging denoising.One is a supervised deep learning convolutional neural network (CNN) based on the Neuro Imaging Denoising Via Deep Learning (NIDDL) pipeline [13], which achieves excellent processing results with a limited amount of data.However, in a supervised network, ground truth (GT) is required for stable training.In practice, it is often difficult to obtain paired videos of the same calcium information of neurons because of the transient nature of neuronal calcium signals [22].The other is a 3D self-supervised denoising network of DeepCAD, in which adjacent timeframes are employed as training data pairs.DeepCAD can recover high-quality images acquired at low laser energies without the need for high signal-to-noise paired data [23].In summary, deep-learning techniques can effectively remove background noise when applied to data that are compatible with their methods.
These methods solve the problems of denoising or dewarping.However, if the video data for calcium imaging are contaminated by noise and motion distortion, the existing methods are not suitable.Specifically, if a video is first denoised and then distorted, some of the original features of the neurons are destroyed, and vice versa.Detailed information is provided in Visualization 1.In this study, this was attributed to the low inter-frame correlation that existed in the deformed videos.Therefore, exploring new methods for simultaneously implementing warp removal and denoising while maintaining clear image structure information is valuable.To achieve this goal, we developed a self-supervised deep learning pipeline structure with a serial, two-level deep learning (DL) network; each network model was designed with 3D U-Net as the backbone because of its superior performance in 3D data denoising and segmentation [12].The first-level network corrects the motion distortion in neuronal imaging videos of awake animals, whereas the second-level network removes noise from the dewarped videos.The major contributions of our study are summarized as follows: (1) We developed a novel two-level DL pipeline to achieve high-quality recovery of neuronal calcium images that were simultaneously destroyed by motion warping and background noise.
(2) To the best of our knowledge, this is the first study to explore a self-supervised DL model for removing motion distortion.This presentation is based on our finding that the self-supervised DeepCAD network can correct motion warping.In this network, a hierarchical perception module and a multiscale attention module are proposed to capture information on various-size motion behaviors.
(3) To remove noise from the dewarped videos, an improved version of DeepCAD was developed, in which a high-frequency enhancement block was proposed.This can preserve the high-frequency components of the signals and recover more detailed information during noise removal.The model can be optimized effectively with less training data and a smaller patch size than DeepCAD [23], leading to lower computational costs.
We performed experiments using synthetic and in vivo data from two-photon and confocal systems to verify the proposed method.Compared with previous studies, the experimental results demonstrate that the proposed DL pipeline can recover more structural information after the denoising and dewarping processes than traditional methods, and the sharpness of the images is effectively improved with higher resolution.

Network model
A flowchart of the pipeline structure is shown in Fig. 1.The structure contains two self-supervised DL models with a 3D U-Net as the backbone: a dewarping network and a denoising network.The entire process is described as follows.
Step 1: Generate training data pairs for the dewarping network.As shown in Fig. 1(A), the 3D raw video data are divided into numerous overlapping substacks.For each substack, odd and even time frames were extracted separately to form the source and target data.
Step 2: Optimize the dewarping network in the training process and output the dewarped substacks in the prediction process.
Step 3: Merge the output substacks into complete, de-warped, and noisy 3D image stacks.As shown in Fig. 1(B), we subtracted the overlaps of the processed substacks and stitched them to obtain a completely processed 3D stack.Step 4: Repeat Step 1 for the dewarped 3D stack and then feed the data pairs into the denoising model for network training.Denoised substacks were obtained using this optimized model.Finally, Step 3 is repeated to construct the ultimate results through denoising and dewarping.
Note that a distinct 3D-stack size was used in our two DL networks, that is, a smaller patch size was adopted in the denoising model; therefore, Step 3 was necessary.Our pipeline structure is a cascade of denoising network and dewarping network.They are trained separately and can be used alone or in cascade.The advantage of this pipeline is that our two models can be flexibly used for separate denoising and dewarping tasks, or in cascade mode for data contaminated by both noise and distortion.

Dewarping network
The structure of the de-warping network is shown in Fig. 2(A).In the in vivo calcium imaging process, motion artifacts ranged from mild to significant.To capture differently sized distortion features, a hierarchical perception module of Res2-Conv3D was designed and inserted into each level of the network, as shown in Fig. 2(B).In Res2-Conv3D, the feature maps were split into four groups: one group is kept invariant, and the other three groups, from bottom to top, are processed with a 3 × 3 × 3 convolution and then added to another adjacent group.Based on this module, the proposed network can obtain the hierarchical image structure information.The mathematical model is as follows: where the inputs and outputs of the four channels are denoted as X i and Y i (i = 1, 2, 3, 4), respectively, and F i () represents the 3D convolution operation in channel i .Subsequently, to make the network focus on the extracted features, a multiscale attention module was presented and inserted behind Res2-Conv3D at the shallow layer in the model.In this module, a multiscale atrous convolution block was first employed to further expand the receptive field of the network and capture deeper spatial voxel context information.Subsequently, a 3D pyramid pooling block with different strides was used to recognize global features in multiple perceptive fields [19].Each block employed LeakyReLU as its activation rectification unit and group normalization as its normalization layer.Finally, a residual connection was used to allow the features from the context to be well integrated.As a result, this module can extract complex information about image interval frames at multiple scales, thereby significantly reducing the impact of motion warp in calcium imaging while preserving image structure information.

Denoising network
Although DeepCAD exhibited good effects in suppressing the Poisson-Gaussian noise in the calcium imaging of neurons [24,25], it resulted in a loss of high-frequency information owing to the heavy reliance on the correlation between video frames, such as the edges of neurons.In particular, in our two-level DL model, after the de-warping process of the first-level network, the noisy images obtained will have a further reduction in image clarity.Therefore, using DeepCAD as the backbone, we developed an advanced denoising model, as shown in Fig. 3(A).The novelty of the denoising network lies in the proposed high-frequency enhancement block, as shown in Fig. 3(B) and Eq. ( 2).This enhancement block was inserted into the decode path because most of the noise was removed at this stage, thus avoiding the influence of noise on the high-frequency operator.In this block, the 3D substack was first separated into single frames, and for each frame, we performed two-path operations of gamma transformation and Laplacian high-frequency filtering.Here, we set the exponent γ = 0.8<1 in the Gamma transform to improve the contrast of weak neural signals [26,27], facilitating the recognition of inconspicuous neuronal components.Laplacian filters are used to detect high-frequency contexts such as edges in smoothed images [28][29][30].The high-frequency components in the image represent parts where adjacent pixel intensity changes are obvious.The Laplacian filter extracts high-frequency components by judging the intensity difference between the central pixel and its neighboring pixels.In calcium imaging, the interface between neurons and the background, as well as the subtle changes in calcium transients within neurons, have large signal gradients and can be detected by the Laplacian operator.The high-frequency enhancement module designed in the denoising network extracts the high-frequency information of the feature maps through the Laplacian operator and adds it to the original feature maps to sharp neuron information.This computational process can be expressed by Eq. (2).To demonstrate the effect of the Laplacian operator more clearly, we provide the processing results on sample data, as shown in Fig. S1.
where I in denotes a separate feature map from the 3D-stack, I gamma denotes the image after gamma transformation, I hf is the high-frequency map after the Laplacian operator L(), and I out represents the final results of the feature map after high-frequency enhancement.w is a weighted coefficient that adjusts the proportion of I edge to I out .In our study, we set w to 0.07.Moreover, in DeepCAD, the 3D stack was set to 64 × 64 × 320, which required a relatively large imaging field of view (FOV) with long time-frame sequences [13], and thus had high requirements for computation sources.To make the model applicable to videos imaged in a small FOV on a low-cost platform, we reduced the 3D stack size to 64 × 64 × 16 for the dewarping network and 24 × 24 × 16 for the denoising network.The modified patch size rendered the network more convenient for practical applications.

Experiment data
Seven video datasets on neuronal calcium imaging were used, and their detailed information is listed in Table 1.The first four videos were obtained from a two-photon imaging system and can be publicly accessed in [23,31].The detailed system information can be found in [23].Video 1 was generated by imaging an artificial synthesis object, whereas the other three videos were acquired using actual mouse brain imaging.The last three videos were obtained from a C57BL/6 (Male, 9 Weeks old, Charles River Laboratories, MA, USA) mouse on a confocal fluorescent imaging system built by our research group [32].In addition, the neuron images in the first five datasets contained mainly noise, and the distortion artifacts were small and negligible.These images were used to verify the denoising ability of the proposed DL model.In Videos 6 and 7, the images were corrupted by both noise and significant distortion artifacts, demonstrating the denoising and dewarping performance of our two-stage pipeline model.Here, the ground truth for synthetic Video 1 was simulated using the Neural Anatomy and Optical Microscopy (NAOMi) [33], and the ground truth for in vivo imaging on a two-photon system was created by increasing the laser energy to improve the signal-to-noise ratio (SNR).All animal experiments were conducted in conformity with the protocol approved by the Animal Research Committee of the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.

Quantitative metrics
We employed three metrics-signal-to-noise ratio (SNR), peak signal-to-noise ratio (PSNR), and structural similarity measurement (SSIM) [34] -to assess the overall performance of the DL network.Their definitions are expressed in Eqs.(3)-( 5): In Eq. (3), max(I signal ) is the maximum value of the signals in the image, and σ background is the standard deviation of the background.The PSNR is calculated for the current image I img based on its reference image I ref , where N is the total number of pixels.The SSIM focuses on the three key features of luminance, contrast, and structure, which are computed using the left, middle, and right items in Eq. ( 5).c 1 , c 2 , and c 3 are the weight coefficients.σ img and σ ref represent the standard deviations of the intensity of the current image and its reference, respectively, and σ img ref represents the covariance between them.µ img and µ ref represent the arithmetic means of the intensity from the current and its reference.Note: The reference image is the ground truth for Video 1, Video 2, and Video 3.For videos without ground truth (Video 4 to Video 7), the reference is the adjacent frames of the current frame in the computation of the PSNR and SSIM.

Experimental results
All the training and prediction processes for the DL model were executed on a home desktop computer equipped with 24 GB of RAM, an AMD Ryzen 7 5700G processor, and a 12 GB RAM NVIDIA GeForce 3060 GPU.TensorFlow was used to implement all DL models.We constructed 10,000 patch samples to train the dewarping network and achieve the effect of eliminated warps with regenerated noise after 30 epochs (300,000 iterations).In the denoising process, 80,000 patch samples were used to train the denoising network using small patches.A fine visual effect was achieved by training 35,000 iterations.Adam was used as the network optimizer with a learning rate of 0.001.In our experiments, the arithmetic average of the L1 and L2 norm terms was the loss function of the two networks, which is defined in Eq. ( 6): where x and y represent the source-target data pair in the self-supervised network and f (x) is the predicted result.The dewarping and denoising networks use the same loss function in Eq. ( 6), and the hyperparameters in the two networks are independent and do not interfere with each other.

Denoising results
To verify the proposed denoising network, we first conducted experiments on artificially generated data from Video 1 in Table 1.The data contained five noisy videos with varying SNRs.The original images and denoised results from the proposed model and DeepCAD are shown in Fig. 4. Compared to DeepCAD, the proposed network achieved a superior visual effect in denoised images, including better edge clarity and improved weak-signal contrast, which is shown in the enlarged sub-images indicated by the rectangular boxes in Fig. 4. In this analysis, the advantages of the proposed model lie in the use of high-frequency filters and gamma transformations.
The proposed denoised model exhibited good performance on all five videos, showing higher superiority with a decrease in image quality, demonstrating its excellent generalization on data with different levels of SNR.Three quantitative indices, PSNR, SNR, and SSIM, were computed to evaluate the denoised images from the two methods, and are listed in Table 2. Observing the data in this table, the proposed method achieves the best quantitative values in most of indexes, consisting with the visual effects shown in Fig. 4. Compared with DeepCAD, the PSNR is improved by 4% to 13% and the SNR is improved by 6% to 28%.
We then conducted experiments on four in vivo data (Video 2 to Video 5 in Table 1) to further verify the proposed denoising model.The original images and corresponding denoised results obtained using the proposed model and DeepCAD are shown in Fig. 5.This figure demonstrates that our model can restore axonal signals with a higher resolution and more complete neuronal structure than DeepCAD.Detailed information can be found in the enlarged sub-images of the three selected sub-regions (numbered 1, 2, and 3), as indicated by the rectangular boxes.To illustrate the effectiveness of the proposed method in recovering more information and improving image clarity, normalized intensity plots of 1D cross-lines selected in the sub-images are shown below.Compared with DeepCAD, the intensity curves of the 1D cross-lines obtained using the proposed method display clearer signal peaks.Because of the small size of the confocal data, we retrained DeepCAD on these data with the same patch size as that used in the proposed denoising model.
Next, to quantitatively evaluate the denoising performance of the proposed method and DeepCAD, three quantitative indices-SNR, PSNR, and SSIM-were calculated, and are listed in Table 3.Because there was a lack of ground truth for Video 4 and Video 5, their PSNR could not be computed.The SSIM in the table is calculated using two adjacent frames in the denoised video as the reference image and signal image, respectively.Considering that SSIM reflects the structural similarity between two images, two adjacent frames in the video image have high similarity in neuron signals, and the difference between them is the randomly generated noise Noisy Image DeepCAD Our Proposed distribution.Therefore, SSIM calculation based on adjacent frames can reflect the denoising effect of the video.Specifically, the first frame was paired with the second frame for the calculation, and the second frame was paired with the third frame.This continued until all data frames had been traversed, and their average was the value of the parameter.From this table, both DeepCAD and the proposed model exhibit apparent improvements in the quantitative parameters compared with the raw images.The proposed DL model achieves the best indicator values for most video data, demonstrating the superiority of our method in improving the signal-to-noise ratio while recovering accurate structural information from neuron images.To evaluate the difference in image resolution between DeepCAD and the proposed method, we selected a small region in the restored images from the different methods and computed the resolution of typical signals, as shown in Fig. 6.Specifically, we drew a line along the direction perpendicular to the signal for each subimage and plotted the intensity of the cross lines in the subfigures (Fig. 6(B)).Based on these plots, we computed the resolution using the full width at half maximum (FWHM) of the signals.The results are shown in Fig. 6(C).Compared to DeepCAD, the proposed model shows an apparent resolution improvement; thus, our model can achieve a precise definition of neuron trajectories.
To verify the performance of our denoising network on other noise types, we constructed two datasets containing salt and pepper noise and two datasets containing transverse stripe noise based on the clean images of Video 1.The relevant denoising experimental results are shown in Fig. S2.

Dewarping & denoising results
In this section, we describe the experiments conducted on two confocal videos (Video 6 and Video 7 in Table 1) with both warping distortion and noise contamination to verify the proposed two-level pipeline network.We compared the dewarping and denoising effects of the proposed DL model with the previous PatchWarp and Block-Matching and 4D Filtering (BM4D) methods, respectively, and the results are shown in Fig. 7.In this figure, the maximum intensity projection (MIP) maps along the time axis for each video are listed to show the processed results from the different methods.Figures 7(a1-e1) and 7(a2-e2) show the original MIP images, processed results by PatchWarp and BM4D, and dewarped and denoised results by the proposed method, on two videos.Observing the dewarped images in the first row, we discovered that (1) PatchWarp can register videos without noise removal, but the non-common field-of-view is discarded and   To further compare the dewarping and denoising performances of PatchWarp + BM4D and the proposed two-level network, overlapping images were generated from three adjacent frames of the selected current frame, its previous frame, and the next frame.They are represented by green, red, and blue colors and merged into overlapping maps, as shown in the second row of Fig. 7(f1-j1) and (f2-j2).The overlaid color in these maps can visually illustrate the consistency among three consecutive frames, thus revealing the dewarping effects.Specifically, a smaller difference between adjacent frames an overlapping image whose color is closer to gray; that is, the degree of grayness in this map is inversely proportional to the magnitude of the difference between consecutive frames.Observing the images in the second row, the maps corresponding to the proposed method exhibit a higher similarity to the gray image than those from PatchWarp + BM4D, revealing the better dewarping and denoising performance of the proposed method.Simultaneously, we computed the disparity between the current frame and its previous and subsequent frames.These disparities are visually depicted in red and cyan in Figs.7(k1-o1) and (k2-o2).The degree of darkness in these difference images is directly proportional to the degree of similarity of pixels between adjacent frames; that is, the magnitude of the discrepancy decreases as the pixel values become closer.Therefore, the images in the third row demonstrate the superiority of our two-level network over the traditional dewarping and denoising methods.The above-mentioned overlap images and difference maps of adjacent frames effectively demonstrate the impact of motion artifacts on image data.In visual assessment of calcium imaging, distortion artifacts can be considered small or ignored when the size of the distortion is less than 0.5 times the size of neurons.Because we can select a smaller fluorescent area that is not distorted in the neurons to perform calcium transient analysis.
To capture the distinctions among frames throughout the entire video stack, we computed Tukey box-and-whisker plots for both videos.The value of the image difference is defined as the average of the pixel sum of the absolute differences between two adjacent frames, as shown in Eq. ( 7), and are computed by pairing consecutive frames in a video processed using a certain method and forming the corresponding statistical information shown in Figs.7(A where D i,i+1 (x, y) indicates the image difference between the i th and (i + 1) th frames.I i (x, y) and I i+1 (x, y) indicate the normalized image intensity of i th and (i + 1) th frames, respectively.In the box plots, the median is represented by a red line, whereas the bounds at 25th percentile and 75th percentile of the data are depicted by the two sides of the rectangle, denoted as Q1 and Q3.The size of the rectangle is known as the Inter Quartile Range (IQR).The upper and lower whiskers indicated the borders of maximum and minimum limit, they are the maximum data smaller than Q3 + 1.5 × IQR and the minimum data greater than Q1 − 1.5 × IQR, respectively.Outliers not included in above range are denoted by black dots.Compared to PatchWarp, the proposed dewarping method exhibited narrower box plots, demonstrating high similarity among frames, thus explaining the better dewarping effects.The large image difference resulted from the regenerated noise in the dewarped video.Specifically, after some initial iterations (approximately 12,000), the dewarping network achieves the effect of removing noise, but at this point distortion artifacts are still present.After 30 epochs, the effect of artifact removal is achieved, as shown in Fig. 8.But at this time, from the perspective of removing noise in the video, the network model has moved far away from the optimal position, thus regenerating background noise with higher intensity than that in original video.As a result, the differences between two adjacent frames become large for they mainly come from the regenerated noise.Regarding the denoising aspects, the proposed method had a lower image difference than BM4D, as shown in the box plots, demonstrating its superiority in noise removal.In the box diagram corresponding to the dewarped and denoised data, some discernible noise remained, which was caused by the momentary movement of the receptive field during the collection of the original videos.The above explanations are based on the principles that the values of the image difference are predominantly attributable to noise in the image and that the box height mainly reflects the degree of motion warping.Owing to the improved data quality, the calcium traces extracted from the denoised and dewarped data exhibited higher fidelity.To investigate the temporal enhancement of the proposed method, we extracted the calcium traces of all neurons in Video 6 and Video 7 from both the raw noisy data and the enhanced counterparts from different methods and computed the Pearson correlation coefficient (CORR) (Eq.( 8)) and the SSIM for adjacent frames of the videos, as listed in Table 4.As shown in this table, the relevance of the video images is significantly enhanced after the dewarping and denoising processes.Although PatchWarp can register images with much noise, it discards some image features that are not in the common field of view; therefore, PatchWarp exhibits a good indicator value in the correlation parameter but is poor in SSIM.Because the proposed dewarping model regenerates Gaussian noise that differs from the source data, the correlation indicator is low.However, when the subsequent denoising process is performed, all the quantitative values achieve the best results.These findings suggest that the spatiotemporal enhancement of our method can improve the accuracy of neuronal localization and trace extraction, and facilitate the analysis of neural circuits.
To evaluate the robustness of the proposed pipeline structure in eliminating motion artifacts of varying strengths and mixed artifacts, we acquired new calcium imaging videos and conducted related experiments, the detailed information on data acquisition and experimental results can be found in Figure S3 and Figure S4.In addition, the superiority of the proposed DL pipeline structure was verified by conducting extended experiments of dewarp + BM4D / DeepCAD on Video 6 and Video 7, related experimental results are shown in Fig. S5 and Table S3.
To visually demonstrate the dewarping and denoising effects of the proposed method, we generated animated visualizations, which are provided as supplementary files.In Visualization 2 and Visualization 3, two warped and noisy videos (Video 6 and Video 7) are used to display the dewarping and denoising performances of the proposed DL pipeline and the traditional PatchWarp + BM4D.
Figure 8 depicts the alterations observed in the loss function and validation index (SSIM) during the entire training procedure for both the dewarping and denoising networks.This visual representation illustrates the network fitting process.The validation index (SSIM) was derived by computing the SSIM of each set of neighboring images considering the time axis of the output patches following each iteration of the network.The training process for the anti-twisted mesh is illustrated in Fig. 8(A).The blue line indicates the position of the optimal model after 30 epochs.The convergence of the de-warping model was also considered, and its loss curve in the training process in Video 6 is shown in Figs.8(A) and 8(B).To demonstrate the noise-changing process found in our study, they were first suppressed and then regenerated.The SSIM curve with respect to iterations is shown in these figures.Notably, the SSIM indicators reflecting the noise level achieve the best performance at approximately 12,000 iterations (shown as a blue dashed line in Fig. 8(B) and then decrease again.Simultaneously, we attempted to train the dewarping network for approximately 30 epochs and observed that while the loss of the dewarping network continued to decrease, the validation index (SSIM) remained rather stable and even exhibited an upward trajectory.Excessive training for a large number of epochs drastically prolongs the training period but does not significantly enhance the inferred outcomes.Therefore, we selected the results of 30 epochs as the most effective antiwarping model.The training process for the denoising network is illustrated in Fig. 8(C).The training procedure was conducted using Video 5. Optimal outcomes were attained after 35,000 network iterations during the training period.By increasing the number of iterations, the network validation index (SSIM) decreased.This suggests that the denoising network continued to train even after achieving the optimal results, leading to network overfitting.
To compare the computational efficiency between the proposed network and DeepCAD, several key indicators of memory requirements during the training and testing phases, time cost in executing network training and prediction, were computed and listed in Table S2.

Ablation experiment on dewarping network
In this section, we describe ablation experiments conducted on the proposed dewarping network to verify the effectiveness of the newly developed submodules.Based on Video 6, dewarping experiments using the baseline (3D-Unet), baseline + Res2-Conv3D, baseline + multiscale attention, and the proposed methods were carried out, and the results of the four frames are shown in Fig. 9.All results shown here were denoised by our denoising network using the same optimized model.Compared with the baseline, the use of the attention mechanism or To evaluate the different DL models further, quantitative comparisons were conducted on two indicators, CORR and SSIM, by pairing adjacent frames in the entire video.The computed parameters are listed in Table 5.The data analysis revealed a noticeable improvement in CORR and SSIM when multiscale attention and Res2-Conv3D modules were incorporated.This confirms the effectiveness of the proposed two-level network in handling imaging videos affected by both motion warping and noise.

Fig. 1 .
Fig. 1.Pipeline structure of two-level DL network.A. Flowchart of generating source and target data pairs; B. Illustration of stitching sub-stacks.

Fig. 2 .
Fig. 2. Diagram of dewarping model.(A) Structure of dewarping network; (B) Illustration of Res2-Conv3D: digits in each layer's top indicate convolution size, and below indicate number of channels; (C) Illustration of multi-scale attention module.

Fig. 4 .
Fig. 4. Experiments on noise removal based on synthetic data.

Fig. 5 .
Fig. 5. Comparative experiments between DeepCAD and the proposed denoising model on four in vivo videos.

Fig. 6 .
Fig. 6.Resolution comparisons between DeepCAD and the proposed denoising model.Scale bar, 10 µm.(A) Sub-images and selected cross lines; (B) Plots of selected cross lines; (C) Bar graph of imaging resolution for the two methods.

Fig. 8 .
Fig. 8. Variation curves of loss function and validation index (SSIM) during training process of two-level network.(A).Variation of loss and validation index of the dewarping network during the entire training process; (B).Loss and validation index of the dewarping network in the first epoch of training.Due to undersampling of the entire curve of training process, the initial growth portion of SSIM is not accurately represented in (A); (C).Variation of loss and validation index of the denoising network during the entire training process.

Table 4 . Quantitative comparisons of dewarping + denoising performance among different methods.
Images A and B have pixel sizes of m × n.I A,mn and I B,mn represent the pixel intensities of images A and B, respectively, at the coordinates (m, n).The abbreviations I A and I B denote the computational averages of the pixel intensities among images A and B, respectively.