Elimination of stripe artifacts in light sheet fluorescence microscopy using an attention-based residual neural network

: Stripe artifacts can deteriorate the quality of light sheet fluorescence microscopy (LSFM) images. Owing to the inhomogeneous, high-absorption, or scattering objects located in the excitation light path, stripe artifacts are generated in LSFM images in various directions and types, such as horizontal, anisotropic, or multidirectional anisotropic. These artifacts severely degrade the quality of LSFM images. To address this issue, we proposed a new deep-learning-based approach for the elimination of stripe artifacts. This method utilizes an encoder–decoder structure of UNet integrated with residual blocks and attention modules between successive convolutional layers. Our attention module was implemented in the residual blocks to learn useful features and suppress the residual features. The proposed network was trained and validated by generating three different degradation datasets with different types of stripe artifacts in LSFM images. Our method can effectively remove different stripes in generated and actual LSFM images distorted by stripe artifacts. Besides, quantitative analysis and extensive comparison results demonstrated that our method performs the best compared with classical image-based processing algorithms and other powerful deep-learning-based destriping methods for all three generated datasets. Thus, our method has tremendous application prospects to LSFM, and its use can be easily extended to images reconstructed by other modalities affected by the presence of stripe artifacts


Introduction
Light sheet fluorescence microscopy (LSFM) is an optical sectioned fluorescence microscopic technique that can be used for fast and high-resolution imaging of biomedical samples with low photobleaching responses [1,2]. Currently, LSFM is extensively used in neurology [3], vascular analysis [4,5], and whole-body imaging [6]. However, stripe artifacts caused by high-absorption or scattering structures, such as impurities or bubbles, in the excitation light path or inside the sample result in image degradation in the application of LSFM [7].
Several techniques have been developed to eliminate stripe artifacts in LSFM images, which can be classified into hardware modification and image-based processing types [8]. In the cases of hardware modifications in LSFM systems, Dong et al. [9] improved the image quality and reduced stripes in unidirectional LSFM by developing a vertically scanned LSFM method. Two parallel light sheets are used in the bidirectional LSFM system from both sides of the sample to eliminate the stripes [10]. Similarly, multidirectional LSFM averages images from different illumination directions and eliminates the stripes [11]. Ren et al. [12] proposed a novel approach called coded light-sheet array microscopy (CLAM), which allows complete parallelized 3D imaging without mechanical scanning, minimizing the illumination artifacts originated in highly scattering tissue. Besides, Ricci et al. [13] adopted acousto-optic deflectors (AODs) in LSFM to reduce the stripe artifacts in the images. Although modified LSFM systems can generate homogenous fluorescence excitation, additional hardware and alignment requirements will considerably increase the complexity of the LSFM system. More importantly, the modified LSFM system may still suffer from stripe artifacts in low-brightness or dense samples [6].
In addition to the aforementioned hardware modification techniques, several image-based processing algorithms have also been suggested to eliminate stripe artifacts [14,15]. Fehrenbach et al. [16] treated the elimination of stripes as a restoration problem and proposed the variational stationary noise remover (VSNR) method that removes stationary noise (such as that manifested in the form of stripes). In addition, Münch et al. [17] combined fast Fourier transforms and wavelet to filter out stripe noise. Liang et al. [18] designed a multidirectional stripe remover (MDSR) method, which applied a fast Fourier transform in the nonsubsampled contourlet transform domain to shrink the stripe components. Pollatou et al. [19] proposed a novel destriping method of stitched biological images based on the location of the stripe artifacts, background modeling, and illumination correction. Although these image-based algorithms can effectively remove stationary stripe artifacts, it is still challenging to remove artifacts that appear in random directions, as shown in Fig. 1.
Deep learning (DL) methods have been reported to significantly outperform conventional image processing algorithms in light microscopy, including microscopy image restoration [20], enhancement [21], super-resolution reconstruction [22,23], deconvolution [24], and microscopy image segmentation [25,26]. Wang et al. [20] proposed a UNet structure model to correct aberrations induced by the refractive index of fluorescence microscopy images. Instead of using a deeper convolutional neural network (CNN), Cai et al. [27] achieved state-of-the-art deblurring performance by utilizing deep residual networks (DRNs) to increase the network depth in dynamic motion deblurring for natural images. Abdallah et al. [28] proposed Res-CR-Net, which is a DRN, for the semantic segmentation of microscopic images. Although these pilot studies have shown promising results for image restoration and enhancement, deep-learning-based destriping of LSFM images has yet to be explored.
In this study, we propose an attention-based residual network (Att-ResNet) comprising residual blocks with a self-attention mechanism to efficiently eliminate stripe artifacts in LSFM images. We adopted the UNet architecture as the backbone of our network [29]. Residual blocks were implemented between successive convolutional layers of the UNet to address the blurring problem. To improve the performance of our network in the destriping task, a self-attention mechanism was introduced in the residual blocks. The self-attention mechanism comprises spatial and channel attention modules to learn sample features against stripe artifacts. Specifically, the spatial attention module aims to extract useful sample features from LFSM images. The use of channel attention is to exploit the features from different channels and remove the channels that originate from stripe artifacts according to the channel weights. Herein, spatial and channel attention modules were incorporated into the residual block to increase the efficacy of the proposed network.
To validate the performance of our model, we developed a degradation model to generate various stripe artifacts, including regular horizontal, anisotropic Gaussian nonhorizontal, and multidirectional anisotropic Gaussian nonhorizontal stripes for simulating stripe artifacts in LSFM images. Both qualitative and quantitative comparisons with other state-of-the-art deeplearning-based and image-based algorithmic destriping techniques were conducted using the generated datasets. The results showed that the proposed method can effectively reduce various stripe artifacts. To the best of our knowledge, the proposed method is the first to address various stripe artifact issues in LSFM images with the use of a unified DL framework [8]. Moreover, our model was validated using external LSFM images of the mouse brain vessels [30], mouse colon [31], and mouse carotid plaques [32]. In addition, compared with the classic algorithm-based destriping methods, our method improves the processing speed by 60 times (compared with MDSR) and 180 times (compared with VSNR), which takes approximately 1s to process each image and thus enables fast destriping in LSFM images.

Methods
The main framework of our model was adopted from the UNet structure, as shown in Fig. 2(a). The encoder-decoder structure with skip connections allows the recovered feature map to integrate low-level features and aggregate features from different scales, so as to better distinguish the sample information and stripe artifacts in the image and obtain the ideal output. The skip connection integrates the features in the encoder and decoder through the concatenating of the channels of the feature. And the skip connection retains more dimensional information, so that the network can better learn from the abstract global features and the detailed local features. At the same time, in the decoder part, upsampling will cause the loss of edge information of the feature map, and the edge feature can be retrieved by concatenating the feature from the encoder. The model includes four downsampling and upsampling operations. Each layer of the model consists of two 3 × 3 convolution operations and an attention module in the middle. The feature dimensions of each layer were empirically determined by weighing the network performance and network complexity based on the existing related work [20,29,33], which were 32, 64, 128, and 256 respectively. Finally, a 1 × 1 convolution layer transforms the resultant feature map from the last block to the output image.

Residual block
The motivation for adopting DRN is that the input (artifact-distorted LSFM images) and target output (ideal LSFM images) are expected to have similar values and structures. In addition, compared with plain CNN, residual networks contain identity mappings that prevent the gradient exploding or vanishing problem, thus facilitating the training of a deeper network. Therefore, to efficiently transform artifact-distorted images to artifact-reduced images, we used residual blocks to learn the difference between the input and target output. The residual block was composed of two convolution layers, a skip connection, and a rectified linear unit, as shown in Fig. 2(b).

Convolutional block attention module
To improve the model performance in the destriping task, we introduced an attention module that contained the self-attention modules, as shown in Fig. 2(c). This attention module can increase the modeling capacity of the network by assigning different weights to the features. Specifically, based on the importance of different feature maps, the value is selectively attenuated or amplified.
We set the attention module as F and it works as follows, where I ∈ R H×W×C andĪ ∈ R H×W×C are the input and output of an attention module, H, W, and C refer to the height, width, and channel depth of the feature map, respectively, and ⊗ is an element-wise matrix multiplication. Existing attention modules can be divided into channel-wise and spatial-wise attention. Channel-wise attention obtains a vector v ∈ R 1×1×C , which indicates the importance of each channel. Meanwhile, spatial attention calculates the importance of each pixel r ∈ R H×W×1 to determine the most important regions. Herein, we applied a convolutional block attention module (CBAM) [34] in our model (see Fig. 2(c)) and compared it with other attention modules, including global average pooling (GAP), squeeze-and-excitation module (SEM) [35], channel attention module (CAM), and spatial attention module (SAM). GAP is the simplest version of channel-wise attention and calculates the average of each channel of the feature map. SEM [35] is also a channel-wise approach that uses GAP as the squeeze operation, followed by two fully connected layers to model the interrelationship between channels. CAM and SAM are the channel submodules of the CBAM [34].
CBAM is composed of CAM and SAM as follows, CAM utilizes GAP and global max pooling (GMP). GAP computes the channel-wise average value of the given input, while GMP returns the maximum value of the input feature map, where I c ∈ R H×W , c ∈ {1, . . . , C} denotes the c th feature of the input. The outputs of GAP and GMP are forwarded to a shared multi-layer perceptron, and the final output is calculated as follows, where ⊕ refers to an element-wise summation, and δ refers to sigmoid activation.
SAM encodes the importance of pixel-wise dependencies. Two pooling operations are employed to aggregate channel information, where s max ∈ R H×W and s avg ∈ R H×W denote the spatial max pooling and spatial average pooling, respectively. The output of the pooling operations is then concatenated and passed through a convolutional layer and sigmoid activation to compute the SAM, where ⊕ denotes the concatenating operation and Conv refers to a 3 × 3 convolution layer.

Loss function
We aim to eliminate stripe artifacts while preserving useful sample information from LSFM images. To achieve this, we adopted the mean absolute error (MAE) and content loss [36] to design our loss function. MAE calculates the average of differences between the pixels of the output and the ground truth where I is the output of our model,Î is the ground truth label, and H, W, and C refer to the height, width, and channel depth of the image, respectively. The difference at the pixel level is advantageous for the restoration of image intensity. However, MAE often ignores details, such as patterns, edges, and small variations. Meanwhile, the content loss compares the feature map of the output and ground truth with the use of a pretrained feature extractor where ϕ denotes the feature map of the image. In our model, we used conv4-3 from the VGG-19 network [37] pretrained with ImageNet [38]as the feature extractor. This feature-level loss helped preserve content information and image details. The final loss function is expressed as follows, where λ is a constant and set to be 0.05 in our model.

Evaluation metrics
The MAE, peak signal-to-noise ratio (PSNR), and structural similarity (SSIM) were employed to evaluate the output image quality of the different methods. The MAE calculates the average per-pixel differences between the ground truth label and output. Thus, the smaller the MAE, the smaller the error between the ground truth label and the output. PSNR and SSIM are commonly used for image restoration and pay more attention to the fidelity of the image, which were computed as: where MAX is the maximum intensity of image, I is the output of our model,Î is the ground truth label, and H, W, and C refer to the height, width, and channel depth of the image, respectively.
where µ I and µˆI refer to the average value of I andÎ, σ I and σˆI refer to the standard deviation of I andÎ, C 1 and C 2 are two constants, and σ IÎ is the covariance of I andÎ. The higher the PSNR, the better the image. The SSIM is a number that ranges from 0 to 1. The larger the result, the more similar it is to the original image.

Implementation
Our model was developed based on the TensorFlow and Keras packages. Att-ResNet was trained for 300 epochs using Adam optimizer with a batch size of two. The initial learning rate was set at 0.001. The angle parameters of the two classic algorithm-based methods for comparison were set according to a specific dataset, and the other parameters are set according to the optimal parameters in [16,18]. The training parameters of the other two DL-based methods were consistent with our network. In the training phase, only the DL methods were trained on the three training datasets (σ =0.5), while during the testing phase, DL-based and algorithmic methods were tested on the testing datasets constructed by different degradation models (σ =0.3, 0.5, 0.7). Our Att-ResNet can automatically remove artifacts, and it took approximately 1 s to process each image.

Data acquisition
All animal studies and procedures were performed according to a protocol approved by the Chinese People's Liberation Army General Hospital Animal Care and Use Committee in accordance with the National Institutes of Health Guideline on the Care and Use of Laboratory Animals. The process of immunolabeling and tissue clearing of mouse brains were the same as in [30]. The immunolabeled and cleared mouse brains were imaged on a commercial light sheet fluorescence microscope (Ultramicroscope II, LaVision Biotec, Bielefeld, Germany) with a 5.0× magnification, a 2× objective lens (Mv PLAPO2VC, Olympus), and a working dipping cap distance of 6 mm, which led to an image size of 2560 × 2160 with a pixel size of 0.65 µm. For the detection of brain vessels, the filters were set as follows: excitation 500/20 nm; emission 535/30 nm. The step size was set to 2 µm for z-stacks scanning and a total scan range of the brain sample up to 1 mm. The measurements were performed with exposure times of 385 ms per slice, resulting in a total acquisition time of ∼2 min per brain resection sample. The raw LSFM images (saved in TIFF format) were processed using the ImageJ package FIJI (version1.51, fiji.sc/Fiji, NIH, Bethesda, MD, USA). We collected coronal images of the resected mouse brain samples, including 2601 images of the cortex, 2061 images of the hippocampus, 2221 images of the cerebellum, and 1578 images of the olfactory bulb.

Degradation model
To verify the performance of our network, we developed a degradation model to generate various stripe artifacts to construct artifact-distorted image datasets [16,18]. We carefully designed the stripe artifacts of the datasets to simulate various types of stripe artifacts, including regular horizontal stripes, anisotropic Gaussian horizontal stripes, and anisotropic Gaussian nonhorizontal stripes. The first dataset was generated with regular horizontal stripes, anisotropic Gaussian horizontal stripes, and anisotropic Gaussian nonhorizontal stripes with angles of 45 • or 135 • . For the second and third dataset, we simulate the stripe artifacts in the horizontal direction, and the direction is randomly deflected based on the configuration of the commercial LSFM system. Thus, we generated the second dataset, including anisotropic Gaussian stripe artifacts with one random angle in [-11 • , 11 • ]. The last dataset was generated using anisotropic Gaussian stripe artifacts with two angles randomly distributed in [-11 • , 11 • ]. The bound of stripe angle in the second and the third datasets was chosen as per the configuration of the commercial LSFM system. All the images with stripe artifacts were randomly generated to form a degradation pool.

Dataset construction
The workflow of the dataset is shown in Fig. 3. To facilitate the subsequent operation, the raw image was first cropped to a size of 2048 × 2048, followed by image enhancement using a histogram equalization algorithm. Subsequently, artifact-distorted images were generated by randomly adding one of the artifacts from the degradation pool to the enhanced image. σ is a coefficient representing the intensity of the added artifact images. The artifact-distorted 2048 × 2048 image was divided into 16 images with a size of 512 × 512 pixels as the inputs. The ground truth labels were collected to initiate the training process. Fig. 3. Workflow of dataset construction. We applied a series of preprocessing steps to the acquired LSFM image and added a degradation model to obtain the artifact-distorted images and the corresponding ground truth label, which were used to train our network.
For the training dataset, we extracted 15 slices cropped with 2048 × 2048 in each of the collected images of the cortex, hippocampus, cerebellum, and olfactory bulb to determine the diversity of the dataset. After image division, we obtained 960 images of each degradation model (σ =0.5), including 864 and 96 images as the training and validation datasets, respectively. For the testing dataset, we extracted 60 slices from the collected images of the cortex, hippocampus, and cerebellum, and three degraded image groups with different σ (0.3, 0.5, 0.7) comprised of 25920 subdivided images were generated as testing datasets.

Ablation study
To study the effectiveness of our network, an ablation study was conducted to investigate the effects of the 1) position of the attention module, 2) different attention modules, and 3) content loss. All ablation experiments were performed on the second testing datasets (σ =0.5) with anisotropic Gaussian stripe artifacts with a random angle of [-11 • , 11 • ].

Effect of the position of the attention module
We studied the model performance when the attention modules were added to different layers (@ ith), which refers to the attention modules located at the ith layer in a UNet architecture. ResNet refers to the standard UNet network with the addition of residual blocks between successive convolutional layers. The attention modules used in this experiment were CBAMs. The experimental results are summarized in Table 1. In this experiment, we found that applying an attention module to any single layer could improve the model performance; this indicates the effectiveness of the attention module. In addition, we can see that applying an attention block to a layer had a similar effect on the model performance. By contrast, applying the attention module to every layer (@ All) achieved the best performance.

Effects of different attention modules
We investigated the influences of the choices of different attention modules on a network based on ResNet. In this experiment, five networks were trained with different attention modules-GAP, SEM, CAM, SAM, and CBAM-as mentioned in Section 2.2. The experimental results are listed in Table 2. We observed that the networks with GAP, SEM, and CAM achieved significant gains in stripe artifact reduction compared with artifact-distorted LSFM images. The performance of SAM is better than that of GAP, SEM, and CAM, indicating that the allocation of the spatial weight is more effective. The results of CBAM, which combined SAM and CAM, yielded the best performance compared with other attention modules. It is believed that CBAM can obtain the various important channels of feature maps and learn the information of different regions in the image to distinguish the sample information and stripe artifacts.

Effects of content loss
In this comparative experiment, we studied the influences of the content loss on the model performance following the application of the MAE as the loss function. The comparative results are presented in Table 3 and Fig. 4. The network with content loss achieved better performance than that without content loss in terms of MAE, PSNR, and SSIM. This is partly because the use of content loss can preserve the image details and content information. The benefit of the content loss can be clearly observed in Fig. 4. It is suggested that the use of content loss can reduce stripe artifacts in the input images while maintaining useful content information.

Comparison with existing destriping methods
We compared our proposed network with other existing classic algorithm-based and deeplearning-based destriping methods on different testing datasets of artifact-distorted LSFM images.
The following methods were chosen: • VSNR: We chose the variational algorithm developed by Fehrenbach et al. [16], as it can be applied to the removal of stationary noise such as stripes from microscopy images.
• MDSR: We selected the algorithm reported by Liang et al. [18] because they designed a multidirectional stripe removal method following the application of the fast Fourier transform in the nonsubsampled contourlet transform domain to shrink the stripe components.
• UNet-based networks (UNet): The model proposed by Wang et al. [20] was chosen, as the UNet models are extensively used in designing networks for fluorescence microscopy image restoration.
• Self-attention mechanism-based networks (AttNet): The self-attention mechanism-based network (Ko et al. [39]) was selected because it achieved a noticeable performance in various artifact reduction tasks compared with other models.
The experimental results are summarized in Table 4 and Fig. 5. The performance of our network was better compared with those of the existing methods in both metrics and visual inspection.

Testing dataset at a known angle
We first tested the effect of each method for handling stripe artifacts of a known angle. In this experiment, our Att-ResNet efficiently eliminates different stripe artifacts and achieves gains in the range of 0.33-4.23 in PSNR and 0.01-0.09 in SSIM (σ=0.5). For the two classic algorithm-based methods, we set the angle parameters to 0, 45 • , and 135 • to deal with the different testing images, and they are effective for destriping tasks. PSNR of MDSR is 0.81 (σ=0.3) and 0.29 (σ=0.7) higher than our Att-ResNet. However, some visible artifacts are not eliminated from the images, (see the red arrows in Fig. 5) and the consumption of each image requires approximately 180 s for VSNR and 60 s for MDSR. In addition, different parameters need to be adjusted when the two methods deal with different artifacts, thereby increasing the complexity of processing. By contrast, our method not only effectively removes different artifacts but also exhibits fast performance, as only 1 s is required to process one image. The performance of UNet and AttNet is comparable with that of our method in terms of PSNR and SSIM. As shown in Fig. 6, the stripe artifacts cause the values of the original image to fluctuate violently, which seriously destroys the original structure. Our Att-ResNet can remove the artifacts and recover the distorted image to have a similar distribution as the ground truth label. MDSR and VSNR can restore images with remnant artifacts, especially those distributed in the background. The recovery ability of AttNet and UNet was slightly inferior to that of our method.

Testing dataset at one random angle
In this experiment, we set the angle parameters to 0 for MDSR and VSNR to deal with the horizontal stripe artifacts with deflection that ranged from -11 • to 11 • subject to the configuration of the commercial LSFM system. As shown in Fig. 5, the two classic algorithm-based methods cannot eliminate the stripe artifacts in the images because of the variation in the angle parameters that resulted in a significant decline in their performance. Our Att-ResNet achieves gains in the range of 5.07-10.02 in PSNR and 0.17-0.36 in SSIM compared with VSNR and MDSR methods. Our method also achieves the best performance, with gains in the ranges of 1.42-3.65 and 0.02-0.10 in terms of the PSNR and SSIM, respectively, compared with the UNet and AttNet methods. Besides, from Table 4, with the increase of σ, the performance of UNet and AttNet decreases (1.99-2.47 in PSNR and 0.09 in SSIM), while the performance of our Att-ResNet remains stable, which indicates the generalization performance of our method.
We studied the influence of angle parameters on the two classic algorithm-based methods, VSNR and MDSR, by adding disturbance δ (δ ∈ [0,10 • ]) to the angle parameters and applying the two methods on the testing dataset at a known angle (σ=0.5). The evaluation metrics of the results are shown in Fig. 7.
From Fig. 7, our Att-ResNet is not affected by the angle parameters (PSNR = 30.55 and SSIM = 0.90). When the angle parameters are accurate (δ=0), both VSNR and MDSR obtain satisfactory results (see Table 4). However, the performance of VSNR and MDSR decreases greatly when disturbance δ is added to the angle parameter. When disturbance δ is in the range of [0, 1 • ], their performance gradually decreases, and when it is in the range of [1 • , 10 • ], the two methods fail to remove the stripe artifacts. From Fig. 7, we can find that both the two classic algorithm-based methods are highly sensitive to the angle parameter, among which VSNR is more sensitive. When the disturbance δ is more than 1 • , these two methods cannot effectively remove the stripe artifacts in the image, indicating the high dependence of these two methods on the accuracy of the angle parameters.

Testing dataset at two random angles
Finally, we applied the destriping methods to the testing dataset at two random angles, as shown in the last row of Fig. 5. The angle parameters of MSDR and VSNR were set as the previous testing dataset at one random angle. The two classic algorithm-based methods are still less effective in eliminating stripe artifacts. By contrast, the DL-based methods are successful in distinguishing stripe artifacts and samples and in eliminating artifacts at different angles in the images. However, some point noises appeared in the results of AttNet, which affected the overall quality of the image. In these testing experiments, stripe artifacts severely affected the image quality (PSNR and SSIM of the input were the lowest among the three testing datasets). Nonetheless, our model still effectively eliminated the artifacts, restored the image, and resulted in the best performance in both metrics and visual inspection.
We analyzed the line profile of each method on a testing dataset at two random angles (σ=0.5). In Fig. 8, we found that the two classic algorithm-based methods are inefficient in the destriping task, and the DL-based methods yielded line profiles similar to the ground truth profiles compared with the classic algorithm-based methods. The line profile of our method is very close to the  ground truth, especially in the background, demonstrating that our method can restore the images in the presence of complex stripe artifacts.

External testing on LSFM images with intrinsic stripe artifacts
In this section, we evaluate our Att-ResNet on LSFM images with intrinsic stripe artifacts in various biomedical applications. We validated our method using LSFM images with intrinsic stripe artifacts as the input, as shown in the first column of Fig. 9. As the stripe artifacts in the images are horizontal artifacts, we set the angle parameters of MDSR and VSNR to 0, and use the models trained by testing dataset at known angle for validation. From Fig. 9, all methods can effectively eliminate artifacts. For MDSR and VSNR, because of the accuracy of the angle parameters, the artifacts can be effectively removed. However, some visible artifacts are not eliminated from the images, (see the red arrows in the second and third column of Fig. 9). For UNet, in addition to some residual artifacts, it may blur the details of the images (see the blue arrow in Fig. 9). AttNet can effectively remove stripe artifacts, but it will introduce some additional ghosting artifacts into the image, which degrades the image quality (see the green arrow in Fig. 9). Our method can eliminate the stripe artifacts while preserving the image information.
For LSFM images with intrinsic stripe artifacts, due to the lack of ground truth for reference, we evaluated the performance of different methods by analyzing the line profile of each method of the representative LSFM image. In Fig. 10, the stripe artifacts cause the values of the original image to fluctuate violently, especially in the background (see the black curve). After destriping, the number of the spikes on the line profile will decrease, and the normalized image values of the background part will decrease and tend to 0. From Fig. 9(b)-9(f), all methods can

Discussion
The destriping of LSFM images is a challenging task that needs to be addressed by many recently developed methods. Both hardware improvements and classic algorithms have been proposed to eliminate stripe artifacts. However, hardware improvements still suffer from stripe artifacts in low-brightness or dense samples, which may affect the imaging speed and field of vision. For the two classic algorithm-based methods, VSNR and MDSR, the performance is satisfactory when dealing with artifacts with known angle (see Table 4). However, the two methods are sensitive to the angle of artifacts. And in practice, it is impractical to obtain all the angles of the artifacts from the LSFM images because the directions of the stripe artifacts are random and various, thus calculating all angles is time-consuming and complex. Moreover, the stripe artifacts are integrated into image. Therefore, artifacts cannot be effectively separated from the image information and measured accurately. When calculating the angle, the deviation of one pixel may cause an error of more than 1 • . From Fig. 7, these two methods cannot effectively remove the stripe artifacts under this error. Although a more accurate result can be obtained by calculating multiple approximate angles, this can hardly be realized in practical application considering the time cost of the two methods (>60 s per image), which require tremendous computational effort.
In this study, we proposed a DL-based method to effectively eliminate stripe artifacts from LSFM images. Unlike the classic destriping methods, our methods are based on the powerful prediction capability of deep convolutional networks to deal with the LSFM image distorted by different artifacts. In our Att-ResNet, we applied the residual blocks with the attention mechanism in our model to learn the features of the sample and different stripe artifacts in the image. We used residual blocks to increase the modeling power [40,41]. This learns only the difference between the input and the target output and transforms the artifact-distorted image to an artifact-reduced image efficiently. In addition, with CBAM, the importance of different regions and channels of the image was learned, and the value of the feature map was selectively attenuated or amplified according to their importance for more effective training. Moreover, CBAM increases the modeling capacity of the network to effectively eliminate artifacts and restore the image. We also use content loss to improve the performance of our model. Content loss is widely used in image processing algorithms of medical images, including artifacts elimination of CT images [39,42], super-resolution reconstruction of MRI images [43], and generation of synthetic digital mammography images [44]. In these works, a VGG network pre-trained on ImageNet is commonly used as the feature extractor to calculate the content loss. We carried out experiments on the content loss calculated by feature extractor pre-trained with ImageNet and LSFM images. The results can be found in Table S1 and Table S2 in Supplement 1. In Table S1 and Table S2, content loss can improve the performance of our model and the results of model pre-trained with ImageNet is better because the data scale and data diversity of ImageNet help the model to better extract the features in the image. Thus, in our model, we used conv4-3 from the VGG-19 network [37] pre-trained with ImageNet [38] as the feature extractor. This feature-level loss helped preserve content information and image details.
Our method was performed on the acquired images without the additional requirement of LSFM hardware improvement. In addition, Att-ResNet contributes to simple, fast image restoration and is suitable for images of different sizes. To verify the destriping ability of our Att-ResNet, we designed three degradation models to generate stripe artifacts in actual images to obtain the corresponding artifact-distorted images and ground truth LSFM image pairs. Our Att-ResNet successfully restored stripe artifacts caused by the degradation models. Furthermore, our Att-ResNet was validated to eliminate the intrinsic stripe artifacts from the actual LSFM images. The performance of Att-ResNet was also quantitatively and quantitatively analyzed and verified to be better than the existing methods. The proposed method can be extended to eliminate different types of stripe artifacts in the LSFM images of ex vivo samples. Moreover, it is feasible to further improve the processing speed by optimizing our model, thus our method has the potential for processing in vivo imaging of neural activities and heart beating and can be applied to the processing of LSFM images in dynamic living samples. Moreover, it can be applied in the destriping task of other microscope images, such as scanning electron microscopy images, selective plane illumination microscope images, and FIB-nanotomography images.
Despite these advantages, our network remains highly dependent on data. The original image is regarded as the ground truth label in the constructed datasets, and artifacts and noise inevitably exist in the original image. The network can eliminate the artifacts and noise caused by the degenerate model. However, because of the presence of artifacts in the ground truth, some light artifacts in the original image cannot be eliminated completely. For the images with background noise, the noise will be regarded as effective information while removing artifacts, which would be remained as speckle noise. At the same time, some details in the image may be eliminated for being recognized as noise. This problem can be avoided by adding a degradation model to LSFM imaging of well-cleared samples to construct image pairs or by generating simulated LSFM images to avoid artifacts in ground truth labels. Moreover, a more complicated degradation model can be created to simulate complex noise, artifacts, and blur patterns in LSFM images. Additionally, a feature-extracted network can be designed pre-trained with available datasets to learn the sample information in the LSFM images of well-cleared samples and distinguish the noise information such as artifacts in the image. And the loss of the model can be upgraded using the designed network as a feature extractor to help improve the performance of main model, so as to eliminate the noise on the premise of preserving the image details as much as possible in practical application.

Conclusion
We proposed a DL method that combined a self-attention mechanism to eliminate stripe artifacts in LSFM images. Different degradation models were generated to simulate stripe artifacts in real LSFM images. Our method was validated by comparing it with two classic methods and two DL methods on the simulated data. The results showed that our method can effectively eliminate stripe artifacts in different datasets, thus overcoming the problem of angle sensitivity of classic methods for different stripe artifacts, and showing the ability to achieve fast performance. Moreover, our method was validated based on the destriping task on LSFM images with intrinsic stripe artifacts. Future studies will include the reduction of noise from different microscopic images and improvement in the temporal and spatial resolutions of the images.