Self-attention module in a multi-scale improved U-net (SAM-MIU-net) motivating high-performance polarization scattering imaging

: Polarization imaging has outstanding advantages in the field of scattering imaging, which still encounters great challenges in heavy scattering media systems even though there are helps from deep learning technology. In this paper, we propose a self-attention module (SAM) in multi-scale improved U-net (SAM-MIU-net) for the polarization scattering imaging, which can extract a new combination of multidimensional information from targets effectively. The proposed SAM-MIU-net can focus on the stable feature carried by polarization characteristics of the target, so as to enhance the expression of the available features, and make it easier to extract polarization features which help to recover the detail of targets for the polarization scattering imaging. Meanwhile, the SAM’s effectiveness has been verified in a series of experiments. Based on proposed SAM-MIU-net, we have investigated the generalization abilities for the targets’ structures and materials, and the imaging distances between the targets and the ground glass. Experimental results demonstrate that our proposed SAM-MIU-net can achieve high-precision reconstruction of target information under incoherent light conditions for the polarization scattering imaging.


Introduction
High-quality imaging through scattering media is of great significance for atmospheric remote sensing [1][2][3], underwater imaging [4][5][6], biological tissues imaging [7][8], and other applications [9][10][11].So far, various methods have been proposed to improve imaging quality [12][13][14][15][16][17][18][19], in which polarization-based methods, as one of the most effective techniques, have received many achievements.J. S. Tyo et al. have proposed the Polarization Difference (PD) method for improving the imaging quality through scattering media [20].Y.Y.Schechner has added polarization effects into the model of atmospheric defogging to improve the defogging effect [21].Liu et al. have proposed an active polarization imaging method based on wavelength selection [22], which takes advantage of the dependence of scattered light at different wavelengths in the turbid underwater environment.Liang et al. have proposed the estimation parameter of the angle of polarization (AoP) can be used in dense fog environments [23], which significantly improves the clarity of blurred images.Li et al. have presented a method based on Stokes vector images to recover objects in turbid water [24].Guo et al. have obtained the Muller matrix (MM) of a scattering medium based on the Monte Carlo (MC) algorithm [25] and proposed a method of polarization inversion to recovery targets in layered dispersion systems [26][27].In recent years, deep learning-based methods have developed rapidly and are considered to be a method that surpasses traditional methods and improves the performance of polarization scattering imaging.Li et al. have established a dataset consisting of the Stokes vector images and proposed a polarized image denoising network (PDRDN) based on the residual dense network [28].Hu et al. have presented a method of polarized underwater image defogging based on deep learning [29], and on this basis, they mathematically converted the two polarization-related parameters into a separate parameter, enabling the network to learn the polarization modulation parameters and obtain a clear de-scattering image [30].Li et al. have considered changes in polarization information when light interacts with targets and transmits in the turbid system.They have combined polarization theory and deep learning and designed an end-to-end network for target reconstruction in a scattering environment [31].Polarization is a superior characteristic of light, but it is not presenting in a direct way.We must use some optical elements to be able to observe it, and the detection methods are also indirectly detected by the intensity.What is more, the energy loss caused by the optics will also lead to a serious decrease in the signal-to-noise ratio of the picture.
Therefore, when polarization pictures captured by the detector are directly entered into the model, the network may not extract enough useful information sufficiently and accurately, resulting in the inability to recover the target efficiently.Therefore, here, we try to provide enough information for the reconstruction network by entering multidimensional information about the target.On this basis, by introducing a self-attention module (SAM) into the multi-scale improved U-net (MIU-net) to form an SAM-MIU-net, the network focuses on the target information carried by the polarization characteristics and enhances the expression of features by giving weights to the feature matrix itself, reducing redundant output and improving the robustness of the network.The experimental results prove that our proposed SAM-MIU-net has significantly improved the reconstruction result, and the test results of complex structures, different materials, and different imaging distances also show that our proposed SAM-MIU-net has superior effectiveness and generalization.

Polarization information
The Stokes vector is a common expression of the polarization information, and it can represent polarized and unpolarized light with four components S = (I, Q, U, V) T , all of which are scalars and express the light intensity information without phase [32].The Stokes vector can be calculated as: where I is the total light intensity, which provides global information of targets; Q is the difference between horizontal and vertical components, which is the difference between the components in two orthogonal directions, so Q images have a certain inhibition effect on backscatter; U is the difference between 45°and 135°components, and V is the difference between right-handed and left-handed components.In addition, further polarization information such as the degree of polarization (DoP) and AoP can be obtained by the Stokes vectors.The degree of linear polarization (DoLP) represents the ratio of the linear-polarization component to the total light intensity: We can get more detailed information about the target through DoLP images.Previous research [33] has fused intensity pictures with DoLP pictures to provide each other with complementary information about the target to improve the resolution of images.Here, we consider that a single polarization picture cannot provide sufficient target characteristics for the network, so we regard the light intensity, Q, and DoLP images as three-dimensional data and input them into the network.By taking advantages of the powerful data mining and learning capabilities of the neural networks, three-dimensional information can be fully extracted and fused for the target reconstruction.

Measurement system
We constructure the experimental setup as shown in Fig. 1 to obtain the polarization dataset.A liner polarizer is placed in front of the LED light source, which allows the captured picture to contain more pronounced polarization information, aiding in subsequent image recovery [34].The light of S = (1, 1, 0, 0) T can transmit from the ground glass, which will be irradiated to the target and reflected from it.Then the reflective light that carries the targets' information of transmits through the ground glass and will be captured by the commercial DoFP (division of focal plane) polarization camera (LUCID, PHX055S-PC) with pixel counts of 2048 × 2448.The pixel array surface of DoFP is covered with a polarization array consisting of four micro-polarizers with four different polarization orientations of 0°, 45°, 90°, and 135°respectively.Then the needed images for the polarization dataset can be calculated by them easily.In our experiments, the target is put at a certain distance behind the ground glass of 4 mm, and the distance between the target and the ground glass is defined as d.

Network design
Inspired by the network architecture proposed by Zhen et al. [35], we have improved the network structure used in the previous work [31], in which we use different-size down-sampling strategies in the middle part of the U-Net to extract target features at different levels.Here, the newly proposed network consists of two stages, in which the former part includes a multi-scale feature extraction network, and the latter part is a fusion reconstruction stage network of different levels of features.We utilize the SAM to optimize the feature exchange between two stages, to ensure that the polarization features that are most conducive to the target reconstruction can be well transmitted to the next stage, improving the reconstruction quality and generalization of the network.
As shown in Fig. 2, the backbone network is modified based on the improved U-Net of the previous work [31].The former part is a feature extraction part composed of three sets of dense blocks, and at the end of each block, the max-pooling of the size of 2×2 reduces the feature to half of the previous level.If we input the target's multi-dimensional information into the network directly, it will not be conducive for information to play its role when the features are modulated inside the network, but also will cause information redundancy, which is not conducive to improving the efficiency and robustness of the network.Therefore, when the feature size is reduced to 64×64, we divide the network into four branches and use 1×1, 2×2, 4×4, and 8×8 max pooling respectively to extract the existing features in different levels and degrees.Through this method, the different roles of multi-dimensional information in the reconstruction of the target can be refined, and features are divided into different channels to the next stage, thereby reducing the superposition of information and improving the efficiency of the network.Most networks that require multi-branch fusion connect features directly and then pass them to the next stage.However, we want more than just fusion, but also filtering redundant information and enhancing polarization characteristics that facilitate target reconstruction.Wang et al. [36] proposed that the SAM can aggregate global information from the feature map.Therefore, here, we introduce the SAM to reduce the redundant output of the first-level network, aggregate effective polarization features by establishing the interaction of feature information between different channels, and enhance the input features to the next stage for feature fusion.The SAM's network structure is shown in Fig. 3.The inputs are first fed into three different convolutional layers (i.e., Conv1, Conv2, Conv3) with kernel size of 1×1, which do not change the spatial size of feature, to generate the query (Q), key (K), and value (V) matrix.Then, according to the equations ( 3) and (4), the Q vector interacts with K vector using the dot-product operators to produce a scalar weight (i.e., the attention map Att) for the corresponding V vector.Subsequently, the attention map Att vector is applied on the V vector to generate the Y vector.
Next, the computational complexity is reduced by convolution.The final output of the SAM is shown by Eq. ( 5), in which a residual structure is formed with the initial input to ensure that the feature information is not lost.Finally, the features enhanced and focused by the SAM are input into the dense block, and then recomposed into the target image of 256×256 by up-sampling and convolution layers.Furthermore, to better reconstruct the target's information, two up-sampling schemes, i.e., transposed convolution and bilinear interpolation, are considered in the decoder and multiscale module.Fig. 4 demonstrates the performances' comparison between using transposed convolution and utilizing bilinear interpolation in the decoder and multiscale module.It can be seen that the transposed convolution is more conducive to the final reconstructions of targets.Hence, in subsequent experiments, we have used transposed convolution as the up-sampling operation.During training, the MAE acts as a loss function to drive the interaction of polarization features within the network.
We trained the model in an image processing unit (NVIDIA RTX 3080) using a Pytorch framework with Python 3.6, training 150 epochs.The optimizer is the Adam (Add Momentum Stochastic Gradient Descent) with a learning rate of 0.001.

Imaging quality
Mean Squared Error (MSE) and Structural Similarity Index (SSIM) [37] are two common data metrics used to evaluate imaging quality.The MSE between the original image and the predicted image (O, K) with the size of m×n can be expressed by: The smaller its value means the better the recovery results.
In addition to the evaluation of the noise situation, the quality of the reconstruction image can be evaluated by introducing structural similarity based on the degradation of structural information.The SSIM compares the brightness, contrast, and structure of the two images by: where µ x is the mean of X, µ Y is the mean of Y, σ x is the variance of X, σ Y is the variance of Y, σ XY is the covariance of X and Y, and C 1 and C 2 are small normal numbers used to avoid the zero denominator.The SSIM value range is 0 to 1.The higher value of the SSIM, the more similar the two images, which means a better network reconstruction.

Dataset preparation
The scattering images captured by the camera as shown in Fig. 5 are cloudy under incoherent light conditions.When the distance between the ground glass and the target is moved, the sharpness of the scattering images will also change, that is, the greater the distance, the blurrier the picture.
The training set is composed of three kinds of scattering images, S 0 , S 1 , and DoP, at the distance of d = 4 cm between the ground glass and the target as shown in Fig. 5, and the following inputs for different test experiments are also composed of three-dimensional data.Our dataset includes 200 groups of polarized images, and each has four sets of images corresponding to different polarization orientations (0°, 45°, 90°, and 135°).On this basis, the images needed for training can be calculated by Eqs.(1,2).Also, the targets are all made up of digits in different fonts.In addition, 200 scattering images are expanded to 2000 images as training sets.All of our data are grayscale images, and the final outputs are also grayscale images.After collecting and classifying the data, the proposed methods can be used.

Enhanced performances from the SAM
Polarization characteristics can distinguish the ballistic and scattering photons to some extents, so the network trained with polarization information will be more stable.In this section, in order to demonstrate that the SAM can help the network to focus and enhance the stable target characteristics carried by polarizations, we conduct the comparative experiments with or without the SAM in the network.We name WSAM-MIU-net for the circumstance of MIU-net without SAM.We use the same training set to train the SAM-MIU-net and WSAM-MIU-net, respectively, and obtain the corresponding optimal model for comparative testing as shown in Fig. 6, in which  It can be clearly seen that the results of the SAM-MIU-net all have higher contrast and clarity, and the background has no interference noise.Nevertheless, the results of the WSAM-MIU-net cannot incompletely rebuild more complex targets, such as the second target ("5"), and all the test results contain background noise.The results prove that the SAM can guide the network to focus on and enhance the target characteristics carried by the polarization information, filter out the redundant information, and improve the reconstruction performance.At the same time, we also calculated the average SSIM and MSE of the reconstructed result that be shown in Table 1.Ultimately, the reconstructing performance of the model with SAM has been greatly improved, where the SSIM has increased by 7.4% and the MSE has decreased by 11%.In addition, we output the features in the middle of the network to explore the physical process of the proposed network.Firstly, we output the features of the four branches as shown in Fig. 7(a).It can be seen that different branches extract the different aspect features, and refine the contribution of multi-dimensional information to the target reconstruction, which will avoid redundancy caused by information superposition and affect the performance of the model.After that, we export the feature maps from the networks with SAM and without SAM respectively, which has been shown in Fig. 7(b).By visualizing the features, we can know the aggregation and enhancement effect of the SAM on the polarization features.Through the above operations, the features are exported to the next modules for information fusion, which will improve the performance of reconstruction under incoherent conditions.

Untrained different-structure targets
In this section, we test the trained SAM-MIU-net with more complexity targets while other conditions are unchanged to further verify the generalization.The alphabetical targets and graphic targets, which are not in the training sets, are entered into the trained network, and the reconstruction results are obtained in Fig. 8. Alphabetical targets and graphic targets, which belong to different structural types from training sets, can be also accurately reconstructed by our method.In case of the limited number of training data, the target structure is reestablished without excessive noise in the background.The structure of graphic targets is the weakest correlation with the training dataset, but the results are still reconstructed with little distortion.From the Fig. 8, our proposed method can reestablish untrained objectives with high contrast, and there is no excessive noise in the background.It can be proved that the SAM-MIU-net has excellent generalization ability for targets' structures.In addition to visual effects, the superior effectiveness and generalization of the SAM-MIU-net can be also seen from the average SSIM and MSE of results in Table 2.Even the less correlated graphical targets also have better performances of SSIM and MSE.Table 3 [3,38] shows the corresponding elements in the Mueller matrix (MM) of the materials used in our experiment.When objects are set as "Ink-Wood", the difference of corresponding MM elements of paper and wood are small, so, the model can reconstruct the ink target relatively completely.Besides, when the target materials are set as "Paper-Steel", other conditions remain unchanged.From Table 3, the corresponding MM elements of ink and steel are similar, so, their scattering images are similar to the "Ink-Paper", resulting in targets can be reconstructed like result got by Ink-Paper model.Lastly, when "Paper-Wood" targets are set, due to the similar corresponding MM elements of paper and wood, only the target profile can be distinguished.
From the results, it can be known that when the material is not trained by the neural network, the performances of the target reconstruction will be decreased, and the performance of the reconstruction is related to the difference in polarization characteristics between the training material and the test material.More importantly, SAM-MIU-net reconstructs the target with more completeness and higher contrast.The SAM also will enhance features that are similar to the polarization characteristics of training targets, which is of great effect on model scalability imaging.Besides, we have also calculated the average SSIM and MSE of the reconstructed different-material targets shown in Table 4.In summary, our proposed method has a certain ability and generalization due to that the SAM-MIU-net will enhance the expression of polarization features in the process of reconstructing the targets.The values of MM for different materials will be different, so their polarization properties will be also different.In addition, when targets and scattering media are determined in a system, the MM of those will not change.As a result, the network trained by polarization information is more robust.So, our proposed SAM-MIU-net can reconstruct the targets with different distances (the targets move within a certain range).In this section, we will demonstrate that the proposed method has strong stability by reconstructing the scattering images obtained at different distances of d, as shown in Fig. 10.When d = 3.5 cm, there is enough target information for target reconstruction, and the stable polarization information improves the quality of results.Besides, we also show the test results from the WSAM-MIU-net, in which the reconstruction results from the SAM-MIU-net are more stable than those from the WSAM-MIU-net when the distance is longer than 4 cm.Particularly, when d = 5, the SAM-MIU-net can still distinguish the targets, but the WSAM-MIU-net can't.Comparison results prove that the aggregation effect of SAM on the polarization characteristics enhancing the expression of stable target features carried by the polarization and improving the flexibility of network.Ultimately, the SAM-MIU-net is capable of extended imaging beyond the distance of the training set by 25%, and achieve efficient elastic imaging.At the same time, we calculated the average SSIM and MSE of the reconstruction result at different d for both cases.From Table 5, the data for the model with SAM is generally better, and the values at different d are more stable.

Performance comparison with other existing methods
In this section, we compared our proposed SAM-MIU-net with several existing methods, including Dark Channel Prior (DCP) [39], PDN [29] (Hu proposed the method based on RDB to dehaze using 0°, 45°and 90°), PU-Net [40] (Zhang proposed the method based on U-Net to dehaze  From the results, the DCP method achieves a slight dehazing effect and does not fully visualize the targets, and then the PDN method cannot clearly recover the target, which may not be suitable for more blurry pictures in complex environments.Although the PU-Net can relatively recover part of the target structure, there is more noise around it.The MU-DLN method input the polarization information of S 1 , only one of the components of the Stokes vector, and the recovery result is incomplete and not in high contrast.On the basis of MU-DLN, we retrain that by using the training sets (i.e.I, Q, DoP) used in this article to get 3D-MU-DLN.And the recovery results are better than those of MU-DLN, but there are also cases where the target recovery is incomplete.In contrast, our proposed SAM-MIU-net is able to completely reconstruct the target structure, enhance the contrast, and make the background have less excessive noise in complex environments.In Table 6, the SSIM and MSE of different methods are calculated, and our proposed SAM-MIU-net obtained better reconstruction performance than other methods.In addition, we calculate the parameters and Floating Point Operations (FLOPs) for different methods to assess the complexity of the network.It can be seen that our method has high quality while having fewer parameters and calculations.

Conclusion
In this manuscript, given the not obvious polarization characteristics and the limitations of the detection system, we use multidimensional polarization information to characterize the target and use multi-scale extraction to refine the contribution of multi-dimensional information for the reconstruction target, which also will avoid information redundancy caused by information superposition.The SAM is introduced to aggregate global information, and enhance the polarization characteristics which will input subsequent reconstruction modules.Experiments have verified that the SAM-MIU-net has greatly improved generalization and stability.And through intermediate feature output, we can visualize influence of our proposed module on the final reconstruction result.It notes that the application of multidimensional information to target features is of great significance for target reconstruction in complex scenarios, such as in the optical remote sensing.In future works, to further improve the performance of polarization scattering imaging, we will focus on the following points: (i) extracting the available information from multi-material target information using the supervised learning algorithms for more scenes reconstruction; (ii) due to the relatively complex acquisition of polarization data sets, it is significant to consider using a small number of training samples to achieve high-quality reconstruction.
Funding.National Natural Science Foundation of China (61775050).
Disclosures.The authors declare no conflicts of interest.

Fig. 2 .
Fig. 2. The overall structure of the used U-net.

Fig. 6 (
Fig. 6(b) is the scattering images of the light intensity, but the test sets also consisted of scattering images of S 0 , S 1 and DoP (for the sake of image simplicity, we only show the scattering images of the light intensity in the following result display).

Fig. 6 .
Fig. 6.The reconstruction result of different models.(a)Untrained digital images;(b) Scattering images;(c) The reconstruction results without SAM; (d) The reconstruction results with SAM.

Fig. 7 .
Fig. 7. Visualization of features in the middle of the networks.(a) The output of the subscale branch; (b) The output after the subscale branch for the networks with SAM and without SAM.
Polarization properties are very sensitive to different materials, so we explored the concrete effects of different materials on the trained model.We change the materials of target-background to Ink-Wood, Paper-Steel, and Paper-Wood sequentially, and place them in the same experimental environment to obtain scattering images of different components.The test results can be obtained by entering their scattering images into the SAM-MIU-net trained by the target-background of Ink-Paper, as depicted in Fig. 9.

Fig. 9 .
Fig. 9.The reconstruction results of different target-background: Ink-Wood, Paper-Steel, and Paper-Wood.(a) Original images; (b) Scattering images;(c) Their reconstruction results from the SAM-MIU-net trained by the target-background of Ink-Paper.

Table 5 . 3
The average SSIM and MSE of the different networks at different distances of d SAM-MIU-net d = 45°, 90°, 135°and S 0 ), and MU-DLN [31] (Li proposed the method based on modified U-Net using Q-component).The corresponding results are shown in Fig.11respectively.To make a fair performance comparison, except for the DCP method, all the methods first used the same training set to learn the model of polarization scattering imaging, and then employed the same testing set to verify their performances accordingly.

Fig. 11 .
Fig. 11.The obtained results with different methods

Table 2 . The average SSIM and MSE of the different targets
3.2.2.Untrained different-material targets