Underwater Target Detection Utilizing Polarization Image Fusion Algorithm Based on Unsupervised Learning and Attention Mechanism

Since light propagation in water bodies is subject to absorption and scattering effects, underwater images using only conventional intensity cameras will suffer from low brightness, blurred images, and loss of details. In this paper, a deep fusion network is applied to underwater polarization images; that is, the underwater polarization images are fused with intensity images using the deep learning method. To construct a training dataset, we establish an experimental setup to obtain underwater polarization images and perform appropriate transformations to expand the dataset. Next, an end-to-end learning framework based on unsupervised learning and guided by an attention mechanism is constructed for fusing polarization and light intensity images. The loss function and weight parameters are elaborated. The produced dataset is used to train the network under different loss weight parameters, and the fused images are evaluated based on different image evaluation metrics. The results show that the fused underwater images are more detailed. Compared with light intensity images, the information entropy and standard deviation of the proposed method increase by 24.48% and 139%. The image processing results are better than other fusion-based methods. In addition, the improved U-net network structure is used to extract features for image segmentation. The results show that the target segmentation based on the proposed method is feasible under turbid water. The proposed method does not require manual adjustment of weight parameters, has faster operation speed, and has strong robustness and self-adaptability, which is important for research in vision fields, such as ocean detection and underwater target recognition.


Introduction
The ocean covers more than 70% of the earth's total area. The marine ecosystem is one of the most productive and dynamic ecosystems on earth. Many scholars have conducted research in marine resource exploration, biological investigation, underwater vehicle navigation, and other fields [1][2][3][4]. Underwater optical images are one of the important media for exploring the ocean at present. However, due to the influence of a large number of floating particles in the water, the actual underwater images are seriously degraded, with problems such as high background noise, low contrast, and loss of details [1]. Therefore, the research of underwater image enhancement technology is of great significance and value for ocean exploration.
Researchers have shown that underwater polarization imaging technology can reduce the influence of backscattered light on underwater imaging to a certain extent by using the polarization characteristics of scattered light [5,6]. The degree of linear polarization (DoLP) images is used to characterize polarization characteristics and provide detailed

Underwater Imaging Model
The Jaffe-McGlamery model [19,20] is one of the commonly used underwater imaging models, as shown in Figure 1. Many underwater image restoration algorithms are proposed based on this model. The Jaffe-McGlamery model states that the final image I(x, y) received by the detector is a linear combination of three components: the target reflected light S(x, y) received by the detector, the backscattered light B(x, y) scattered by the water body before the light source reaches the target, and the forward scattered light F(x, y) scattered by part of the target reflected light reaching the detector through the water body. The image can be expressed as I(x, y) = S(x, y) + B(x, y) + F(x, y). The initial irradiance of the target is assumed to be J(x, y), and part of the energy is lost due to scattering and absorption when the light propagates from the target to the detector. Therefore, the reflected light S(x, y) of the target can be expressed as ( , ) ( , ) ( , ), = ⋅ S x y J x y t x y (2) t(x, y) is the transmittance of the medium, which is determined by the attenuation coefficient ( , ) Backscattered light B(x, y) is the background light scattered by water particles to the detector. It can be expressed as ( , ) (1 ( , )), where B∞ represents the underwater ambient light intensity at the infinite distance. Since the effect of forward scattering on imaging quality is minimal, its effect is usually ignored, so Equation (1) can be simplified as ( , ) ( , ) ( , ) (1 ( , )).

Polarization Imaging Model
The Stokes vector method is one of the most commonly used polarization characterization methods in the field of polarization detection. This method can fully characterize the polarization characteristics of the light wave. The vector is composed of four parameters: .°°°°°°+ r  l   I  I  I  I  I  Q  S U  I  I   V I I The initial irradiance of the target is assumed to be J(x, y), and part of the energy is lost due to scattering and absorption when the light propagates from the target to the detector. Therefore, the reflected light S(x, y) of the target can be expressed as t(x, y) is the transmittance of the medium, which is determined by the attenuation coefficient β(x, y) and the propagation distance ρ(x, y). In a single uniform medium, the attenuation coefficient β(x, y) is invariant in space, so β(x, y) = β 0 . The propagation distance refers to the underwater part of the optical path between the object and the camera: Backscattered light B(x, y) is the background light scattered by water particles to the detector. It can be expressed as where B ∞ represents the underwater ambient light intensity at the infinite distance. Since the effect of forward scattering on imaging quality is minimal, its effect is usually ignored, so Equation (1) can be simplified as Thus, the initial irradiance J(x, y) of the object can be expressed as

Polarization Imaging Model
The Stokes vector method is one of the most commonly used polarization characterization methods in the field of polarization detection. This method can fully characterize the polarization characteristics of the light wave. The vector is composed of four parameters: Sensors 2023, 23, 5594 4 of 12 S represents the Stokes vector, which is a 4 × 1 column vector composed of four parameters: I, Q, U, and V. I represents the total light intensity received by the detector, Q represents the light intensity difference between the 0 • and 90 • polarization components I 0 • and I 90 • , U represents the light intensity difference between the 45 • and 135 • polarization components I 45 • and I 135 • , and V represents the intensity difference between the left-and right-handed circularly polarized components I r and I l . The emergent light S = [I , Q , U , V ] T can be obtained by the Mueller matrix: cos 2θ sin 2θ 0 cos 2θ cos 2 2θ cos 2θ sin 2θ 0 sin 2θ cos 2θ sin 2θ θ is the included angle between the main optical axis and the reference line. S represents the outgoing light with angle θ. According to Equation (8), the intensity of the outgoing light with angle θ can be obtained as follows: The polarization camera can obtain the light intensity image of the polarization directions of 0 • , 45 • , 90 • , and 135 • because each pixel of the CMOS sensor is placed with four polarizers of different angles (0 • , 45 • , 90 • , and 135 • ), as shown in Figure 2. Every four pixels is set as a computing unit. Then, the light intensity of 0 • , 45 • , 90 • , and 135 • and the Stokes vector of the light can be obtained simultaneously. The Stokes vector can be used to further calculate the DoLP and angle of polarization of the incident light: DoLP represents the proportion of linearly polarized components in the total light intensity. The angle of polarization refers to the dominant polarization direction of the incident light. S represents the Stokes vector, which is a 4 × 1 column vector composed of four parameters: I, Q, U, and V. I represents the total light intensity received by the detector, Q represents the light intensity difference between the 0° and 90° polarization components I0° and I90°, U represents the light intensity difference between the 45° and 135° polarization components I45° and I135°, and V represents the intensity difference between the left-and right-handed circularly polarized components Ir and Il. The emergent light S′ = [I′, Q′, U′, V′] T can be obtained by the Mueller matrix: c o s2 s i n2 0 cos 2 cos 2 cos 2 sin 2 0 1 . sin 2 cos 2 sin 2 sin 2 θ is the included angle between the main optical axis and the reference line.
The polarization camera can obtain the light intensity image of the polarization directions of 0°, 45°, 90°, and 135° because each pixel of the CMOS sensor is placed with four polarizers of different angles (0°, 45°, 90°, and 135°), as shown in Figure 2. Every four pixels is set as a computing unit. Then, the light intensity of 0°, 45°, 90°, and 135° and the Stokes vector of the light can be obtained simultaneously. The Stokes vector can be used to further calculate the DoLP and angle of polarization of the incident light: 2 2 , DoLP represents the proportion of linearly polarized components in the total light intensity. The angle of polarization refers to the dominant polarization direction of the incident light.

Network Architecture
The network structure adopted in this paper is shown in Figure 3, which mainly consists of three modules: feature extraction, feature fusion, and image reconstruction. First, in the feature extraction module, the light intensity image and polarization image are input through two channels. The first layer is the convolution layer containing the 3 ×

Network Architecture
The network structure adopted in this paper is shown in Figure 3, which mainly consists of three modules: feature extraction, feature fusion, and image reconstruction. First, in the feature extraction module, the light intensity image and polarization image are input through two channels. The first layer is the convolution layer containing the 3 × 3 convolution kernel and activation function ReLU (rectified linear unit), which is Sensors 2023, 23, 5594 5 of 12 used to extract low-level features. The second layer is the DenseBlock module containing 3 convolution layers to extract high-level features, in which each convolution layer also uses a 3 × 3 convolution kernel. The operation step of the convolution kernel is 1. Before the convolution operation, there are the BN (batch normalization) layer and ReLU activation function. This sort can speed up the training speed of the network. The two input channels of light intensity image and polarization image share the same weight, which can reduce the computational complexity of the network. This is followed by the attention unit (see Section 3.2), which takes the feature map of the previous layer as input. It is able to capture the global relationships in the data and guide the network to learn the distribution of the feature map. Second, in the feature fusion module, the feature map output by the feature extraction module is superimposed. The channel size of the two feature maps is 128, and the channel size of the fused feature map after being superimposed is 256. Finally, the output of the feature fusion module is used as the input of the image reconstruction module. The image reconstruction module includes 5 transposed convolution layers, and the convolution kernel size of each transposed convolution layer is also 3 × 3. The fusion results are reconstructed from the fusion features through these 5 transposed convolution layers. A more detailed network architecture is shown in Table 1. 3 convolution kernel and activation function ReLU (rectified linear unit), which is used to extract low-level features. The second layer is the DenseBlock module containing 3 convolution layers to extract high-level features, in which each convolution layer also uses a 3 × 3 convolution kernel. The operation step of the convolution kernel is 1. Before the convolution operation, there are the BN (batch normalization) layer and ReLU activation function. This sort can speed up the training speed of the network. The two input channels of light intensity image and polarization image share the same weight, which can reduce the computational complexity of the network. This is followed by the attention unit (see Section 3.2), which takes the feature map of the previous layer as input. It is able to capture the global relationships in the data and guide the network to learn the distribution of the feature map. Second, in the feature fusion module, the feature map output by the feature extraction module is superimposed. The channel size of the two feature maps is 128, and the channel size of the fused feature map after being superimposed is 256. Finally, the output of the feature fusion module is used as the input of the image reconstruction module. The image reconstruction module includes 5 transposed convolution layers, and the convolution kernel size of each transposed convolution layer is also 3 × 3. The fusion results are reconstructed from the fusion features through these 5 transposed convolution layers. A more detailed network architecture is shown in Table 1.

Attention Mechanism
The attention unit combines channel attention and spatial attention. Channel attention enables the network to learn the importance of features in the channel domain and give different weights to the feature map, so as to achieve the selective combination of light intensity image and polarization image in the channel domain. Spatial attention focuses on learning the effective information distribution of each layer of the feature map to improve the transmission of salient features. The attention unit includes a global mean pooling layer, a convolution layer, an activation layer, and a splicing layer, and its detailed structure is shown in Figure 4. Given X ∈ R H×W×C and X ∈ R H×W×C as the input and output of the attention unit, the calculation process of the attention unit is where σ is the sigmoid activation function, F c is the channel attention branch, F s is the space attention branch, ⊕ is the broadcast addition operation, and ⊗ is the element-by-element multiplication operation.

Attention Mechanism
The attention unit combines channel attention and spatial attention. Channel attention enables the network to learn the importance of features in the channel domain and give different weights to the feature map, so as to achieve the selective combination of light intensity image and polarization image in the channel domain. Spatial attention focuses on learning the effective information distribution of each layer of the feature map to improve the transmission of salient features. The attention unit includes a global mean pooling layer, a convolution layer, an activation layer, and a splicing layer, and its detailed structure is shown in Figure 4. Given as the input and output of the attention unit, the calculation process of the attention unit is where σ is the sigmoid activation function, Fc is the channel attention branch, Fs is the space attention branch, ⊕ is the broadcast addition operation, and ⊗ is the elementby-element multiplication operation.
where δ is the ReLU activation function, and GAP is the global average pooling. Similar to the channel attention branch, when passing through the spatial attention branch, 3 3 × When the input feature map X ∈ R H×W×C passes through the channel attention branch, the channel feature X c ∈ R 1×1×C is obtained through the global average pooling layer, and then the channel feature size 1 × 1 × C r obtained by point-by-point convolution of PWConv 1 , BN layer, and ReLU activation function. The channel attention feature map X c of size 1 × 1 × C is obtained by point-by-point convolution of PWConv 2 and BN layer. F c is expressed as F c (X) = BN(PWConv 2 (δ(BN(PWConv 1 (GAP(X)))))), (13) where δ is the ReLU activation function, and GAP is the global average pooling. Similar to the channel attention branch, when passing through the spatial attention branch, 3 × 3 convolution Conv 1 , BN layer, and ReLU activation function were first used to obtain the feature map of the size H × W × C r . To obtain the spatial attention feature map of size H × W × C1 × 1, convolution PWConv 2 and BN layer were used. F s can be expressed as: F s (X) = BN(PWConv 2 (δ(BN(Conv 1 (X))))).

Loss Function
The loss function in this paper adopts the globally weighted SSIM (structural similarity) loss function, which is a multiscale and weighted SSIM (MSW − SSIM) [17]: [3,5,7,9,11]   γ ω · Loss SSI M I S 0 , I f ; ω Loss SSIM (x, y; ω) is a loss function based on the SSIM, which represents the structural similarity of the image x and y on window ω: ω x is the region of the image within the window ω, and ω x is the mean of ω x . The variables σ 2 ω x and σ ω x ω y are the variance of ω x and the covariance of ω x and ω y , respectively. The remaining ones, ω y , ω y , and σ 2 ω y , represent the corresponding meanings. C 1 and C 2 are constants to avoid instability when ω x + ω y and σ ω x + σ ω y are very close to zero, respectively.
The multiwindow SSIM is proposed in the loss function to solve the problem of image detail at different scales. The window sizes used include 3, 5, 7, 9, and 11. Different windows can extract features of different scales. In addition, Loss SSIM (I S 0 , I f ; ω) and Loss SSIM (I DoLP , I f ; ω) use the weight coefficient, which is based on σ 2 ω S 0 and σ 2 ω DoLP , defined as shown in Equation (17). When the window ω of the intensity image S 0 variance is greater than the corresponding DoLP image, it indicates that the local region of S 0 has more image details; that is, the weight coefficient γ ω corresponding to the image of S 0 should be larger.
is the variance of the intensity image S 0 within the window ω; σ 2 ω DoLP is the variance of the DoLP image within the window ω. g(x) = max(x, 0.0001) is a correction function to increase the robustness of the solution.
In addition, MSW − SSIM can retain high-frequency information, but it is insensitive to uniform deviation, which can easily lead to changes in brightness. Therefore, the integration of MSW − SSIM with the L 1 norm loss function can ensure the brightness of fusion results. The L 1 norm loss function can be expressed as where M and N are the height and width of the image, respectively. I avg is the average value of I S 0 and I DoLP . I f is the fused image. Then the final loss function can be expressed as where α is a balance parameter.

Experiment
In order to obtain the dataset, the underwater imaging experiment was conducted. The experimental device and layout are shown in Figure 1, which mainly includes a polarization camera, glass water tank, polarization light source, and target object. The resolution of the polarizing camera (PHX0550S-P) is 2448 × 2048, and the number of bits is 12. It adopts focal-plane polarizing imaging and can take four images of linear polarized light intensity with polarization angles of 0 • , 45 • , 90 • , and 135 • at one time. The focal length of the lens is 10.5 mm. The polarized light source consists of an LED light source and a linear polarizer. A water tank (500 mm × 250 mm × 250 mm) was used as a container and the inside was covered with black flannel to avoid interference from ambient light. The target was placed in the water tank filled with water. The milk was prepared in the water tank to simulate the underwater situation with suspending particles. The light intensity and polarization images were obtained. Finally, a dataset containing 150 sets of images was constructed. Each set of images was composed of corresponding light intensity and polarization images, whose size was 1224 × 1024. A total of 100 groups were used as the training set, and the remaining 50 groups were used as the verification and test sets. In addition, the image of the dataset was flipped and trimmed to 80 × 80 as the input of network training. The training process was carried out on the server, whose graphics card was NVIDIA GeForce RTX 2080 Ti, whose processor was i9-9600X, and whose memory was 128 G. After weight initialization, the optimization was performed using the Adam optimizer with a minibatch size of 128. The learning rate was initially set to 0.0001 and decayed exponentially at a rate of 0.99, with a maximum epoch set to 200.

Image Enhancement
Based on the above method, the network was trained and the performance of underwater image fusion was tested. Information entropy (IE), standard deviation (SD), mutual information (MI), and SSIM were adopted to measure the quality of the fusion image objectively. IE represents the average information amount of an image, as shown in Equation (20). The more information amount, the greater IE. Image fusion will result in the increase in image information, and IE can reflect the degree of change.
E is the statistical average, and p(a i ) represents the probability of the gray value a i appearing.
SD refers to the dispersion degree of the image pixel gray value relative to the mean (µ). If the SD is larger, it indicates that the gray levels in the image are more dispersed and the image quality is better. The calculation formula is as follows: MI can measure the degree of similarity between two images, that is, the amount of information of the original image obtained by the fusion of images. The larger the MI is, the more source image information is retained and the better the quality is.  H(A, B).
The calculation formula adopted in this paper is SSIM is a widely used image quality evaluation index. It is based on the assumption that structured information will be extracted when human eyes watch an image. The SSIM Sensors 2023, 23, 5594 9 of 12 value ranges from −1 to 1. The closer it is to 1, the higher the similarity is and the better the fusion quality is. The calculation formula adopted in this paper is The network training test results are shown in Figure 5. According to the image fusion results, it can be found that the quality of the light intensity image S 0 is poor and the scene details are degraded seriously. However, after adding the polarization image for fusion, the target becomes clearer and the texture outline of the key can be clearly identified. According to the image index used for evaluation, the IE and SD after fusion increase by 24.48% and 139%, indicating that the proposed method can improve the quality of underwater images. In addition, the fusion image obtained by this method is compared with several other image fusion methods, including the curvelet transform (CVT) [10], the gradient transfer fusion (GTF) [21], the multiresolution singular value decomposition (MSVD) [22], the ratio of low-pass pyramid (RP) [7], and the discrete wavelet transform (DWT) [8]. As can be seen from Figure 5, the result of RP has poor visual quality. Artifacts are generated to a certain extent on the edge of the key and the shaded part, and there are more noises. The results of CVT, DWT, and MSVD have a certain degree of graininess, and the processing ability of shadows is poor. GTF results have high contrast, but the texture details of the key are not clear enough. However, the method presented in this paper has a relatively real visual effect without obvious artifacts and distortions and has a good processing effect on shadows. To further improve the image-enhancing effect, we also tested different network configurations, but the image quality is not improved much and the corresponding configurations are not valuable. Thus, we choose to only show the current configuration of the network, which already meets the actual requirements.
SSIM is a widely used image quality evaluation index. It is based on the assumption that structured information will be extracted when human eyes watch an image. The SSIM value ranges from −1 to 1. The closer it is to 1, the higher the similarity is and the better the fusion quality is. The calculation formula adopted in this paper is The network training test results are shown in Figure 5. According to the image fusion results, it can be found that the quality of the light intensity image S0 is poor and the scene details are degraded seriously. However, after adding the polarization image for fusion, the target becomes clearer and the texture outline of the key can be clearly identified. According to the image index used for evaluation, the IE and SD after fusion increase by 24.48% and 139%, indicating that the proposed method can improve the quality of underwater images. In addition, the fusion image obtained by this method is compared with several other image fusion methods, including the curvelet transform (CVT) [10], the gradient transfer fusion (GTF) [21], the multiresolution singular value decomposition (MSVD) [22], the ratio of low-pass pyramid (RP) [7], and the discrete wavelet transform (DWT) [8]. As can be seen from Figure 5, the result of RP has poor visual quality. Artifacts are generated to a certain extent on the edge of the key and the shaded part, and there are more noises. The results of CVT, DWT, and MSVD have a certain degree of graininess, and the processing ability of shadows is poor. GTF results have high contrast, but the texture details of the key are not clear enough. However, the method presented in this paper has a relatively real visual effect without obvious artifacts and distortions and has a good processing effect on shadows. To further improve the image-enhancing effect, we also tested different network configurations, but the image quality is not improved much and the corresponding configurations are not valuable. Thus, we choose to only show the current configuration of the network, which already meets the actual requirements. In order to objectively evaluate the performance of the method, the four image evaluation indexes introduced previously were used to evaluate the images in the test set, and the final results were averaged, as shown in Table 2. The method has better In order to objectively evaluate the performance of the method, the four image evaluation indexes introduced previously were used to evaluate the images in the test set, and the final results were averaged, as shown in Table 2. The method has better performance in the image evaluation indexes of IE, SD, MI, and SSIM, which further proves the effectiveness of the method. Finally, the running time was evaluated on a server configured with NVIDIA GeForce RTX 2080 Ti, 3.1 GHz Intel Core i9-9600X, and 128 G RAM. The results are shown in Table 3. All methods were implemented using the average value of multiple groups of images in Python language. It can be seen that the processing speed of the proposed method is faster than other methods.

Target Segmentation
Most of the existing underwater image segmentation algorithms are based on one of the light intensity information, spectral information, and polarization information, which has great limitations. It is necessary to make reasonable comprehensive use of light intensity and polarization information. In the study, we use the improved U-net network to extract features from the fusion image for image segmentation [23,24]. The framework and corresponding configuration of the image segmentation are similar to the image enhancement section of the previous part. We simulated different turbidities underwater by adding different volumes of milk (0, 1, 2, and 3 mL) into the water tank. Figure 6a,b are original images and segmentation results, respectively, at different water turbidities. With the increase in turbidity, the intensity of backscattered light increases, the clarity of the original image decreases, and the noise increases. When the milk volume is 3 mL, the image quality decreases obviously, but the general outline can still be detected in the segmentation results.
Sensors 2023, 23, x FOR PEER REVIEW 10 of 12 performance in the image evaluation indexes of IE, SD, MI, and SSIM, which further proves the effectiveness of the method. Finally, the running time was evaluated on a server configured with NVIDIA GeForce RTX 2080 Ti, 3.1 GHz Intel Core i9-9600X, and 128 G RAM. The results are shown in Table 3. All methods were implemented using the average value of multiple groups of images in Python language. It can be seen that the processing speed of the proposed method is faster than other methods.

Target Segmentation
Most of the existing underwater image segmentation algorithms are based on one of the light intensity information, spectral information, and polarization information, which has great limitations. It is necessary to make reasonable comprehensive use of light intensity and polarization information. In the study, we use the improved U-net network to extract features from the fusion image for image segmentation [23,24]. The framework and corresponding configuration of the image segmentation are similar to the image enhancement section of the previous part. We simulated different turbidities underwater by adding different volumes of milk (0, 1, 2, and 3 mL) into the water tank. Figure 6a,b are original images and segmentation results, respectively, at different water turbidities. With the increase in turbidity, the intensity of backscattered light increases, the clarity of the original image decreases, and the noise increases. When the milk volume is 3 mL, the image quality decreases obviously, but the general outline can still be detected in the segmentation results. We used pixel accuracy (PA) and mean intersection over union (MIoU) to measure the accuracy of image segmentation. PA represents the ratio of correctly predicted pixels to all pixels in the image. The physical meaning of MIoU is that the overlap of the predicted and labeled region is divided by the combination of the predicted and labeled We used pixel accuracy (PA) and mean intersection over union (MIoU) to measure the accuracy of image segmentation. PA represents the ratio of correctly predicted pixels to all pixels in the image. The physical meaning of MIoU is that the overlap of the predicted and labeled region is divided by the combination of the predicted and labeled region. The evaluation index of target segmentation results with different water turbidities is shown in Figure 7. After adding polarization information, PA and MIoU are improved. The results show that the target segmentation based on the proposed method is feasible under turbid water.
region. The evaluation index of target segmentation results with different turbidities is shown in Figure 7. After adding polarization information, PA and M improved. The results show that the target segmentation based on the proposed is feasible under turbid water.

Conclusions
Aiming at the problem of poor quality of underwater optical imaging, thi proposes a method of applying a deep fusion network to underwater polarization By analyzing the underwater active polarization imaging model, we set experimental device to obtain the underwater polarization image and constr training dataset. We establish an end-to-end network model based on unsup learning and attention mechanism guidance and design the loss functio experimental results show that the method can improve the visual quality of the and is superior to other methods. The processing speed of this method is faster th of other methods, which shows the potential of the method to meet the requirem real-time underwater video processing. Next, we improve the U-net network stru extract features for image segmentation. The results show that the target segme based on the proposed method is feasible under turbid water. Future research i building a more comprehensive dataset and improving the loss function and n module to further improve the quality of fusion image and meet the requirem practical applications. At the same time, the processing efficiency of the algorithm improved to reduce the operation time and realize the real-time detection of und targets.

Conclusions
Aiming at the problem of poor quality of underwater optical imaging, this paper proposes a method of applying a deep fusion network to underwater polarization images. By analyzing the underwater active polarization imaging model, we set up an experimental device to obtain the underwater polarization image and construct the training dataset. We establish an end-to-end network model based on unsupervised learning and attention mechanism guidance and design the loss function. The experimental results show that the method can improve the visual quality of the image and is superior to other methods. The processing speed of this method is faster than that of other methods, which shows the potential of the method to meet the requirements of real-time underwater video processing. Next, we improve the U-net network structure to extract features for image segmentation. The results show that the target segmentation based on the proposed method is feasible under turbid water. Future research includes building a more comprehensive dataset and improving the loss function and network module to further improve the quality of fusion image and meet the requirements of practical applications. At the same time, the processing efficiency of the algorithm will be improved to reduce the operation time and realize the real-time detection of underwater targets.