An Underwater Image Denoising Method Based on High-Frequency Abrupt Signal Separation and Hybrid Attention Mechanism

During underwater image processing, image quality is affected by the absorption and scattering of light in water, thus causing problems such as blurring and noise. As a result, poor image quality is unavoidable. To achieve overall satisfying research results, underwater image denoising is vital. This paper presents an underwater image denoising method, named HHDNet, designed to address noise issues arising from environmental interference and technical limitations during underwater robot photography. The method leverages a dual-branch network architecture to handle both high and low frequencies, incorporating a hybrid attention module specifically designed for the removal of high-frequency abrupt noise in underwater images. Input images are decomposed into high-frequency and low-frequency components using a Gaussian kernel. For the high-frequency part, a Global Context Extractor (GCE) module with a hybrid attention mechanism focuses on removing high-frequency abrupt signals by capturing local details and global dependencies simultaneously. For the low-frequency part, efficient residual convolutional units are used in consideration of less noise information. Experimental results demonstrate that HHDNet effectively achieves underwater image denoising tasks, surpassing other existing methods not only in denoising effectiveness but also in maintaining computational efficiency, and thus HHDNet provides more flexibility in underwater image noise removal.


Introduction
Underwater vision is a vital technology to explore the marine environment noninvasively, which could provide abundant and various information for ocean study.Highquality underwater images are essential for robots to complete underwater tasks such as exploration, archaeology, rescue, and imaging.However, underwater images are often distorted by water and suspended particles, which inevitably cause noise and reduce image usability.Well-denoised underwater images with high quality could assist scientific observations and robot underwater operations in working efficiently and accurately [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15].Furthermore, denoising technology could also support marine engineering by providing more precise and reliable data.
When light travels through water, its absorbance and scattering effect are influenced not only by water molecules but also by a combination of suspended particles such as sand grains, plankton, and dissolved organic matter.Consequently, the main challenges in underwater image denoising include low contrast, color distortion, and noise interference commonly observed in such images.To address these image quality issues, researchers Sensors 2024, 24, 4578 2 of 19 have proposed several methods for underwater image restoration and enhancement over the past few decades.These methods [1][2][3][4][5] have significantly improved visibility and color correction in underwater images.Based on marine measurement data, Akkaynak et al. successfully derived the scattering space with physical effectiveness and constructed a revised underwater image generation formation model [1] to simulate the degradation process of underwater images.Similarly, in order to address the underwater image restoration problem, Desai et al. also designed a revised model and trained it with generative adversarial networks [2] to restore the real quality of underwater images.However, those two methods mentioned above still used RGB image inputs and did not consider separating and processing the noise component independently.From another aspect, Peng et al. tackled the challenge of separating color and texture in underwater images by proposing a U-shaped Transformer network [3] and introducing LAB and LCH color space to optimize the separation of color and texture, and they achieved significant results.Wang et al. observed inconsistencies in attenuation across different color channels and spatial regions in underwater images, thus leading to the development of a dual-information modulation network [4] to enhance the accuracy and robustness of underwater image restoration tasks.However, solely relying on color space for texture and color separation would be insufficient when dealing with underwater images.The reason is that image texture often contains both noisy and non-noisy components, and underwater image noise typically manifests as abrupt signal changes, which belong to the high-frequency part of the image.Failure to further separate these high-frequency signals during texture extraction can result in sub-optimal processing outcomes.Therefore, in addition to color space separation, further considerations are necessary from the perspective of frequency domain decomposition when processing underwater images.In this field, Li X et al. proposed the ACCE-D framework [5].In the proposed framework, a Difference of Gaussian (DoG) filter and a bilateral filter were used to decompose the high-frequency and low-frequency components, respectively.Soft thresholding was then applied to suppress noise in the highfrequency components.Nevertheless, ACCE-D did not employ a learning-based denoising algorithm for training after separation and still left some progress to be made.The current underwater image denoising algorithms can be classified into two main categories: modelbased methods [6][7][8][9][10][11][12][13][14][15] and learning-based methods [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30].Model-based methods remove noise from the image by modelling the noise distribution in the target image.Herein, filters designed manually are significant, such as bilateral filters [6], Gaussian filters, and median filters.The model-based method defines noise as abrupt signals with significant image gradients.By smoothing these abrupt signals, it can selectively remove noise from the target image.Additionally, wavelet transform thresholding-based denoising [7] is a commonly used technique in traditional image processing.It decomposes the signal into different scales and determines thresholds based on the energy of each scale in the way of setting low-energy wavelet coefficients to zero to achieve denoising.The non-local means (NLM) method [8] considers each pixel in the image and compares it with similar regions in other parts of the image.Different from traditional local denoising methods, NLM utilizes information from a wider area in the image, thereby better preserving details and structure.The block-matching and 3D filtering (BM3D) method [9] removes image noise by enhancing sparsity.Markov random field models [10] take each pixel in the image as a random variable, and model interactions between pixels by using an energy function and also find a configuration which could minimize the energy function to achieve denoising in the end.To simplify, model-based methods separate noise from images, suppress noise components, and then model noise removal.However, these methods carry the risk of losing image details.In addition, the performance of model-based methods may not be satisfying in complex scenarios in that they may struggle to remove various types of noise effectively.
Since the introduction of CNN algorithms like AlexNet [16] and ResNet [17], CNNs have been applied to image denoising tasks constantly [18][19][20][21][22][23][24][25][26].DnCNN [18], proposed by Zhang K et al., was the first to apply CNNs to image denoising tasks, which defined a deep Sensors 2024, 24, 4578 3 of 19 learning denoising equation as noisy images equal to clean images plus noise information to simulate the noise removal process.RIDNet [19], proposed by Anwar S et al., used residual structures to alleviate low-frequency information flow and feature attention to explore channel correlations.ECNDNet [20], proposed by Tian C et al., used dilated convolutions to enhance perception in the denoising process.ADNet [21], proposed by Liu Z et al., utilized sparse modules, feature enhancement modules, attention modules, and reconstruction modules to build a network structure for image denoising.MSANet [22], proposed by Gou Y et al., considered both intra-scale characteristics and cross-scale feature complementarity.SADNet [23], proposed by M Chang et al., introduced encoder and decoder blocks with context in capturing multi-scale information and removing noise ranging from coarse to fine.However, current CNNs cannot perceive long-distance interactions between pixels and also lack flexibility in learning and adjusting noise models, thus making CNNs less adaptable to different types and intensities of noise.
In recent years, some researchers have attempted to use Transformer architecture [27][28][29][30] for image denoising, as Transformers can capture long-distance interactions of pixels.Restormer [27] focuses on multi-scale local-global representation learning on high-resolution images.It introduces modules like Multi-Dconv Head Transposed Attention and Gated-Dconv Feed-Forward Network to aggregate locally and non-locally related pixels and control feature transformation.KBNet [28] combines the strengths of CNNs and Transformers and introduces the Kernel-Based Attention module to adaptively aggregate spatial neighborhood information, thereby using learnable kernels for different local patterns.Additionally, it also designs a separate lightweight convolution branch to predict linear combination coefficients for kernels, thus further enhancing the efficiency and performance of Transformer denoising.Therefore, combining lightweight convolutional networks with Transformers can improve the convergence speed of Transformers, making it easier to apply Transformers to low-level tasks such as image denoising.
In addition to improvements in network structures, researchers have also proceeded with denoising from the perspective of frequency domain separation [31][32][33][34][35][36].From the frequency domain viewpoint, noise is primarily concentrated in the high-frequency signal region [31], which is characterized by sharp changes and is difficult to restore.Therefore, the approach involves using high-low frequency separation algorithms to divide the input image into high-frequency and low-frequency components.Denoising methods based on frequency domain separation include Fourier decomposition [32], wavelet decomposition [33], Laplacian high-low frequency decomposition [34], discrete cosine decomposition [35], and Gaussian blur decomposition [36].CFPNet [35], proposed by Zhang K et al., employed discrete cosine decomposition to separate the image into high and low frequencies, and then processed these components individually using convolutional neural networks, thereby enhancing the ability to handle high-frequency signals.
Wang L et al. used wavelet decomposition [33] to separate high and low frequencies and processed these components separately.However, methods like wavelet decomposition and discrete cosine decomposition are time-consuming and produce a large number of decomposed components.When using convolutional networks to learn from these extensive components, the computational load increases significantly.To reduce the time consumed by high-low frequency separation, Kang J et al. proposed the FSformer [36] image denoising network, which used a Gaussian blur kernel-based separation method.This method divided the input image into high-and low-frequency components, and reduced processing time effectively compared with wavelet decomposition.FSformer employed Transformer-based low-frequency (LFB) and high-frequency (HFB) modules to process the respective components separately, and then merged them to obtain the denoised image.While the aforementioned methods successfully separated high and low frequencies and addressed the issue of slow decomposition speeds, they did not differentiate the treatment of high-and low-frequency signals in their network structures, despite noise being primarily concentrated in the high-frequency signal region.
In recent research, some researchers have applied lightweight diffusion models to underwater image denoising tasks [37,38].DM-Water [37], proposed by YI Tang et al., is a method that used diffusion models for image enhancement in underwater scenes.It generated corresponding enhanced images by using underwater images and Gaussian noise as input.Additionally, to improve the efficiency of the reverse process in diffusion models, they employed a lightweight Transformer-based denoising network to speed up both training and inference.WF-Diff [38], proposed by Chen Zhao et al., combined wavelet spatial frequency information of underwater images with diffusion models, which achieved state-of-the-art performance on several public datasets.However, diffusion models have the characteristic of generating tasks, making it difficult for the generated images to retain the original information of the underwater images.Moreover, diffusion models required significant computational resources.
To address the noise problem in underwater images, this paper proposes an algorithm called HHDNet, which is specialized to remove noise caused by environmental disturbances and technical constraints in the process of underwater robot photography, thereby improving the overall quality and clarity of images.Since noise in underwater images mainly concentrates on high-frequency abrupt signals, the HHDNet algorithm adopts a global residual learning approach.It decomposes RGB images into high-frequency and low-frequency components by using high-low frequency separation and utilizes a dual-branch network architecture to process high-and low-frequency parts independently.It also strengthens the perception and elimination of high-frequency abrupt noise during training.The contributions of this paper are as follows: (

Underwater Image Denoising Network
Figure 1 shows the structure of the HHDNet, which adopts a dual-branch network architecture consisting of two branches.Each branch is constructed by stacking multiple cascaded feature extraction modules internally, enabling the deep extraction of various image features layer by layer and enhancing the network's feature extraction capability.When processing images, the network firstly decomposes the degraded input image into high-frequency and low-frequency layers by using high-low frequency decomposition and then feeds them into the two branches for processing.The high-frequency branch uses eight GCE (Global Context Extractor) modules for high-frequency residual learning to remove high-frequency noise while preserving details.The low-frequency branch undergoes low-frequency residual learning through four residual convolution modules to restore the image's basic structure.After learning, the residual features outputted by the high-frequency and low-frequency branches are added to the original layers, thus obtaining the denoised high-frequency and low-frequency information for precise reconstruction.Finally, the denoised layer information is concatenated, and a global residual amount is obtained by convolution fusion with a 3 × 3 filter, which is added to the original noisy image to obtain the clean image.
Sensors 2024, 24, x FOR PEER REVIEW 5 of 20 move high-frequency noise while preserving details.The low-frequency branch undergoes low-frequency residual learning through four residual convolution modules to restore the image's basic structure.After learning, the residual features outputted by the high-frequency and low-frequency branches are added to the original layers, thus obtaining the denoised high-frequency and low-frequency information for precise reconstruction.Finally, the denoised layer information is concatenated, and a global residual amount is obtained by convolution fusion with a 3 × 3 filter, which is added to the original noisy image to obtain the clean image.

High-Low Frequency Separation
HHDNet uses Gaussian blur for high-low frequency decomposition to separate high-frequency and low-frequency information.Gaussian blur is an image processing technique used to reduce image noise and detail levels, resulting in a smoother image.By adjusting the values of Gaussian blur, the degree of blur for different frequency components in the image can be controlled.After applying Gaussian blur, the processed layer is combined or contrasted with the original layer in some form to extract high-frequency and low-frequency information, thus achieving high-low frequency separation.Assuming the input image is I, the Gaussian function is G, the mean of Gaussian noise is , and the variance is θ, the high-low frequency decomposition of the image can be represented as: As shown in Equations ( 1) and ( 2), the input RGB image undergoes Gaussian blur processing, and results in low-frequency information (LF).The high-frequency information (HF) is obtained by taking the absolute difference between the RGB image and the

High-Low Frequency Separation
HHDNet uses Gaussian blur for high-low frequency decomposition to separate highfrequency and low-frequency information.Gaussian blur is an image processing technique used to reduce image noise and detail levels, resulting in a smoother image.By adjusting the values of Gaussian blur, the degree of blur for different frequency components in the image can be controlled.After applying Gaussian blur, the processed layer is combined or contrasted with the original layer in some form to extract high-frequency and low-frequency information, thus achieving high-low frequency separation.Assuming the input image is I, the Gaussian function is G, the mean of Gaussian noise is µ, and the variance is θ, the high-low frequency decomposition of the image can be represented as: As shown in Equations ( 1) and ( 2), the input RGB image undergoes Gaussian blur processing, and results in low-frequency information (LF).The high-frequency information (HF) is obtained by taking the absolute difference between the RGB image and the lowfrequency information.High-frequency information typically corresponds to abrupt signals with significant gradients in the image, while low-frequency information represents the overall structure and colors of the image.

Global Context Extractor
In the high-frequency branch, eight cascaded Global Context Extractor (GCE) modules are utilized.The GCE module integrates a convolution group (ConvGroup) and crossattention group, thereby enhancing the effectiveness of high-frequency image denoising.The role of the ConvGroup is to extract local features from the image and utilize bias induction to quickly identify and focus on areas with significant gradient changes in the image during the early stages of training.Furthermore, the cross-attention group has a more comprehensive long-distance perception and dependency capability, thus extracting global contextual information effectively.The GCE, constructed by combining the convolution group and cross-attention group, can selectively receive high-frequency images during training and process abrupt signals within them.
The GCE module is shown in Figure 2.During the construction, the feature map undergoes preliminary processing through a ConvGroup.The ConvGroup includes convolution layers, batch normalization (BN), and depthwise separable convolution (DWConv).The ConvGroup is defined as follows: noising.The role of the ConvGroup is to extract local features from the image and utilize bias induction to quickly identify and focus on areas with significant gradient changes in the image during the early stages of training.Furthermore, the cross-attention group has a more comprehensive long-distance perception and dependency capability, thus extracting global contextual information effectively.The GCE, constructed by combining the convolution group and cross-attention group, can selectively receive high-frequency images during training and process abrupt signals within them.The GCE module is shown in Figure 2.During the construction, the feature map undergoes preliminary processing through a ConvGroup.The ConvGroup includes convolution layers, batch normalization (BN), and depthwise separable convolution (DWConv).The ConvGroup is defined as follows: As shown in Equation ( 3), assuming the input feature is Z, it undergoes feature extraction using a 1 × 1 convolution operation first.Then, batch normalization (BN) is applied to normalize the feature map, enhancing the stability and convergence speed of the model.Next, a 3 × 3 depthwise separable convolution (DWConv) is used to further refine the features in order to reduce model complexity while maintaining high performance.The processed features are then added to the original input feature map to enable residual learning and alleviate the gradient vanishing problem during training of deep neural networks.After preliminary feature extraction in the convolution group, the output of the convolution group is passed into the cross-attention group.The cross-attention group consists of a layer normalization (LN) layer and a cross-attention module.LN is a normalization technique that normalizes the features across channels, providing stability during training.The cross-attention module facilitates information exchange between different parts of the input, allowing the model to focus on relevant areas for better performance in image denoising tasks.
The cross-attention module is shown in Figure 3.After inputting feature map, the input is firstly split along the channel dimension to obtain two feature subsets, namely F1 and F2, both with half the number of channels of the original input.Different global pooling methods are applied to F1 and F2 for feature aggregation.F1 is processed through global average pooling to obtain mean information from all positions in the feature map, while F2 undergoes global max pooling.After pooling, F1 and F2 are compressed into As shown in Equation (3), assuming the input feature is Z, it undergoes feature extraction using a 1 × 1 convolution operation first.Then, batch normalization (BN) is applied to normalize the feature map, enhancing the stability and convergence speed of the model.Next, a 3 × 3 depthwise separable convolution (DWConv) is used to further refine the features in order to reduce model complexity while maintaining high performance.The processed features are then added to the original input feature map to enable residual learning and alleviate the gradient vanishing problem during training of deep neural networks.
After preliminary feature extraction in the convolution group, the output of the convolution group is passed into the cross-attention group.The cross-attention group consists of a layer normalization (LN) layer and a cross-attention module.LN is a normalization technique that normalizes the features across channels, providing stability during training.The cross-attention module facilitates information exchange between different parts of the input, allowing the model to focus on relevant areas for better performance in image denoising tasks.
The cross-attention module is shown in Figure 3.After inputting feature map, the input is firstly split along the channel dimension to obtain two feature subsets, namely F1 and F2, both with half the number of channels of the original input.Different global pooling methods are applied to F1 and F2 for feature aggregation.F1 is processed through global average pooling to obtain mean information from all positions in the feature map, while F2 undergoes global max pooling.After pooling, F1 and F2 are compressed into feature vectors of size 1 × 1 × C/2.To further refine the feature representation, a strategy of dimensionality reduction followed by dimensionality expansion is utilized: score vector with the same number of channels as the input feature is obtained.Assuming the input is X and the attention score is A, the cross-attention module is defined as: As shown in Equation ( 5), the weighted feature representations are obtained by element-wise multiplication of Att1 and Att2 with the original F1 and F2, respectively.After concatenating the weighted F1 and F2 together, they are added to the input feature before channel splitting, serving as the output of the cross-attention mechanism module.During training, F1 and F2 are cross-perceived and integrate information between different branches through cross-attention.The cross-attention module optimizes attention computation based on the module's final output, ensuring that attention calculation maintains logicality and consistency while fully capturing and utilizing the complex features of the input data.It also explores the dependency relationships in the noisy regions from multiple perspectives.

Residual Block
The low-frequency component contains information such as color, saturation, and brightness, which are not included in the high-frequency component.This information collectively constitutes the basic color and overall perception of the image, and therefore plays an important role in underwater image denoising tasks.However, there is less noise As shown in Equation (4), d represents the number of channels after compressing either F1 or F2.The feature undergoes a 1 × 1 convolution operation and the number of channels in the feature vector is reduced to C/2r, where r is the dimension reduction factor.Subsequently, another 1 × 1 convolution layer is used to increase the number of channels back to C/2.After the dimensionality reduction and expansion operations, an attention score vector with the same number of channels as the input feature is obtained.Assuming the input is X and the attention score is A, the cross-attention module is defined as: As shown in Equation ( 5), the weighted feature representations are obtained by element-wise multiplication of Att1 and Att2 with the original F1 and F2, respectively.After concatenating the weighted F1 and F2 together, they are added to the input feature before channel splitting, serving as the output of the cross-attention mechanism module.
During training, F1 and F2 are cross-perceived and integrate information between different branches through cross-attention.The cross-attention module optimizes attention computation based on the module's final output, ensuring that attention calculation maintains logicality and consistency while fully capturing and utilizing the complex features of the input data.It also explores the dependency relationships in the noisy regions from multiple perspectives.

Residual Block
The low-frequency component contains information such as color, saturation, and brightness, which are not included in the high-frequency component.This information collectively constitutes the basic color and overall perception of the image, and therefore plays an important role in underwater image denoising tasks.However, there is less noise information in the low-frequency part, so there is no need to use computationally intensive and structurally complex modules.This paper chooses to use low-complexity residual blocks [17] to construct the network structure for processing the low-frequency component, which can remove noise while preserving the original features of the low-frequency part.
The structure of the residual learning module is shown in Figure 4.It consists of two convolutional blocks which learn residual components through convolutional operations and then add themselves to the original components.Each convolutional block contains a 3 × 3 convolution, Instance Normalization [39] (IN), and Parametric Rectified Linear Unit (PRelu), respectively.IN is a normalization method that normalizes each channel of each input sample individually.PRelu is an activation function that improves upon the traditional ReLU function by introducing a learnable parameter to adaptively adjust the shape of the activation function in the negative region.The residual learning module can retain input information while learning and extracting more useful low-frequency feature representations.
information in the low-frequency part, so there is no need to use computationally intensive and structurally complex modules.This paper chooses to use low-complexity residual blocks [17] to construct the network structure for processing the low-frequency component, which can remove noise while preserving the original features of the low-frequency part.
The structure of the residual learning module is shown in Figure 4.It consists of two convolutional blocks which learn residual components through convolutional operations and then add themselves to the original components.Each convolutional block contains a 3 × 3 convolution, Instance Normalization [39] (IN), and Parametric Rectified Linear Unit (PRelu), respectively.IN is a normalization method that normalizes each channel of each input sample individually.PRelu is an activation function that improves upon the traditional ReLU function by introducing a learnable parameter to adaptively adjust the shape of the activation function in the negative region.The residual learning module can retain input information while learning and extracting more useful low-frequency feature representations.

Loss Function and Optimizer
Underwater image denoising based on deep learning uses a loss function to quantify the difference between actual values and predicted values.A smaller loss indicates better algorithm performance.In the training of HHDNet, image noise is defined as high-frequency abrupt signals.For the handling of high-frequency abrupt signals, this paper chooses the MAE loss function, also known as the L1 loss function, for supervision.As shown in Equation ( 6), where N represents the total number of training samples, x represents the image after denoising by the network, and y represents the true noise-free image: Throughout the entire model training process, the optimizer plays a crucial role in facilitating parameter updates and guiding the model to its optimal state.The Adam optimizer combines the advantages of AdaGrad and RMSProp and leverages the strengths of both optimization algorithms.By comprehensively estimating the first and second moments of gradients, the Adam optimizer calculates the update step size.The simplicity of implementation and lower consumption in memory make Adam particularly suitable for models with large-scale data and parameters.Therefore, this paper chooses Adam to assist in achieving the best solution during model training.

Experimental Setup
The underwater data used in this experiment are drawn from the data source for the Underwater Robot Picking Competition (URPC) organized by the National Natural Science Foundation of China.The dataset used in this paper is URPC2019, consisting of images captured by underwater robots using cameras.The dataset contains 5543 images with a resolution of 640 × 480.The dataset is divided into training and testing sets in a 7:3 ratio.

Loss Function and Optimizer
Underwater image denoising based on deep learning uses a loss function to quantify the difference between actual values and predicted values.A smaller loss indicates better algorithm performance.In the training of HHDNet, image noise is defined as high-frequency abrupt signals.For the handling of high-frequency abrupt signals, this paper chooses the MAE loss function, also known as the L1 loss function, for supervision.As shown in Equation ( 6), where N represents the total number of training samples, x i represents the image after denoising by the network, and y i represents the true noise-free image: Throughout the entire model training process, the optimizer plays a crucial role in facilitating parameter updates and guiding the model to its optimal state.The Adam optimizer combines the advantages of AdaGrad and RMSProp and leverages the strengths of both optimization algorithms.By comprehensively estimating the first and second moments of gradients, the Adam optimizer calculates the update step size.The simplicity of implementation and lower consumption in memory make Adam particularly suitable for models with large-scale data and parameters.Therefore, this paper chooses Adam to assist in achieving the best solution during model training.

Experimental Setup
The underwater data used in this experiment are drawn from the data source for the Underwater Robot Picking Competition (URPC) organized by the National Natural Science Foundation of China.The dataset used in this paper is URPC2019, consisting of images captured by underwater robots using cameras.The dataset contains 5543 images with a resolution of 640 × 480.The dataset is divided into training and testing sets in a 7:3 ratio.The training set includes 3880 ground truth images, while the testing set includes 1663 ground truth images.To train our HHDNet, Gaussian noise is added to the dataset at noise levels of 15, 25, and 50.The proposed HHDNet and comparison models are run on a single NVIDIA GeForce RTX 3090 graphics card.The HHDNet model is trained by using a partitioned original training dataset consisting of 64 × 64 input and output blocks.Training sessions are conducted separately for RGB color images with a fixed batch size of 16 and a learning rate set at 1 × 10 −3 .Data augmentation techniques are applied to enhance dataset diversity, including random vertical and horizontal flips, along with 90-degree rotations.Network parameter optimization during training is accomplished using the Adam optimizer.

Evaluation Metrics
In this paper, we used UCIQE, UIQM, PSNR, and SSIM to evaluate the performance of HHDNet.UCIQE and UIQM are primarily used for evaluating underwater image restoration tasks, while PSNR and SSIM are commonly used as metrics for image denoising tasks.
(1) The Underwater Color Image Quality Evaluation Index [40] (UCIQE) is a metric used for comprehensively evaluating the quality of color images.It evaluates color images from three aspects: the mean value of saturation, the standard deviation of hue, and the mean value of contrast.The larger the UCIQE value, the better the overall color quality of the image.The definition formula for this index is: where c 1 , c 2 , and c 3 are weights assigned to these components based on their importance in the overall image quality evaluation, usually set as c 1 = 0.4680, c 2 = 0.2745, and c 3 = 0.2576.σ c is the standard deviation of contrast.µ s is the mean value of saturation, and σ h is the standard deviation of hue.(2) The Underwater Image Quality Measure index [41] (UIQM) is used to assess the quality of underwater images, focusing on three aspects: colorfulness, sharpness, and contrast.Colorfulness measures the naturalness and vividness of colors, contrast reflects the ability to distinguish objects and details in the image, and sharpness relates to the clarity of details and structures.By combining these factors, the UIQM index provides an evaluation of the overall quality of underwater images, where a higher value indicates better image quality.The formula for UIQM is typically given as: Underwater Image Colorfulness Measure (UICM) evaluates color richness and naturalness.Underwater Image Sharpness Measure (UISM) assesses image sharpness and clarity.Underwater Image Contrast Measure (UIConM) measures image contrast and distinction of objects.The UIQM index provides a quantitative measure of underwater image quality, crucial for assessing the effectiveness of image enhancement techniques in underwater images.
(3) The Peak Signal-to-Noise Ratio (PSNR) is used as an evaluation metric to measure the enhancement effect of HHDNet.Given the width and height of the input image as H and W, respectively, the enhanced image is denoted as I c , and the original noisy image is denoted as In.The mean squared error (MSE) between the enhanced image and the original image is defined as: The Peak Signal-to-Noise Ratio (PSNR) between the enhanced image and the original image is defined as: MAX I represents the maximum pixel value of the image.If each pixel is represented by a B-bit binary number, then MAX I is equal to 2 raised to the power of B minus 1.In this paper, if each pixel is represented by an 8-bit binary number, then MAX I is 255.
(4) In addition, we also use the Structural Similarity Index [42] (SSIM) to measure the brightness, contrast, and structure (structural) between samples x and y.
s(x, y) = σ xy + c 3 σ x σ y + c 3 (13) where µ x and µ y are the means of x and y, respectively; σ x and σ y are the variances of x and y, and σ xy is the covariance between x and y; and c1 and c2 are two constants.
We set c3 = c2/2 to avoid being divided by zero.MAX I represents the maximum value of pixels in a B-bit image, which is 255 in this paper.By default, k1 = 0.01 and k2 = 0.03, and then we have: When α = β = γ = 1, we have:

Experimental Results
HHDNet employs a strategy of high-low frequency separation, utilizing a Gaussian blur-based approach for separation.Compared to other separation methods, Gaussian blur kernel high-low frequency separation is a real-time processing method.Table 1 provides an inference time comparison of Fourier decomposition, Wavelet decomposition, Laplacian decomposition, discrete cosine decomposition, and Gaussian blur decomposition.HHDNet utilizes a high-low frequency decomposition strategy and also employs the GCE Block to process the high frequency.To validate the effectiveness of each improvement, this paper conducts ablation experiments.Firstly, the high-low frequency decomposition strategy is removed to verify its contribution to improving accuracy.Secondly, a comparison is made between the ResBlock and the GCE module in terms of accuracy improvement.Additionally, we incorporate inference time for each ablation experiment.Ultimately, when the low-frequency branch utilizes ResBlock and the high-frequency branch employs GCEBlock, the model achieves a good balance between accuracy and inference time.The results are shown in Tables 2 and 3.In HHDNet, high-low frequency decomposition is employed using Gaussian blur kernels.To determine the optimal Gaussian kernel size, we conduct the following experiments to compare the impact of different Gaussian kernels on UCIQE, UIQM, PSNR, and SSIM metrics, as shown in Tables 4 and 5. Observing at the same noise level, when Ksize increases from 3 × 3 to 5 × 5, UCIQE, UIQM, PSNR, and SSIM values all show improvement.However, as Ksize continues to increase to 7 × 7 and beyond, the improvement in metrics becomes very limited, and there is even a slight decrease in some cases.Therefore, this paper ultimately uses a Ksize of 5 × 5 as the parameter for the Gaussian kernel in the high-low frequency decomposition.
We conduct comparative experiments using ten methods, including NLM, BM3D, DnCNN-B, RIDNet, ECNDNet-L, ADNet-L, MSANet, SADNet, DM-Water, and WFI2-Diff.These ten methods are tested alongside our proposed HHDNet algorithm on the URPC2019 dataset.Ultimately, our algorithm outperforms other methods in terms of UCIQE, UIQM, PSNR, and SSIM in the URPC2019 testing, as shown in Tables 6 and 7.At a relatively low noise level with Sigma = 15, the proposed HHDNet algorithm achieves a UCIQE value of 0.631 and an UIQM value of 5.128.As the noise level increases to Sigma = 25, the UCIQE value of the HHDNet algorithm decreases to 0.598, with an UIQM value of 4.728.As the noise level increases to Sigma = 50, the UCIQE value of the HHDNet algorithm decreases to 0.557, with an UIQM value of 4.379.
At a relatively low noise level with Sigma = 15, the proposed HHDNet algorithm achieves a PSNR value of 31.554 and an SSIM value of 0.9421, showing significant advantages over other compared algorithms, indicating its effectiveness in restoring image quality and preserving structural information at this noise level.As the noise level increases to Sigma = 25, the PSNR value of the HHDNet algorithm decreases to 29.051, with an SSIM value of 0.9024, still surpassing other compared algorithms, demonstrating its stability and ability to preserve image structure across different noise levels.In the extreme case of high noise with Sigma = 50, although all algorithms experience a significant drop in SSIM values, HHDNet still achieves a PSNR value of 26.005 and an SSIM value of 0.8248 and shows its capability to recover images and preserve structure even under extremely high noise levels.
The total number of model parameters (Parameters), model computational complexity (FLOPs), and inference time to some extent reflect the model's complexity.If the total    From the visualization results, the HHDNet algorithm demonstrates good structural preservation performance under both low and high noise levels, especially in low to moderate noise levels, where it performs exceptionally well.Compared with other algorithms, HHDNet still achieves relatively high UCIQE and UIQM values.

Conclusions
This paper proposes an underwater image denoising algorithm named HHDNet.The algorithm adopts a dual-branch network architecture for high-and low-frequency components and integrates a hybrid-attention GCE module to enhance and accurately identify high-frequency noise spike signals, thus effectively removing noise generated during underwater robot photography due to complex environments and technical limitations.It not only surpasses existing methods in denoising performance on the URPC2019 dataset Figure 8.When Sigma = 50, visualize the error map (difference and the distribution of error between the prediction and the ground truth).Then, select the image of Figure 7 and denoise it using different algorithms.
These images clearly indicate that the denoising results produced by our algorithm are significantly clearer and preserve image details effectively.Additionally, both UCIQE and UIQM metrics are higher.
From the visualization results, the HHDNet algorithm demonstrates good structural preservation performance under both low and high noise levels, especially in low to moderate noise levels, where it performs exceptionally well.Compared with other algorithms, HHDNet still achieves relatively high UCIQE and UIQM values.

Conclusions
This paper proposes an underwater image denoising algorithm named HHDNet.The algorithm adopts a dual-branch network architecture for high-and low-frequency components and integrates a hybrid-attention GCE module to enhance and accurately identify high-frequency noise spike signals, thus effectively removing noise generated during underwater robot photography due to complex environments and technical limitations.It not only surpasses existing methods in denoising performance on the URPC2019 dataset but also demonstrates significant advantages in computational efficiency, performing underwater image denoising more precisely and efficiently.The proposed method improves the visual quality of underwater image denoising significantly and could contribute to visual-based underwater tasks such as subsequent underwater detection and segmentation tasks.
As underwater scientific research and industrial applications develop further, the demand for high-quality underwater images is becoming increasingly urgent.HHDNet could bring significant improvement in underwater image denoising technology and provide strong support for further development.In particular, HHDNet demonstrates notable advantages in computational efficiency, thereby enhancing resource usage efficiency.The outstanding denoising effect and lower resource consumption make HHDNet absolutely predominant among competitors in completing tasks such as target detection and image segmentation.We strongly believe that HHDNet would bring new breakthroughs and practical value to the development of underwater image processing.
Nonetheless, HHDNet still has room for improvement.The formation of underwater image noise is complex and diverse, with significant variances in the distribution of different noise types.In practical applications, denoising models need to map the noise domain containing multiple types of noise to the high-quality image domain, which is essentially a many-to-many task.It is confined to the current supervised training using only Euclidean distance, which may lead to the training process converging to an average level.To further enhance denoising effectiveness, we could consider exploring advanced techniques such as adversarial networks or diffusion models to address this issue and promote the continuous advancement of underwater image processing technology.

Figure 2 .
Figure 2. Structure of Global Context Extractor.

Figure 2 .
Figure 2. Structure of Global Context Extractor.

Figure 5 .Figure 5 .Figure 6 .Figure 6 .
Figure 5.When Sigma = 15, select an image from the URPC2019 test set and denoise it by using different algorithms.

Figure 6 .Figure 7 .
Figure 6.When Sigma = 15, visualize the error map (difference and the distribution of error between the prediction and the ground truth).Then, select the image of Figure 6 and denoise it using different algorithms.Sensors 2024, 24, x FOR PEER REVIEW 17 of 21

Figure 7 .Figure 7 .
Figure 7.When Sigma = 50, select an image from the URPC2019 test set and denoise it by using different algorithms.

Figure 7 .Figure 8 .Figure 8 .
Figure 7.When Sigma = 50, select an image from the URPC2019 test set and denoise it by using different algorithms.

Table 1 .
Inference time comparison of Fourier decomposition, Wavelet decomposition, Laplacian decomposition, discrete cosine decomposition, and Gaussian blur decomposition.Bold represents the shortest time.

Table 2 .
HHDNet's ablation experiments in terms of average UCIQE and UIQM at noise levels 15, 25, and 50.× represents High-Low Frequency Decomposition is not used, √ represents High-Low Frequency Decomposition is used, and bold represents the configuration used by HHDNet.

Table 3 .
HHDNet's ablation experiments in terms of average PSNR and SSIM at noise levels 15, 25, and 50.× represents High-Low Frequency Decomposition is not used, √ represents High-Low Frequency Decomposition is used, and bold represents the configuration used by HHDNet.

Table 4 .
HHDNet's high-low frequency decomposition with different Gaussian kernel parameters in terms of average UCIQE and UIQM at noise levels 15, 25, and 50.Bold represents the configuration used by HHDNet.

Table 5 .
HHDNet's high-low frequency decomposition with different Gaussian kernel parameters in terms of average PSNR and SSIM at noise levels 15, 25, and 50.Bold represents the configuration used by HHDNet.

Table 6 .
The average UCIQE and UIQM values of different algorithms at noise levels 15, 25, and 50 on the URPC2019 test set.Bolded "HHDNet" represents the algorithm proposed in this paper, and bolded metrics represents the best results in the comparative experiments.

Table 7 .
The average PSNR and SSIM values of different algorithms at noise levels 15, 25, and 50 on the URPC2019 test set.Bolded "HHDNet" represents the algorithm proposed in this paper, and bolded metrics represents the best results in the comparative experiments.