Open Access
27 January 2024 Offset flow-guide transformer network for semisupervised real-world video denoising
Lihui Sun, Heng Chen, Jialin Li, Chuyao Wang
Author Affiliations +
Abstract

Video denoising is a fundamental task in low-level computer vision. Most existing denoising algorithms use synthetic data learning. However, there is a significant difference between the noise distributions of synthetic and natural data, which leads to poor generalization performance of the model in actual scenes. Hence, a video method based on an offset optical flow-guided transformer is proposed. The proposed method adopts a semisupervised framework to improve the model’s generalization performance, designs the offset optical flow to guide the transformer in capturing critical information, and performs global self-similarity modeling using neighboring spatiotemporal domain features to improve the denoising performance. In addition, contrastive learning is introduced in the supervised branch to prevent the fitting of wrong labels, imaging prior information to mine sequence features in the unsupervised branch, and a two-branch memory loss is introduced to reduce the difference of double branch training. Experimental results on synthetic and real videos demonstrate that our method has obvious quantitative and qualitative improvements over state-of-the-art methods with fewer parameters.

1.

Introduction

With the wide application of digital devices in tasks such as handheld cameras, target tracking, and automatic driving, the underlying visual task of video denoising has gradually become increasingly critical, requiring denoising algorithms to not only improve the visual quality of sequences but also achieve better generalization performance in natural complex environments.

Traditional multiframe denoising methods extend a priori image information to a time series,1 such as extending self-similarity to intersequences2 for searching and compensating and using a priori image variation to smooth the noise. However, the general prior applies only to isolated situations, making it impossible to apply it to all scenarios. With the development of deep learning methods, algorithms based on convolutional neural networks (CNN) have been proposed, and their powerful characterization ability can address temporal redundancy and improve denoising performance. For example, a priori features have been progressively incorporated into the convolutional kernel3 to reduce local redundancy; strategies, such as optical flow4,5 and deformable convolution,6 are used to achieve frame alignment; and sequence modeling is implemented using U-shaped structures7 and recurrent neural network (RNN) architectures1,8,9 to adequately propagate sequence features. However, CNN models essentially learn in a fixed core domain, and there remain some limitations to their learning of remote spatiotemporal dependence and nonlocal self-similarity.

Recently, vision transformer (ViT) approaches have bridged this gap. The transformer captures the correlation between pixels through a global attention mechanism, unlike the nonlocal self-similarity property of images.1 Additionally, it enables the modeling of remote spatial dependencies; however, existing methods still face some problems. First, the transformer’s processing of multiple input sequences leads to substantial computational overhead. Although the currently proposed global sliding window,10 cyclic frame-by-frame parallel processing incorporation of wavelet transform,11,12 and other approaches reduce computational redundancy, they are still deficient in boundary processing and are difficult to train. Second, to fully utilize temporal redundancy, the introduction of optical flow13 or an implicit alignment strategy enhances the consistency of the sequence; however, it also leads to increased computation time, and the literature14 has confirmed that existing alignment methods will affect the performance of the original ViT.

Most methods use synthetic noise sequences for training and verification. Owing to the nonuniformity of the probability model of synthetic noise and the obvious difference between it and the degradation mechanism of image sequences, overfitting easily occurs, and the denoising effect is poor in actual scenes. Currently, some scholars are committed to conducting semisupervised or unsupervised research. MF2F15 uses fine-tuning technology to minimize losses to reduce input/output differences. UDVD16 extends image blind spot technology to video sequences. These two studies presented typical unsupervised algorithms. However, its denoising effect easily ignores details, and its generalization performance is insufficient; thus its denoising effect is not ideal. In addition, noise-reduction methods exist for the original RAW sequence to obliterate noise.17,18 Although their denoising performance is good, they have yet to be widely used in practical applications. Therefore, it must be considered how to make full use of the limited number of actual noise sequences to alleviate the problem of domain bias.

To solve the above problems, this paper proposes a flow-guide double transformer (FGDFormer) video denoising method. This method mainly uses the advantages of ViT remote timing modeling to build an attention block guided by an offset optical flow, find matching key features in adjacent frames guided by the flow, and perform self-attention calculations with query elements in reference frames. In addition, FGDFormer is composed of two training branches. The supervision branch uses synthetic data training and introduces contrast regularization constraints to improve the visual quality of denoising, whereas the unsupervised branch uses natural noise sequences for training, using image prior features and double memory loss to correct constraints. In summary, the main contributions of the proposed method are as follows.

  • (1) The use of an offset optical flow as a guide for the transformer to calculate self-attention is proposed. First, flow guidance can reduce redundant self-attention calculations and provide image-prior features. Second, the offset optical flow avoids the inaccuracy of the original optical flow.

  • (2) The design uses a contrastive regularization term to constrain supervised training. In the feature space, the denoising sequence is closer to the clean sequence and retains the sequence details. To the best of our knowledge, this is the first study to explore contrastive learning in the field of video denoising.

  • (3) In unsupervised branches, an image prior is introduced to preserve the sequence structure and details, and dual-branch memory loss is proposed to reduce the difficulty of semisupervised learning and the difference in the denoising effect between dual branches.

2.

Method

2.1.

Architecture

In this study, a semisupervised approach was used to train the FGDFormer to mitigate the differences between natural and synthesized data; the overall architecture is shown in Fig. 1. Specifically, the synthesized video dataset (Ii,Li=1Ns) and the real dataset Ii=1Nu were used for learning, where Ns and Nu denote the total numbers of synthesized and natural sequences, respectively. In this study, a network (·) was trained for learning a clean sequence Y from an input noisy sequence X. Therefore, the overall learning strategy can be formulated as

Eq. (1)

Y=(X),
where (·) consists of two parts, the supervised branch s and the unsupervised branch u, where the two branch networks share the same weights. During training, supervised branching constrains the model using reconstruction loss and contrast loss and introduces contrast learning to improve the model’s learning ability. Unsupervised branching improves the network’s fitting ability based on a priori physical features and helps the model learn the distributional properties of natural noise components. The architecture of the method adopts the typical encoder–decoder architecture with skip connections. The global features in the spatiotemporal dimension are extracted using a series of offset flow-guide attention blocks in the code stage. Finally, the features at different levels are aggregated in the decoding stage to generate more realistic and natural video denoising results.

Fig. 1

General framework of semisupervised FGDFormer.

JEI_33_1_013029_f001.png

2.2.

Offset Flow-Guided Attention Block

As previously analyzed, the computational complexity and redundant learning of ViT are challenging issues, particularly when handling high-resolution videos. Based on the existing literature,19 mainstream methods currently employ multiframe fusion recovery. However, these approaches do not address the impact of redundant features in nearby frames, which can generate pseudonoise, leading to reduced generalization performance. Hence, we propose the offset flow-guided attention block (OFGAB), as shown in Fig. 2(a), where a novel offset flow-guided multihead attention (OFG-MSA) is designed. In addition, unlike the traditional ViT, a dual-gate control network (DGCN) is used instead of a feed-forward network.

Fig. 2

(a) Offset optical flow guide attention block and (b) offset flow-guided multihead attention.

JEI_33_1_013029_f002.png

2.2.1.

DGCN

Specifically, the intermediate features are the first features extracted by two 3×3 deep convolutions simultaneously. Then one of the branches is activated by the GELU to obtain the gated information, which represents the information flow of the entire hierarchy and is used to enrich the contextual features. Finally, the original detailed components are preserved by residual connection. Dual gating facilitates the realization of a spatial contextual response to the prior OFG-MSA, which was demonstrated in subsequent experiments. The propagation process for a given input feature I is expressed as

Eq. (2)

I¯=DGCN(OFGMSA(LN(I))+I)DGCN(I˜)=ϕ(Wd1(LN(I˜)))Wd2(LN(I˜)),
where ϕ(·) denotes the GELU nonlinear activation, Cdi denotes different 3×3 depth convolutions, LN denotes the normalization layer, and denotes the elemental dot product.

2.2.2.

OFG-MSA

Building on the excellent performance of the optical flow-guided transformer in the underlying vision task, this study drew inspiration from a concept put forth in the literature.20 Our proposed multihead attention is executed with optical flow support. Because optical flow estimation is susceptible to noise, the derived motion characteristics may not be accurate and, in certain cases, may even impede the performance of the ViT, as demonstrated in the literature. In this study, we utilized offset optical flow as the search range for the “query.” This method assists in identifying key elements that exhibit high similarity to the query elements. Consequently, it reduces the extraction of key elements by the error optical flow and, in turn, enhances the self-alignment performance of the transformer. Specifically, for the input noise sequence FtR3×H×W, the neighboring and reference frames are prealigned according to the computed optical flow. The obtained prealigned and original features are learned by convolution to obtain residual offsets based on the deviation between the two. The residuals are then fused with the original optical flow to obtain the actual offset optical flow offi, which is used as a guide for motion information. The offi formation can be formulated as

Eq. (3)

offt=Conv(W(Ft,flowt),Ft)+flowt,
where flowt denotes the optical flow information with neighboring frames, and W denotes the warp operation. For the query elements, to fully utilize time redundancy, the features are divided into nonoverlapping windows of size P×P. The query and key-value elements are extracted for the window range, and the set of query elements is formulated as follows:

Eq. (4)

Γi,jt={qm,nt||mi|P/2,|nj|P/2},
where, qm,nt denotes the elements at (m,n) with (i,j) as the center in a certain window with a distance from the center point of less than P. The set of all elements of query will be searched for the corresponding highly similar key-value elements in the neighboring frames according to the guidance of the offset optical flow offt. The set of key-value elements key, value is formulated as

Eq. (5)

Ωi,jt={km+Δxf,n+Δyff||ft|r,qm,ntΓi,jt}(Δxf,Δyf)=[offo(Fref,Fsup)]|(i,j),
where Ωi,jt denotes the set of all key-value elements obtained according to the guidance of offi using qm,nt as the query feature, where t denotes the reference frame index, f denotes the neighboring frame index, r denotes the number of neighboring frames, and the offset motion information is expressed by Eq. (5). (Δxf,Δyf) denotes the location of the window located at position (m+Δxf,n+Δyf) according to the motion information obtained from the offset optical flow. Fref and Fsup denote the reference and support frames, respectively, and [·] denotes the operation of extracting the information of the offset optical flow. The OFG-MSA is represented by

Eq. (6)

OFG-MSA(Γi,jt,Ωi,jt)=n=1NWnkΩi,jt,qΓi,jtsoftmax(qi,jtφqkTφkd)vφv,
where N denotes the number of attention heads, and φqRd×C, φkRd×C, φvRd×C are learnable parameters, with d=CN.

2.3.

Slip Compensation Strategy

The transformer has the advantage of global spatial modeling. However, considering the computational cost, the radius of the key-value lookup of the method in this paper is limited to r, which limits the ability of remote modeling to some extent. Meanwhile, if the alignment features are transmitted only between each module, the performance of the subsequent key-value matching will be continuously affected when the optical flow is inaccurate. Therefore, a slip compensation strategy (SCS) is proposed to further enhance remote timing modeling. Specifically, the output of the features from the forward OFGAB is utilized to connect with the features of the reference frame, and the features of the original reference frame are fused with the output intermediate results for the fusion extraction operation, so that when looking for the key-value elements in the posterior block as the support frame, highly similar key-value regions can still be found through the original optical flow. As shown in Fig. 3, f denotes the features of each input frame, the superscript t denotes the t’th block in the time sequence, and the subscript denotes the index of the sequence frame. The sliding compensation strategy can be expressed as

Eq. (7)

f¯lt+1=conv(flt,flt)fl+it+i=OFGTB(fl+it+i,f¯l+i1t+i,fl+it+i,fl+i+1t+i).

Fig. 3

Slip compensation strategy.

JEI_33_1_013029_f003.png

The SCS is proposed to propagate the sequence features of the video without interruption, and the output denoising results of the last block are preserved. In addition, the fusion of the output features with the original features facilitates accurate guidance of the subsequent motion information and helps preserve more texture structures.

2.4.

Design of Loss Function

The supervised branch learns the mapping between synthetic and clean noise sequences, whereas the unsupervised branch mainly learns the probability distribution of the natural noise. Therefore, in this study, different loss functions were used for the design according to the following equation:

Eq. (8)

Ltotal=Lsup+Lunsup.

2.4.1.

Supervised branch loss functions

The supervised branch is trained using a synthetic dataset, and the supervised branch loss Lsup is defined in Eq. (9), where is used to balance the reconstruction and contrast losses:

Eq. (9)

Lsup=Lre+βLcr.

Reconstruction loss Lre

L1 and L2 losses are two standard functions. However, L2 loss penalizes minor errors less, thus ignoring the detailed content of the image itself, which has also been confirmed in the literature,21 and better practical results can be obtained with L1. Therefore, this study used L1 loss as the reconstruction loss.

Contrast regularization loss Lcr

Relying solely on the reconstruction loss increases the likelihood of inaccurate labeling. Current approaches involve refining the perceptual domain content using various methods, including perceptual and a priori-based loss, as a regularization term. Therefore, this method incorporates the contrast loss Lcr as a constraint. This pushes the anchor samples toward the positive samples and away from the negative samples by learning the representation. Compared to perceptual loss, contrast loss not only considers the difference between the real and output sequences but also constrains the solution space by considering negative samples as negative features space.22 In this study, the denoised sequence output was used as an anchor point, and positive and negative samples consisted of clear and noisy sequences, respectively. To strengthen the fitting ability of the model, the negative samples contained other types of noise distinct from the input noise. To extract the potential feature space, a pretrained VGG-1923 was utilized as the fixed feature decoder (FE) Lcr can be reformulated as follows:

Eq. (10)

Lcr=j=1Ki=1T|φj(Yi),φj(Xi)||φj(Yi),φj(ϕm(Xi))|,
where Xi and Yi denote the input sequence frame and output sequence frame of the i’th frame, respectively, ϕm(·) denotes that the sequence is noised as a negative sample, Yi=s(ϕt(Xi)) denotes that the output denoised sequence is used as an anchor point, φ(·) denotes the j’th layer of the FE, wj denotes the weight coefficients of each layer, K denotes the specified number of layers of the potential feature space, and T denotes the number of input sequence frames. In this study, we used L1 loss to measure the feature space distance between the anchor points and positive and negative samples. Therefore, Eq. (8) can be rewritten as follows:

Eq. (11)

minYϖs(X)+β·ρ(FE(Y),FE(X),FE(ϕm(X))).

2.4.2.

Unsupervised branch loss functions

Unsupervised branch loss utilizes a real dataset for training, and the branch loss Lunsup is defined as

Eq. (12)

Lunsup=λ1Ltv+λ2Lcp+λ3Ldm.

Total variation loss Ltv

Due to the lack of clean sequences as labels, a priori features based on images are required as supervised constraints. Total variation (TV), as an a priori image feature, can model the information of the image gradient24 distribution. Therefore, this study introduced TV loss as an unsupervised branching constraint as follows:

Eq. (13)

Ltv=1Ti=1T(hYi+vYi),
where T denotes the number of frames of the natural noise sequence, and h and v are the gradient operators in the horizontal and vertical directions, respectively. Variational loss preserves the image edge features of a sequence. However, it utilizes gradient error backpropagation supervision, which raises the problem of training instability.

Content preservation loss Lcp

To improve the robustness of the network, the input sequence is used as supervision, and the content preservation loss is designed to minimize the difference between the two as a regular term through L1 loss, which helps the model generate denoising results that are as similar as possible to the original inputs in terms of the overall structure and color. Simultaneously, it alleviates the problem of difficult model training:

Eq. (14)

Lcp=1Ti=1TE[YiXi1].

Double memory loss Ldm

Semisupervision can enhance the generalization performance of the method. However, during the transition, the two conflicting learning methods tend to lose their vital features. Thus the unsupervised branch introduces double-branch memory loss, which helps to retain the knowledge acquired in the supervised branch. Specifically, first, a copy of the parameter model trained in the supervised branch is kept as ˜s. When the real sequence is used for training, the input noise sequence, in addition to obtaining the result Yiu busing the unsupervised branch, is simultaneously computed by the copy to obtain a copy of the denoising result Yis. The error between the two is then minimized to avoid semisupervised training difficulty. Considering that the commonly used L1 and L2 losses are pixel-level comparisons, to fully utilize the self-similarity property of sequence images, this study adopted the structural similarity function [structural similarity index measure (SSIM)] as the loss function, and Ldm can be reformulated as

Eq. (15)

Ldm=1Ti=1TSSIM(|YisYiu|).

2.5.

Noise Degradation

To better adapt to noise reduction processing in a real environment, it is necessary to carefully design synthetic noise processing. Therefore, inspired by the literature,25 fixed degradation processes exist because the noise distribution is unknown and complex. We adjusted the noise synthesis strategy. Specifically, for a given clean sequence, the following probabilistic noise models were randomly added: Gaussian noise, Poisson noise, sensor noise, and JPEG compression noise. The probability of occurrence of Gaussian noise was 1, and the probability of occurrence of all other types was random, with a probability of no more than 0.5. If the probability of camera sensor noise was not zero, it was set preferentially; if the probability of JPEG compression noise was not zero, it was set last. In addition, the blurring effect and resizing were introduced in the late stage of noise addition, and the two strategies could better simulate the actual noise sequence situation (Algorithm 1).

Algorithm 1

Synthetic noise sequence data.

Input: clean sequenceX;
Output: sequence pair{X,Y};
 1: Set the probabilities for generating Gaussian noise, Poisson noise, Sensor noise, and JPEG compression noise as follows pro{G:1,P:p,C:c,J:j}(p,c,j0.5);
 2: Random parameters were generated to control Gaussian noise, Sensor noise and JPEG compression noise par{G:g,P:p,C:c,J:j}
 3: Y=X
 4: for each key,value in pro.items()do
 5:  ifpro[key]>0then
 6:   Y=NoiseModel(Y,par[key])
 7:  else
 8:   Y
 9:  end if
 10: end for
 11: return{X,Y}

3.

Experimental

3.1.

Datasets

In this study, DAVIS26 and Set817 were selected as the synthesized datasets, and CRVD17 and ICOV27 were selected as the real noise datasets. Among them, DAVIS contains 90 sequences at 480P resolution and 30 test sequences at 854×480; Set8 is generally used for testing and contains 8 video sequences at 940×540 resolution. CRVD is in the RAW format and contains five different ISO levels for 16 different indoor and outdoor scenes, with real static sequences as a reference for indoor scenes and no clean sequences for outdoor scenes. We used pretrained image signal processing converted to the sRGB format for a fair comparison. ICOV is a fixed handheld device used to capture dynamic videos in various states, and the corresponding “clean” sequences were generated based on the average value of multiple frames as labeling data.

In this study, we used a 9:1 ratio from DAVIS as the training and validation sets, and different types of noise were added during the training process. Six CRVD indoor scenes were selected as the training set, and the remaining scene videos were used as the test set.

3.2.

Training and Evaluation Setup

The proposed framework is a U-shaped structure containing three different scale features, with the model intermediate feature dimension set to 64 and skip connections as residuals for information complementation. The feature extraction and aggregation phases consist of 3 and 6 residual blocks, respectively, which are used to ensure speed while fusing different levels of spatiotemporal information. The optical flow is estimated using a pretrained SpyNet.28 After many experiments and experiences, it was concluded that the model performs best when the different weights in the joint loss function β, λ1, λ2, and λ3 are set to 0.4, 1, 0.3, and 0.5, respectively.

The experiments were implemented via PyTorch 1.8 with an NVIDIA GeForce RTX2080SUPER GPU. The proposed algorithm was implemented by training the network parameters via the Adam29 optimizer with the β1 and β2 parameters set as 0.9 and 0.99, respectively. The batch size was 4, initial learning rate was 1×104, a cosine annealing strategy controlled the learning rate, and the learning rate was stopped by decelerating to 107, which is proven to be effective for stabilizing the training in various experiments. The random cropping size during training was 128, and the total number of training iterations was 200k.

3.3.

Synthetic Denoising

Comparative experiments were conducted using various types of noise. To assess the image quality, we utilized established metrics that are often employed in video restoration, specifically peak signal-to-noise ratio (PSNR)30 and SSIM.31 The larger the metric value is, the better the image quality is. We compared our method with state-of-the-art (SOTA) deblurring methods, including DVDNet,4 FastDVDnet,7 PaCNet,5 VRT,13 RVRT,32 TempFormer,11 and ASWin.12 We retrained them according to the aforementioned settings for inconsistencies in the type and amount of training data of the compared algorithms.

3.3.1.

Quantitative comparison

Table 1 lists the objective metrics and running times under synthetic Gaussian white noise at different noise levels, with the resolution of the test sequences uniformly set at 480P. As shown in this table, the proposed algorithm achieved good objective metrics compared with the other algorithms, and there was a significant improvement in the average PSNR under all noise levels compared with the CNN-based methods. Compared with the CNN-based methods, the proposed algorithm exhibits a noticeable improvement in various noise levels, with an average PSNR gain of 2.4 dB, which is close to that of the current SOTA VRT and RVRT algorithms, and the processing speed is the fastest. This was analyzed because of the performance burden generated by the complex network architecture of VRT and the parallel strategy of RVRT. This method fully utilizes the advantages of the transformer. It achieves a trade-off between the denoising performance and speed as much as possible with the help of offset optical flow guidance and SCS. In addition, according to the performance in Table 1 on Set8, the generalization performance of the proposed method was best under different noise levels, indicating that the proposed method is suitable for video denoising tasks in natural scenes.

Table 1

Quantitative (PSNR/SSIM) comparison on DAVIS test and Set8 dataset for synthetic Gaussian noise.

MethodDVDNETFastDVDNetPacNet5VRT13RVRT32TempFormer11AWSin12Ours
Runtime (s)2.900.2479.275.600.702.100.39
DAVIS1037.94/0.93738.71/0.95339.97/0.97140.82/0.97740.57/0.97139.97/—40.15/—40.15/0.969
2035.2/0.91335.57/0.91736.82/0.94738.15/0.96338.05/0.96237.10/—37.12/—37.96/0.963
4032.41 /0.87132.51/0.88933.34/0.89735.32/0.93535.47/0.93534.16/—34.13/—35.38/0.938
5031.45/ 0.82131.48/0.83731.86/0.87434.36/0.92134.57/0.92533.20/—33.17/—34.41 /0.920
Set81036.20/0.95136.25/ 0.95037.06/0.96037.88/0.96337.53/0.96936.97/—36.99/—37.56/0.968
2033.45/0.91333.23/0.91133.94/0.92535.02/0.93734.83/0.94134.55/—34.06/—34.75/0.940
4030.43/0.84130.46/ 0.84530.70/0.86232.15/0.88932.21/0.88831.86/—31.22/—32.21/0.889
5028.87/0.811129.15/0.81529.66/0.83531.22/0.86931.33/0.87130.96/—30.31/—31.25/0.871
Note: bold values indicate the best scores.

In Table 2, we show the PSNR and SSIM of the different methods on the DAVIS testing set under different noise types. Poisson noise was generated using the pixel distribution range of the image. The JPEG compression ratio was set to 50, and the mixed method described in this section was utilized. Compared to other methods, our method exhibited better performance. Specifically, our model outperformed previous SOTA RVRT by an average PSNR of 0.45 dB. These methods are influenced by noise because they neglect the generalization performance. These results demonstrate the superiority of the proposed architecture.

Table 2

Quantitative (PSNR/SSIM) comparison on DAVIS test set for other noise types.

Degradation typesDVDNETFastDVDNetPacNet5VRT13RVRT32TempFormer11AWSin12Ours
Poisson38.10/0.93137.95/0.94338.85/0.95439.21/0.96239.45/0.97039.55/0.968
JPEG comp.36.75/0.90136.85/0.91537.05/0.94537.15/0.95037.55/0.95338.16/0.965
Mixed35.10/0.87535.66/0.91135.96/0.93536.82/0.94236.78/0.94237.43/0.943
Note: bold values indicate the best scores.

3.3.2.

Qualitative comparison

Figure 4 shows the denoising results when the Gaussian noise intensity was 20. As this figure shows, there is still some noise in DVDNet, and FastDVDNet lacks the ability to process detailed information, such as the character’s facial features. However, these methods are influenced by implicit alignment and cannot capture the fast movements of tennis rackets. PaCNet is high because it is used to search for a patch of self-similarity. However, it ignores the edge structure, resulting in detailed information not being reflected. The VRT processing effect is effectively too smooth, and the character’s skin color is influenced by the bright background; compared to RVRT, the present method can be more explicit for the character’s facial details and background texture yellow line recovery. Figure 5 shows the denoising results when the synthetic noise intensity was 40. Under high-intensity noise, the proposed method can recover the details of trees on top of the mountain in the background, and the overall color of the characters is closer to that of the original image. However, other methods based on ViT cause a certain amount of distortion in the color of the character’s outlook. Some of the samples were extremely smooth.

Fig. 4

Subjective visualization on the DAVIS test set (σ=20): (a) DVDNet, (b) FastDVDNet, (c) PaCNet, (d) noisy, (e) RVRT, (f) VRT, (g) FGDFormer, and (h) original.

JEI_33_1_013029_f004.png

Fig. 5

Subjective visualization on the Set8 dataset (σ=40): (a) DVDNet, (b) FastDVDNet, (c) PaCNet, (d) noisy, (e) VRT, (f) RVRT, (g) FGDFormer, and (h) original.

JEI_33_1_013029_f005.png

3.4.

Real-World Denoising

The video denoising performance was verified for real environments using the remaining CRVD indoor and outdoor scenes as tests. In addition to the above PSNR metrics, a no-reference image quality assessment was introduced to assess the denoising effect accurately. In this study, we used the SOTA NIQE33 (natural image quality evaluator) metric, which does not require a reference image but fits a multivariate Gaussian model based on a series of a priori information to measure the differences in multivariate distributions of a single image to be tested; the smaller the value of NIQE is, the higher the quality of the image is. For a fair comparison, we compared our method with SOTA methods supporting blind video noise reduction, including ViDeNN,34 EDVR,6 RViDeNet,17 UDVD,16 and FloRNN.1

3.4.1.

Quantitative comparison

Tables 3 and 4 present a visual comparison of the different methods on the CRVD and IOCV datasets with the PNSR and NIQE metrics, respectively. As shown in Table 3, the proposed algorithm obtained the best performance metrics for different exposure scenarios, with an average PSNR gain of 1  dB compared to RViDeNet, which provides the original data, and an average gain of 0.3 dB compared to the current SOTA FloRNN. The proposed method provides better NIQE objective metrics under different exposure scenarios without a reference quality assessment. A comparison of the denoising performance under the untrained IOCV dataset is shown in Table 4, where the proposed method mostly achieved advanced quantitative results and improved the NIQE by 0.2 compared to the FloRNN method, which indicates that the proposed method is better able to generate high-quality denoising results that are more in line with the visual habits of the human eye.

Table 3

Quantitative (PSNR/NIQE) evaluation metrics reasoned on the CRVD dataset.

ISOViDeNNEDVRRViDeNetUDVDFloRNN1Ours
160035.44/5.6742.10/5.2143.13/5.0143.45/4.9744.07/4.9244.10/4.87
320034.37/5.8941.03/5.2341.99/5.3442.32/5.1242.98/5.0343.15/4.82
12,80029.79/5.7837.47/5.1238.44/5.0838.85/5.1439.64/4.9239.56/4.68
25,60025.95/6.0235.26/5.6936.21/5.6936.51/5.8737.34/5.6536.39/5.21
Note: bold values indicate the best scores.

Table 4

Quantitative (PSNR/NIQE) evaluation metrics reasoned on the IOCV dataset.

TypeViDeNNEDVRRViDeNetUDVDFloRNN1Ours
HUAWEI BC39.30/4.7941.23/4.6440.89/4.7841.15/4.3542.28/4.3843.10/4.21
HUAWEI PC38.23/7.4539.35/7.4338.87/7.4438.76/7.5639.57/7.4540.34/7.14
OPPO BC33.35/8.3233.31/8.2233.05/8.2533.56/8.2133.75/8.1933.56/8.26
OPPO FC39.66/6.2139.89/6.1540.03/6.1539.56/5.8940.31/5.7540.41/5.66
Note: bold values indicate the best scores.

3.4.2.

Qualitative comparison

Figure 6 shows a scene with ISO 25600 in CRVD, from which it can be seen that ViDeNN still has some noise because its architecture modeling ability does not apply to complex backgrounds. In contrast, both the EDVR and UDVD methods have some denoising effects but lack the processing of complex edges, which leads to blurring of the ball and background boundaries. RViDeNet and FloRNN are similar to the original image; however, some blurring remains in the background region in the lower left corner. The method proposed in this paper also utilizes the guidance of the offset optical flow, which can capture highly similar clean regions as a complement during processing. Hence, the overall denoising effect is more visually consistent with the behavior of the human eye.

Fig. 6

Subjective visualization on the CRVD test set (ISO = 25,600): (a) ViDeNN, (b) EDVR, (c) UDVD, (d) noisy, (e) RViDNet, (f) FloRNN, (g) FGDFormer, and (h) original.

JEI_33_1_013029_f006.png

Figure 7 shows the comparative denoising results of a video sequence from IOCV data, which provides more testing of the actual generalization performance because the IOCV data are not involved in the training process. The video sequence camera moves quickly and presents objects with certain blurring and artifacts. Nevertheless, the proposed method still achieves good visual results, the detailed information of the font is complex, and at the same time, according to the data in Fig. 4 and Table 4, the method can recover more texture detail information, which can deal with more complex and unknown scenes.

Fig. 7

Subjective visualization on the IOCV dataset: (a) noise, (b) ViDeNN, (c) EDVR, (d) RViDNet, (e) UDVD, and (f) FGDFormer.

JEI_33_1_013029_f007.png

3.5.

Ablation Study

The following ablation experiments were performed to verify the contribution of each module and loss function to the robustness and generalization performance of the method. The training settings were the same as above, and the test data consisted of 10 randomly selected sequences from the DAVIS test set and CRVD outdoor scenes with an exposure level of 6400. In this case, Gaussian white noise of level 30 was added to the synthetic noise. The evaluation metrics of the ablation experiments were compared with the PSNR values.

3.5.1.

Effectiveness of each module

The effectiveness of each module was investigated (Table 5). Specifically, we utilized the design of the initial model as attention blocks, referred to as “base.” We conducted the experiments by removing these modules. The model without these modules exhibited a performance decrease, demonstrating the importance of each module. These results demonstrate the importance of our proposed attention block, DGCN, and SCS.

Table 5

Ablation validation of the effectiveness of each component.

ModelDAVISCRVD
Base35.9839.42
OFGAB36.1739.54
OFGAB + DGCN36.3439.99
OFGAB + DGCN + SCS36.5240.35
Note: bold values indicate the best scores.

3.5.2.

Comparison of loss functions

Table 6 compares the objective evaluation of the denoising and generalization abilities of different loss functions. It can be observed from this table that the joint use of multiple loss functions leads to a better denoising performance of the model while improving the generalization performance. Comparing the use of perceptual loss and contrast learning in the supervised branches (S2 and S3), it can be observed that contrast learning still favors the overall denoising performance by crowding out negative samples. The introduction of semisupervised learning helps the network produce high-quality denoised sequences compared with the use of supervised learning alone. In semisupervised branching, double memory loss can effectively mitigate the gaps in the training of the branching network and improve the overall generalization performance of the structure.

Table 6

Ablation validation with different loss functions.

LossS1S2S3S4S5S6
Reconstruction loss
Contrast loss
TV loss
Content preserve loss
Double memory loss
Perceptual loss
DAVIS36.4736.8736.7636.3736.4536.52
CRVD38.9839.1739.1639.3739.8940.35
Note: bold values indicate the best scores.

3.5.3.

Comparison of sample sizes for contrastive learning

The numbers of positive and negative samples in contrastive learning also affect the overall performance of the algorithm.22 Therefore, we assumed that the number of samples is s, where positive samples contain clean sequences by default, and negative samples contain input sequences by default. The remaining s1 positive samples come from other frames of the same video, and the negative samples add different types of noise to these frames. For the experiments, given the limitations of the deployment environment, the maximum value of s was set to 3. As shown in Table 7, the addition of negative samples improved the overall performance of the model. Negative samples were used to move the model results “away” from noise features. Although increasing the number results in more computations, contrast loss is not applied in the actual processing, so the training time only partially increases.

Table 7

Ablation validation for positive and negative sample sizes.

The number of positive and negative samplesDAVIS
1:136.29
1:336.52
3:135.94
Note: bold values indicate the best scores.

The number of positive and negative samples in contrastive learning also affects the algorithm’s overall performance.23 Therefore, the effect of different positive and negative sample ratios will be explored. Suppose the number of samples is s, where positive samples contain clean sequences by default, negative samples contain input sequences by default, other frames of the same video supplement the remaining s1 positive samples, and the remaining s1 negative samples are supplemented by introducing different types of noise. Considering the limitations of the deployment environment, the maximum number of s is set to 3. As shown in Table 7, adding more negative samples will improve the model’s overall performance. The negative samples are used to keep the model results “away” from the noise features, so the number of negative samples in this paper is set to 3, which performs best. However, increasing the number will lead to a certain amount of computation, but the loss of contrast is not applied in the actual processing. Although increasing the number of negative samples will lead to some computational effort, the contrast loss will not be applied in the actual processing, so it will only increase part of the training time.

4.

Conclusion

This paper proposed a semisupervised denoising method based on an offset optical FGDFormer for natural scene-oriented video noise reduction tasks. The core idea is to use the motion characteristics of the offset optical flow to sparsely model the transformer, achieve fast acquisition of highly similar regions for spatiotemporal domain learning, and use an SCS to achieve long-distance modeling. In this case, the implementation of offset optical flow enables a reduction in the overall computation, and the employment of a sliding compensation strategy promotes the temporal consistency of the denoised sequences. Furthermore, this study adopted a semisupervised learning approach that utilizes different losses for supervision. Novel contrast learning was introduced in the supervised branch to comprehensively improve the denoising performance of the model, in which a combination of image priors was used to preserve the detailed features of the original sequences, and the difficulty of two-branch training was alleviated by two-branch memory loss. Comprehensive experiments and qualitative comparisons demonstrated that the proposed method achieves the best video denoising effect at a low cost. In future work, more effective and lightweight video-denoising algorithms will be further explored for real-world applications.

Disclosures

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Code and Data Availability

The dataset generated during and/or analyzed during the current study are available in the Captured Raw Video Denoising Dataset (CRVD) ( https://mega.nz/file/Hx8TgLQY#0MoZSqdrQ_HgIc4OP6_jmwAwupNctPc7ZilXLV_FAQ0).

Acknowledgments

This work was supported by the Key Research and Development Program of Hebei (Grant No. 20350801D).

References

1. 

S. Avidan et al., “Unidirectional video denoising by mimicking backward recurrent modules with look-ahead forward ones,” Lect. Notes Comput. Sci., 13678 592 –609 https://doi.org/10.1007/978-3-031-19797-0_34 LNCSD9 0302-9743 (2022). Google Scholar

2. 

M. Maggioni et al., “Video denoising, deblocking, and enhancement through separable 4-D nonlocal spatiotemporal transforms,” IEEE Trans. Image Process., 21 (9), 3952 –3966 https://doi.org/10.1109/TIP.2012.2199324 IIPRE4 1057-7149 (). Google Scholar

3. 

A. Davy et al., “A non-local CNN for video denoising,” in IEEE Int. Conf. Image Process. (ICIP), 2409 –2413 (2019). https://doi.org/10.1109/ICIP.2019.8803314 Google Scholar

4. 

M. Tassano, J. Delon and T. Veit, “DVDNET: a fast network for deep video denoising,” in IEEE Int. Conf. on Image Process. (ICIP), 1805 –1809 (2019). https://doi.org/10.1109/ICIP.2019.8803136 Google Scholar

5. 

G. Vaksman, M. Elad and P. Milanfar, “Patch craft: video denoising by deep modeling and patch matching,” in IEEE/CVF Int. Conf. on Comput. Vis. (ICCV), 2137 –2146 (2021). https://doi.org/10.1109/ICCV48922.2021.00216 Google Scholar

6. 

W. Xintao et al., “EDVR: video restoration with enhanced deformable convolutional networks,” in IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. Workshops (CVPRW), 1954 –1963 (2019). https://doi.org/10.1109/CVPRW.2019.00247 Google Scholar

7. 

M. Tassano, J. Delon and T. Veit, “FastDVDnet: towards real-time deep video denoising without flow estimation,” in IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit. (CVPR), 1351 –1360 (2020). https://doi.org/10.1109/CVPR42600.2020.00143 Google Scholar

8. 

C. Qi et al., “Real-time streaming video denoising with bidirectional buffers,” in Proc. 30th ACM Int. Conf. on Multimedia, 2758 –2766 (2022). https://doi.org/10.1145/3503161.3547934 Google Scholar

9. 

K. C. K. Chan et al., “BasicVSR: the search for essential components in video super-resolution and beyond,” in IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. (CVPR), 4945 –4954 (2021). https://doi.org/10.1109/CVPR46437.2021.00491 Google Scholar

10. 

L. Ze et al., “Swin transformer: hierarchical vision transformer using shifted windows,” in IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 9992 –10002 (2021). https://doi.org/10.1109/ICCV48922.2021.00986 Google Scholar

11. 

M. Song, Y. Zhang and T. O. Aydın, “TempFormer: temporally consistent transformer for video denoising,” Lect. Notes Comput. Sci., 13679 481 –496 https://doi.org/10.1007/978-3-031-19800-7_28 LNCSD9 0302-9743 (2022). Google Scholar

12. 

L. Lindner et al., “Lightweight video denoising using aggregated shifted window attention,” in IEEE/CVF Winter Conf. on Appl. of Comput. Vis. (WACV), 351 –360 (2023). https://doi.org/10.1109/WACV56688.2023.00043 Google Scholar

13. 

J. Liang et al., “VRT: a video restoration transformer,” (2022). Google Scholar

14. 

S. Shi et al., “Rethinking alignment in video super-resolution transformers,” in Adv. in Neural Inf. Process. Syst., 36081 –36093 (2022). Google Scholar

15. 

V. Dewil et al., “Self-supervised training for blind multi-frame video denoising,” in IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2723 –2733 (2021). https://doi.org/10.1109/WACV48630.2021.00277 Google Scholar

16. 

D. Y. Sheth et al., “Unsupervised deep video denoising,” in IEEE/CVF Int. Conf. on Comput. Vis. (ICCV), 1739 –1748 (2021). https://doi.org/10.1109/ICCV48922.2021.00178 Google Scholar

17. 

Y. Huanjing et al., “Supervised raw video denoising with a benchmark dataset on dynamic scenes,” in IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. (CVPR), 2298 –2307 (2020). https://doi.org/10.1109/CVPR42600.2020.00237 Google Scholar

18. 

M. Maggioni et al., “EMVD: efficient multi-stage video denoising with recurrent spatio-temporal fusion,” in IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. (CVPR), 3465 –3474 (2021). https://doi.org/10.1109/CVPR46437.2021.00347 Google Scholar

19. 

H. Zhang, H. Xie and H. Yao, “Spatio-temporal deformable attention network for video deblurring,” Lect. Notes Comput. Sci., 13676 581 –596 https://doi.org/10.1007/978-3-031-19787-1_33 LNCSD9 0302-9743 (2022). Google Scholar

20. 

J. Lin et al., “Flow-guided sparse transformer for video deblurring,” in Int. Conf. Mach. Learn., ICML 2022, (2022). Google Scholar

21. 

S. Huang et al., “Contrastive semi-supervised learning for underwater image restoration via reliable bank,” in Proc. IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. (CVPR), 18145 –18155 (2023). https://doi.org/10.1109/CVPR52729.2023.01740 Google Scholar

22. 

H. Wu et al., “Contrastive learning for compact single image dehazing,” in IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. (CVPR), 10546 –10555 (2021). https://doi.org/10.1109/CVPR46437.2021.01041 Google Scholar

23. 

S. Karen and Z. Andrew, “Very deep convolutional networks for large-scale image recognition,” (2014). Google Scholar

24. 

D. Wang, J. Pan and J. Tang, “Two-scale real image blind denoising with self-supervised constraints,” J. Software, 34 (6), 2942 –2958 https://doi.org/10.13328/j.cnki.jos.006512 (2023). Google Scholar

25. 

Z. Kai et al., “Practical blind denoising via swin-conv-UNet and data synthesis,” (2022). Google Scholar

26. 

F. Perazzi et al., “A benchmark dataset and evaluation methodology for video object segmentation,” in IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), 724 –732 (2016). https://doi.org/10.1109/CVPR.2016.85 Google Scholar

27. 

Z. Kong and X. Yang, “Color image and multispectral image denoising using block diagonal representation,” IEEE Trans. Image Process., 28 (9), 4247 –4259 https://doi.org/10.1109/TIP.2019.2907478 IIPRE4 1057-7149 (2019). Google Scholar

28. 

A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), 2720 –2729 (2017). https://doi.org/10.1109/CVPR.2017.291 Google Scholar

29. 

D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” in 3rd Int. Conf. Learn. Represent., ICLR 2015, 131 –142 (2015). Google Scholar

30. 

W. Stefan and M. Praveen, “The evolution of video quality measurement: from PSNR to hybrid metrics,” IEEE Trans. Broadcast., 54 (3), 660 –668 https://doi.org/10.1109/TBC.2008.2000733 (2008). Google Scholar

31. 

Z. Wang et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., 13 (4), 600 –612 https://doi.org/10.1109/TIP.2003.819861 IIPRE4 1057-7149 (2004). Google Scholar

32. 

J. Liang et al., “Recurrent video restoration transformer with guided deformable attention,” in Adv. in Neural Inf. Process. Syst., 378 –393 (2022). Google Scholar

33. 

A. Mittal, R. Soundararajan and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Process. Lett., 20 (3), 209 –212 https://doi.org/10.1109/LSP.2012.2227726 IESPEJ 1070-9908 (2013). Google Scholar

34. 

M. Claus and J. van Gemert, “ViDeNN: deep blind video denoisingIEEE,” in IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. Workshops (CVPRW), 1843 –1852 (2019). https://doi.org/10.1109/CVPRW.2019.00235 Google Scholar

Biography

Lihui Sun is a professor at Hebei University of Economics and Business. He is the author of more than 40 journal papers and has written multiple invention patents. His current research interests include infrared image processing, image restoration, and big data research.

Biographies of the other authors are not available.

CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 International License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.
Lihui Sun, Heng Chen, Jialin Li, and Chuyao Wang "Offset flow-guide transformer network for semisupervised real-world video denoising," Journal of Electronic Imaging 33(1), 013029 (27 January 2024). https://doi.org/10.1117/1.JEI.33.1.013029
Received: 7 August 2023; Accepted: 14 December 2023; Published: 27 January 2024
Advertisement
Advertisement
KEYWORDS
Denoising

Video

Education and training

Transformers

Optical flow

Visualization

Image quality

RELATED CONTENT

Gaze estimation based on swin transformer
Proceedings of SPIE (August 16 2023)

Back to Top