Wavelet-Based Network For High Dynamic Range Imaging

High dynamic range (HDR) imaging from multiple low dynamic range (LDR) images has been suffering from ghosting artifacts caused by scene and objects motion. Existing methods, such as optical flow based and end-to-end deep learning based solutions, are error-prone either in detail restoration or ghosting artifacts removal. Comprehensive empirical evidence shows that ghosting artifacts caused by large foreground motion are mainly low-frequency signals and the details are mainly high-frequency signals. In this work, we propose a novel frequency-guided end-to-end deep neural network (FHDRNet) to conduct HDR fusion in the frequency domain, and Discrete Wavelet Transform (DWT) is used to decompose inputs into different frequency bands. The low-frequency signals are used to avoid specific ghosting artifacts, while the high-frequency signals are used for preserving details. Using a U-Net as the backbone, we propose two novel modules: merging module and frequency-guided upsampling module. The merging module applies the attention mechanism to the low-frequency components to deal with the ghost caused by large foreground motion. The frequency-guided upsampling module reconstructs details from multiple frequency-specific components with rich details. In addition, a new RAW dataset is created for training and evaluating multi-frame HDR imaging algorithms in the RAW domain. Extensive experiments are conducted on public datasets and our RAW dataset, showing that the proposed FHDRNet achieves state-of-the-art performance.


I. INTRODUCTION
H IGH dynamic range (HDR) imaging using multiple low dynamic range (LDR) images as inputs is a technique used in computational photography to generate high-quality HDR images. This technique achieves a large range of luminosity by utilizing the information from multiple LDR images. A digital camera usually captures an LDR image with only a limited range of luminosity at a time, where there might appear some over-exposed and/or under-exposed regions, degrading the image quality. Cameras embeded in wearable devices usually have small optical sensors and small apertures, which limit the number of electrons to reach each pixel, making them difficult to capture HDR images at a time. A practical solution for wearable devices is to capture several LDR images with different exposure times and fuse them into a single HDR image. To generate an HDR image, the method should be able to restore the missing information (over-exposed and under-exposed regions) from multiple LDR images, and more importantly, be ghost-free. Existing methods [2]- [7] suffer from different kinds of artifacts, including ghosting, missing details, color degradation, etc. The traditional method by Malik and Debevec [8] can generate a decent high quality HDR image by merging several static LDR images with different exposures, but it might introduce ghosting artifacts when there is motion. Other early works [9]- [13] try to deal with motion through detecting and rejecting moving pixels [9]- [11], or through aligning and merging LDR images [12], [13]. They can address a small range of motion but they cannot handle moving content effectively.
Recently, deep learning-based methods [2], [3], [6] have been proposed and made great improvements over traditional methods, benefiting from CNN's good representation ability and large amount of training data. These methods either use optical flow to align the inputs, followed by a merging module [2], or formulate the HDR imaging task as an imageto-image translation problem [3], [6]. Although these methods have made great progress in this area, they still suffer from the ghosting problem (see Figure 1). We notice that none of the existing methods tries to exploit the fact that the ghosting artifacts caused by large foreground motion are mainly of lowfrequency, while the details are of high-frequency. We argue that it is beneficial to separate these low-frequency and high- frequency signals and deal with them separately. Frequency operation has also been used in a few existing HDR imaging methods [14], [15], e.g., [14] decomposes HDR frames into different frequency bands, where the most suitable band is selected adaptively to prevent ghosts, [15] uses pairwise frequency-domain temporal filter operation for a robust and fast alignment.
In this paper, we choose Discrete Wavelet Transform (DWT) to decompose signals into different frequency bands. Compared with other methods, such as Discrete Fourier Transform (DFT) and Discrete Cosine Transform (DCT), DWT can capture both frequency and spatial information of the images (or feature maps), which helps to preserve detailed texture. In order to verify our hypothesis, we select the output of AHDR [6] in Figure 1, where a distinct ghosting artifact is presented because of the object motion, as an example to visualize the decomposed signals in each frequency sub-band. After decomposing, the corresponding frequency sub-bands are given in the first row of Figure 2. It clearly shows that the ghosting artifacts are mainly in the low-frequency subband (LL), while the high-frequency sub-bands (LH, HL, HH) include textures in different directions. In order to extend this verification to the feature space, we also investigate the feature map from the last but one layer of AHDR. In the second row of Figure 2, it presents a similar trend in the feature space where the ghosting artifacts caused by large foreground motion is mainly in the low-frequency sub-band. Thus, it is highly worth exploring frequency-specific processing in both the RGB and deep feature domains for the HDR imaging task.
In this work, we propose a frequency-guided network (FH-DRNet) to explicitly deal with signals of different frequency sub-bands for HDR imaging (see Figure 3). FHDRNet also performs well in the RAW domain. For RAW domain evaluation, we propose a new RAW dataset 1 . Processing HDR fusion in the RAW domain has the following advantages, especially for wearable devices: 1) From the Image Signal Processing (ISP) pipeline's perspective, it can bring the HDR fusion module to the early stage (e.g., earlier than demosaicing) of the whole ISP pipeline. It can save computations for other modules (e.g., demosaicing) that otherwise have to be done three times, each for one LDR RAW image; 2) RAW data usually have higher bit width (e.g.,16 bit) and contain more metadata. HDR fusion in the RAW domain will recover more 1 More details about the RAW dataset are included in Appendix A.
original useful information than that in the RGB domain (8 bit).
The paper's contributions are: 1) The proposed method -FHDRNet, working in the wavelet domain, is the first to explicitly deal with frequency-specific problems in the HDR imaging task, e.g., ghosting caused by large foreground motion, where the attention mechanism is used on the lowfrequency sub-band for fusion to remove such artifacts. The high-frequency sub-bands are used to preserve details (e.g., texture) in the generated HDR image; 2) A novel frequencyguided upsampling module is proposed to fuse multiple components with different frequency sub-bands from different images into a single set of low and high-frequency sub-bands to upscale the output using Inverse Discrete Wavelet Transform (IDWT); 3) A new dataset is built for training and evaluating HDR algorithms in the RAW domain, which includes 85 and 15 sets of training and testing samples. This is the first RAW dataset for HDR imaging; 4) Our method achieves state-ofthe-art performance on several public datasets and the new RAW dataset.
HDR Imaging. When the scene and camera are completely static, the traditional method [8] can generate a high quality HDR image by merging them together. But it generates ghosting artifacts when there is motion among the LDR images. Early works [9]- [11] that try to detect and reject the moving pixels fail to handle moving content effectively. To make use of the moving content, Bogoni et al. [12] and Gallo et al. [13] first align the input images and then merge the aligned images into one HDR image. These methods [12], [13] simply merge the aligned LDR images, and are unable to avoid alignment artifacts.
Recently, many deep learning-based methods [2]- [4] have been developed. Kalantari et al. [2] propose a deep learningbased model that first aligns the LDR image using optical flow, and then uses a convolutional neural network to generate an HDR image. However, it is difficult to correct the misalignment errors of optical flow, e.g., in the moving area, particularly when there also exists occlusion. Wu et al. [3] treat the HDR imaging as an image translation problem and use a U-Net to cope with large foreground motion. Though it can reduce the ghosting artifacts, it also blurs image details and hallucinates fine details in the over/under-exposed regions. Yan et al. [7] adopt three sub-networks with different scales to reconstruct the HDR image gradually. NHDRRnet [4] uses a U-Net to extract features in a low dimension, and then the features are sent into a global non-local network which can fuse the features from inputs according to their correspondence. This method can remove the ghosting artifacts from the final output efficiently. AHDR [6] employs an attention mechanism to solve misalignment and achieves state-of-theart performance. SCHDR [1] uses a lightweight optical flow PWC-Net [19] followed with refinement to align the LDR  Fig. 3. Overall architecture of the proposed FHDRNet. We consider three pairs of LDR and its HDR as inputs to our FHDRNet and the final reconstructed HDR output is viewed after tone mapping. The proposed network structure contains three parts: an encoder, a merger and a decoder. In the encoder, the input feature maps are decomposed into different frequency sub-bands for further fusion and reconstruction by using DWT. In the merger, the low-frequency sub-bands from the last layer of the encoder are used to generate a single fused feature map. In the decoder, the pre-saved frequency sub-bands are used along with the fused feature map to reconstruct into an upper scale feature map through a frequency-guided upsampling module (FGU). Finally, a global residual connection is used to enhance the feature representation ability of the network. (Best viewed in color.) images first, and then conducts feature aggregation and feature merging to generate an HDR image. Prabhakar et al. [5] merge the input LDR images in a lower resolution to save computational cost. These methods fail to explicitly remove the ghosting artifacts and fully exploit the useful information in the inputs. Learning in the Wavelet Domain. Learning in the wavelet domain has the advantage of explicitly dealing with signals in different frequency sub-bands, and it has been applied to some high-level vision and low-level vision problems, such as classification [16], [20]- [22], face aging [23], style transfer [24], image denoising [17], [25], [26], image demoireing [27], image/video compression [28], [29], network compression [30], and super-resolution [18], [31], etc. One of the classical image denoising approach is through image shrinkage [32], where the noisy image is decomposed into low and highfrequency components and then thresholding is applied to the high-frequency coefficients to remove high-frequency noise. For image super-resolution [33], the classical approaches are to estimate or interpolate the coefficients of wavelet subbands for refining image details. Recently, DWT has also been applied in deep learning-based image denoising. The winner of the NTIRE 2020 Denoising Challenge [17] proposes a multi-level wavelet ResNet for image denoising, where DWT and IDWT are used for downsampling and upsampling. Guo et al. [34] propose a deep wavelet super-resolution model to recover the residuals of wavelet coefficients of the low resolution image. Bae et al. [35] present a wavelet residual network for image denoising and image super-resolution. Both [34] and [35] only use one level wavelet transformatiom. Liu et al. [27] develop WDNet for image demoireing working directly in the wavelet domain. Liu et al. [18] propose a multilevel wavelet-CNN that shows good performance on several image restoration tasks.
As far as we know, DWT has not been applied to HDR imaging. We observe that in HDR imaging, ghosting artifacts caused by large foreground motion are of low-frequency, while the details are of high-frequency. We demonstrate that it is beneficial to treat them separately, and explore the advantages in the wavelet domain.

III. METHODOLOGY
Given a set of LDR images {L 1 , L 2 , · · · , L n } with different exposure times, the task of HDR imaging aims to reconstruct an HDR image H that is aligned with the reference frame L ref (e.g., the medium exposure LDR image). In this paper, we follow [2], [3], [6] and use three pairs of LDR and HDR images as input. The corresponding HDR images are obtained from the LDR inputs using a gamma correction function: where γ is set to 2.2 as the default gamma parameter, and t i is the exposure time of L i . The final input of the network is the concatenation of the LDR and the corresponding HDR images, forming a 3-pair 6-channel input:

A. Overview of Our Network Structure
The proposed network has a U-Net like structure, as shown in Figure 3, containing an encoder, a merger and a decoder with skip connections. In the encoder, the inputs {I 1 , I 2 , I 3 } are sent into three independent sub-networks. In each subnetwork, DWT is used for decomposing the feature maps into different frequency sub-bands {LL i , LH i , HL i , HH i } (i = 1, 2, 3), among which only the low-frequency sub-band LL i is used for the next stage (scale) processing. All frequency sub-bands are also sent to the corresponding frequency-guided upsampling modules through skip connections. The merger fuses the three inputs (in the low-frequency sub-band) into a ghost-free one, which is then sent to the decoder. The network also includes two significant modules: merging module (Section III-C) and frequency-guided upsampling module (Section III-D). The merging module takes only low-frequency components of the previous stage as input and generates a merged result, focusing on structural information. In the decoder, the frequency-guided upsampling module is used to process features in the low-frequency and high-frequency subbands independently and then reconstruct the feature maps to a finer scale using IDWT. A global residual connection is also used to enhance the feature representation ability of the network. The output of the network passes through a tone mapping function (using µ-law) to generate the final tonemapped HDR image: where H is the generated HDR output and µ is set to 5000 as default to adjust the compression level.

B. Encoder using DWT
The original inputs {I 1 , I 2 , I 3 } are firstly sent into three independent sub-networks to extract features individually. The features after the first convolution layer (conv1) are transformed into different frequency sub-bands through DWT, including one low-frequency component LL level-1 i and three high-frequency components, where i denotes the i th input. According to Liu et al. [27], the low-frequency sub-band contains more structure information and the high-frequency sub-bands contain more detailed information. In order to effectively leverage the decomposed data, the low-frequency component LL level-1 i is used for further decomposition. In the corresponding frequency-guided upsampling module, the high-frequency components can provide details. So we keep them

C. Merging Module
The merging module aims at reducing the low-frequency artifacts (e.g., ghosting) by fusing only the low-frequency components (see Figure 4). Inspired by AHDR [6], attention mechanism is applied to deal with the misalignment and saturated regions. to generate corresponding weighted masks M 1 and M 3 . The attention module includes two convolution layers (3 × 3 kernel size), with stride and zero padding equal to 1. A sigmoid function is used to normalize the values of the masks to [0, 1]. Next, the feature maps of the support frames are masked and weighted with the masks using element-wise multiplication to get the filtered feature maps where denotes element-wise multiplication. These filtered feature maps and the reference frame's feature maps are concatenated and go through a convolution layer. DWT is applied again to decompose the previous feature map into frequency components with a lower scale for efficient fusion, where the low-frequency component LL level-3 i goes through 9 residual blocks to conduct feature fusion. Finally, the pre-saved highfrequency components LH level-3 i , HL level-3 i , HH level-3 i , and the merged feature are used as the input of IDWT to recover fused feature maps F level-2 for the frequency-guided upsampling module.

D. Frequency-Guided Upsampling Module
Different from the previous works [18], [34] that use IDWT to reconstruct feature maps from the filtered frequency sub-bands that all go through the same process, our method leverages the decomposed components that go through different processes with the aim of further fusing lower frequency components. As shown in Figure 5, three sets (each for one input) of decomposed components are used for restoration. Firstly, the high-frequency components are regrouped into three groups according to their frequency subbands: LH s , HL s , HH s , where LH s = {LH 1 , LH 2 , LH 3 }, etc. Then, each group is fused by two convolution layers to generate a single set of high-frequency components. The low-frequency components are fused in a similar way to the merging module by going through the attention modules, and {LL 1 , LL 2 , LL 3 } along with fused feature maps F (from the previous stage) are concatenated and go though a convolution layer for fusion. Finally, IDWT is applied on the fused low and high-frequency components to reconstruct the feature maps. An extra convolution layer is used to squeeze the output's size.

E. Training Loss
Two types of loss are used to train our network: reconstruction loss and Sobel loss. The reconstruction loss is 1 loss which is the sum of the pixel-wise errors between the generated HDR image and the ground truth. We adopt 1 loss because it is proved effective for image restoration tasks [6]. For the HDR imaging problem, it has been shown that 1 loss of the tone-mapped images is better than the 1 loss in the linear space. The tone mapping function T (·) is applied to the output to generate the HDR image using µ-law. A basic reconstruction loss is defined as below: whereĤ is the predicted HDR linear RGB image and H is the ground truth. In order to keep the structure information in the generated HDR image, we also use the Sobel loss, which is: where ∇ x and ∇ y are the Sobel edge operator in the x direction and y direction respectively. Our final loss is defined as: and λ is a balancing parameter.  [36] is a synthetic one, containing 100 samples. The dataset is created in a similar way to the Kalantari dataset, except that all the data are synthesized through a game engine. We choose the first 85 samples for training, and the last 15 for testing. Our RAW dataset is also created in a similar way with higher resolution (5120×3456); it includes 85 training samples and 15 testing samples.

IV. EXPERIMENTS AND RESULTS
In the experiments, the training and evaluation are divided into three parts: 1) For real images, the model is trained on the Kalantari [2]'s training samples and evaluated on the Kalantari [2] and Prabhakar [1] testing samples; 2) For synthetic images, the training and evaluation are on the Samsung dataset [36]; 3) For the RAW images, training and evaluation are on the RAW dataset. For those training samples with ground truth, during training, the images are randomly cropped into 256×256 small patches and then data augmentation (e.g., flip and rotate) is applied for effective training. During evaluation, the entire test images are fed into the network to predict the HDR images.
Implementation Details. During training, Adam [41] is selected as the optimizer. The initial learning rate is 2 × 10 −4 . After 20,000 epochs, it is reduced to 2×10 −5 , and after 20,000 epochs, it is further reduced to 2 × 10 −6 . We train the network for 60,000 epochs. The batch size is 16. Haar wavelet is used for frequency decomposition. The balancing parameter λ is set to 0.25.

A. Comparison with State-of-the-Arts
Quantitative Results. Table I shows the quantitative comparison between the state-of-the-art models and ours on three datasets: Kalantari [2], Prabhakar [1], and Samsung [36]. Our model outperforms most of them, especially on the Kalantari and Samsung datasets, where it achieves the best results on five out of the six evaluation metrics. For the Prabhakar dataset, our model has the best result on SSIM-µ and the second best results on PSNR-µ, PSNR-L and PSNR-M. Different from other optical flow free methods using the U-Net structure, AHDR [6] adopts a network structure in a fixed scale. Therefore, AHDR can preserve more information during encoding and merging. With the assistance of the attention mechanism which can detect the misalignment and saturated regions, AHDR achieves the second best scores in most cases. Compared with AHDR, our model consistently outperforms it across all three datasets.
Qualitative Results 3 . In Figure 6, we show the qualitative comparisons. Sen [37] and Hu [38] generate strong ghosting artifacts in the images with large foreground motion. These traditional methods have worse performances compared with deep learning-based methods. Optical flow based methods Kalantari17 [2] and SCHDR [1], in which the input frames are aligned using optical flow before the further merging operation, benefiting a lot from the explicit alignment. But inaccurate optical flow estimation leads to ghosting artifacts, especially in the areas of large motion (see Figure 6 (a, c)). Wu18 [3] produces gridding artifacts (see Figure 6 (c)), because of deconvolution for upsampling. AHDR [6] also produces ghosts in Figure 6 (c). From these results, our method shows better details than other baselines, because details are preserved in high-frequency sub-bands. Through merging features using low-frequency components with the attention mechanism, the ghosting artifacts are also relieved compared with other methods.

B. Ablation Studies
In this section, we conduct ablation studies to investigate the contribution of each module in our model on the Kalantari  dataset. As shown in Table II, our ablation studies focus on the following parts: 1) only process low-frequency and high-frequency separately, 2) the importance of the attention mechanism, 3) different types of wavelet, 4) different types of methods to fuse high-frequency components in the upsampling module, 5) the Sobel loss function, and 6) the importance of using ONLY the low-frequency component for further processing (next scale) and merging.
Frequency-Specific Processing. We design a "U-Net + DWT" model that is basically a U-Net except that it processes the low and high-frequency sub-bands separately. The naive replacement leads to an improvement of 0.78dB in terms of PSNR-µ over the U-Net baseline. This model can outperform several state-of-the-art deep learning models [3], [4], [7] that adopt U-Net as the backbone. This is because the highfrequency components can preserve more details. As shown in Figure 7, the results of "U-Net + DWT" are smoother and also with better details than U-Net.
Attention Mechanism. Inspired by AHDR, attention modules are also used in both the merging module and the upsampling module of the proposed model. To verify its contribution, we remove the attention modules (indicated as "w/o Attention" in Table II). Compared with our final model, it shows that removing the attention modules leads to 0.49dB decrease in terms of PSNR-µ. Different from AHDR which applies the attention to all feature maps of the original scale, we only apply the attention to the low-frequency components of the feature maps on smaller scales (1/8, 1/4, and 1/2). By designing in this way, we specifically align the lowerfrequency sub-band to remove ghosting artifacts and also save computation. As shown in Figure 7, the results of "w/o Attention" have ghosting artifacts.
Types of Wavelet. In addition to the default Haar wavelet, various types of wavelet are also evaluated: Symlet wavelet (indicated as "sym2") and Daubechies wavelets with approximation order 2 and 3 (indicated as "db2" and "db3"). Our model with Haar wavelet outperforms the models with other wavelets. However, using other types of wavelet still gets comparable results, which shows the robustness of our method to the type of wavelet. Fusion Methods of High-Frequency Components. Another approach to fuse the high-frequency components in the upsampling module is also investigated. Firstly, it groups the components from different inputs with specific frequencies, and then averages the values of these components pixel by pixel to get the fused high-frequency components. The average fusion method has worse performance (0.5dB decrease on PSNR-µ) than the CNN fusion.
Sobel Loss Function. The Sobel loss contributes 0.21dB improvement for the score of PSNR-µ. It can guide the model to recover better edge information.
Using only Low-Frequency after Decomposition. To verify our design of using only the low-frequency component after decomposition, all frequency sub-bands are used for the next stage's processing (indicated as "Ours − "), and it leads to a decrease of PSNR-µ by 0.78dB. As shown in Figure 7, the results (Ours − ) contain ghosting artifacts.

C. Trade-off Between Quality and Efficiency
High dynamic range (HDR) imaging algorithms are widely used in the real-world devices (e.g., smart phones). Therefore, computational efficiency is also an important factor to evaluate the performance of the algorithms. In this experiment, we test the running time and the memory cost with the corresponding PSNR-µ scores of the baselines and our method in Table III. The proposed method needs around 0.59 second to generate an HDR image with 1500×1000 resolution on a RTX-2080Ti GPU, whereas AHDR needs 0.78 second and has a lower score on PSNR-µ. Besides, AHDR takes up the most memory among the 6 competitors, because AHDR merges LDR images in the original scale. Furthermore, our method, with 29% less memory than AHDR, can still achieve better performance.  Thus, our method has a good balance between quality and efficiency.

D. Evaluation on the RAW Dataset
We evaluate FHDRNet and compare with state-of-the-art on our new RAW dataset. As shown in Table IV, FHDRNet achieves the best performance in terms of PSNR-µ and SSIMµ, indicating that our model can also be used in the RAW domain. In the qualitative comparison, our method preserves more details in the texture than the baselines (see Figure 8), because of the efficient utilization of high-frequency subbands. For example, in Figure 8 (a), our method restores better details in the bottle and the newspaper. Thus, the proposed method can also keep its advantages in the RAW domain.

V. CONCLUSION
In this paper, we have proposed a frequency-guided network (FHDRNet) for high dynamic range (HDR) imaging. In the proposed method, the input LDR images are transformed into the wavelet domain using Discrete Wavelet Transform (DWT). The low-frequency sub-bands are mainly used to avoid ghosting artifacts caused by large motion, while the high-frequency sub-bands are used for preserving details. The attention mechanism is adopted to merge low-frequency information to deal with misalignment. The extensive experiments have shown that our method can remove ghosts and preserve more details. It also achieves state-of-the-art results on several public datasets and our RAW dataset with lower computational costs, compared with previous approaches. We believe it has great potential for more extensive applications of HDR imaging.

APPENDIX A RAW HDR DATASET
Recently, HDR imaging algorithms embeded in the wearable devices usually process images in the RAW domain for the following reasons (repeated in the main text): 1) From the Image Signal Processing (ISP) pipeline's perspective, a complete ISP pipeline mainly includes Sesor Correction (e.g., Black Level Compensation (BLC), Lens Shade Correction (LSC), and Defect Pixel Correction (DPC)), Denoising, Demosaicing, Auto White Balance (AWB), Colour Correction Matrix (CCM), and Gamma Correction, etc. Generating the HDR image in the RAW domain can bring the HDR fusion module to the early stage (e.g., earlier than demosaicing) of the whole ISP pipeline. It can save computations for other modules (e.g., demosaicing) and we just need to run denoising and demosaicing once instead of three times in the RGB domain. 2) RAW images usually have higher bit width (e.g.,16 bit) and contain more metadata. HDR fusion in the RAW domain will recover more original useful information than that in the RGB domain (8 bit).
In order to satisfy the requirements that developing HDR imaging algorithms in the wearable devices (e.g., smart phone). We create a new dataset for training and evaluating HDR imaging algorithms in the RAW domain. The data capturing and ground truth merging is accodring to the method in Kalantari [2] and the device is SONY ILCE-7RM2. We capture two sets of images for the same scene: the static set and the dynamic set. Each set contains three images captured with different exposure bias and with high resolution (5120×3456) using RAW format. In the static set, the object is kept static during the capturing, and these images are mainly used to generate the ground truth HDR images. In the dynamic set, the object will do some different movements, and these images are used as inputs for the network. In our dataset, we capture both classic HDR imaging scenes and objective scenes. In order to have an objective evaluation of the generated HDR images, some professional standards are introduced to our dataset, such as Film calibration plate (e.g., details) and SpyderCheckr (e.g., colour). The examples of our dataset are in Figure 9.
Furthermore, we also provide the corresponding RGB images and metadata (e.g., ISO, F-number, exposure time, exposure bias, and white balance coefficients) for each set of samples, and these extra data can be used for future works (e.g., training deep learning based end-to-end ISP pipelines). The provided metadata can also be used to calculate the precise exposure ratio (ER) between images, instead of using exposure bias to get an approximate value.
Finally, We capture 253 sets of samples in total and keep 100 sets, where there is no scene motion or object motion in the static sets (e.g., pixel shift is smaller than 5 pixels) for producing better ground truth HDR images. As described in the main text, 85 sets of samples are used for training, and 15 sets of samples are used for evaluating. The code and the dataset will be available after publishing.  provided in Figure 10. The datasets have no ground truth. From the results, our FHDRNet keeps more details and alleviates ghosting artifacts compared with other methods, especially in Figure 10 (d).