Robust photon-efficient imaging using a pixel-wise residual shrinkage network

Single-photon light detection and ranging (LiDAR) has been widely applied to 3D imaging in challenging scenarios. However, limited signal photon counts and high noises in the collected data have posed great challenges for predicting the depth image precisely. In this paper, we propose a pixel-wise residual shrinkage network for photon-efficient imaging from high-noise data, which adaptively generates the optimal thresholds for each pixel and denoises the intermediate features by soft thresholding. Besides, redefining the optimization target as pixel-wise classification provides a sharp advantage in producing confident and accurate depth estimation when compared with existing research. Comprehensive experiments conducted on both simulated and real-world datasets demonstrate that the proposed model outperforms the state-of-the-arts and maintains robust imaging performance under different signal-to-noise ratios including the extreme case of 1:100.


Introduction
Active optical imaging technology has greatly expanded the boundaries of human vision and achieved rapid progress in many emerging applications, such as long-range imaging [1,2], underwater imaging [3,4], non-line-of-sight imaging [5][6][7] and microscope imaging [8,9].In recent years, light detection and ranging (LiDAR) [10] has secured a dominant position in the field of active optical imaging.Conventional LiDAR system requires more than 10 3 signal photons per pixel for accurately imaging a 3D scene [11].However, such large number of signal photons can not be collected in photon-starved scenarios where optical flux and integration time are limited.Since single-photon avalanche diodes (SPADs) [12] provide extraordinary optical sensitivity and picosecond () time resolution, SPAD-based LiDAR, also named single-photon LiDAR, has been designed to fully exploit few or even one signal photon per pixel to precisely capture the 3D structure information in photon-starved scenarios, such as extremely distant and poorly reflective surfaces [1,13,14].
A typical coaxial single-photon LiDAR system as shown in Fig. 1 consists of a pulsed laser, a Micro-Electro-Mechanical system (MEMS) mirror, a SPAD sensor and the synchronization module.In practice, the recorded data may contain a large number of false counts due to the strong ambient lights and dark counts of SPAD.Thus, a key challenge is to robustly recover the precise depth images from such high-noise counting data.Many efforts have been devoted to developing statistical-learning-based denoising algorithms.For example, Shin et al. [11,15] proposed a constrained maximum likelihood (ML) estimation algorithm with a total variance regularization.Following this line, Rapp et al. [16] designed a cluster searching strategy and exploited the estimated reflectivity to improve the performance of the constrained ML estimation.However, these statistical approaches often suffer from performance degradation as the background noise increases.
In recent years, deep learning approaches [17][18][19][20] have demonstrated superior performance for photon-efficient imaging in high-noise scenarios.Lindell et al. [17] proposed a network based on U-Net [21] for imaging at low signal-to-background ratios (SBRs) with sensor fusion strategy.A similar deep sensor fusion strategy that combines RGB images and corrupted SPAD data to estimate the depth of a scene is also proposed in [22].However, these approaches require additional intensity images which are difficult to obtain in realistic scenarios.Peng et al. [18] proposed a non-local neural network to extract the long-range correlations among the photon counting data across the whole temporal dimension.However, the large computational and memory consumption of the non-local block may reduce the efficiency of this model.Based on U-Net++ [23], Zang et al. [19] built a network with rich skip connections between multi-scale features to capture the spatial patterns, while such complex connections are more compatible with clean data.Most recently, Zhao et al. [24] also adopted the U-Net++ architecture and employed an Attention-Directed Attention Gate module to enhance the edges' reconstruction.Generally speaking, existing deep learning models focus more on capturing spatial or temporal correlations from the data while lack a careful consideration for denoising.As a result, the robustness against noise cannot be guaranteed, i.e., these algorithms may fail to behave stably at different SBRs.For this reason, the goal of this paper is to design a denoising network to achieve robust imaging performance at different SBRs.
Soft thresholding has been widely used as an efficient denoising approach in signal processing [25,26] and recently has drawn attention from deep learning communities [27,28].The core idea behind soft thresholding is to use thresholds to distinguish real signals from noises, while the input signals below the thresholds are zeroed and the others are shrinked.Since determining the task-specific thresholds requires significant expertise, Isogawa et al. [27] proposed a neural network to learn the global thresholds for all inputs, and the learned values are fixed after training.Zhao et al. [28] designed a residual shrinkage network to produce dynamic thresholds for different inputs.In particular, the thresholds are designed channel-wise and shared among all pixels, and then soft thresholding shrinkage is leveraged to denoise the data channel by channel.However, since the data collection with respect to each pixel can be treated as an independent process, the optimal thresholds for each pixel must be different from each other.Therefore, the global or channel-wise threshold may not be the best solution for denoising the photon counting data.
In this paper, we propose a pixel-wise residual shrinkage block (PRS block) to improve the feature denoising ability from high-noise SPAD data, with the ultimate goal of generating precise depth images at different SBRs.PRS block calculates the dynamic threshold for each pixel of the input and implements soft thresholding as a nonlinear transformation layer to increase the SBR.The PRS-Net is built by stacking multiple PRS blocks.In addition, an improved temporal window strategy based on a specific 3D convolution is proposed to cluster the signal photons before the PRS blocks.Another advantage brought by the pixel-wise approach is that we can redefine the loss function with cross entropy to improve its optimization performance.More specifically, the previous methods tried to minimize the reconstruction error between the distribution of predictions and a predetermined waveform based on Kullback-Leibler (KL) divergence, where the predetermined waveform is generated from the real depth of each pixel.However, the real depth must be located within one time bin, and thus using a specific pulse waveform that spans a number of time bins to replace the real label is inherently inaccurate.Moreover, since the cross entropy loss can produce a more centralized prediction with the accurate label, the approximation to total variation loss also becomes more accurate than the previous methods.Experiments on simulated dataset demonstrate that PRS-Net achieves better performance than the state-of-the-art approaches, especially in extremely low SBRs.Moreover, on the real-world dataset captured in [17], PRS-Net outperforms the state-of-the-arts by a large margin.

Problem definition
We assume that the pulse repetition period is long enough to avoid the distance aliasing problem [11].Denote the total number of time bins as , and the photon counting histogram ℎ   ∈ R  is recorded for each pixel.The photon counting cube  ∈ R  × × is obtained by combining all the counting histograms, where  ×  is the resolution of the LiDAR system.Under the single-surface assumption [15],  is used to recover a depth image Ẑ ∈ R  × by where  (•) represents the reconstruction algorithm.The key challenge of this task is to recover a faithful depth image from the counting cube with limited signal photon counts.Ideally (with no background count), it may be possible to search for the time bin with most photon counts for the estimation of the depth.However in practice, the recorded histogram may contain a significant number of background counts, which would make real signals indistinguishable from noise.In this case, the aim is to denoise these fake counts and generate Ẑ as close as possible to the ground-truth depth image .

Observation model
We briefly review the observation model in [11] for generating the simulated dataset for training.Denote the impulse response function of the pulsed light as (), and   denotes the average noise intensity including the background light and dark count of SPAD.For the (, )-th pixel, the photon detections produced by SPAD can be modelled by an inhomogeneous Poisson process [29] …

Raw Cube Depth Image Encoder Decoder
Fig. 2. The overall structure of the PRS-Net.
with the time-varying rate function as where c is the speed of light and    is the light energy attenuation ratio due to atmospheric attenuation and diffuse reflection. is the quantum efficiency of the detector.Based on the rate function, the expectation value of the detected photon counts at the (, )-th pixel during an integration period Δ is calculated by where   denotes the beginning time of the integration period.Since the detection of each photon is independent of each other, the photon counting histogram of the (, )-th pixel based on  illuminations can be sampled by following Poisson process as Based on the observation model, simulated photon counting cubes can be sampled using the depth images and reflectivity maps.

Approach
In this section, we describe the structure of the PRS-Network and its optimization target which redefines the imaging task as a pixel-wise classification problem.

Network structure
Figure 2 shows the overall structure of the PRS-Net, which consists of a temporal windowing module, an encoding module, a series of PRS blocks and a decoding module.

Temporal windowing module
In contrast to the background noise counts which are uniformly distributed along the time dimension, signal counts tend to cluster around the time bin that corresponds to the true depth.
We propose a specific 3D convolutional layer with a kernel of the dimension   × 1 × 1 to cluster the photon counts.The parameters of this kernel are all initialized as 1 and fixed during the training.For the (, )-th pixel, the temporal windowing operation is formulated as where [•] denotes the floor function.As shown in Fig. 3, each time bin (even without any photon count) can be taken as a candidate for generating the photon cluster.The temporal windowing strategy is not used to directly judge the accurate position of the time bin that corresponds to the 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 1 1 2 2 2 3 3 2 1 0 1 1 1 1 1 (a) 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 1 1 1 2 3 3 3 2 1 1 2 2 2  true depth.The goal of temporal windowing is to highlight the signals, such that it becomes much easier for the subsequent modules to find a good filtering threshold than the original input.

Photon Counting Histogram
In other words, candidates for the target time bin are greatly reduced by the temporal windowing strategy.In practice, since most of the signal counts are concentrated in the full width at half maximum (FWHM) of the pulse, we set   as close to the FWHM as possible.The temporal windowing strategy in Rapp et al. [16] calculates a minimum cluster size (a threshold) with a predetermined and complicated formula to filter out the clusters whose counts are below the threshold.In contrast, our temporal windowing is used to enhance the real signal, which facilitates the estimation of soft thresholds to filter out the noise features by the subsequent modules.The thresholds here are generated dynamically and adaptively by the subsequent modules, which may achieve better filtering performance than using thresholds that are determined with a given formula.

Encoding module
The encoding module is designed to generate the latent representations from the photon counting cube.We use the dense dilated fusion strategy [30] to construct this module.Both normal 3D convolution and dilated 3D convolution are adopted here to extend the receptive field.To reduce the computational and memory consumption, we compress the temporal dimension by four convolutional layers with 2 × 1 × 1 stride.Finally, the time dimension of the input is compressed by 4 times and the number of channels is scaled by 32 times.

PRS block
In order to improve the feature denoising ability on the high-noise data, the PRS-block uses pixel-wise soft thresholding to boost the SBR on the latent representations.Improved upon the stable threshold generation strategy from [28], where the same thresholds are shared among all pixels, our strategy is to predict the pixel-wise optimal thresholds for different inputs.As shown in Fig. 4, the input is denoted by  ∈ R ×× × × , where  is the batch size,  is the number of channels, and  ×  is the input resolution.First, two 3 × 3 × 3 convolutional layers followed by a reshape layer are used to obtain the residual input, which is denoted by   ∈ R × × × , for the pixel-wise soft thresholding operation.Then,   is fed into the dynamic thresholds generation block, which is composed of two parallel branches.In the first branch, a normalization layer is used at the beginning to improve the training efficiency [31].
Then three 2D convolutional layers with 1 × 1 kernel size are proposed to calculate the pixel-wise scaling parameter  ∈ R × × .At the end of this branch, a sigmoid function is used to confine the output within [0, 1].In the second branch, we average the absolute values of   along the is the soft thresholding operation.denotes the element-wise summation.For the Convnd( 1 ,  2 ,  3 ,  4 ,  5 ) modules,  1 is the number of input channels,  2 is the number of output channels,  3 is the kernel size,  4 is the stride and  5 is the padding.time dimension.Then the pixel-wise thresholds  ∈ R × × are calculated by where , ,  and  are the corresponding indices in the reshaped tensors.After that, the soft thresholding operation  (•) for each element in   with its corresponding threshold in  is defined by Finally, the denoised representation is reshaped as   ∈ R ×× × × and added with the input as where   is the denoised output with increased SBR.

Decoding module
The decoding module up-samples the denoised features along the temporal dimension.Four deconvolutional layers with a kernel size of 6 × 3 × 3 and a stride of 2 × 1 × 1 are adopted here.
After that, a 1 × 1 × 1 convolutional layer followed by a softmax layer are used to produce the predicted distribution p  of each pixel along the time dimension.The final prediction for the depth ẑ  is generated from p  using a differentiable SoftArgmax [17] operation.

Loss function
Previous works [17][18][19] tried to reconstruct the pulse waveform distribution from sparse and noisy SPAD data at each pixel, as they took the KL divergence to measure the difference between the output and a target normalized pulse waveform that spans a number of time bins.In fact, the real depth must correspond to a single location within one time bin.Thus, we aim to classify each pixel into its depth category, and therefore the target label is much more centralized than the previous works.Consequently, the produced predictions would be more centralized along the time dimension as well.
To be more precise, the optimization target is redefined, and a new loss function based on cross-entropy is proposed.Denote where    is the real depth for the (, )-th pixel and Δ is the size of time bin.By one-hot encoding, we obtain  ∈ R  × × as the classification label from {   }.Then the cross entropy loss between the output and real label is calculated by The total loss is given by in which the total variation loss of the predicted depth image is formulated as for spatial smoothing, where ẑ is the predicted depth generated from p.The SoftArgmax function in the following form is employed to approximate the Argmax function.Mathematically, a more centralized distribution enforced by the cross entropy loss will lead to a more accurate approximation to Argmax function, which would further improve the performance of the model.It is clear that a wide p (trained by KL) may get an inaccurate result.Therefore, our model classifies each pixel into its depth category to make p more centralized than the one trained by fitting the distribution.In other words, although the real signal in a real-world histogram is a distribution, finding the peak of this distribution is the goal of this reconstruction problem.To verify this assumption, we compared the models trained with KL and CE in our ablation study in Section 4.7.The model trained by CE achieved better performance than the one trained by KL.

Experiment
In this section, we first introduce the implementation details and evaluation metrics.After that, the comprehensive evaluation of the proposed model in comparison with the state-of-the-art approaches on both simulated and real-world data is presented.Results of the ablation study are given at the the end of this section.

Implementation details
Data simulation.The simulated data of photon counting cubes are sampled using the observation model, in which the ground-truth depth images and reflectivity images are adopted from the NYU v2 dataset [32].NYU v2 consists of a large number of video sequences captured with Microsoft Kinect sensor from various indoor scenarios.In line with the previous works, the total number of time bins is set as 1024 and the size of time bins is 80.The FWHM of the pulsed light is 400.All the images are reshaped into a resolution of 64 × 64.As a result, the size of the simulated counting cube is 64 × 64 × 1024.13000 and 3000 samples are generated for training and validating, respectively.
Test Setting.All the approaches, including the deep learning and statistical learning models are firstly evaluated on the Middlebury stereo dataset [34], which consists of eight scenes including Art, Books, Bowling, Dolls, Laundry, Moebius, Plastic and Reindeer.We resize the images to 576 × 640 (for Bowling & Plastic scenes) and 576 × 704 (for other scenes) pixels resolution.Then, experiments are conducted on real-world data with 256 × 256 × 1536 pixels resolution measured by [17], which is resized into 256 × 256 × 1024.For deep learning models, the evaluations are conducted in a patch-by-patch manner, where each input is cropped into 128 × 128 × 1024 patches with a stride of 64.After that, the recovered patches are assembled together to produce a full resolution output.

Evaluation metrics
To compare the performances of different approaches, two commonly-used metrics [17][18][19] are adopted, namely, the root mean-square error (RMSE) which is formulated as and the accuracy at a given threshold  which is formulated as

Approaches for comparison
Approaches for comparison are: • LM filetr [35]: Conventional maximum likelihood estimation.
• Shin et al. [11]: The maximum likelihood estimation with a total variance regularization term.
• Rapp et al. [16]: The maximum likelihood estimation combined with a total variance regularization term, a cluster searching algorithm and a superpixel formation strategy.• U-Net [17]: The deep learning model based on U-Net [21].To make a fair comparison, we adopt the version without the intensity fusion strategy.
• ADAG-U-Net++ [24]: The deep learning model based on U-Net++ with an Attention-Directed Attention Gate (ADAG) module to enhance the edges' reconstruction.

Evaluation on simulated data
The test results on the Middlebury dataset are generalized in Table 1.Under two typical SBR levels (i.e., 2:10 and 2:50) and two extremely low SBR levels (i.e., 2:100 and 1:100), the PRS-Net achieves the best RMSE values among all the models of all cases.Considering the accuracy metrics, except for the case with SBR of 2:10 and  of 1.03 where Non-local net slightly outperforms PRS-Net, PRS-Net beats the other models by a large margin.Specifically, PRS-Net have achieved an accuracy of more than 95% with  = 1.01 when SBRs are 2:10, 2:50 and 2:100.Even in the extreme case that SBR is 1:100, PRS-Net still achieves an accuracy of 87.51%, which is significantly better than other models.The comparison results in terms of running time and maximum GPU memory are also displayed in Table 1.The statistical learning algorithms are run on 48G RAM and Intel i5-9400 CPU, which uses 6 parallel kernels.It should be noted that during the test, statistical approaches use the original high-resolution data instead of patches as the input since CPU memory is large enough.It should be noted that since the 128 × 128 × 1024 resolution always used up the memory of our GPU, we tested ADAG-U-Net++ with a smaller resolution of 64 × 64 × 1024.It can be seen that PRS-Net is the fastest among all the deep learning models.Besides, PRS-Net requires the least amount of memory among all the models, including both the statistical and deep learning ones.More importantly, as compared with other competitors, PRS-Net is robust against noise.We depict the curves of RMSE values and accuracies with  = 1.01 by varying SBR in Fig. 5 (a) and (b), respectively.The performances of other models degrade significantly as noise level increases, while the PRS-Net maintains stable performance, despite a slight drop in the extreme case of SBR= 0.01.In this case, as compared with the best competitor, the RMSE value of PRS-Net is reduced by 40% and the accuracy is improved by 8.5%.
Examples of recovered depth images from Art scene and visualizations of the corresponding error maps with 2:50 and 1:100 SBRs are shown in Fig. 6 and Fig. 7, respectively.It is clear that PRS-Net achieves the best imaging quality under both SBRs.According to the error maps in which black pixel indicates a less than 1% difference between the real depth and prediction, the number of abnormal pixels in the depth images of PRS-Net is the least among all competitors, especially in the extreme case of SBR=0.01.In particular, the accuracy of PRS-Net with  = 1.01 has reached 95.82% for SBR of 2:50, while the accuracy of the second-best model is 90.95%.Although the accuracy of PRS-Net has fallen below 90% for SBR of 1:100, it is still the only approach that maintains an accuracy beyond 85%.
In order to measure and eliminate the randomness induced by Poisson sampling process, we choose the Art and Bowling scenes from Middlebury stereo dataset as the representatives of complex and simple scenes, and simulate 10 sets of random data with different SBRs to calculate the average performance.More specifically, 10 trials are repeated to calculate the average RMSE values with 10 sets of randomly generated data based on Poisson sampling.As shown in Table 2, PRS-Net achieves superior performance which is consistent with single test.Moreover, PRS-Net basically has the smallest standard deviations when compared with other models, which suggests a remarkable robustness of performance for PRS-Net.

Real-world evaluation
We conduct experiments on real-world data captured by indoor single-photon imaging prototype [17], where the photon counting histogram for each pixel consists of 1536 time bins with a bin size of 26.In the real-world data, photon counts are separately distributed in several consecutive bins.To ease processing, we sum the photon counts of every neighboring pair of bins along the time dimension to get 768 bins.By this way, the number of time bins is compressed to 1024, with the remaining bins padded by zeros.Thus the bin size has doubled to 52.Figure 8 shows the performance comparisons in five real-world scenarios, including Lamp, Hallway, Rolling ball, Kitchen and Stuff.In Lamp scene, PRS-Net produces the cleanest image, with the most precise details around the bulb and iron arm.In Hallway scene, PRS-Net is the only approach that has successfully restored the complete human body.In Ball scene, PRS-Net produces the least amount of noise near the left palm, and the clearest thumb on the right.In Kitchen scene, PRS-Net recovers the clearest outline of the human head, such that even the nose can be distinguished.In addition, the black box on the wall has been successfully recovered as well, which is the closest to the intensity image among all the models.In Stuff scene, PRS-Net has produced the fewest abnormalities.

Distribution of predictions with pixel-wise approach
The improvement brought by PRS-Net can be understood by calculating the averaged variance of the probability distribution of predictions along the time dimension by where Variance defined in Eq. ( 17) measures the dispersion of the probability distribution over all the pixels, which indicates how confident the model is about its prediction.In PRS-Net, the cross-entropy loss for pixel-wise classification makes the probability distribution of predictions more centralized at one time bin, which leads to a more precise approximation to Argmax function with the SoftArgmax operation in Eq. ( 14).The comparison results on the averaged variances using Eq. ( 17) in the Art scene are summarized in Table 3.It can be seen that the averaged variances of predictions of PRS-Net is much lower than other deep learning models for all values of SBRs, which means that the predictions of PRS-Net are more centralized.
In particular, we randomly select a pixel from the Art scene and visualize the output distributions of predictions produced by different models in Fig. 9.It is clear that the width of the probability distribution produced by PRS-Net is the smallest, and the peak of distribution is the highest among all competitors.

Ablation study
To verify the effectiveness of soft-thresholding operation in PRS-block, pixel-wise classification loss and temporal window strategy, we conduct ablation tests by removing the relevant components from the architecture as depicted in Fig. 2. By comparing the results of the second and last row in Table 4, it can be seen that the network with and without the soft-thresholding branch in PRS-block achieve comparable performances if SBR is 2:10 or 2:50.However, under the high-noise setting, i.e., SBR of 2:100 or 1:100, the complete model with soft-thresholding outperforms the model without soft-thresholding branch by a large margin.By comparing the results of the first and last row, we see that the pixel-wise classification loss greatly improves the   performance of the model over all SBRs, which demonstrates the advantage of using pixel-wise approach.Besides, the results of the third and last row show that the temporal window strategy may slightly decrease the RMSE.We also show the ablation test results on both the simulated data and the real-world data in Fig. 10.The results are consistent with the statistics in Table 4.The differences between CE and KL losses are shown by comparing the columns A+B+D and A+B+C.The differences between with and without the soft-thresholding branch are shown by comparing the columns A+B+D and A+D.It is clear that the network performance degrades significantly if either the CE loss has been replaced or the soft-thresholding branch has been removed.Besides, the temporal windowing strategy has shown its advantage in restoring more details when comparing the columns A+B+D and B+D.Specifically, we see the number of abnormal pixels are reduced ((b) and (c)), the needle-like object is revealed ((g) and (h)), and a complete restore of the head of a man has been achieved ((l) and (m)) with the temporal windowing strategy.A clearer outline of the human head, for which even the nose can be distinguished, is restored by the temporal windowing strategy when comparing (q) and (r).

Conclusion
In this work, we proposed a pixel-wise residual shrinkage network for robust photon-efficient imaging especially in high-noise conditions.Specifically, a noval PRS-block with pixel-wise soft thresholding has been proposed to extract a clean signal from noisy SPAD measurements.By adopting cross-entropy loss for pixel-wise classification, the optimization target becomes more precise, which enforces more confident and centralized depth predictions and thus brings significant performance improvements over the previous works.
The combination of PRS-block, pixel-wise classification loss and temporal window strategy offers higher imaging quality than state-of-the-art approaches.Extensive experiments on simulated and real-world datasets demonstrate that PRS-Net consistently outperforms the stateof-the-art models by a large margin, for SBRs ranging from 2:10 to 1:100.In particular, PRS-Net shows a superior denoising ability irrespective of SBRs.Besides, benefitting from the simplified architecture of network, PRS-Net requires the shortest processing time and memory in comparison with other deep learning methods.These results suggest an efficient and robust approach for imaging in extreme scenarios such as ultra-long and underwater imaging.Since the experiment of PRS-Net for high-resolution images still takes tens of seconds, an interesting future direction is to further improve the imaging speed for real-time imaging.

Fig 3 .
Fig 3. (a-b) Two examples for the temporal windowing strategy.The raw counting data and the generated clusters are listed as the top and bottom histograms, respectively.The time bin marked in yellow indicates the real depth.The black dashed arrow lines indicate the moving trajectory of the window.Signal and noise photons are marked in green and red, respectively.

Fig 4 .
Fig 4. The pipeline of PRS block. is the reshape operation.denotes the absolute value function.denotesthe average operation along the time dimension.is the element-wise multiplication.isthe soft thresholding operation.denotes the element-wise summation.For the Convnd( 1 ,  2 ,  3 ,  4 ,  5 ) modules,  1 is the number of input channels,  2 is the number of output channels,  3 is the kernel size,  4 is the stride and  5 is the padding.

Fig 6 .Fig 7 .
Fig 6.Comparison results of different approaches in the Art scene with SBR of 2:50.(a) The pixels of RGB images are colored using the predicted depth values with the colorbars.The corresponding monochrome images are the visualizations of the error maps.The white pixels are ones whose predictions fail to satisfy Eq. (16) for  = 1.01.Alternatively, the differences between the predicted and real depths of the black pixels are below 1%.Note that the depth images created by LM filter and Shin et al. are colored with different scales due to high variance of the predictions.(b) The RMSE and Accuracy (for  = 1.01) results calculated on the Art Scene with SBR of 2:50.The best results are highlighted in bold.

Fig 8 .
Fig 8.The reconstructed depth images on five real-world scenarios.The first row presents the intensity images captured by a regular camera.

Fig 10 .
Fig 10.The definitions of A, B, C and D are the same as in Table.4.The top two rows are the reconstructed depth images on simulated data with SBR=0.01 and the bottom two rows are on real-world data.(a∼e) are cropped from Art. (f∼j) are cropped from Laundry.(k∼o) are cropped from Hallway.(p∼t) are cropped from Kitchen.Note that we use depth images as the ground-truths in (a) and (f) while use intensity images in (k) and (p).

4 .
Fig 10.The definitions of A, B, C and D are the same as in Table.4.The top two rows are the reconstructed depth images on simulated data with SBR=0.01 and the bottom two rows are on real-world data.(a∼e) are cropped from Art. (f∼j) are cropped from Laundry.(k∼o) are cropped from Hallway.(p∼t) are cropped from Kitchen.Note that we use depth images as the ground-truths in (a) and (f) while use intensity images in (k) and (p).

Table 1 .
Performance comparisons on the simulated dataset.The results of two metrics are reported as the average over the eight scenes from Middlebury dataset.The best results for each column are highlighted in bold.The statistic learning models are run on CPU (48GB) while deep learning models are run on GPU (8GB).The computational costs during the test are summarized as well.

Table 2 .
The mean and standard deviation of RMSE results in Art and Bowling scenes by repeating ten trails, for different SBRs.For each column, the best results are highlighted in bold.

Table 3 .
Averaged variances of the output probability distributions for different deep learning models in Art scene.The average is taken over all pixels.The best result for each column is highlighted in bold.

Table 4 .
Ablation study results using RMSE metric on the simulated dataset.PRS-Net is tested in different combination of components, where A stands for the temporal window strategy, B stands for the soft-thresholding operation in PRS-block, C and D stand for training with pulse waveform reconstruction loss and pixel-wise classification loss, respectively.indicates that this component is adopted in the network.The best results are highlighted in bold and comparable results are underlined.