Non-fusion time-resolved depth image reconstruction using a highly efficient neural network architecture

: Single-photon avalanche diodes (SPAD) are powerful sensors for 3D light detection and ranging (LiDAR) in low light scenarios due to their single-photon sensitivity. However, accurately retrieving ranging information from noisy time-of-arrival (ToA) point clouds remains a challenge. This paper proposes a photon-efficient, non-fusion neural network architecture that can directly reconstruct high-fidelity depth images from ToA data without relying on other guiding images. Besides, the neural network architecture was compressed via a low-bit quantization scheme so that it is suitable to be implemented on embedded hardware platforms. The proposed quantized neural network architecture achieves superior reconstruction accuracy and fewer parameters than previously reported networks.


Introduction
Depth imaging has been an essential tool in various applications, such as autonomous vehicles [1], vision-guided robotic systems [2], and augmented reality applications [3]. Many strategies have been proposed to obtain depth information from captured intensity images. For example, Zhan et al. used machine-learning and monocular depth perception [4] principles to retrieve depth information from RGB images. However, the reconstruction fidelity deteriorates because of the scale ambiguity. The stereovision technique mimicking human vision systems [5] is also popular. It uses the triangulation principle to understand spatial information. Conventionally, stereo-based cameras are not photon-sensitive; few reflected photons are detected. Besides, the sensing accuracy of the stereovision approach deteriorates when it works in dark conditions or performs long-distance measurements, whereas LiDAR can overcome the limitations. LiDAR has become popular in ranging applications. Unlike Radar systems [6] that use radio waves to measure the time-of-flight (ToF) between transmitted and reflected signals, LiDAR systems adopt pulsed light with a much shorter wavelength to detect an object's range. Therefore, LiDAR can obtain more accurate spatial information than Radar for seeing objects at a longer distance [7].
Single-photon avalanche diodes (SPADs) [8] are effective LiDAR sensors due to their singlephoton sensitivity and excellent temporal resolutions. Richardson et al. developed and applied low-noise SPAD sensors [9] to time-resolved imaging [10,11]. Recent advances in silicon manufacturing have introduced more compelling high fill-factor devices [12]. Figure 1 shows a conventional SPAD-based LIDAR system, including a pulsed laser, a SPAD sensor, and a time-correlated single-photon counting (TCSPC) module. The TCSPC module includes a picosecond time-to-digital converter (TDC) to measure and time-stamp reflected photons from the target. The system also contains a histogramming module to establish a time of arrival (ToA) profile for estimating the average depth (or the distance). For SPAD-based 3D ranging, convex-optimization [13] and Bayesian inference [14] approaches have been widely adopted to tackle the recovery problem in low photon-flux conditions. Shin et al. [15] modelled photon registration behaviors as the rate function [16] of a Poisson process and used the constrained maximum likelihood estimation (MLE) to reconstruct depth images. Later, to improve the reconstruction accuracy, Rapp and Goyal [17] proposed a pixel-wise spatial unmixing strategy to split the signal cluster and background noise. Tachella et al. used the Bayesian theory to identify surfaces from pixel-wise histograms [18], using the advanced Markov chain Monte Carlo (MCMC) sampling method to address the maximum-a-posterior (MAP) problem. Although their algorithm yielded outstanding 3D reconstructions with a fast processing speed, some hyperparameters should be adjusted to maintain accuracy and computing efficiency. Quentin et al. [19] have used the expectation-maximization (EM) method to estimate multi-spectral and depth profiles. It shows excellent performances in estimating mixed probabilistic models with latent variables. Time intervals between emitted and detected photons are measured and digitalized by the TCSPC system. Detected photons are accumulated to generate a 3D tensor. Each data cube (blue) is a histogram, and neural network postprocessing is adopted to retrieve the time-bin index representing the average distance.
Deep neural networks (DNNs) feature hierarchical structure, showing powerful learning ability from complex data. Using DNN with sensor-fusion strategies has been a new trend for extracting ToA depth maps. Lindell et al. [20] first introduced a U-net neural network [21] merging 2D intensity images and 3D SPAD tensor data to reconstruct depth images. To make the hardware platform more compact, a dedicated neural network was proposed [22] for a SPAD array that can simultaneously generate intensity images and ToA data. Another sensor fusion architecture [23] uses monocular depth perception principles to retrieve a coarse-grain depth map in advance. The corresponding ToA data is fused with the up-projected monocular depth map. This fusion strategy achieved more accurate results than the SPAD-intensity version. A non-local neural network model [24] was introduced to explore the long-range correlation along spatial and temporal dimensions. However, the networks mentioned above have large model sizes and redundant parameters. Therefore, they show a long training time and slow inference speed, not suitable for real-time scenarios. Also, it is nonviable to implement them on embedded hardware such as field-programmable gate arrays (FPGA) or application-specific integrated circuits (ASIC).
In this work, we proposed a photon-efficient 3D convolutional neural network (CNN) architecture without using sensor-fusion strategies. Unlike the U-net based models that only fuse the data with the same dimension [20,22,23], our proposed architecture fuses multi-dimensional spatial and temporal features to extract more information. Spatial correlations of depth images can be effectively explored from SPAD data only using short-and long-range connections and multiple up-sampling processes. We also compressed the architecture using the low-bit parametric quantization strategy. It shows a good compression ratio over existing sensor-fusion neural network models while maintaining high reconstruction quality. Results show that our architecture is feasible to enhance the accuracy without auxiliary RGB and monocular depth images. Compared with previously reported algorithms, our architecture yields superior reconstruction results in low photon-flux conditions. We evaluated our compressed model over various signal-background ratio (SBR) levels in terms of extensive evaluation metrics [25] for synthetic and captured data [20] with a high spatial resolution. The result shows that this work is suitable for the single-photon LiDAR applications in low-light scenarios.

Problem definition
For raster-scanning and wide-field ranging systems, a laser source generates periodic pulses s(t) to illuminate the target scene. The reflected photon flux can be detected by individual pixels with the quantum efficiency η ∈ [0, 1). We assume that there are only a few detected photons (less than 1 photon per pixel). Therefore, pile-up effects and carriers' crosstalk [26] are negligible. The spatial resolution of the SPAD array is (m 1 , m 2 ) ∈ {1, 2, . . . , M} 2 . According to previously published reports [16,[18][19][20], the number of recorded photons in the time interval n ∈ {1, 2, . . . , N} of each pixel can be formulated as: where g denotes the instrument response function (IRF), ∆t is the time bin-width of the TDC, d ∈ R M×M + is the scene's depth profile, c is the speed of light, and b is the ambient light with a wavelength λ. The photon arrival behavior can be formulated as an inhomogeneous Poisson process [27]. The dark count [26] triggered by thermally-generated carriers in the SPAD sensor is also considered the time-varying factor in the Poisson process's rate function. Consequently, the histogram in one pixel with I illumination periods can be formulated by a Poisson process (·) with a time-varying arrival function: where the constant γ represents the attenuation factor caused by photon scattering on the surface and b d is the dark count rate. Suppose f(·) is the neural network's feedforward function, and the input is a noise-corrupted tensor composed of 1D pixel-wise histograms with a 2D spatial resolution M 2 . Therefore, pixel-wise denoised ToA data can be modelled aŝ where θ is the parameter set to be learned, and we can use MLE to calculate it aŝ Equation (4) can be solved by the neural network's learning phase to be detailed in the next section.

Non-fusion ToA denoising model
This section introduces the proposed network architecture, loss function, and low-bit quantization scheme. To conduct a fair comparison with peers' work, we focus on depth images retrieved in low photon-flux conditions with a low SBR.

Neural network architecture
As shown in Fig. 2, the U-net++ [28] was used as the network's backbone, modified to a 3D version to denoise the ToA tensor. The network consists of two parts: a main feature extraction module and a refinement module. In Fig. 2, x i,j represents the nodes where i ∈ {0, 1, 2, . . . , l} is the row number along the down-sampling path (blue arrows), j ∈ {0, 1, 2, . . . , l} is the number of nodes in skipped connections in each row, and l is the level or depth of the down-sampling. The refinement and a differentiable argmax function after the last up-sampled node are included. The procedure for calculating a feature map can be formulated by where Conv (·) denotes the convolution operation, M (·) the max-pooling to perform downsampling, U (·) the transposed convolution to perform up-sampling, [·] the concatenating layer's operation S (·) is the soft argmax function to find the bin index. The proposed model's architecture. The input is a ToA tensor corrupted by background noise. By adopting dense connections in each row, short-and long-range information can be fully explored. A more robust depth map can, therefore, be obtained.
By adding dense connections in each horizontal node, our architecture can compensate for information lost due to the down-sampling. Besides, long-range and short-range concatenations (black dot lines in Fig. 2) between horizontal convolutional layers can fuse different non-local spatial and temporal information of ToA measurements. Within a pixel, a soft-argmax function [20] is applied to find the index corresponding to the ToF (or the distance) (see Fig. 2). So the ToA tensor's noise can be censored during the learning phase, and the network generates a squeezed 2D depth map from the denoised ToA tensor. This architecture does not need any guiding images (monocular and intensity) due to the above features, thereby saving processing time and parameters than fusion approaches. The max-pooling as down-sampling operations along the down-sampling path is applied to reduce network parameters and computing time.
To make our network portable to embedded hardware, a model compression strategy was applied to simplify it. The bottleneck of embedding neural networks in reconfigurable hardware is due to limited on-chip memory for storing pre-trained parameters and the large memory bandwidth for data transfer. Therefore, we generalize a 2D low-bit parametric quantization scheme [29] for 3D data quantization to compress the model. Multiplication operations of floating-point numbers can be converted to bitwise operations of fixed-point numbers (in binary). Briefly, suppose x and y are two fixed-point integers coded with M-bit and K-bit binary digits, respectively. The conversions are subjects to are corresponding binary digits. The dot product of x and y can be therefore indicated by where all operations can be performed via economic bitwise operations in FPGAs. More details are provided in [29]. As for our neural network, weights (W) and activation functions' outputs (A) are quantized with different bit-widths, denoted as WXAY, where X and Y are bit-widths. The quantization process in each convolutional layer is depicted in Fig. 3. The weights and activations can be quantized with an arbitrary low bit-width during the forward propagation. The weights are firstly normalized via a tanh(·) function, and a quantization function Q k (·) converts the normalized floating-point weights to a fixed-point format, defined in Eq. (7).
where Q k (r) is defined as The combination of Eq. (7) and (8) is a 'straight-through estimator' [29] that is also used in the back-propagation. And floating-point parameters are quantized to the k-bit fixed-point format in the forward propagation. The quantization procedure was emulated in Pytorch, and it is a prototype for future hardware implementations. The performance of the proposed quantization strategy is evaluated in Fig. 4 in the next section.

Training detail
Since solving the MLE problem in Eq. (4) is mathematically equivalent to minimizing the Kullback-Leibler (KL) divergence [30]. The KL divergence is adopted to minimize the KL distance between the ground-truth (GT) histograms and the network's output histograms over the spatial dimension: where H is the histogram tensor composed of the pixel-wise GT histograms (denoted as h) and H is the network output containing the predicted pixel-wise histograms (denoted as h). To enhance the efficiency of 2D spatial denoising, the total variation (TV) regularization [31] can minimize the spatial variation. The whole loss function becomes where β is a tunable hyperparameter before training. We use the stochastic gradient descent (SGD) to learn the parameters during the back-propagation, expressed as As for preparing training data, the NYUv2 [32] and Middlebury [33] datasets were utilized as the training and testing datasets, respectively. The training dataset contains nine types of indoor scenes with 13k synthetic ToA tensors, and the validation dataset contains 1.3k. The network's input is the data cube with the size of 512×512×1024, and the SGD with the ADAM algorithm [34] in the Pytorch library was employed to execute the back-propagation. Loss curves of training and validation shown in Fig. 4 were generated and fetched from the Tensorboard toolkit. To achieve better visualization, we apply a smooth factor of 0.8 to alleviate fluctuations. For each convolutional module except the last, batch normalization and ReLU operations are added after the convolution operations to alleviate the vanishing gradient. We used a mini-batch to optimize the gradient descent and save memory with a batch size of 5. The hyperparameter β of the TV loss function is 10 −5 . 4 epochs (each contains 3200 iterations) are configured to train the network, guaranteeing a converging loss. The training dataset is randomly shuffled before each training epoch to generalize our model. Our architecture's training time is 17 hours, 7 hours shorter than the existing sensor-fusion networks. For the inference phase, we aim to retrieve depth images with a high spatial resolution of 688×552. Due to limited GPU memory, the whole tensor (with the size 688×552×1024) was divided into small ones for the network's input with the size 64×64×1024. Since the loss curve of validation fluctuates significantly, it is difficult to use the early-stop method to cease the training to prevent over-fitting. Our approach is to smooth the loss curve and obtain an approximate range that contains the smallest loss. Then we apply the synthetic testing dataset on these saved models, and we pick the one generating the minimum loss or RMSE. The selected model is employed to deduce the depth images from captured SPAD data. Finally, the reconstructed result with a high spatial resolution can be generated by seaming the individual low-resolution ones consecutively. We used SPAD and intensity data captured by the LinoSPAD system [35] as real-word test datasets. The tensor's size is 256×256×1536, and the average bin-width of embedded TDCs is 26 ps. Moreover, for a fair comparison with the monocular-SPAD architecture [23], DenseDepth [36] was used to reproduce monocular depth images to be fused with ToA data in the SPADnet [23] model.

Loss evaluation
The training and validation loss of the floating-point and the quantified model are shown in Fig. 4. We implemented three quantization cases and compared them with the floating-point version. Since the training and inference are susceptible to the activation's bit-width, we tried different bit-widths to quantize the activations' output and selected the best case. As depicted in Fig. 4, only Case W2A2 shows degraded performances. Therefore, considering the accuracy and consumption of computing resources, W2A4 works the best. Table 1 compares the model sizes of existing sensor-fusion models and the proposed architecture, where 'Lindell w/ intensity' and 'Lindell w/o intensity' indicate the training with and without intensity images. The compression rate is determined by comparing the model size (obtained from Pytorch) with existing models. Although our network was trained by less powerful GPUs (NVIDIA RTX 5000, whereas NVIDIA Titan V and 1080Ti were used in [20], [23], [24]), it can still achieve the shortest training time. Moreover, after the quantization process, the network obtains a remarkable compression ratio compared with the original floating-point model and existing network architectures. It is suitable for hardware-embedded solutions, as it requires much less on-memory to pre-load the model. The performance of the compressed model will be detailed in the following subsection.

Synthetic data
We first evaluated the reconstruction quality for simulated data using five different metrics to assess the depth reconstruction. They are the accuracy of a given threshold thr, RMSE, RMSE (log), the absolute relative difference (Abs rel), and the squared relative difference (Sq rel), defined to be (log 10 (d m 1 ,m 2 ) − log 10 (d m 1 ,m 2 )) and accuracy with threshold thr: where d andd denote the ground truth and predicted depth images. Seven indoor target scenes of the Middlebury dataset were assessed. Table 2 shows the averaged values across seven scenes with three SBRs. Compared with existing algorithms, our compressed architecture obtains comparable or improved accuracy to Sun et al. [23] using RGB-SPAD fusion strategy and Peng et al. [24] using long-and short-range residual connections architecture. Despite the similar performances, we should notice that Sun et al. adopted a log-scale binning method merging fewer time-bins for the front indexes and more time-bins for the ending indexes; thereby, the accuracy deteriorates when sensing long-distance objects. We use a simulated indoor scene (a lecture theatre) to prove this point (shown in Fig. 5) that the binning method cannot reconstruct relatively long objects highlighted in red boxes. Table 3 shows our results are better than [23] . As for Peng et al.'s architecture, long-range correlations of feature maps were effectively explored, and dilated convolutions were employed to enlarge the reception field. Both methods can enhance the accuracy and generate fewer parameters. However, Peng et al.'s method has many vectorized operations during the training and inference phases, resulting in a longer training time (36 hours) than ours (17 hours with a less powerful GPU). Moreover, owing to the complicated residual connections of [24], it is difficult to reconfigure the structure for different tasks with a proper trade-off. Instead, our architecture has good scalability where the number of down-and up-sampling can be configured without redesigning the connections. Lastly, our network maintains high reconstruction accuracy despite a much smaller model size.
To better indicate the statistical significance, we further calculated the p-value of our compressed model and the models in [23] and [24] in test datasets in terms of Abs rel. We obtained a list of Abs rel (in total 21 elements, representing 7 scenes in 3 SBR levels) from each pair of GT and reconstructed depth images from each algorithm. The p-value of Abs rel is 2.94 × 10 −5 between the model in [23] and ours (and 0.0162 between the model in [24] and ours). Both are smaller than 0.05, meaning Abs rel of our model is statically significant versus [23] and [24].
Since the compressed version (W2A4) can obtain almost the same performance as the floatingpoint version, we employed it to conduct further visual comparisons hereafter. As shown in Fig. 6, we chose one exemplar image named Art as an example. Rapp and Goyal's algorithm achieves better recovering results than the LM filter and Shin et al.'s approach because they used pixel-wise signal refining and spatial regularization to smooth objects' boundaries. However,  their method is not robust in distinguishing object boundaries, whereas the neural networks in the last five columns (see Fig. 6) can identify the boundaries through learning numerous training scenes. The proposed compressed model can achieve the smallest RMSE among existing methods.

Captured data
We also tested the proposed compressed model (W2A4) and existing networks on five real-world image datasets. These scenes were captured in a low light condition (less than 1 photon per pixel) with low SBRs. In the first row of Fig. 7(c), the reconstructed depth results from the log-matched (LM) filter [37] and Shin et al. [15] show the shadow that did not receive active illumination. In contrast, the other four algorithms can in-paint the shadow and generate relatively detailed depth maps. Our model can retrieve a more precise structure for the lamp scene. Although Rapp and Goyal adopted the super-pixel method to obtain robust depth estimations, some essential signals are lost, like the lamp's circle part.
Similarly, in the second and the third rows, a rolling ball and an elephant toy are presented, as highlighted in red boxes. The human's thumb and the ivory with distinct boundaries can be identified in our model's depth images. However, these two small objects are over-smoothed by the monocular-SPAD fusion owing to the monocular perception. In the fourth row, the depth images of a bouncing ball from stairs are compared. The intense sunlight illumination at the top

Discussion
We evaluated the proposed model with three depth levels, meaning the input tensor is downsampled three times. The fewer levels we use, the shorter processing time it costs (with fewer parameters). Figure 8 shows the network's depth versus the processing speed and RMSE. L1 is a naive U-net architecture with one down-sampling module without long-range connections. As our architecture becomes deeper with both long-and short-range connections, the reconstruction error decreases significantly. The RMSEs for L4 to L2 are acceptable when the model's depth reduces since they are better than existing studies. However, the processing time increases as more layers are added. Notably, due to the structural regularity of our model, it is easy to configure the numbers of down-and up-sampling without modifying the connections in the architecture. The log-scale binning method introduced by Sun et al. can significantly reduce the GPU memory consumption and reduce the processing time. The binning method, however, would lead to critical degradation of performance. The configurability of our network can also be deemed the model pruning since fewer nodes are involved in computations when reducing the number of down-sampling. Therefore, our network is more robust to achieve a reasonable trade-off between processing time and accuracy. The processing time is measured by inferencing one depth map from a ToA tensor with the size 552×668×1024 under SBR = 0.04. RMSE is an average value of seven scenes from the Middlebury dataset that are the same data used in Table 2.  It should be noted that there is no acceleration for our quantized architecture because the GPU cannot handle fixed-point operations due to the unified intrinsic hardware and instruction set architecture. However, FPGA can manipulate quantized numbers and accelerate the analysis; the hardware implementation is regarded as future work. Additionally, our architecture contains dense connections and multiple 3D convolution modules that consume the most processing time. ToA tensors are sparse with many zero elements involved in computing. Thus, more aggressive compression approaches like model pruning, sparse convolution strategies can help further accelerate the forward-propagation process.
The proposed network under different SBR conditions was further investigated in Fig. 9. The algorithms using neural networks are more sensitive to background noise, especially when the SBR decreases from 0.02 to 0.01. We can increase the training dataset to cover a broader range of SBRs to alleviate this problem, although it costs more processing time. Nevertheless, the other three methods (LM filter, Shin et al., and Rapp and Goyal) are more robust across all the SBRs. The LM filter and Shin et al.'s approach have maximum errors across all SBR levels. Rapp and Goyal's and Lindell et al.'s fusion models achieve relatively lower errors, and the former is more robust across all SBR levels. Sun et al.'s and the proposed network obtain the lowest errors and are robust for relatively low SBRs except the level 0.01. Therefore, it is feasible for the proposed neural network to achieve low prediction errors at standard SBR levels, and remain robust without using complementary training images from another camera. And optimization-based algorithms can be more stable than neural network-based methods over various SBR levels.
Despite relative better reconstruction results, our method still has limitations. For the scenes corrupted with enormous background noise, the time-bin index retrieved by the soft-argmax might be not the correct ToF profile. Whereas Rapp and Goyal [17] separated the signal and noise, a pixel-wise noise censoring algorithm and a windowing approach were introduced to remove noise and obtain the ToF information accurately. The reconstructed bouncing ball in Fig. 7 shows that Rapp and Goyal's algorithm can produce relatively high fidelity in a noisy scenario.

Conclusion
We developed a non-fusion network that does not need an additional camera for depth estimation. It is much more practical for real-world applications. The proposed model was quantized to achieve a smaller model size without degrading performances. Due to the model's flexible structure, the trade-off between accuracy and the processing speed can be balanced by manipulating the network's depth. It achieves excellent reconstruction performances in low light conditions and poor SBR conditions, and it can be extended to biological applications where samples are generally dim.