Robust super-resolution depth imaging via a multi-feature fusion deep network

Three-dimensional imaging plays an important role in imaging applications where it is necessary to record depth. The number of applications that use depth imaging is increasing rapidly, and examples include self-driving autonomous vehicles and auto-focus assist on smartphone cameras. Light detection and ranging (LIDAR) via single-photon sensitive detector (SPAD) arrays is an emerging technology that enables the acquisition of depth images at high frame rates. However, the spatial resolution of this technology is typically low in comparison to the intensity images recorded by conventional cameras. To increase the native resolution of depth images from a SPAD camera, we develop a deep network built specifically to take advantage of the multiple features that can be extracted from a camera's histogram data. The network is designed for a SPAD camera operating in a dual-mode such that it captures alternate low resolution depth and high resolution intensity images at high frame rates, thus the system does not require any additional sensor to provide intensity images. The network then uses the intensity images and multiple features extracted from downsampled histograms to guide the upsampling of the depth. Our network provides significant image resolution enhancement and image denoising across a wide range of signal-to-noise ratios and photon levels. We apply the network to a range of 3D data, demonstrating denoising and a four-fold resolution enhancement of depth.


Introduction.−
The importance of three-dimensional imaging in industrial and consumer applications is increasing. Light detection and ranging (lidar), where a pulse of light is used to illuminate a target and a detector provides time-of-flight information, is one of the leading technologies for depth imaging. For example, lidar is one of the key systems for future connected and autonomous vehicles, and it is used in the latest smartphones and tablet to aid auto-focus assist and enhance virtual reality.
Single-photon avalanche detector (SPAD) arrays are an emerging technology for depth estimation via lidar. These are devices that are sensitive to single photons and can provide histograms of times of arrival of a single photons with respect to a trigger. When used in combination with a pulsed laser that illuminates a target object, SPAD arrays provide accurate and fast data that can be converted to depth information.
In the context of lidar, several different SPAD array sensors have been developed, see refs [1,2] for recent examples. They have been used to measure depth in a range of scenarios, including under water [3,4], long ranges [5][6][7], at FIG. 1: Overview of the proposed network. HistNet is designed for a SPAD camera operating in a dual-mode. The camera provides a histogram of photon counts and an intensity image. The network takes this as an input and provides a HR depth map as the output with a resolution that is four times higher in both spatial dimensions than the initial raw histogram. high speeds [8,9], and providing high-resolution depth information [2,10]. Recently, Morimoto et al. reported a mega-pixel SPAD array [11]. SPAD array sensors have also been used for light-in-flight imaging [12] and looking at objects hidden around corners [13]. They have also been used extensively within the field of biophotonics, see ref [14] for a review.
Although SPAD arrays are becoming well established in lidar systems, there are several key challenges to overcome to fully exploit their potential. The single-photon sensitivity that the SPAD array provides promises depth imaging at long ranges and in degraded visual environments, but improving the performance in these scenarios can dramatically increase the range of use of the detectors. In addition, the native resolution is typically very low in comparison to conventional image sensors. Ultimately, it is desirable to operate the SPAD arrays at high frame rates, cover a large field-of-view at large distances, produce images at high resolutions, and perform well in a wide range of environmental conditions. Each of these objectives brings separate challenges that need to be addressed.
Due to the nature of the challenges to single-photon depth imaging, computational post-processing techniques are known to be a very powerful method to improve the overall image quality, both in terms of signal-to-noise and resolution. The latter methods [15][16][17][18][19][20] take advantage of prior information in one form or another and attempt to improve the quality of the depth images in the low photon regime. There are advantages and disadvantages to each of the methods, often with a trade off in terms of quality of the reconstruction and for the time taken to reconstruct.
Several statistical approaches have been implemented to improve the depth maps of single-photon depth data: Tachella et al. established their images making use of priors on the depth and intensity estimates, achieving fast reconstruction and very good performance in the single-photon regime [15]; Rapp and Goyal tackled the problem by the creation of super-pixels that borrow information from relevant neighbours to separate the contributions of signal and background [16]; and Halimi et al. implemented an alternating direction method of multipliers (ADMM) to minimize a cost function using priors on correlations between pixels using both depth and intensity estimates [17]. We refer the reader to [21,22] for more details regarding state-of-the-art robust reconstruction algorithms.
Machine learning approaches have also shown good performance when enhancing the quality of high-resolution depth data. For example, Guo et al developed a deep neural network to reconstruct a high-resolution depth map from its low resolution version [18]. In addition, Lindell et al [20] and Sun et al [19] developed deep networks that process the whole 3D volume of raw photon counts and output a 2D depth map, achieving high performance in the low photon regime at the cost of a long processing time. All references [18][19][20] make use of an intensity image to guide the reconstruction of the depth. Peng et al. [23] implemented a network capable of reconstructing depth from SPADs at very long ranges by exploiting the non local correlations in time and space via the use of non local blocks and down-sampling operators.
This paper proposes and implements a machine learning network to simultaneously perform up-sampling and denoising of depth information from a SPAD array sensor. Our approach is to pre-select the essential information of the SPAD to input to the network, without inputting the whole 3D histogram of counts to achieve good and fast reconstruction. The proposed algorithm is designed specifically for the measurements provided by the latest SPAD technology [8] (i.e., Quantic 4x4 sensor). This system alternates between 2 modes at over 1000 frames per second: an intensity mode providing a high-resolution intensity image of 256x128 pixels, and a depth mode providing a lowresolution 64x32x16 histogram of counts containing depth information. After processing, the final resolution of our up-sampled depth images is increased by a factor of four to 256x128 pixels. Figure 1 shows an overview of the results of the proposed network when applied on captured data. This paper is organized as follows: in Section I, we provide a brief overview of the SPAD array sensor, the model of photon detection, and we present the processing done to the SPAD data to extract useful information prior to the reconstruction via the network. Section II introduces the proposed HistNet in details. Section III reports the results on both simulated and real data along with a comparison to other algorithms, and demonstrates its robustness to different noise scenarios.

A. Data Acquisition
Single-photon avalanche diode arrays can capture depth and intensity information of a scene. To achieve this, a short laser pulse is used to illuminate a target, and the detector records the arrival time of photons reflected back by the scene with respect to a laser trigger. This data, known as time tagged data, can be used to generate a temporal histogram of counts with respect to the time of flights, where the peak in the histogram can be used to calculate the distance to the target.
In this work, we develop a network suitable for a SPAD array sensor, the Quantic 4x4 sensor, that generates the histograms of counts on-chip and operates in a hybrid acquisition mode [8,24]. This hybrid mode alternates between Representation of the processing of the different arguments of HistNet. The SPAD array provides alternating LR Histograms of size 64x32x16 and HR intensity images of size 256x128. The 256x128 first and second depth maps are obtained by computing center of mass around the strongest and second strongest peaks in the raw histogram and then up-sampling by four in both spatial dimensions using nearest neighbour interpolation. Depth feature D1 of size 128x64 is obtained by down-sampling the first depth map by two in both dimensions. D2 of size 64x32 is obtained by applying center of mass around the strongest peak in the 64x32x16 raw histogram. D3 and D4 are obtained by down-sampling the raw 64x32 histogram respectively by two and four, and computing centre of masses around the strongest peak of subsequent down-sampled versions.
two measurement modes at a temporal rate exceeding 1000 frames per second. The details of the modes are: a highresolution (HR) intensity measurement with a spatial resolution of 256x128, and a low-resolution (LR) histogram of photon counts containing depth information at a resolution of 64x32x16 (16 being the number of time bins of each of the 64x32 histograms). It is the purpose of the network to increase the resolution of the initial depth data (64x32) to the same resolution as the intensity data (256x126), while simultaneously denoising the data.

B. Pre-processing of data for the network
The SPAD camera provides LR histogram data and HR intensity images in alternate frames. We will see in the following sections how we select features from the SPAD array data that maximise the quality of information provided to the network while minimising the total quantity of data necessary for accurate super-resolution. There are several different features that we extract from the data provided by the SPAD array: the first depth map, the second depth map, the high-resolution intensity image, and the multi-scale depth features extracted from down-sampled versions of the original histogram. The processing time to calculate each of these features is minimal, adding very little computational overhead to our overall procedure. Figure 2 shows SPAD array data and the different processing steps to compute the arguments of HistNet.

First depth map
The first depth map is calculated directly from the 64x32x16 3D histogram data. The photon counts data can be assumed to be drawn from the Poisson distribution P (.) as commonly used in [16,17]. Assuming known background level and a Gaussian system impulse response, the maximum likelihood estimator of the depth is obtained as the central mass of the received signal photon time-of-flights (assuming depths are far from the observation window edges). This estimator is approximated for each depth pixel (i, j) as with h i,j,t being the photon counts acquired in pixel (i, j) for time bin t ∈ [1, T ], b i,j is the background level of pixel (i, j) estimated as the median of each pixel, and d max represents the location of the signal peak estimated as the location of the maximum of the histogram of counts in photon dense regimes, or using a matched filter in sparse photon regimes. We integrate over three time bins, between d max − 1 and d max + 1, as this width corresponds approximately to the impulse function of our system. Before this data is input to the network, we divide the depth by the total number of time bins to normalize it between 0 and 1. It is then up-scaled to the size of the desired resolution (four times larger in both spatial dimensions) using a nearest neighbour interpolation. Nearest neighbour interpolation is used as it preserves the separate surfaces in the scene. This is preferable over other interpolation strategies that connect separate surfaces with new depths that did not exist in the original data.

Accounting for multi-scale depth information
Multiple resolution scales have been shown to help depth estimation, in particular in high noise scenarios [8,16,18]. This information is included to the network by using 4 depth features D1, D2, D3 and D4 of different resolution scales. The dimensions of each feature for our real data (64x32x16 histogram) are 128x64, 64x32, 32x16 and 16x8 respectively. D1 is obtained by down-sampling the previously obtained 256x128 first depth map by two in both dimensions using nearest neighbour interpolation. D2 is obtained by computing center of mass on the 64x32x16 LR histogram. D3 and D4 are obtained by down-sampling this histogram by a factor of two and four by summing the neighbouring pixels in the spatial dimensions. Thanks to this process of down-sampling at the level of histogram, the resultant D3 and D4 have higher signal-to-noise than the first depth map, albeit with a lower resolution. This helps the network identify features in images with high levels of noise. All the features are normalized by dividing with the total number of time bins of the system (i.e., the width of the range observation window).

HR Intensity
The intensity map of a scene has been used to guide the reconstruction of the depth in statistical methods [8,[15][16][17] and in machine learning methods [18][19][20]. In our case, we obtain the intensity image directly from the SPAD detector Quantic 4x4. This intensity image has a spatial resolution of 256x128, which is four times larger than the 64*32 histogram spatial resolution. The intensity and histogram data are acquired in alternate frames, so it is possible that objects can move from frame to frame. However, our system has a high temporal frame rate, so we assume perfect alignment between the histogram and intensity data.

Second depth map
When the Quantic 4x4 sensor operates in the depth mode, each superpixel in the histogram gathers the photon counts of 4x4 pixels. Therefore, some histograms might present multiple peaks when observing multiple surfaces located at different depths. Those histograms contain multiple peaks that correspond to the different depths involved. While the first depth map is calculated by identifying the strongest peak, we compute a second depth map based on the second strongest peak. More precisely, the depth position in the second depth map is calculated by applying the center of mass of Equation 1 around the index of the second strongest peak.
We set the following criterion on the minimum number of photon counts a relevant peak should contain. For each pixel (i, j) we consider a peak at bin t to be relevant if with h i,j,t being the number photon counts of pixel (i, j) at time bin t; b i,j being the background level at pixel (i, j) estimated by taking the median value of the histogram; b i,j represents the standard-deviation of the Poisson distributed background counts; and level is a variable adjusted empirically, so that the values in the second depth map do not come from the noise but mostly represent real signal. In the case of our captured data, we set level to 12. If the criteria is not met, the second depth is set to zero. We note that in high noise scenarios, discriminating peaks corresponding to real depths from peaks corresponding to background photons is difficult. In extremely noisy scenarios, no second depth can be extracted and the second depth map is set to zero. In the same way as for the first depth map, the second depth map is 4x4 up-scaled with a nearest neighbour interpolation. In scenarios where a second depth map is extracted, improvements in the network performance thanks to the second depth map are highlighted in section III A 6.

II. PROPOSED METHOD
The proposed HistNet network increases the spatial resolution of the input depth map by four in both dimensions (e.g., from 64x32 to 128x64 for our real data). It is also robust to extremely noisy conditions as highlighted in following sections. The network structure is independent of the input dimensions and can therefore reconstruct data of any size. However, for clarity, we use the dimensions of our real data to present the network structure.

A. Network architecture
In the context of guided depth super-resolution, Guo et al. [18] developed the U-Net-based [25] DepthSR-Net algorithm, which offers state-of-the-art performance. The U-Net architecture is particularly useful as it requires very few training data and is based on CNNs which have shown good results for image denoising and super-resolution tasks [19,20].
In this paper, we develop a new network to account for the multi-scale information available from the SPAD array histogram data, making it more robust to sparse photon regimes and/or high background scenarios, in addition to exploiting the fine details of the intensity guide. The network makes use of the different features extracted from the raw histogram (first and second depth maps, the multi-scale depths) and the intensity image. Our network performs simultaneous up-sampling and denoising of depth data for a wide range of scenarios. Details of the number and filter sizes for each layer are reported in Figure 3. Figure 4 shows a schematic representation of the network.

Residual U-Net architecture
The goal of the network is to take the data from the low-resolution histogram and intensity image and produce a residual map R that can be added to an up-scaled version of the low resolution depth map [18,26]. The sum of the residual map and the low-resolution depth map is the final high-resolution depth map. The goal of the training is to find the parameters of the filters, i.e. the weights and biases, that minimises the l1-norm between the residual map and the ground truth.
The network consists of an encoder of five layers, denoted L0 to L4, connected to a five-layer decoder (L5 to L9) with skip connections. The network includes a branch that processes the multi-scale depth features (see II A 2) and a branch that processes the intensity image (see II A 3). The main input of the encoder is the concatenation of the first and second depth maps along a third dimension. In the case of the real data, this input is therefore of size 128x64x2. Note that each filter of the convolutional and deconvolutional layer has a height and width of 3.
In the encoder, the main input is passed to layer L0, which consists of two convolution operations of 64 filters each. Layers L1 to L4 consist of three steps: first, 2x2 max-pooling that down-samples the data by two in both spatial dimensions; second, integration of the information of the multi-scale depth features by concatenation with the layer of the depth guidance that has the same shape (see II A 2); and finally, two convolution operations with a number of filters of 128, 256, 512 and 1024 for L1, L2, L3 and L4, respectively.
In the decoder, the layers L5 to L8 consist of four steps: first, deconvolution operations filter and up-sample the image by two in both spatial dimensions; second, skip connections between encoder and decoder are computed by concatenating the decoder layer with the layer in the encoder that has the same shape; third, guidance with the intensity is provided by concatenation with the right layer in the intensity guidance branch (see II A 3); and finally, two convolution operations are performed with a number of filters of 512, 256, 128 and 64 for L5 to L8, respectively. L9 is a convolutional layer with one filter that provides the output. The predicted HR depth map is obtained by adding the output of the network R, which is known as the residual map, to the LR depth input, i.e. the first depth map.

Guidance with the noise-tolerant multi-scale depth features
The multi-scale depth features are of size 64x32, 32x16, 16x8 and 8x4. Each feature passes through a convolutional layer of 64, 128, 256 and 512 filters of size 3x3 respectively, creating cubes of size 64x32x64, 32x16x128, 16x8x256 and 8x4x512. These cubes are integrated in the encoder by concatenation along the filter dimension with the layer of corresponding size (see II A 1).

Guidance with the intensity map
We use intensity guidance in the same manner as Guo et al. [18]. The guidance branch consists of convolutional operations followed by 2x2 max-pooling. The number of filters of the convolution operation of each layer is 64, 128, 256 and 512. For our real data, the outputs of the convolutional operations of each layer in the guidance branch are The "output shapes" of the layers specified in the fourth column correspond to the case of processing our real data (i.e. histograms of spatial resolution 64x32 and 128x64 intensity images). The filters of each layer are described with four parameters: N stands for the number of filters; w the width; h the height and c the number of inner channels. cv stands for convolutional layer; mp for max-pooling layer; cat() for concatenation with the layers specified in the brackets and dcv for deconvolutional layer.
of size 128x64x64, 64x32x128, 32x16x256 and 16x8x512. These outputs are integrated along the decoder part of the network by concatenation along the filter dimension with the layer of corresponding size (see II A 1).

B. Loss
We minimize the l1 loss to help reconstruct separate surfaces in depth. The l1-norm is known to promote sparsity [27,28]. During the training, a batch-mode learning method with a batch size of M = 64 was used, and the loss was defined as with M the number of images within one batch, N the number of pixels of each image, θ the trainable parameters of the network, the R the residual map predicted by HistNet, d the first depth map, and d ref the ground truth depth.

C. Simulated datasets for training, validation and testing
We simulate realistic SPAD array measurements (LR histogram and HR intensity) from 23 scenes of the MPI Sintel Depth Dataset [29,30] for the training and validation dataset, and from six scenes of the Middlebury dataset [31,32] for the test dataset. From the HR depth and intensity images provided by both datasets, we create histograms of counts of 16 time bins as follows. The photon counts of the histogram are assumed to be taken from a Poisson distribution and the impulse response of our SPAD camera is approximated by a Gaussian function G (m, σ) with average m and standard deviation σ = 0.5714 histogram bins [8,24]. For each pixel (i, j), the photon counts h i,j,t of the acquired histogram at time bin t can be expressed as function of the intensity r i,j and the depth d i,j as follows with b i,j,t the background level which is assumed constant for all time bins of a given pixel. The LR histograms are simulated by down-sampling the HR histogram by integrating over four by four pixels in the spatial dimensions. We consider two metrics to assess the level of noise: the signal to background ratio (SBR) and the number photon counts per pixel (ppp). We define the pixel SBR i,j as .
The image SBR is then the average value over all pixels, i.e. SBR = 1 N i,j SBR i,j . In a similar fashion, we define ppp i,j , the number of photon counts reflected from the target in a pixel (i,j), as The average ppp for the image is then the average value over all the pixels, i.e. ppp = 1 N i,j ppp i,j . It should be noted that the 436x1024 MPI Sintel Depth Dataset are pre-processed to remove all NaN and Inf values using median filtering. We use 21 images for training and two are saved for the validation. We increase the number of training images by a factor of eight by using all possible combinations of 90°image rotations and flips. Furthermore, the images are split into overlapping patches of size 96x96 with a stride of 48.

D. Implementation details
We implemented HistNet within the Tensorflow framework and use the ProximalAdagradOptimizer optimizer [33], as this enables the minimization of l1 loss function. The learning rate was set to 1e-1. The training was performed on a NVIDIA RTX 6000 GPU. We trained for 2000 epochs, which took about 10 hours.

Noise scenarios
Two different noise scenarios were considered a scenario mimicking the lighting conditions of [8] with ppp = 1200 counts and SBR = 2 denoted "realistic scenario", and an scenario corresponding to a much lower photon count and lower signal to noise with ppp = 4 counts and SBR = 0.02 denoted "extreme scenario". We trained a separate network for the two different noise scenarios.

Evaluation metrics
We quantify the performance of the network on simulated data by comparing the reconstruction with the ground truth depth using the root mean squared error metric RMSE =

Comparison algorithms
We compare the results of HistNet with the following methods: • Nearest-neighbour interpolation Depth is estimated with center of mass on the 276x344x16 histogram of counts and up-sampled with nearest neighbour interpolation to a 1104x1376 image. Note that this is the method used to produce the first depth argument of HistNet. We choose nearest-neighbour interpolation to avoid joining spatially separated surfaces.
• Guided Image Filtering of He et al. 2013 [34] We perform further processing to the estimated depth from the nearest-neighbour interpolation by applying the guided filtering algorithm with the HR intensity image as a guide.
• DepthSR-Net of Guo et al. 2019 [18] We retrained this network using the same training datasets for our network. This network outputs a 4x upsampled depth map from a LR depth map using a HR intensity map to guide the reconstruction. In [18], the LR depth map is first up-sampled to the desired size with a bicubic interpolation. However, we want to reconstruct surfaces that are well separated from one another. Therefore, nearest neighbour interpolation instead of bicubic interpolation is used to upsample the input LR depth map.
• Algorithm of Gyongy et al. 2020 [8] This algorithm is designed to process the Quantic 4 x4 SPAD array. It consists of various steps of guided filtering and upsampling with low computational cost. One part of the algorithm is designed to compensate the inherent misalignement between the depth and the intensity information that the SPAD provides. Since our synthetic data consists of perfectly aligned intensity and depth, we do not use this part of the algorithm. 4. Results for the realistic scenario (ppp = 1200 counts and SBR = 2) Figure 5 shows the reconstruction for the different reconstruction methods. We see that HistNet is able to produce sharp and clean boundaries. Guided Filtering and DepthSR-Net introduce blurred details around the edges and nearest interpolation leads to pixelated images. We report the root mean squared error and the absolute error in Table I for the two Middlebury scenes reconstructed with the different methods. This table indicates that HistNet outperforms the other methods in both performance categories. The processing time of the different methods is reported in Table I. Guided Filtering and nearest interpolation have a very low computational cost and process the image the fastest, in a few milliseconds for a 1104x1376 input. The algorithm of [8] reconstructs the image in about 4 seconds. The reconstructions of HistNet and DepthSR-Net were performed on a NVIDIA RTX 6000 GPU. Each 1104x1376 Middlebury scene took about 7 seconds to reconstruct with the two networks.   Figure 6 shows the results for measurements simulated with an average of 4 signal photons per pixel and a signalto-background ratio of 0.02. Visually, we see that our method performs the best in high noise scenarios. Quantitative comparison can be found in Table I. For both scenes, HistNet performs better in terms of the RMSE and ADE.

Influence of the second depth map on the performance
We investigate the benefit of incorporating the second depth map among the inputs of HistNet. For this purpose, we developed a version that does not use the second depth map. In the proposed HistNet, the input of the encoder is the concatenation of the first and second depth maps, and the first layer of the encoder consists of a 3D convolution filter of width 2. For the version without the second depth, the encoder takes as input the first depth map only, and its first layer is a 2D convolutional layer. We trained this version of HistNet with data with ppp of 1200 counts and SBR of 2. Quantitative results are displayed in Table II for different simulated measurements from the Middlebury dataset. The version of HistNet using the second depth map information has the best performance in terms of RMSE and ADE for all images.

Robustness to noise
We study how well a network trained on data with a specific SBR and ppp levels can reconstruct data with other noise. HistNet trained according to realistic and extreme scenarios was tested on data with ppp levels ranging from 1 * 10 −1 to 7 * 10 5 and SBR ranging from 1 * 10 −5 to 70. Figure 7 shows the RMSE value between the ground truth and the reconstruction of HistNet with respect to the SBR and the ppp of the testing data. We see that HistNet shows good performance on a variety of SBR and ppp levels. HistNet trained with the extreme noise scenario is able to reconstruct data that presents higher noise than when trained with the realistic scenario. However, the best performance is always achieved when the ppp and SBR of the testing data approximately matches the one implemented in the training dataset.

B. Results on Real Dataset
We test the performance of HistNet on real measurements captured by the Quantic 4x4 camera [8]. The spatial resolution of the histogram data is of 32x64 and the number of time bins is of 16. The resolution of the intensity image is of 128x256. The data is first interpolated to the size of the intensity image with nearest interpolation and is calibrated using a compensation frame. We estimated a number of photon counts per pixel of 1200 and a signalto-background ratio of 2 in this data. Therefore, we use HistNet trained with this scenario to reconstruct the depth maps. Figure 8 displays the reconstruction via our proposed HistNet and the different reconstruction algorithms. We see that HistNet leads to more accurate image with sharper edges. In this section, we will compare HistNet to the CNN network FusionDenoise published in [20]. This network does depth estimation and is designed for single-photon detector array. More precisely, the FusionDenoise network reconstructs a denoised depth from histograms of photon counts with the guidance of an intensity image acquired by another sensor.
Both HistNet and FusionDenoise aim at reconstructing a depth image for single-photon detector array, and they both use a histogram of photon counts and an intensity image. However, the methods differ in several ways. First, FusionDenoise aims at denoising the depth data while using the intensity image as a guide, while the proposed algorithm use a HR intensity image to perform both denoising and depth super-resolution. Second, FusionDenoise acts directly on the entire 3D cube of data which is time and memory consuming, whereas HistNet acts on the pre-processed depth image with the help of multi-scale depth features computed from the histogram.
In order to perform a fair comparison, we use the simulated measurements that [20] provides. They provide a simulated histogram of photon counts of size 544 x 688 x 1024 created from the art scene of the Middlebury dataset [31,32]. Since HistNet performs depth up-sampling by four, we down-sample by four the version of this histogram to the size 136x172x1024 that will be the new simulated measurements for HistNet. From this histogram, normalized depth maps and normalized multi-scale depth features have to be computed to input to HistNet.
The work of [20] presents the challenging scenario of very sparse histogram in which the signal from objects is only contained within a range of 60 time bins, and the remaining 924 bins contain only background photons. Computing the depth maps using the centre of mass of the entire histogram creates an input with a very low contrast that HistNet is unable to reconstruct properly.
To overcome this issue, we restrict the range of time bins to include only the return signal from objects. This range is calculated by looking at the down-sampled depth maps and evaluating the appropriate number of time bins to use. We trained HistNet with the appropriate noise scenario of the simulated measurements. We estimate the data to have a ppp = 2 and a SBR = 40. To summarize, the simulated histogram provided by Lindell et al. [20] are down-sampled using nearest-neighbour interpolation in both spatial dimensions and are cropped in time to bound the outliers. The size of the histogram used to test HistNet is 136x172x57.
The results are displayed in Figure 9 along with the associated RMSE values. HistNet's performance on the 4x   7: RMSE between predictions of network and ground truth for different SBR and ppp. The noise level implemented in the training dataset is represented with a red marker. In (a), the network was trained on data with an average signal count (ppp) of 4 per pixel and a signal to background ratio (SBR) of 0.02. In (b), the network was trained on data with an average signal count of 1200 and a signal to background ratio (SBR) of 2. Both networks were tested on simulated data with ppp levels ranging from 1 * 10 −1 to 7 * 10 5 and SBR ranging from 1 * 10 5 to 70. The network performs best when the training and testing noise matches. It is also robust to data that have a lower noise. However, the performance drops when the testing data presents higher noise than the training data.
downsampled histogram in terms of RMSE are better. However, HistNet was tested on histogram that were cropped to bound the high values of the outliers, whereas the algorithm of [20] processed the entire histogram data. In terms of processing time, the algorithm of [20] took about 6/7 minutes to process the 608x736x1024 histogram, while HistNet required about 7 seconds to reconstruct the up-sampled depth map, both using the same GPU.

Conclusions.−
In this work we presented a depth super-resolution and denoising algorithm called HistNet. This deep network is specific to measurements provided by SPAD array sensors and provides 4 times resolution improvement of the depth maps extracted from the raw histograms. Our method exploits the SPAD array data in a simple and efficient manner, i.e. multi-scale features, intensity and depth images, and multiple depths are provided to the network so that it can extract the most important information from the raw SPAD array data.
Our method performs well with respect to other state-of-the-art depth upsampling algorithms in terms of quality of the reconstruction, especially for low signal-to-noise ratios and low photon levels. Moreover, the method shows robustness to a wide range of different noise scenarios, so that the noise statistics of the training dataset do not need to closely match the input data's. The processing time of the network applied to the Quantic 4x4 SPAD data is around 2 seconds on a NVIDIA RTX 6000 GPU.
Future work will make use of the very high frame rate specific to the SPAD array sensor to achieve even better resolution in depth. We also propose to tackle the misalignement between the histogram and the intensity image, which is inherent to the operating mode of our SPAD detector that acquires them alternately.