Real-Time Reconstruction of 3D Videos From Single-Photon LiDaR Data in the Presence of Obscurants

Single-photon methods are emerging as a key approach to 3D imaging. This paper introduces a two step statistical based approach for real-time image reconstruction applicable to a transmission medium with extreme light scattering conditions. The first step is an optional target detection method to select informative pixels which have photons reflected from the target, hence allowing data compression. The second is a reconstruction algorithm that exploits data statistics and multiscale information to deliver clean depth and reflectivity images together with associated uncertainty maps. Both methods involve independent operations that are implemented in parallel on graphics processing units (GPUs), which enables real-time data processing of moving scenes at more than 50 depth frames per second for an image of $128 \times 128$ pixels. Comparisons with state-of-the-art algorithms on simulated and real underwater data demonstrate the benefit of the proposed framework for target detection, and for fast and robust depth estimation at multiple frames per second.

reflectivity as well as depth information, allowing full 3D scene reconstruction.
Several imaging approaches have been proposed in recent years showing different levels of data pre-processing, and we distinguish a few categories. First, those approaches providing successive raw binary frames where a pixel value contains either a zero or the detected ToF value [4], [5]; secondly, those using a compact representation such as a timing histogram of photon counts [6], [7], [8] or other nonlinear transformations [9], [10]; and finally, commercial systems that usually pre-process the raw data to only provide successive estimated depth frames (e.g., Velodyne systems 1 ). This highlights the need for a unifying processing framework to deal with these data representations to deliver fast and robust depth estimations, as adopted in this work. Although it has several advantages, single-photon LiDaR imaging faces some practical challenges. First, it may require long acquisition times to capture sufficient photon events to establish accurate parameter estimation, which has promoted the design of algorithms to process photon-sparse data [11], [12], [13], [14], [15] or optimized approaches for data acquisition [16], [17], [18], [19], [20]. In this paper, we propose a multiscale strategy for robust processing of sparse data and a target detection to only focus on informative pixels and reduce data volume in preparation to further processing. Second, LiDaR can also be corrupted by background photon counts due to imaging in scattering environments (e.g turbid underwater conditions or through atmospheric obscurants, such as fog or smoke). For example, when using a histogram data representation, imaging through obscurants might lead to non-uniform background level that can result in the target signal being much lower than the background level at some parts of the timing distribution, thus jeopardizing the estimation of target's reflectivity and depth [5], [21], [22]. To counter this issue, we generalize the target detection method to filter out non-uniform background. Third, the deployment of LiDaR to real world applications requires reliable and fast data processing to deliver multiple depth frames per-second (fps) together with their uncertainty, as both will serve higher-level processing such as scene understanding and target recognition and to assist with decision making. We solve this challenge by adopting a statistical framework that allows uncertainty quantification of the estimates, while ensuring that iterative operations are independent hence allowing a parallel implementation using a GPU for fast real-time processing.
Algorithmic solutions have been proposed to deal with the sparsity and high-noise levels of LiDaR data. These solutions include model-based approaches relying on the data statistics and/or known data properties such as sparsity and spatial correlations [6], [23], [24], or learning-based approaches that learn important features from data with ground-truth during the training phase, then use them to process new data during the testing phase [25], [26], [27], [28], [29]. These methods dealt with the first two challenges. However, few were interested in real-time data processing, among these we distinguish the target detection [30], [31] and depth reconstruction [32] methods. Several methods focused on extremely noisy cases as observed when imaging through obscurants, which leads to a non-uniform background shape [5], [21], [22]. Others provided reliable estimates by also estimating depth uncertainty as a confidence metric on the computed estimates [22], [29], [33]. This paper proposes a fast statistically based method for the robust processing of LiDaR data in photon-sparse regimes or in presence of obscurants. The method consists of two steps, an optional target detection algorithm with background estimation as a pre-processing step, and a denoising algorithm. The target detection method is used to localize informative pixels having target returns and reject those that only contain background counts, hence reducing the data volume. This is achieved by generalizing the approach in [31] to account for non-uniform background. In contrast to [22], [29], the proposed denoising algorithm only assumes the presence of initial noisy depth and reflectivity maps, which makes it compatible with different system representations as it allows processing of raw ToFs binary frames, or pre-estimated maps from commercial systems (see Fig. 1). This algorithm exploits the data statistics and multiscale information to deliver robust parameter estimates in addition to depth uncertainty maps. Both steps rely on iterative independent operations suitable for parallel computing tools. A graphics processing unit (GPU) implementation was designed enabling significant improvement in processing speed, i.e. less than ∼ 20 milliseconds to deliver 128 × 128 pixel depth images together with uncertainty maps. Experiments on simulated and real underwater data validates the proposed methods, which provided similar or better results than existing statistical and deep learning state-of-the-art methods, but much faster.
The paper is organized as follows. Section II presents the observation models, and the challenges associated with these representations. Section III introduces the optional pre-processing step to perform target detection and the initial estimation of depth and reflectivity maps. Section IV introduces the approximate multiscale model used for the robust reconstruction of 3D scenes. Section V discusses implementation using GPU to obtain an accelerated algorithm ensuring real-time data processing. Results on simulated and real data are reported and analysed in Sections VI and VII. Finally, conclusions and future work are presented in Section VIII.

II. PROBLEM FORMULATION
This section describes the observation model for singlephoton data, and approximations used to improve computations and robustness such as multiscale formulation. The challenges associated with these formulations are then described to introduce the need for the proposed approach.

A. Observation Models
The TCSPC system usually provides time-tagged photon events denoted by z n,m , m ∈ 1, . . . ,ȳ n , where n ∈ {1, . . . , N} denotes the pixel location andȳ n the number of detected photons for that pixel. These photons are usually accumulated into a timing histogram of counts with respect to their ToF, denoted by y n,t , for the ToF sampling t ∈ {1, . . . , T }. Due to its counting nature, this data is often modelled using a Poisson distribution P(.) given by [6], [12], [15]: where the expression of s n,t in presence of a target at depth d n with a reflectivity r n is given by where f s represents the system impulse response (SIR), and b n,t represents the background rate (i.e. photons that do not originate from reflections at the target plane such as the detector's dark counts or reflections from the scattering environment). Note that b n,t depends on t as often observed when imaging through obscurants [21]. Assuming the independence of the observed counts of different pixels leads to the joint likelihood distribution where d, r, B gather the depth, reflectivity and background values for all pixels, respectively. An alternative formulation models the detected list of photons z n,m for the nth pixel and for m ∈ {1, . . . ,ȳ n }, using a mixture of densities [6] where w n = r n r n + t b nt is the probability of the detected photon to belong to a target or background, and f b indicates the temporal shape of the background or equivalently the photon event ToF distribution in absence of target reflections (i.e. r n = w n = 0) The joint likelihood distribution obtained by assuming the independence of the observed ToFs is given by where z n = (z n,1 , . . . , z n,ȳ ) gathers all detections for the nth pixel. From both models, the goal is to reduce data volume by detecting the pixels containing target's reflections (0 < w n ≤ 1 or r n > 0) and to estimate the target's depth and reflectivity parameters (r n , d n ). Several methods have been designed to estimate the parameters d, r using the Poisson model in (3), often leading to a large computational cost. Recent approaches tend to approximate this likelihood using simpler models, and to recover the lost information by regularizing the estimated parameters [22], [34], leading to faster inference. This approach is adopted in this paper as detailed in the rest of the paper.

B. Practical Challenges: Robustness and Real-Time Processing
To improve the parameter estimates, the above model can be further extended to incorporate multi-scale information, as used in recent algorithms [6], [15], [22], [23], [35]. This is performed by spatially low-pass filtering the histogram of counts to obtain new histograms, from which the depth and reflectivity images present a lower level of noise at the cost of a reduced spatial resolution (see Fig. 1).
However, this multiscale strategy presents two practical limitations. First, a computational challenge especially when aiming for real-time data processing with millisecond levels per large histogram cube [22], [29]. Indeed, filtering in the Fourier domain while assuming circular borders leads to the high computational cost O(N log(N )T ) per cube (see Fig. 1). Second, the unavailability of the histogram data (which prevents multiscale computations) when dealing with commercial systems (such as Velodyne or Kinect 2 ) that only output depth and reflectivity estimates. This motivates the design of a robust method operating only on depth and reflectivity images to reduce the computational cost of the multiscale model and ensure generalisability to commercial systems, as described in the following section.

III. PRE-PROCESSING: TARGET DETECTION WITH NON-UNIFORM BACKGROUND
This section assumes the availability of raw data (i.e. time tagged photons), and introduces an efficient pre-processing step to extract depth and reflectivity estimates of the observed target. Note that this step could be ignored if depth and reflectivity images are the only provided measurements, as for commercial systems. The proposed approach generalizes the Bayesian method in [31] to account for a non-uniform background with temporal shape f b (t), and is denoted generalized event-based target detection (GETD) algorithm. Given the time tagged measurement for the nth pixel, the method aims to estimate a target detection binary label u n , the depth d n and the signal-to-background ratio (SBR) parameter w n . To solve this inverse problem, we adopt a Bayesian approach which combines the likelihood in (6) with prior distributions for the parameters of interest, to extract their estimates from the obtained posterior distribution. Akin to [31], we adopt the following prior distributions where the prior of w n is a mixture of a Dirac delta distribution δ(.) and the Beta distribution Beta(., .) with known hyperparameters (α, β), u n is assigned a Bernoulli prior with probability of target presence π = 0.5 in what follows, and d n is uniform in the interval [1, T ]. Combining (6) with (7) leads to the posterior distribution where ∝ means "proportional to". The variable u n indicates the presence or absence of a target in the nth pixel. An estimate of this parameter is obtained by evaluating the marginal conditional distribution f (u n |z n ) where a target exist if For pixels with targets, approximate depth and reflectivity parameters can be efficiently estimated (see Appendix A) as required by the denoising algorithm. The output of this section are the three N × 1 vectors representing the detection map u, the depth parameter d and the reflectivity parameter r. The reader is invited to see [31] and Appendix A for more details regarding the mathematical computations.

IV. APPROXIMATED MULTISCALE MODEL FOR DENOISING
In many practical situations, we only have access to depth and reflectivity maps as measured data instead of the raw timetagged photons. This is the case for commercial systems, or when pre-processing the raw data as described in the previous section. This section introduces an approximate model for 3D scene reconstruction when only having the depth and reflectivity images as input instead of the rich histogram of counts. The approach adopts justified assumptions to enable the computation of the multiscale depth and reflectivity maps without the need to histogram data.

A. Inferring Multiscale Depths From Estimated Maps
The multiscale model is often considered to improve the parameters robustness to noise, however, it requires access to histogram of counts to build low-pass filtered versions as represented in Fig. 1. The goal of this section is to infer the multiscale imagesd (e.g., from the previous preprocessing step). To achieve this goal, we start by introducing an approximate likelihood model that will enable the computation of multiscale information while only usingd Akin to [22], assume the absence of background and a Gaussian SIR (3), hence simplifying the first scale likelihood to the following ( = 1 is omitted for brevity) whered n = arg max d t y n,t log f s (t − d) denotes the maximum likelihood depth estimate,σ 2 := σ 2 /ȳ n , Q is a parameterfree function only dependent on y n , and G(x ; ·, ·) denotes the gamma distribution with shape and scale parameters. Therefore, in the case of a Gaussian SIR, this equation shows that the first scale estimates of depth and reflectivity can be expressed as d ∀ ∈ 1, . . . , L, where = 1 is the original cube, y ( ) n = n ∈ ν n y n , with ν n representing the k n neighbours of the nth pixel, and for example, = 2 corresponds to a filtering with a q 2 = 3 × 3 uniform window, = 3 with a q 3 = 5 × 5 window, etc. Comparing (11) and (10) shows thatd this estimate represents the mode (or equivalently the mean) of the Gaussian distribution in (11). To improve robustness to outliers, we consider Interestingly, the resulting mode estimator in (12) can be linked to kernel-based approaches [36], where the log-SIR represents the kernel function that is centered using the first scale's depth values, and weighted by their corresponding reflectivities. For the reflectivity, the above estimator represents the average photon counts received per pixel. This is justified in the absence of background, since all photons are reflected by a target. However, this reflectivity estimate can be improved by only summing photons around the estimated depthd ( ) n to reject outliers. This shows that the resulting multiscale depth and reflectivity estimates could be extracted from the original first scale estimates, which enable a significant reduction in computational cost.

B. Denoising Algorithm
The multiscale depth and reflectivity images need to be combined efficiently to build single, cleaned depth and reflectivity images. We consider a denoising approach to associate a single value of depth to each pixel location (i.e. X, Y location). For each pixel location, moving from lower to higher scales, the good pixels are first detected (those having more than √ q L neighbours with close depth values). The depth value is then obtained as the average of the surrounding valid points. Formally, the depth is obtained byd with¯ n = min subject to n ∈ν n δ(|d denoting the cardinality of a set, and the coefficient d is easily fixed based on physical considerations related to the impulse response width and time bin resolution. Note that the denoised values of first scales are promoted to enhance spatial resolution as higher scales might lead to blurred depth maps. Note also that the multiscale depth values can be seen as a point cloud, in which case the above approach rejects the outlier points while locally smoothing the resulting cleaned surface. Combining the multiscale reflectivity mapsr ( ) , ∀ to obtain a single map can be performed using several restoration algorithms. The choice of the method relates to the imaging conditions, where algorithms based on Poisson statistics can be used in the sparse photon regime [24], [37], [38], while other state-of-the-art denoising algorithms [39], [40] can be considered in dense photon regimes. In this paper, we aim for an efficient algorithm allowing real time processing. Akin to the bilateral filter, the approach exploits the cleaned depth map and average the reflectivity of the good points as follows Measuring uncertainty on the estimated depth values is also important to help with decision making. Using the multiscale depth maps from Section IV-A and akin to [22], the depth variance n of the nth pixel can be approximated as followŝ where α d , β d are small constants and w ( ) n, n are positive normalized weights given by where d is a positive constant and d contains the multiscale depths after zeroing the outliers and only keeping good depth values (those with more than √ q L neighbour points with close depth values).

V. PARALLEL IMPLEMENTATION USING GRAPHICAL PROCESSING UNITS
The proposed algorithm was first implemented and validated in MATLAB. Based on this as a reference, a multi-threaded implementation was created in C++ where most of the functionalities have been transformed to run on a CTA (Cooperative Thread Array). We have chosen the CUDA framework that can compile code to run on an NVIDIA GPU. This resulted in a speed-up improvement of 3 orders of magnitude compared to the MATLAB implementation. The key steps of the implementation are described in the following sections.

A. Deciding on Modularity vs Performance
We started the transformation of the algorithm by identifying the individual modules. There is a trade-off between performance and modularity (reusability) that requires consideration. Modules provide complete functions that can have multiple inputs and outputs. A module function finishes when all of its outputs are ready. But within the function, the different outputs are not necessarily ready at the same time. So if another module could start working with one of the outputs, then there would be no need to wait for all outputs to be completed. Therefore there is a synchronization boundary at the modules' functions where performance is lost through waiting. On the other hand, using modules makes the code reusable and easier to read. Keeping this in mind our implementation is structured in modules as shown in Fig. 2. The role of these modules is the following:

B. Data-Flow Modelling
To identify the possibilities for parallelization, we model the execution pipeline using a Data-flow Diagram (DFD) that depicts individual functional units, and the flow of data between them. These functional units -also called kernels -are determined to be able to run in a CTA, therefore need to satisfy the following criteria: r Perform a simple operation to produce one output, load only as many inputs as needed for that; r The input data must be continuous in the device memory; r The boundaries of individual data elements must be clearly addressable (same size, or padded to be so); r The data needs to be properly aligned in memory according to device requirements; r If not all of these are satisfied the data needs to be transformed to the proper format first. Based on the DFD, the possible parallel execution paths can be identified for maximum performance. When a DFD function has multiple inputs from different execution paths, these need to be synchronized to make sure the data is only used when ready.

C. Parallel Execution Framework
The individual kernels for the application were implemented using the CUDA framework in C++. While multiple parallel processing frameworks are available that are general, the native CUDA implementations provide the best computational performance [41]. Features of modern C++ such as compile-time polymorphism were also utilized wherever possible for maximum performance.
Real-time performance was achieved by utilizing multiple levels of parallelism of the framework: r Grid-level: launching kernels asynchronously for independent tasks using CUDA streams. For example, the parallel processing of the multiscale convolution, where each scale is processed independently; r Block-level: aligning input data, so different blocks of the CTA can work on different parts of the data. An example is running a convolution filter for each pixel in a different block; r Thread-level: perform reduction and scanning type of operations by cooperation and synchronization; to scale to the limited size of the block we can utilize sequential iterations. For example do vector reduction to determine weights of the convolution filter in each block. In grid-level paralellism we can perform completely independent tasks on independent and arbitrary sized data. In block level paralellism we perform the same task on different parts of the continuous data of the same size. Thread-levels paralellism is similar to block-level with the exception that we can use cooperation between different threads. We have used the primitives of the CUB 3 library for thread-level cooperative operations. Note that the processing speed can be further improved by utilizing device-level parallelism, when we distribute the processing tasks to multiple GPU devices. However, this division needs to be carefully designed to ensure efficiency, and will be the subject of future work.

D. Implementation
Based on the assembled DFD, and the requirements laid out in Section V-B functional units were implemented as C++ classes that use several CUDA kernels. The structure of these kernels follow a standard pattern of operations depicted in Fig. 3. First, an element from memory is loaded to the local register by each thread. Each thread can perform independent operations on the data. When threads need to cooperate they copy their register contents to the shared memory. Data is transformed cooperatively in the shared memory with proper synchronization. Finally the end results are copied back to the device memory.
The target detection algorithm uses permutation matrices to check the likelihood of the subset of received photons being from targets. The size of these matrices depend on the number of received photons in each pixel. In order to process this data on the CTA, the data needs to be in the same dimension. Different photons groups are considered, and pixel input data is separated and padded. These groups are then processed independently. In the end, the result is reassembled.
In the denoising algorithm, the filtering kernel is realized by first using 2D CUDA textures to get the neighbouring depth and reflectivity values for each pixel. This matrix is computed once for the largest filter size configured. The different filters are run independently thereafter. Each row of the matrix is processed by a separate CUDA thread block. The selection of the final values are performed when all the filters have been completed. Visualization is done by using OpenCV for the 2D images, and OpenGL for the 3D point cloud.

VI. RESULTS ON SIMULATED DATA
This section evaluates the performance of the proposed algorithms by considering simulated data with known ground-truth. We first introduce the evaluation criteria, and the comparison algorithms. We then analyse the performance of the target detection algorithm, the robustness of the proposed denoising strategy, and finally discusses its computational cost. Except for RT3D, all simulations have been obtained on a HP ZBook Studio G8 laptop with Intel i7-11850H CPU, 32 GB RAM with an NVIDIA GeForce RTX 3070 GPU with 8 GB memory. The available RT3D was precompiled on a windows machine, therefore we used a PC with Intel i5-6600 CPU, 24 GB RAM and a GeForce GTX 1060 6 GB GPU under Windows 10.

A. Evaluation Criteria
The results obtained are evaluated qualitatively by showing the estimated maps and quantitatively by considering several metrics. The true positive (TP) and false alarm (PFA) probabilities are used to evaluate the target detection pre-processing algorithm. For simulated data with known ground truth, the reconstructed depth and reflectivity images are evaluated using the depth mean absolute error (DAE) measure DAE=

B. Comparison Algorithms
The proposed method is compared to several state-of-the-art algorithms based on statistical or learning based strategies, and using CPU or GPU implementations, as follows r Lindell [25]: this is a deep learning algorithm for denoising raw single-photon data. It is used to evaluate the robustness of the reconstruction in extreme conditions; r RT3D [32]: this is a GPU based real-time algorithm assuming the presence of multiple surfaces per-pixel. It is used to evaluate the robustness of the reconstruction in extreme conditions and the processing speed; r Halimi [22]: this is a statistical algorithm for denoising raw single-photon data using multiscale information. It is used to evaluate the robustness of the reconstruction in extreme conditions; r The Classical algorithm (denoted Class.): estimatesd ( =1) as described in Section IV-A in (10) from the observed histograms or photon ToFs, and estimatesr ( =1) by summing the counts around the detected peak.

C. Performance of the Target Detection Pre-Processing
This section evaluates the generalized event based target detection algorithm (GETD) proposed in Section III. We generate synthetic data following the model (6) with the parameters T = 2500 bins, σ = 40 while varying SBR and average number of photons in the ranges SBR∈ [0.01, 10] andȳ ∈ [1,1000]. We consider two cases based on exponential and gamma shaped backgrounds , respectively, where a = 0.4˜T and b = 0.3˜T ), and compare the proposed algorithm with the histogram based detection (HTD) method in [30], and the ETD method [31] assuming uniform background (both ETD and GETD considered the approximation level M = 10). Figs. 4 and 5 show the probabilities of true detection and false alarm for the three algorithms. For both background shapes, GETD provided better probabilities than ETD. HTD provided similar or higher detection probabilities than GETD, but this happened at the expense of a very large false alarm probability, which indicates that the HTD algorithm is misled by background counts that are often considered as target detections. These figures indicate the importance of exploiting the background shape especially when dealing with noisy situations with high PPP and low SBR.

D. Robustness of the Denoising Algorithm to Photon Sparsity or High Background
The proposed denoising algorithm is compared to different state-of-the-art algorithms, namely: RT3D [32], Lindell [25], and Halimi [22] as well as the Classical matched filter algorithm. The comparison used the Middleburry dataset to generate histogram of photon counts when considering data corrupted by uniform and gamma-shaped background (with f b = G (2.2, 110)). We have focused on the 283 × 183 pixels Art scene, and generated the data using a Gaussian SIR of σ = 6, T = 1024 bins with multiple cases of PPP and SBR, which varies logarithmically in the intervals PPP ∈ [0.1, 100], and SBR ∈ [0.1, 100]. The art scene contains target returns in all pixels, hence the target detection was only used to estimate initial depth and reflectivity images. The proposed denoiser algorithm is used at three scales: 1 × 1, 3 × 3 and 7 × 7, and the following threshold d = 2.75cm (i.e. 4.6 time bins). Fig. 6 shows the obtained DAE with the studied algorithms for different SBR, PPP levels and two background shapes. In Fig. 6 (top), we have tested all algorithms by satisfying their assumptions including a uniform background noise. In Fig. 6 (bottom), we have tested the generalizability of the algorithms to a non-uniform background scenario. We highlight here that Lindell's algorithm was not retrained and that RT3D is not shown since the DAE metric is not a suitable metric for its point cloud output. As expected, the proposed approximate algorithm provides less accurate depth estimates than Halimi's (which involves more computations), but outperforms the classical and Lindell's algorithms, especially in case of gamma background, since these algorithms do not perform background shape estimation. Similarly, Fig. 7 shows the obtained IAE where the Lindell is not included because it does not output intensity estimates. The proposed algorithm performs better than the classical algorithm, especially in case of gamma shaped background. To compare with RT3D, we selected another metric, examining the number of good detected points in each pixel. A point is considered as a good detection if a reference point exists within a threshold of τ = 10 bins. If there are multiple points per pixel (in case of RT3D), we always consider the one closer to the reference. Fig. 8 shows the results of this metric for two background cases. The proposed algorithm shows better detection results in most cases including low PPP or SBR scenarios (yellow color is better). Fig. 9 shows an example of the obtained point cloud reconstructions for two sparse photon cases of the analyzed algorithms with gamma-shaped background. In the first scenario (top row) PPP= 0.1 and SBR= 10. The proposed algorithm outperforms RT3D and the results are comparable to Halimi's and Lindell's algorithm, with a small number of outliers. In the second scenario (bottom row) PPP= 1 and SBR= 10, Halimi's algorithm provides the best results, followed by other algorithms where Lindell's misses few regions, RT3D joins independent surfaces and the proposed algorithm detects few outliers.

E. Evaluation of the Computational Time
This section evaluates the execution time of the preprocessing and denoising algorithms when implemented in GPU as described in Section V. The analysis is performed using randomly generated data, while varying the number of photons, and pixels. Fig. 10 shows the variation of the computational time of the target detection algorithm with respect to (a) the number of photons per pixelȳ, and (b) the number of pixels. The processing time does increase only slightly for P < 8 photons, above this we can observe an approximately log linear dependence of the target detection's execution time on the number of photons, as already indicated in [31]. The reason is that the 2D kernels do not have significant processing demand when P is low, so the GPU can schedule more parallel tasks. In terms of the number of pixels we can observe log linear dependence.
The computational time of the denoising algorithm is independent from the number of photons or time bins and is only studied when varying the number of pixels and filter sizes. Fig. 11 shows a small increase in the processing time while increasing the filters for small resolutions (N < 128 × 128) until the maximum processing throughput of the GPU is reached. However, larger images will lead to an increased processing time as the pixels are processed in separate CUDA blocks, and will need to be scheduled sequentially, preventing full parallel processing.

VII. RESULTS ON REAL DATA
This section evaluates the proposed algorithms on real underwater data with moving targets. We first describe the experimental setup and targets considered. Second, we study the proposed target detection algorithm while varying the number of detected photons, i.e. data acquisition time. We finally evaluate the proposed denoising algorithm on real underwater data obtained in clear conditions and in presence of a scattering agent (i.e. different levels of obscurant).     12. Schematic of the experimental setup and targets. The bi-static LiDaR system and an RGB passive camera were positioned to have a front view. Another side view RGB passive camera was used to observe the targets in presence of obscurants. Both RGB cameras were used while illuminating the scene with the 532 nm laser source. Targets included a small model submarine and artificial fish.

A. Experimental Setup and Targets
This section considers underwater imaging of moving objects. The experiments were conducted in laboratory conditions at Heriot-Watt University, using the TCSPC system with the CMOS Si-SPAD detector array, as described in [4]. This bi-static system allows fast imaging at a rate of 500 binary frame per second (with 1 ms acquisition time per frame), with good spatial resolution of 128×192 cross-range pixels.
A 20 MHz synchronization signal was used with 33ps timing bin widths equivalent to a return depth of 3.7 mm per bin in water. In these measurements we used a total of 1516 time bins. Fig. 12 shows a schematic of the experimental setup that included the LiDaR system and 2 RGB cameras to visualise the scene. Both the LiDaR system and one of the RGB cameras are facing the water tank to provide a front view. In addition, we have added a side view RGB camera in order to image the targets in presence of obscurants. Fig. 12 also shows pictures of the moving targets considered, namely, a miniature model of a yellow submarine, and models of small artificial fish. The targets were tied using a metallic wire to allow an operator to control their position from above the underwater tank. Several experiments were performed with these objects and by adding different levels of scattering agent to the water. In each case the LiDaR system was used to observe 10000 ToF binary frames at a rate of 500 frames per second (fps), which allowed capturing of the fast movements of the objects.

B. Photon-Sparse Target Detection for Underwater Imaging
This section evaluates the target detection algorithm for underwater imaging of moving targets. The algorithm's performance is studied when varying the number of integrated binary frames, i.e. the number of detected photons. We have selected a clear water scenario with the moving submarine for this evaluation. To obtain pixels without target returns, we have pre-processed the raw data by replacing the peak of the reflective surface at the end of the tank by random background counts. This means that some pixels contain returns from the targets, while others only contain scattered background photons due to the environment. Note that the ToF binary frames can be integrated and jointly processed to simulate different acquisition times, for example, jointly processing 10 ToF binary frames to obtain a rate of 50 depth fps. To study different photon levels, we have built the data by jointly processing 128, 64, 32, 16 and 8 ToF binary frames, starting from the same frame in each case. Fig. 13 shows examples of the TD results on clear water data after combining different numbers of binary frames. The first row (a) shows the difference of probabilities (ΔP = log(p(u = 1)/p(u = 0))). The target detection maps are obtained by thresholding these probability maps using a small threshold (τ ), as shown in the second row (b). To obtain cleaner TD maps, a simple solution is to filter the probability maps using a median filter (e.g., with a 3 × 3 kernel) to obtain the results in the bottom row (c). We can clearly recognise the object's shapes even when jointly processing as low as 8 ToF binary frames, which corresponds to an output of 62 detection fps suitable for real-world target recognition applications. Since this is real data, we consider the filtered (c) results obtained on 128 ToF binary frames with a threshold τ = 5 as a ground truth to evaluate the target detection algorithm by calculating the probability of true detection (PTD), and probability of false alarm (PFA) in the different acquisition time cases. Fig. 14 presents the obtained target detection maps with and without filtering, when considering different values for the threshold τ . Using small τ , the PTD is better, but the PFA is higher as well. Filtering the target detection probability maps improves PTD, and lower PFA values. This highlights the   I  COMPUTATIONAL TIME OF THE DIFFERENT TARGET DETECTION AND  DENOISING ALGORITHMS robustness of the proposed algorithm in the sparse photon case, which is consistent with the results on simulated data presented above.

C. Fast and Robust 3D Underwater Imaging
In this section, we are interested in fast imaging under noisy scenarios, as in turbid water, which presents significant scattering levels and a non-uniform background. We studied different levels of scattering by mixing the water with varying concentrations of Maalox, which is an antacid medicine that strongly affects scattering without inducing significant optical absorption (also used in [42]). To highlight the benefit of the proposed denoising algorithm, we compared it with the classical algorithm, Halimi's [22], the robust Lindell [25] and the fast RT3D algorithms [32]. To ensure a fair comparison, a black-plane was put at the end of the water tank to ensure all pixels have a target return, thus the TD algorithm was not used in this section. Fig. 15 shows the 3D reconstruction of the different algorithms on frames from four different experiments with increasing attenuation length (AL 4 ) from AL=0.25 to AL=4.6, where the attenuation is measured from transceiver to target, one way only, not round-trip. In each case, 64 binary frames were used, each of 1 ms acquisition time. In the clear water case both Halimi and the proposed algorithm use filter sizes [1,3,7], while in the other cases they use [5,7,11]. For RT3D we adjusted the intensity threshold in each case to get the best visual results. Every other parameter was set to default. The targets used were the model submarine and small artificial fish and the RGB images were captured via the front and side view RGB passive cameras as shown in Fig. 12. The proposed algorithm provides similar good reconstructions as Halimi's algorithm [22] even in turbid water, while preserving edges between distinct surfaces. RT3D is conservative preferring to reject surfaces within noise, thus tends to remove some of the back-surface. Lindell's algorithm is successful in detecting the targets, but it over-smooths the edges. Additionally, the proposed algorithm returns the depth uncertainty that is comparable to the statistical algorithm [22], as can be seen in Fig. 16. Higher uncertainty is present in the background and around the edges of objects.
The computational cost of the studied algorithms is reported in Table I, the proposed algorithm reconstructs the scene with less than 8 ms per frame, which enables higher levels real-time processing such as object detection or classification for autonomous navigation. More results obtained by processing multiple frames as 3D videos are provided at the external link 5 .

VIII. CONCLUSION
This paper has proposed an accelerated algorithm for robust processing of 3D single-photon LiDaR data. The algorithm included a pre-processing target detection step to reduce data volume and only consider pixels with targets. To improve the robustness to false detections, the proposed algorithm adopted an approximated multiscale model, which avoided the computational expensive low pass-filtering of the histograms. This formulation allowed the estimation of parameter's values together with their uncertainties, as required by critical applications. The independent updates of the resulting algorithm were implemented in parallel using graphics processing units, which led to a fast and robust algorithm showing real-time processing performance at more than 50 fps for 128 × 128 pixels. The algorithm was validated using different experiments of imaging in extreme conditions showing clean target reconstructions, and better performance compared to all of the investigated state-ofthe-art algorithms. Future work will consider integrating some of the multiscale computations within the system hardware to accelerate the imaging process, generalizations to multi-spectral LiDaR data and the presence of multiple peaks per-pixel as observed when imaging through semi-transparent surfaces, or for tree canopy imaging using airborne LiDaR.

IX. CODE AVAILABILITY
Compiled executables processing LiDaR data used in the paper are available at the following URL: https://github.com/ sandorplosz/ieee-tci-real-time-denoiser-2022-demos.
Akin to [31], one can show that p(u n = 1|z n ) = π ȳ m=0 [Beta (ȳ + α − m, β + m)ā nm ] T Beta (α, β) (18) with a nm (d n ) = whereā nm is the result of marginalizing the Gaussians f s (.) in a nm (d n ) with respect to d n , with the assumption that the location d n is far from the observation window borders leading to d n N d n (μ, σ 2 ) ≈ 1. The sum in (19) is approximated by limiting the number of terms summed to K = ( M M/2 ) = M ! (M/2)!(M/2)! , where M is a user fixed parameter (fixed to M = 10 in this paper) and ! denotes the factorial operator. Note that the target detection algorithm assumes a known background distribution f b (t). This can be measured during calibration by imaging a scene without a target, or from previous frames in the case of multi-temporal imaging. Otherwise, this can be easily approximated by first estimating depth using a matched filter, and then rejecting outlier pixels (those without neighboring points within a depth threshold). The normalized smoothed histogram of these outliers provide an approximate estimate for f b (t). In presence of a target, the depth estimate can be obtained using a weighted matched filter as followŝ and the reflectivity as the number of photon counts around this depth estimate.