Dynamic non-line-of-sight imaging system based on the optimization of point spread functions

: Non-line-of-sight (NLOS) imaging reveals hidden objects reflected from diffusing surfaces or behind scattering media. NLOS reconstruction is usually achieved by computational deconvolution of time-resolved transient data from a scanning single-photon avalanche diode (SPAD) detection system. However, using such a system requires a lengthy acquisition, impossible for capturing dynamic NLOS scenes. We propose to use a novel SPAD array and an optimization-based computational method to achieve NLOS reconstruction of 20 frames per second (fps). The imaging system’s high efficiency drastically reduces the acquisition time for each frame. The forward projection optimization method robustly reconstructs NLOS scenes from low SNR data collected by the SPAD array. Experiments were conducted over a wide range of dynamic scenes in comparison with confocal and phase-field methods. Under the same exposure time, the proposed algorithm shows superior performances among state-of-the-art methods. To better analyze and validate our system, we also used simulated scenes to validate


Introduction
Imaging hidden objects from a camera's view is an emerging research topic with a broad range of applications, including robotics vision, remote sensing, medical diagnosis and autonomous vehicles. In general, a non-line-of-sight (NLOS) imaging system (including a short-pulse laser and a transient imaging device) sequentially performs the following steps: 1) the pulsed-laser shines on the relay wall, 2) the scattered light is bounced on the hidden object and diffusely reflected to the relay wall, and 3) a transient imaging device records the scattering response from the relay wall. NLOS imaging can reveal hidden objects around corners or obstructions due to the excellent temporal and spatial resolvability of time-resolved sensors, such as streak cameras, gated sensors, or SPADs.

Optimization of spatial point spread functions
As shown in Fig. 1(a), the NLOS imaging system consists of a 32×32 SPAD array, a 50MHz, 70fs femtosecond laser, and a relay wall. The laser light first shines the wall, travels to the hidden scene, and propagates back to the relay wall. A SPAD array camera captures transient waves containing the scattered geometric information of hidden objects. The laser's illumination position and the SPAD array's field of view (FoV) are fixed throughout the acquisition process. The transient image can be derived from a linear combination of each hidden point's response Ψ in Fig. 1(a). The linear function is showed in Fig. 1(b), where τ is the NLOS system's response captured by the SPAD camera, ρ is the shape of the NLOS object, Ψ is the NLOS PSF matrix, and each column of Ψ represents the PSF for a fixed point on the target. Figure 1(c) shows how each object voxel contributes to a transient image. The entire transient image is generated from the linear sum of all the images. If there is no occlusion, we can turn the forward model into a linear system and generate a corresponding matrix representation. Figure 1(d) is a simple example to depict circular objects at different positions producing transient images with different curvatures and positions. A response captured by a SPAD array with 32×32 pixels for a hidden sample object in the bottom is also provided. The model traces the time-resolved response of a femtosecond laser interacting with scattering media and the hidden objects, and the transient information recorded by the sensor can be represented by [2,3], where ρ is the sought-after albedo of the hidden scene in the 3D half-space Ω when the distance from the scattering board. The Dirac Delta functionδdescribe the instant where the transient scattered light hit the relay wall. r l is the distance from the light source (x l , y l , 0) to the imaging point (x, y, z), and r is the distance from the imaging point (x', y', 0) to the imaging point (x, y, z). The transient image is recorded when the light source illuminates the position (x l , y l ) on the wall. Every measurement sample τ (x', y', t) captures the photon flux at the position (x', y') and the time t relative to an incident pulse scattered by the laser position at t = 0. With a Gaussian-shaped laser pulse alongside the Poisson noise of the time-resolved camera, the system response can be modeled as: where P is the Poisson noise from the camera acquisition process. In the SPAD camera, Photon counters detect the single photon event within a certain time bin with some probability. In the extreme low light condition of the NLOS experiment, the sampling process of SPAD becomes a Bernoulli experiment that it repeats many times [2]. Since we assume the laser position is fixed in this system, we can formulate Eq. (2) to be the correlation between the intensity of a voxel in the hidden scene ρ and the encoding function ψ: where Equation (3) is similar to Eq. (2), and the function ψ is a known encoding function that corresponds to every target point ρ. However, because the function ψ varies spatially, we cannot directly convert the problem into the deconvolution form in a confocal setup. Instead, we aim to find the trial solution that can generate the closest system response arg min where ρ* is the trial solution. To avoid the problem being under-determined, the reconstruction resolution (ψ's column number) should be lower than the resolution of raw data (ψ's row number).

Discrete algorithm implementation
We need to discretize the forward model illustrated by Eqs. (3)-(5) to implement the optimization algorithm. The reconstruction zone ρ(x, y, z) is separated into ρ xyz ∈ R M×N×K , where ψ xyz is a corresponding independent PSF for the position (x, y, z) in ρ xyz . Because each correlated response is linearly independent, we can vectorize ρ xyz and ψ xyz into ρ i and Equations (3) and (4) can be simplified to where τ is the vectorized signal from the SPAD array camera, and the forward model becomes a linear system. x', y' represents the SPAD imaging position on the scattering wall, and t is the TCSPC time stamp. The Poisson noise P is still added in the discrete forward model. By concatenating vectorized ψ i into a matrix A, Eq. (6) can be illustrated as, In our experiment, each PSF has a dimension of R 32×32×256 . If the number of PSFs is lower than 32×32×256, the system is over defined. We used a 16×16×64 NLOS reconstruction domain.
We used the normal standard gradient descent with learning rates 1×10 −4 and 1×10 −6 for 2000 iterations for the static scene to optimize Eq. (9). In a dynamic scene, because the difference is minimal between adjacent frames, we can use previous frames to optimize the current frame to reconstruct images faster. We used 2000 iterations with a learning rate of 1×10 −4 for the first frame and 100 iterations with a learning rate of 1×10 −6 for each consecutive frame. We utilized one GTX 2080Ti graphics card to run the optimization method; the runtime for the first frame is around 20 seconds to optimize 2000 iterations. It took about 0.9 seconds and 100 iterations to reconstruct each consecutive frame.

Comparisons based on synthetic data
We compared the proposed method with confocal and phasor field reconstruction methods through simulations of two scenes, "H" and "X", as shown in Fig. 2(a) and Fig. 2 (b), respectively. Both scenes were reconstructed with a 0.7m scanning aperture and a Gaussian laser in temporal and spatial dimensions. The spatial resolution is 32×32, and the time-resolution is 32 ps in time-resolved data.
To compare all the methods based on the same external condition, the forward process of time-resolved data is adjusted to the confocal setup. Although such a setup is different from our experimental system, the only difference is each response's curvature in the ideal synthetic data. The only difference in the spatial response should not affect reconstruction results. Our method's NLOS reconstruction still depends on the spatially varied response. Both the phasor field and our method yield similar results regardless of whether the setup is in the confocal mode. However, the confocal method cannot reconstruct spatially varied responses. Directly applying the confocal algorithm on non-confocal data results in significant artifacts in the non-confocal mode. We adjusted scene parameters for all methods' reconstruction processes according to simulated experiments, ensuring that all the reconstructed data is under the same condition. We can achieve better results from simulations than confocal methods, even though our method is designed for solving general NLOS problems.
We extensively conducted reconstruction experiments under different noise levels and compared the results in the peak signal-to-noise ratio (PSNR), the structural similarity (SSIM) and the total variation (TV) with confocal and phase-field methods. As Fig. 2 shows, our method has the highest PSNR and SSIM while maintaining the lowest TV. Our method shows excellent noise-rejection capacity. As the noise level increases, PSNR and SSIM decrease as expected. The TV increases due to rising random reconstruction noise. All reconstruction methods show a similar trend, and our method offers the best results.  13.03 dB. The ground truth (gt) is compared with (a1) Confocal [2], (a2) Phasor Field [3], (a3) FK-Migration [4]. b. Reconstruction result of scene "X", column (r2) represents the raw data with various SNR value, (s1) 30.01 dB, (s2) 23.01 dB, (s3) 13.01 dB. The ground truth (gt) is compared with (b1) Confocal, (b2) Phasor Field, (b3) FK-Migration. c. d. PSNR, SSIM and TV plots at different noise levels of scene "H" and "X".

Pixel synchronization
As pixels' TCSPCs in the SPAD array operate independently, we need to synchronize the time stamp of each detector. Before experimenting, the SPAD array can image the relay wall orthogonal to the sensor without any object in the reconstruction zone. The laser repetition rate is 50MHz, and all SPAD pixels are illuminated simultaneously. After more than 5 seconds of the TCSPC integration, the dark current and noise are negligible compared to a strong reflected signal from the scattering board. In the experiments, we chose the time stamp corresponding to the maximum count as the start timestamp for every TCSPC pixel.

noise2noise (N2N) validations
To benchmark the performance of the N2N denoising method [47] quantitatively, we trained the network model over a synthetic dataset. The dataset was formed by 8000 time-resolved images, and each image was corrupted with 3 dB Gaussian noise. The model parameters were obtained by training the model 30 epoch with batch size of 4. We discussed the model details further in Appendix 2.
In Fig. 3(a), We compared reconstruction results visually in noise level of 2.78 dB and 11.8 dB. The random noise was drastically reduced when applied with N2N method, and the reconstruction results by N2N was more visually satisfactory. We tested the method over various noise level from 2.78 to 11.8 dB. As the Fig. 3(b) showed, the N2N method generally brings around 2 dB of PSNR gain and 0.1 SSIM gain.
We conducted the following tasks to evaluate the proposed method. (1) Data acquisition & hardware configuration: Our system consists of a femtosecond amplified diode laser (500 mW at 780 nm with a pulse width of 70fs at 50 MHz repetition rate) and a 32×32 SPAD-TCSPC array (PF32, Photon Force, Ltd.) with a 55ps time-resolution and a 100ns dead time. (2) NLOS setup: The system is located 2 m from the relay wall, with the scenes hidden from the direct view. The FoV on the relay wall is 82cm×82cm. The scattering wall is made of standard white Styrofoam. To avoid pile-up effects, we set the laser position out of the FoV on the relay wall.
The experiments included image of hidden objects "H", "S", and "U" in 0.1s acquisition time. The sample was placed at a random position and angle of about 0.6 meters away from the scattering media. Under a short exposure from the SPAD array in Fig. 5, the proposed method fails to reconstruct the hidden object when noise is higher than a certain level. Thus, we developed a novel 3D noise2noise method (N2N, see Appendices) to improve transient images' SNR. Figure 5 shows the comparison between raw data and denoised data. N2N can effectively suppress the random noise of the time-resolved response while preserving the texture fidelity.
The proposed method was extensively tested on N2N noise reduction, as shown in Fig. 3(c). Under the extreme low exposure case, the proposed method fits the PSF dictionary to noise instead of time-resolved information. Thus, reconstructed images cannot show actual structures and contain significant artifacts that block the observation. When reconstructing denoised raw data, the original structure can be restored more comprehensively. Significantly reduced background noise and artifacts result in more explicit images.

Dynamic experiments
We also imaged dynamic hidden scenes with the same experimental setup as the N2N validation. The first scene is a rotating statue, the laser illuminated the scattering wall at a fixed position, and the SPAD camera recorded the reflected photons with 100ms exposure time. The other dynamic scene is a rotating "U", and the experiment parameters are identical with the first scene. We recorded a frame every 200ms, and the results are shown in Fig. 4(a).
To assess the highest temporal resolution of the system, we tested with a varying exposure time of 1000ms, 500ms, 100ms and 50ms, as shown in Fig. 4(b). The intensity of the reconstructed data decreases accordingly. However, the results fail to show details when the exposure time is shorter than 50ms. Therefore, experiment results confirm the reliability of the proposed system and the reconstruction method. With the proposed NLOS imaging system, we can observe dynamic scenes at about 20 frames per second. Moreover, we also performed a contrast analysis for the proposed system (see Appendices). By varying the virtual aperture size and the scene's depth, we can observe factors other than the exposure time that affect the reconstruction quality. The contrast analysis results show that the spatial resolution of reconstructed images is positively related to the virtual aperture size of the scattering wall and negatively related to the scene's depth. Objects far from the scattering media are hard to reconstruct due to the weak signal and indiscernible response. The detailed analysis is illustrated in the Appendices. Four results are showed for visual comparison, (a1) PSNR is 32.6 dB when noise level was 2.78 dB, (a2) PSNR is 31.7 dB when noise level is 2.78 dB, (a3) PSNR is 22.3 dB when noise level is 11.8 dB, (a4) PSNR is 25.5 dB when noise level was 11.8 dB. b. The quantitative comparison of PSNR and SSIM reconstruction in different noise level. When noise level is smaller than 12 dB, the N2N method can improve around 2 dB in PSNR and 0.1 in SSIM in reconstruction results. c. The Visual comparison of N2N method in experiment. The denoising method is able to ameliorate the reconstruction results visually in detailed structure.

Fig. 4.
Dynamic results and the highest achievable frame rate of the imaging system. a. Dynamic result of a rotating statue and a rotating "U". b. Test results of "U" with exposures of 1000ms, 500ms, 100ms and 50ms. The NLOS reconstruction results can preserve the structure if the exposure time is higher than 50ms.

Discussion
Nowadays, NLOS reconstruction can deliver excellent results through improved iterative backprojection solutions using a new rendering model and analysis method. Revolutionary reconstruction systems and algorithms have been improving NLOS imaging quality. For example, colorful NLOS imaging has also been implemented by a single photomultiplier tube [48,49]. Many methods have been proposed to increase the spatial resolution of the NLOS reconstruction, such as real-time transient imaging for amplitude modulated continuous wave LIDAR applications [50], analysis of missing features based on time-resolved NLOS measurements [51], convolutional approximations to incorporate priors into filtered back projection (FBP) [52], occlusion-aided NLOS imaging using SPADs [21,22], Bayesian statistics reconstruction for accounting for random errors [53], temporal focusing methods [31], compressed methods with different acquisition schemes [54]. The reconstruction time for these methods ranges from minutes to hours. To our knowledge, none of them has been applied successfully to the SPAD array camera without laser scanning. Also, the NLOS resolution is limited by the system temporal resolution (contributed from the SPAD and the laser). For a time-resolved imaging system, the time resolution, including the laser pulse width and the TDC temporal resolution around 100 ps leading to a theoretically achievable grid resolution of 3 cm in the hidden scene. The proposed NLOS system with a SPAD array camera can globally record the hidden object's reflected waves. Thus, it can achieve NLOS reconstruction with a low exposure time without redundant spatial sampling. It shows great potential for real-time NLOS imaging due to its scanning-free mechanism. The novel reconstruction method based on the forward model of NLOS converts the reconstruction problem into an optimization problem. We divided the reconstruction zone into grids where each voxel corresponds to an independent time-resolved response. We can reconstruct the hidden object from spatial-variant response through linear regression. Because the model can separate structural data from random noise, the reconstruction method shows anti-noise advantages compared with back-projection methods. Also, the reconstruction is further enhanced by neural network denoising tools, described in Appendix 2. By utilizing the fixed noise model of the SPAD camera, training a specific denoising model neural networks can drastically improve SNR. We can extract high SNR signals, assuming the time-resolved data is similar between two consecutive frames.
There are two approaches to improve construction performance. First, transient images are captured by a 32×32 SPAD camera, which corresponds to the same reconstructed resolution. Thus, sparse sampling highly restricts the spatial resolution. In future, we will design a SPAD camera with 240×160 pixels to improve the spatial resolution. We also plan to develop a novel reconstruction method to obtain high-quality NLOS images with 32×32 or fewer pixels based on compressed sensing. Second, the FoV of our camera system is only 82cm×82cm, limited by the off-the-shelf lens. With a custom-design lens, we will enlarge the FoV in the scattering wall to improve the reconstruction quality. Up to the present, the reconstruction method is a linear optimization based on gradient descent to optimize the spatial albedo. However, the calibration error of wall reflectivity and imperfect laser deviation is not considered. In the future, we will develop a novel reconstruction method further to uniformly optimize the system forward model and the spatial albedo for hidden objects. To further facilitate the reconstruction compatibility, we can also implement an NLOS imaging system from a low temporal resolution at the nanosecond level, paving the way for a practical and low-cost NLOS imaging system. Moreover, a remote NLOS imaging system is also meaningful for some practical applications.
In future works, we will adopt a SPAD array camera with low readout noise and develop a novel anti-noise method to improve the SNR of transient images to enhance the speed of NLOS imaging.

Conclusions
To implement dynamic NLOS imaging, we designed a novel NLOS imaging system based on a SPAD array camera. The new system can globally acquire high-quality transient images from hidden objects. A new reconstruction method was also proposed by utilizing an optimizationbased approach. A linear regression algorithm was developed by converting the descattering problem into a spatial-varied correlation, achieving good results. Some simulations and analyses were made to compare PSNR, SSIM and TV performances with confocal and phase-field methods. Results show that our method has good PSNR, SSIM and TV performances, and our method can reconstruct hidden objects under a low SNR. Moreover, we also conducted experiments to image static and dynamic scenes, and images were reconstructed by the proposed method. Results show that our system can implement dynamic NLOS imaging at 20 fps. Hidden objects can be reconstructed under a short exposure time. Future developments will enable NLOS imaging with even higher spatial and temporal resolutions with rapid advances in semiconductor manufacturing and imaging methods.

Synchronization
SPAD Array usually adopt time independent TCSPC, and the direct imaging causes strong tearing caused between pixels. Such drawback generally doesn't affect back-projection based method where each histogram is projected independently. However, the teared response is detrimental for pattern-based reconstruction. However, we need a calibration process so that a random shift of photon counting can be aligned.

3D noise2noise method
One of the main drawbacks from Single Photon Avalanche Detector (SPAD) array imaging is the strong noise brought by the histogram-based imaging method. The response is based on probability of the avalanche at a particular time stamp, leading to strong mixed noise with high sparsity when the acquisition time is low. However, such noise model consists of many sources such as dark current, cross talking, jitter of laser, weak response signal, dead time, broken pixel and TDC counting error, which hinders building the noise model for measuring the expectation value. Since inaccurate noise estimation causes aliasing error when removing noise, it is very difficult to compute the real value under such complicate noise model. Also, Acquisition of the real value is also difficult in such system through long exposure. The long-exposed image still contains obvious sparse noise even when the memory is depleted for TCSPC counting. Thus, we have recovered the noised data without ground truth as reference.
Fortunately, the noise model is mostly time-invariant. If we sample noised time-resolved image in different time, all the data is based on the same noise model. Considering the complexity of the noise model and lack of supervisory, we adopt the noise2noise (N2N) unsupervised deep learning method for noise reduction for robustness. The network utilizes the fact that the expectation of multiple noised samples of the same scene is assumed to be the ground truth. By utilizing the high frame rate of SPAD array and assuming the content from adjacent frame is similar enough, we relaxed the same scene constraint into the similar scenes between the consecutive frames from SPAD array. By averaging the multiple video sequence pair through N2N, we are able to acquire smooth time-resolved data for NLOS reconstruction.
The N2N network removes the noise from a fixed noise model by utilizing "the average" of the two continues frames. When we only use two frames for training, the network finds the mapping from the input to the output. With paired data of continuous video, the noise is usually sparse and vastly varying while the scattered time-resolved signal varies slowly. Thus, to denoised the video series, we are finding the expectation model that fits every pair of consecutive frames formulated as: arg min where, f is the neural network, v i and v i+1 is two consecutive time resolved data frame imaged by SPAD array, L is the loss function and E v is the expectation of the Loss over the whole video sequence. In the training process, we let two random consecutive frames of a scattered video sequence to be one training pair. With one being the input data and the other being the label, the trained U-Network (Unet) fits the same part between the training data to label while diminishing the noise. The two frames are assumed to be similar enough because the time-resolved response from non-line-of-sight (NLOS) scene contains mostly low frequency components with a small shifting due to the motion. Because the high noise contains mostly in high frequency, the averaging is much more effective when exerting on the noise component instead of data with minor motion. Because the input and output image contain small difference, directly using the mean square error results in the motion blur in scatter image, which might affect the reconstruction results. We designed the objective function of reconstruction to be, where, f is the mapping process of the neural network, x the input and output image and G is the wiener filter operator. The first term coerce the averaging between the two frames while diminishing the noise, the second term constrain the output to be similar to and noise filter input using conventional method. The objective function constrains the output to be similar to inputs while removing a complex system noise.
As the figure showed, the result in Fig. 5(a) contains noise mainly comes from the stochastic imaging process. The Fig. 5(b) is the denoised results from N2N. We can hardly discern the noise from the signal in the raw data because low SNR and high sparsity. However, the data does contain useful signal of the time-resolved response in the time-resolved image. After noise filtered by the N2N network, the results are more visualization friendly in texture sharpness and smoothness. In the low SNR zone where the noise become very sparse, The N2N network is still able to reconstruct the time-resolved response and reject strong and sparse noise.

Experiment setup
We set up an experiment system shown in Fig. 6. The experiment mainly utilizes the photon-force "PF32" SPAD array and a femtosecond laser CFL-05RFF0 with 50M frame rate, 70fs pulse width, 780nm wavelength and 500mW power. The laser and detector are synchronized and delayed through a MPD Picosecond Delayer provided by the Photon Force. The camera is focus on the scattering relay wall with a 4mm CHIOPT-FA0401C lens. We used 2 reflection mirrors for controlling the ignition angle of laser. The side lobe from the output pulse is shaped through and the scattering wall is 1m × 1m, and the utilized virtual aperture is 0.82m × 0.82m. The reconstruction zone is 1.2m×1.2m×1.2m.
As the Fig. 6(a) shows, our NLOS system only has four simple components, a SPAD array camera, a femtosecond laser, a trigger relay and a pulse shaper. The pulse shaper is used to adjust the size of femtosecond spot avoiding the crosstalk in FOV of SPAD camera due to irregular laser speckles. In Fig. 6(b), the scattering wall in our system is made up of standard white Styrofoam.

Contrast analysis
We further analyzed the NLOS reconstruction ability through examine the contrast under extreme condition. By varying the field of view of the camera (aperture size) and the scene distance from the scattering wall, the calculated contrast is computed and compared to find the optimal working condition of the NLOS system. To test the reconstruction aptitude of our system, we set up a simple scene with 5 close points on a flat surface that is parallel to the scattering relay wall. The where the I max and the I min is the reconstructed maximum and minimum value. For the experiment setup, the spacing between scene points is fixed to 4cm and 6cm in vertical and horizontal direction. To compare the system ability under different condition, we conducted 3 experiments by varying the aperture size and the depth of scene with following setup, depth 0.3m, 0.6m and aperture size 0.7m and 0.85m. All the result is shown in Fig. 7.   Fig. 7. Contrast analysis. a. The ground truth is 5 points on a vertical surface in 3d space parallel to the relay wall. The lateral spacing of the points is fixed to 4 cm and 6 cm, where the depth of the surface varies between 30 cm and 60 cm. The virtual aperture of the relay wall varies between 70 cm and 85 cm. b. The ground truth contrast response on the reconstruction surface. The contrast is 1. c. The reconstructed contrast response when depth is set to 0.3 m and aperture is set to 0.7 m. The contrast is 0.382. d. The reconstructed contrast response when depth is set to 0.6 m and aperture is set to 0.7 m. The contrast is not applicable since the reconstructed result is not correct in structure. e. The reconstructed contrast response when depth is set to 0.3 m and aperture is set to 0.85 m. The contrast is 0.82.
As shown in Fig. 7, the Fig. 7(a) is the ground truth in reconstruction zone and Fig. 7(b) is the contrast analysis for ground truth. In this situation, the contrast is 1. From Fig. 3, we found out the reconstruction results is highly depending on the aperture and the depth, when we keep the same aperture to be 0.7m, increasing the scene depth will decrease the resolution since the energy decreases and the curvature of time-resolved response gradually decreased to plane surface. As Fig. 7(c) and Fig. 7(d) shows, the system loses the reconstruction ability when the scene distance is increased from 0.3m to 0.6m. The central point became not discriminable and leading the contrast calculation not applicable. Also, the contrast become higher when the aperture size enlarges, which also increase the curvature of the time-resolved response. As Fig. 7(c) and Fig. 7(e) shows, the contrast increased from 0.38 to 0.82 when the aperture size increased from 0.7m side length to 0.85m. The more curved response leads to a more discernible reconstruction during the reconstruction.
Disclosures. The authors declare no conflicts of interest. Data Availability. Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.