Spatial images from temporal data

Traditional paradigms for imaging rely on the use of spatial structure either in the detector (pixels arrays) or in the illumination (patterned light). Removal of spatial structure in the detector or illumination, i.e. imaging with just a single-point sensor, would require solving a very strongly ill-posed inverse retrieval problem that to date has not been solved. Here we demonstrate a data-driven approach in which full 3D information is obtained with just a single-point, single-photon avalanche diode that records the arrival time of photons reflected from a scene that is illuminated with short pulses of light. Imaging with single-point time-of-flight (temporal) data opens new routes in terms of speed, size, and functionality. As an example, we show how the training based on an optical time-of-flight camera enables a compact radio-frequency impulse RADAR transceiver to provide 3D images.

Introduction. The most common approach to image formation is obvious and intuitive: a light source illuminates the scene and the back-reflected light imaged with lenses onto a detector array (a camera). A second paradigm, single-pixel imaging, relies instead on the use of a single pixel for light detection, while the structure is placed in the illumination by either scanning the scene point by point or through more advanced approaches in which selected patterns are projected or collected from the scene [1][2][3][4][5][6]. Three-dimensional (3D) imaging can be obtained with these approaches where the additional depth information can be collected based on stereovision, holographic or time-of-flight techniques [7][8][9][10]. In timeof-flight approaches the depth information is estimated by measuring the time needed by light to travel from the scene to the sensor [11]. Many recent imaging approaches also rely on computational techniques for enhancing imaging capabilities ranging from sparse-photon imaging [12][13][14] to non-lineof-sight imaging [15][16][17][18][19][20]. Among the various possible computational imaging algorithms [21], machine learning [22] and in particular deep learning [23], is employed to enhance imaging capabilities by providing a statistical or data-driven model for enhanced image retrieval [24]. State-of-the-art deep learning techniques have been applied in computational imaging problems such as 3D microscopy [25,26], super-resolution imaging [27,28], phase recovery [29,30], lensless imaging [31,32] and imaging through complex media [33][34][35][36][37][38]. Regardless of the wide variety of different methods, these imaging approaches use either a detector array or scanning/structured illumination to retrieve the transverse spatial information from the scene. The requirement for either structured illumination or arrays of pixels in the detection is clear if we consider the inverse problem that needs to be solved if data is collected only from a single point, non-scanning detector, and with no structure in the illumination: there are infinite possible scenes that could give the same measurement on the single point sensor thus rendering the inverse problem very strongly illposed.
Here, we introduce a spatial imaging modality based on single-point, single-photon time-resolving detectors. In our approach, the scene is flood illuminated with a pulsed laser and the return light is focused and collected with a single-point single-photon avalanche diode (SPAD) detector, which records only the arrival time of the return photons from the whole scene in the form of a temporal histogram. During the measurement no spatial structure is imprinted at any stage either on the detector or the illumination source. Then, a data-driven artificial neural network (ANN) reconstructs the 3D scene from a single arrival-time histogram. We demonstrate the 3D imaging of different objects, including humans, with a resolution sufficient to capture scene details and up to a depth of 4 m. This approach only requires arrival-time measurements and therefore lends itself to cross-modality training whereby training based on ground truth from an optical system can be applied to data from a completely different platform. As an example, we show that a single radio-frequency (RF) impulse RADAR transducer together with our ML algorithm is enough to retrieve 3D images. Single-point 3D imaging approach. Previous work has shown that, in addition to the object-sensor distance, the 3D profile of objects manifests through a particular temporal footprint that makes them classifiable even in cluttered environments [39]. The goal is to extend this concept to full 2D or 3D imaging using only photon arrival-time from the scene. It is simple to construct a forward model for this, where all points in the scene that are at some distance, r i = (x i , y i , z i ), 3D imaging with single-point time-resolving sensors. (a) A pulsed laser beam flash-illuminates the scene and the reflected light is collected with a single-point sensor (in our case, a single-photon avalanche diode, SPAD) that provides a temporal histogram via time-correlated-single-photon counting (TCSPC) (b). The temporal histograms, together with the corresponding 3D images obtained from a time-offlight (ToF) camera, are used to train the image retrieval algorithm, an ANN. (c) After the algorithm is trained, it is fed with single-point temporal histograms to retrieve 3D images in a single recording.
from the detector provide a related photon arrival time, . By recording the number of photons arriving at different times t, we can build up a temporal histogram that contains information about the scene in 3D. The inverse problem we wish to solve requires using a measured temporal trace to reconstruct the scene. However, all points placed within a spherical dome represented by the equation (ct i ) 2 = x 2 + y 2 + z 2 will have the exact same arrival time t i at the detector and, as a consequence, they will contribute with equal probability to the same time bin on the histogram. For this reason, if no prior information on the scene is available, it is not possible to obtain a unique solution for the inverse problem. In other words, a single temporal histogram is not enough, in principle, to retrieve meaningful shape or depth information that resembles the actual scene. This problem arises due to a lack of additional information or constraints on the scene (that is usually provided by using multiple light sources or detectors/pixels). Qualitatively speaking, in our approach, this additional information is provided through the initial provision of datasets that are used to form a constrained model of the scenes to be imaged.
In more detail, the 3D imaging approach is depicted in Fig. 1 and consists of three main elements: i) a pulsed light source, ii) a single-point, single-photon timeresolving sensor, and iii) an image retrieval algorithm. The scene is flood-illuminated with the pulsed source and the resulting back-scattered photons are collected by the sensor. We use a single-point SPAD detector operated together with time-correlated single-photon counting (TC-SPC) electronics in order to retrieve the arrival time of the scattered photons from the scene and form a temporal histogram [ Fig. 1(b)]. Objects placed at different posi-tions within the scene and objects with different shapes provide different distributions of arrival times at the sensor [39]. The histogram, h, measured by the single-point sensor can be mathematically described as h = F(S), where S = S(r) represents the distribution of objects within the scene. The problem can then be stated as the search for the function F −1 that maps the temporal histograms onto the scene. We adopt a supervised training approach (see Supplementary Material for details) for the ANN by collecting, a series of temporal histograms corresponding to different scenes, together with the corresponding ground-truth 3D images collected with a commercial time-of-flight camera. The ANN is then trained to find an approximate solution for the mapping F −1 and is finally used to reconstruct 3D images only from timeresolved measurements of new scenes that have not been seen during the training process. Results. To evaluate the validity of our approach, we first analyzed its performance with numerical simulations. We consider human-like objects with different poses, moving within a scene of 20 m 3 , which is represented as a color-encoded depth image, as shown in Fig. 2(b). We assume that we uniformly flash-illuminate the scene with a pulsed light source (with duration that is much shorter than all other timescales in the problem) and then calculate the arrival time of photons from every point of the scene. Simulating different scenes allows us to obtain multiple 3D images-temporal histograms pairs that are used to train an ANN that we use as the image retrieval algorithm (details about the structure of the algorithm and training parameters can be found in Supplementary Material). Typical scenes consist of a static background with moving human figures in different poses, as shown in the Supplementary Material. After training the ANN, single temporal histograms are tested to reconstruct the related 3D scenes. In order to evaluate the potential performance in idealised conditions, for these simulations we assumed that the time bin width ∆t = 2.3 ps is also the actual temporal resolution (impulse response function, IRF) of the full system. The minimum resolvable transverse feature size or lateral object separation δ that can be distinguished with our technique depends on both the IRF, ∆t, and the distance from the sensor, d: where c is the speed of light. In the depth direction, the spatial resolving power is determined only by the time-offlight resolution (as in standard LIDAR), i.e. δ z = c∆t. At a distance of 4 m from the detector, for ∆t = 2.3 ps we can expect a transverse image resolution of 7 cm, which will degrade to 77 cm for ∆t = 250 ps. The impact of the latter realistic time-response will be shown in the following experimental results. Figure 2(a) shows one example of a temporal histogram constructed from the scene in Fig. 2(b). Figure 2(c) shows the scene reconstructed using the numerically trained ANN and highlights the relatively precise rendition of both depth and transverse details in the scene. After numerically demonstrating the concept of 3D imaging with single-point time-resolving detectors, we test its applicability in an experimental environment. We flood-illuminate the scene with a pulsed laser source at (550 ± 25) nm, with pulse width of τ = 75 ps. Our scenes are formed by a variety of fixed background objects, up to a depth of 4 m (the maximum distacne allowed by our ToF camera), while different additional objects (people, large-scale objects) are moving dynamically around the scene. The 3D scene is recorded by means of a time-offlight (ToF) camera, although we note that any other 3D imaging system, such as LiDAR, stereoimaging, or holography devices could be used for collecting the ground truth data for the training process. The ToF camera is synchronized with a SPAD detector equipped with TC-SPC electronics that provides temporal histograms from the back-reflected light that is collected from the whole scene with an angular aperture of ∼ 30 • and an IRF ∆t = 250 ps (measured with a small 2 cm mirror in the scene). In Fig. 3 the first column shows examples of recorded temporal histograms, the second column shows the images reconstructed from these temporal histograms, and the last column shows for comparison, the ground truth images obtained with the ToF camera. Full movies with continuous movement of the people and objects within the scenes, acquired at 10 fps, are shown in Supplementary Video 1. As can be seen, even with the relatively long IRF of our system, it is possible to retrieve the full 3D distribution of the moving people and objects within the scene from the single-point, temporal histograms. Compared to the numerical results, the larger IRF leads to the loss of some details in the shapes, such as arms or legs that are not fully recovered. We can see these limitations for example in the reconstruction of the letter 'T', row (d) of Fig. 3 (with dimensions 39 × 51 cm 2 ) and especially in the Supplementary Video 1, where the algorithm is able to detect the object but struggles to obtain the correct shape. This lateral resolution power [Eq. (1)] would be improved for example to 25 cm with ∆t = 25 ps and increases with a square-root law with distance implying a relatively slow deterioration versus distance (for example, resolution would be 50 cm at 20 m distance). Specific shape information is retrieved from all features in the scene, both dynamic (e.g. moving people) and static (e.g. objects in the background). This can be seen in Fig. 3(a) and Fig. 3(b), where both temporal histograms have peaks that are placed at similar positions, yet the reconstructed scenes are different. On the one hand, in Fig. 3(a) the ANN recognizes the box at the right of the image (corresponding to the peak 2 in the histogram) as a static background object that was present in all of the training data, while the person (peak 1) is identified as a dynamic object with a certain shape given by the peak structure. In contrast, the ANN recognizes both temporal peaks 1 and 2 in Fig. 3(b) as people moving dynamically through the scene (as these were not constant/static in the training data). This example highlights the role of the background, which turns out to be key also in removing ambiguities that would arise in the presence of a single isolated object moving in front of a uniform background (for instance, a single isolated histogram peak could be interpreted as a person placed either to the left or to the right of the scene -see Supplementary Material for details). We further analyze the impact of the IRF on the image quality reconstruction by repeating the reconstruction with temporal histograms that are convolved with Gaussian point-spread functions with different temporal widths. The results (Supplementary Material), show that although longer IRFs degrade image reconstruction, the shape and 3D location of people in the scene are still easily recognisable even when using an IRF of 1 ns (corresponding nominally to a lateral spatial resolution of ≈  collected from the scene. The experiments were carried out with a static background and moving objects. This is indeed a typical imaging situation e.g. surveillance applications in a static or slowly changing environment. Similarly, background objects will appear static if they change at a slower rate (and/or are at a larger distance) with the respect to the dynamic elements of the scene or slower than the acquisition rate of the sensor. This approach to 3D imaging could provide a series of advantages such as very fast frame rates (kHz or even MHz) as a result of the streamlined data and fast operation speeds allowed by single pixel detectors. Moreover, the same concept can be transferred to any device that is capable of probing a scene with short pulses and precisely measuring the return "echo", for example RADAR and acoustic distance sensors. To test our approach under the best ideal conditions, we performed numerical simulations where we generate synthetic scenes containing flat human-like silhouettes indifferent poses, as shown in Fig. 4. We perform data augmentation of this data-set, originally consisting of 10 different humans and poses, by mirroring each image in the horizontal direction, and by also translating the humans around the scene in all directions (x, y, z). This allows to augment the data set from the 10 original images to 4000 images used to train our neural network. In order to simulate the expected scaling factor of objects when increasing its distance from the observation point, we apply a size scaling factor to the image, depending on its coordinates (x i , y i , z i ): where d = (x 2 i + y 2 i + z 2 i ). The field of view considered in our simulations forms we have assumed a viewing angle of 52 • from the virtual 3D camera. Each human silhouettes is translated to 200 different positions across the field of view of the scene: 10 different depths z and 20 in x. This, together with the horizontal mirroring of the individual, give us a total data set of 400 scenes per figure (4000 in total).
We then assume that we illuminate the scene with a pulsed laser and estimate the arrival time of photons coming from each pixel of each image of the data set. For a given voxel in the image corresponding to a spatial position r i = (x i , y i , z i ), its time of flight is t i = (2c) −1 x 2 i + y 2 i + z 2 i . From this information, we transform our data set into 3D images, where the value of every pixel encodes the time of flight, i.e. we generate a color-encoded depth 3D image, as shown in Fig. 5. As we assume uniform reflectivity for all objects, the number of photons n i back-reflected from every point on the scene is inversely proportional to the voxel distance to the origin, i.e. n i ∝ P 0 / r i 4 , where P 0 is the laser power. This information allows us to create a temporal histogram from the scene indicating how many photons arrive at a certain time to the virtual detector, see Fig. 5. We thus generate our 3D image-temporal histogram pairs that are required for training of the artificial neural network (ANN).
In a general scenario, it is very unlikely to find such clean backgrounds as the ones observed in Fig. 5, where only the human individuals appear in scene. Therefore, to make our simulations more realistic we also added some objects at the background to mimic, for instance, what one would expect when walking into a room. the temporal histograms corresponding to their respective ground-truth scenes, i.e. Figs. 6(d) and (f). The predicted 3D scenes from our image retrieval algorithm are shown in Figures 6(c) and (d). Note that in the absence of background objects (left-hand-side block), the algorithm struggles to reconstruct the 3D scene properly as the number and different shape of objects compatible with a particular temporal histogram is high. In particular, there is an evident left-right symmetry due to the fact that any object and its mirror image are both compatible with the same temporal trace. This symmetry is removed when including background objects (right-handside block), as the human silhouette will both add varying signals to the temporal for different positions, while subtracting signal from the corresponding background peaks observed when the silhouette is not present. Given that the subtracted signal data depends on the exact shape and location of the silhouette, the background plays a crucial role in solving the otherwise ill-posed problem and provides indirect information of the silhouette's absolute position that is learned by the ANN during the training process. For the simulations shown in the main document, we used the background corresponding to Figs. 6(b), (e), and (f).

B. Image retrieval algorithm
Given a histogram h = F(S) recorded by the singlepoint time-resolving detector from the scene S = S(r), the task of our algorithm is to find the the function F −1 that maps h onto S. A way to overcome this highly unconstrained inverse problem is by using prior information on the objects in the 3D scenes. To this end, we make use of a supervised training approach where we train an artificial neural network with pairs of temporal histograms and 3D images. Both the temporal histogram and the 3D images are treated as vectors with corresponding dimensions 1 × 8000 and 64 × 64 = 1 × 4096 for the simulated data and 1 × 1800 and 64 × 64 = 1 × 4096 for the experimental data (the details of the experiment are given in the next section).
We implemented a multilayer perceptron (MLP) [40]. Perceptrons are mathematical models that produce a single output from a linear combination of multiple weighted real-valued inputs after passing through a nonlinear activation function. Mathematically, this operation can be expressed as: where y is the output vector, x is the input vector, w is the weight matrix, and b is the bias used to shift the threshold of the activation function φ (a tanh in our case). Our MLP, summarized in Fig. 7, is composed of five layers: an input layer (temporal traces, having 8000 or 1800 nodes), 3 hidden fully-connected layer (with corresponding number of nodes: 1024, 512, 256), and an output layer (3D images, 4096 nodes). By training the MLP on  Examples of 3D images (depth is encoded in color) (left column) and corresponding temporal histograms (right column) from some of the data used to train the algorithm.
sets of temporal histograms -3D images pairs we learn an approximate function for the the mapping F −1 which allow us to reconstruct 3D images from a single temporal histogram.
Examples of the type of data used to train the algorithm are given in Figs. 2 and 4 in the main document. For the numerical/experimental results, 1800/9000 and 200/1000 pairs are used for training and testing, respectively, whilst a 7% of the training data was used for validating. During experiments, the total 10000 ToF-image -temporal-histogram pairs were obtained within a single data acquisition sequence. However, we note that the data used for testing was never used during training and that this data also belongs to different periods of the data acquisition process during experiments. We also note that the only processing carried out on our data was a normalisation to the range [0, 1] before passing through the algorithm. There is certainly room for future development in terms of data processing, e.g. time-gating or time-correlation between sequences, to improve the imaging capabilities of the algorithm. During learning, the cost function minimised for each image pixel, i, is the mean square error, M SE [41]: where s i and y i are the measured and predicted values of the depth of the pixel i in the scene S, respectively. The MLP is implemented in TensorFlow [42] using Keras Sketch of the artificial neural network used: a multi-layer perceptron consisting of one input layer (with 8000 nodes), one output layer (with 4096 nodes), and three hidden fully-connected layers (FCL, with 1024, 512, and 256 nodes respectively). After each layer hidden layer we apply a tanh activation function (not shown in the figure). The loss used is the mean square error.
[41] and training is performed by the Adam optimizer on a desktop computer equipped with a Intel Core i7 Eight Core Processor i7-7820X at 3.6 GHz and a NVIDIA GeForce RTX 2080 Ti with 11 Gb of memory. The training time depends on the number of images used, the batch size (number of images taken for each iteration of the training algorithm, 64 in our case), and the number of epochs (200), and requires 26 min in total. After training the algorithm, this can recover a 3D image from a single temporal histogram in 30 µs on a standard laptop, therefore implying a limit (with this hardware layout) to the maximum frame rate for the imaging approach shown here at 33 kHz. A laser beam at (550 ± 50) nm delivering pulses with τ = 75 ps, at a power of 250 mW is expanded with a 10x microscope objective to flood-illuminate the scene, providing an opening angle of 30 • . A pair of lenses used as 4x beam expander are used to fill the back aperture of the microscope objective, while two IR mirrors are used to filter out the laser pump. The back-scattered light from the scene is collected with a lens with focal 60 mm and a 40x objective that concentrates light onto the single-pixel SPAD, that retrieves temporal histograms via TCSPC. In parallel, a ToF camera grabs 3D images of the scene that are used as ground truth during training.

C. Experiment details
In a first step we collect pairs of co-registered temporal histograms and ToF camera measurements. We used a supercontinuum laser source (NKT SuperK EXTREME) delivering pulses with 75 ps pulse duration and average power of 250 mW after a 50 nm band-pass filter centered at 550 nm. A 10x microscope objective (Olympus plan achromat) with NA = 0.25 and WD = 10.6 mm is used to flood-illuminate the scene with an opening angle of ≈ 30 • . This opening angle ensures an illumination circle with diameter of 2.15 m at 4 m distance. The laser beam is expanded providing a 4x magnification that fills the back aperture of the objective and the IR radiation from the laser pump was filtered with two IR mirrors. Light scattered by objects placed inside the illumination cone is collected by a second microscope objective (40 ×, Nikon plan fluor, NA = 0.75, WD = 0.66 mm) and focused on a 50 × 50 µm 2 area SPAD sensor that retrieves the temporal trace of the scattered light by means of TCSPC electronics [43]. During the training process, we use a ToF camera (flexx PMD Technologies) with a depth range of 0.1 − 4 m and a maximum resolution of 224 × 171 pixels that is triggered via software together with the timerecording electronics. The ToF images are cropped and down-sampled to a resolution of 64 × 64 for training the ANN. After the ML algorithm is trained, it is tested with new data taken from new scenes and hence new temporal histograms from the scene.
In our proof-of-principle experiments we trained the system with two different individuals (either separately or both present at the same time) moving around the scene, and also with non-human shapes (such as the letter 'T' and the square shown in the main document) that we moved through the scene, also changing their orientations. To provide the algorithm with more variability, we also changed the position of some items in the background of the scene during the data acquisition process.
The temporal impulse response function (IRF) of our system depends on the time resolving electronics and the pulse length. We directly measured the IRF by placing a mirror with diameter of 25.4 mm reflecting light back to the SPAD sensor. The IRF of the system is taken as the full width at half maximum of the time histogram, which was measured to be 250 ps. In all our experiments we interfaced the SPAD sensor and TCSPC electronics, RADAR chip, and ToF camera via MATLAB. This allowed us to record temporal histograms at a rate of 10 Hz and 1 Hz for the SPAD and RADAR chip, respectively.
For the cross-modality RADAR imaging, we used an ultra wide band impulse RADAR chip (Novelda XeThru X4). The chip has a single transmitter and receiver channel, operates at (7.29 ± 1.4) GHz, has a pulse duration of 670 ps, and a sampling rate of 23 × 10 9 samples/s.

D. Impact of the impulse response function
To investigate the impact of having a broad IRF we performed simulations by convolving the numerically generated temporal histograms, h, with different IRF, G, thus generating new histogramsĥ = h G. The TIRF can be characterized by a Gaussian function G = exp −t 2 /∆t 2 , where ∆t gives characteristic system impulse response. The new histogramsĥ are then paired with their corresponding ToF images and used to re-train the ANN. In Fig. 9 we summarize our results. Fig. 9(a) is the original histogram h obtained from the scene shown in Fig. 9(b). Fig. 9(c)-(f) are the different IRF with [2.3, 25, 250, 1000] ps of pulse width, Fig. 9(g)-(j) are the corresponding histogramsĥ, while Fig. 9(k)-(n) are the 3D image reconstructions obtained from each histogram h. As we found in our experimental results, shorter IRFs improve the reconstruction of limbs and fine details of the objects.

E. Minimum resolvable transverse feature
We consider the situation depicted in Fig. 10. The scene is flash illuminated with a pulsed laser and the back-scattered photons are collected by a single-point time-resolving sensor. We are interested to address the problem of resolving spatially two points A and B in space whose difference in time of flight is given by the IRF of the system, i.e ∆t. These two points are separated by a transverse distance δ and we consider that one of them is co-axial with the laser and sensor. Given the geometry shown Fig. 10, A and B form a triangle rectangle with the laser/sensor. As a consequence, by simple geometry, one obtains that: Eq. (5) states that the minimum lateral resolvable distance δ depends not only on the IRF but also on the distance to the detector, following a square root law.
F. Cross-modality optical 3D imaging using a RADAR chip Any sensing system that can map the position of objects within a scene into a temporal histogram can be used to obtain a 3D image from the scene. This opens new routes in imaging, for instance by using acoustic or radio waves. To test the feasibility of this concept, we performed new experiments with an impulse RADAR transceiver that emitted and collected pulsed electromagnetic radiation at 7.29 GHz (Novelda XeThru X4). The RADAR transceiver had a bandwidth of 1.4 GHz and a pulse duration of 670 ps, which corresponds to a depthspatial resolution of cτ = 20 cm and, according to the model presented above, to a transverse spatial resolution of 90 cm at 2 m distance. Following the previously discussed procedure for the optical laser pulses, we gathered new data combining the impulse RADAR with the (optical) ToF camera and re-trained the ANN. In this case, only 3000 ToF images -temporal histogram pairs were used for training, which, together with the low temporal resolution of the RADAR chip, leads to a poorer image reconstruction compared to the one obtained with laser pulses (see Fig. 11. However, we are still able to retrieve the general properties of the individual, e.g. their size, and position on the scene, see also Supplementary Video 2.

G. Pseudo-codes
In this section we provide the pseudo-codes required to simulate the ToF data and to train the ANN. scale I i by a factor 1: S = 2/z j 8: create a new image: 1:Î i = I i (x j , y j , z j ) scaled and 2: displaced 9: for all pixels p k = (px k , py k ) 1: withinÎ i do 10: calculate the distance d k 12: from sensor placed at 12: d k = (x k − x0) 2 + (y k − y0) 2 + (z k − z0) 2 12: calculate time of flight 12: from distance d k : 12: t k = (2c) −1 d k 13: estimate number of 12: photons n k from pixel p k 12: according to d k : 12: n k = P0/d 4 k 14: save d k , n k into a variable 12: fi, k = (d k , n k ) 15: obtain histogram hi from fi 16: create ToF image by replacing 12: value of p k with t k 17: down-sample the ToF image 12: to the desired resolution for 12: training the ANN later on 2. Artificial neural network