Single-Pixel Image Reconstruction from Experimental Data Using Neural Networks

: Single-pixel cameras that measure image coeﬃcients have various promising applications, in particular for hyper-spectral imaging. Here, we investigate deep neural networks that when fed with experimental data, can output high-quality images in real time. Assuming that the measurements are corrupted by mixed Poisson-Gaussian noise, we propose to map the raw data from the measurement domain to the image domain based on a Tikhonov regularization. This step can be implemented as the ﬁrst layer of a deep neural network, followed by any architecture of layers that acts in the image domain. We also describe a framework for training the network in the presence of noise. In particular, our approach includes an estimation of the image intensity and experimental parameters, together with a normalization scheme that allows varying noise levels to be handled during training and testing. Finally, we present results from simulations and experimental acquisitions with varying noise levels. Our approach yields images with improved peak signal-to-noise ratios, even for noise levels that were foreseen during the training of the networks, which makes the approach particularly suitable to deal with experimental data. Furthermore, while this approach focuses on single-pixel imaging, it can be adapted for other computational optics problems.


Introduction
Single-pixel imaging is an extreme configuration of computational optics, where a single point detector is used to recover an image [1]. Since the seminal work by Duarte and coworkers [2], single-pixel imaging has been successfully applied to fluorescence microscopy [3], hyperspectral imaging [4,5], diffuse optical tomography [6], image-guided surgery [7], short-wave infrared imaging [8], and imaging through scattering media [9]. Single-pixel measurements can be modeled as dot products between an image and some two-dimensional functions that are implemented through a spatial light modulator [1]. To limit acquisition times, it is highly desirable to reduce the number of light patterns, which leads to an undetermined inverse problem to recover the image.
In the field of single-pixel imaging, deep learning has been used to unmix the fluorescence intensity and lifetime from time-resolved measurements [10,11]. Higham and coworkers [12] proposed a convolutional auto-encoder for single-pixel image reconstruction imaging that outperformed compressed sensing approaches. This network directly maps the measurement vector to the desired image, using a fully connected layer followed by convolutional layers. Several Deep Learning architectures have been proposed since then, to solve single-pixel reconstruction problems. For instance, Li et al. in [9] used a similar network to that introduced by [12]. For measurements with very low signal-to-noise ratios, in [13] the authors proposed a U-net for highly compressed single-pixel Fourier imaging, while in [14], a recurrent neural network was used. Further, Li and coworkers [15] investigated a conditional generative adversarial network to reconstruct highly compressed data from measurements with random binary patterns.
In the present work, we propose a new deep learning methodology for image reconstruction from single-pixel measurements corrupted by a mixed Poisson-Gaussian noise model. Traditionally, inverse problems with data corrupted by Poisson noise are tackled using variance stabilizing transforms, such as the Anscombe transform [16], followed by a Wiener filter [17]. However, the resulting image can be blurred, and in particular for undersampled data. More recent alternatives have been used to solve this issue by exploiting statistical or handmade image priors [18][19][20].
Solving the resulting problems is usually prohibitive for real-time applications, as a single reconstructed image can take several seconds. In this regard, deep networks are ideal candidates due to their rapid evaluation. As such, deep learning has been used for single-pixel imaging in the presence of Poisson noise [21], although no special modifications were made to the neural network to allow for the Poisson noise. Many studies have explored many different neural network architectures; however, they usually consider convolutional layers that act in the image domain, while the raw data are in the measurement domain. Moreover, the interpretation of the output of a neural network is still an open question, specifically in the presence of noise where robustness issues can occur, as indicated by [22]. In particular, in the domain of single-pixel imaging, a neural network that is not finely tuned to the system can be outperformed by linear reconstructors [23].

Contribution
First, we explore the best way to map the raw data to the image domain, before applying a cascade of convolutional layers. This mapping is traditionally learned or computed through the Moore-Penrose pseudo-inverse. We describe a linear mapping that can be interpreted in terms of traditional image reconstruction. This linear mapping is implemented as the first layer of a (nonlinear) deep neural network. We introduced this idea in [24] for noiseless data, and here we extend it to noisy data. In particular, we propose to estimate the image intensity, which is unknown in practice, and to use this estimation to approximate the covariance matrix of the measurements.
Next, we describe a machinery that allows a network to be trained in the presence of Poisson noise. We provide a normalization scheme that allows raw data with different orders of magnitudes to be considered, which is mandatory for realistic scenarios where the image intensity is not known.
Finally, we validate our approach by reconstruction of an experimental dataset that we acquired with varying integration times and light fluxes. Upon acceptance, the datasets will be made available alongside implementations of our reconstruction methods in the Python toolbox (SPyRiT [25]).

Single-pixel imaging
Let ∈ [0, 1] be the image to acquire. The main idea of compressive optics is to measure = 1 using hardware, and to recover using software. The system matrix 1 ∈ R × , with < , collects the patterns that are sequentially uploaded on a spatial light modulator, to obtain the measurement vector. The patterns are traditionally chosen from within a basis ∈ R × ; i.e., 1 = where = [ , 0] describes the subsampling strategy. Classical choices for the basis include Fourier, discrete cosines, wavelets, and Hadamard basis, as discussed in [26]. In the present study, we consider the Hadamard basis. The subsampling strategies can be divided into adaptive and nonadaptive. Adaptive strategies progressively determine during the acquisition [27], while nonadaptive chose beforehand. Here, for simplicity, we consider the nonadaptive strategy introduced in [12, 28].

Acquisition model
As Hadamard patterns contain negative values, we adopt a differential strategy [29]. We denote bym + the measurements of the positive parts of the patterns, and bym − the measurements of the negative parts. We model noise as a mixture of Poisson and Gaussian distributions [30,31]. The Poisson noise is signal dependant and originates from the discrete nature of the electronic charge, while the signal-independent Gaussian noise accounts for readout and circuit fluctuations. For both the positive and negative measurements, we havê m +,− ∼ P ( where P and N are the Poisson and Gaussian distributions, is a constant that represents the overall system gain (in counts/electron), is the intensity (in photons) of the image (which is proportional to the integration time), dark is the dark current (in counts), and dark is the dark noise (in counts). We further hypothesize that dark and dark are independent of the image intensity .
The normalized measurements m are finally defined as

Deep learning for image reconstruction.
Deep-learning-based methods reconstruct an image˜ using nonlinear mapping˜ = G ( ) where the weights of the network are optimized with respect to a loss functions; e.g., to minimize the quadratic error over an image database where ( ) 0≤ ≤ −1 are the image samples of the image database, and ( ) 0≤ ≤ −1 are the corresponding normalized noisy measurements given by Equation (2). Traditionally, neural networks are made of cascading layers, and can be written as where G ℓ , 1 ≤ ℓ ≤ is the ℓ-th (nonlinear) layer of the network, and • is the function composition. The first layer is generally a fully connected layer that maps the normalized measurements ∈ R to a raw solution in * ∈ R . In Section 3.2, we explore different strategies for the design of this layer.

Experimental set-up
Our measurements are obtained from the single-pixel camera experimental set-up depicted in Fig. 1, as first described by [32]. The telecentric lens (Edmund Optics 62901) is positioned such that its image side projects the image of the scene onto the digital micro-mirror device (DMD; vialux V-7001), which is positioned at the object side of the lens. The object is transparent and is illuminated by a LED lamp (Thorlabs LIUCWHA/M00441662). The DMD can implement different light patterns (denoted as 1 in Section 2) by reflection of the incident light onto a relay lens, which projects the light into an optical fiber (Thorlabs FT1500UMT 0.39NA). This optical fiber is connected to a compact spectrometer (BWTek exemplar BRC115P-V-ST1). For every object, we sequentially upload onto the DMD all of the = 4096 Hadamard patterns of dimension = 64 × 64 pixels. We can down-sample the full measurement vector a posteriori to achieve any sampling ratio. To consider different noise levels, we acquire the same object with varying integration times and neutral optical densities.

Mapping of the raw data to the image domain
We propose to use a Tikhonov-regularized interpretable solution [33] to map the raw data into the image domain. In particular, we choosẽ where the regularized data * is such that * = [ * 1 , * 2 ] , where * 1 ∈ R and * 2 ∈ R ( − ) are regularized versions of the acquired and missing coefficients, respectively.
Let be the variance of our noisy measurements, and are respectively the mean and variance of our prior model in the Hadamard domain. The computation of and is described in Section 3.3, and that of in Section 3.4. We derived the analytical Tikhonov regularized solution given by * are the blocks of the covariance in the measurement domain, 1 ∈ R and 2 ∈ R ( − ) are the blocks of (see Section 3.3). Interestingly, Equation (6a) can be interpreted as the denoising of the raw data in the measurement domain, while Equation (6b) is the estimation of the missing coefficients from the denoised acquired coefficients.
To circumvent the difficulty of inverting the signal-dependant matrix in Equation (6a), we choose to neglect the nondiagonal terms of 1 ;Denoting 2 1 = diag ( 1 ), we get * where division and multiplication apply element-wise.  (6). These two layers can be interpreted as a layer that denoises Equation (6a) of the raw measurements, followed by a layer that estimates the missing coefficients from the denoised measurements of Equation (6b). The raw image˜ is corrected by a cascade of Image-domain convolution layers (CL) and non-linear layers, such as the ReLU layers.

Prior mean and covariance
We compute the mean and covariance as introduced in Equation (6) from the the samples of an image database. We consider the classical estimators of the mean and covariance Note that both quantities are computed once and for all. We load them into the graphical processor unit when the network is initialized, with no need to re-compute them later (e.g., during training or evaluation).

Noise covariance estimation
To use the mapping proposed in Section 3.2, we need an estimation of the covariance matrix of our measurements. Assuming that the raw measurements are independent, from Equation (2) we can determine the values of the covariance as where Diag ( ) refers to the diagonal matrix where the diagonal coefficients are the elements of , and refers to the identity matrix. As 2 depends on the unknown image as well as on the intensity , we exploit the raw data that also depends on and . Recalling that the variance of a Poisson variable is the same as its expected value, we approximate the expected value by the noisy sample; i.e., The experimental parameters dark , and dark can be estimated as described in [31]. We describe how to estimate in Section 3.5.

Estimation of the image intensity
For image denoising, estimation of can be achieved through fitting methods (e.g., see [30]). These methods usually exploit homogeneous regions of the image, and are therefore not suited to raw measurements. Instead, we propose a simple empirical method, which consists of estimating from a rough estimate of the nonnormalized image . Considering the pseudo inverse, which is a linear operator that scales with , we consider In practice, this simple estimator can provide the order of magnitude of with good approximation, but it obviously leads to some inaccuracies. This is illustrated in Fig. 3, where we show the distribution of the error of our estimation over the STL-10 dataset [34] (i.e., 113,000 images obtained by merging the unlabeled, training and testing sets) for different values of (512, 1024) and (5, 50, 2500 photons). Overall, the relative error is usually below 50%.

Training neural networks using simulated data
We train our network using the STL10 database [34], with = 105, 000 images that correspond to the 'unlabeled' and 'train' subsets. We consider 8,000 images for the testing (i.e., the 'test' subset). The original 96×96 images are resized to 64×64 using bicubic transform, and are normalized between −1 and 1.
We implement the full networks using Pytorch [35] (version 1.5.1; cuda V10.2.89). We train the network by solving Equation (3) using the ADAM optimizer [36], with an initial learning rate of 10 −3 , which is halved every 10 epochs, for a maximum of 100 epochs. In our experiments, the validation loss traditionally stops decreasing after 70 epochs. The training phase takes 3 h and 45 min on a NVIDIA GP107GLM [Quadro P1000 Mobile]. As seen in Section 3.5, it is not realistic to consider the intensity as a known constant. Therefore, we make vary during the training, with mean and standard deviation . Therefore, the network has to learn how to be robust to vary about . To fit with the observations made in Fig. 3, we choose = 0.5 to account for the worst-case scenarios. As shown in Section 5, varying alpha during the training phase, i.e., considering different noise levels, is crucial to get a robust network.
However, such an approach does not guarantee that a network trained around a given intensity can perform well for another intensity. A neural network trained for a certain probability distribution cannot generalize 'for free' to another probability distribution. We will further develop this point in Section 5.
Our experiments were done with the same Image domain layer structure as in [12]. This means that our network has three convolutional layers, with each layer separated by a ReLU layer. The first has a kernel size of 9 and a depth of 64, the second has a kernel size of 1 and a depth of 32, and the final one has a kernel size of 5 and a depth of 1. This structure can, however, be changed.

Experimental data
As shown in the first column of Fig. 4, we acquire three different objects: the LED lamp directly (first row); a cat from the STL-10 test set printed on a transparent sheet (middle row); and the Siemens Star resolution target (bottom row) [37]. The ground-truth image is computed (see Section 4.3) from a fully sampled measurement vector that is acquired with the highest signal-to-noise ratio (i.e., high flux illumination, long acquisition time). Specifically, we consider no neutral density, and the integration time to 1 ms per pattern for the lamp, to 4 ms per pattern for the STL-10 cat, and to 8 ms per pattern for the Siemens target, For each object, we also acquire a high-noise dataset by placing neutral optical density behind the lamp, to reduce the light flux. We consider optical densities of 1.3 for the LED lamp, 0.6 for the STL-10 cat, and 0.3 for the Siemens target. We set the integration time to 4 ms for both the LED lamp and the STL-10 cat, and to 8 ms for the Siemens target, we retain only = 512 measurements for both the LED lamp and the STL-10 cat, which accelerates the acquisition by a factor of 8. For the Siemens target, which has a richer spatial frequency content, we keep more measurements ( = 2048; acceleration factor of 2).

Evaluation metrics
Given the ground-truth image , we compute the peak signal-to-noise ratio (PSNR) of a reconstructed image * as PSNR( * , ) = 10 log 10 2 2 * − 2 (13) For the experimental data, we have no direct access to the ground truth image. This image is computed as = gt , where gt is a fully sampled measurement with high signal-to-noise ratio. Before computing the PSNR according to Equation (13), we normalize the ground-truth image in the range [−1, 1].

Simulated results
Training for different levels of noise In Table 1, we evaluate the effects of training a network under varying noise levels. We also simulate the acquisition of the test images with the same source intensity ( = 50 photons) and for varying intensities (mean values = 50 photons and standard deviation = 25 photons). Training a network under different noise levels is detrimental when the image intensity is known; on average, it results in a PSNR drop of 23.19 − 22.13 = 1.06 dB.
On the contrary, noise-varying training improves the reconstruction when the image intensity is unknown; on average, we obtain an enhancement of 21.07 − 19.33 = 1.74 dB (second row of Table 1).
This observation is crucial, as the exact image intensity is not available beforehand in real-life experiments. Although we can estimate the image intensity from the raw data (e.g., by the method introduced in Section 3.5), this inevitably leads to inaccuracies. In the following, we only consider realistic scenarios where the image intensity is not known, which requires training with varying noise levels. In Table 2, we report the reconstruction PSNRs obtained using the proposed network, and for a network with the same architecture where the mapping is learned ('Free Layer') for different levels of noise dictated by different image intensities ( = ∞ corresponds to the noiseless version of the proposed network). The standard deviation of the image intensity is set to 50%, in agreement with the findings of Section 3.5.
As expected, the network trained with no noise (i.e., = ∞) performs very poorly in the presence of noise. Therefore, we focus our analysis on the cases where our network is trained with noise. We divide our analysis into three cases, which depend on the relative noise levels used during training and testing.
Training noise and testing noise have the same levels In Table 2, the PSNRs of these experiments are shown in blue (i.e., diagonals). In all of our simulations, the proposed method outperforms the 'Free layer' network proposed in [12].
Training noise is higher than testing noise In Table 2, the PSNRs of these experiments are shown in red (below lower diagonal). We can observe that networks trained on high levels of noise (e.g., = 2 or 5 photons) with our proposed method perform poorly when they are tested on data with low levels of noise (e.g., = 50 or 2500 photons). For instance, a neural network trained with = 2 photons yields an average PSNR of 17.40 dB when it is tested for = 2500 photons. This is a PSNR drop of 4.77 dB compared to the optimal training conditions ( train = test = 2500 photons). We can also observe that in these cases the 'Free Layer' [12] outperforms the proposed method. We can see that in the most extreme cases where the training was with very high levels of noise, and the testing was with very low levels of noise, the 'Free Layer' network can increase the PSNR by 20.09-17.40 = 2.69 dB on average. Table 2, the PSNRs of these experiments are shown in green (above diagonals). We can see that the proposed network behaves similarly to the optimal training conditions. The worst drop in PSNR is relatively low, as 0.71 dB. This shows that the proposed network has high reconstruction quality provided that it is trained with relatively low levels of noise (e.g., = 50 or 2500 photons). Contrary to the previous scenario, the proposed method outperforms the 'Free Layer' network in all of these experiments. Here, the presence of the denoising layer has a great advantage; the proposed method generalizes better to high noise experiments.  Table 2. Reconstruction of peak signal-to-noise ratios (PSNRs) for different training strategies. 'Proposed' refers to the neural network method with the proposed mapping in Section 3. 'Free Layer' refers to the neural network method where the mapping of the raw measurements is learnt jointly with the postprocessing layers, as in [12]. Note that the network trained with no noise ( = ∞) corresponds to the case where we choose = 0. From top to bottom, image acquisition is simulated assuming increasing light intensity (i.e., decreasing noise levels). From left to right, images are reconstructed by networks that are trained using decreasing noise levels. Blue font, the testing and training noise levels are the same; green font, the testing noise is lower than the training noise; and red font, the testing noise is higher than the training noise. To facilitate the comparison between the two networks, we underline similar PSNRs (i.e., difference <0.1 dB) and use bold font to indicate the best performing networks (i.e., difference >0.1 dB). Fig. 4 illustrates the performance of our networks on experimental data acquired with the experimental set-up of a single-pixel camera presented in Section 3.1. We compare the performance of our proposed method with the result of the total variation regularized solution [38], the Tikhonov-regularized solution of Equation (6), the proposed method trained without noise ('Noiseless Net'), and with noise ('Proposed'), as well as the Tikhonov-regularized solution of Equation (6) to which we applied a state-of-the-art denoising method BM3D [39] ('Tikhonov+bm3d'), and a neural network reconstructor with the same architecture as the proposed method but where the mapping is learned jointly with the postprocessing layer ('Free Layer'), as in [12].

Experimental Data
First, we observe that the network trained with no noise performs very poorly on noisy data, and it is outperformed by the pseudo-inverse in the case of the LED lamp and the STL-10 cat. Visually, the reconstructed image retains many of the artifacts introduced by Poisson noise, which emphasizes the need for accounting for noise when training a neural network. Furthermore, we observe that learning the fully connected layer leads to many artifacts and a loss of detail (see the star sector target with the 'Free Layer'). This can be attributed to the learning of the mapping for specific noise values, instead of adapting to the noise, as in our proposed method. We also observe that the Tikhonov-regularized reconstruction is itself quite powerful. Even if it keeps some of the reconstruction artifacts, it compares very well to the results obtained using total variation regularization. In terms of PSNRs, Tikhonov-regularized reconstruction performs very similarly to the 'Free Layer' neural network, which is strong motivation to use it for mapping to the image domain.
Finally, we observe that the proposed network and 'Tikhonov+bm3d' offer the smoothest reconstructed images that are almost free of the artifacts introduced by down-sampling or Poisson noise. In terms of PSNRs, we observe that our method outperforms the others on experimental data, while being similar to 'Tikhonov+bm3d'. However, the proposed method is 1,000 times faster than the latter (0.2 seconds for BM3D versus 0.2 milliseconds for the proposed network). Moreover, 'Tikhonov+bm3d' is a two-step method where denoising is agnostic to reconstruction, while our method solves both problems at once, adapting denoising to the noise level. Although the PSNR gain is relatively low compared to the Tikhonov-regularized reconstruction, the proposed network leads to images that are well enhanced visually.

Discussion
The proposed mapping generalises much better to higher levels of noise than the learned mapping proposed in [12]. This is an advantage when dealing with experimental data in real time, as it can be time consuming to dynamically load different neural networks for different lighting conditions. In experimental scenarios, we do not know , and therefore our proposed mapping with a training value of = 2500 ± 1250 is the most suited for processing experimental data. Indeed, for all of the testing conditions, that neural network offers a very similar quality of reconstruction to the optimally trained neural network for that noise value.
Our method allows visually convincing reconstructions and good performance with respect to PSNR against several levels of noise, when trained for low levels of noise (high values of ). It therefore shows good proprieties for dealing with experimental data. Unlike previous single-pixel deep-learning studies, such as [12][13][14]38], we have showcased the robustness of our method to different lighting conditions, and given an interpretable meaning to the first layer of our neural network.
The noise level of the image under acquisition is a key feature in real-life experiments. This parameter can be estimated first, and then the network that fits the actual noise level can be evaluated. However, as the noise level is unknown, this requires the loading of several networks, which might be a severe limitation for real-time applications, and in particular if the models are too large to be all stored on the graphics processor unit. Our proposed method is robust to different levels of noise as long as it was trained with low levels of noise (high values of ). Therefore, the same network can be used almost optimally in a wide variety of situations. The Tikhonov-regularized mapping that we propose helps to denoise the raw data, to provide an approximate reconstruction to the convolutional layers that can focus on learning spatial features.
One limitation of our deep network compared to more classic approaches such as [18,20,40] is that there is no theoretical guarantee that it will work for any image; in particular, if the image under acquisition significantly differs from those of our training set. This is a common concern for deep-learning approaches that are, however, seen to work well in practice. While this tends to  Siemens Star resolution target with = 1024). We display the images reconstructed from a fully sampled dataset (ground-truth; GT) acquired with high image intensity (first column, = 148 photons, = 195 photons and = 295 photons for the LED lamp, STL-10 cat and Siemens Star resolution target, respectively) and lower image intensity ('Noisy GT' second column, = 9 photons, = 10 photons and = 25 photons for the LED lamp, STL-10 cat and Siemens Star resolution target, respectively). The following columns show reconstructions using the total variation regularized solution [38], the Tikhonov-regularized solution of Equation (7), the noiseless proposed network (Noiseless Net) Section 3.2 (network trained with no noise and where we assume = 0), a Deep neural network with the same architecture as our proposed method, where the mapping is learned as in [12] (Free Layer), the Tikhonov-regularized method combined with BM3D [39] denoising and the proposed network Section 3.2 (trained with = 50 photons and = 0.5 ). All of the PSNRs are computed as described in Section 4.3, with the first column as the ground-truth. be confirmed by our experimental results where our approach worked on the Siemens Star sector and on a LED light (both of which are very different from the images of the stl-10 database), there are no theoretical guarantees that this will always be the case.
Another limiting aspect of our study might be the choice of our architecture, which is shallower than popular architectures, such as the U-Net used in [41], and more recent variants. However, we are keen to keep the number of network parameters as low as possible, to keep both the training and evaluation times as short as possible. Another limitation of this study concerns the analysis of the PSNRs of the images reconstructed from the experimental data, where the ground truth is not known. We limit this common issue by acquiring fully sampled low-noise images. Finally, we only test our algorithms on × pixel images, with = 64. By considering a database with high-resolution images (e.g., ImageNet), our network can be generalized to handle the case where > 64, either directly or by implementing a patch-based strategy to limit the memory requirements.
Compared to most studies dedicated to deep learning for inverse problems, we mainly focus on the (fully connected) layers of the network, which acts in the measurement domain. The post-processing layers that act in the image domain can be replaced by any variants (e.g., U-Net, resNet, others). Moreover, our approach is compatible with architectures inspired by conventional variational methods (e.g., [42]). This will be the object of future work. Although this work focuses on single-pixel imaging, it can be used for any linear problem where the measured data is scarcely sampled in an orthogonal basis, just by setting to the corresponding basis. Finally, this network can easily be adapted to other noise models, as we only need an estimation of the mean and covariance of the noisy measurements.

Conclusion
We present a deep learning method to recover an image in single-pixel imaging by introducing mapping of the raw data to the image domain. This mapping is implemented as the first layer of a neural network, which provides an approximate reconstruction to a traditional cascade of convolutional layers. We also describe a framework for training a network using simulated data only. We describe how to exploit all of the experimental parameters to deal with experimental data, and also how to normalize the measurement, which is fundamental when a nonlinear reconstructor (e.g., neural network) is considered.
Although our network is trained using simulations only, it performs very well on experimental data, even when the noise level is unknown. The estimation of the noise covariance coupled with an appropriate training process appears to be efficient in a wide variety of scenarios. This is particularly interesting in real-time configurations where it is not possible to evaluate many different reconstructors. In future work, we will explore more complex interpretable network architectures.