Imaging through glass diffusers using densely connected convolutional networks

Computational imaging through scatter generally is accomplished by first characterizing the scattering medium so that its forward operator is obtained; and then imposing additional priors in the form of regularizers on the reconstruction functional so as to improve the condition of the originally ill-posed inverse problem. In the functional, the forward operator and regularizer must be entered explicitly or parametrically (e.g. scattering matrices and dictionaries, respectively.) However, the process of determining these representations is often incomplete, prone to errors, or infeasible. Recently, deep learning architectures have been proposed to instead learn both the forward operator and regularizer through examples. Here, we propose for the first time, to our knowledge, a convolutional neural network architecture called"IDiffNet"for the problem of imaging through diffuse media and demonstrate that IDiffNet has superior generalization capability through extensive tests with well-calibrated diffusers. We found that the Negative Pearson Correlation Coefficient loss function for training is more appropriate for spatially sparse objects and strong scattering conditions. Our results show that the convolutional architecture is robust to the choice of prior, as demonstrated by the use of multiple training and testing object databases, and capable of achieving higher space-bandwidth product reconstructions than previously reported.


INTRODUCTION
Imaging through random media [1, 2] remains one of the most useful as well as challenging topics in computational optics. This is because scattering impedes information extraction from the wavefront in two distinct, albeit related ways. First, light scattered at angles outside the system's Numerical Aperture is lost; second, the relative phases among spatial frequencies that pass are scrambled-convolved with the diffuser's own response. In most cases, the random medium is not known or it is unaffordable to characterize it completely. Even if the random medium and, hence, the convolution kernel are known entirely, deconvolution is highly ill-posed and prone to noise-induced artifacts.
Therefore, the strategy to recover the information, to the degree possible, must be two-pronged: first, to characterize the medium as well as possible so that at least errors in the deconvolution due to incomplete knowledge of the medium's response may be mitigated; and, second, to exploit additional a priori knowledge about the class of objects being imaged so that the inverse problem's solution space is reduced and spurious solutions are excluded. These two strategies are summarized by the well-known Tikhonov-Wiener optimization functional for solving inverse problems aŝ where H is the forward operator in the optical system g = H f , f is the unknown object, g is the raw intensity image (or images if some form of scanning is involved), Φ(.) expresses prior knowledge by penalizing unacceptable objects so the optimization is prohibited from landing onto them, α is the regularization parameter controlling the relative contribution of the two terms in the functional, andf is the estimate of the object. The forward operator H includes the effects of the scatterer, as well as of the optical system utilized in any particular situation. A number of ingenious strategies have been devised to design forward operators that improve the imaging problem's condition, most famously by using nonlinear optics [3,4] or stimulated emission [5]. Restricting oneself to linear optics, structured illumination [6][7][8] is an effective strategy which modulates object information onto better-behaved spatial frequencies.
Several approaches characterize the random medium efficiently. One method is to measure the transmission matrix of the medium by interferometry or wavefront sensing [9][10][11]. Alternatively, one may utilize the angular memory effect in speckle correlation [12][13][14][15]. The angular memory principle states that rotating the incident beam over small angles does not change the resulting speckle pattern but only translates it over a small distance [16,17]. In this case, computing the autocorrelation of the output intensity and deconvolving it by the autocorrelation function of the speckles, which is a sharply peaked function [18], results in the autocorrelation of the input field. Then, the object is recovered from its own autocorrelation using the Gerchberg-Saxton-Fienup (GSF) algorithm [19,20] with additional prior constraints.
Turning to the problem of determining Φ, during the past two decades thanks to efforts by Grenander [21], Candés [22], and Brady [23], the use of sparsity priors was popularized and proved to be effective in a number of contexts including random media. For example, Liu et al successfully recovered the 3D positions of multiple LEDs embedded in turbid scattering media by taking phase-space measurements and imposing the L1 sparsity prior [24].
Instead of establishing H and Φ independently and explicitly from measurements and prior knowledge, an alternative approach is to learn both operators simultaneously through examples of objects imaged through the random medium. To our knowledge, the first instance when this strategy was put forth was by Horisaki [25]. In that paper, a Support Vector Regression (SVR) learning architecture was used to learn the scatterer and the prior of faces being imaged through. The approach was effective in that the SVR learned correctly to reconstruct face objects; it also elucidated the generalization limitations of SVRs, which are shallow two-layer architectures, as for example when presented with non-face objects the SVR would still respond with one of its learned faces as a reconstruction. A deeper fullyconnected architecture in the same learning scheme has been proposed recently [26]. The Horisaki paper was the first, to our knowledge, to use machine learning in the computational imaging context; it certainly influenced our own work on lensless imaging [27] and other related works [28][29][30][31][32].
In this paper, we propose for the first time, to our knowledge, two innovations in the use of machine learning for imaging through scatter: the first is the use of the convolutional network architecture [33] and the second the use of Negative Pearson Correlation Coefficient (NPCC) as loss function. The convolutional architecture, which we refer to as IDiffNet (for "Imaging through Diffusers Network"), affords us two main advantages: first, we are able to reconstruct space-bandwidth product (SBP) of 128 × 128, higher than previously reported; second, IDiffNet is able to better generalize when reconstructing test objects from classes outside its own training set (e.g. reconstructing faces when trained on natural images.) We conducted training and testing with well-calibrated diffusers of known grit size and well-calibrated intensity objects produced by a spatial light modulator. We also examined a large set of databases, including classes of objects with naturally embedded sparsity (e.g. handwritten characters or digits). These experiments enabled us to precisely quantify when IDiffNet requires strong sparsity constraints to become effective, as function of diffuser severity (the smaller the grit size, the more ill-posed the inverse problem becomes.) The adoption of NPCC instead of the more commonly used Mean Absolute Error (MAE) as loss function for training ID-iffNet was an additional enabling factor in obtaining high-SBP image reconstructions through strong scatter. We compared the performance of these two loss functions under different imaging conditions and with different training databases determining the object priors that the networks learn and showed that NPCC is preferable for cases of relatively sparse objects (e.g. characters) and strong scatter. Lastly, we probed the interior of our trained IDiffNets through the well-established test of Maximally-Activated Patterns (MAPs) [34] and compared with standard denoising networks to eliminate the possibility that IDiffNet might be acting trivially instead of having learnt anything about the diffuser and the objects' priors.
The structure of the paper is as follows: In Section 2, we describe the architecture of our computational imaging system, including the hardware and the IDiffNet machine learning architecture. In Section 3, the reconstruction results are analyzed, including the effects of scattering strength, object complexity (i.e., the object priors that the neural networks must learn) and choice of loss function for training. The comparison with a denoising neural network is described in Section 4 and concluding thoughts are in Section 5.

COMPUTATIONAL IMAGING SYSTEM ARCHITEC-TURE
The optical configuration that we consider in this paper is shown in Fig. 1. Light from a He-Ne laser source (Thorlabs, HNL210L, 632.8nm) is transmitted through a spatial filter, which consists of a microscope objective (Newport, M-60X, 0.85NA) and a pinhole aperture (D = 5µm). After being collimated by the lens ( f = 150mm), the light is reflected by a mirror and then passes through a linear polarizer, followed by a beam splitter. A spatial light modulator (Holoeye, LC-R 720, reflective) is placed normally incident to the transmitted light and acts as a pixel-wise intensity object. The SLM pixel size is 20 × 20µm 2 and number of pixels is 1280 × 768, out of which the central 512 × 512 portion only is used in the experiments. The SLM-modulated light is then reflected by the beam splitter and passes through a linear polarization analyzer before being scattered by a glass diffuser. A telescopic imaging system is built after the glass diffuser to image the SLM onto a CMOS camera (Basler, A504k), which has a pixel size of 12 × 12µm 2 . In order to match the pixel size of the CMOS with that of the SLM, we built the telescope using two lenses L 1 and L 2 of focal lengths: f 1 = 250mm and f 2 = 150mm. As a result, the telescope magnifies the object by a factor of 0.6, which is consistent with the ratio between the pixel sizes of the CMOS and SLM. The total number of pixels on the CMOS is 1280 × 1024, but we only crop the central 512 × 512 square for processing; thus, the number of pixels measured by the CMOS camera, as well as their size, match 1:1 the object pixels at the SLM. Images recorded by the CMOS camera are then processed on an Intel i7 CPU. The neural network computations are performed on a GTX1080 graphics card (NVIDIA). The modulation performance of the SLM depends on the orientations of the polarizer and analyzer. Here, we implement the cross polarization arrangement to achieve a high intensity modulation contrast. Specifically, we set the incident beam to be linearly polarized along the horizontal direction and also set the linear polarization analyzer to be oriented along the vertical direction. We experimentally calibrate the correspondence between the 8-bit grayscale input images projected onto the SLM and intensity modulation values of SLM (see Supplement). We find that in this arrangement, the intensity modulation of the SLM follows a monotonic relationship with respect to assigned pixel value and a maximum intensity modulation ratio of ∼ 17 can be achieved. At the same time, the SLM also introduces phase modulation which is correlated with the intensity modulation due to the optical anisotropy of the liquid crystal molecules. The phase depth is ∼ 0.6π. Fortunately, the influence of this phase modulation is negligible in the formation of the speckle images that we captured in this system; detailed demonstration can be found in the Supplement. Therefore, we are justified in treating this SLM as a pure-intensity object. As shown in Fig. 1(b), the glass diffuser is inserted at a distance z d in front of the lens L1. Here, we approximate the glass diffuser as a thin mask whose amplitude transmittance is t(x 1 , y 1 ). In this case, a forward model can be derived to relate the optical field at the detector plane g out (x , y ) to the optical field at the object plane g(x, y) (constant terms have been neglected) [35]: (2) where λ is the light wavelength, R the radius of the lens L 2 and J 1 (·) denotes the first-order Bessel function of the first kind. T is the Fourier spectrum of the diffuser: Here, * denotes the convolution product and the last term in the convolution accounts for the influence of the finite aperture size of the lenses.
We model the diffuser transmittance t(x 1 , y 1 ) as a pure-phase random mask, i.e. t(x 1 , y 1 ) = exp i2π∆n λ D(x 1 , y 1 ) , where D(x 1 , y 1 ) is the random height of the diffuser surface and ∆n is the difference between the refractive indices of the diffuser and the surrounding air (∆n ≈ 0.52 for glass diffusers). The random surface height D(x 1 , y 1 ) can be modeled as [36]: Here, W(x, y) is a set of random height values chosen according to the normal distribution at each discrete sample location (x, y), i.e. W ∼ N(µ, σ 0 ); and K(σ) is a zero-mean Gaussian smoothing kernel having full-width half-maximum (FWHM) value of σ. The values of µ, σ 0 and σ are determined by the grit size of the chosen glass diffuser [37]. In this paper, we use two glass diffusers of different grit size: 600-grit (Thorlabs, DG10-600-MD) and 220-grit (Edmund, 83-419). Using these values in (2) and (3), we simulate the point spread function (PSF) of our imaging system as shown in Fig. 2. We can see that the PSF for the 600grit diffuser has a sharp peak at the center, while the PSF for the 220-grit diffuser spreads more widely. This indicates that the 220-grit diffuser scatters the light much more strongly than the 600-grit diffuser.
We may also represent equation (2) in terms of a forward operator H g : g out (x , y ) = H g g(x, y). When the object is pureintensity, i.e. g(x, y) = I(x, y), the relationship between the raw intensity captured at the detector plane I out (x , y ) and the object intensity I(x, y) can also be represented in terms of another forward operator H: I out (x , y ) = H I(x, y) = [SH g Sr]I(x, y). Here, S denotes the modulus square operator and Sr denotes the square root operator. Then, in order to reconstruct the intensity distribution of the object, we have to formulate an inverse operator H inv such that whereÎ(x, y) is an acceptable estimate of the intensity object.
Due to the randomness of H, it is difficult to obtain its explicit form and do the inversion accordingly; prior works referenced in Section 1 employed measurements of the scattering matrix to obtain H approximately. Here, we instead use IDiffNet, a deep neural network (DNN) trained to the underlying inverse mapping given a set of training data. IDiffNet uses the densely connected convolutional architecture (DenseNets) [38], where each layer connects to every other layer within the same block in a feed-forward fashion. Compared with conventional convolutional networks with same number of layers, DenseNets have more direct connections between the layers, which strengthens feature propagation and encourages feature reuse.
A diagram of IDiffNet is shown in Fig. 3. The input to IDiffNet is the speckle pattern captured by the CMOS. It first passes through a dilated convolutional layer with filter size 5 × 5 and dilation rate 2 and is then successively decimated by 6 dense and downsampling transition blocks. After transmitting through another dense block, it successively passes through 6 dense and upsampling transition blocks and an additional upsampling transition layer. Finally, the signals pass through a standard convolutional layer with filter size 1 × 1 and the estimate of the object is produced. Due to the scattering by the glass diffusers, the intensity at one pixel of the image plane is influenced by several nearby pixels at the object plane. Therefore, we use dilated convolutions in all our dense blocks so as to increase the receptive field of the convolution filters. In addition, we also use skip connections to pass high frequency information learnt in the initial layers down the network towards the output reconstruction. Additional details about the architecture and training of IDiffNet are provided in the Supplement, Section 3.

RESULTS AND NETWORK ANALYSIS
Our experiment consists two phases: training and testing. During the training process, we randomly choose image samples from a training database. The space bandwidth product of the original images are all 128 × 128 and we magnify each image by a factor of 4 before uploading to the SLM. The corresponding speckle patterns are captured by the CMOS. As mentioned in Section 2, we only crop the central 512 × 512 square of the CMOS. We further downsample the captured speckle patterns by a factor of 4 and subtract from them a reference speckle pattern, which is obtained by uploading to the SLM a uniform image with all pixels equal to zero. The purpose of this subtraction operation is to eliminate the background noise on the CMOS and also to better extract differences between speckle patterns resulting from different objects.
After the subtraction operation, we feed the resulting speckle patterns into IDiffNet for training. In this way, the input and output signal dimensions are both 128 × 128. We collected data from six separate experiment runs: each time we used training inputs from one of the three different databases: Faces-LFW [39], ImageNet [40] or MNIST [41] and inserted one of the two glass diffusers that we have into the imaging system. Each of our training dataset consists of 10,000 object-speckle pattern pairs. These data were used to train six separate IDiffNets for evaluation. In the testing process, we sample disjoint examples from the same database (Faces-LFW, ImageNet or MNIST) and other databases such as Characters, CIFAR [42] and Faces-ATT [43]. We upload these test examples to the SLM and capture their corresponding speckle patterns using the same glass diffuser as the training phase. We then input these speckle patterns to our trained IDiffNet and compare the output to the ground truth.
In training the IDiffNets, we use two different loss functions and compare their performances. The first loss function that we consider is the mean absolute error (MAE), is defined as: Here, w, h are the width and height of the output, Y is the output of the last layer, and G is the ground truth. The qualitative and quantitative reconstruction results when using MAE as the loss function are shown in Fig. 4 and 5, respectively. From Fig. 4, we find that, generally speaking, IDiffNet's reconstruction performance for the 600-grit diffuser is better than that for the 220-grit diffuser. High quality reconstructions are achieved for the 600-grit diffuser when IDiffNets are trained on Faces-LFW (column iv) and ImageNet (column v). For the 220-grit diffuser, the best reconstruction is obtained when ID-iffNet is trained on the ImageNet database (column ix). The recovered images are close to the low-pass filtered version of the original image, where we can visualize the general shape (salient features) but the high frequency features are missing. This result is expected since the scattering caused by the 220-grit diffuser is much stronger than that of the 600-grit diffuser, as we had already deduced from Fig. 2. As a result, we can still visualize some features of the object in the raw intensity image captured in the 600-grit diffuser case. By contrast, what we capture in the 220-grit diffuser case looks indistinguishable from pure speckle, without any object details visible. This means we should expect it to be much more difficult for IDiffNet to do the inversion.
Noticeable from Fig. 4 is that when IDiffNet is trained on MNIST for the 220-grit diffuser (column x), all the reconstructions seem to be uniform. This is due to the fact that the objects that this IDiffNet was trained on were sparse; and, hence, it also tends to make sparse estimates. Unfortunately, in this case the sparse local minima where IDiffNet is trapped are featureless.  (iv) IDiffNet reconstruction from raw images when trained using Faces-LFW dataset [39]. (v) IDiffNet reconstruction when trained used Im-ageNet dataset [40]. (vi) IDiffNet reconstruction when trained used MNIST dataset [41]. Columns (vii-x) follow the same sequence as (iii-vi) but in these sets the diffuser used is 220-grit. Rows (a-f) correspond to the dataset from which the test image is drawn: (a) Faces-LFW, (b) ImageNet, (c) Characters, (d) MNIST, (e) Faces-ATT [43], (f) CIFAR [42], respectively.
Tackling this problem motivated us to examine the Negative Pearson Correlation Coefficient (NPCC) as alternative loss function. The NPCC is defined as [44]: Here,G andỸ are the mean value of the ground truth G and the IDiffNet output Y, respectively.
The qualitative and quantitative reconstruction results using NPCC as the loss function are shown in Fig. 6 and 7, respectively. The reconstructed images are normalized since the NPCC value will be the same if we multiply the reconstruction by any positive constants. Similar to the case where MAE is used as the loss function, the reconstruction is better in the 600-grit diffuser case than the 220-grit diffuser case. However, when IDiffNet is trained on MNIST for the 220-grit diffuser (column x), high quality reconstruction is achieved for the test images comes from Characters and MNIST database (row c and d). This is in contrast to the MAE-trained case, thus indicating that NPCC is a more appropriate loss function to use in this case. It helps ID-iffNet to learn the sparsity in the ground truth and in turn use the sparsity as a strong prior for estimating the inverse. In addition, when trained on ImageNet for the 220-grit diffuser (column ix), IDiffNet is still able to reconstruct the general shape (salient features) of the object. But the NPCC-trained reconstructions are visually slightly worse compared with the MAE-trained cases. Fig. 6. Qualitative analysis of IDiffNets trained using NPCC as the loss function. (i) Ground truth pixel value inputs to the SLM. (ii) Corresponding intensity images calibrated by SLM response curve. (iii) Raw intensity images captured by CMOS detector for 600-grit glass diffuser. (iv) IDiffNet reconstruction from raw images when trained using Faces-LFW dataset [39]. (v) IDiffNet reconstruction when trained used Im-ageNet dataset [40]. (vi) IDiffNet reconstruction when trained used MNIST dataset [41]. Columns (vii-x) follow the same sequence as (iii-vi) but in these sets the diffuser used is 220-grit. Rows (a-f) correspond to the dataset from which the test image is drawn: (a) Faces-LFW, (b) ImageNet, (c) Characters, (d) MNIST, (e) Faces-ATT [43], (f) CIFAR [42], respectively.
In both MAE and NPCC training cases, IDiffNet performance also depends on the dataset that it is trained on. From Fig. 4 and 6, we observe that IDiffNet generalizes best when being trained on ImageNet and has the most severe overfitting problem when being trained on MNIST. Specifically, when IDiffNet is trained on MNIST, even for the 600-grit diffuser (column vi), it works well if the test image comes from the same database or a database that shares the same sparse characteristics as MNIST (e.g. characters). It gives much worse reconstruction when the test image comes from a much different database. When IDiffNet is trained on Faces-LFW, it generalizes well for the 600-grit diffuser, but for the 220-grit diffuser it exhibits overfitting: it tends to reconstruct a face at the central region, as Horisaki's case. When IDiffNet is trained on ImageNet, it generalizes well even for the 220-grit diffuser. As we can see in column ix, for all the test images, IDiffNet is able to at least reconstruct the general shapes (salient features) of the objects. This indicates that IDiffNet has learned at the very least a generalizable mapping of low-level textures between the captured speckle patterns and the input images. Similar observation may also be made from Fig. 5 and 7. From subplots (a) and (b) in both figures, we notice that the IDiffNets trained on MNIST have much higher MAEs/lower PCCs when tested on other databases. As shown in subplot (d), the IDiffNets trained on Faces-LFW have a large discrepancy between training and test error, while for IDiffNets trained on ImageNet, the training and testing curves converge to almost the same level. An explanation for this phenomenon is that all the images in MNIST or Faces-LFW databases share the same characteristics (eg. sparse, circular shape), imposing a strong prior on IDiffNet. On the other hand, the ImageNet database consists a mixture of generic images that not have too much in common. As a result, IDiffNet trained on ImageNet generalizes better. It is worth noting that overfitting in our case evidences itself as facelooking "ghosts" occurring when IDiffNet trained on Faces-LFW tries to reconstruct other kinds of images, for example (see Fig.  6, column viii). This is again similar to Horisaki's observations [25]. From comparing the four possible combinations of weak vs strong scattering and constrained dataset (e.g. MNIST) vs generic dataset (e.g. ImageNet), we conclude the following: when scattering is weak, it is to our benefit to train the IDiffNets on a generic dataset because the resulting neural networks generalize better and can cope with the scattering also for general test images. On the other hand, when scattering is strong, it is beneficial to use a relatively constrained dataset with strong sparsity present in the typical objects: the resulting neural networks are then more prone to overfitting, but now this works to our benefit for overcoming strong scattering (at the cost, of course, of working only for test objects coming from the more restricted database.) The choice of optimization functional makes this tradeoff even starker: MAE apparently does not succeed in learning the strong sparsity even for MNIST datasets, whereas the NPCC does much better, even being capable of reconstructing test objects under the most severe scattering conditions (220grit diffuser, column x in Fig. 6) as long as the objects are drawn from the sparse dataset. These observations are summarized in Table. 1.

COMPARISON WITH DENOISING NEURAL NET-WORKS
To get a sense of what IDiffNets learn, we first compare their reconstruction results with that of a denoising neural network. Specifically, we use ImageNet as our training database. To each image in the training dataset, we simulate a noisy image using Poisson noise and make the peak signal-to-noise ratio (PSNR) of the resulting noisy image visually comparable to that of the corresponding speckle image captured using the 600-grit diffuser. We use Poisson noise other than different kinds of noise such as Gaussian because Poisson noise is signal-dependent, similar to the diffuser case. We then train a denoising neural network using those noisy images. For the denoising neural network, we implement the residual network architecture [45]. Finally, we feed the test speckle images captured using the 600-grit diffuser into this denoising neural network and compare the outputs with those reconstructed by IDiffNet (using MAE as the loss function).
The comparison results are shown in Fig. 8. From column iv, we find that the denoising neural network works well when the test images are indeed noisy according to the Poisson model. However, as shown in column v, if we input the diffuse image into the denoising network, then it can only output a highly blurred image, much worse than IDiffNet given the same diffuse input, as can be seen in column vi. This result demonstrates that IDiffNet is not doing denoising, although the speckle image obtained using the 600-grit diffuser visually looks similar to a noisy image. We posit the reason for this as follows: Poisson noise operates pixel-wise. Consequently, denoising for Poisson noise is effectively another pixel-wise operation that does not depend much on spatial neighborhood, except to the degree that applying priors originating from the structure of the object helps to denoise severely affected signals. A denoising neural network, then, learns spatial structure only as a prior on the class of objects it is trained on. However, this is not the case when imaging through a diffuser: then every pixel value in the speckle image is influenced by a set of nearby pixels in the original image. This may also be seen from the PSF plots shown in Fig. 2. The denoising neural network fails because it has not learnt the spatial correlations between the nearby pixels and the correct kernel of our imaging system, as our IDiffNet has.
We also examined the maximally-activated patterns (MAPs) of the IDiffNets and the denoising neural network; i.e. what types of inputs would maximize network filter response (gradient descent on the input with average filter response as loss function) [34]. Fig. 9 shows the MAP analysis of two convolutional layers at different depths for all the three neural networks. For both the shallow and deep layers, we find the MAPs of our IDiffNets are qualitatively different from those of the denoising network. This further corroborates that IDiffNet is not merely doing denoising. In addition, the MAPs of the 600-grit IDiffNet show finer textures compared with that of the 220-grit IDiffNet, indicating that the IDiffNet learns better in the 600-grit diffuser case.  Fig .4, duplicated here for the readers' convenience]. Rows (a-c) correspond to the dataset from which the test image is drawn: (a) Characters, (b) CIFAR [42], (c) Faces-LFW [39], respectively.

CONCLUSIONS
We have demonstrated that IDiffNets, built according to the densely connected convolutional neural network architecture, can be used as an end to end approach for imaging through scattering media. The reconstruction performance depends on the scattering strength of the diffusers, the type of the training dataset (in particular, its sparsity), as well as the loss function used for optimization. The IDiffNets seem to learn automatically both the properties of the scattering media, as well as the priors restricting the objects where the network is supposed to perform well, depending on what the network was trained on.