Deep-learning projector for optical diffraction tomography

: Optical diﬀraction tomography is an eﬀective tool to estimate the refractive indices of unknown objects. It proceeds by solving an ill-posed inverse problem for which the wave equation governs the scattering events. The solution has traditionally been derived by the minimization of an objective function in which the data-ﬁdelity term encourages measurement consistency while the regularization term enforces prior constraints. In this work, we propose to train a convolutional neural network (CNN) as the projector in a projected-gradient-descent method. We iteratively produce high-quality estimates and ensure measurement consistency, thus keeping the best of CNN-based and regularization-based worlds. Our experiments on two-dimensional-simulated and real data show an improvement over other conventional or deep-learning-based methods. Furthermore, our trained CNN projector is general enough to accommodate various forward models for the handling of multiple-scattering events.


Learning-based reconstruction methods
In the past decade, several works that adopted deep neural networks (DNN) [24,25] have reached state-of-the-art performance in many challenging linear inverse problems such as image reconstruction [26], super-resolution [27], x-ray computed tomography [28,29], and compressive sensing [30]. Instead of retrieving the image from the measurements directly, the strategy is to train a convolutional neural network (CNN) [31] to map the initial reconstructions, which are usually obtained by classical methods, into the final desired results [32].
Another popular variant is to modify traditional iterative algorithms in such a way that the network plays the role of a regularizer. In [33], a model-based iterative algorithm for image reconstruction was proposed in which a deep CNN-based regularization prior was combined with numerical optimization modules such as conjugate-gradient algorithms to enforce data consistency and constrain the plausibility of the solutions. In [34], the authors proposed an iterative reconstruction method for nonlinear tomographic problems. Specifically, they devised a gradient-descent-like scheme where the gradient is learned via a CNN and showed that it works with any nonlinear operator. In [35], a CNN is trained as a projector on the training data and is then plugged in the alternating-direction method of multipliers method to solve linear inverse problems. The framework there is similar to plug-and-play priors for model-based reconstruction [36]. In [37], where the idea is similar to that in [35], a CNN is trained with multiple stages to approximate a projector for the consistent reconstruction of computed tomography (CT) images. The projector is then plugged into a relaxed projected-gradient-descent (PGD) method which is guaranteed to converge.
In diffraction tomography, learning-based RI tomography approaches are also being rapidly developed. In [14], the authors described a framework for the imaging of three-dimensional (3D) phase objects in a tomographic configuration implemented by training a DNN to reproduce the complex amplitudes of the measurements. The network was designed such that the voxel values of the RI of the 3D object are the inputs that are adapted during the training process. In [38], the authors improved upon [14] by using an 1 loss function and anisotropic TV. In [39], the scattering decoder framework (ScaDec) combines a back-propagation scheme with a U-Net [40]. The authors could recover strongly scattering objects in a purely data-driven fashion. Yet other CNN architectures were proposed in [41,42]. These works mainly concentrated on cases characterized by the absence of the missing-cone problem. Nevertheless, they suggest that CNN could be an appealing solution to the specific challenges of ODT.

Contribution
Inspired by [37], we develop a plug-and-play scheme with a CNN trained as a projector for ODT. Our approach contains a feedback mechanism that ensures consistency with the measurements while taking advantage of the complexities of CNN. By contrast to [37], we adapt the framework to a more challenging modality with an underlying nonlinear physical model and a missing-cone problem. We extensively validate the proposed method on simulated and real data in different scenarios. We compare the proposed method against the conventional direct inversion method (a.k.a. the filtered back-propagation (FBP) method), iterative inversion methods combined with TV regularization (e.g., RytovTV), ScaDec, and the direct CNN. Our method outperforms traditional optimization-based methods in both simulated and real datasets with diverse configurations. Furthermore, our framework allows the use of more accurate models, for instance, BPM.

Organization
In Section 2, we present the physical model in ODT [10] and its classical approximations. In Section 3, we formulate the problem and explore an iterative scheme based on PGD. In Section 4, we describe the proposed deep-learning scheme; it involves a novel strategy to train the CNN as a projector onto a set of desirable solutions. In Section 5, we validate our method on two-dimensional (2D) simulated and experimental data against conventional and other CNN-based approaches. Finally, we summarize our study in Section 6.

Optical diffraction tomography
Let n : Ω → R denote the distribution of RI of the object immersed in a medium of RI n b , where Ω ⊂ R d includes the support of the sample and d is the dimension of the problem (e.g., d = 2, 3). In ODT, an incident plane wave u in : Ω → C of wavelength λ illuminates the sample. The interaction between n and u in produces a scattered field u sc : Ω → C. The complex total field u = u in + u sc is measured at the location Γ of the detector plane (see Fig. 1). In the scalar-diffraction theory, the total field u is governed by the Helmholtz equation with the free-space wavenumber k 0 = 2π λ . Under suitable conditions, the integral form of (1) is the Lippmann-Schwinger equation where f (x) = k 2 ( n 2 (x) − 1) denotes the scattering potential and k = k 0 n b is the wavenumber in the medium. The convolution kernel g : R d → C is the Green function of (∇ 2 + k 2 0 n b I) given by where H (1) 0 denotes the Hankel function of the first kind. Under specific hypotheses, we can linearize this problem by replacing the total field u with the incident field u in . This yields the first-order Born approximation, written as The fact that the Born model only allows one to observe optically thin samples is a serious drawback. The Rytov model [11] yields a similar equation with the left term of (4) being replaced by u Ryt = u in log( u u in ). In practice, it is necessary to unwrap the imaginary part of log( u u in ). In this work, we use the Rytov model as our forward operator, unless specified otherwise.

Formulation of the forward model
We discretize Ω in N pixels, and we represent the sampled scattering potential f and the refractive index n by the vectors f ∈ R N and n ∈ R N , respectively. The sample is illuminated by a series of P incident plane waves {u in p (x)} P p=1 = {e j k p ,x } P p=1 with the wavevector k p = k b s p , where s p ∈ R d is the directional unit vector for the pth illumination. We acquire M complex measurements per illumination. According to the Rytov model [11], our discrete forward model reads as where y Ryt p ∈ C M are the measurements modified according to the Rytov model, S p ∈ R M×N denotes a sampling operation that depends on the incident field p, and F ∈ C N×N applies a d-dimensional Fourier transform.
We also introduce the RI variation δn ∈ R N whose mth component is given by In this work, we utilized RI variations to describe the setup of diverse datasets.

Problem formulation
We reconstruct our object by optimizing the constrained least-square problem where y Ryt p denotes the measurement modified accordingly, H Ryt p : R N → C M is the Rytov model defined in (5), and S is a set of acceptable solutions that honor our prior information.

Iterative reconstruction scheme
The PGD [43] is a well-studied iterative method to solve (7). It alternates between taking a step in the direction of the negative gradient of the cost function and applying a projection operator P S that maps the current iteration onto S. It can be written as where γ denotes the step size subject to the condition γ<2/||H T H|| 2 , and P S : R N → S is a projection operator.
However, PGD is not ensured to converge when the operator P S is not a valid projector, the condition of validity being P S • P S (x) = P S (x), ∀x ∈ R N . We therefore consider the relaxed PGD (RPGD) of [37], which is ensured to converge (see Theorem 3 of [37]). The iterative reconstruction scheme is sketched in Algorithm 1. In Line 6, in particular, we need a projector onto S. Inspired by recent works on learning approaches, our goal is to replace P S with a CNN that is trained to act as a projector (see Fig. 2).

Fig. 2.
One iteration of the RPGD for ODT. The forward model simulates the physical process from the current estimate f k . The vector g is obtained by gradient descent from the data-fidelity term. Then, g is fed into the (trained) CNN projector to obtain z k . The next estimate f k+1 is the sum of z k and f k weighted by α and (1 − α), respectively. Note that the initial guess f 0 , which is obtained with a direct inversion method (Rytov model), is directly fed into the (trained) CNN projector.  if k ≥ 1 then 4: Update relaxed parameter 10: end if 11: end if 12: Calculate relative update 14: k ← k + 1 15: end while However, PGD is not ensured to converge when the operator P S is not a valid projector, the condition of validity being P S • P S (x) = P S (x), ∀x ∈ R N . We therefore consider the relaxed PGD (RPGD) of [37], which is ensured to converge (see Theorem 3 of [37]). The iterative reconstruction scheme is sketched in Algorithm 1. In Line 6, in particular, we need a projector onto S. Inspired by recent works on learning approaches, our goal is to replace P S with a CNN that is trained to act as a projector (see Fig. 2).

Deep-learning-based projector for ODT
In this study, we replace Line 6 in Algorithm 1 by where CNN is trained to act as a projector and P C denotes a projection onto the convex set C ⊂ R N that enforces physical constraints on the RI (e.g., non-positivity constraint). The training data set consists of 2D objects placed randomly. We simulated the complex measurements by using an accurate forward model that relies on the Lippmann-Schwinger model [20]. The details of the simulation are described in Section 5.1.1.1.

Network architecture
We design a CNN based on the U-Net architecture [28,40]. The setup includes an external skip connection between the input and the output so that the network acts as a residual net. Furthermore, we double the number of channels in the encoder (left path) each time the depth of the network increases (starting at 32, Fig. 3). We also replace the (3 * 3) up-conv layer with a (2 * 2) transposed convolutional layer, and add a Leakly rectified linear unit (LeaklyReLu) [44] after the last convolutional layer (see Fig. 3).

Training strategy
In our experiments, we adopt a training strategy inspired by [37]. The input of the CNN during training time falls into three classes:f where H Lip denotes the Lippmann-Schwinger model, and B is shorthand for FBP [11]. The learned parameters in the CNN after the tth epoch training by is denoted by θ t . The Q ground-truth images are indicated by {f q } q∈ [1···Q] . This strategy enhances the dataset as it introduces a variety of perturbations. On the contrary, there is no perturbation in (12) so that it encourages the network to mimic a projector. The training process aims at minimizing the loss function The minimization of L proceeds using the Adam algorithm [45].
In practice, we train the network in three stages. In the first stage, we train it only with the input data set {f q,1 } obtained by (10). Then, in the second stage, we concatenate the data set {f q,2 } generated by (11) to {f q,1 } and train the CNN with a total loss that integrates the two parts. Finally, in the third stage, the whole data set {f q,1 ,f q,2 ,f q,3 } is fed into the optimization.

Numerical experiments
In this section, we present experiments that validate our proposed method on 2D simulated and experimental data. For the implementation, the training process is performed on a desktop workstation (Titan X GPU, Ubuntu operating system) and implemented on PyTorch [46], while the evaluation is implemented by MATLAB R2019a on another desktop computer (Intel Xeon E5-1650 CPU, 3.5 GHz, 32 GB of RAM) using the GlobalBioIm library [47].
Parameter setting For each of the three stages, the number of epochs was set as T 1 = 20, T 2 = 30, and T 3 = 30. The initial learning rate was 10 −4 ; it was optimally reduced to one tenth of the current value at the 40th and 70th epoch. We set the batch size as 18 for the three stages and did draw an equal number of samples from each dataset {f q,m } 3 m=1 . The parameters for the iterative RPGD were set as follows: the relaxed parameter α was initialized with 1, and all members of the sequence {c k } N k=1 were set to the constant value c which was tuned manually for different scenarios. Furthermore, either the relative update E = 10 −4 or the number of iterations N = 10 stopped the algorithm. We set C = {f ∈ R N : f ≤ 0} (i.e., non-positivity constraint), unless specified otherwise.
Method of comparison We quantitatively evaluated the quality of the reconstructed RIn with respect to the ground-truth n. To that end, we adopted the relative error (ERROR) and structural similarity (SSIM) [48] defined as ERROR(n,n) = n −n 2 n 2 SSIM(n,n) = (2µ n µn + c 1 )(2σ n σn + c 2 ) (µ 2 n + µ 2 n + c 1 )(σ 2 where µ n , µn, σ n , σn, and σ nn are the local means, standard deviations, and cross-covariance for images n,n. Parameters c 1 = 10 −4 and c 2 = 9 × 10 −4 are the default regularization constants. They avoid instabilities over image regions where the local mean or standard deviation is vanishing. We compare the results of our method with those of FBP, a Rytov-based method with TV regularization (RytovTV), and ScaDec [39]. We also compare them with those of the direct CNN which shares the same architecture as the network of our method, but is only trained with the dataset in (10) (see Table 1). To solve the optimization-based method combined with TV regularization, we adopted the forward-backward splitting approach [49]. All the parameters are optimized to achieve the best relative error.

Training data composed of one disk
Simulation setup We generated 1,116 images composed of one disk arbitrarily positioned in the middle third of the horizontal direction and anywhere in the vertical direction. Each disk has a radius ranging from 4λ to 7.5λ. The data were split into 1,080 images for training, 18 images for validation, and 18 images for testing. The physical size of the image was set to (38λ × 38λ), and was discretized on a (256 × 256) grid. We assumed that the background medium was oil (RI n b = 1.525). The value of the RI variation δn ranged arbitrarily from (−0.135) to (−0.055), component-wise. The plane waves had incident angles uniformly distributed between (−45 o ) to 45 o , thus simulating a missing-cone problem. The wavelength of the illumination was set to λ = 450nm. We acquired P = 40 views on a detector line of length 97λ. In addition, to make the measurements more realistic, we added white Gaussian noise w to the measurements y. The noise was added such that the input SNR = 20dB, where SNR(y + w, y) = 20log 10 ( y 2 / w 2 ). All simulations were implemented using the GlobalBioIm library [47].
Reconstruction of one circular disk We evaluated the stability of the methods over some range of noise (10, 15, and 20 dB) in the context of limited-angle measurements. We summarize in Table 2 and Table 3 the average ERROR and SSIM values over the whole testing dataset. Notably, RytovTV dramatically underestimates the RI distribution and fails to reconstruct the shape of the objects. Our method obtains the highest-quality reconstructions in terms of the actual RI value and the shape of the objects. In most cases, it faithfully recovers the images with lower ERROR and higher SSIM values compared to the other methods. All the deep-learning-based methods did recover the samples successfully while the TV-regularized method did fail. Only for the 10 dB case, ScaDec performed slightly better for the relative error. This can be explained by the failure of the phase unwrapping on which the Rytov model relies, unlike ScaDec. The proposed method takes 10 seconds for 10 iterations, whereas the RytovTV takes 50 seconds for 80 iterations. Three samples of the reconstructed images are shown in Fig. 4.  Reconstruction of objects whose shape differ from the training data We further assessed how the trained CNN performs when the testing set does not match the training set. We consider three examples with several shapes (see Fig. 5): a square, a cell-like sample, and two non-overlapping disks with independent radius and RI variation. In this case, the direct CNN and ScaDec both fail to recover the objects. However, our method leads to reconstructions with more accurate shapes and RI values because it incorporates a feedback mechanism that ensures consistency with the measurements. Furthermore, it performs slightly better than RytovTV.
These illustrative examples suggest that our proposed method can perform well despite being trained on single disks. We emphasize that it may not perform well for all types of data, but it shows that the reconstructions may still be satisfactory for some testing data that does not match the training set.
In addition, we generated 18 other images containing two disks with arbitrary RI variation ranging from (−0.067) to (−0.033) (see Fig. 6). The rotation angle for this data was chosen between 0 o , 22.5 o , 45 o , and 90 o . In the 0 o view, the two circles were placed vertically, while in the 90 o view they were aligned horizontally. In this case, SacDec performs less well, at the possible exception of the 10dB case (see Tables 4 and 5). Similarly, RytovTV leads to reconstructions with inaccurate shapes and visible artifacts in the background. On the contrary, both the direct CNN and our method produce remarkably well recovered images. Moreover, our iterative scheme even offers a slight increase in the quality of the reconstruction. These results suggest that the proposed scheme is robust to the mismatch experienced in training on a single disk and testing on two.
a denotes the CNN is trained on data composed of one disk.

Training data composed of complex objects
Simulation setup We generated 1,116 images composed of 2D cell-like samples with two embedded smaller ellipses (see Fig. 7). The radius of the larger part ranged from 10λ to 12λ, while that of smaller ellipses ranged from 1.5λ to 2.5λ. The ellipses are arbitrarily located. The boundary of each large cell-like sample is randomly generated. Each object has its own RI variation ranging from (−0.1) to 0.2. In this case, we set C = {f ∈ R N }, and the SNR of the measurements is 20dB. Fig. 7. Sixteen examples of the images randomly generated to train the CNN projector to reconstruct complex objects.

Reconstruction of complex objects
We give in Fig. 8 three examples of reconstructions. The average ERROR and SSIM over the whole testing data is also shown in Table 6. Notably, all the CNN-based approaches performed better than RytovTV. Moreover, our method (RytovCNN) still improves the quality of the reconstruction compared to ScaDec or direct CNN. Note that the columns BPMTV and BPMCNN will be discussed in Section 5.3.

Validation on experimental data
Experimental setup The experimental data were 2D cross sections of a 3D specimen with no variation along the y axis. The objects are two non-overlapping fibers placed in the medium vertically, diagonally, and horizontally. The image size was (1024 × 1024). The reconstructed image size was cropped from the center to (512 × 512), which corresponds to (97λ × 97λ). The fibers had an expected RI variation of (−0.055) and a diameter equal to 9 µm. In total, P = 160 views were collected with the illumination of a laser diode (512 measurements per view). The other physical parameters replicated those of the simulation. Reconstructions We trained the CNN over one disk. Since the experimental data have a size larger than the simulated data, we cropped the image before to feed it into the CNN and zero-padded the output correspondingly. The performance of the reconstruction with different methods is shown in Fig. 9. ScaDec produces blocky artifacts in all configurations. In addition, the direct CNN introduces deformation and underestimated RI. It fails spectacularly for the 90 o case. RytovTV produces piecewise-constant structures, which suits well the dataset. Yet, artifacts exist and induce elongated shapes, which points to a weakness of this method toward the missing-cone problem. On the contrary, the proposed method generally obtains satisfying results in terms of RI and shape. Since the end-to-end approaches (ScaDec and DirectCNN) do not use the information from the measurement iteratively, they are more vulnerable to a potenial mismatch between the synthetic training data and the real data. By contrast, our method and RytovTV are more robust because of the iterative information feedback from the measurement. Similar findings have been observed in [37]. Note that the most difficult case is 90 o , because the missing-cone problem is acute for this configuration. Moreover, the considered forward model is least valid since multiple-scattering events occur in this setting.

Nonlinear forward model with deep-learning projector
We combine the nonlinear forward model BPM [14,15] with the CNN based on the Rytov model and trained in the previous section. The idea is to assess the effect of our projector. Here, we present reconstructions of three synthetic data samples and one real data sample. The reconstructions of synthetic data were obtained with the CNN trained with complex objects in Section 5.1.2 with the Rytov model. Similarly, the reconstruction of real data was obtained with the CNN trained with single disks.
As shown in Fig. 10, RytovTV fails to restore the objects, while the TV-regularized BPM (BPMTV) is able to recover the objects well. All the CNN-based methods perform better for recovering the shapes and RI values. The CNN projector combined with the BPM model (BPMCNN) outperforms all the other methods for the whole testing of complex objects, as shown in Table 6.
Additionally, as shown in Fig. 11, the BPM model is able to reconstruct the samples with impressive improvement over the RytovCNN. These comparisons show that our framework is capable of reconstructing the samples even if the CNN projector was trained with the strategy described in Section 4.2 (i.e., without knowledge of the BPM model).

Conclusion
To reconstruct data governed by the Helmholtz equation, we have proposed a general iterative framework that takes advantage of two worlds. On one hand, a deep-learning-based projector encodes priors learnt beforehand. On the other hand, the inversion method is based on a physical model and ensures consistency with the measurements. We have validated our approach in a challenging setting that combines the difficulties of the missing-cone problem with a nonlinear physical model. While our framework is trained on a simulated dataset that contains one object only, it remains applicable to experimental data featuring multiple objects in a variety of optical configurations, which is remarkable. In addition, we also validated the proposed approach on more complex object. Our numerical experiments have shown that the proposed method outperforms the conventional reconstruction of refractive indices in an optical diffraction tomography context, as well as recent deep-learning-based methods. Due to our combined approach (projector and physical model), we could mitigate the missing-cone problem. Furthermore, we have assessed the ability of our projector to work with a more advanced nonlinear forward model. The reconstructions were significantly improved upon those of the linear model.