Experimental robustness of Fourier Ptychography phase retrieval algorithms

Fourier ptychography is a new computational microscopy technique that provides gigapixel-scale intensity and phase images with both wide field-of-view and high resolution. By capturing a stack of low-resolution images under different illumination angles, a nonlinear inverse algorithm can be used to computationally reconstruct the high-resolution complex field. Here, we compare and classify multiple proposed inverse algorithms in terms of experimental robustness. We find that the main sources of error are noise, aberrations and mis-calibration (i.e. model mis-match). Using simulations and experiments, we demonstrate that the choice of cost function plays a critical role, with amplitude-based cost functions performing better than intensity-based ones. The reason for this is that Fourier ptychography datasets consist of images from both brightfield and darkfield illumination, representing a large range of measured intensities. Both noise (e.g. Poisson noise) and model mis-match errors are shown to scale with intensity. Hence, algorithms that use an appropriate cost function will be more tolerant to both noise and model mis-match. Given these insights, we propose a global Newton's method algorithm which is robust and computationally efficient. Finally, we discuss the impact of procedures for algorithmic correction of aberrations and mis-calibration.


Introduction
Fourier ptychographic microscopy (FPM) [1] circumvents optical space-bandwidth (SBP) limitations to achieve gigapixel-scale quantitative phase images, having both wide field-of-view (FOV) and high resolution. The method combines ideas from synthetic aperture and translational-diversity phase retrieval [2,3], conveniently realized by replacing the light source of a microscope with an LED array, then capturing multiple images under different illumination angles. When LEDs illuminate the sample from angles smaller than that allowed by the objective's numerical aperture (N A obj ), brightfield images result. Conversely, when the illumination NA is larger than the objective NA, darkfield images result. Although darkfield images alone do not have higher resolution than the objective allows, they do contain information about subdiffraction-limit sized features, which occupy a shifted area of the sample's Fourier space (assuming a thin sample). By collecting many images that cover a wide region of Fourier space and stitching them together coherently, one can achieve spatial resolution beyond the objective's diffraction limit, corresponding to the sum of illumination and objective NAs (N A eff = N A illu + N A obj ). FPM's scan-free high SBP imaging capability has great potential for revolutionizing biomedical imaging, with applications in digital pathology [1,[4][5][6] and in vivo live cell imaging [7]. The original FPM method only applies to 2D thin objects, however, new models and reconstruction algorithms also enable 3D reconstruction of thick samples [8].
Multiple algorithms have been proposed for solving the nonlinear inverse FPM problem, which amounts to phase retrieval. Amongst these, there are the usual tradeoffs between accuracy, noise performance and computational complexity. In practice, however, we have found that a critical metric is an algorithm's performance under model mis-match -when the experimental data is imperfect (e.g. due to misalignment). This is typical in computational imaging algorithms, which are often fragile and not robust enough to provide consistent high-quality results. Unfortunately, model mis-match is difficult to quantify, since it is systematic, yet unpredictable. Here, we aim to compare algorithms directly, in order to identify error mechanisms and determine the most accurate and robust algorithm for our experiments. Figure 1: Fourier ptychographic reconstruction (amplitude only) of a test object with the algorithms discussed here, all using the same experimental dataset. Algorithms derived from the same cost function (amplitude-based, intensity-based, and Poissonlikelihood) give similar performance, and first-order methods (Gerchberg-Saxton) suffer artifacts.
The original FPM algorithm used a Gerchberg-Saxton approach [9], which is a type of alternating projections [10][11][12][13], first developed for traditional ptychography [2,3,[14][15][16][17] and later for FPM [1,18,19]. Shifted support constraints (finite pupil size) are enforced in the Fourier domain as the corresponding amplitude constraints (measured images) are applied in the image domain, while letting the phase evolve as each image is stepped through sequentially. The Gerchberg-Saxton method, which is a type of gradient descent, represents a natural way to solve phase retrieval problems by trying to directly minimize some cost function that describes the differences between actual and predicted measurements. Unfortunately, these formulations are often non-convex in nature and do not come with global convergence guarantees.
Recently, a class of gradient descent like updates, dubbed Wirtinger Flows [20], have been shown to have global convergence guarantees. This method has been successfully applied to FPM [21], though the actual implementation deviates from theory somewhat. In the Wirtinger Flow framework, the optimization procedure is similar to gradient descent, except that the step size and initial guess are carefully chosen for provable convergence.
Gradient descent and Wirtinger Flow are first-order methods, in the sense that they only use the first-order derivative of the cost function when updating the complexfield. It is also possible, and generally advantageous, to use higher-order derivatives in the updates. For example, second-order methods (e.g. Newton's method) use both the first and second derivative of the cost function, and have been shown to provide faster convergence rates [22]. In our studies, we also observe improved performance when using second-order methods. For example, in the top row of Fig. 1, the Gerchberg-Saxton algorithm is a first-order method, whereas the other three methods are second-order (or approximate second-order) methods. All results achieve a similar resolution, but the first-order (Gerchberg-Saxton) result is corrupted by lowfrequency artifacts. While computing second-order derivatives increases complexity, we find that it usually reduces the number of iterations needed, enabling fast overall run times.
The final class of algorithms that have been proposed are based on convex relaxations [23][24][25][26][27]. This class of phase retrieval algorithms, called PhaseLift, re-frames the problem in higher dimensions such that it becomes convex, then aims to minimize the cost function between actual and predicted intensity via semidefinite programming. These algorithms come with the significant advantage of rigorous mathematical guarantees [28] and were successfully applied to FPM data [27]. The actual implementations of these algorithms, however, deviate from the provable case due to computational limitations.
Algorithms can be further classified as sequential or global, depending on whether the update is done for each image, one at a time (sequentially), or all at once with the full set of images (globally) for each iteration. Global methods are expected to perform better, at a cost of additional computational requirements. In our studies, results show little difference between the sequential and global implementation of any particular algorithm (see Fig. 1), suggesting that sequential procedures may be sufficient, allowing reduced computational requirements.
One seemingly unimportant classification of algorithms is whether their cost function minimizes differences in intensity or amplitude. Throughout this paper, we refer to algorithms that minimize intensity differences as intensity-based algorithms, and algorithms that minimize amplitude differences as amplitude-based algorithms. Since intensity is amplitude squared, both drive the optimization in the correct direction; hence, one might expect that the choice between the two is of little consequence. Surprisingly, however, we find that the cost function is the key predictor of experimental performance for our experimental dataset. Intensity-based algorithms suffer from strong artifacts (see Fig. 1), which we show to be due to noise and model mismatch errors. Hence, amplitude-based algorithms perform better on imperfect data, so are more robust. Our goal is to explain why this happens. Figure 2: (a) Experimental setup for Fourier ptychography with an LED array microscope. (b) The sample's Fourier space is synthetically enlarged by capturing multiple images from different illumination angles. Each circle represents the spatial frequency coverage of the image captured by single-LED illumination. Brightfield images have orders of magnitude higher intensity than darkfield (see histograms), resulting in different noise levels.
We will show that in order for a phase retrieval scheme to be robust to experimental imperfections, the choice of cost function is of crucial importance. One source of error in our experimental data is measurement noise, including Gaussian noise or Poisson shot noise. Another main source of error is model mis-match, caused by experimental imperfections such as aberrations and LED misalignment. A particular problem of FPM datasets is that they contain both brightfield and darkfield images, which have drastically different intensity levels (see Fig. 2). Brightfield images can have several orders of magnitude higher intensity than darkfield images; thus, the amount of Poisson noise will also be significantly higher. If this difference in the noise levels is not properly accounted for, the brightfield noise may drown out the darkfield signal. We will further show that aberrations and LED mis-calibration -the two main model mis-match errors in our experiments -result also in intensity-dependent errors. Thus, by carefully designing the the cost function, we can develop algorithms that are significantly more robust to both noise and model mis-match.
We develop a maximum likelihood theory which provides a flexible framework for formulating the FPM optimization problem with various noise models. In particular, we will focus on Gaussian and Poisson noise models. We find that amplitude-based algorithms effectively use a Poisson noise model, while intensity-based algorithms use a Gaussian noise model. To illustrate, we simulate four FPM datasets, three of which are contaminated with measurement errors (see Fig. 3): Poisson noise, aberrations, and LED misalignment. We compare the performance of various algorithms on these datasets to demonstrate that the imperfections in our experimental data are more consistent with a Poisson noise model. This explains our observations that amplitudebased algorithms are more experimentally robust than intensity-based algorithms.

Forward problem for Fourier ptychography
Consider a thin sample with transmission function o(r), where r = (x, y) represents the 2D spatial coordinates in the sample plane. Assuming that the LED array is sufficiently far from the sample, each LED will illuminate the sample by a plane wave from a different angle, defined by exp(i2πu · r), where u = (u ,x , u ,y ) is the spatial frequency corresponding to the -th LED, = 1, . . . , N img . After passing through the sample, the exit wave is the product of the sample and illumination complex fields, o(r) exp(i2πu · r). The tilted plane wave illumination means that the Fourier transform of this exit wave is just a shifted version of the Fourier spectrum of the object, O(u − u ), where O(u) = F{o(r)} and F is the 2D Fourier transform. This exit wave then passes through the objective lens, where it is low-pass filtered by the pupil function, P (u), which is usually a circle with its size defined by N A obj . Finally, with F −1 being the 2D inverse Fourier transform, we can write the intensity at the image plane as [19] (1)

Possible noise and simulated dataset
Ideally, all algorithms based on the forward model above should give good reconstructions. However, noise and model mis-match errors cause deviations from our forward model. Thus, a noise model that accurately describes the error will be important for noise tolerance. Heuristically, we have identified three experimental non-idealities that cause error: Poisson noise, aberrations and LED mis-alignment. We aim to separate and analyze the artifacts caused by each through controlled simulations that incur only one type of error. The simulated data (Fig. 3) uses the same parameters as our experimental setup, where a 32 × 32 green LED array (central wavelength λ = 514 nm) is placed 77 mm above the sample. LEDs are nominally 4 mm apart from each other and only the central 293 LEDs are used, giving a maximum N A illu = 0.45. Samples are imaged with a 4× objective lens having N A obj = 0.1. Using our forward model, we simulate four datasets: 1. Ideal data: no noise is added. The object and pupil follow exactly the FPM forward model that is assumed in the algorithm.
2. Poisson noise data: the ideal data is corrupted by Poisson-distributed noise at each pixel. To emphasize the effect and to emulate experiments with lowerperformance sensors, we simulate 20× more noise than is present in our experiments (details in Section 2.3).
3. Aberrated data: simulated images are corrupted by imaging system aberrations, which are described by the aberrated complex pupil function shown in Fig. 3. The pupil function used in these simulations was obtained from experimental measurements.
4. LED mis-aligned data: the illumination angle of each LED is perturbed slightly (following a normal distribution with standard deviation σ θ = 0.2 • ). The black × and blue • in Fig. 3 show the original and perturbed LED positions, respectively.
To deal with these experimental errors, in the next section we will discuss different noise models for formulating the FPM optimization problem.

Optimization problem based on different noise models
Most algorithms solve the FPM problem by minimizing the difference between the measured and estimated amplitude (or intensity), without assuming a noise model. Hence, the FPM problem can be formulated as the following optimization Since the cost function here, f A (O(u)), aims to minimize the difference between the estimated amplitude and the measured amplitude, this is the amplitude-based cost function. By optimizing this cost function, the projection-based algorithms for Fourier ptychography can be obtained [1,18,19], which treat each measurement as an amplitude-based sub-optimization problem. The formulation is used in the traditional Gerchberg-Saxton phase retrieval approach.
If we have information about the statistics of the noise, we can use it in our optimization formulation via the maximum likelihood estimation framework [29]. If we assume that our measured images suffer only from white Gaussian noise, then the probability of capturing the measured intensity I (r) at each pixel, given the estimate of O(u), can be expressed as whereÎ (r) = |F −1 {P (u)O(u − u )}| 2 and σ w is the standard deviation of the Gaussian noise.Î (r) and I (r) denote the estimated and measured intensity, respectively. The likelihood function is the overall probability due to all the pixels in all the images and can be calculated as r p[I (r)|O(u)], assuming measurements from all pixels are independent. In maximum likelihood estimation, the goal is to maximize the likelihood function. However, it is easier to solve this problem by turning the likelihood function into a negative log-likelihood function which can be minimized. The negative log-likelihood function associated with this probability distribution can be calculated as The next step is to minimize this negative log-likelihood function by estimating O(u) so that the overall probability is maximized. For white Gaussian noise, it is assumed that σ 2 w are the same across all pixels for all images (i.e. all measurements have the same amount of noise), though this will not be the case for FPM datasets. By making a Gaussian noise assumption, the first term in (4) is a constant and can be ignored. The optimization problem then reduces to We call this cost function, f I (O(u)), the intensity-based cost function because it aims to minimize the difference between the estimated intensity and the measured intensity. It also implies that noise from each pixel is treated the same and independent of the measured intensity. It will be shown later that the previous implementations of PhaseLift [27] and Wirtinger flow algorithms [21] for FPM aimed to optimize this intensity-based cost function. However, both can be implemented instead with a Poisson likelihood cost function.
If we assume instead that our measured images suffer from Poisson shot noise, then the probability of the measured intensity, I (r), given the estimate of O(u) can be expressed as Note that the Poisson distribution is used to describe the statistics of the incoming photons at each pixel, which is a discrete probability distribution. Here, we assume that the intensity is proportional to the photon count, so we can treat the distribution of the intensity as a Poisson distribution. When the expected value of the Poisson distribution is large, then this Poisson distribution will become more like a Gaussian distribution having a standard deviation proportional to the square root of the intensity, σ ,r ≈ I (r), from the central limit theorem. This means that a large measured intensity at a particular pixel will imply large noise at that pixel. In the simulation, we impose Poisson noise on the measured intensity by distributing each pixel value with a Gaussian distribution and setting the standard deviation to 20 I (r). The negative log-likelihood of the Poisson noise model can then be calculated; the optimization problem is formed by minimizing the negative log-likelihood function with estimation of O(u), This cost function comes from the likelihood function of the Poisson distribution, so we call it the Poisson-likelihood-based cost function. It implies that the pixels with larger measured intensity are weighted smaller because they suffer from more noise.
Since the brightfield images have more large-value pixels, they are assumed to be more noisy and thus are weighted smaller in the cost function. It is shown in the Appendix that the gradient of this cost function (37) is very similar to that of the amplitude-based cost function (34), which suggests that the amplitude-based cost function deals well with Poisson-like noise or model mis-match.

Vectorization Notation
For multivariate optimization problems such as (2) and (5) , it is convenient to reformulate the problem using linear algebra. First, the functions need to be vectorized. Each of the captured images, I (r), having m × m pixels, are raster-scanned into vectors, I , with size m 2 × 1. Since the estimated object transmission function will have higher space-bandwidth product than the raw images, the estimated object should have n × n pixels, where n > m. For convenience, we actually solve for the Fourier space of the object, O(u), which is vectorized into a vector O with size n 2 × 1. Before multiplying the pupil function, the Fourier space of the object is downsampled by a m 2 × n 2 matrix Q . The matrix Q transforms a n 2 × 1 vector into a m 2 × 1 vector by selecting values out of the original vector, so the entries of this matrix are either 1 or 0 and each row contains at most one nonzero element. The pupil function P (u) is vectorized into a vector P with size m 2 × 1. The 2D Fourier transform and inverse transform operator are m 2 × m 2 matrices defined as F and F −1 . | · |, | · | 2 , √ ·, and ·/· are element-wise operators, and the diag(·) operator puts the entries of a vector into the diagonal of a matrix.
The second step is to rewrite the optimization in vector form using the new parameters. First, the forward model (1) can be vectorized aŝ The amplitude-based cost function (2) can be vectorized as where the hyperscript † denotes a Hermitian conjugate. Likewise, the intensity-based cost function (5) can be vectorized as The Poisson likelihood cost function is more complicated to be expressed in vector form. First, we rewrite |g | 2 as where A = diag(ḡ )F −1 diag(P)Q is a m 2 × n 2 matrix with m 2 × 1 row vectors, a † ,j , j = 1, . . . , m 2 , andḡ denotes the complex conjugate of vector g . Then the likelihood function can be rewritten as To minimize (9), (10) or (12) using an iterative optimization algorithm, the gradients (and possibly Hessians) of the cost functions need to be calculated, both of which are shown in the Appendix. Since (9), (10) and (12) are all real-valued functions of a complex vector O, that means that O andŌ should be treated independently in the derivative calculation, which is based on the CR-calculus discussed in [30] and the similar formulation for traditional ptychography discussed in [17].

Derivation of algorithms
The basic formulation of the optimization problem in FPM has been described in the last section and the derivative calculation has been done in the Appendix, so we now turn our attention to describing how each algorithm solves the optimization problem based on different cost functions. The key step will be in how each algorithm updates the estimate of the object at each iteration. We compare the existing algorithms for FPM and also implement a new second-order global Newton's method under different cost functions, for comparison. The initialization for all algorithms is the same -the amplitude of the image from the on-axis LED illumination.

First-order methods
3.1.1 Sequential gradient descent (Gerchberg-Saxton) [1,18] For the implementation in [1,18], the algorithm aims to optimize the amplitude-based cost function (9). It is the simplest to implement and, in this case, equivalent to the Gerchberg-Saxton approach of simply replacing known information in real and Fourier space. Since the sequential strategy treats a single image as an optimization problem, the cost function for each problem is just one component of Eq. (9) and is defined as where denotes the index of each measurement. The derivative of this cost function is thus a component of Eq. (34) and can be expressed as The update equation for this sequential amplitude-based algorithm is then a gradient descent with the descent direction given by Eq. (14) and step size 1/|P| 2 max : where i indicates the iteration number, which goes to i + 1 after running through all the measurements from = 1 to = N img . This algorithm adopts the alternating projection phase retrieval approach. The first projection in the real domain is the amplitude replacement operation diag √ I |g | g , and the second projection is to project the previous estimated Fourier region diag(P)Q O onto the updated Fourier region Fdiag √ I |g | g . It is worth noting that the algorithm in [1] directly replaces Fdiag √ I |g | g in the Fourier domain at each sub-iteration. A similar algorithm in [18], introduced for simultaneous aberration recovery, has the same form as Eq. (15) that implements gradient descent in the Fourier domain. However, when there is no pupil estimation, then P becomes a pure support function with one inside the support and zero outside. In this situation, these two algorithms become exactly the same, and thus we refer to both as sequential gradient descent or Gerchberg-Saxton algorithm. [20,21] The Wirtinger flow optimization framework was originally proposed to iteratively solve the coded-mask phase retrieval problem using nonlinear optimization [20]. It is a gradient descent method implemented with a special initialization and special step sizes. For the FPM implementation described in [21], the intensity-based cost function is used. Thus, the update equation for the object transmission function O can be expressed as

Wirtinger flow algorithm
where the step size is calculated by where is the gradient of the intensity-based cost function calculated in (36), and i 0 and θ max are user-chosen parameters to calculate the step size.
In the previously proposed FPM implementation of Wirtinger flow [21], the algorithm deviates somewhat from the original theory proposed in [20]. First, there is an additional term in the cost function to deal with additive noise. Second, the initialization used in [21] is not the proposed one in [20], but rather a low-resolution captured image. So the algorithm in [21] is essentially a gradient descent method with the special step size based on the intensity-based cost function and is not guaranteed to converge to the global minimum.
The Wirtinger flow algorithm can be implemented with different cost functions simply by replacing the original intensity-based gradient with the other gradients derived in the Appendix. For comparison, we have implemented the Wirtinger flow algorithm using all three of the cost functions described here: amplitude-based, intensity-based and Poisson-likelihood-based. The results are compared in Fig. 1 with experimental data and Section 4 with simulated data.

Second-order methods
Beyond first-order, a second-order optimization method can improve the convergence speed and stability of the algorithm, especially for nonlinear and non-convex problems. Second-order methods (e.g. Newton's method) use both the first and second derivatives (Hessian) of the cost function to create a better update at each iteration. As a result, they generally require fewer iterations and move more directly towards the solution. The difficulty of second-order implementations is in computing the Hessian matrix, whose size scales quadratically with the size of the image. As a result, approximations to the Hessian are often used (known as quasi-Newton methods) to trade performance for computational efficiency.

Sequential Gauss-Newton method [19]
First, we look at a Gauss-Newton method based on the amplitude-based cost function, which approximates the Hessian matrix as a multiplication of its Jacobian matrix: where c = (O T ,Ō T ) T (See Appendix). Since the inversion of this Hessian matrix requires very high computational cost, we approximate the Hessian by dropping all the off-diagonal terms of the Hessian matrix. Further, the inversion of the Hessian matrix may be an ill-posed problem, so a constant regularizer is adopted. In the end, the approximated Hessian inversion becomes where ∆ is a constant vector with all the entries equal to a constant regularizer δ over all pixels. By applying Newton's update, Eq. (45), with this approximated Hessian inversion, the new estimate of O can be expressed as where the diag |P| |P|max part is the step size for this descent direction. Note that when P is a constant having either 0 or 1 values, this method is reduced to the sequential gradient descent method with a tunable regularizer δ. In practice, however, we also simultaneously update P (see Section 4.3), so the second-order optimization procedure becomes more crucial.

New algorithm implementing a global Newton's method
Since we expect second-order methods to perform better than first-order, and also global methods to be more stable than sequential, we propose a new global secondorder (Newton's) method, and show the results compared against other methods. For completeness, we implement all three of amplitude, intensity, and Poisson-likelihoodbased cost functions, showing that the amplitude and Poisson-likelihood-based cost functions indeed perform better. The difficult step in deriving a Newton's method for this problem is in calculating the gradients and Hessians of the cost functions directly, without approximations. In the Appendix, we show our derivation, and in this section we use the results with a typical Newton's update equation: The inverse of the Hessian matrix, (H cc ) −1 , is solved efficiently by a conjugate gradient matrix inversion iterative solver as described in [31]. α (i) is determined by the backtracking line search algorithm at each iteration, as described in [22].

PhaseLift algorithm [23-27]
The PhaseLift formulation for phase retrieval is conceptually quite different than the previous methods described here. The idea is to lift the non-convex problem into a higher-dimensional space in which it is convex, thereby guaranteeing convergence to the global solution. To do this, the cost function of O is reformulated into that of a rank-1 matrix X = OO † and the goal is to estimate X instead of O. The process of reformulation can be expressed as [27] where D is an N img m 2 × N img m 2 operator combining the inverse Fourier transform, pupil cropping, and the downsampling operation with row vectors denoted by d † j . Hence, the estimated intensity |g| 2 as a function of X can be expressed where A is a linear operator transforming X into |g| 2 . In Section 2.
Since X is a rank-1 matrix, we then minimize the rank of X subject to I = A(X). However, the rank minimization problem is NP-hard. Therefore, a convex relaxation [23][24][25] is used instead to transform the problem into a trace minimization problem. Under this relaxation, the optimization problem becomes where α is a regularization variable that depends on the noise level. The problem with this new approach is that by increasing the dimensionality of the problem, the size of the matrix X has become n 2 × n 2 , which is too large to store and calculate eigenvalue decomposition on a normal computer. To avoid these computational problems, we do not directly solve (25), but rather apply a factorization to X = RR † , where R is an n 2 × k matrix. X is a rank-1 matrix so k is set to be 1 (R becomes O). This new problem is then solved effectively using the augmented Lagrangian multiplier, by modifying the original cost function [26,27] min R f AL,I (R) = min R σ 2 (I−A(RR † )) † (I−A(RR † ))+y T (I−A(RR † ))+Tr(RR † ), (26) where y, N img m 2 ×1 vector, is the Lagrangian multiplier, and σ ≥ 0 is the augmented Lagrangian multiplier. Both are parameters that can be tuned to give a better reconstruction. By taking the derivative of this cost function with respect to R and updating R in each iteration, the optimization problem can then be solved [26]. Unfortunately, after these modifications, the problem becomes non-convex because of the minimization with respect to R instead of X, and thus is no longer provable. In order to provide a more familiar form for comparing the PhaseLift algorithm to the others discussed in this paper, we define y = [y T 1 , . . . , y T N img ] T , where y i is m 2 × 1 vector, so that the minimization problem in Eq. (26) becomes Now, we see that the PhaseLift implementation is essentially an intensity-based cost function with an additional constraint that may deal better with noise. The corresponding derivative of the cost function is calculated as in the previous section: When σ is large compared to the component of y and O, the factorized PhaseLift formulation with rank-1 X is equivalent to the intensity-based optimization problem discussed in the previous section. To solve this optimization problem, a quasi-Newton algorithm called L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) method [22], which is a second-order method using an approximated Hessian inversion from previous gradients, is adopted. We note that although the PhaseLift algorithm can also be implemented with the Poisson-likelihood-based cost function, the algorithm in the rank-1 case is equivalent to our global Newton's method discussed in Section 3.2.2 for the same reason as in the above analysis.

Performance analysis of various algorithms
In this section, we compare the algorithms described in Section 3 using experimental data, as well as simulated data that mimics the experimental errors described in Section 2.2. We find that second-order optimization generally performs better than first-order, while global methods do not give significant improvement over sequential. Further, we explain why the cost function is a key consideration in choosing an algorithm by explaining the cause of the high-frequency artifacts that result from intensity-based algorithms. Interestingly, the two model mis-match errors (aberrations and LED mis-alignment) behave similarly to Poisson noise, in that they also give intensity-dependent errors. Hence, the amplitude and Poisson likelihood algorithms are more robust not only to Poisson noise, but also to model mis-match errors.  Next, we use each of the algorithms described in Section 3 to reconstruct amplitude and phase from the datasets simulated in Section 2.2, in order to quantify performance under various experimental error types by comparing against the ground truth input. Figures 4 and 5 show the reconstructed amplitude and phase, respectively. On the top left corner of each image we give the relative error of the reconstruction, defined as

Reconstruction of the simulated and experimental data
where O recover and O true are the reconstructed and true images, respectively, in vector form. In order to ensure that all algorithms converge to their stable solutions, we use 200 iterations for each algorithm, except for Wirtinger flow, which requires 500 iterations. The tuning parameters for each algorithm are summarized in Table 1.
We have attempted to optimize each parameter as fairly as possible; for example, we use a large σ in the PhaseLift algorithm to achieve a better reconstruction. Small σ trades resolution for flatter background artifacts.
PhaseLift Intensity Newton i 0 = 10 θ max = 1 σ = 10 10 N/A In analyzing results from the simulated datasets, we find that algorithms with the same cost function give similar reconstruction artifacts. For example, the intensitybased algorithms suffer from high-frequency artifacts and phase wrapping when the data is not perfect. Almost all algorithms give a satisfactory reconstruction when using the error-free ideal dataset, except for intensity-based Wirtinger flow, which suffers some phase-amplitude leakage and phase blurring (see Figs. [4][5]. When the dataset contains noise or model mis-match, we observe a distinct trend that amplitudebased and Poisson-likelihood-based algorithms give a better result, compared with intensity-based algorithms. The exception to this trend is the Gerchberg-Saxton algorithm, which is somewhat unstable and gets stuck in local minima, so is not robust to any type of error. The goal of our simulations was to determine the main error sources that cause artifacts in the experimental reconstructions of Fig. 1. Since the experiments contain combined errors from multiple sources, it is difficult to attribute artifacts to any particular type of error. We find, however, that all three of our main error sources cause similar artifacts, hence our experimental results may be corrupted by any of Poisson noise, aberration, or LED misalignment. For example, notice that our simulated error-corrupted data all results in high-frequency artifacts when using intensity-based algorithms, similar to the experimental results. The Gerchberg-Saxton result also displays low-frequency errors in simulation, as in experiment. The fact that both noise and model mis-match create similar artifacts is unexpected, since they are very different error mechanisms. We explain below why all three are intensity-dependent errors, which is the reason why the cost function choice is so important for robustness. The consequence is that algorithms which use a more accurate noise model (amplitude and Poisson likelihood-based) will not only be more robust to noise, but also to model mis-match errors. To examine the convergence of each algorithm, Figure 6 plots the error for each iteration when using the aberrated dataset and LED misaligned dataset with different algorithms. The intensity-based algorithms (red curves) clearly do not converge to the correct solution and can incur large errors when the data is not perfect. Compared to PhaseLift and the intensity-based Newton's method, the Wirtinger-flow algorithm seems to have lower error; however, this is only due to its slow divergence. If run for many iterations, it will eventually settle on a similarly error-corrupted result as the other two intensity-based algorithms (not shown). We also observe that amplitudebased (blue curves) and Poisson-likelihood-based (black curve) algorithms converge to points with lower errors in a similar fashion. This behavior is well explained by the similarity of the algorithms in their use of gradients and Hessians (as shown in the Appendix). Again, the exception to the trend is the first-order Gerchberg-Saxton algorithm, which recovers the object fairly well with aberrated data, but goes unstable in the case of LED misalignment. Note that, when there is no pupil estimation step, the only difference between the Gerchberg-Saxton and the sequential Gauss-Newton algorithm is the step size. Since the latter algorithm gives a good reconstruction, while the former diverges, we conclude that the Gerchberg-Saxton step size is too large for a stable update in this particular case. The convergence speed of each algorithm can be determined from Figure 6 using two metrics: number of iterations required and total runtime. We choose the convergence curves from the cases of ideal data and LED misaligned data and compare their iteration numbers and runtimes in Table 2. All the algorithms were implemented in MATLAB on an Intel i7 2.8 GHz CPU computer with 16G DDR3 RAM under OS X operating system. We define convergence as the point when the relative phase error reaches its stable point. The comparison does not consider the divergent cases. In the ideal data case, we can see that the sequential methods outperform all the other algorithms in terms of runtime. The Gerchberg-Saxton algorithm is the fastest in terms of both iteration number and runtime for this perfect dataset. The global Newton's method using intensity-based and amplitude-based cost functions also converge very fast in terms of iteration number. The Wirtinger flow algorithm takes much longer to reach convergence both in runtime and iteration number. For the case of the LED misaligned data, only five algorithms converge. In terms of iteration number, the amplitude-based Newton's method converges much faster than the other four, as expected. However, the sequential Gauss-Newton algorithm converges much faster in terms of the runtime. Though the global Newton's method is theoretically better than the others, it takes significant time to calculate the full Hessian matrix. Thus, the sequential Gauss-Newton method is our preferred algorithm in practice, because it provides excellent robustness while also enabling fast runtimes and reasonable computational complexity.
The main conclusions to be drawn from this section are that the FPM optimization algorithms which are formulated from amplitude-based and Poisson-likelihood-based cost functions are more tolerant to imperfect datasets with both Poisson noise and physical deviations like model mis-match, which were represented by aberrations and LED misalignment here. In the next section, we will explain more about the causes for this trend. The reason why amplitude-based and Poisson-likelihood-based algorithms have superior tolerance to experimental errors is due to their Poisson noise model. Each of these algorithms makes an implicit or explicit assumption that the magnitude of the errors in the data scale with the measured intensity. This is obviously a good model for Poisson noise errors, which are defined as noise which scales with intensity. It is not as obvious that the model mis-match errors (aberrations and LED misalignment) scale with intensity as well. To demonstrate this, Fig. 7 shows the histogram of the difference between the deviated dataset and the ideal dataset, for the cases of both brightfield and darkfield images. The histograms show a similar trend -all of the brightfield errors are much larger than the darkfield errors, with a similar statistical variation. Thus, the errors from Poisson noise, aberrations and LED misalignment all scale with the measured intensity. In our experimental data, there are always aberrations in the objective lens, LED misalignment, and Poisson shot noise. Since the noise model for the amplitude-based and Poisson-likelihood-based algorithms match the actual noise properties, these algorithms perform better than the intensity-based algorithms. And since the images captured by FPM have drastically different intensity values, this effect dominates the reconstruction artifacts. Note that these large variations in intensity values are specific to FPM and likely do not play a major role in other phase imaging schemes (e.g. phase from defocus or traditional ptychography), where images do not have such a wide range of intensity values. In our experiments, the Poisson noise is fairly low (due to use of a high-performance sCMOS sensor), but the model mismatch in the experimental data can cause effects similar to strong Poisson noise. Figure 8: The intensity-based cost function gives higher weighting to images in the low spatial frequency region of the Fourier domain, resulting in high-frequency artifacts. Here, we show the gradient of the amplitude-based, Poisson-likelihood-based and intensity-based cost functions at the tenth iteration, using experimental data.

Noise Model Analysis
For further understanding, we look closer at the relationship between the noise model and the cost function. Our optimization algorithms are derived from three cost functions. Each of the cost functions makes a noise model assumption. The intensity-based cost function assumes that noise in the data follows a white Gaussian noise model, which means that the standard deviation of the noise is assumed to be the same across the brightfield and darkfield images. Recall that the standard deviation of a Gaussian noise probability model is related to the weight in the cost function for each pixel, as shown in Eq. 4. The larger the standard deviation (amount of noise) at any pixel in Fourier space, the smaller the weighting, since noisy pixels should be trusted less. In the Gaussian noise model, the weights in the cost function for large-valued pixels and small-value pixels are the same. However, the deviation for brightfield images is much larger than that for darkfield images, as shown in Fig. 7. Hence, the brightfield images will contribute more to the total cost function value if the weights are all the same, due to their high intensity. The result is that the intensity-based (Gaussian noise model) algorithms focus mostly on the brightfield images, which correspond to low spatial frequency information, and the darkfield images do not contribute much. The result is a failure in the high-frequency reconstruction, as we saw in Figs. 1, 4, 5, and loss of effective resolution since the darkfield images contain all the sub-diffraction-limit information. To illustrate the dramatic difference in weights, Fig. 8 shows the gradient of the different cost functions. Obviously, the intensity cost function gives much higher weighting to low spatial frequencies, which causes the high-frequency artifacts.
Since the amplitude-based cost function shares a similar gradient and Hessian with the Poisson likelihood function, as shown in the Appendix and Fig. 8, it is not surprising that they both produce a similar quality reconstruction. Both of these cost functions assume the noise in the data follow a Poisson distribution, with the standard deviation scaling with the measured intensity. This assumption matches the actual error better than the white Gaussian assumption. The actual noise or deviations in the experiments for brightfield images have larger standard deviation, while that for darkfield images have smaller standard deviation. Under the Poisson noise model, the weight in the cost function is smaller for the noisy brightfield images and larger for the darkfield images. At the end, algorithms based on the Poisson noise model put more emphasis on the darkfield images and thus get a better reconstruction compared to the intensity-based algorithms. Figure 8 shows that the gradients for the amplitude-based and Poisson-likelihood-based cost function are similar and are more uniform throughout the whole Fourier space.

Pupil estimation
There are already more sophisticated FPM extensions to correct for some model mis-match errors [18,19], similar to the probe correction algorithms in traditional ptychography [16]. Both of the methods previously developed for Fourier ptychography are derived from the amplitude-based formulation. By taking the derivative of the cost function with respect to P, the decent direction to estimate the pupil function can be calculated as By applying the pupil estimation step after each object estimation using this gradient or approximated Hessian, the sequential gradient descent [18] and the sequential Gauss-Newton method [19] including pupil estimation can be derived. Here we only consider the amplitude-based cost function, for simplicity. Figure 9: Object and pupil reconstruction results using different algorithms, with and without pupil estimation. The second-order method (sequential Gauss-Newton) with pupil estimation gives the best result, as expected. In this case, we find that the second-order method without pupil estimation is already better than first-order method (sequential gradient descent) with pupil estimation.
We wish to investigate the improvements obtained by adding a pupil estimation step to both first and second-order optimization algorithms. Figure 9 shows the reconstruction result from the sequential gradient descent (first-order) and sequential Gauss-Newton (second-order) algorithms, using the aberrated dataset from the previous simulations. The numbers at the top left corner are the relative error compared to the ground truth simulated image. As can be seen, adding the pupil estimation step gives a better complex-field reconstruction, and the second-order (Gauss-Newton) method with pupil estimation provides the best result.
Surprisingly, however, the second-order reconstruction without pupil estimation is better than the first-order reconstruction with pupil estimation, for this case. This highlights the robustness to aberrations that a second-order optimization scheme enables. The second-order nature of the algorithm makes it faster in convergence, and also more stable. In terms of runtime, the pupil estimation step takes about the same time as the object reconstruction part, so the algorithm is two times slower when the pupil function step is incorporated.

Misalignment correction
Another possible correction scheme for model mis-match is that for LED misalignment. Since each LED position corresponds to a certain shift of the pupil function in the Fourier domain, this is similar to the shift of the probe function in traditional ptychography. There, iterative algorithms have been proposed to correct for the positioning error of the probe function [14,[32][33][34]. In [14,34], a gradient of the cost function with respect to the shift of the probe function has been calculated and the conjugate gradient method has been applied to correct for the positioning error. In [32], a simulated annealing method is adopted to estimate the shift of the probe function. The simulated annealing method is also adopted to correct for the misalignment of the spatial light modulator in a overlapped Fourier coding system [35], analogous to FPM. In our experiments, we observe that the simulated annealing method can locate the LED positions more accurately than other methods. Thus, we only compare with the simulated annealing method.
Simulated annealing is a method of searching unknown variables over a finite space to minimize or maximize the function of merit -the cost function in our case. Instead of exhaustively testing all the possible states, simulated annealing iteratively approaches the optimal state. At the first iteration, the algorithm randomly searches several states in the space and selects the one with the smallest cost function value. The algorithm then starts at this state for the next iteration, slowly reducing the search range in the following iterations until convergence. In our sequential algorithm, the whole optimization problem is divided into many sub-optimization problems for different collected images. At each sub-optimization problem, a gradient descent or Gauss-Newton method is applied to update that cor-responding region in Fourier domain. To add a LED mis-alignment correction step, the simulated annealing algorithm can be incorporated into each sub-iteration to find an optimal shift of the pupil function. In each sub-iteration, the down-sampling matrix, Q , which contains the information of the pupil shift, is tested according to the annealing process for several possible states corresponding to different shifts of the pupil. The state with the smallest cost function value is selected to update the old down-sampling matrix. Then, the new down-sampling matrix is used to update the corresponding region in the Fourier domain.
The simulated annealing method estimates the LED positions with good accuracy. Figure 10 shows the reconstruction result from the simulated LED misaligned dataset, both with and without the LED correction step. The result using the LED correction clearly shows better quality and smaller error, as seen in Fig. 10(a). Since the LED correction scheme also estimates the actual LED positions, which we intentionally perturbed in order to impose a known error, we can also compare the actual and recovered LED positions, shown in Fig. 10(b).
To complete the picture, we now show experimental reconstructions with and without the two correction schemes: pupil correction and LED mis-alignment corrections (see Fig. 11). Since we do not know ground truth for our experiments, we can only make qualitative observations. An incremental improvement is observed when adding the pupil estimation and then the LED correction steps -the background variation becomes flatter. Figure 11(b) shows the corrected LED positions compared to the original ones, in angular coordinates. Corrected positions of LEDs in different regions share similar offset because the fabrication process of the LED array can cause unexpected position misalignment for each LED. Notice that the LEDs at the edges (corresponding to higher angles of illumination) incur more variation, since these are more sensitive to calibration. Also, many of the large deviations occur at the edges that are not along the horizontal and vertical axes. In these areas, the LED position recovery is poor because the object has very little information there (the resolution test target contains only square features) and so the data contains little information about these areas. However, any errors in LED positions in this area will also not significantly affect the reconstruction if they do not contribute much energy to the object spectrum. If the goal was not to correct the image results, but rather to find the LED positions accurately, then one should choose an object that contains uniformly distributed spatial frequencies (e.g. a random diffuser or speckle field). Although the simulated annealing further improves our reconstruction, we note that it is more than ten times slower to process the data because of the local search performed at each sub-iteration.

Conclusion
We formulated the Fourier ptychographic phase retrieval problem using maximum likelihood optimization theory. Under this framework, we reviewed the existing FPM algorithms and classified them based on their cost functions: amplitude-based algorithms (akin to a Poisson noise model) and intensity-based algorithms (akin to a white Gaussian noise model). We also derived a new algorithm based on the Poisson likelihood function, which is better suited for dealing with measurement imperfections. We compared the tolerance of these algorithms under errors due to experimental noise and model mis-match (aberrations and LED mis-alignment) using both simulated data and experimental data. Because the noise and model mis-match error for brightfield and darkfield images depend on the measured intensity, the amplitudebased and Poisson-likelihood-based algorithms from the Poisson noise model are more robust than the intensity-based algorithms. This can be explained by the standard deviation of the noise model determining the weight of each image in the optimization. Hence, intensity-based algorithms over-weight the brightfield images, resulting in poor high-frequency reconstruction.
We used existing pupil estimation algorithms and proposed a simulated-annealingbased LED correction algorithm to algorithmically fix the experimental deviations. We compared the performance of the pupil estimation algorithms and found that second-order methods give the best results. We also showed the capability of the simulated annealing method to correct for misaligned LEDs and find their actual positions.
Based on our studies, we conclude that the global Newton's method gives the best reconstruction, but may have high computational cost. Considering both robustness and computational efficiency, we find that sequential Gauss-Newton method provides the best trade-offs for large-scale applications. Its experimental robustness is verified in our recent time-series in vitro experiments [7]. Our open source code for this algorithm can be downloaded at [36].
The gradient for f I can be calculated in the similar way, and the chain rule part of f I can be calculated as With (35), it is clear to express the gradient of f I as The calculation of gradient of L Poisson (O) with respect to O is different from the other two. With the expression (12), the gradient of Poisson likelihood function can be calculated as This is equivalent to the gradient of the intensity-based cost function with added weight 1/|g | 2 to the component from each image. In addition, this gradient is very similar to that from the amplitude-based cost function.
Since we have gradients for all cost functions, the updating equation for the gradient descent method can then be expressed as where i denotes the iteration number, α is the step size chosen by the line search algorithm, and f (O) can be either intensity-based or amplitude-based cost function.
, they all contain the term Q † diag(P) following by a residual term. The residual term basically finds the difference between the estimation and the measurement. This difference carries the information to update the previous estimation. Since each measurement carries the information for a specific region in the Fourier space, the Q † diag(P) term brings this updating information back to the right place corresponding to some spatial frequency.
For ∇ O f A (O), the first term in the residual shows the replacement of the amplitude in the real domain, which is the projection from the estimation to the modulus space. Thus, the gradient descent method using the amplitude-based cost function is similar to the projection-based phase retrieval solver. nents of the Hessian for the amplitude-based cost function are where 1 is the m 2 × m 2 identity matrix. Likewise, the Hessian of the intensity-based cost function is Finally, the Hessian of the Poisson likelihood cost function is In general, Newton's method, which is the second-order method using the inversion of Hessian matrix, is preferred in solving nonlinear least square problems because of its fast convergence and stability compared to the first-order methods such as gradient descent. The updating equation for Newton's method can be expressed as .