A matrix-free Levenberg-Marquardt algorithm for efficient ptychographic phase retrieval

The phase retrieval problem, where one aims to recover a complex-valued image from far-field intensity measurements, is a classic problem encountered in a range of imaging applications. Modern phase retrieval approaches usually rely on gradient descent methods in a nonlinear minimization framework. Calculating closed-form gradients for use in these methods is tedious work, and formulating second order derivatives is even more laborious. Additionally, second order techniques often require the storage and inversion of large matrices of partial derivatives, with memory requirements that can be prohibitive for data-rich imaging modalities. We use a reverse-mode automatic differentiation (AD) framework to implement an efficient matrix-free version of the Levenberg-Marquardt (LM) algorithm, a longstanding method that finds popular use in nonlinear least-square minimization problems but which has seen little use in phase retrieval. Furthermore, we extend the basic LM algorithm so that it can be applied for general constrained optimization problems beyond just the least-square applications. Since we use AD, we only need to specify the physics-based forward model for a specific imaging application; the derivative terms are calculated automatically through matrix-vector products, without explicitly forming any large Jacobian or Gauss-Newton matrices. We demonstrate that this algorithm can be used to solve both the unconstrained ptychographic object retrieval problem and the constrained"blind"ptychographic object and probe retrieval problems, under both the Gaussian and Poisson noise models, and that this method outperforms best-in-class first-order ptychographic reconstruction methods: it provides excellent convergence guarantees with (in many cases) a superlinear rate of convergence, all with a computational cost comparable to, or lower than, the tested first-order algorithms.


Introduction
Ptychography is a coherent diffraction imaging (CDI) method based on the collection of diffraction intensities obtained by using a finite coherent beam to illuminate overlapping regions on an object [1,2,3].Following the development of a practical object reconstruction algorithm in 2004 [4,5], ptychography has found widespread use in imaging with X rays in 2D [6] and 3D [7], with visible light [8], and electrons [9], and variants have appeared using Bragg diffraction [10] and overlapping illumination angles instead of positions [11].It is able to deliver images with a spatial resolution limited only by the scattering collected from the object (rather than the resolution of any optical elements used), and with both absorption and phase contrast (and phase contrast in Bragg-geometry CDI can in turn measure lattice strain in crystalline materials [12]).
The ptychographic reconstruction step, wherein the object is reconstructed from the set of diffraction patterns, is a computational inversion step, typically iterative in approach, that aims to retrieve the phases of the diffracted intensities so that they are altogether consistent with the estimated sample.Standard phase retrieval problems are generally difficult computational inverse problems [13], but because the ptychographic scan is performed with some overlap between adjacent probes, redundant information is extracted from each spatially localized area, which therefore provides robust and successful inversion (phase retrieval) for the whole sample.Nevertheless, the ptychographic reconstruction step remains a challenging large-scale numerical problem.The pioneering iterative projection methods (due to Gerchberg and Saxton [14], Fienup [15,16], and others [17,18,19]) developed in the context of standard CDI were not amenable to the joint phasing of the set of intensity patterns from a ptychography experiment.Dedicated strategies developed early on (the ptychographic iterative engine or PIE [4], and the difference map or DM approach [20]) were pivotal for the early expansion of method.More recently, however, nonlinear optimization approaches to ptychographic reconstruction [21] have become increasingly popular: they are more robust, accurate, and highly versatile as they are built explicitly on physical modeling assumptions associated with the experimental setup.
In ptychography, two specific situations arise that correspond to classes of inversion problems of increasing difficulty.The first of these is the situation where an accurate estimate of the illuminating probe is available (e.g., via a calibration performed previous to the experiment), so that we can assume that the probe function is known.Retrieving the object from the diffraction patterns is then equivalent to an "unconstrained" minimization problem.For this case, researchers have proposed a variety of iterative solution methods, with most of them consisting of first-order gradient-based iterations.First-order strategies [4,19,21,22,23,24,25,26] provide updates that are relatively easy to compute, but they are limited in their convergence speed [27,28].While recent second-order iterative methods [27,28,29] address this convergence speed issue, their use in ptychography is strongly impaired by the high computational cost per iteration required to evaluate the second-order derivative matrix at each iterative update step [30].Furthermore, knowledge of the true nature of the probe as it illuminates the sample is difficult and in some cases impossible to ascertain with sufficient accuracy for the unconstrained ptychography problem.This leads to the common second situation where we need to retrieve (or at least refine) both the object and the probe structure from the dataset: this is the "blind" ptychography problem.Since the set of ambiguous solutions is much larger in this second situation, the problem is more difficult to solve efficiently.Early algorithms developed for this purpose [31,20] were therefore subject to stagnation [32,33].More recent algorithmic developments partially solve this issue by applying additional constrains on the probe and the sample via proximal operators [32,33].As we will see, while these proximal operators are very versatile tools that have been used extensively to build constrained minimization algorithms [34], this flexibility may be outweighed by moderate convergence speed, so that other strategies may be preferred [35,Chapter 3].
The formulation of a gradient-based minimization strategy to address ptychography (and general optimization problems) typically requires the manual derivation of closed-form gradient expressions.This is already a tedious and inflexible procedure just for a first-order minimization strategy, and it becomes even more difficult if we want to formulate a higher-order minimization method.An increasingly popular alternative to such by-hand derivations is to use the powerful "automatic differentiation" or "algorithmic differentiation" (AD) technique for the derivative calculations [36].In the AD framework, once we specify the physics-based experimental forward model, we can calculate the derivatives (first or higher order) with respect to any desired model component automatically, without any additional mathematical manipulation.Moreover, if we modify the forward model, the changes are also transferred to the derivatives automatically.This provides two key advantages for ptychography.First, the derivatives (and associated matrix-vector products) related to both the object and probe variables are similarly easy to access.Second, we can account for a modification of the model components-such as the optical device, the propagation method, or the noise regime-with minimal effort.Recent works have demonstrated that the AD framework can be used to conveniently and flexibly solve general phase retrieval problems (e.g., ptychography, tomography, and more) through popular first-order minimization strategies [37,38,39,40,41].
We propose here a generic, AD-based Levenberg-Marquardt (LM) minimization strategy to deal with the optimization constraints met in ptychography.The proposed strategy solves both the standard and blind ptychography problems efficiently.The LM algorithm used here is essentially a regularized second-order iterative approach that offers fast convergence [42,30,43].In contrast to existing works [44,45], our LM implementation uses iterative updates calculated using a computationally efficient "matrix-free" [46] fashion using the AD framework.The key to this matrix-free method is to only ever use the second-order derivative matrix to calculate matrix-vector products.Just as the Fast Fourier Transform method calculates the Discrete Fourier Transform (DFT) efficiently without forming the DFT matrix itself, our algorithms calculate the necessary matrix-vector products efficiently without ever forming the full second order derivative matrix.We accomplish this by using a "Hessian-free" AD approach [47,48,49].
Our overall contributions in this paper are as follows: 1. We modify the classical LM method so that it is based on the "Generalized Gauss-Newton" (GGN) extension [48] to the Gauss-Newton matrix and so that it incorporates the "projected gradient" [50,51] extension to handle convex constraints.In contrast to the classical LM method, which can only be applied to solve unconstrained nonlinear least-squares (NLSQ) minimization problems, this extended approach can be applied towards general constrained minimization problems.
2. Our LM implementation is entirely matrix-free.Since this approach is AD-based and does not require closed-form derivative expressions, it can be used in a drop-in fashion within nonlinear minimization strategies for other inverse problems.
3. We derive analytical expressions for the diagonal elements of the GGN matrix for the ptychography application.These expressions can be used for "preconditioned" LM iterative updates, and even for preconditioned iterative updates within first-order gradient-based optimization methods (such as the nonlinear conjugate gradient method).These expressions are easy to adapt for other phase retrieval problems.
4. We demonstrate empirically that the LM method successfully solves the ptychographic phase retrieval problem for both the Gaussian (NLSQ) and Poisson (non-NLSQ) noise models, with a computational cost comparable to, or lower than, state-of-the-art first order methods, and in many cases (for the Gaussian noise model) even provides a superlinear rate of convergence.
In this paper, we first provide (Section 2) a brief description of the two canonical problems we aim to solve, namely the far-field ptychographic object reconstruction problem, and the far-field blind ptychography problem.Next, we contextualize the matrix-free LM strategy (Section 3), then detail the implementation of the proposed algorithm (Section 4).Finally, we use a variety of numerical experiments (Section 5) to demonstrate that the LM algorithm makes for a robust and efficient optimization strategy to solve the ptychographic reconstruction problem.

Some canonical reconstruction problems in ptychography
We first provide a short description of the experimental acquisition model considerered for the ptychographical reconstruction problems.
In the far-field 2D ptychography experiment, which is the most common variant of ptychography in the literature, we illuminate an unknown 2D object with a coherent probe beam localized to a small area on the object, and record the intensity in the far field using a pixel array detector.Using a raster scan of K spatially overlapping illumination spots, we generate a sequence of K diffraction patterns at the detector plane.In the following sections, we model the object as a 2D grid of N x × N y = N pixels represented by the vector O ∈ C N , and we model the localized probe as a grid of M x × M y = M pixels (with M < N ) represented by the vector P ∈ C M .At each illumination position r k (with k = 1, 2, . . ., K), the binary shift operator S k (a M × N matrix) extracts the illuminated M object pixels to generate the transmitted wave function where diag (P) is an M × M diagonal matrix containing the elements of P in its main diagonal.The expected wavefield intensities at the detector plane are given by where F is the 2D discrete Fourier transform operator, |•| is the element-wise modulus of a vector and b k is the (incoherent) experimental background that we shall assume is known and has strictly positive components 1 .Since statistical fluctuations associated with the use of a finite number of illuminating photons are inherent to the measurement process, the recorded data y k necessarily differ from the expected values of Eq. (2.2).In the non-linear minimization approach, which we apply in this work, we account for the noise by defining an "error metric" (or fitting function).The generic form of the error metric is with y k,m ∈ y k , h k,m ∈ h k , and where L : R × R → R is derived from the specific model chosen for the noise-driven fluctuations in the measurements.We usually require that L is a strictly convex functional, thereby defining a proper "metric"2 between the actual (noisy) measurement y k,m and the expected quantity h k,m (holding in average only).The additive structure of Eq. ( 2.3) allows us to account for all the measurements and produces a single real value which acts as a figure of merit.Concerning the noise model, the Poisson distribution is the natural choice for a photon counting process, so we adopt it here.Following the maximum likelihood principle introduced by Fisher [53], we obtain the error metric [54,22,25,33] In the low photon-count regime, the above error metric is often the metric of choice because it reduces the estimation biases (systematic errors) in CDI reconstructions [22].As the expected count h k,m gets higher, we can use the frequently-used least-square functional as a consistent approximation of Eq. (2.4) to give where || • || is the usual Euclidean norm and • 1/2 is the element-wise square root.This latter metric is derived from an additive perturbation model over the recorded magnitudes y k,m with the assumption that the perturbation follows a Gaussian distribution with a constant variance.For low to moderate photon-count regimes, we expect a greater bias in the reconstructions obtained using Eq.(2.5) than in those using Eq.(2.4).However, we shall see in Section 5 that the standard least-square metric of Eq. (2.5) leads to a faster convergence in general, and essentially identical results in the high-count regime.
In the non-linear minimization paradigm, our solution of the ptychographical reconstruction problem is implicitly defined via the minimization of either Eq.(2.4) or Eq.(2.5) with respect to the unknown quantities (and possibly under additional constrains).More specifically, if the structure of the probe beam is known, the object can be retrieved by numerically solving the unconstrained minimization problem where f • stands for f p or f g , and where the dependence on the quantity of interest O was given by Eqs.(2.1) and (2.2).For the sake of clarity, we call Eq. (2.6) the "standard ptychographic reconstruction" problem (SPR).The problem above, equipped with the functional f g , is an example of the "coded diffraction pattern" problem that has been extensively analyzed in recent phase retrieval literature [23,24,27].
If the structure of the probing field is not completely known, we need to retrieve the object as well as the probe from the diffraction dataset.This "blind ptychographic reconstruction" problem (BPR) is structurally distinct from the SPR problem (or the general phase retrieval problem): it allows for a much larger set of ambiguous solutions, and is thus much more difficult than the SPR problem [32,55].Therefore, the recent literature solves the BPR problem by taking into account not only the diffraction dataset but also a priori information about the object and the probe, such as spatial or spectral support, non-negativity, or magnitude constraints, typically via proximal operators [32,33,55].In this approach, the BPR problem is then a constrained minimization problem of where O ⊂ C N and P ⊂ C M are closed convex sets associated with the object or probe constraints respectively.In keeping with the existing blind ptychography literature, we only report numerical results for our constrained formulation of the BPR problem.If these constraints are not applied, the possible solution space becomes very large (due to scaling ambiguities) and the BPR problem can be harder to solve.We can use either the Poisson error metric of Eq. (2.4) or the Gaussian error metric of Eq. (2.5) to solve either the SPR case (Eq.(2.6)) or the BPR case (Eq.(2.7)).These problem instances thus define the four canonical problems that we aim to solve via a fast and computationally efficient algorithm.

The principle of a matrix-free Levenberg-Marquardt strategy
As in any phase retrieval problem, both the SPR and BPR problems are NP-hard and, in general, cannot be solved exactly in polynomial time [56].We thus resort to "gradient-based" minimization strategies with good local convergence properties, i.e., algorithms that ensure that any stationary point is a local minimizer of the considered problem.Since the considered error metrics3 f are defined over a set of complex-valued parameters z ∈ C n , we rely on the Wirtinger (or CR calculus) extension to the notion of the derivative, wherein we regard f as a function of two variables [57,58,59].We accomplish this by writing f as a function of the real and imaginary parts of z, where R[•] and I[•] denote the element-wise operations4 .To clarify that the optimization is accomplished entirely via real-valued coordinates, we define the new vector z ≡ The gradient ∇f (z) and any subsequent higher-order derivatives are then all real-valued.With this definition in hand, we can now derive any gradient-based nonlinear minimization method from a second order expansion of f around an arbitrary point z [30]: where M(z) is a matrix of size 2n×2n that describes the local "curvature" of the objective function.
While the canonical quadratic approximation of f (z) uses M(z) = ∇ 2 f (z), with ∇ 2 f (z) the "Hessian" matrix, we can also use other choices of the matrix M(z) to get alternative quadratic approximations of f (z).As long as our choice of the curvature matrix M(z) is positive semidefinite, the quadratic approximation obtained is easy to minimize, and the minimizing step thus obtained can be used to define a descent step for f (z).In fact, different choices for M give different descent steps, and, as such, comprise different optimization methods, as we discuss in the following sections.

From first-order to second-order methods
We can make a simple choice for the quadratic approximation in Eq. (3.1) by discarding any anisotropy and coupling in the local curvature of f and setting M = 1 α I, where α > 0 controls the curvature magnitude.This choice leads to the update step which is none other than the "steepest descent" update step.In the phase retrieval context, a number of classic algorithms (such as ER [16]), as well as some recent algorithms (such as Wirtinger flow [23], reshaped Wirtinger flow [24], and others) can be interpreted as variations of this steepest descent method, just with different initializations and error metrics.There also exist stochastic minibatch (or sub-sampled) variations of these steepest descent methods that use a subset of the full dataset to calculate each update (such as PIE/ePIE [60] and minibatch reshaped Wirtinger flow [24]).These methods are easy to implement but can require a large number of iterations (the number of which depends strongly on the step size α) to converge to a solution.Even if we choose the optimal step size at every update, these algorithms exhibit, at best, a linear rate of convergence, unlike second-order methods which can provide superlinear or quadratic rates of convergence [30,61].
In the optimization literature, the simple steepest descent scheme of Eq. (3.2) has, to a large extent, been superseded by more sophisticated "accelerated" first-order optimization techniques.Many of these methods have also been applied to the phase retrieval problem: these include nonlinear conjugate gradient methods [16,19,54,62], heavy-ball momentum and Nesterov's accelerated gradient (NAG) methods [26,63,64], and the Adam method [65,39].While these algorithms are easy to implement and have low per-iteration computational cost, they are essentially attempting to adapt to the geometry of f (z) by utilizing only the gradient information (from current and prior iterations) [66].Consequently, to the degree that such approximations do not accurately capture the local curvature, such algorithms will display less improvement per iteration than a pure second-order optimization method.
To develop a second-order optimization method, we rely on the canonical quadratic approximation of Eq. (3.1), with M(z) = ∇ 2 f (z) as the Hessian matrix, to fully capture the local curvature information at z. Assuming that this Hessian is full rank, the so-called Newton's step minimizing the resulting quadratic approximation is ∆z = −[∇ 2 f (z)] −1 • ∇f (z).Provided z is close enough to a local isolated minimizer, the Newton's approach utilizes the full local curvature captured in ∇ 2 f (z) to attain a fast (typically quadratic) rate of convergence [30,Chapter 3].This very good local convergence property is generally offset by prominent, long-known, robustness issues [30,Chapter 6].In particular, when ∇ 2 f (z) is not positive semi-definite (PSD), which is often the case for non-convex objective functions, the step may not be a descent step at all.To address this issue, we can resort to a PSD approximation of the Hessian, which is the idea behind the "Generalized Gauss-Newton" method.

From Gauss-Newton to constrained Levenberg-Marquardt
The classical Gauss-Newton matrix arises in the following form in the context of nonlinear leastsquares minimization problems, and can be derived when the error metric takes the form L(h) = 1 2 h − y 2 so that the objective function f reads as Let us introduce the "Jacobian matrix" J (z) ∈ R m×2n defined element-wise as [J (z)] ij = ∂h i /∂ zj .
The first-and second-order derivatives of Eq. (3.3) can be shown [30,43] to be where ∇ h L is the gradient of L with respect to h, and ∇ 2 h j is the Hessian of the j-th component of h with respect to z.When the "residuals" (h j − y j ) are small, or when the model h is almost linear locally so that ∇ 2 h j terms are small, the PSD Gauss-Newton (GN) matrix G(z) := J (z) T • J (z) is a close approximation of the local Hessian.In a GN optimization algorithm, the quadratic surrogate Eq. (3.1) is built with M(z) ≡ G(z) so that the update ∆z is the solution of the linear system In the early 2000's, Schraudolph [48] generalized the GN method to arbitrary error metrics L that are convex in h.For any such L, the Generalized Gauss-Newton (GGN) matrix is defined as where ∇ 2 h L is the Hessian of L(h) with respect to h.The Hessian of f (z) then reads where G(z) is now given by Eq. (3.6).Similar to the classical GN derivation, the GGN strategy drops the second term in the above Hessian to build the update.The GGN step ∆z minimizing the surrogate Eq. (3.1) solves the linear equation G(z) • ∆z = −∇f (z).Since L(h) is convex, the GGN matrix G(z) is PSD and the solution −G(z) −1 • ∇f (z) provides a well-behaved, locally decreasing update direction.This basic GGN method provides a number of advantages over firstorder methods as well as Newton's method in many situations.However, if the residuals of Eq. (3.3) are large, or if the Jacobian matrix is ill-conditioned [30,43], the method may be unstable.To address this deficiency, we explore a popular variation of the basic GN algorithm: the Levenberg-Marquadt method.
In the classical Levenberg-Marquardt (LM) algorithm, we obtain the iterative updates by solving the linear equation which is an interpolation between the steepest descent update (Eq.(3.2)) and the GN update (Eq.(3.5)), with λ > 0 the interpolation parameter [42,67,68,30].The value of λ, which we adjust at every step, indicates the extent to which we trust the minimizer of the GN quadratic approximation to minimize the true objective function f (z).When λ is very small, the GN term dominates, and the LM step is approximately along the GN update direction; when λ is very large, the LM step is approximately along the steepest descent direction (with the step size 1/λ).This adjustment allows the algorithm to identify minimizing steps even if the residuals are large or if the Jacobian matrix is ill-conditioned.
The LM algorithm has been established as a workhorse for nonlinear least-squares minimization applications (with the classical GN matrix).Recent works have extended the LM method by including a "projected gradient" [61] approach within the LM framework [50,51] to ensure convergent descent for minimization problems with convex constraints.Furthermore, the LM algorithm has also been successfully applied to minimization problems with more general error metrics (with the GGN matrix) [69,70].
For this work, we implement a generalized LM algorithm that includes the GGN adaptation along with the projected gradient extension; this enables its use in general minimization problems with convex constraints.We present the algorithmic details in Section 4.

A truncated, matrix-free Levenberg-Marquardt method
At first glance, the LM method inherits computational bottlenecks that have long been associated with second-order optimization strategies: the computational and memory costs required to calculate the curvature matrix grow quadratically with the problem dimension.Indeed, at each LM iteration we need to calculate the full 2n × 2n element GGN matrix of Eq. (3.8), which can be prohibitively expensive even for moderate problem dimensions.In this work, we circumvent this computational difficulty by using two key ingredients that work in conjunction.
A first thrust toward a computationally efficient LM approach is to resort to a "truncated" version of the method, which means that we give up on the idea of calculating the LM step ∆z by exactly solving the linear relation in Eq. (3.8) [71,72].Instead, we compute an inexact (but sufficiently accurate in practice) solution by using an iterative solver.To achieve this, we use the conjugate gradient (CG) method, which has been successfully used since the 1980's in very similar situations (see [73] and references therein).The resulting LM method now contains a minor (nested) loop solving for ∆z in each LM update.This nested loop is stopped when the following stopping rule is met: where η ∈ (0, 1).This termination condition ensures that the LM procedure remains globally convergent with, under optimal conditions, a superlinear local convergence rate [74].
The nested solver does not prevent the memory requirement from being prohibitive by itself.For instance, each CG iteration requires a matrix-vector multiplication that involves the full GGN matrix.For "real-world" ptychographic problems, the storage of such a matrix is not a realistic option.We solve this second computational bottleneck by using the AD framework, as it allows the method to be matrix-free in that none of the GGN matrices involved in the LM iterations are effectively stored or even built.Instead, for some v ∈ R 2n , we directly access the "matrix-vectorproducts" G(z) • v required in the linear (CG) solver via the reverse-mode AD method.Calculating these matrix-vector products through such a matrix-free method requires a computational cost that is larger than that for the gradient calculation by only a small multiplicative factor [75]. Appendix A details the mechanism at work with reverse-mode AD to compute matrix-free, generic vector multiplication with Jacobian and Hessian operators.
A few additional considerations are required to deal with poorly scaled problems, i.e., problems where the changes to z in a certain direction produce much larger variations in the value of f than do changes in other directions of z [30, Chapter 2].In such a situation, the GGN matrix in Eq. (3.8) is ill-conditioned and the CG solver may not find a viable update direction even after a large number of iterations.The resulting LM updates may converge slowly, or even fail to converge.One way to address this scenario is by replacing the identity matrix in Eq. (3.8) by the diagonal matrix D = Dg (G(z)), where Dg (•) is a diagonal matrix built from the main diagonal of the square matrix given as an argument.This modification makes the algorithm invariant under diagonal scaling of the variables [68].In addition, we may also use the preconditioned CG (PCG) method to substantially accelerate the convergence of the linear solver computing the inexact update ∆z .In this work, we follow an existing example [49] and test the simple diagonal (Jacobi) preconditioner Dg (G(z) + λD) (calculated analytically in Appendix B), and find that it provides efficient convergence.

Matrix-free LM algorithm for the canonical problems
We now provide a detailed description of the matrix-free truncated LM algorithms used in this work.For notational simplicity, as in the previous section, the vector z is used hereafter as a generic short-hand for the set of CR-valued parameters that are optimized, i.e., we have z = Õ for the SPR or z = ( Õ, P) T for the BPR problem.

From intensity to magnitude-based LM updates
The LM method has long been established as a versatile, fast and provably convergent solver for NLSQ minimization [30,43].In addition, as explained in Section 3.2, the method can be extended beyond the usual quadratic error metrics via the GGN formulation.As a direct consequence, both the error metrics f g and f p defined in Section 2 can be minimized within this framework.We also gather from Section 3.2 that, for a given error-metric f • , the LM update is not unique: any functional decomposition of the error-metric that preserves a positive definite central part in the GNN matrix defines a legitimate, yet specific LM strategy minimizing f • [48].For instance, we define Eq. (2.3) via the intermediate (physically relevant) intensity variables h k,m , hence providing "intensity-based" formulations of the error-metrics Eqs.(2.4) and (2.5).Since L is strictly convex, the functional decomposition Eq. ( 2.3) suggests a straightforward, intensity-based LM update.However, as it is often the case in mathematical programming, numerical implementations derived from equivalent mathematical formulations can differ substantially in performance in solving the very same problem.
We see this difference in action when we introduce the magnitude of the expected diffracted wave-field5 , defined as where j ∈ {1 • • • K × M } is a single index spanning both the probe position index and the pixel index.We can now introduce equivalent magnitude-based formulations of the intensity-based error metrics in Eqs.(2.4) and (2.5) where We note that f L • (z) and f • (z) are just different functional decompositions of the exact same algebraic expression.As such, they also have identical gradient values at all points (∇f L • = ∇f • ).However, we can now define two separate matrices where J h and J ζ are the Jacobian matrices defined as 2)) for the standard ptychographic object reconstruction (SPR) problem with a known probe.The numerical experiment uses the parameters described in Section 5 for the n high setting.For the LM algorithm, g enables much faster convergence than the L g , while the convergence rate of the NCG method is the same for both the metrics.
When we closely examine the expressions for G L (z) and G (z), we find that G (z) more accurately approximates the true Hessian ∇ 2 f (z) than G L (z) does.In practice, this means that the magnitude-based LM strategy shows a much faster rate of convergence than the intensitybased strategy,as we can see in the numerical results shown in Fig. 1; we analyze this result in Appendix C. For convenience, in the following sections, we strictly use only the magnitude-based LM strategy and use the simplifying notations J (z) ≡ J ζ (z) and G(z) ≡ G (z).As a result, the linear system to be solved is Thanks to the AD implementation, all of the various linear systems derived from our magnitudebased formulation of Eq. (4.6) can be solved iteratively in a matrix-free fashion and none of the derivatives involved have to be obtained analytically beforehand.We can also use the diagonal matrix D = Dg (G(z)) in the left-hand side of the LM system of Eq. (4.6) to deal specifically with poorly-scaled problems, as shown in Sec.3.3.

Implementation of the truncated, matrix-free Levenberg Marquardt approach
We provide below, in Algorithm 1, the main solver we use to address both the cases of the SPR (Eq.(2.6)) and BPR (Eq.(2.7)) canonical problems that we introduced in Sec. 2.
Algorithm 1 Truncated LM algorithm: common structure for both the SPR and the BPR problems.For the BPR problem, the complementary unit given in Algorithm 2 ensures the constrained update.Calculate the PCG preconditioner P t .

12:
Calculate actual and predicted reductions Set ρ t = ∆f a t /∆f p t .
14: end while 22: end for This LM algorithm implements the magnitude-based formulation given in the preceding section by combining the basic truncation algorithm from [74] with the diagonal scaling from [68], the GGN extension from [70], and the projected gradient plug-in from [51].As explained in Sec.3.3, each LM update rests on the inexact, matrix-free solving of the linear system of Eq. (4.6) using PCG with the diagonal preconditioner P = Dg (G(z) + λD).We use a warm start procedure to initialize the PCG inner loop, and control the accuracy of the inexact solution via the termination condition with η calculated according to [51], where the parameter β > 0 determines the accuracy of the PCG update and thereby also the balance between the number of inner CG iterations and outer LM iterations.Individual (inner) PCG iterations are much cheaper than the outer LM iterations, so it makes sense to prioritize the accurate solution of Eq. (4.8).After some testing, we set β = 0.1 to provide a good balance between the inner/outer iterations for the Gaussian error metric.However, since the GGN matrix for the Poisson error metric is less accurate than that for the other error metrics, and since the LM linear system for this metric may be difficult to solve accurately, emphasizing highly accurate PCG updates in this case is likely to impose a large computational cost but produce diminishing returns with respect to the objective function.After some testing, we determined that a value of β = 0.9 is appropriate for the Poisson error metric, enabling more frequent outer updates for the overall GGN matrix.We note that the choice of β only affects the computational cost of the LM algorithm; the algorithm remains convergent as long as η ∈ (0, 1).Finally, we note that there is no obvious indication in Algorithm 1 that it is an AD-based matrixfree implementation.Instead, this is implied in Line 11 where the gradient and any matrix-vector products involving the Jacobian in Eq. (4.6) are computed "on the fly" with AD.

An additional "plug-in" to deal with convex constraints
In both of the canonical cases here, the SPR or the BPR problem (Section 2), Algorithm 1 is common to our LM approach.However, an additional consideration is needed in the specific case of the BPR (Eq.(2.7)) problem because the sample O and the probe P updates need to be constrained to their respective convex sets O and P. To achieve that aim, let us define the projection of the current estimate on the convex set of constraints Z := O ∩ P by (4.9) Following [50] and [51], we can obtain a constrained and convergent LM update under these conditions with the following strategy: (i) the unconstrained LM update from Algorithm 1 is projected via Eq.(4.9), (ii) if this projected LM step does not decrease the error metric, and if the search direction (at optimization iteration t) s t = Π Z (z t + ∆z t ) − zt is a decrease direction for f (Step 5 in Algorithm 2), we perform a line search along s t [50, Section 4], (iii) if s t is not a decrease direction, then we perform a standard projected gradient step.The "line search" step, with the search direction ∆z ls, t , refers to a backtracking linesearch that calculates the step size α t so that α t satisfies the Armjio criterion: where 0 < σ 1.The "projected gradient" step refers to just the line search with the search direction ∆z ls, t = −∇f (z).This strategy is detailed in Algorithm 2.
This simple projection strategy retains the pivotal assets of the LM approach: (i) that the iteration in Algorithm 2 is globally convergent, and (ii) that the algorithm can still preserve a superlinear convergence speed as long as the solution is not at the boundary of the constraint set [50,51].
Calculate the step size α t that satisfies the Armijo criterion in Eq. (4.10) for the search direction ∆z ls, t = s t .

7:
Set z t+1 = zt + α t s t .Calculate the step size α t that satisfies the Armijo criterion in Eq. (4.10) for the search direction ∆z ls, t = −∇f (z).Set z t+1 = zt + α t ∆z ls, t .12: end if In practice, since the probe and object constraints sets (P and O respectively) are separable, we can apply the projector in Eq. (4.9) just by performing the independent projections for P and O.

Update schemes for the BPR problem
For the BPR problem, the scheme outlined in Algorithms 1 and 2 details a joint optimization scheme which simultaneously optimizes both the sample O and the probe P.In contrast, historical [31,20] and also some popular modern [76] BPR approaches update O and P in an alternating fashion.In Algorithm 3 we also provide an alternate update LM scheme for the BPR problem.Even though such an alternating scheme is often effective for first-order updates, it results in a loss of the sample-probe coupling information from the second-order curvature matrix; this could be detrimental to the speed of second-order minimization algorithms.However, the sample and probe variables can have totally different scaling, and therefore the joint curvature matrix can be badly conditioned, while the separate sample-probe curvature matrices could still be individually well-conditioned.Consequently the PCG solution for the LM sub-problem of Eq. (4.6) for the joint optimization scheme can have very slow convergence depending on the matrix conditioning.We can address this with appropriate preconditioning, as discussed below.

Scaling and preconditioning with AD
Practically speaking, AD frameworks dramatically simplify the implementation of any gradientbased iterative solver [39].For instance, if we do not use either the scaling [i.e., D = I in Eq. (4.6)] or the CG preconditioning within the LM algorithm, then the AD-based implementation of Algorithm 1 is totally agnostic to any analytical calculation whatsoever.Unfortunately, this is no longer the case if we use the scaling D = Dg (G(z)) and the CG preconditioner P = Dg (G(z) + λD).Actually, the required quantity Dg (G(z)) is not a "natural output" of the AD framework so we derive it analytically (see Appendix B).

Numerical Experiments
To test our proposed algorithms, we simulate a far-field transmission ptychography experiment and perform SPR and BPR with a variety of state-of-the-art reconstruction algorithms.Our simulations use a 160 × 160 pixels test object shown in Fig. 2(a,d) placed at the center of a 224 × 224 pixel bounding box with the "bright-field" boundary condition applied [55] (i.e., with phase-less support set to 1.0).We scan the object using a 64 × 64 pixels probe generated by defocusing an Airy wavefront; the probe is shown in Fig. 2(b,e).A ptychographic raster grid is then obtained by translating the probe latitudinally and longitudinally in steps of 5 pixels each, thus obtaining a dataset with a total of 1024 noise-free intensity (diffraction) patterns.
We also consider three different levels of Poisson counting noise depending on the integrated intensity of the probe: a "low" signal-to-noise ratio (SNR) setting with 10 3 probe photons (with a fluence of n low = 37.9 photons per object pixel), a "moderate" SNR case with 10 4 probe photons (n mod = 379 photons/pixel), and a "high" SNR case with 10 6 probe photons (n high = 3.79 × 10 4 photons/pixel).For all these simulation settings, we set a constant background level of b k = 10 −8 photons per object pixel.With these datasets in hand, we attempt to solve the canonical problems defined in Sec. 2. For each problem instance and SNR scenario, we generate five different uniformly random complex arrays with magnitude ≤ 1 as the object initial-guess.With each such object initialization, we run 1000 iterations of reconstruction.For the (unconstrained) SPR problem, we use the true probe, and the algorithm only updates the object guess.For the (constrained) BPR problem which retrieves the probe as well, we use the constraint sets (5.1) The initial guess for the probe wavefront is a single-mode phaseless circular aperture with its width equal to the central lobe of the Airy beam (before defocus), and with the integrated intensity set to that of true probe wavefront, as shown in Fig. 2(c,f).At the end of every iteration, we assess the progress made by the optimization process: we first use the subpixel registration algorithm [77] to calculate the normalized object reconstruction error from the ground truth, and then estimate the mean object reconstruction error O t (where t indexes the iteration number) averaging the error from the five independant object initializations.For the BPR problem, we also record the mean probe reconstruction error P t .
We estimate the evolution of the computational cost of the optimization process by tracking the number of floating point operations (flops) required per iteration of the algorithm.For example, for the LM algorithm, we estimate the computational cost for a given iteration by tracking: a) the number of λ updates required in this iteration, b) the number of CG iterations required for each such λ update, and c) the number of projected gradient line search iterations required in this iteration.We separately estimate the number of flops required for each CG iteration, for any extra computation required (outside of the inner CG loop) at each λ update, and for each iteration of the projected gradient linesearch.By combining these appropriately, we get the overall computational cost for the iteration.
Finally, to define a convergence indicator for the optimization procedure, we use a sliding window over the time series for O t to calculate the root mean square value where W = 100 is the window size, j indexes the current window, and O j is the mean of the values in the current window.The optimization procedure can then be said to have "converged" at iteration j if where 0 < c 1 is a constant.For the subsequent analyses, we define the point of convergence using with c = 10 −3 , c = 2 × 10 −3 , and c = 3 × 10 −3 for the n high , n mod and n low cases respectively.

Performance of the matrix-free LM solver for the SPR problem
In this subsection, we present the performance analyses of the LM solvers applied to the SPR problem with the Gaussian and Poisson error metrics.Here, we compare the basic truncated LM algorithm (denoted as LM) and the LM algorithm implemented with the diagonal preconditioning and scaling (denoted as PLM), with the following state-of-the-art first-order algorithms (described in Appendix F): • NCG/PNCG: the nonlinear conjugate gradient (NCG) algorithm and the preconditioned NCG algorithm for both the Gaussian and Poisson error metrics.
• NAG: Nesterov's accelerated gradient method for the Gaussian error metric.

The Gaussian error metric
In Fig. 3, we demonstrate the performance of the LM method for the SPR problem with the Gaussian error metric.Here, subplots (a, d, g) clearly show that, for all the noise levels, the LM algorithms converge to the solution within a few iterations, thus displaying a convergence rate significantly faster than any of the first-order algorithms tested.In fact, these subplots indicate that the LM iterates have the expected superlinear rate of convergence (i.e., the error decreases at a faster-than-linear rate in the semilog plot), but this is difficult to actually verify numerically due to the limited dynamic range of the reconstruction problems.Subplots (b, e, h) show that, in terms of the real computational cost (in flops), for the n low setting, the NAG algorithm reaches the solution much faster than the other methods.However, as we increase the SNR, this advantage decreases, such that at the n high setting, the computational cost of the LM and PLM methods are comparable to that of the NAG and PNCG methods.
Subplots (c, f, i) present the recovered object magnitudes for the PLM algorithm at the various noise levels;the figures demonstrate that the object recovered in the the high SNR setting (where  Another point of interest is that the PLM algorithm does not necessarily provide an improvement in the computational cost over the basic LM method.This result is in keeping with the observation [27] that the SPR problem is generally well-conditioned as long as there is a sufficient overlap between adjacent probe positions.In this case, the analytical calculation for the preconditioner is not strictly necessary, and the basic LM algorithm typically suffices.We present the full quantitative results in Appendix G.1.

The Poisson error metric
In Fig. 4 we present the reconstruction results for the SPR problem with the Poisson error metric.In subplots (b, e, h), the lines for the LM (red dots) and PLM (green dashes) algorithms clearly show that the individual LM updates incur a very large computational cost, particularly at the initial stages of the optimization procedure, and that this effect becomes more significant as we increase the dynamic range of the problem.This result stands in stark contrast to the results for the Gaussian error metric (Fig. 3).We examine this discrepancy in Appendix D, where we show that, at points far away from the minimum, the GGN approximation does not effectively capture the curvature of the Poisson error metric.This result suggests that the PNCG algorithm could be a much more effective choice if we want to minimize the Poisson model for the unconstrained SPR problem.
For practical use cases (even for the SPR case) we often have experimental constraints (such as the object constraint in Eq. (5.1)) that we want to apply during the optimization procedure, and the basic PNCG algorithm is no longer a robust option.In such cases, we can perform fast LM optimization by using a surrogate formulation of the Poisson error metric.We can obtain this surrogate formulation by redefining the expected magnitude (from Eqs.(2.2) and (4.1)) as: where we monotonically decrease ς t from a large stabilizing value to 0 as we proceed with the optimization iteration t, so that we revert to the true Poisson metric after a predefined number of optimization iterations (see details in Appendix E).We introduce the notation LM-S and PLM-S to denote the LM algorithms optimizing this surrogate model 6 .The results in Fig. 4 (and Appendix G.1) show that this simple reformulation significantly reduces the computational cost required per LM iteration at the initial states of the optimization (when we are far away from the minimum) for both the moderate SNR and high SNR settings; the LM optimization cost is now only slightly higher than that for the PNCG method.Conversely, our use of the surrogate Poisson formulation and a loose acceptance criterion for the CG substep (with β = 0.9 in Eq. (4.8)) has the consequence that the LM iterates are no longer superlinearly convergent; they nonetheless converge faster than the PNCG algorithm for the n low and n mod cases.We observe in subplots (c,f) that the objects recovered for the n low and n mod cases with the PLM-S method (with O = 0.24 and O = 0.11 respectively) look sharper than that obtained with the Gaussian error metric, but the range of magnitudes obtained is much larger for the Poisson error metric.For the n high case, optimizing either of the error metrics results in similarly accurate reconstructions (see Appendix G.1).We note that the Gaussian metric can be interpreted as a consistent approximation of the Poisson metric for the high SNR regime.In this regime, the Gaussian error metric can be optimized with a lower computational effort, and it is not clear if we actually want to optimize the Poisson metric at all.

Performance of the matrix-free LM solver for the BPR problem
In this subsection, we present the performance analyses of the LM solvers applied to the BPR problem with the Gaussian and Poisson error metrics.In the numerical comparisons, we test the following LM variations: • LM-A: the basic alternating LM algorithm, as described in Algorithm 3.
• PLM-A: the LM-A algorithms modified to use diagonal scaling and preconditioning.
• PLM-J: the joint optimization scheme that simultaneously updates the O and P for the BPR problem.
Here, we do not include the results for the basic joint optimization scheme (without preconditioning and scaling) since these reconstructions do not converge within 1000 iterations.We compare these to the following state-of-the-art first order BPR methods (described in Appendix F): • PHeBIE: the proximal heterogeneous block implicit-explicit (PHeBIE) method [32].
Unlike the PHeBIE and ADMM methods, which are provably convergent for the BPR problem, the ePIE method does not provide convergent updates [22,32].Nevertheless, the ePIE method is still widely used in the community and we therefore include the results here.

The Gaussian error metric
In Fig. 5, we present the performance results for the BPR problem with the Gaussian error metric.The joint (PLM-J) optimization procedure converges to the minimum in significantly fewer iterations, and with lower computational effort, than any of the first-order algorithms.Again, Fig. 5(g) seems to indicate that the PLM-J algorithm is superlinearly convergent.Meanwhile, the convergence rate of the alternating (LM-A and PLM-A) optimization procedures decreases as we increase the dynamic range of the problem, which shows that the sample-probe coupling information in the second-order curvature matrix becomes more important for problems with higher SNR.Another point of interest is that the LM-A algorithm does not benefit appreciably from the application of preconditioning, which indicates that the object and probe sub-problems are individually wellconditioned; however, the joint optimization approach is not achievable without the application of a preconditioner.Among the first-order methods tested, the ePIE algorithm, which does not use any constraints, does not improve the object reconstruction at all in the n low case.More generally, the ePIE and the PHeBIE algorithms show slow convergence in all the settings tested.Meanwhile, the ADMM convergence depends greatly on the choice of the the penalty parameter (see Appendix F.5).In our experience, tuning is a computationally expensive procedure, which becomes progressively more difficult as we lower the SNR.This effect is evident in subplots (b, e, h): the ADMM reconstructions are relatively (compared to the LM methods) more expensive and of lower quality as we move from bottom to top (see also Appendix G.2).It is likely that we can further tune the value of to improve the convergence rate and solution quality, but this would require even more computational effort and is beyond the scope of this paper.We present the full results for the converged algorithms in Appendix G.2.

The Poisson error metric
Figure 6 shows the reconstructions for the for the various SNR cases with the Poisson error metric.These results again show that the surrogate formulation of the Poisson error metric reduces the computational cost required for the optimization procedure for the n mod and n high settings.These results differ from that in Section 5.2.1 in two ways: i) the LM iterates (even with the surrogate formulation) have a slower rate of convergence, and ii) optimization with the PLM-J algorithm requires a computational effort similar to that needed for the PLM-A algorithm.We can attribute both of these observations to the fact that we use a loose acceptance criterion for the CG substep when optimizinig the Poisson error metric.However, it is possible that the PLM-J algorithm would provide an improvement over the PLM-A algorithm for a BPR problem with a much larger dynamic range.Finally, the ADMM method again displays the behavior described in Section 5.2.1.Similar to the SPR case, the PLM-J-S reconstruction for the n low and n mod case (with O = 0.25 and O = 0.11 respectively) is sharper than the reconstruction for the Gaussian error metric.For the n high setting, the caveat described in the SPR case (Section 5.1.2) again applies (see Appendix G.2).

Discussion
The comparisons of the various approaches discussed above were all drawn from a relatively simple simulated case.As we summarize below, this provided us the opportunity to delve deeply into the regimes of relative success and failure of a wide range of phase retrieval approaches, including the LM-based implementations that we developed.It is also important to note that when imagining use of LM algorithms for ptychography experiments as well as broader general use in phase retrieval, certain additional aspects should be considered as well, which we touch on below.
Through our experiments, we find that the LM algorithms shine in applications with the Gaussian error metric, both for the SPR and BPR problems.For the SPR problem, at all the noise levels examined, the preconditioned LM algorithm is found to converge superlinearly, in significantly fewer iterations than any of the first-order algorithms.For the BPR problem, the joint optimization (PLM-J) method shows similarly excellent performance.Remarkably, in all these cases, the LM algorithms have a true computational cost that is comparable to, or even lower than, that of the best-performing first-order algorithm with the additional advantage of minimal hyperparameter tuning.
The Poisson error metric is more difficult to optimize (in a computationally efficient manner) with the LM algorithm, and requires the use of a surrogate, stabilized, formulation of the Poisson error metric as well as a loose acceptance criterion for the CG substep.These modifications reduce the convergence rate so that it is no longer superlinear, but increase the computational efficiency of the LM method.For the unconstrained SPR problem, this modified LM algorithm converges at a rate similar to the first-order PNCG method, but has a slightly higher computational cost.For applications with constraints, however, the (modified) LM method remains the most performant choice.Notably, for high SNR experiments, the Gaussian error metric is generally expected to serve as a robust proxy for the Poisson error metric and produce similar reconstruction results.Under these conditions, the LM methods perform as well as the PNCG method even for unconstrained problems.
By incorporating the projected gradient method, the LM algorithm provides convergence guarantees in both the unconstrained and constrained optimization settings with the benefit of minimal parameter tuning.Looking at comparable approaches, even though the PHeBIE and ADMM methods both guarantee convergence for BPR problems with constraints, and can be easily modified to also apply to constrained SPR problems, they can be difficult to accelerate.The PHeBIE algorithm is a steepest descent procedure that depends on the use of small, stable, step sizes for its convergence; accelerating this method requires a sophisticated domain decomposition approach [32].The ADMM algorithm, on the other hand, permits bigger step sizes [33], but only with a careful choice of the ADMM penalty parameter, the value of which often changes from problem to problem (see Fig. 7).On the other hand, the projected LM procedure generally works with minimal modifications in all these different settings.
The fact that the second-order LM algorithm has a computational cost comparable to firstorder methods is itself quite remarkable, and further improvements and broader applicability can easily be imagined.The primary reason for computational parity of these methods is that our LM implementation uses the reverse-mode AD framework to avoid the construction of the full GGN matrix (with N 2 elements); the only drawback of this approach is a higher memory cost, about 4× that for a first-order method.Additionally, our work here also showcases the GGN-based extension of the classical LM algorithm, which can potentially be applied beyond ptychography to problems with general convex error metrics.Among the implications of this is that the LM procedure can be used for phase retrieval problems with other noise statistics (e.g., a mixed Gaussian-Poisson noise setting [79]), other error metrics (e.g., for robust regression models), or even for entirely different classes of optimization problems.
To determine the ease of generalization of the LM method we need an additional consideration: whether a preconditioning strategy is necessary for LM optimization.For the SPR problem, the LM procedure has similar performance both with and without any preconditioning as long as there is sufficient overlap between adjacent probe positions.Likewise, the alternating minimization strategy for the BPR problem is also not dependent on the use of a preconditioner.However, the BPR joint optimization strategy, which is faster than the alternating optimization strategy for higher SNR regimes, requires the use of a preconditioner.On the one hand, typical phase retrieval problems (in other experimental modalities) only attempt to retrieve the illuminated object, and are similar to the SPR problem in this regard.This suggests that for general well-conditioned phase retrieval problems, a preconditioner is not strictly necessary.In this scenario, the LM algorithm is entirely AD-based, with no analytical calculations required, and can thus be used in a drop-in fashion to solve these problems.On the other hand, if our phase retrieval problem is ill-conditioned, or if we desire a BPR-like joint optimization strategy, then we have to rely on a preconditioner.In this case, one option is to use the diagonal elements of the GGN matrix (Dg (G(z))) for the preconditioning; we provide the analytical expressions for these for the BPR problem and they are straightforward to modify for other phase retrieval applications 7 .When this analytical calculation is not feasible, however, we can instead use stochastic matrix-free techniques to compute an unbiased estimate of Dg (G(z)) in a computationally efficient manner through matrix-vector products alone [80,81,49] (for example, the recently published AdaHessian algorithm uses such a scheme for mini-batch second-order GGN optimization [82]).We expect that these modifications can allow for use of the LM method for general SPR-like and BPR-like phase retrieval problems.Now, for an arbitrary v ∈ R 2n , we can calculate the J T VP which is the desired JVP at z0 .To summarize, we can calculate the JVP for the Jacobian J Γ (z 0 ) and an arbitrary vector v ∈ R 2n by first calculating the transformation ξ(ω) = J Γ (z 0 ) T • ω for any ω ∈ R m (say ω j = 1 for all j), then calculating the J T VP for this transformation ξ.
Finally, we need a mechanism to calculate a HVP of the form ∇ 2 ζ (ζ) • x for some x ∈ R 2n .Noting that the Hessian is defined elementwise as The full HVP is therefore just the gradient which is again a matrix-free procedure.This completes the desired mechanism for a matrix-free calculation of the GGN-vectorproduct.
In practice, we formulate the LM linear system through the following sequence of calculations: 1. We first calculate the J T VP J (z) T • ∇ ζ (ζ) through the procedure in Eq. (A.5).This is just the gradient ∇f (z), and can be accessed directly in any reverse-mode AD toolset.

We set
• ∆z .While we can use any ω ∈ R m for this calculation, reusing the output from step (1) reduces the computational cost and memory overheads.
3. We then calculate the HVP ∇ 2 ζ (ζ) • (J (z) • ∆z ) either: (i) through the HVP procedure described in Eq. (A.10) or, (ii) if the Hessian ∇ 2 ζ (ζ) is a diagonal matrix (and therefore is easy to formulate explicitly) then directly using the closed-form expression for ∇ 2 ζ (ζ).In case (ii), which is true for both the Gaussian and Poisson error metrics, we only need to calculate the m diagonal elements of ∇ 2 ζ (ζ), then perform an elementwise product operation with the vector (J (z) • ∆z ) to calculate the desired HVP.

Finally, we calculate the
) through the procedure in Eq. (A.5).This is again easy to access from the AD toolset.
From step (1) and step (4) (and some extra elementary operations), we have the right-hand-side and left-hand-side of Eq. (A.2) respectively, which completes the full LM linear system.

B Calculating the diagonal elements of the Generalized Gauss-Newton matrices
In this appendix, we derive general analytical expressions to calculate the diagonal elements of the GGN matrices (defined in Eq. (3.6)) for the ptychographic reconstruction problems.For ease of analysis, this appendix defines the error metric f as a function of the variable z ∈ C n and its complex conjugate z * .This (z, z * ) coordinate representation is connected to the (Re{z}, Im{z}) representation used in Section 3 through the Wirtinger gradient definition of with i = √ −1, ∂f (z)/∂z * the componentwise partial derivative of f with respect to the complex conjugate variable z * , and ∂f (z)/∂R[z] and ∂f (z)/∂I[z] the componentwise partial derivatives with respect to the real and imaginary parts of z.A change of representation is easily accomplished via the linear transform [59] of where I n×n is the n × n identity matrix and • T is the transpose operator.
For the SPR problem, our coordinates of interest are just (z, z * ) = (O, O * ).For the magnitudebased formulation of the error metrics [Eq.(2.3), Eq. (4.2)], we can use the chain rule of multivariable calculus to derive where ∂ζ ∂O * is the (KM × N ) Jacobian matrix defined elementwise as ∂ζ . We can follow the procedure in [27] to calculate the set of M lines in the Jacobian matrix corresponding the k-th probe position.Let so that the sub-matrix in the Jacobian corresponding to the k-th position is With this Jacobian in hand, the calculation of ∇ O f (from Eq. (B.3)) for a chosen error metric is straightforward, and we can now use ∇ O f to formulate any choice of first-order algorithm.
To formulate a second-order optimization algorithm, however, we need to additionally define the "compound" Jacobian matrix J (see Eq. 116 in [58]) which contains the derivatives with respect to both the O and O * coordinates: In other words, the matrices ∂ζ/∂O and ∂ζ/∂O * are stacked columnwise to form the matrix J .Assuming that the error metric of interest, (ζ), has a positive semi-definite Hessian matrix defined elementwise as ∇ 2 k,j = ∂ 2 ∂ζ k ∂ζj , we can write the compound GGN matrix as where • † is the conjugate transpose operation.From here on, we additionally assume that ∇ 2 is a diagonal matrix; this holds for both the Gaussian and Poisson error metrics.Putting Eq. (B.5) and Eq.(B.6) together, we get the relation We substitute for ∂ζ ∂O * from (B.4) to get the lower-right diagonal block in this latter matrix: where ∇ 2 k: is the M × M diagonal block extracted from ∇ 2 that corresponds to the k th probe position.Since S k is a binary matrix, we can use S * k = S k to get the simplified expression where we use the simplifying approximation8 h k ≈ ψk where e n denotes a vector with a 1 at index n and 0's elsewhere.From (B.8) and (B.9) we have Noting that S k is a binary matrix that extracts the M object pixels that interact with the probe at the k th probe position, we can see that where e m is a vector that contains all 0's except at index m .The index m corresponds to the probe pixel that interacts with the object pixel n at the k th probe position.Since there exist N −M object pixels that do not interact with any probe pixel at a given probe position, e m contains a 1 at index m if there exists such a probe pixel, and contains 0's everywhere otherwise.As a result, S k • e n is a vector that contains at most a single 1 among its elements.Then, for the diagonal matrix diag (P * ), the matrix-vector product is also a vector that contains at most a single non-zero element [P * ] m at position m .If we then define the set Ω n ⊂ {1, 2, • • • , K} as the set of probe positions that interact with the sample pixel at index n, we get the relation (from (B.10)) where [C k ] m ,m is the element extracted from the position m in the main diagonal of C k .We now note that the circulant matrix C is constant along its main diagonal and has the elements . We finally have the relation As an alternative, we can also interact directly with the shift matrices S k instead of iterating over individual diagonal elements d n .To accomplish this, we can note that where I M ×M is the identity matrix.This gives us the relation If we follow the same procedure for the upper-left diagonal block in (B.7), we again get the same expression as in (B.12).In practice, we can calculate these expressions in a straightforward manner by using array manipulation tricks instead of using the matrix multiplication with the shift S k .This holds even after we change the coordinate basis to (R[], I[]) through the relationship (Equation 89in [58]) From our calculations thus far, we have obtained the desired analytical expressions we can use to calculate the diagonal elements of general GGN matrices for the SPR problem.For the BPR problem, we can follow a similar procedure to calculate the corresponding expression for the probe variable.If we desire a joint object-probe optimization, the main diagonal of the extended GGN matrix takes the form where G P is the "probe counterpart" of the sample GGN matrix defined by (B.6).We thus have the desired analytical expressions for the diagonal elements of the GGN matrices for both the SPR and BPR problems.

C Comparing magnitude-based and intensity-based GGN approximations
In this appendix, we show that LM algorithm using magnitude-based GGN matrix (G (z) in Eq. (4.5)) makes for a faster optimization strategy than the scheme using the intensity-based GGN matrix (G L (z)).To accomplish this, we first examine the Gaussian error metric case, then develop a guideline for how to formulate the GGN matrix for general error metrics.Using z = O for the SPR problem, from Eqs. (2.5), (B.3) and (B.4) we get an expression for the gradient of We are most concerned with the optimization iterations where z is in the neighborhood of the solution, since, for non-convex problems, this is the region where second-order methods provide the most acceleration.In this region, with | is also very small in size and does not contribute meaningfully to the gradient calculation.For the data points of interest we can simply write We can easily verify that a similar line of reasoning also holds for the Poisson error metric.We expect that this also holds in the general case.
We can now rewrite the expressions for f L and f in Eq. (4.2) as Martens and Sutskever [83] have previously shown that if we compare the GGN matrix G L (O) = J T I •∇ 2 I L•J I with the true Hessian matrix ∇ 2 O f , the second derivative terms associated with L are faithfully captured in G L (O).However, the second derivative terms associated with the calculations in (I) are not accurately captured in the GGN matrix.Effectively, the more calculation we associate with the Jacobian terms in the GGN matrix, the more curvature information we lose.Since (I) contains an extra (•) 2 operation in comparison with (II), we lose more curvature information when we calculate G L (O) than we do when we calculate G (O).In other words, G L (O) is a less accurate approximation of ∇ 2 O f than is G (O).This leads us to a general guideline: we should construct the functional decomposition of f so that the function associated with the central Hessian part of the GGN matrix performs as much as possible of the computation in f (see also [49,Chapter 3.4]).

D Efficacy of GGN optimizations for the Gaussian and Poisson error metrics
Examination of the individual GGN approximations for the Gaussian and Poisson metrics to understand the efficacy of GGN-based optimization procedures for these metrics.We first use Eq.(3.7) to calculate the Hessian matrices where G g (ζ) and G p (ζ) denote the GGN matrices associated with the error metrics g and p , respectively, and S g and S p denote the difference between the true Hessian and the corresponding GGN approximation.When z is far away from the minimum, we may get the extreme case with .Due to contributions from these pixels, S p may be much "larger" than S g .This indicates that G p (ζ) is a less accurate approximation of In fact, we can see from Eq. (D.2) that this effect is present in general (Hessian-based) second-order optimization strategies, which makes the Gaussian error metric the preferred optimization target for general applications.
Putting these effects together, we can see that f p may be difficult to optimize through a GGNbased routine.

E Defining a surrogate Poisson error metric
We can address the difficulty in optimizing the Poisson error metric with a GGN-based routine (see Appendix D) by introducing a surrogate formulation that is asymptotically identical to the Poisson error metric we aim to minimize.We can build this surrogate formulation by adding a spatially uniform value to the incoherent background term in Eq. (2.2) and driving this value towards zero as we progress with the optimization iterations.For this procedure, we redefine the expected intensity h k (Eq.(2.2)) and magnitude ζ k (Eq.(4.1)) as: where t indexes the optimization iteration and 0 ≤ ς t < 1.This redefinition leaves unchanged all the relations that define the preconditioner and the LM update.We nevertheless need to define a constant ς 0 and an integer T ≥ 1 before we start the optimization procedure, then monotonically decrease the value of ς t after every iteration until t = T , where we then set ς T = 0. Therefore at t = T , the surrogate formulation is exactly equal to the Poisson error metric p .Now, through a relatively large choice of ς 0 (e.g., ς = 1) we can significantly constrain the size of the elements in both the matrices S p and G p (ζ) (which we define as in Appendix D).We can expect this effect to become less significant as we progress get closer to the minimum, at which point we can transition to using the true Poisson error metric.
For the numerical results presented in Section 5, we set T = 100, ς 0 = 1, and use an evenly spaced logarithmic grid to decrease the value of ς from ς 0 > 0 to 0.

F First-order optimization algorithms
In this appendix, we present the implementation details for the tested first-order algorithms.We use the reverse-mode AD procedure to calculate the gradients required within all the tested algorithms.

F.1 ePIE
The classic extended ptychographic engine (ePIE) algorithm [31,26] optimizes the Gaussian error metric for the BPR problem by stochastically iterating through the individual diffraction patterns.In the ePIE method, the current probe and object estimates are updated concurrently as

F.2 Nonlinear conjugate gradient
We use the popular Polak-Ribiere nonlinear conjugate gradient (NCG) method [30, Chapter 5], with the update step sizes calculated using an adaptive backtracking line search procedure, to solve the unconstrained SPR problem for both the Gaussian and Poisson error metrics [54].
For ill-conditioned problems, we can accelerate the basic NCG algorithm by choosing a matrix that approximates ∇ 2 f −1 as a preconditioner [84].Since the GGN matrix can be considered to be a proxy for the Hessian, the matrix Dg (G) −1 seems a sensible choice as a preconditioner for the NCG algorithm.We therefore modify the standard NCG method for each of the Gaussian and Poisson error metrics to use this preconditioner and denote this algorithm as the preconditioned nonlinear conjugate gradient (PNCG) method.In Sections 5 and 6, we only report the results with the PNCG algorithm, but we report the full results for both the PNCG and NCG algorithms in Appendix G.In practice, we find that the use of this preconditioner provides only slight acceleration to the basic NCG algorithm.

F.3 Nesterov's accelerated gradient
The Nesterov's accelerated gradient (NAG) method, also referred to as the Nesterov's momentum method, is an accelerated adaptation of the standard gradient descent method.The NAG method in the unconstrained SPR setting for the Gaussian error metric uses the scheme Here, the "velocity" term v ∈ R 2N stores a weighted history of the past gradient directions and uses this to adapt the current update direction.For compatibility with existing phase retrieval literature, we set the step size parameter to α , the inverse of the Lipshcitz constant of the gradient ∇ O f g (O j ) [32].Finally, we use the standard fixed schedule γ j = (j + 2)/(j + 5) for the momentum parameter γ j [85].
If we set γ j = 0, the resulting algorithm is exactly the "reshaped Wirtinger flow" algorithm [24] only with a different the step size parameter.The NAG algorithm is also closely related to the "accelerated Wirtinger flow" algorithm proposed in [64] that uses a scaled gradient in addition to Nesterov's momentum.

F.4 Proximal heterogeneous block implicit-explicit method
The proximal heterogeneous block implicit-explicit (PHeBIE) scheme described in [32] is provably convergent for the constrained BPR problem with the magnitude-based error metric.The PHeBIE scheme treats the transmitted waves Ψ = {ψ 1 , . . ., ψ K } as an auxiliary variable that is kept fixed during the probe and object updates.The updates are calculated as where Π O and Π P represent the projections into the convex sets O and P respectively.The object and probe step sizes are again derived from the partial Lipschitz constants for the respective gradients.

F.5 Alternating Directions Minimization
The Alternating Directions Minimization (ADMM) method [86,87] has recently been adapted as a provably convergent scheme for the BPR problem for both the magnitude-based and Poisson error metrics [33].To formulate the ADMM scheme, we have to first define the operator which generates the transmitted waves at the far-field detector plane.We can then define an auxiliary variable Ψ = A(O, P) that is kept fixed during the probe and object updates.The "augmented Lagrangian" for the ADMM scheme is then given by The object and probe updates use the exact solutions to their respective minimization problems.
For the auxiliary variable update, we we use a single iteration of the projected gradient algorithm to solve Equation 3.10 in [33].
The performance of the ADMM method depends strongly on the choice of the penalty parameter [35].However, we are not aware of any practical guideline on how to choose the optimal value of for the BPR optimization problem.In our experiments, we find that the large values of obtained by following the proof strategy in [33] lead to impractically slow optimization; the numerical experiments in [33] instead use much smaller, manually tuned, values of .Thus, to find the optimal value of the penalty parameter , we run separate reconstructions for = 10 x for x ∈ {−2, −1.5, −1, −0.5, 0, 0.5, 1} for each simulation setting and error metric.For the convergence analysis in this work, we choose the value of that enables a monotonic descent of the objective function f .Additionally, when multiple choices of yield similar final values of f (after 1000 iterations), we choose the value of that leads to the lowest object reconstruction error O t .While this latter choice is post-hoc in nature and is impractical for actual ptychography experiments, it suffices for our analysis.We present, in Fig. 7, the convergence history of reconstructions as a function of the parameter for the various numerical experiments we report in this paper.
Finally, we note that, first, our tuning procedure is computationally demanding and results in choices of that are much smaller than that used in the proof strategy in [33], and therefore do not guarantee the convergence of the object and probe updates.Second, it is likely that a finer tuning of could result in a more performant ADMM procedure, but this would require significant additional computational effort and is thus beyond the scope of this paper.

G Computational costs for the compared algorithms
In this appendix, we report the final reconstruction results for all the numerical experiments we report in this work.

G.1 Reconstruction results for the SPR experiments
In Appendix G.1, we report the final reconstruction results obtained by optimizing the Gaussian error metric for the SPR problem for all the three tested SNR scenarios.In the table, "It." denotes the number of outer iterations required for convergence, and "I O " denotes the number of inner iterations required.Specifically, the "inner iterations" reported for the LM algorithms uses the format "•/ • /•" and contains the total number of CG iterations, projected gradient calls (outer), and projected gradient line search iterations (inner) respectively.For the NCG algorithms, "I O " is just the total number of line searches iterations required.
For the results reported, the points of convergence were calculated using Eq. ( 5.3) with c = 3 × 10 −3 , c = 2 × 10 −3 , and c = 10 −3 respectively for the n low , n mod , and n high cases.Table Appendix G.1 also reports the final reconstruction results obtained by optimizing the Poisson error metric for the SPR problem.In this case, the "PLM-S" algorithm reported is the LM algorithm optimizing the surrogate formulation of the Poisson error metric (Appendix E).

G.2 Reconstruction results for the BPR experiments
Tables Appendix G.2 contain the reconstruction results obtained by optimizing the Gaussian error metric and Poisson error metrics for the BPR problem.In this case, for the LM-A and NCG algorithms, "I O " and "I P " denote the inner iterations required for the object and probe updates respectively.For the PLM-J algorithm, the object and probe updates are jointly calculated, but, for convenience, we still use the "I O " column to report the number of inner iterations required.Similarly, for convenience, in the ADMM case, "I O " denotes the number of line search iterations within the projected gradient algorithm.Finally, P denotes the normalized error for the reconstructed probe variable at the point of convergence.As with Appendix G.1, the points of convergence were calculated using Eq.(5.3) with c = 3 × 10 −3 , c = 2 × 10 −3 , and c = 10 −3 respectively for the n low , n mod , and n high cases. Alg.

Fig. 1 .
Fig.1.Comparison of the object reconstruction error for optimization with the intensity-based (L g ) and magnitude-based ( g ) error metric formulations (Eq.(4.2)) for the standard ptychographic object reconstruction (SPR) problem with a known probe.The numerical experiment uses the parameters described in Section 5 for the n high setting.For the LM algorithm, g enables much faster convergence than the L g , while the convergence rate of the NCG method is the same for both the metrics. .

Fig. 2 .
Fig.2.Simulated (true) probe (a) magnitude and (e) phase.The insets in (a) and (e) show the initialization for the probe magnitude and phase respectively.Simulated (true) object (b) magnitude and (f) phase.Reconstructed phase for the n low SPR setting with the (c) Gaussian and (g) Poisson error metric, using the PLM and PLM-S algorithms respectively.Reconstructed phase for the n low BPR case with the (d) Gaussian and (h) Poisson error metric, using the PLM-J and PLM-J-S algorithms respectively.
025) is difficult to distinguish from the true object, but the moderate and low SNR settings (where O = 0.15 and O = 0.25 respectively) show a clear deterioration in the object recovered.

Fig. 3 .
Fig. 3. Standard ptychographic reconstructions (SPR) of the object with a known probe using the Gaussian error metric.The top, middle, and bottom rows display the results for the n low , n mod , and n high cases respectively.Subplots (a,d,g) display the history of the normalized object reconstruction error O over the first 300 optimization iterations, subplots (b,e,h) display the history of O up to convergence as a function of the computational cost in flops, and subplots (c,f,i) display the reconstructed object magnitude (| O |) obtained from the PLM algorithm.

Fig. 4 .
Fig.4.Standard ptychographic reconstructions (SPR) of the object with a known probe using the Poisson error metric.The notations LM-S and PLM-S denote the respective LM method optimizing the surrogate formulation of the Poisson error metric.As with Fig.3, subplots (a, d, g) and (b, e, h) show the history of O with the iterations and flops (to convergence) respectively.Subplots (c, f, i) display the reconstructed object magnitude obtained from the PLM-S algorithm.

Fig. 5 .
Fig.5.Blind ptychographic reconstructions (BPR) of the object and the probe with the Gaussian error metric.As with Fig.3, subplots (a, d, g) and (b, e, h) show the history of O with the iterations and flops (to convergence) respectively.Subplots (c, f, i) display the reconstructed object magnitude and probe magnitude (in the inset) obtained from the PLM-J algorithm.

Fig. 6 .
Fig.6.Blind ptychographic reconstructions (BPR) of the object and probe with the Poisson error metrics, with the notations LM-A-S, PLM-A-S, and PLM-J-S denoting the respective LM methods for the surrogate formulation of the Poisson metric.As with Fig.3, subplots (a, d, g) and (b, e, h) show the history of O with the iterations and flops (to convergence) respectively.Subplots (c, f, g) display the reconstructed object magnitude and probe magnitude (inset) obtained from the PLM-J-S algorithm.

2
and where C k is a circulant matrix.The diagonal elements of D are then given byd n = e T n • D • e n for n = 1 . . .N,(B.9) h k,m ∈ h k , ψk,m ∈ ψk , y k,m ∈ y k and b k,m ∈ b k , we get one of two cases: (i) ψk,m b k,m , or (ii) all of ψk,m , b k,m , and y k,m have comparably small values, and therefore | ψk,m 1 − y

1 )P 2 max and γ j = 1 / 2 max
j+1 = P j − γ j ∂ P f g (O j , P j , y k ).(F.2)Here, ∂ O f g (O j , P j , y k ) and ∂ P f g (O j , P j , y k ) are the derivatives of f g with respect to O and P computed using only the information in the k−th diffraction pattern (chosen randomly).The step sizesα j = 1/ |P j | |S k • O j |are the inverse of the Lipschitz constants of the partial gradients ∂ O f g (O j , P j , y k ) and ∂ P f g (O j , P j , y k ) respectively.

Fig. 7 .
Fig. 7. Effect on the normalized object reconstruction error O as a function of the penalty parameter for the ADMM reconstructions.The left, mid, and right columns display the plots generated for the n low , n mod , and n high settings respectively for the Gaussian error metric (top), and the Poisson error metric (bottom).For the convergence analysis in Section 5, and Appendix G we use (a) = 1.0, = 10 −0.5 , (b) = 10 −0.5 , (c) = 0.1, (d) = 1.0,(e) = 1.0, and (f) = 1.0.
point is that the matrices G p (ζ) and G g (ζ) differ in the central Hessian term: ∇ 2 g (ζ) is the identity matrix whereas ∇ 2 p (ζ) is the diagonal matrix with the elements 1+y j /ζ 2 j .If ζ j 1 and y j > 0, then the diagonal elements of ∇ 2 p (ζ) can take very large values, which would degrade the conditioning of ∇ 2 p (ζ) and, consequently, G p (ζ).

Table 3 :
Reconstruction results and computational costs at the point of convergence for the BPR problem with the Gaussian error metric.

Table 4 :
Reconstruction results and computational costs at the point of convergence for the BPR problem with the Poisson error metric.