Iterative least-squares solver for generalized maximum-likelihood ptychography.

Although ptychography does not require a precise knowledge of the illumination wavefront, common implementations rely upon assumptions such as accurate knowledge of the scan positions and constant illumination. Limited validity of these assumptions results in deterioration of the reconstruction quality. We present a generalized ptychography method that optimizes the reconstruction along multiple directions. In our manuscript, we demonstrate that the additional flexibility not only helps to amend imprecisions of the ptychography model in a self-consistent way but additionally leads to faster convergence without a significant increase of the computational cost per iteration.


Introduction
During the last decade coherent diffractive imaging (CDI) became an important imaging technique with remarkable success in a wide range of imaging tasks [1,2]. The shared principle of CDI methods relies on propagation of coherent light scattered by a sample onto a detector, where the intensity distribution is measured but the phase of the incident radiation is lost. The lost phase can be retrieved if additional constrains in real space are available.
A particularly promising CDI method is ptychography [3]. Its fundamental idea is to illuminate partly overlapping regions of the specimen with a structured illumination probe and collect a diffraction pattern at each scan position. Adequate overlap between the adjacent illuminated areas sufficiently constraints the optimization task and makes recovery of the lost phase possible. Additionally, ptychography has the ability to separate the generally unknown complex-valued illumination probe from the complex-valued transmission function of the specimen.
On the other hand, a common drawback of the CDI methods is their high computation cost. In particular, ptychographic X-ray computed tomography (PXCT) datasets may consist of more than 10 5 scan positions. Along with higher scanning speed of modern PXCT endstations [4], fast scanning techniques [5][6][7] and need for real-time reconstructions, i.e. immediately after or even during the data acquisition, the requirements on the processing speed are growing significantly. In order to alleviate these issues, ptychography algorithms with fast convergence, low per-iteration cost and low memory requirements are needed.
Although closed-form algorithms for special cases of ptychography exist [8,9], ptychographic data sets are most commonly inverted by means of iterative algorithms [11] that give more flexibility in the sampling requirements [12]. Moreover, since the real measurement is always deteriorated by random noise and systematic errors, a statistically optimal reconstruction should be preferably found [13,14].
In this manuscript, a generalized approximation of the maximum-likelihood (ML) method [13] that we call least-squares maximum-likelihood (LSQ-ML) method is presented. We demonstrate that the convergence speed of the ML method can be improved while the per-iteration calculation cost is reduced and the robustness against noise and systematic errors is preserved. Additionally, the LSQ-ML method can be generalized to account for common systematic errors in the measurements such as sample position errors or small changes in the illumination probe.

Optimization problem
The ptychography reconstruction can be considered as a constrained optimization problem whose solution is defined by a common complex-valued illumination probe P r and a common complex-valued transmission function O r called the "object", so that the distance between the expected photon distribution I e i,q and the measured photon distribution I m i,q in the reciprocal coordinates q is minimized for each scan position i. The exit wave downstream the sample ψ i,r in i-th scan position can be expressed as where O r+r i denotes a patch of the object O r shifted to position r i relative to the probe and N 0 is a set of all scan positions r i . In order to calculate the expected photon distribution I e i,q , the exit wave ψ i,r model is propagated by an operator P to the detector plane, i.e. Ψ i,q = P{ψ i,r }. The propagator P can be generally any unitary operator. This requirement is fulfilled by a wide variety of ptychographic methods, such as conventional ptychography [11], near-field ptychography [15] or Fourier ptychography [16]. The expected number of detected events at the detector plane I e i,q can be expressed as where A q ,q is a diagonally-dominant positive-definite matrix or operator and b i,q is a positive offset vector. An example of the operator A that can be used to include illumination coherence properties and non-ideal response of camera pixels into the model I e i,q is a convolution with a positive kernel k [17, 18] A|Ψ The vector b i,q is equal to the number of expected background events [14,17]. We split the optimization problem into two subtasks: 1. Search for an exit wave that minimizes the distance between the expected intensity distribution I e i,q and the measured data I m where Ψ opt i is the optimal wavefront in reciprocal space and the distance metric d(· , ·) is assumed to be differentiable. Different metrics that are suitable for common ptychography setups are discussed in the next Section 2.1. Note that each scanning position is optimized independently and information between adjacent positions is shared in the following step during update of the common probe and object.
2. Decomposition of the optimal exit wave ψ opt i,r := P −1 Ψ opt i,q into the common illumination probe P r and the common object O r . Updated probe P upd i,r and updated object patch O upd i,r+r i can be expressed in the following form where ∆P i,r+r i denote the j-th optimization direction in the i-th scan position and α (j) O,i are optimal steps along these directions. These optimization directions can generally correspond to, e.g., updates of the common probe and object but also to position refinement, variable wavefront, or changes in sample properties during scan. The way to find the optimal directions is discussed in Section 2.3. The update steps α (j) O,i , which can generally differ for each scan position, are selected in order to minimize the Euclidean distance min between the optimal wavefront ψ opt i,r and the updated wavefront decomposition ψ As we will show in Section 2.2, the estimation of the update steps can be approximated as a linear least-squares (LSQ) problem.
The unitary property of the propagator P guarantees that the new estimate of the exit wave ψ upd i,r , which minimizes Eq. (6), will also minimize the Euclidean distance between the new reciprocal estimate Ψ upd i,q := P ψ upd i,r and the optimal wavefront Ψ opt i,q as well. Since Ψ upd i,q is sought to be close to Ψ opt i,q , which minimizes Eq. (4), the new estimate Ψ upd i,q will be close to the minimum as well. Assuming that the reciprocal distance metric is a continuous function and the difference between Ψ upd i,q and Ψ opt i,q is small. This approximation allows separation of the reciprocal-space optimization from the real-space optimization task and thus an identical real-space optimization approach can be adopted for different statistical metrics applied on the measured datasets. Additionally, this approximation allows expressing the decomposition task as a LSQ problem which results in analytical formulas for the update steps α (j) O,i . The analytical expression of the step sizes provides a significant computational advantage by avoiding likelihood maximization using a line search method that requires repeated propagation of the updated wavefront estimate to the detector plane in order to evaluate the likelihood exactly.

Optimization in reciprocal space
First, it is important to select an appropriate noise model for the given application [13,14,19]. For simplicity, we will assume that the noise is uncorrelated between the pixels and systematic errors such as partially coherent illumination, missing pixels or incoherent background signal can be sufficiently well described by the intensity model in Eq. (2).
If noise in measurements is dominated by multiple independent processes or if the number of the measured photons in each pixel is large, it can be often well approximated by the Gaussian probability distribution function (PDF) with standard deviation σ i,q . Having measured the intensity distribution I m i,q , the probability of the distribution model I e i,q in i-th scan position is On the other hand, if only Poisson noise is present, the Poisson PDF can be directly optimized. Given the measured photon count I m i,q , the probability of the expected distribution of photons I e i,q , is Commonly, the measured signal is dominated by Poisson noise but may also contain other additive noise independent of the measured signal that can be described by a Gaussian PDF with a standard deviation σ q . Using the variance stabilization property of the square-root transform for Poisson noise [14], the probability distribution of a transformed Gaussian PDF, which we will call amplitude likelihood, can be expressed as where σ a i,q denotes the transformed distribution of the overall noise. If only Poisson noise is present, then σ a = 0.5 is a constant independent of measured data. However, if other noise sources with the standard deviation σ e q are present, dependence of σ a i,q on the measured data I m i,q can be approximated as σ a i,q ≈ 0.5 1 + (σ e q ) 2 /I m i,q assuming that I m i,q σ e q 2 . Since this allows approximating the Gaussian likelihood by the amplitude one, we will limit our further analysis only on the Poisson and the amplitude likelihoods.
The optimal reciprocal-space solution Ψ opt i,q should maximize the probability p(I i |Ψ i ). An equivalent approach to the probability maximization is to minimize the negative log-likelihood Since we search only for the position of the minimum of L, constant offsets do not affect our results, thus they will be omitted. The negative log-likelihood of the aforementioned metrics can be expressed as Poisson log-likelihood (11a) Amplitude log-likelihood (11b) Gradients of the likelihood functions with respect to the propagated exit wave Ψ i,q can be expressed analytically: where W i,q denotes a weighting factor defined so that W i,q = 1 if only the Poisson noise is present and W i,q = 0 if the pixel at coordinate q contains invalid data, for example due to bad or saturated pixels. Once the locally optimal update direction is known, the updated propagated exit wave Ψ opt i,q can be found by a gradient descent method Ψ The step length α i is sought to minimize the negative log-likelihood L(Ψ i ). Since both the amplitude and the Poisson log-likelihoods, Eq. (11), are convex along ∇ Ψ i,q L direction, the optimal step α i can be calculated as an extreme of the selected log-likelihood In the case of the amplitude likelihood, the update step can be expressed as a vector where q A q,q denotes sum over the q-th column of the matrix A. If only Poisson noise is present and the expected intensity I e i,q is assumed to be equal to |Ψ i,q | 2 , i.e. A is a diagonal unitary matrix and the background offset b i,q is zero, the step size given by Eq. (16) is constant and Eq. (14) becomes equivalent to the classical modulus constraint in CDI methods [20]. If additional sources of noise are present, the gradient descent update in Eq. (14) becomes equivalent to a relaxed modulus constraint for the pixels, where the weighting factor W i,q < 1.
Similarly the optimal step α p i,q for Poisson likelihood can be estimated as a recursive function where ξ i,q := 1 − I m i,q /I e i,q . This equation can be either solved iteratively or the step α p i from the previous iteration usually serves as a sufficient initial guess. Finally, propagation of the optimal solution Ψ opt i,q from Eq. (14) back to real space provides an updated exit wave ψ opt i,r estimate

Optimization in real space
Once the updated estimate of the exit wave ψ opt i,r is known, a decomposition of the exit wave update χ i,r into the probe and object related update directions ∆P (j) i,r and ∆O (j) i,r is sought. The exit wave update χ i,r is defined as a difference between the updated and the original exit wave for i-th scan position The optimal decomposition should minimize the real-space cost function where the updated decomposition ψ upd i,r was defined in Eq. (5) and (7). We will approximate the real-space cost function L r by a linear combination of the probe and object updates If the update directions ∆P i,r are known, the optimal step sizes α (j) O,i along these directions can be solved analytically by the linear least-squares (LSQ) method. Assuming that only a decomposition into one probe and one object update direction for each scan position is sought after, LSQ leads to a system of two equations per scan position. The resulting matrix can be expressed in the following form where R denotes the real part of a complex number, * denotes the complex conjugate and γ is a small regularization constant to prevent numerical instability if the vectors ∆P P,i provides the optimal step lengths for the i-th scan position. Solving the system in Eq. (22) is computationally cheap, however the cost quadratically grows for a larger number update directions mainly due to the increased number of non-diagonal terms. The computational cost can be significantly reduced if the update directions can be considered mutually orthogonal. In that case, the non-diagonal terms of the matrix in Eq. (22) become zero and the optimal step lengths can be estimated as

Optimization directions
So far the update directions were assumed to be known and only the optimal steps along the known directions were estimated. In this section, we will present several examples of selecting the update directions including common ones, such as probe and object updates, but also directions that can serve to refine the scan positions or account for unstable illumination wavefront.

Probe and object update directions
The fundamental idea of ptychography is that both object and probe are reconstructed selfconsistently from data. The advantage of our LSQ approach is possibility to estimate the optimal directions for the object and probe updates without considering the precise form of the statistical metrics between the intensity distribution model and measured data. The simplest option for update directions are the steepest descent updates ∆P i,r and ∆O i,r given as the negative gradients of the real-space cost function L r defined in Eq. (20) Until now in this manuscript, each scan position was treated independently, however the assumption of the common illumination probe and object is an essential part of the ptychography algorithm. Using averaged update directions ∆P r , ∆Ô r provides a better estimate of the optimal update directions because the features that are well correlated in the overlapping regions get amplified while the uncorrelated ones will diminish. The common update directions can be estimated as The numerator in the updates ∆P r , ∆Ô r can be generally summed over a partial set of scan positions N ⊂ N 0 and the probe and object updates can be evaluated and applied sequentially as it will be discussed in Section 2.5. If N = N 0 , then the shared update directions in Eq. (25) are equivalent to the ones proposed for the original ML method [13] with additional weighting factors in the denominators. The weightings can be seen as preconditioners of the gradient descent task [19,21] with small regularization constants δ P , δ O > 0. These regularized preconditioners result in a behavior similar to the damped least-squares method that penalizes large values in the update, particularly in the weakly illuminated regions.
Since the object and probe updates given by Eq. (25) cannot be considered mutually orthogonal, the approximation in Eq. (21) describes well the real-space cost function L r (ψ i ) if the following condition is fulfilled: Another consequence of non-perpendicularity is that the non-diagonal terms in the matrix in Eq. (22) cannot be generally neglected in the calculation of the optimal update steps sizes. The final updated probeP r and objectÔ r can be calculated as weighted means of the scaled updatesP The different step size for each scan position gives the algorithm an option to adjust the convergence as a function of the local properties of the object or the illumination. So far only the steepest descent update was considered. However, the same approach can be used for the nonlinear conjugate gradients method [22] that is used to avoid the well-known zig-zag behavior of steepest descent [13,21,23]. In that case, the gradient update in Eq. (25) has to be calculated over the full set N 0 and the optimal step is estimated along the conjugated directions. In this manuscript, the nonlinear conjugate gradient method with the Polak-Ribière update formula [24] is used for reconstruction of examples with the parallel update, i.e. N = N 0 , which is presented in Section 3.

Sample position refinement direction
Sample position refinement can be expressed as an additional update direction as well. As it was already discussed in [25], if the reciprocal-space update is carried out with a positioning error it results in a small shift in between the original ψ i and the updated ψ opt i wavefront. Since the shift is significantly smaller than 1 pixel, a linear approximation of the shift by a gradient of the object or probe is sufficient, i.e. P i are horizontal and vertical update steps for i-th scan position respectively. For example, the probe update direction corresponding to the horizontal shift can be expressed as where x = [1, . . . , N x ] is a linear ramp vector, i is the imaginary unit and F{} denotes the 2D discrete Fourier transform. Due to the mutual orthogonality of the update directions, the update step α (x) i can be well approximated by the diagonal approximation in Eq. (23). Note that the position refinement direction vector can be identically formulated for the object function as well. The sum of the relative probe shifts α (x) i , which can be generally much larger than 1 px, is then used to improve the sample positions in the following iterations.
This method shows good sub-pixel precision that is needed to reveal errors during scanning or long term drifts as discussed in Section 3.2.

Illumination wavefront refinement direction
Small changes in the illumination wavefront can be described by directional vectors as well. There are two approaches, either we assume that the wavefront changes can be described by a low number of Zernike polynomials or the most optimal direction is reconstructed self-consistently from the wavefront update vectors χ i,r .
In the first case, directional vectors describing small variations of the illumination wavefront can be expressed by ∆P Z j ,r := iZ j,r P r for j > 0 , where Z j,r denotes the j-th Zernike polynomial. For example, if we consider only a linear phase ramp, i.e. first order Zernike polynomials, the corresponding step α Z j i provides information about a sub-pixel shift of the i-th diffraction pattern in the far field. Since ∆P Z j ,r can be considered close to orthogonal to other update directions, the approximation in Eq. (23) can be used to estimate the optimal steps. This update direction is useful if the illumination wavefront is changing during the scan duration, but a similar effect can be observed in Fourier ptychography for local or field-dependent aberrations as well [26]. Similarly, intensity fluctuations can be described as ∆P int := P r , resulting in a correction α int i for the i-th scanning position. The second option is to recover the directional update directly from data. This idea is similar to the orthogonal probe relaxation ptychography (OPRP) [27], where the variable illumination is described by a sum of a low number of orthogonal modes. It is common that a single additional orthogonal mode is sufficient to describe wavefront variations in synchrotron-based imaging. In that case, the most relevant directional update ∆P eig,r can be estimated by a modified power method [28] and it will be called here the eigenprobe. The power method is applied on the matrix of residual probe updates R, whose i-th column is defined as where ∆P i denotes the gradient of the real-space cost function L r (ψ i ) defined in Eq. (24a) and ∆P is an average of the gradient over the scan positions i. If the previous estimate of the eigenprobe ∆P eig is known, the next estimate ∆P eig can be refined by where β is a relaxation constant that is 1 for parallel update and 0 < β ≤ 1 for sequential updates, andα eig is a vector whose elements are sums of the previous optimal updates α eig i , ∀r i ∈ N estimated from Eq. (23). The eigenprobe ∆P eig should be initialized as a random matrix as proposed in the power method.

Implementation of additional update directions
The presented additional update directions effectively result in a different illumination probe for each scan position. The modified illumination probe P i,r for i-th position can be expressed from the common probe P r as whereα • i denotes the sum over the previous steps α • i and • denotes any of the superscripts, i.e. int, Z j , x, y or eig. The first term is responsible for the intensity correction, the second term for the wavefront correction, and the last term corresponds to the sub-pixel probe shifts of the common probe P r and its variable partα eig i P eig,r . If allα • i are zero, then P i,r is equal to the common probe P r . If the correction termsα • i are unconstrained, additional ambiguities may be introduced into the ptychography reconstruction. In order to avoid this, the average correction over all the scan positions has to be constrained to zero where N dir is a number of the additional directions, i.e. all directions other than updates of the common probe and of the common object. This will force, for example, an aberration that is common to all the scan points into the common probe, and only the changes of the illumination function that cannot be described by a common probe are picked up by the α (j) i correction terms.

Parallel vs. sequential gradient descent
The optimization task can be either solved in parallel for all scan positions or the updates can be applied sequentially, i.e. one diffraction pattern at a time. The sequential approach generally leads to faster convergence [19,29,30]. Particularly in the case of a poor initial guess, sequential solvers such as the extended ptychography iterative engine (ePIE) [29] are often able to avoid being trapped in local minima and thus have a better chance to find the global optimum. On the other hand, parallel solvers provide better robustness against random and systematic errors [13,14,19]. Moreover, parallel solvers can efficiently distribute the computational task into many independent threads or clusters and thus the parallel approach may help to significantly reduce the per-iteration time.
In our implementation, we use a block update approach, where the update is applied in parallel over partitions N k of the full set N 0 . The block update approach results in lower variation in the update estimates compared to the sequential updates and thus it provides more stable convergence. At the same time, serial updating of smaller subsets can provide faster convergence than one global update. Selection of the partitions N k will only affect the numerator in Eq. (25), the rest of the LSQ approach stays unchanged. The sum in the denominator in Eq. (25) could be summed over the partitions N k as well, however it would lead to unnecessary noise amplification in the regions that are weakly illuminated in the subset N k but are well covered by the full set N 0 .
There are two distinct approaches to the subset selection: compact partitions with maximal mutual overlap of the illumination or sparse partitions, for which the relative overlap between selected positions is minimized, as shown in Fig 1. In the rest of the text, we will denote the compact-set approach LSQ-ML-c and LSQ-ML-s will denote the sparse-set approach. LSQ-ML-c provides a behavior more similar to the ML algorithm such as better robustness against errors in the data while LSQ-ML-s behaves more as the ePIE method providing faster initial convergence but with higher robustness to noise. Moreover, the ratio between the block sizes and the number of scan positions provides the flexibility to choose between faster convergence vs. better robustness and lower per-iteration time because calculations of larger blocks can be parallelized more efficiently.
Finally, the memory requirements of our method are proportional only to the block size and the size of the dataset. Therefore, smaller blocks can be beneficial for processing large datasets using graphics processing units (GPU) for which the available memory is limited and transfer into the GPU memory may cause significant overhead.

Numerical simulations
The advantages of the presented method are first demonstrated on realistic numerical simulations because it allows us to avoid effects of sample drifts, incoherent scattering, and illumination wavefront instability. These errors are rather related to a specific experiment rather than properties of the reconstruction method. The simulated data were thus limited only by Poisson noise.
The object and probe models are based on a reconstruction of the Eiger detector readout chip (see Fig. 2) presented in [31]. For our simulation, we have chosen a step size of 2.4 µm in combination with an illuminating probe diameter of approximately 10 µm. The reconstruction has a pixel size of 38.4 nm. Illumination was fully coherent at a photon energy of 6.2 keV. The incident flux in the simulation was normalized to have an average dose of 1100 photons per object pixel, i.e. 0.75 photons/nm 2 and corresponding Poisson distributed noise was added. The scan consisted of 392 Nyquist-sampled diffraction patterns with 512×512 pixels each. The scanning pattern followed a Fermat spiral pattern [32] in order to avoid the raster grid pathology [33].
The initial guesses of the object and probe were estimated as a Fourier interpolated reconstruction of a dataset cropped to the central 128×128 pixels using the difference map (DM) algorithm [34]. This cropping reduces the computational cost and memory requirements 16-times compared to the full-resolution reconstruction and makes the time to solve the low resolution estimate rather negligible compared to the final optimization.
The evolution of the reconstruction quality was measured by the half-pitch resolution estimated from the intersection of the Fourier ring correlation (FRC) curve with the 1-bit criterion threshold [35]. FRC was computed between the original model and the sub-pixel aligned reconstruction [36] with the relative phase ramp removed prior to each FRC evaluation.

Convergence analysis
Convergence of the least-squares maximum-likelihood (LSQ-ML) method with the amplitude cost function was compared to common ptychography methods, in particular DM, ePIE and ML, in Fig. 2. The ePIE method shows the largest progress in the first iteration but then it makes only slow progress towards higher spatial resolution. The reconstruction quality of the DM method is known to be more sensitive to noise [13], therefore DM only slightly improved the low resolution initial guess. The ML method reached significantly better results than the ePIE and DM methods, even though the initial convergence was slower. Compared to these methods, the LSQ-ML methods provide the fastest convergence and highest resolution within 1000 iterations. Two examples of LSQ-ML are shown: the red curve shows the sequential option using sparse partition selection, LSQ-ML-s, with the block size B = 50, i.e. 13% of the full dataset. Note that we consider one iteration when all the data is used. The blue curve shows a reconstruction of the LSQ-ML-c technique with a single compact partition. In particular, the LSQ-ML-s method provides very fast convergence reaching after 20 iterations similar resolution as the ML code after 200 iterations. On the other hand, it still suffers from similar issues as the ePIE method, namely, the solver is not able to reach the optimal solution and the resolution stagnates at around 42 nm. The LSQ-ML-c method, which in this case used a single global update, provides higher reconstruction quality for the price of slower initial convergence.
A more detailed evaluation of the reconstruction quality is shown in Fig. 3, where the spectral signal to noise ratio (SSNR) after 20 iterations and after 1000 iterations are shown in Fig. 3(a)  and 3(b), respectively. Assuming zero-mean uncorrelated noise between the model object and .
The ePIE and DM methods show only minor progress after 30 iterations except small improvement in the lowest spatial frequencies. Although both the LSQ-ML methods reach almost their final resolution after 30 iterations already, the average SSNR is improved more than twice by using 1000 iterations. Finally, ML has slower convergence but the final SSNR curve is very close to the LSQ-ML curves.

Position refinement
Position refinement is demonstrated on the simulated dataset described in the previous section by adding random position errors with a standard deviation of 1 pixel, i.e. 38.4 nm in the experimental geometry. If the position errors significantly exceed 1 pixel, a pyramidal approach, as used in image processing [38], can be used by reconstructing first at lower resolution. At lower resolution the computational cost per iteration is significantly lower and then the resolution is progressively improved and along with it the position refinement precision. The mean squared error (MSE) of the refined positions is shown by a blue line in Fig. 4(a). The refined positions converge to the true positions at close to exponential rate. Despite the fact that we have used a rather photon-limited dataset, accuracy as good as 10 −2 pixel, i.e. 0.4 nm, was achieved.
Refinement of global parameters errors, such as pixel scaling, rotation or skewing, which introduce correlated position errors, is shown in Figs. 4(b)-4(c). We have added a 0.5°error in rotation and skewness of the translation axes and 0.1% error in the pixel scale. The rotation error can arise from a relative angle between the scanning axes and the detector pixels while an error in the pixel scale could arise from imprecise knowledge of the incoming photon energy or the distance from the sample to the detector. These errors have usually a rather negligible effect on the visual reconstruction quality, however they result in a quality loss when multiple 2D frames are tomographically assembled in 3D [4]. The convergence of the position refinement is shown via the mean squared error (MSE) with a red line in Fig. 4(a). The convergence is clearly slower compared to the random errors, because in this case the relative errors between neighboring   positions are very small and information propagation between distant regions is slower.

Wavefront variation refinement
The wavefront changes were simulated by variations in intensity and wavefront tilt. The intensity was modulated using a sinusoidal function with amplitude of 5% and period of 50 scan positions. The variable aberration was simulated by the first-order Zernike polynomials resulting in a sinusoidal shift of the diffraction pattern in Fourier space with amplitude of 1 pixel and a period of 100 positions. Similar level of instability can be observed in high-resolution laboratory-based EUV microscopy [27,39] or free electron lasers [40]. The wavefront refinement was performed as it is described in Section 2.3.3. Figure 5(a) shows convergence of the wavefront parameters to their true values. Similarly to position refinement, the short-distance corrections, i.e. the neighboring frames, are corrected in a few iterations. The distant regions of the reconstruction may still contain some small errors even after 1000 iterations as shown in Fig. 5(c) compared to the initial far-field wavefront displacement shown in Fig. 5(b).

Nyquist frequency
Iteration number

Experimental verification
In order to verify the experimental performance, two ptychography scans of the same region of the Eiger detector readout chip presented in [31] were reconstructed and the repeatability was evaluated by the FRC. The scan step was 4.2 µm, in combination with an illumination diameter of 10 µm. Average imaging dose was 1900 photons per object pixel or 1.3 photons/nm 2 . Each of the two processed ptychography datasets contains 423 positions. As it was already noted in the original publication [31], the translation axes were skewed by 1.38°. Additionally to this skewness, we have observed that further position refinement was beneficial. In order to reduce the effect of the systematic errors on the performance of the compared methods, the scan positions were corrected by our position refinement method and then each algorithm was initialized with the refined positions. The initial guesses of the object and probe were obtained from a low-resolution reconstruction as described in the previous section.
The evolution of the reconstruction quality for different methods is shown in Fig. 6. Since the ground truth is not available, the reconstruction quality was estimated as the intersection of the FRC curve between two reconstruction from independent datasets with the 1-bit threshold [35].
Quality evolution of two configurations of the LSQ-ML solver are shown in Fig. 6(a). The red curve corresponds to the LSQ-ML-s method with partition sizes of B = 100 positions, chosen to minimize their mutual overlap. The blue curve corresponds to the LSQ-ML-c with same size of partitions but with compact subset selection. The LSQ-ML method provided the fastest convergence in the first configuration and better quality than other methods in the second configuration. In particular, the low spatial-frequencies had better reproducibility between the reconstructions as shown in Fig. 6(b). Despite this, the differences are less pronounced than in the simulation shown in Fig. 2, and all the methods converged before reaching 100 iterations. This cannot be explained by read-out noise in the measured data since an Eiger photon counting detector [41] was used. A possible explanation is that the quality of the data was limited by incoherent background caused by scattering of the X-rays by helium gas between the sample and the detector. The average background signal was estimated 0.06 photons per pixel in each measured diffraction pattern.
This background reduces signal-to-noise ratio for the low intensity regions of the diffraction patterns and even though a background offset was included into our model in Eq. (2), the additional noise prevents extracting all information from Poisson statistics. This may be one of the reasons, why superior results could be achieved in the simulation despite 40% lower photon dose.
Additionally, the final quality of all the methods is more similar to one another due to the smaller overlap between the scan positions compared to the simulation. This results in a higher number of photons per diffraction pattern if the total imaging dose, i.e. number of photons per object pixel, is kept constant. Therefore, even statistically less aware algorithms, such as ePIE, perform comparably well to the ML method. Finally, evolution of the reconstruction quality with and without position refinement is shown in Fig. 7(a). The position refinement resulted in an additional 0.10°tilt, -0.12°rotation and 0.02% pixel scale correction. The residual errors that could not be described by these geometry factors are shown in Fig. 7(b). The residues clearly show long term drifts along both axes and significant fluctuations in the vertical direction that are well correlated between the two scans.

Conclusion
We have presented a novel reconstruction scheme to find a maximum likelihood solution for ptychography. Our method is based on optimal decomposition of the exit wave update into multiple directions using least-squares. This allows including position refinement, illumination wavefront correction methods, conjugate gradient update, without the need of computationally expensive line-search methods. Additional to the low computational cost, we have demonstrated improved reconstruction quality and convergence speed on simulated and measured X-ray ptychography data. However, the method is equally applicable to other ptychography configurations such as Fourier ptychography or near-field ptychography.
The proposed illumination wavefront correction methods can be seen as a computationally faster and more memory-efficient approximation of the orthogonal probe relaxation (OPRP) method [27]. Besides its fast convergence, the presented method is more robust to noise due to a lower number of additional degrees of freedom. However, if the illumination variations cannot be adequately described by a low number of the Zernike modes or a single orthogonal mode, the full OPRP method should be used.
An implementation of the presented method is available at www.psi.ch/sls/csaxs/software.