Noise-robust latent vector reconstruction in ptychography using deep generative models

Computational imaging is increasingly vital for a broad spectrum of applications, ranging from biological to material sciences. This includes applications where the object is known and sufficiently sparse, allowing it to be described with a reduced number of parameters. When no explicit parameterization is available, a deep generative model can be trained to represent an object in a low-dimensional latent space. In this paper, we harness this dimensionality reduction capability of autoencoders to search for the object solution within the latent space rather than the object space. We demonstrate a novel approach to ptychographic image reconstruction by integrating a deep generative model obtained from a pre-trained autoencoder within an Automatic Differentiation Ptychography (ADP) framework. This approach enables the retrieval of objects from highly ill-posed diffraction patterns, offering an effective method for noise-robust latent vector reconstruction in ptychography. Moreover, the mapping into a low-dimensional latent space allows us to visualize the optimization landscape, which provides insight into the convexity and convergence behavior of the inverse problem. With this work, we aim to facilitate new applications for sparse computational imaging such as when low radiation doses or rapid reconstructions are essential.


I Introduction
Obtaining clear and accurate images under noisy conditions is paramount in many scientific and industrial applications.Whether in medical diagnostics, materials analysis, or semiconductor inspection, noise-robust imaging techniques can mean the difference between precise understanding and potential misinterpretation.Typical noise removal techniques, like median filtering [1], anisotropic diffusion [2,3] and BM3D [4], have been shown to be effective and applicable across various domains, but are usually limited to image postprocessing and restoration.In the field of computational imaging, where an image is algorithmically recovered from noisy intensity measurements, an interesting option becomes available: noise-robustness can be intrinsically included in the process of solving the inverse problem.For example, techniques such as accurate modeling of the underlying noise statistics [5,6], sparse modeling [7][8][9], deep denoiser priors [10][11][12], and regularization by denoising [13][14][15][16][17] have been explored in this context.
Ptychography, a computational imaging method, has seen exponential growth in the number of related publications in recent years [18].The origin of ptychography dates back to Hoppe's 1969 work, where they introduced a method for phase retrieval from electron diffraction interference [19].This foundational concept was refined and first named "ptychography" in a subsequent paper the following year [20].The process involves illuminating a thin object with a localized and coherent beam.The illumination field is diffracted by the object and propagates in free space to form a diffraction pattern on a camera sensor [21].By laterally translating the object or illumination field to overlapping regions, the object's phase and amplitude can be retrieved using iterative algorithms [22].Variations of this method include Fourier ptychography (FP), which uses a microscope objective to collect diffracted light and computationally synthesizes an image in the spatial domain [23,24].Interestingly, reciprocity relations allow for the conversion of acquired data between both modalities, opening doors for the integration and mutual enhancement of various reconstruction algorithms [25].In the past decade, numerous algorithmic extensions have been developed to leverage redundancy in ptychographic measurements, enhancing image quality by recovering parameters such as the illumination field [26][27][28], scanning position errors [29,30], object-camera distance [31], and multiple incoherent modes [32,33].
The inherent data redundancy in ptychography also provides a unique opportunity for data-driven reconstruction techniques such as Automatic Differentiation Ptychography (ADP), which aims to optimize a loss function using gradient-descent and differentiable modeling [34][35][36][37].This loss function is derived from the intensity prediction of a physics-based forward model, actual data, and regularization terms.Utilizing automatic differentiation (AD), this approach enables the simultaneous and joint reconstruction of multiple relevant parameters [38].ADP enhances the reconstruction process by offering portability, flexibility, and adaptability to changes in the forward model.This includes fusing multiple camera sensors [39], adjusting the loss function for mixed noise statistics [40], or tailoring a maximum-information illumination scheme [41].An intriguing extension of data-driven image retrieval involves integrating deep neural networks (DNNs) with the reconstruction algorithm.One approach is combining physics knowledge and machine learning to achieve optimal experimental designs [42][43][44].Another approach is end-to-end mapping, where DNNs learn a direct mapping between the object and diffraction image domains [45][46][47][48][49][50][51][52].This method reduces the computationally slow and costly reconstruction process, even enabling real-time inference in some cases [53][54][55].However, it requires a large amount of training data.While the training data can be generated numerically in simulation, Sinha et al. [45] acquire them in situ by projecting thousands of phase objects onto an SLM, whereas in Cherukara et al. [50], the training data is obtained from iterative phase retrieval of experimental data.Although these methods allow the model to learn physical system inaccuracies, they are time-consuming and not portable to different optical systems.Given that wave propagation is well-described by the Helmholtz equation, requiring a DNN to learn this may be unnecessary.This insight leads to physics-informed and deep learning (DL) assisted computational imaging.For instance, Goy et al. [10] found that physics-informed phase retrieval outperformed end-to-end methods for low photon counts.Similarly, Metzler et al. [15] utilize a convolutional neural network (CNN) known as DnCNN [56] as a denoising regularizer, improving the reconstruction quality by exploiting the natural absence of additive Gaussian noise in images, while Chang et al. [? ] advance this approach by incorporating a complex-domain neural network that leverages the latent correlations between amplitude and phase for improved coherent imaging reconstructions.A recent development in this field is the realization that the structure of deep generative networks can capture image statistics even before learning [57], improving ptychography reconstruction quality under the concept of deep image priors without requiring a preceding training procedure [12,58,59].
In this paper, we introduce a method that combines a fully physics-based ptychography reconstruction framework with a pre-trained deep generative model.When prior knowledge indicates that a sample is sparse in an unknown basis, we show that the learned low-dimensional representation in latent space enables accurate image reconstruction even under extremely challenging noise conditions.The integration of the deep generative model serves two distinct functions.First, in a pre-training step, an under-complete autoencoder [60] learns an implicit parameterization of images belonging to a specific class (e.g., MNIST [61]).Then, this learned model is utilized to successfully reconstruct images from ill-posed diffraction data.We empirically demonstrate noise-robust latent vector reconstruction using experimental data from a photolithographically manufactured sample.The compact space spanned by the latent vectors allows us to visualize and study the optimization landscape in ptychography through principal component analysis, a novel approach to the best of our knowledge.Lastly, we quantify the reconstruction quality as a function of the total number of photons in the illumination field through numerical simulations.We find that latent vector reconstruction begins to produce faithful reconstructions with an average of 0.001 photons per camera pixel, even in the presence of readout noise.However, as the total number of photons approaches the degrees of freedom in conventional reconstruction, the performance of our approach diminishes in comparison due to its inability to produce comparably high-definition images.We provide the raw ptychography data and an open-source implementation of ptychographic latent vector reconstruction underlying this paper in [62].

A Optical Setup
We employ a ptychography setup in transmission geometry as illustrated in Fig. 1A.A continuouswave laser (Cobolt Jive 100™) with a wavelength of λ = 561 nm is coupled into a polarizationmaintaining single-mode fiber.Then, a fiber-coupled collimator (60FC-L-0-M75-26, Schäfter+Kirchhoff) expands the beam to 25 mm in diameter, illuminating a 500-µm pinhole.This pinhole is imaged onto the object using a 2-lens system with a magnification of M = 3, resulting in a circular illumination field with a uniform phase at the object plane.The object, a binary hand-drawn digit, is laterally scanned through the beam using stepper motor actuators (ZFS25B, Thorlabs).Diffraction patterns are recorded 6.5 cm downstream of the object using a CMOS camera sensor (acA2440-35um, Basler) with a pixel size of 3.45 µm and 1024 × 1024 total pixels.For calibration of the illumination field, object-camera distance, and actuator position inaccuracies, we employ a Fermat spiral scanning pattern [63] with 96 positions and an overlap of 80 %.At 16 uniformly distributed positions during this calibration scan, we capture an additional evaluation series of diffraction patterns at a lower signal-to-noise ratio (SNR), varying the camera exposure time between 300 ms to 0.03 ms, and with an approximate illumination overlap of 67 %.By acquiring the calibration and evaluation data within a single scan trajectory, we ensure that the scanning position correction remains valid for the evaluation reconstructions.In practice, calibrating the illumination field can be effectively achieved using any object that ensures high-SNR diffraction data and reliable reconstruction quality.The binary photomask sample, the reconstructed illumination field, and the scanning pattern are shown in Supplement 1.

B Reconstruction procedure
In this work, we employ an ADP framework integrated with a pre-trained deep generative model, as illustrated in Fig. 1B.A comprehensive description of the physics-informed ADP framework can be found in [38].We use TensorFlow [64] to model the optical system, leveraging its differentiable programming capabilities to seamlessly incorporate the deep generative model into our forward model.Specifically, the decoder of a pre-trained autoencoder serves as this deep generative model.This enables us to represent the object as a compact latent vector rather than a conventional pixel-based image.
For each scanning position, the object is illuminated by a coherent light field.The exit field in the object plane is computed using the projection approximation [65] and propagated to the detection plane via a band-limited angular spectrum method [66].This yields a set of predicted diffraction patterns I k for any given object patch and illumination field.The optimization goal is to minimize a loss function L(θ) designed to maximize the likelihood that the parameter set θ accurately represents A 500-µm pinhole is illuminated with coherent light at λ = 561 nm and relayed onto the object using a 2-lens system.The object is moved laterally through the beam using a computer-controlled XY stage, and a CMOS camera sensor records the diffraction intensities 6.5 cm downstream from the object.(B) Diagram of the Automatic Differentiation Ptychography (ADP) framework, which models the physical system beginning from the object illumination.In the conventional mode, the object is represented by complex-valued pixels.With a pre-trained autoencoder for a specific class of objects, the decoder can be integrated into the ADP framework as a deep generative model, allowing the object to be represented as a latent vector and significantly reducing the number of free parameters.
the observed diffraction patterns X k .The loss function is defined as follows: [40] where N is the total number of camera pixels, and σ 2 k is the camera sensor readout noise determined from 300 dark measurements.In the high-SNR calibration reconstruction, the parameter θ encompasses the illumination field, object, object-camera distance, and the scanning positions.Conversely, when we assess our method's noise-robustness using latent vector reconstruction, we rely on the calibrated data for all parameters except the object.In this scenario, θ consists solely of the latent vector representing the object.While the optical setup and pre-calibrated illumination field are sampled at a 1024 × 1024 resolution, the decoder maps only to a 32 × 32 image output.Hence, we employ the Mitchell-Netravali cubic filter to resize the decoder accordingly.This filter is chosen empirically for its high-quality output with smooth gradients and minimal aliasing artifacts [67].
All reconstructions are executed on a commercial Nvidia RTX A6000 GPU using the Adam optimizer [68] with a randomized order of diffraction patterns.The process typically completes within 100 epochs, taking approximately 20 minutes for our datasets.The learning rate α serves as a hyperparameter that controls the step size for the gradient descent within the loss landscape.We find that learning rates in the range of α = 0.1 to 1.0 with an exponentially decaying schedule α n = α × 0.97 n for the n-th epoch yield optimal convergence.While regularization terms are included for conven-tional reconstructions as described in [40], reconstructing the latent vectors does not necessitate any additional regularization for optimal convergence.

C Deep Generative Model
In computational imaging, incorporating machine learning models, particularly deep generative models, offers a compelling route for enhancing image reconstruction capabilities.Deep generative models, such as autoencoders, are neural networks trained to learn a compressed, yet informative, representation of unlabeled data.This learned representation, often referred to as the latent space, captures the essential features of the data while discarding noise and redundancies.This section delves into the architecture, training, and characterization of the autoencoder model used in this study, elucidating how it integrates with the ptychographic reconstruction framework and with a routine to image objects of a class that is known a priori.
The autoencoder architecture, adapted and implemented in TensorFlow from [69], is depicted in Fig. 2.An autoencoder network typically aims to map input data into a lower-dimensional latent space and reconstruct it back to the original form [60]. Mathematically, an encoder function f maps an input x to a latent vector h = f (x).A decoder function g then maps h back to the reconstructed input x = g(h).The autoencoder is trained using the Adam optimizer and MNIST, a dataset containing 60,000 training and 10,000 validation images of handwritten digits, to minimize the binary crossentropy loss function L(x, x = g(f (x))).The training is a one-time process and requires only 50  epochs which take about 20 minutes on an Nvidia RTX A6000 GPU.With respect to the image size of our object, the design of our under-complete autoencoder requires a latent space of significantly smaller dimensionality.Otherwise, the autoencoder would trivially learn the identity function, failing to capture the most salient features of the training data.This raises the question of selecting the optimal latent space dimension.Rather than relying on a heuristic trial-and-error approach, we employ the Implicit Rank-Minimizing Autoencoder (IRMAE) model [69].The IRMAE includes eight additional linear layers W 1 , W 2 , ..., W 8 at the end of the encoder network, which are randomly initialized.Deep linear networks have been shown to induce implicit regularization, leading to low-rank solutions [70].Hence, we can choose the latent dimension of h to be reasonably large (128 in our case) and let the training process automatically find the lowest rank.
In Fig. 3, we evaluate different autoencoder architectures across various scenarios.Fig. 3A illustrates the singular value decomposition on the covariance matrix of MNIST validation examples to assess the effective rank needed for feature representation.Our findings indicate that the rank-minimized autoencoder has an effective rank of 22.This is corroborated by the same rank of the matrix absorbed into the encoder, calculated as W = 8 i=1 W i .In contrast, we find that the singular values for a standard autoencoder, where W is an identity matrix, can only be neglected above a rank of 86.Hence, the rank-minimized autoencoder spans a more compact latent space.Additionally, we examine the impact of simplifying the training set by using only the 982 handwritten '4's from the MNIST dataset, which results in a lower rank of 13.This adjustment in training samples can be viewed as imposing a stronger prior belief about the object in the context of latent vector reconstruction in computational imaging.
In Fig. 3B, we further explore the properties of the latent space.Specifically, we illustrate latent vector arithmetic through linear interpolation between two latent vectors representing a '9' and a '0'.Using the rank-minimized autoencoder, we observe that the latent space interpolation captures smooth and meaningful transitions between different images, indicating a well-structured feature representation.However, this is less evident in the standard autoencoder, which results in a more ambiguous interpolation between images.In the scenario where the model is trained only on a subset of the MNIST dataset, images that are not part of the training set are not as accurately represented.This is by design, as the deep generative model can be intentionally optimized to reconstruct objects within its trained class, in this case, the digit '4'.This targeted approach offers an advantage for applications where more concrete prior knowledge about the object class is available.
Lastly, we sample images generated from multivariate Gaussian noise input to the decoder in Fig. 3C.Remarkably, images generated using rank-minimized decoders predominantly resemble handwritten digits rather than random patterns that are observed with the standard decoder.This is particularly advantageous for our ptychography application, as it implies that the reconstruction process is intrinsically guided towards generating images that align with prior knowledge -here, the class of handwritten digits.We seek out this property to enhance the robustness of our method, especially when dealing with ill-posed data, by reducing the likelihood of spurious reconstructions.An extended evaluation of the autoencoder's performances across various inputs is available in Supplement 1.

A Experimental Amplitude Transmission Reconstructions
We present the main experimental result of this paper in Fig. 4. We illuminate an amplitude-only sample shaped like the digit '4' and adjust the camera's exposure time over four orders of magnitude.As a result, we acquire sets of diffraction patterns ranging from high SNR to extremely low SNR.These datasets are then used for ptychographic reconstructions.In conventional reconstruction, using a 468 × 468 pixel basis, the object's amplitude transmission function is pristine at the highest SNR but deteriorates to noise at a 30 µs exposure.In contrast, switching to latent space reconstruction with a pre-trained deep generative model significantly improves low-SNR performance.The reduced rank of the latent space, previously shown to be 22, cuts the number of free parameters by approximately 10, 000× compared to the conventional reconstruction method.This allows for successful object determination with remarkably fewer photons.Moreover, the model trained specifically on the digit '4' shows more stable convergence and slightly better reconstructions overall.
In the high-SNR scenario, the conventional reconstruction outperforms our latent vector approach in terms of image sharpness.When a sufficient number of detected photons is available, reducing the number of parameters that represent the object offers no advantage.Indeed, this becomes a drawback when the lower-resolution output of our deep generative model is resized to match the higher resolution used in our ADP framework, resulting in limited image sharpness.This can be interpreted as a trade-off for the immense reduction of free parameters.
During the optimization process, we observe that the latent vector reconstruction based on training with the full MNIST dataset can occasionally get stuck in local minima, even when using diffraction data with high SNR.Due to the randomization of the order of the diffraction patterns in the stochastic gradient descent, the first update step of the latent vector can walk in a different direction within the loss function optimization landscape.The latent space is non-injective, meaning that multiple different latent vectors can map onto the same output.Therefore, the optimization procedure is sensitive to the initial state and first gradient step in particular.In practice, we find that this sensitivity can be mitigated by initializing the latent vector as the vector average h obtained using the pre-trained encoder function f (x) and all training examples x train expressed as where N x is the total number of training samples, and f (x train,i ) is the latent vector corresponding to the i th training sample.This initialization approach is practical as the training samples are readily available from the pre-training phase.Moreover, for the case of the deep generative model trained on the filtered MNIST dataset only including '4's, this vector average initialization is not required.In this case, the convergence is consistently stable and fast, and the latent vector can be initialized as random or uniform.

B Numerical Reconstructions and Quantitative Comparisons
To quantitatively assess the noise robustness of our method and its ability to generalize to another object, we simulate ptychographic reconstructions using a known amplitude-only hand-drawn '3' as ground truth.We generate diffraction patterns using the probe and scanning pattern shown in Supplement 1, with an object-camera distance of 8 cm, while all other simulation parameters are consistent with the experimental setup.We vary the total photon count in the illumination field from 10 to 10 6 photons, assuming a uniform camera readout noise of σ k = 0.3 photons, ∀k.
To evaluate reconstruction quality in Fig. 5, we use the Peak Signal-to-Noise Ratio (PSNR) between the reconstructed image x and the ground-truth xgt .As the maximum pixel value for our decoder output is equal to one, we can write In the double-logarithmic plot, a linear relationship between illumination intensity and conventional reconstruction quality suggests a power-law behavior.The curve plateaus at 10 6 photons, indicating the inverse problem becomes well-posed.This is further supported by the amplitude reconstruction visually nearing its optimum in sharpness and contrast at the highest simulated SNR.
For the latent vector reconstruction, successful convergence occurs at around 10 3 photons for both the full and filtered MNIST-trained models.This corresponds to an average photon count of 0.001 per camera pixel.Below this threshold, the PSNR is inflated due to the deep generative model's inability to generate noisy outputs matching the readout noise, leading to spurious correlations with the ground truth.The model trained exclusively on '3's demonstrates stable convergence at marginally lower SNRs, specifically around a few hundred photons in the illumination field.
In summary, our method excels at low SNRs where the conventional method struggles to produce meaningful reconstructions, although it is inherently limited at high SNRs due to the reduced degrees of freedom in the deep generative model.Hence, when sufficient information is available in the diffraction data, the conventional reconstruction method should be preferred.

C Loss Landscapes
Given the compact nature of the latent space, we have the unique opportunity to approximate and visualize the loss landscape that is traversed during optimization (see Fig. 6).Utilizing the two leading orthogonal principal components of the latent space, we construct a three-dimensional representation of the loss landscape.We perform a principal component analysis on the covariance matrix of all latent vectors from the validation MNIST dataset to identify the two most informative directions associated with the two leading singular values, denoted as v 1 and v 2 .Given an optimal latent vector h opt obtained from experimental diffraction data, we explore the loss landscape by varying this optimal point along the directions of v 1 and v 2 : Here, α and β range from -10 to 10 based on the extent of these variations in image space as illustrated in Supplement 1.We then compute the loss value for each h(α, β) to visualize the landscape.
In the high-SNR case with training on the full MNIST dataset (Fig. 6A), the loss landscape exhibits a distinct but non-convex and asymmetric minimum.This topography accounts for the sensitivity to the initial latent vector state.The non-convexity is exacerbated in low-SNR scenarios (Fig. 6B), confirming that convergence is more challenging when the signal is weak.Interestingly, the loss landscape exhibits a smoother and more convex topography when the model is trained exclusively on images resembling the digit '4' (see Fig. 6C).This finding aligns with our previous observation that such specialized training renders the optimization process less prone to getting trapped in local minima and more forgiving of suboptimal latent vector initialization.Finally, we explore the impact of using an alternative loss function based solely on Poisson statistics in Fig. 6D.This loss function, defined as , reveals a loss landscape with large plateaus and a higher degree of non-convexity for the high-SNR, full MNIST-trained scenario.These characteristics align with our observation that the mixed Poisson-Gaussian loss function generally offers better convergence behavior in comparison.

IV Discussion and Conclusion
We present a novel approach to ptychographic image reconstruction by integrating a deep generative model into a physics-informed Automatic Differentiation Ptychography (ADP) framework.By incorporating prior knowledge about the object class of a specimen, our method significantly reduces the number of free parameters in the optimization problem, enabling robust reconstructions in low signal-to-noise ratio (SNR) scenarios.Since the pre-training of the generative model and the ADP reconstruction are separated, this approach is modular and portable to related AD-based imaging methods.For instance, the deep generative model can seamlessly be exchanged or reused in an adapted physics-informed imaging modality without retraining.As the field of generative artificial intelligence is currently undergoing an immense research interest, our work presents a straightforward way to incorporate future latent generative models into computational imaging.
One key observation of this paper is the inherent trade-off between noise robustness and the maximum achievable fidelity of the reconstructed image.This limitation is primarily due to the output resolution constraints of the pre-trained decoder.This opens up intriguing avenues for future research, including the exploration of alternative deep generative models for ptychographic reconstruction.While Generative Adversarial Networks (GANs) and latent diffusion models have shown promise in related imaging contexts such as compressed superresolution imaging through multimode fibers [71] or high-resolution image reconstruction from human brain activity [72], they introduce their own set of challenges.These include increased computational complexity, less interpretable latent spaces [73], and the added model complexity required for capturing high-resolution or complex greyscale features, which could compromise our method's ability to reconstruct from severely ill-posed data.Indeed, architectures like the recently shown GigaGAN [74] show excellent generative abilities and a controllable latent space, but require up to 4700 days of training on a high-end A100 GPU, while the autoencoder design in this work trains within minutes on any commercial laptop.
Another promising direction for future research in latent vector reconstruction is the extension of the model to output complex-valued images.This would lift the current limitation of representing only amplitude objects, thereby broadening the applicability of our approach.Cherukara et al. have already made strides in this direction, demonstrating ptychographic imaging with a Y-shaped latent model that performs an end-to-end mapping from diffraction patterns to both phase and amplitude images [50].This is particularly interesting for noise-robust latent vector reconstruction, as the shared latent representation could be leveraged to account for the high correlation typically observed between phase and amplitude in complex objects, akin to the implementation recently shown in [? ].Consequently, even though the output dimensionality would effectively double to accommodate both phase and amplitude, the rank of the latent space may not necessarily need to double, thanks to this inherent correlation.
Our utilization of a compact latent space for object representation provides a unique opportunity to approximate and visualize the optimization loss landscape during ptychographic object retrieval.This offers valuable insights into the convergence behavior and the sensitivity to the initial state of the reconstruction process.However, the mapping from the latent space to the object space is noninjective, and our landscape visualization, therefore, is localized and truncated due to the omission of principal components associated with smaller but still relevant eigenvalues.
Our method's robustness in low-SNR conditions offers valuable applications in both biological and industrial settings where prior knowledge for pre-training is often available.It is particularly wellsuited for medical imaging of delicate specimens, where minimizing radiation dose is a priority.The reduced computational complexity, thanks to fewer free parameters and the calibrated illumination field, also makes it ideal for real-time and specialized imaging scenarios like industrial quality control, where quick yet quality-assured reconstructions are desired.Indeed, we occasionally observe high-SNR latent vector reconstructions to converge within a single epoch, as compared to the multiple epochs typically required for conventional reconstructions.Furthermore, our method's noise resilience makes it a promising tool for imaging in photon-starved regimes, such as extreme ultraviolet (EUV) or X-ray applications, where imaging with minimal photon counts is often required.
In conclusion, our work represents a step forward in the field of computational imaging by demonstrating the power of integrating machine learning techniques with physics-based models for robust image reconstruction.As computational imaging continues to evolve, the integration of deep learning models with traditional imaging techniques promises to unlock new capabilities and applications across a wide range of scientific and industrial domains.Noise-robust latent vector reconstruction in ptychography using deep generative models: Supplement 1 Jacob Seifert, 1 , Yifeng Shao, 2 and Allard P. Mosk This document provides supplementary material to Noise-robust latent vector reconstruction in ptychography using deep generative models.

I Autoencoder performance across varied inputs
As elaborated in Section 2C of the main document, we explore various autoencoder architectures.In this section, we extend that discussion by examining how these architectures map different types of inputs to outputs.Fig. S1 provides a comprehensive view of the autoencoder's performance on a range of inputs, including both hand-drawn digits and objects not present in the training set.
The standard autoencoder, with its larger rank of 86, offers the most faithful input-to-output mapping.Notably, it can generalize well to unseen objects like alphabets and even a hand-drawn smiley.However, in the context of this work, we are primarily interested in deep generative models that can efficiently map to hand-drawn digits, as we have prior knowledge that the objects of interest belong to this specific class.In this regard, the implicit rank-minimized autoencoder proves to be more suitable as it spans a more compact latent space, which is beneficial for the purpose of reconstructing images from ill-posed data.
Additionally, we investigate the impact of training the autoencoder on a filtered dataset, focusing on specific digits like '3' and '4'.We find that such specialized training only maintains the mapping for those specific digits and also tends to "morph" other inputs into resembling these digits.This demonstrates the flexibility in incorporating varying degrees of prior knowledge about the object under study.In our application, this could range from knowing that the object is any hand-drawn digit to knowing that it is a specific digit.

II Principle component analysis of the latent space
To explore the latent space of the trained autoencoder, we aim to identify the two leading principal components.These components will serve as directions along which the latent vectors can be varied to generate visualizations of the loss landscape.First, we encode the set of 10,000 validation MNIST images x test using the trained encoder function, obtaining their latent representations h test = f (x test ).Next, we calculate the covariance matrix C of the latent vectors as We then perform singular value decomposition (SVD) on the covariance matrix C: U, Σ, V = SVD(C).(S2) Here, V contains the eigenvectors of C, sorted by their corresponding eigenvalues in descending order.The principal components corresponding to the two largest eigenvalues are then the first and second rows of V, denoted as v 1 and v 2 , respectively.
To generate new images along the directions of these two orthogonal leading directions, we interpolate a new latent vector h as described in Eq. 4 in the main document, where α and β are scalar values that define the extent to which the latent vector is varied along each principal component.By following this procedure, we can explore the latent space along its most informative directions, thereby generating meaningful variations of the original images as illustrated for the average latent vector in Fig. S2A and a latent vector representing a '4' in Fig. S2B.

III Sample calibration and scanning patterns
In this section, we present key experimental parameters employed in both experimental and numerical studies.Fig. S3A showcases a photograph of the photomask that serves as the binary amplitude object.This photomask features rows of binary hand-drawn objects akin to those previously depicted in Fig. S1.Each row is successively scaled down by a factor of three with respect to the previous row.Specifically, we focus on the '4' in the third row, which has an object size of approximately 1 mm 2 .Fig. S3B and C display the calibrated illumination field and the scanning pattern, respectively.The scanning pattern is a high-overlap trajectory comprising 96 points arranged in a Fermat spiral shape, used solely for calibrating the illumination field, scanning points, and object-camera distance.To assess the noise-robustness of our latent vector reconstruction approach, we utilize only the 16 orange, uniformly distributed scanning points within this pattern.
Finally, Fig. S4 illustrates the normalized illumination field and scanning pattern utilized in our numerical simulations.In this case, we adopt Poisson disk sampling to generate the scanning trajectory.Much like the Fermat spiral, this choice is well-suited for ptychography as it yields an arbitrary, nonregular grid while maintaining uniform distances between neighboring points [75].In these simulations, the magnitudes of the illumination fields are rescaled to achieve a total photon count ranging from 10 to 10 6 photons.

Figure 1 .
Figure 1.(A) Schematic of the optical setup used for ptychography.A 500-µm pinhole is illuminated with coherent light at λ = 561 nm and relayed onto the object using a 2-lens system.The object is moved laterally through the beam using a computer-controlled XY stage, and a CMOS camera sensor records the diffraction intensities 6.5 cm downstream from the object.(B) Diagram of the Automatic Differentiation Ptychography (ADP) framework, which models the physical system beginning from the object illumination.In the conventional mode, the object is represented by complex-valued pixels.With a pre-trained autoencoder for a specific class of objects, the decoder can be integrated into the ADP framework as a deep generative model, allowing the object to be represented as a latent vector and significantly reducing the number of free parameters.

Figure 2 .
Figure 2. Illustration of the autoencoder architecture, detailing output dimensions, activation functions, and layer types.(A) Schematic of the encoder model, designed to map input data into a lower-dimensional latent space.Convolutional layers employ 4 × 4 kernels and a stride of 2, halving the dimensions at each Conv2D layer, and are followed by rectified linear unit (ReLU) activation functions.Linear layers upstream to the latent vector facilitate rank-minimization.(B) Schematic of the decoder model, tasked with reconstructing the original input from the latent representation.Post-training on MNIST, the decoder serves as a deep generative model.The final sigmoid activation function constrains the output to the range [0, 1], making it well-suited for amplitude transmission functions in ptychography.

Figure 3 .
Figure 3. (A) Singular values of the covariance matrix for latent vectors obtained from MNIST validation examples, revealing the effective rank of the feature representation.Different autoencoder designs and training sets are indicated by color (blue/green: implicit rank-minimized autoencoder; orange: standard autoencoder; blue/orange: full MNIST dataset; green: MNIST dataset filtered to include only '4's).(B) Linear interpolation in the latent space between two objects, illustrating the well-structured feature representation.(C) Images generated from multivariate Gaussian noise input to the decoder, highlighting the decoder's capability to produce meaningful handwritten digits.The color coding for panels (B) and (C) follows the legend provided in panel (A).

Figure 4 .
Figure 4. Comparison of ptychographic amplitude image reconstruction results under varying signal-to-noise ratios (SNR).The top row displays stacks of diffraction patterns used for reconstruction, with exposure times decreasing from left to right, leading to a corresponding decrease in SNR.The second row presents results from conventional reconstruction.Subsequent rows feature latent vector reconstructions using a pre-trained deep generative model, first trained on the full MNIST dataset and secondly on a filtered MNIST dataset containing only images resembling the digit '4'.

Figure 5 .
Figure 5.Comparison of reconstruction quality for conventional and latent vector ptychographic methods across varying signal-to-noise ratios (SNRs), obtained from numerical simulations.The Peak Signal-to-Noise Ratio (PSNR) serves as the quality metric and is plotted against the total photon count in the illumination field.Selected object amplitude reconstructions are displayed for various photon counts to highlight the tradeoffs between the methods.A latent vector reconstruction using a deep generative model trained on filtered data is also included for comparison.

Figure 6 .
Figure 6.Visualization of the optimization loss landscapes for different scenarios.α and β are coefficients for the two leading principal components of the latent space used for ptychographic reconstruction.(A) The landscape for high signal-to-noise ratio (SNR) and training on the full MNIST dataset.(B) The landscape when reconstructing from low-SNR (high-noise) diffraction data.(C) The landscape after training the deep generative model on a filtered MNIST dataset containing only the digit '4'.(D) The landscape when optimizing using a Poisson-only loss function at high SNR, for comparison with the mixed Poisson-Gaussian loss from all other panels.

Funding:
Netherlands Organization for Scientific Research NWO (Perspective P16-08).Disclosures:The authors declare no conflicts of interest.

Figure S1 .
Figure S1.Performance comparison of various autoencoder architectures using ten hand-drawn images as inputs.The figure illustrates the capability of each model to faithfully reconstruct or generalize to these images, highlighting the trade-offs between standard and rank-minimized autoencoders, and the utilization of filtered training data.

Figure S2 .
Figure S2.Exploration of the latent space along the most informative principal components.(A) Generated images obtained by varying the average latent vector along the two leading principal components.(B) Generated images obtained by varying a latent vector representing the digit '4' along the same principal components.

Figure S3 .
Figure S3.(A) Photograph of the photomask used as the binary amplitude object.The object consists of rows of hand-drawn binary shapes, each row scaled by a factor of three.The '4' in the third row, which is used in this work, has an object size of approximately 1 mm 2 .(B) Calibrated illumination field, obtained through a high-overlap ptychographic calibration reconstruction.The total photon count is approximately 950 × 10 6 photons.(C) Scanning pattern used for calibration and noise-robustness evaluation reconstructions.

Figure
Figure S4.(A) Illumination field used in the numerical simulations.(B) Scanning pattern generated through Poisson disk sampling.

1 1
Nanophotonics, Debye Institute for Nanomaterials Science and Centre for Extreme Matter and Emergent Phenomena, Utrecht University, P.O.Box 80000, 3508 TA Utrecht, The Netherlands 2 Imaging Physics Department, Applied Science Faculty, Delft University of Technology, The Netherlands