VIPR: Vectorial Implementation of Phase Retrieval for fast and accurate microscopic pixel-wise pupil estimation

In microscopy, proper modeling of the image formation has a substantial effect on the precision and accuracy in localization experiments and facilitates the correction of aberrations in adaptive optics experiments. The observed images are subject to polarization effects, refractive index variations and system specific constraints. Previously reported techniques have addressed these challenges by using complicated calibration samples, computationally heavy numerical algorithms, and various mathematical simplifications. In this work, we present a phase retrieval approach based on an analytical derivation of the vectorial diffraction model. Our method produces an accurate estimate of the system phase information (without any prior knowledge) in under a minute.


Introduction
Attaining phase information in diffraction experiments is of great practical interest; however, intensity measurements, i.e. images, do not explicitly capture complex values. This hidden information can be estimated by Phase Retrieval (PR) [1] -namely, recovery of a complex signal from intensity measurements of its Fourier transform, which has therefore been employed in a wide variety of applications, e.g. X-Ray crystallography [2], astronomy [3], optical ptychography [4,5], adaptive optics [6,7], and the subject of this work: localization microscopy [8,9]. In a typical 2D localization microscopy experiment, the localization step is performed by fitting images of well-separated emitters with a known Point-Spread Function (PSF), e.g. an Airy disc.
The reliance on precise knowledge of the PSF is highlighted in single-molecule localization microscopy (SMLM), where the Signal-to-Noise Ratio (SNR) is limited. Acquiring additional information can be achieved by PSF engineering, where the PSF shape is intentionally perturbed to encode the desired information. This approach has been used for 3D measurements, e.g. astigmatism [10], Tetrapod [11] and Double Helix (DH) [12] PSFs; encoding emitter wavelength [13][14][15] and molecular orientation [16]. A closely related application is adaptive optics, which employs additional optical elements to counteract aberrations present in the imaging system that distort the PSF. In all mentioned cases, PR can be used to correct the theoretically simulated PSFs or to design the phase mask which creates them [17].
In PR, the complex signal is numerically estimated from the magnitude of its Fourier transform. Any employed algorithm requires the nonobvious selection of a penalty function, and must overcome degeneracies in the solution, unstable derivatives, non-convex optimization, and more [18]. So far, PR methods have made use of the Iterative Gerchberg-Saxton(GS) Algorithm [19] and its variations [20], estimation over the Zernike basis of aberrations [21][22][23], and pixel-wise numerical gradient calculations [24]. The first consideration of these algorithms is the tradeoff between computational efficiency and model accuracy; namely, a limited number of parameters must be chosen to describe the phase mask and various model approximations are employed. To reduce computational burden, the mentioned discrete approaches employ coarse graining in the physical space, while the methods that use a polynomial basis, e.g. Zernike polynomials, reduce the number of parameters by modeling only the most common optical aberrations.
A second consideration is the selection of an optical model. Thus far, most methods employ the scalar diffraction model [21,25]. The scalar model is more computationally efficient to simulate than the vectorial model; however, this approximation deteriorates for emitters near the coverslip, precisely where most calibration measurements are performed. While in some cases, PR can be performed within a biological sample (away from the coverslip), which would make the scalar model appropriate [26], it is often impractical.
In this work, we present a PR technique based on an analytical derivative of the vectorial diffraction model that converges to a robust and accurate solution in a pixel-wise manner. Our implementation is very fast thanks to the GPU-optimized implementation using Fast Fourier Transforms (FFT). We provide a flexible software, suitable for both freely rotating emitters and fixed dipoles [27]. Furthermore, our method requires no priors on the pupil plane phase pattern (making it especially suitable for adaptive optics) and, intriguingly, also corrects for experimentally-observed effects, e.g. axial-position dependent lateral shift (also known as wobble), removing the need to calibrate and correct data in post-processing [28]. Our approach is designed to retrieve the phase information from a simple calibration measurement: images of a single bright emitter over a range of focal positions. The retrieved phase can then be applied to volumetric samples, e.g. 3D imaging of mitochondria and microtubule blinking data in live cell imaging.

Optical models
An illustration of the optical setup is shown in Fig. 1(a). Briefly, the imaging system consists of a Nikon Eclipse-Ti inverted fluorescence microscope with a Plan Apo 100X/1.45 NA Nikon objective. The emission path was extended with two achromatic doublet lenses (focal length 15 cm) forming a 4-f system. A Liquid Crystal Spatial Light Modulator (LC-SLM) is placed in the conjugate back focal plane (Meadowlark 1920X1152 Liquid Crystal on Silicon) with an EMCCD camera (iXon 897 Life, Andor) at the image plane. The calibration sample consists of a water-covered glass coverslip (Fisher Scientific) with 40 nm fluorescent beads adhered to the surface (FluoSpheres (580/605), ThermoFisher), which is illuminated with 50 , 560 nm light (iChrome MLE, Toptica).
Modeling the 3D PSF of the microscope can be done in various methods. A frequently used model is known as the scalar approximation. This model presupposes that the emission pattern of each point source is spherical. This is a good approximation when the objective NA is low, the emitter is rotationally mobile [29], and particularly when imaging emitters far from optical interfaces (beyond near-field effects). Another crucial difference between the vectorial and scalar models is that volumetric imaging far from the coverslip (beyond ~1 wavelength), has an effective NA that is limited by the sample's refractive index rather than the objective NA. The main advantage of the scalar model is the reduced computational complexity in numerical simulations. Our method implements the more complex vectorial model (in addition to the scalar model) to account for the mentioned effects.
In this section, we discuss freely-rotating emitters for both the scalar and vectorial models. Here, the intensity pattern ( , ; , )  I x y q of a quasi-monochromatic source at the image plane is given by [30][31][32] : where encompasses the optical parameters required to model the imaging system, including the phase mask, refractive indices, objective parameters, and more. for the full vectorial description and 1, 2, 3  w or 4, 5, 6  w for the vectorial model with a linear polarizer, relevant for polarization dependent LC-SLMs. B denotes the optical blurring (estimated by a Gaussian kernel with a spatial standard deviation of  B ). In some models [30,33], an additional filter is added which describes the 3D distribution of emitters for where 2 D  denotes the 2D Fourier transform operator and , bfp w E represents the electrical field components at the back focal plane given by: , bfp w g and are the Green's tensor components and back focal plane support (aperture), respectively (both are defined in Appendix A). is the phase term defined by: where   ( , ) P x y is the phase mask that we retrieve and ( ) l q are the elements of  q . We assume that the phase mask is a single matrix for all the electric field components, i.e. there are no birefringent effects in the back focal plane and we neglect any axial aberrations (which cannot be described by a single phase mask). k is the free-space wave number, and   ( , )  l x y are known phase terms that are associated with the  q coordinates (Appendix A). In the next sections, we assume that the optical system is fixed and that we measure a pixelated version of the intensity ( , ; , ) 

Cost functions
The next step in optimization is introducing the observed z-stack. To do so, a cost function which penalizes the difference between the simulation and measurements is required. Such cost functions can serve in both the PR step and subsequent localization steps. All of the considered cost functions are of the form Q, e.g. in the typical case of PR from a z-stack of a bead on the coverslip surface, these correspond to different values of . Common examples of cost functions include the L1 [34] and L2 [24] norms. Other cost functions employ statistical estimation, e.g. are derived assuming Poisson distribution and thus define a negative log-likelihood cost [11,25,35,36]: We assume that the intensity distribution ( | , )   i I n P q is governed by Poisson statistics with a large mean value, thus a Gaussian approximation is valid (by the law of large numbers) in good calibration samples. In many cases, an additive background term is introduced to the model PSF to better match experimental data [25]. This background noise ( ) bg n can be approximated to be white noise, independent from the emitter, with variance of I n P q bg n I n P q I n P q . (6) This distribution defines a negative log likelihood cost function: Z n I n P q C I n P q Z n log I n P q NQ I n P q . (7) As can be seen in Eq. (7), the resulting cost function is a weighted L2 norm added to a logarithmic term. Unlike , comes from a continuous distribution, which is better for derivative calculations. A comparison between the two latter costs is given in Appendix B in terms of their applicability and performance in localization.

Analytical gradient for Phase Retrieval
Phase retrieval can be viewed as a nonconvex optimization problem [37]. In general, such optimization problems pose a challenge as they can have spurious local minimizers, saddle points, and significant sensitivity to good initialization; However, it has been shown that PR can achieve convergence using Stochastic Gradient Descent (SGD) [18].
To facilitate a fast gradient computation, we analytically derive the gradient of the cost function with respect to the pixelated phase mask ( ), 1, 2..., .  P m m M In this section we describe the general derivation for dipoles with full rotational mobility (the fixed dipole case is derived in Appendix E). For simplicity, we derive the gradient for a single position ). The first derivative in the chain rule (for the form of cost functions defined in section 2.2) gives: n P q f I n Z n C g n P q B P m N P m I n (11) By zero padding Fourier space such that , the matrix multiplication in Eq. (11) gives a Fourier transform over n (using Eq. (10)): , , Using Eq. (12), we calculate a single iteration using a few Fourier transforms: the full vectorial model ( 1, 2,..., 6  w ) requires 12 FFT calculations of size per z position: six to generate the image and six to generate the gradient; critically, a pixel-wise numerical derivative would require 6 1 FFT calculations of size per z position (six FFT calculation for a single direction derivative at any pixel and another six to generate the initial image).

Phase mask optimization workflow
Phase retrieval can be used either for recovering the pupil plane from measured PSFs, or for designing a new phase mask which will produce desired PSFs. Both of these applications amount to very similar procedures. In this section we describe the optimization flow which generates an estimated phase mask from the measured 3D PSF using the previously derived gradient. For mask design, the difference is that no denoising or experimental blur correction are required. The first step in the optimization procedure requires the input of a z-stack ( ) i Z n of length Q with associated coordinates  i q . Note that the PR result is the estimation of the phase in the back focal plane, assuming known coordinates .  i q Any mismatch between experiment and assumed coordinates will affect the result. For example, an inaccurate axial position input is compensated by an added defocus correction in the phase mask. Optimally, as many images would be used as possible within the axial range of interest, considering practical limitations, e.g. stage resolution and photobleaching. Next, the background is estimated from the corners of the images to subtract the mean and estimate the noise standard deviation (   i ).
We initialize the phase mask with an analytical gradient similar to [18]. This improves the convergence rate, because at this point, the simulated images (standard PSF) and measured images are not necessarily correlated. iterations. The blur kernel estimation is done using the MATLAB routine fmincon. The second option is performing 2 K iterations with a variant of directed sampling [40,41] where the side information is the correlation of the current model with the measured image stack. This can improve results when the used phase mask converges poorly at some focal positions; in this work, we found this step necessary only for the misaligned Tetrapod (Fig. 4. and Appendix D). Future work can include optimization of this final step based on recent advances in adaptive sampling [42], which could further improve convergence.
Importantly, this method is very fast, thanks to the FFT implementation (Eq. (12)): Table  1. Shows the average time required to calculate a single gradient as a function of number of pixels (binning) in the phase mask. The algorithm is orders of magnitudes faster than numerical pixel-wise gradient calculation [24], and much faster than fitting Zernike polynomials, which can take ~30 minutes when implemented using the MATLAB function fminunc [34]. The results were timed on a laptop with INTEL i7-8750H CPU and NVIDIA GeForce RTX 2070 GPU. Typical pupil plane recovery (vectorial model with a linear polarizer) from a fluorescent bead z-stack requires 300 iterations over 44000 pixels, using a batch size of 3 z positions, and takes 31.5 seconds on a GPU (Table 1). An equivalent numeric pixel-wise gradient based calculation takes ~9 days (~25,000 times slower).

Comparison with existing methods
For benchmarking, we compare our vectorial PR (VIPR) to two alternatives: (1) analytically optimizing (with our method) the scalar diffraction model and (2) adding Zernike polynomials to the initially designed phase mask using the vectorial model [21,34]. Note that using Zernike polynomials requires an additional step to address observed wobble, by fitting the z-stack with the initial guessed mask to center it; this only works when the initial guess is close to the actual mask. Using the described methods, we retrieved phase masks from a z-stacks using a 4 Tetrapod phase mask [11,43], with one image per NFP (estimated image sum (Signal) of 3.5 • 10 and background~(870,8100)  ). To quantify the performance, we analyzed z-stacks with 10 images per focal position. Robustness of our approach to image noise was tested by computationally degrading the images with a Gaussian-distributed noise (std equal to a multiplier scalar times the estimated std of the noise (Fig. 2(b-d)).
While our method deteriorates in precision with added noise, the other methods, interestingly, initially improve after the addition of some noise; this is likely due to our method capturing the subtle details of measured intensity patterns and thus avoids bad localizations at specific axial positions (Appendix C). To test the axial localization bias, i.e. the z accuracy, and the wobble behavior we evaluated the performance of the methods on a short z-stack (bounded by the repeatability of the stage in the two measurements, PR accuracy and noise in both zstacks). We found that our method incorporates the wobble better and has minimal axial bias (see Appendix C).  (24) of (e) a full axial range acquisition using each of the retrieved phase masks. The orange outline shows a single PSF and MLE fits using each of the phase masks.

Generality to phase patterns
The benefit of pixel-wise methods is their broad applicability to model any aberration, e.g. nonsmooth ones, which require many Zernike modes to describe. To test the generality of our method, we imaged z-stacks with 4 different phase mask using the LC-SLM. The first mask consists of a randomly generated combination of low amplitude Zernike polynomials to test the applicability of the method for adaptive optics (Fig. 3(a)). The second case shows the machine learned phase mask which was recently demonstrated to optimally measure dense fields of emitters in 3D (Fig. 3 (b)) [44]. The third test (Appendix D) deals with severe misalignments in the Tetrapod phase mask, which has a large area of unmodulated pixels. Here, for comparison with typically-used algorithms, we also show the results of using centered Zernike polynomials for fitting such an extreme case. The results for the Double-Helix phase mask, which comes from optimization of Gauss-Laguerre basis functions and is commonly used in PSF engineering [45] are also shown in Appendix D.

Experimental demonstrations of STORM data
Direct stochastic optical reconstruction microscopy (dSTORM) is a technique in which a superresolution image is reconstructed from localization of sparse fluorescent molecules that are switched on and off (blinking) using high power laser [46,47]. Both experiments (section 4 and Appendix F) were performed using a dielectric 4 Tetrapod phase mask, which exhibits better photon efficiency than the LC-SLM polarization dependent implementation. PR was performed in a similar way to the LC-SLM data (by removing the polarizer). Appendix F shows a reconstruction of fluorescently labeled mitochondria in COS7 cells, using the deep-learning network which trains on simulations only, adapted from [44]. This demonstrates the viability of our in-vitro method in predicting in-vivo PSFs.
To show the robustness of our method to suboptimal situations, we used a misaligned 4 dielectric Tetrapod phase mask to image microtubule blinking data. We show in Fig. 4. the applicability of our method in recovering unusual aberrations, which are otherwise challenging to model. The use of image-based interpolation methods can also address these issues; However, such methods need a good volumetric calibration [48] and are challenging to deploy on a GPU as it requires much more parameters to model the PSF [22]. The PR predictions (using the vectorial model) were used to localize 10K frame of cell data (Fig. 4(a)). Using the localization and an axial range of 25 around the desired positions, we created mean experimental PSFs (Fig. 4(c)) to show their matching with our model.

Conclusion
In this work, we have developed and demonstrated an accurate and fast optimization method for PR and PSF design. Our method estimates the optical transfer function in a microscope experiment and provides with an accurate imaging model based on Fourier optics. The optimization uses only basic knowledge of the optical system and a target image set (simulated or measured), with no priors on the phase mask. The derivation is general and can be adopted for many microscopy applications, sample variation (fixed or freely rotating dipoles) and optical processing methodologies (e.g. a polarized LC-SLM, a dielectric phase mask, deformable mirror can be directly modeled with their real pixel size and optical constraints). The performance of VIPR is mainly limited by SNR, the used approximations (e.g. numerical pixelation, aplanatic performance, axially invariant NA, etc.), number of measurements, and accuracy in the knowledge of the optical parameters. The pixel-wise derivation can be adopted to many applications by simply changing the cost function and optimization constraints. Furthermore, this approach can be applied to adaptive optics due to the versatility and speed of our implementation.

Appendix A
In this section we describe the analytical (far-field) electrical field derivation at the back focal plane which is required in Eq.  in a sample with refractive index m n , placed above an oil immersion objective with refractive index imm n .The emission from the source can be described by the far-field Green tensor 0 ( , ) ff G r r [31,32,49], which links between the 3 polarizations of the emitter to the electrical field generated at the far field position ( , , ) r x y z : where ( , )   denote the angular coordinates The propagation path is defined by the refractive indices and the propagation angles in the sample medium  m and immersion oil  imm (connected by Snell's law). The coefficients ( ) , 1,2,3 The support of the Fourier plane,   ( , ) S x y (Eq. (3)), in such systems is usually limited by the objective's aperture given by Abbe's sine rule. However, a subtle yet important phenomena can be observed here, usually referred to as Super-Critical Angle Fluorescence (SAF). When the objective's NA is higher than the refractive index of the sample media, a range of  imm propagation angles is available between the critical angle and the NA where cos( )  m becomes imaginary. In PR techniques such as this, the molecules are bound to the coverslip and thus the SAF light is measurable. The last component needed to describe the electrical field at the back focal plane, is the ray rotation matrix which is induced by the objective, assuming a perfect objective lens (full transmission): Finally, the electrical field in the back focal plane as we define in Eq. (3) are given by: Eq. (18) describes the case of fixed dipoles. Simulating the freely rotating case is simply achieved by super-position (Eq. (2)).
In the case of the scalar approximation, there is only a single component of the electrical field and all the polarization effects are ignored. To relate between the 2 models, the scalar model can be described by .

Appendix B
To compare between the 2 MLE cost functions (Poisson and Gaussian) we used the retrieved Tetrapod phase mask ( Fig. 2(a)) to localize the image stack. (Fig. 5(c)). Evaluation of the robustness to noise was achieved by adding artificial noise (similar to section 2.2). We found that the Poisson MLE deteriorates faster than the Gaussian MLE under the addition of artificial noise ( Fig. 5(a)) and even under photo-bleaching, which can be seen in Fig. 5(b) where the Poisson MLE has a trend in the precision over the course of a single measurement.  Figure  6. describes the results of localizing a short z-stack (a second z-stack taken on the calibration bead) with the 3 retrieved phase mask from the described methods in section 2.4. A short zstack will endure less optical drift and can be used to estimate the accuracy of the PR method. The results are bound by the precision of the stage and the noise in the two experiments (the zstack which was used for PR and the reconstruction one). Fig. 6(a) shows the difference in the axial position estimation (with MLE fitting) and the stage read where the sole difference is the phase mask. Fig. 6(b-c) shows that the wobble was incorporated into the phase mask using our method (with the vectorial and scalar models) compared to the Zernike PR (green) which shows wobble. Figure 7. describes the localization results of Fig 2. (b) (without additional noise). From these results, we observe that even a small mismatch in the PSF can lead to large deviations in localization precision at certain positions. These are the positions in which the SAF light produces a sharp point (as it is unmodulated by the phase mask). VIPR incorporates SAF into the derivation and thus suffers less precision penalty in this region.

Appendix D
In this appendix we demonstrate further generality results, in continuation to section 3.2. Fig.  9 shows the retrieval results on a 4 Tetrapod phase mask which is severely misaligned as described in section 3.2. Figure 9 shows the retrieval results on a DH phase mask, similar to Fig 2. (a). We note that while the DH is not based on Zernike polynomials like the Tetrapod mask, it can still be corrected using Zernike polynomials if we have the prior knowledge of the theoretical phase mask.   g n P q I n P q g n P q P m P m g n P q g n P q B P m .
All the subsequent derivatives and the Fourier transform representations are the same as in the freely rotating case. In Fig. 10, we show the theoretical capabilities of this method. We simulated a z-stack of a fixed dipole (1, 1,1)

  
 with a linear polarizer and a Tetrapod phase mask. The proposed gradient in this section was used to recover the phase mask.

Appendix F
In this appendix, we describe the sample (section 4) preparation and provide also with a demonstration of VIPR when used on an aligned 4 Tetrapod phase mask. Sample preparation included cleaning 22X22 mm, 170 thick coverslips in an ultrasonic bath with 5% Decon90 at 60 o C for 30 min. Next, the coverslips were washed with water then incubated in ethanol absolute for 30 min and finally sterilized with 70% filtered ethanol for 30 min. COS7 cells were grown for 24 hours on the coverslips in 6-well plate in phenol red free Dulbecco's Modified Eagle Medium (DMEM) With 1g/l D-Glucose (Low Glucose) supplemented with 10% Fetal bovine serum, 100 U/ml penicillin 100 ug/ml streptomycin and 2 mM glutamine, at 37 o C, and 5% CO2. The cells were fixed with 4% paraformaldehyde and 0.2% glutaraldehyde in PBS, pH 6.2, for 45 min, washed and incubated in 0.3M glycine/PBS solution for 10 minutes. The coverslips were transferred into a clean 6-well plate and incubated in a blocking solution (10% goat serum, 3% BSA, 2.2% glycine, and 0.1% Triton-X in PBS, filtered with 0.45 PVDF filter unit, Millex) for 2 hours at 4 o C. The cells were then immunostained overnight at 4 o C with either anti TOMM20-AF647 (Fig. 11.) antibody (Abcam , ab209606) or anti-Tubulin-AF647 (section 4) antibody (Abcam, ab190573) diluted 1:230 in the blocking buffer, and washed X5 with PBS.
For STORM, a PDMS chamber was attached to the glass coverslip containing fixed immunostained COS7 cells and placed in the microscope setup described in Fig. 1(a). The Tetrapod phase mask, the second row shows the reconstructed PSFs associated with the reconstructed phase mask on the left. added blinking buffer was freshly made from 100 mM β-mercaptoethylamine hydrochloride, 20% sodium lactate, and 3% OxyFluor (Sigma, SAE0059), modified from [51]. A glass coverslip was placed on top to prevent evaporation. The sample was illuminated by a highintensity 638nm 2000mW red dot laser module which passed through a 25 µm pinhole (Thorlabs). Emission light was filtered through a 500 nm Long pass dichroic and a 650 nm long pass (Chroma). 10K images were recorded on a sCMOS camera (Prime95B, Photometrics) instead of the EMCCD to reduce pixel size and enable a large field of view.