Information Transfer as a Framework for Optimized Phase Imaging

In order to efficiently image a non-absorbing sample (a phase object), dedicated phase contrast optics are required. Typically, these optics are designed with the assumption that the sample is weakly scattering, implying a linear relation between a sample's phase and its transmission function. In the strongly scattering, non-linear case, the standard optics are ineffective and the transfer functions used to characterize them are uninformative. We use the Fisher Information (FI) to assess the efficiency of various phase imaging schemes and to calculate an Information Transfer Function (ITF). We show that a generalized version of Zernike phase contrast is efficient given sufficient foreknowledge of the sample. We show that with no foreknowledge, a random sensing measurement yields a significant fraction of the available information. Finally, we introduce a generalized approach to common path interferometry which can be optimized to prioritize sensitivity to particular sample features. Each of these measurements can be performed using Fourier lenses and phase masks.


Introduction
In a phase contrast microscope, transparent objects are imaged using optics which convert phase variations to amplitude variations. This modality is important for visible light (notably, biological samples), x-rays [1][2][3], and electrons [4]. While some phase contrast is intrinsic in systems with a limited numerical aperture (NA) [5] and more can be generated by adding defocus [6], much of the information about the sample phase shift can only be accessed with dedicated optics. Zernike developed the first method for optically-generated phase contrast using a phase-shifting filter in the backfocal plane of the objective lens (or some conjugate plane) [7,8]. Zernike phase contrast (ZPC) is particularly effective for imaging weak phase objects (WPOs), which have transmission functions close to unity. A probe passed through a WPO will retain a strong undiffracted component which can be used as an interferometric reference. Many phase contrast applications involve phase objects for which the WPO approximation (WPOA) is dubious [9,10]. For these applications ZPC is only partially effective. When the undiffracted component of the beam is entirely depleted, for example due to a strongly scattering sample matrix, then ZPC produces no contrast at all. Some common phase contrast methods are compatible with strongly scattering samples, for example the class of schemes sensitive to phase gradients which include Differential Interference Contrast [11], Hoffmann Modulation Contrast [12], and Spiral Phase Contrast [13]. However these techniques are insensitive to low spatial frequency features, making them sub-optimal for some measurements. When it is possible to establish a reference channel which circumvents the sample then Quantitative Phase Contrast [14,15] and various versions of holography [16][17][18] are possible. These generate some contrast for phase objects of any strength, and their limitations are not as obvious.
Comparing the effectiveness of these methods is especially difficult outside of the WPOA. In the strong scattering regime the imaging process remains linear with respect to the sample transmission function but becomes non-linear with respect to the sample phase. As a result of this non-linearity, the performance of the imaging system will depend on the joint properties of the optics and the particular sample. One example of an attempt to move beyond the WPOA is Generalized Phase Contrast (GPC) [19]. GPC, like ZPC, uses the undiffracted probe component as a reference wave. The relative phase and extinction applied to the reference wave can be optimized based on foreknowledge about the sample to maximize the visibility (contrast) or peak irradiance or to establish an unambiguous phase-to-intensity mapping. While GPC avoids invoking the WPOA, it still relies on a strong undiffracted component in the exit wavefunction. To form an even more general theory of phase imaging we must consider a wider class of measurements.
To this end, we recast the imaging process as a many-parameter estimation problem and employ the Fisher Information (FI) to optimize it. While the FI is a prominent tool for experimental design, especially in the field of optics, it is usually applied to optimize measurements of one or a few image parameters. To apply it in a more general imaging scenario with n unknown parameters we must compute the n 2 elements of the FI matrix (FIM) describing all of the parameters and their correlations. For even modestly sized images (n 100) the FIM is expensive to calculate, let alone optimize over all possible measurements. As the optimum may depend on the particular sample, it will be critical to develop efficient heuristics rather than to attempt an explicit optimization for each measurement. Implementing the measurements will require programmable optics. Such technology is available for optical microscopy (i.e. spatial light modulators) and is newly emerging for electron microscopy [20].
In the field of quantum metrology, the FI maximized over all measurements permitted by quantum mechanics is called the Quantum Fisher Information (QFI, [21][22][23]). The measurement which achieves this maximum generally depends on the values of the parameters being measured. This is no obstacle to many QFI applications, where the goal is to make increasingly precise measurements of an already well-characterized parameter. In the limit of high measurement resources, which we will call the asymptotic regime, we can efficiently measure an unknown parameter by allocating a negligible fraction of the resources for pre-estimation. In situations where measurement resources are limited, which we will call the Bayesian regime, the optimal measurement may depend strongly on the foreknowledge of the parameters [24]. A measurement sequence in the Bayesian regime will ideally be adaptive so that each measurement is refined using information gathered from previous measurements. Rather than considering the properties of such a measurement sequence, we will focus on optimizing an individual measurement.
In order to constrain the scope of this project, we make several simplifying assumptions. We assume the sample is a pure phase object with a negligible depth of field, that the measurement is performed with a deterministic source of unentangled scalar particles (i.e. polarization/spin degrees of freedom are not considered), and that the dominant source of noise is projection noise.
In the next section we motivate the transition from contrast to information and introduce the the relevant FI formalism. In section 3 we apply this formalism in the asymptotic regime to the idealized scenario of multi-phase phase estimation (MPE) where arbitrary, lossless transformations can be applied to the exit wavefunction. This perspective helps to clarify the value of a reference channel when projection noise alone limits the measurement efficiency. In section 4 we explore MPE in the Bayesian regime, where foreknowledge of the sample becomes a second limitation. We will develop a suite of methods for phase measurements of strongly scattering samples which are effective for various optimization priorities and levels of foreknowledge. Finally, in section 5 we restrict the optimization to a set of measurements which can be implemented with only a few optical components and take into account the limited NA of the objective lens.

From Contrast to Information
The properties of a linear optical system can be described by the point spread function or by its Fourier transform, the optical transfer function. The complex modulus of the optical transfer function is called the modulation transfer function or the Contrast Transfer Function (CTF, especially in electron microscopy [4]). The CTF characterizes the frequency-dependent efficiency with which the system transports information from the sample to the detector. For absorbing imaging targets, spatial resolution (limited by lens aberrations and the NA) is often the primary concern. For phase-shifting imaging targets, the CTF expresses another important limitation: how efficiently the optics convert phase variations to intensity variations.
The CTF is insufficient for describing the properties of the transfer optics when there is not a one-to-one correspondence between spatial frequencies in the sample phase and spatial frequencies in the detected intensity. For example, a strong sinusoidal phase grating, unlike an amplitude grating, diffracts to many orders. We might account for this by replacing the conventional CTF with a scattering matrix (or vector-valued function) which, for each spatial frequency q k in the sample, gives the resulting intensity contrast at each spatial frequency q j at the detector. However it is not obvious how condense this data into a figure of merit for optimizing the optics. Instead we will optimize with a cost function which can be constrained with the FI. The FI formalism also provides a quantum limit for the minimal cost which can be used as an optimization benchmark.
We will assume the sample transmission function Φ = e iφ can be discretized into n regions with unknown phase shifts {φ k } n−1 k=0 . Let Θ = [θ 0 , θ 1 , ..., θ n−1 ] be a vector of linear parameters describing φ in the orthonormal basis {v (a) } n−1 a=0 . A linear parameterization of φ is a nonlinear parameterization of Φ: We will often use a 'phase grating basis' where v (a) is a phase grating with spatial frequency q a . The phase grating basis is described explicitly in the appendix (section A). In order to estimate the values of Θ based on a measurement outcome j (i.e. detection at pixel j), we use an estimating function (estimator)Θ(j). The optimization of the measurement is defined by minimizing the expected cost where C(Θ,Θ j ) is the cost function and I j (Θ) is the probability of result j (i.e. the intensity at detector pixel j). A standard choice is the quadratic form C(Θ,Θ j ) = Θ j − Θ T W Θ j − Θ where W is a positive semi-definite (often diagonal) weighting matrix which defines the relative priority of reducing the variance of each of the parameters. Using this cost function, the expected cost is where ΣΘ is the covariance matrix forΘ. The quadratic cost function is a fairly universal choice when the measurement variance is expected to be small. For larger variances, this cost function does not reflect the periodicity of Φ(Θ). We could construct a periodic cost function as in [24], however the phase shift is connected to some non-periodic physical property of the imaging target (e.g. the integral of the index of refraction along the optical axis), and it is ultimately this underlying property we want to measure. Minimizing the expected cost involves choosing optimal transfer optics and simultaneously an optimal estimator. However, with the choice of the quadratic cost function, the Cramer-Rao Bound (CRB) provides a simple way to calculate the lowest achievable variance of any (unbiased) estimator [25]. The CRB will describe the performance of the optics without specifying the optimal estimator: where the first inequality is the usual comparison between positive semidefinite matrixes, N is the number of independent measurements, and I(Θ) is the Fisher Information Matrix (FIM) for Θ where ∂ a is the derivative with respect to θ a . The diagonal elements I a,a ≡ I a bound the variance for each individual parameter θ a . In the WPOA, using the phase grating basis, it is possible to show that the FIM is diagonal and its elements are the square of the CTF values. The details of this correspondence are discussed in the appendix (section B). Unlike the CTF, the FIM is meaningful even outside the WPOA. In that sense the diagonal of the FIM can naively be thought of as a generalization of the CTF. A more informative transfer function, which takes into account the off-diagonal elements of the FIM, is discussed at the end of this section. The CRB allows us to bypass the consideration ofΘ while optimizing the measurement T implemented by the transfer optics. We can make the dependence of FI on the measurement T explicit by writing I(Θ, T ). Then the QFI is The CRB applied to the QFI is called the quantum CRB (QCRB) [22]. Since the QFI is independent of T , it can be considered a measure of the 'information' about θ a available in the exit wavefunction. To be precise, the QFI describes the variance-reducing power of a measurement and has units of 1/θ 2 a rather than entropy (bits), which is generally considered a more elementary measure of information. Nevertheless, QFI is seen as a fundamental quantity in quantum metrology.
Multi-parameter measurements are limited by a quantum FIM (QFIM) which is larger than the FIM (in the positive semidefinite sense) for any particular measurement. Whereas the quantum information limit is always attainable in the single parameter case, the matrix bound for multiple parameters may not be attainable when the parameters are associated with incompatible observables [26]. For example, the limited NA of an objective lens restricts the transverse momentum of the exit wavefunction, thereby performing a counterfactual measurement incompatible with spatial phase measurements of the sample. Even when the QFIM is unattainiable, the cost function can be used to identify an optimal measurement. However the optimal measurement will generally depend on the particular value of Θ. Therein lies the paradox described in the introduction: to construct the optimal measurement, one must first know the result of the measurement. In the Bayesian regime, we can only design measurements to maximize the expected FI.
We can express our foreknowledge of Θ using a probability distribution λ(Θ). For example, the WPO condition could be incrementally relaxed by setting λ(Θ) = a N (θ a ; σ 2 ) where each θ a is drawn from an independent normal distribution with zero mean and variance σ 2 1. A plausible source of foreknowledge is a known diffraction pattern. In this case, we can apply the principle of indifference and assume a uniform distribution over all phase objects with the same diffraction pattern. To sample λ, we can apply the Gerchberg-Saxton algorithm [27] using the condition that the probe intensity must be uniform at the sample. Given λ, the goal is to minimize the weighted average of the expected variance. The expected measurement cost is where . λ is the expectation with respect to λ. This cost is related to the FI by the van Trees bound [28] (also known as the Bayesian CRB [29]) where To promote this to a quantum bound, we should maximize I(Θ, T ) λ over all possible measurements. To form a tight upper bound, the maximization should be done after taking the expectation value, which produces the Quantum van Trees Information [30]. In general, this bound can only be computed by finding the specific measurement which achieves it. Instead we will use generalized QFI (GQFI) Z [30,31] which is obtained by simply replacing I(Θ, T ) λ with J : While this bound is typically unattainable it is generally easier to calculate and thus more suitable as an optimization benchmark. Caution is warranted in interpreting these bounds, as it may not be simple or even possible to devise an efficient estimator (one which saturates the bound) with limited information. We may regard the lower bound on C λ as the value we would assign a measurement in retrospect, after collecting enough information to accurately estimate Θ. When using a uniform weighting W = I, the cost is minimized by prioritizing sensitivity to parameters with large prior variance. This is sometimes undesirable. Suppose the sample consists of a WPO Φ(Θ f ) embedded in a strongly scattering matrix Φ(Θ b ). The combined transmission function is We will call Θ f the foreground and Θ b the background and assume the corresponding prior distributions, λ f (Θ f ) and Since λ b contains larger variances, a measurement optimized using cost function defined in Eq. 7 will be tailored for measuring the background. We could attempt to find a non-uniform weighting to increase the cost of foreground error, but it's not obvious how to choose the weights. In the appendix (section C) we derive a van Trees-like bound on the cost function for variance reduction in the foreground: In the limit where η → 0 (complete ignorance of the background), ∆I → I(Θ, T ) λ so the measurement cost is constant (no information can be gained about the foreground). In the limit where η → ∞ (complete knowledge of the background), ∆I → 0 and λ → λ f , so the bound on C f becomes identical to the standard van Trees bound. This cost function tends to prioritize sensitivity to parameters which have a small prior background variance. A lower bound on this cost function for all possible measurements is obtained by replacing I(Θ, T ) λ with J . While the cost functions are useful for optimization, they provide little insight into the properties of a particular measurement. For this purpose, it will be useful to define an information transfer function (ITF) which describes the information gained about each parameter. The diagonal of the FIM is not sufficient for this purpose, as it does not account for correlations between parameters, both from the prior distribution and the measurement. A full account of the information gain (in bits) is given by the relative entropy (also known as the KL-divergence) between the prior and posterior probability distributions for θ a [32]. There are two problems with using the relative entropy to define the ITF. First, it would require specifying a rule for updating λ for each possible measurement result -in other words specifying an estimatorΘ. Second, there is no clear way to normalize an ITF based on the relative entropy. If instead the ITF is defined in terms of the FI, then the first problem is solved by the van Trees bound and the second is solved by the limit placed on V by Z. Therefore we define the ITF as the decrease variance achieved for each parameter (determined using the van Trees bound) relative to the decrease in variance allowed by the GQFI: where σ 2 a (λ) = Σ a,a (λ) and Σ(λ) is the covariance matrix for λ. The maximum value of the ITF is 1, and when T is expected to produce no new information about parameter θ a , then H(a; λ, T ) = 0. When I λ , J , and I(λ) are all diagonal, then the ITF is simply the ratio of the FI and the QFI. When Θ is expressed in the phase grating basis we will write H( q a ; λ, T ). As shown in the appendix (section B), the ITF is equal to the square of the CTF in the WPOA. Note that unlike a standard transfer function, H( q a ; λ, T ) depends on spatial frequencies in the sample phase φ rather than spatial frequencies in transmission function Φ. Also, unlike a linear optical transfer function which describes the properties of the optics alone, the ITF depends on the joint properties of the optics (via T ) and the particular sample (via λ).
We will use a similar formulation to evaluate optimization outcomes in terms of the decrease in cost indicated by the van Trees bound relative to the maximum decrease allowed by the GQFI. For the cost function in Eq. 7, 3 Multi-Phase Estimation with Full Foreknowledge Before applying the above to phase imaging with optics with limited NA, we will consider the more idealized scenario of multiple phase estimation (MPE) to clarify some of the fundamental limitations of phase imaging with various amounts of foreknowledge. MPE is a well-studied problem in the field of quantum metrology [33,34]. Instead of free space modes, the probe states of MPE occupy n discrete channels upon which we can apply arbitrary (lossless) transformations. In order to keep close analogy with phase imaging, we will imagine the channels are arranged in a grid so we may parameterize the φ in terms of its 2D spatial frequency components. Of course, the actual spatial arrangement of channels in MPE is irrelevant. A reference channel with known phase is generally available and the goal is to find the optimal probe state. We will assume the probe is a pure, single particle state with amplitude α j in channel j and amplitude β in the reference channel ( j α 2 j + β 2 = 1). The QFIM for a pure state ψ can be written explicitly [21]: For the case where n = 1, it is simple to verify that J = 1 and a measurement which achieves this limit can be performed with a Mach-Zehnder interferometer (MZI) with α 2 = β 2 = 1/2. A natural guess for an efficient measurement for n > 1 is to divide the probe evenly among n parallel MZIs (so half of the total probe intensity still passes through the reference arm). The FIM for this measurement is I = I/n so the total information is Tr (I) = 1 and the bound on the total variance from the CRB is C = Tr (ΣΘ) ≥ Tr I −1 = n 2 . However the quantum limit is superior: Tr (J ) = 4n/(1 + √ n) 2 and C ≥ n(1 + √ n) 2 /4. This implies there is some advantage to simultaneous parameter estimation (if we allow ψ to be a multi-particle entangled state, then this relative advantage is even more pronounced [33]).
In order to explain the advantage of simultaneous estimation, it is helpful use a parameterization which diagonalizes J . If we use a uniform probe α j = α, the QFIM has two distinct parameter eigenspaces. One corresponds to the average phase shift θ 0 = 1 n j φ j and has eigenvalue J 0 = 4α 2 β 2 ≤ 1/n which is maximized using probe amplitudes β 2 = nα 2 = 1/2. The other eigenspace has rank n−1 and contains information about all parameters independent of θ 0 . Its eigenvalue is J ⊥ = 4α 2 ≤ 4/n, which achieves its largest value when β = 0. Thus, all but one of the degrees of freedom can be measured optimally without a reference channel (see [35] for an analysis of quantum multi-phase estimation without a reference channel), and the total variance is minimized by setting β 2 ∼ 0 for large n (explicitly, Simultaneous estimation schemes have an advantage, then, because they are able to invest more in the J ⊥ eigenspace, where fewer measurement resources are required to achieve the same variance reduction. In some microscopy applications the relevant measurement resource is the total dose, d = nα 2 . In this case, we should maximize J /d. The matrix eigenvalues become J 0 = 4β 2 /n and J ⊥ = 4/n. This consideration does not change the conclusion that the reference channel is not helpful in the J ⊥ eigenspace. Indeed, there are often practical advantages in dispensing with the reference channel. In many imaging applications, the value of φ is irrelevant (e.g. the thickness of the sample matrix) and may even be considered a nuisance parameter. For example, when an imaging system has multiple optical axes, their relative phase stability becomes an added engineering challenge [36]. We will proceed under the assumption that φ is an extraneous parameter and specialize to measurements which are in-line (lacking a reference channel) and therefore only sensitive to the J ⊥ eigenspace (we will suppress the ⊥ subscript in the future). We will also continue to assume that the probe amplitude is uniform across the channels. With these assumptions, J = (4/n)I. Since J is a scalar matrix, it is invariant to reparameterization: a measurement which achieves J is optimal for estimating any individual parameter or set of parameters.

Multi-Phase Estimation with Limited Foreknowledge
We will now discuss several types of in-line measurements which are useful with various levels of foreknowledge. We will assume the measurements are projective so that I j = |T j,k Φ k | 2 for some unitary matrix T (this precludes measurement schemes which use multiple detectors to make non-commuting measurements). We can factorize T as where U (a unitary matrix) is the measurement eigenbasis and M is a diagonal matrix of unit-norm eigenvalues M q,q = e iµq . A sufficient condition for T to achieve the QFIM limit is if U concentrates all of the intensity in the exit wavefunction into a single eigenvector. This is possible in the asymptotic regime (where Φ(Θ) is known) by setting U = FΦ −1 where F is the Fourier transform matrix and Φ −1 is the inverse of the sample transmission function. Then its simple to show I(Θ, T ) = J if µ q=0 = π/2 and µ q>0 = 0. For WPOs (Φ −1 ∼ 1), this measurement is equivalent to ZPC. The reparameterization-invariance of J implies that ZPC performs optimally for measuring any feature of a WPO. It is interesting to note that while a general projective measurement of a state with n degrees of freedom is described by a unitary transform with n 2 real parameters, an optimal measurement can be performed using only using ZPC optics (with no degrees of freedom) and a phase mask with n degrees of freedom. This makes it practical to design efficient phase imaging optics for any sample using relatively few optical elements. Figure 1: Exit wavefunction ψ is formed by passing a uniform, single particle probe through an unknown phase object Φ. A unitary operator T = U * M U is applied to ψ before measurement at detector D. If it is possible to choose a measurement eigenbasis U which condenses ψ into a small subspace, then an efficient measurement can be performed using Generalized Common Path Interferometry (GCPI). A set Q of eigenvectors is designated as the in-line reference. The measurement eigenvalues are then set to M q = exp(iµ) for q ∈ Q and M q = 1 otherwise. The membership of Q and the value of µ are optimized based on the expected intensity Λ q carried by eigenvector q. As a rule of thumb, Q includes the eigenvectors carrying the highest intensities. When Λ q is uniform, Q includes half of the eigenvectors at random. A, B, and C are schematics of example intensity patterns Λ q and the corresponding optimal M . The white region represents the set Q.
In the Bayesian regime we do not have precise knowledge of Φ and therefore cannot choose a measurement eigenbasis which concentrates ψ into single eigenvector. As shown in the appendix (section D) this precludes finding a projective measurement which achieves the quantum information limit. To find an efficient measurement we must optimize based on the prior distribution λ. As a starting point, we can set U = Fe −i φ λ , µ q=0 = π/2 and µ q=0 = 0. This measurement is effective when λ contains mainly translation-specific prior information, or, equivalently, when the covariance matrix for λ is nearly diagonal for the parameterization θ a = δ a,k φ k . An implementation of this measurement is described in [37]. With little or no translation-specific information, the measurement is ineffective. This can be partially ameliorated by optimizing the phase shift µ q=0 ≡ µ, an approach we will call Generalized ZPC. The maximum efficiency of GZPC will depend on how much intensity Λ q = |(U ψ) q | 2 λ is focused into measurement eigenvector q = 0. In the appendix (section E) we calculate I λ for GZPC in the particular case that each of the parameters are independently, normally distributed with variance σ 2 . For Λ 0 = e −σ 2 0.8, the result is I λ ∼ Λ 0 J . This linear approximation underestimates I λ in the region 0.8 > Λ 0 > 0.5, where it plateaus to a value of I λ ∼ 3 4 J . For Λ 0 < 0.5, I λ drops precipitously. The effectiveness of GZPC can be extended to lower values of Λ 0 by increasing the value of µ, in which case I λ ∼ 3 4 J is maintained until Λ 0 < 1 4 . If no foreknowledge of Φ is available, we can assign a uniform distribution to each phase φ k . In this case Λ 0 ∼ 1/n and GZPC (for any µ) is uninformative. However if we randomly set each µ q to either 0 or π, the expected Fisher Information is I λ ∼ 1 2 J . This measurement, which we will call random sensing, is similar to the technique described by Oe and Namura [38] which uses a diffuser to generate in-line phase contrast. Oe and Namura rely on the WPOA to reconstruct the phase object. For strongly scattering samples, we must resort to a general phase retrieval algorithm such as Gerchberg-Saxton or Fienup [39]. Despite the factor of 2 discrepancy between the FI for random sensing and the QFI, this measurement out-performs the parallel Mach-Zehnder interferometer scheme (provided that the reconstruction algorithm produces the full variance reduction allowed by the CRB).
In some circumstances, it is possible to find a measurement more efficient than both GZPC and random sensing. For example, if there exists a set Q of measurement eigenvectors with |Q| n such that R = q∈Q Λ q ∼ 1, then it's simple to show that setting µ q∈Q = π/2 and µ q ∈Q = 0 defines a measurement which gives I λ ∼ J . Such a set exists, for example, if Φ is a crystal with a known, sharp diffraction pattern (but perhaps unknown translation). In general, if it is possible to choose U which concentrates ψ into a small subspace, then this strategy produces an efficient measurement: if |Q| n and R ∼ 1, Tr ( I λ ) ∼ Tr (J ) R(1 − |Q|/n). However, rather than trying to explicitly optimize U , we will focus on applications where the nature of the foreknowledge about the sample leads to a natural choice of U . For example, U = F is the natural choice when λ is induced by an expected diffraction envelope. More generally, we will assume Λ is given for a particular U and proceed to optimize M .
Having chosen the measurement eigenbasis, we can define a family of measurements called Gemeralized Common Path Interferometry (GCPI) which is parameterized by the set Q and the phase shift µ applied to measurement eigenvectors q ∈ Q. A procedure for optimizing over µ and Q is described in the appendix (section F). The optimization is especially likely to identify a measurement more efficient than GZPC or random sensing when specializing to foreground variance reduction (using the cost function 11) or to high-spatial frequency measurements. For example, in dose-limited electron microscopy, the high spatial frequency features degrade quickly and the achievable resolution scales with the fourth power of the dose [40]. This motivates a parameter weighting W a,a = | q a | 4 .
In Fig. 2, we compare the efficiency of GZPC, random sensing, GCPI, and dark field microscopy. On the left, ∆C is calculated for samples which are known to have a Gaussian intensity distribution Λ in some basis U for various peak intensities Λ 0 . When the weighting on the parameters is uniform and the optimization is done using the full cost (Eq. 7), GCPI offers no advantage over the best choice among the other methods. However, when the weighting is W a,a = | q a | 4 , GPCI achieves a lower measurement cost by sacrificing sensitivity at low spatial frequencies in exchange for increased sensitivity at high spatial frequencies. GCPI also exploits this trade-off to out-perform other methods when specializing to foreground variance reduction using the cost function Eq. 11. On the right, Λ is a 2D Lorentz distribution Λ( q) = 1 2π w (| q| 2 +w 2 ) 3/2 with w = (2πΛ 0 ) −1/2 . In accordance with the rule of thumb described above, GCPI is especially effective for the Gaussian prior, where Λ is more concentrated. Figure 2: In-line multi-phase estimation schemes for n = 16 2 channels and prior distributions induced by Gaussian (left) and Lorentzian (right) diffraction envelopes with unscattered intensity Λ 0 . The quantity ∆C is the average reduction in the weighted variance relative to the limit imposed by the GQFI for various phase contrast schemes: generalized Zernike phase contrast (GZPC) with µ = π/2 (solid line) and µ = π (dashed line), dark field (DF), random sensing (RS), and generalized common path interferometry (GCPI). GCPI is optimized for the total variance (solid line) or just the foreground variance (dashed and dotted lines) and for a uniform weighting (dashed line) or for W a,a = | q a | 4 (solid and dotted lines).

Phase Imaging with a Limited Numerical Aperture
Given some foreknowledge of the diffraction pattern of a sample, the measurements described in the previous sections can be performed using a spatial phase modulator to implement M and two Fourier lenses to implement U and U * . Given some translation-specific foreknowledge, it may also be beneficial to add a second spatial phase modulator to a conjugate-image plane before the first lens to implement U = Fe −i φ λ . In this section we will account for loss due to the limited numerical aperture of real lenses. This adds some intrinsic phase contrast which becomes significant for strongly scattering samples, but also reduces the total amount of information which can reach the detector. The achievable efficiency will depend on how well the probe, which is focused in the condenser aperture to provide plane wave illumination, can be refocused in a conjugate plane after passing through the sample. Let A(| q|) be a hard aperture function with A(| q| < q max ) = 1 and A(| q| > q max ) = 0. For a weak phase object (or an amplitude object), A blocks all information about spatial frequencies with a magnitude larger than q max . However the intensity pattern at the detector depends on all spatial frequencies present in a strong phase object, regardless of q max . For example, the diffraction pattern of the superposition two phase gratings at spatial frequencies q a and q b contains the beat frequencies q a ± q b . Even if both | q a | > q max and | q b | > q max , it's possible that | q a − q b | < q max . This principle makes it possible to achieve superresolution using structured illumination [41]. Since the illumination is also limited by the NA, structured illumination can only improve resolution over the standard limit by a factor of two. But with a sufficiently informative prior distribution λ providing known structure in the sample itself, diffraction no longer imposes a fundamental resolution limit [41,42]. There remains, however, an information limit.
The measurements which can be applied to exit wavefunction ψ using diffractionlimited optics are non-projective and cannot be described using a unitary transfer function of rank n. However we will assume that the measurement applied to the wavefunction exiting the Fourier aperture Ψ q = A q F(ψ) q is unrestricted. Then the diffraction-limited QFIM,J , can be calculated by applying Eq. 14 to Ψ. We will neglect the small amount of additional information available when using a deterministic source (i.e. the FI associated with the total intensity missing at the detector). Unlike the QFIM for MPE,J depends on Θ and is not diagonal, making the calculation of the GQFI less trivial: The off-diagonal elements are generally small and a good approximation is While more strongly scattering samples send a larger portion of the probe intensity outside the NA, they are also more sensitive to spatial frequencies higher than q max through the beating effect described above. These effects act in equal measure and the fractional QFI lost to the aperture Tr J −J /Tr (J ) ∼ q A q /n is roughly constant regardless of λ. UsingZ(λ), we can write an envelope function for the ITF which represents the maximum diffraction-limited variance reduction Fig. 3 shows the ITF for various phase contrast schemes. Since natural images often have spectra with power ∼ q 2 [43][44][45], we assume the diffraction pattern has a 2D Lorentz distribution with unscattered intensity Λ 0 = 0.6 (left), Λ 0 = 0.2 (middle), and Λ 0 = 0.1 (right). The black curve is the envelope function defined in Eq. 17. Comparing the three plots, we see that as more intensity scatters outside the NA, the decrease in E below q max is accompanied by an approximately equal increase above q max . The cyan curve is the ITF for the intrinsic (bright field) contrast due to scattering outside the NA. The blue curve is the ITF for GZPC using the optimal phase (µ = π/2 for Λ 0 = 0.6, µ = π for the other two). The red curve is the ITF for random sensing. The remaining curves are ITFs for GCPI using µ = π/2 with varying |Q|. As |Q| increases, information about high spatial frequency parameters is gained at the cost of information about low spatial frequency parameters. While the decrease in low spatial frequency information is strictly a disadvantage from the perspective of any (positively weighted) cost function, it may be a positive feature in some circumstances. For example, filtering out low spatial frequencies may simplify data interpretation (finding an efficient estimator). In Fig. 4 we optimize a GCPI filter for measuring a WPO in a strongly scattering background using the cost function in Eq. 11. We again assume a Lorentzian diffraction pattern and set Λ( q = 0) = 0.2. The foreground WPO is a 20 µm diameter pinwheel. The phase of the combined foreground and background is shown in (A). The detected intensity distributions using ZPC with µ = π (B), random sensing (C), and GCPI (E) are shown with identical color scales. The optimized Fourier filter for GCPI is shown in (D). A phase shift of ∼ 0.52π is applied in the central (white) region relative to the outer (grey) region. The black region is absorptive and establishes a NA of 0.8 using 500nm light. Besides providing good contrast for high spatial frequency features, foregroundoptimized GCPI filters out the much of the background. A similar filtering affect can be achieved simply by blocking the prominent spatial frequencies in the background. In (F) the GCPI filter is modified so that the central (white) region is completely absorbing. This high-pass filter produces significantly less contrast: the color scale in (F) is emphasized by a factor of 50 compared to the color scale in (E). Figure 3: Information Transfer Function (ITF) for various phase imaging schemes for a prior distribution λ induced by a Lorentzian diffraction envelope with unscattered intensity Λ 0 = 0.6 (left), Λ 0 = 0.2 (middle), and Λ 0 = 0.1 (right). The horizontal axis is the magnitude of the spatial frequency in the sample phase. The black vertical line at | q a | = q max marks the largest spatial frequency in the exit wavefunction allowed through the Fourier plane aperture. The black envelope labeled E is the information limit set by the aperture. The other curves are the ITF for Generalized Zernike Phase Contrast (GZPC, using µ = π/2 for Λ 0 = 0.6 and µ = π for Λ 0 = 0.2, 0.1), random sensing (RS), bright field (BF), and Generalized Common Path Interferometry (GCPI) with phase shift µ = π/2 applied a successively larger sets of Fourier coordinates Q 1 , Q 2 , and Q 3 . Figure 4: Phase contrast imaging simulation of a weak phase object (a 20 µm diameter pinwheel with phase thickness π/10) embedded in a strong phase background with a Lorentzian power spectrum. The background scatters 80% of the λ = 500nm plane-wave illumination. A: The sample phase shift in radians. B: The detected intensity pattern using ZPC with a π phase shift. C: The detected intensity pattern using random sensing. D: The Fourier-plane filter for GCPI optimized for foreground detection. The black region lies outside the NA. The grey and white regions are completely transmissive and have a relative phase shift of 0.52π rad. A cross section of the expected diffraction intensity Λ is shown with the same vertical scale. E: The detected intensity pattern using the GCPI filter shown in D. F: The detected intensity pattern using the filter shown in D but with the transmissivity central region set to 0.

Conclusion
Outside of the WPOA, each spatial frequency in the sample phase affects many spatial frequencies in the intensity at the detector. This non-linearity makes it difficult to design efficient phase imaging transfer optics. We have approached this problem using FI as a rigorous optimization framework and developed an information transfer function to study the properties of various measurements of pure phase objects. As a rule of thumb, the amount of information that can be extracted from a single measurement depends on how well the exit wavefunction can be concentrated into a small subspace using foreknowledge of the sample. GZPC is a family of measurements described by a single parameter which can be optimized for efficient phase imaging if at least 20% of the probe intensity can be refocused. When GZPC is ineffective, a random sensing measurement can be employed without any optimization at the cost of complicating the measurement interpretation. A third option, GCPI, performs at least as well as GZPC and random sensing and is especially effective when specialized to measuring high spatial frequencies or imaging WPOs in a strongly scattering background. It would be straightforward to extend these methods and the ITF to phase objects with finite depth of field and finite absorption, and also to include lens aberrations and limited coherence. The ITF could also be used to characterize an aggregate measurement including multiple modalities (e.g. phase contrast and fluorescence) by summing their individual FIMs.

Acknowledgements
This work was supported by the Gordon and Betty Moore Foundation and the Department of Energy grant DE-SC0019174-00

A The Phase Grating Basis
We want a spatial frequency parameterization φ. Since φ is not imaginary, we should not use a discrete Fourier basis. Instead, we could a discrete cosine or sine transform. But these contain 'half frequency' elements. It will be more convenient to parameterize φ in a basis where every element has a simple Fourier representation. Inspired by the real discrete Fourier transform [46] where A contains a if q a = − q a and otherwise contains one of either a or b such that q b = − q a .

B The Information Transfer Function
The definition of contrast most often used to calculate the CTF is the Michelson contrast Suppose the sample is a phase grating with spatial frequency q and amplitude δθ q 1. The CTF can be written As an example, suppose we use a plane wave probe and the transfer function T applies a phase shift µ(q) to each Fourier component q. Then I = |T (ψ)| 2 = |F{F {ψ}e iµ }| 2 where F represents the action of a Fourier lens. Using the WPOA, we find the CTF is In some formulations, the CTF is a signed quantity with negative contrast indicating dark fringes. The CTF is also often normalized so that |C(q)| ≤ 1. Consider a phase object built from a superposition of phase gratings with amplitudes Θ = [θ 1 , θ 2 , ..., θ n ]. If we perturb one of these amplitudes by a small amount δθ, how much contrast will that perturbation generate? The answer using the Michelson contrast depends on the intensities measured by only two detector pixels (at the locations of I max and I min ). Clearly this summary statistic is too coarse-grained to capture the full effect of the perturbation. As an alternative, we can define the CTF using the root-mean square Weber contrast where C W is the Weber contrast where I b = I δθq=0 . These definitions of the CTF are entirely equivalent in the WPOA. Now consider the square of the Weber CTF for small perturbations δθ q → 0: where E is the expectation value. The final expression is the definition of the FI for parameter θ q . We can also show this equivalence by calculating the diagonal elements of the FIM for a WPO using the phase grating basis and unitary transfer function T : which (apart from the normalization factor 1/n) is the square of the CTF. However, we cannot interpret the diagonal of the FIM as a transfer function outside of the WPOA, as the off-diagomal elements may be important. Nevertheless, it will be useful to formulate a transfer function based on the FIM to help visualize the properties of a particular measurement. We define the information transfer function (ITF) for a measurement T as the ratio of the maximum variance reduction for parameter θ a achievable by T (as determined by the van Trees bound) to the maximum variance reduction for parameter θ a allowed for any measurement (as determined by the GQFI). In general, we write the function as ITF(| q a |; Θ, T ) assuming Θ is expressed in the phase grating basis and that T and λ respect radial symmetry around q = 0. Explicitly, the ITF is For large n, I a (Θ, T ) λ ≤ J a = 4/n 1 and if σ 2 a (λ) n/4, we can expand V −1 and Z −1 in powers of σ 2 a (λ) I λ and σ 2 a (λ)J, respectively: and the ITF becomes This approximation is accurate, for example, in the WPOA, in which case the ITF is the square of the CTF as shown above.

C Fisher Information for WPO in a Strong Background
The standard formulation of the cost function measures the expected average variance. An optimized measurement will prioritize sensitivity to the parameters with the largest prior variances. Suppose the sample consists of a WPO (the foreground) embedded in a strongly scattering, unknown background. Let parameter vector Θ f = [θ f ;0 , θ f ;1 , ...] with prior distribution λ f (Θ f ) describe the foreground and Θ b = [θ b;0 , θ b;1 , ...] with prior distribution λ b (Θ b ) describe the background, so the total transmission function is We cannot separately measure θ f ;a and θ b;a , but we can adjust the cost function to specifically reward reduction of the foreground variance. Let λ tot be the prior distribution for Θ tot = [Θ f , Θ b ]. The covariance matrix for the estimator of the combined parameter vector is constrained by the van Trees bound ΣΘ tot λtot The cost function is equivalent to the standard cost function when but can be specialized to foreground variance reduction using We can write I(Θ tot ) λtot as a 2 × 2 block diagonal matrix, where each block is I(Θ) λtot . We will also write I(λ tot ) in block form I(λ tot ) = I(λ tot ) 11 I(λ tot ) 12 I(λ tot ) 12 I(λ tot ) 22 (39) so that the right hand size in Eq. 35 is N I(Θ) λtot ) then As an example, suppose λ f is independently and identically distributed for each of the parameters so that the prior information matrix is I(λ f ) = 1 σ 2 I. Also suppose W = I and λ b (Θ b ) = n−1 a=0 λ a (θ a ) where each λ a is normal with zero mean and variance σ 2 a N σ 2 .
This cost function has a strong preference for measuring parameters with small σ a , where the foreground is more 'visible' despite the background. For comparison, if we use the weighting in Eq. 37 we get the standard cost which gives priority to increasing I a (Θ) λtot for parameters with large σ a .

D Projective Measurements in the Bayesian Regime
In multi-phase estimation with n = 2 phases, the phase difference φ 2 − φ 1 can be optimally measured without a reference channel or any prior knowledge of the phases using a 50-50 beam splitter. For n > 2 pixels, the phase differences between neighboring channels can be measured using a series of beam splitters and 2n−1 detectors. This measurement is impractical for phase imaging, where n is large and space is limited. Instead, we will consider only projective measurements which can be represented by a unitary matrix T of rank n. Here we give an informal argument that projective measurements generally cannot achieve the QFIM in the Bayesian regime. Using Eq. 5 we can write the FI for parameter θ a , where γ j = arg{T (ψ) j } and γ − γ j ) = 1 for all j where |T (∂ a ψ) j | 2 > 0. This is possible only if Λ = |U ψ| 2 has no overlap with Λ (a) = |U ∂ a ψ| 2 , and the QFIM can only be achieved if this condition is met for all values of a. Λ can be thought of as the reference component and Λ (a) as a signal component. Achieving the QFIM requires the reference to be completely isolated from the signal. It takes n − 1 channels to carry information about n − 1 independent parameters. This limits Λ to a single channel. In the Bayesian regime, we will not have sufficient prior information to find a measurement basis where Γ occupies a single channel.

F Details of Numerical Calculations
.
In order to optimize GCPI, the value of µ and the membership of Q must be jointly optimized based on the prior distribution λ. When λ is induced by an expected intensity pattern Λ we sample from λ using the Girchberg-Saxton algorithm with a uniform random initial phase distribution. In order to determine I(λ), we estimate the covariance matrix for λ, Σ λ , then set I(λ) = Σ −1 λ . The number of possible sets of Q is combiniatorially large. In order to reduce the complexity of optimizing Q, we estimate the value V q of including q ∈ Q, then set Q = {q|V q ≥ V * } and optimize the threshold value V * . The estimated value will depend on the cost function. To minimize the weighted average of the expected variance using cost function from Eq. 7, we use where ∆ (a) max is the maximum variance reduction allowed by the GQFI for θ a . The denominator is the weighted average of the signal components expected to be carried by eigenvector q, and estimates the opportunity cost of losing sensitivity to q. When optimizing for foreground variance reduction using the cost function from Eq. 11, we use : Left: Expected Fisher Information I λ for Zernike phase contrast with a prior λ which is an independent Gaussian distribution with variance σ 2 for each phase. The maximum FI and the ideal Zernike phase shift, µ, depend on the unscattered intensity Λ 0 = e −σ 2 . The dashed black line is 4Λ 0 , which is a good approximation for I λ when Λ 0 > 0.8. Right: Phasor diagrams which show the the action of the transfer optics on the exit wavefunction (represented by the black unit circle). The cyan circle represents the possible values of the wavefunction at the detector, and the red portion represents the probability distribution of the wavefunction. The black and cyan vectors have length Λ 0 and relative angle µ. For .25 < Λ < .5, the optimal µ causes the cyan circle to pass through the the origin. For Λ 0 > .5 and Λ 0 < 0.25, the optimal values for µ are π/2 and π, respectively. which has higher value when ∆ (a) max , the potential reduction in the background variance, is smaller. In many cases, especially when Λ q decreases monotonically with q, the same value ranking is obtained simply using V q = Λ q . The optimization proceeds by alternating between minimizing the cost with respect to µ use Matlab's fminbnd (with π/2 < µ < π), and then minimizing with respect to V * .