Conditional Injective Flows for Bayesian Imaging

Most deep learning models for computational imaging regress a single reconstructed image. In practice, however, ill-posedness, nonlinearity, model mismatch, and noise often conspire to make such point estimates misleading or insufficient. The Bayesian approach models images and (noisy) measurements as jointly distributed random vectors and aims to approximate the posterior distribution of unknowns. Recent variational inference methods based on conditional normalizing flows are a promising alternative to traditional MCMC methods, but they come with drawbacks: excessive memory and compute demands for moderate to high resolution images and underwhelming performance on hard nonlinear problems. In this work, we propose C-Trumpets -- conditional injective flows specifically designed for imaging problems, which greatly diminish these challenges. Injectivity reduces memory footprint and training time while low-dimensional latent space together with architectural innovations like fixed-volume-change layers and skip-connection revnet layers, C-Trumpets outperform regular conditional flow models on a variety of imaging and image restoration tasks, including limited-view CT and nonlinear inverse scattering, with a lower compute and memory budget. C-Trumpets enable fast approximation of point estimates like MMSE or MAP as well as physically-meaningful uncertainty quantification.


I. INTRODUCTION
In Bayesian modeling of computational imaging problems, we assume that the (unknown) object of interest x and the observed measurements y are realizations of random vectors X ∈ X and Y ∈ Y with a joint distribution p X,Y . This general model includes the common setting of a deterministic forward operator and additive Gaussian noise, where Y |X ∼ N (A(X), σ 2 I) as well as other relevant models like Y = Poisson(A(X), λ) or even random or uncertain forward operators A.
Common machine-learning approaches to ill-posed inverse problems yield point estimates, that is, they output a single reconstruction. For example, training a deep neural network f θ with the mean-squared error (MSE) loss E X − f θ (Y ) 2 approximates the minimum-mean-squared-error (MMSE) estimator 1 E[X|Y ] (the posterior mean) [1]. 1 Often called the regression function in machine learning and statistics. The column x shows the ground truthl, the column y the pseudoinverse of the forward operator applied to the measurements. Conditional injective flows provide MMSE, surrogate MAP (Section III-C) and physically meaningful uncertainty quantification (UQ). Given no sensors in the top regions of the image, the trained C-Trumpet assigns higher uncertainty to the top half of the domain. While both standard point estimates look close to "9", many posterior samples look like "4"s which indicates the significance of posterior sampling for ill-posed inverse problems.
In many situations, however, a single point estimate can be misleading or incomplete. For example, in radio interferometric imaging which aims to reconstruct astronomical images from radio telescope measurements, there can be multiple solutions that fit the observed measurements; a now-famous example is the imaging of a black hole [2]. This can happen for a variety of reasons, all stemming from the ill-posedness of the imaging problem. The posterior may be multimodal, in which case the MMSE estimator blends the modes together and maximum a posteriori estimate (MAP) (or posterior mode) arg max x p X|Y (x|y), returns only one of the many modes. Even when the posterior is unimodal, providing a point estimate when measurements have low signal-to-noise ratio does not convey the amount of uncertainty in the estimate, thus requiring cautious interpretation. As a remedy, uncertainty quantification (UQ) on top of a reconstructed image can greatly help medical professionals make more informed decisions or order additional measurements in uncertain regions [3].
In Figure 1 we use a toy problem to illustrate access to arXiv:2204.07664v1 [cs.LG] 15 Apr 2022 the posterior can help in the presence of multiple modes. Both standard point estimates look close to a "9", but many posterior samples look like "4"s. While innocuous on MNIST [4], such failure modes come with risks in medical imaging.
Another approach is then to characterize the posterior p X|Y upon observing y. Given the joint distribution p X,Y , computing the posterior p X|Y generally involves intractable high-dimensional integrations. A standard way to circumvent this is by sampling methods such as Markov chain Monte Carlo (MCMC) [5], [6]. These methods work well in lowdimensional problems but become computationally intensive when used in high-dimensional imaging tasks due to the need to compute the forward operator many times [7]. An alternative is to perform variational inference. This requires defining a tractable, parameterized family of distributions Q and choosing a q ∈ Q that is "close" to p X|Y .
In this paper, we define a new class of approximate posteriors Q using injective deep neural networks, suitable for imaging problems. We build on the recently-proposed injective flows [8], [9] and conditional coupling layers [10]. Injective flows map a low-dimensional latent space to a high-dimensional image space using a sequence of injective functions whose inverses (on the range) can be computed efficiently and exactly. They combine the favorable aspects of GANs [11] and invertible normalizing flows [12]- [14] in a way especially well-suited for imaging problems. GANs provide good samples and can be trained with a reasonable compute budget but lack an exact, fast inverse and access to (approximate) likelihoods. Normalizing flows have fast inverses and enable fast and exact likelihood computation, but they lack a low-dimensional latent space thereby requiring large memory and compute budget when training at higher image resolutions [15]. Injective flows have a low-dimensional latent space with a fast and exact inverse on the range while maintaining a low compute and memory budget.
Conditional normalizing flows [10] inherit the favorable properties of their non-conditional counterparts-easy access to the likelihoods and inverses of generated samples. They were applied to inverse problems [16], [17] where they enable posterior estimation, efficient sampling and even uncertainty quantification. In this work, we propose conditional injective flows termed C-Trumpets. While injective flows model image datasets supported on low-dimensional manifolds, the range of a C-Trumpet is a (potentially) different low dimensional manifold of posterior samples for each measurement. As we show in Section IV, this makes them an effective model for data distributions supported on fiber bundles with the base space corresponding to the space of measurements Y, and in particular effective models of posterior distributions in imaging inverse problems. Moreover, thanks to a low-dimensional latent space, they can be trained for high-dimensional data (256×256) using a single GPU while training conditional bijective flows at this resolution requires significantly more memory.
While generating approximate posterior samples is important, many applications still call for Bayesian point estimates such as the MAP or the MMSE estimator, and it is convenient if those can be computed with the same model. While it is clear (at least conceptually) how to do it for the MMSE estimatorgenerate many samples and average them-the MAP estimator is more elusive. There are not many deep learning approaches to imaging which compute (or approximate) the MAP estimator since the corresponding training loss (purely formally) is the highly irregular δ(x − x ) as opposed to the "nice" x − x 2 [18]. (A notable exception is amortized MAP for image super-resolution [19], although it is limited to noiseless linear low-rank projections.) We could in principle obtain a MAP estimate from C-Trumpets via iterative maximization; however, that is slow and it is not guaranteed to converge. We thus propose a modification of coupling layers which results in a fixed volume change with respect to the input. We use this newly designed layer to efficiently approximate MAP estimators without the need to run an iterative process or evaluate the forward operator.
Our main contributions can be summarized as follows: • We define a class of deep learning models for amortized variational inference called C-Trumpets, with a smaller memory and compute footprint compared to bijective flows; C-Trumpets can be trained on 256 × 256 images on a single V100 GPU in a day. • The new flows include bespoke architectural innovations: fixed-volume-change layers provide efficient MAP estimates without iterative optimization, while skip connections improve the quality of the generated samples and uncertainty quantification. • We show that C-Trumpets outperform conditional bijective flows in solving computational imaging problems including nonlinear electromagnetic scattering, limitedview CT and seismic travel-time tomography; C-Trumpets produce much better posterior samples and uncertainty estimates that are consistent with the physics of the forward operators in various inverse problems. • While standard injective flows parameterize manifolds and are thus a natural (regularizing) choice when we believe the manifold assumption holds, we show that conditional injective flows can parameterize fiber bundles [20].

II. VARIATIONAL BAYESIAN INFERENCE
In this section we work formally and assume that all probability measures have a density; this allows us to simply present the main ideas. A training strategy suitable for distributions supported on low-dimensional manifolds is detailed in Section III.
We consider random vectors X ∈ X (representing the unknown object) and Y ∈ Y (representing the observed noisy measurements). The posterior distribution p X|Y can be expressed as In high-dimensional imaging applications, calculating x p X,Y (x, y)dx is intractable. Moreover, the prior distribution p X is unknown and needs to be estimated [21], [22]. MCMC-based sampling methods do not require access to x p X,Y (x, y)dx but they are slow if one desires likely posterior samples [7], [23]. An alternative to MCMC is to perform variational inference [24], [25]: we define a parameterized class of distributions where P(X ) is the space of probability distributions over X , and we search for θ such that q θ is close to p X|Y . Examples of Q include Gaussian mixtures and densities induced by generative neural networks.
The remaining ingredients are a measure of "closeness" and a fitting algorithm. In standard variational inference it is common to use the Kullback-Leibler (KL) divergence as a measure of fit, Given a measurement y, we choose q θ by solving either respectively called reverse and forward KL minimization [6], [26]. The two estimates are in general different due to the asymmetry of the KL divergence.
Minimizing the reverse KL requires computing the expectation E X∼q θ log p Y |X (y|X)p X (X). While p Y |X is usually asssumed known in imaging problems with a known forward operator-in (2) it corresponds to additive Gaussian noisethe prior distribution p X is unknown. Recently, normalizing flows were used to estimate p X to then allow for downstream minimization of the reverse KL divergence [2], [9], [27]. On the other hand, minimizing the forward KL does not require access to the prior distribution, the noise model or the forward operator [28], [29].
Computing θ * (y) in both forward and reverse KL formulation involves solving a separate optimization problem for every new measurement y. The reason is that q θ ∈ Q is a function of x alone, not x and y, and we hope that for each y, This separate optimization for every y may be inefficient if implemented via standard iterative solvers.
We could, however, work directly with a family of conditional distributions q θ (x|y) which depend on both x and y, where for each y, q θ (·|y) ∈ P(X ) is a probability distribution over X .
We can now compute a conditional variational approximator q θ (x|y) by minimizing the average KL divergence over all measurements y: this procedure is known as amortized inference. It leads to the following optimization problem: The key observation is that the population expectation over p X,Y can now be approximated by an empirical expectation over the training data The question that remains is: how to parameterize q θ (x|y) so that we can 1) learn θ * efficiently from data, 2) easily obtain conditional samples from q θ * (x|y) ≈ p X|Y (x|y) for a given y, and 3) efficiently compute standard point estimators such as the MAP estimator? We answer this question in the remainder of the paper by describing conditional injective flows called C-Trumpets.
In a nutshell, we will define a family of neural networks f θ (Z; y) where y is the conditioning input. The first argument, Z, will be a standard Gaussian random vector over a lowdimensional latent space, and the parameter θ adjusted so that for each y, the random vector f θ (Z; y) is close in distribution to X|Y =y. In other words, denoting the standard Gaussian distribution by p Z , we will obtain q θ (· | y) as a pushforward of p Z via f θ (z; y), 2 The approximate posterior samples can then be obtained in a standard way as f θ (Z; y).

III. C-TRUMPETS: CONDITIONAL INJECTIVE FLOWS
C-Trumpets ( Figure 2) are conditional injective normalizing flows that map a low dimensional latent space to a high dimensional data space using an injective transformation for each conditioning sample. Injectivity guarantees that for each conditioning sample the range of the network is a manifold. Moreover, the efficiently-invertible layers facilitate training and inference. As shown in Figure 2, the proposed model has two subnetworks: an injective generator g γ that maps a low-dimensional space Z = R d to the data space X = R D , d D, and a bijective mapping h η : Z → Z maintaining dimensionality. The end-to-end mapping is then given as, f θ (z; y) = g γ (h η (z; y); y) with learnable parameters θ = (γ, η).
C-Trumpets are inspired by non-conditional injective flows [9] called Trumpets. They comprise a sequence of injective [9] and bijective [14]   block consists of three components: activation normalization, (injective or bijective) 1 × 1 convolution and affine coupling layer. It is worth noting that all these components are nonconditional, and we will later use conditional affine coupling layers to make the generation process conditional.
1) activation normalization, 2) 1 × 1 convolution with a kernel w, a) Bijective version: b) Injective version: w is a 1 × 1 convolutional filter, which is simply a matrix multiplication along the channel dimension where w ∈ R cin×cout and w † is the pseudo-inverse of w (a non-square matrix in the injective dimension-expanding case). We use LU decomposition in matrix inversion, which significantly reduces training time of the injective part of C-Trumpets (see Section S-I for more details). 3) affine coupling layer FORWARD: The mappings s and b are respectively the scale and the shift networks.
In order to make the generative process conditional, we adapt the conditional affine coupling layers proposed in [10]. Conditional affine coupling layers keep the advantages of a regular flow model-fast inverses and tractable Jacobian computations, while benefiting from the conditioning framework.
Since scale s( · ) and shift b( · ) networks are never inverted, we can concatenate the features of the conditioning sample y to their input without losing invertibility and tractable log det Jacobian computation. Accordingly, s( · ) and b( · ) are replaced by s( ·, c ϕ (y)) and b( ·, c ϕ (y)), represents the conditioning network that extracts appropriate features from y. We deploy conditional affine coupling layers in both injective and bijective subnetworks of C-Trumpets.

A. Conditioning network
The role of the conditioning network c ϕ ( · ) is to extract features from conditioning samples y to be used by the affine coupling layers. The architecture of the conditioning network depends on the nature of the conditioning samples. When y is structured as an image, we use convolutional layers; when it is an unstructured 1D vector, we use fully connected layers, as for example in class-based image generation (see Section S-III-A in the supplemental materials) where the conditioning data are one-hot class encodings. The output dimension of the conditioning network is set to match the input dimension of the scale and shift modules of the coupling layers. The weights of the conditioning networks are trained jointly with the remaining parameters in C-Trumpets via back-propagation using paired training data. Further details about conditioning networks in all numerical experiments are given in Sections S-II-B and S-III-A in the supplemental materials.

B. Skip connections
The fact that we use expanding revnet layers allows us to augment the architecture with elements that are not compatible with the standard injective layers. The conditioning samples often have a pixel-wise dependency on target signals. For example, in image inpaiting with a fixed mask location, the out-of-mask pixels should be simply forwarded to the output. In order to not re-learn these features, we introduce skip connections after every revnet block between the measurements, y and the different resolution levels of the injective subnetwork of C-Trumpets, where rev( · ) is the conditional revnet block, S is a learnable matrix with entries between 0 and 1; S adjusts the amount of the direct mixing of the measurements in the generated posterior sample. We empirically find that skip connections help the injective part of C-Trumpets converge faster and they yield better reconstruction in several imaging problems.

C. Fast MAP estimation via fixed volume-change layers
MAP estimation traditionally requires an iterative solution of a maximization problem. Deep neural networks for image reconstruction are traditionally trained with an p loss (p = 2 giving the ubiquitous MSE loss and ultimately an approximation of the MMSE estimator E[X|Y ]). There are, however, few attempts at using deep neural networks to approximate the MAP estimate in imaging, possibly because the associated "loss" would be a tempered distribution δ(x − x ) [18]. While the MAP and the MMSE estimates coincide when X and Y are jointly Gaussian, they are in general different. For posteriors supported on a low-dimensional manifold the MMSE estimate will generally not lie on the manifold. (We show a qualitative manifestation of this effect in Section V-C) A notable deep-learning approach to amortized MAP inference for image super-resolution has been proposed by Sønderby et al. [19]. However, their method is only applicable for super-resolution in the ideal noise-free scenario. In this section, we propose a new variant of affine coupling layers, which enables us to obtain the MAP estimate instantaneously with a single forward pass of the trained network. This method is exact when used with bijective flows and approximate for injective flows where it computes the MAP estimate of the pre-image samples z in Figure 2.
Consider a trained conditional normalizing flow model x = f (z; y). The MAP estimate can be obtained by solving where z = f −1 (x; y). In principle, we can run an iterative maximization process to compute x MAP . In general, this may be slow and it is not guaranteed to converge, even with multiple random restarts.
Let us analyze (10) more closely. While the first term, log(p Z (z)), has the highest value at z = µ z (the mean of the Gaussian), the second term has three components: activation normalization, 1 × 1 convolution, and the conditional affine coupling layer. The first two components are linear layers with data-independent Jacobians; their log dets are thus constant with respect to x and can be omitted from (10). The log det of the Jacobian of conditional affine coupling layers is where s i ( · ) is the ith element of the output of the scale network. This term is in general data-dependent. In order to make it data-independent, we propose to use the following activation for the scale network, where m( · ) is an arbitrary neural network and softmax(x) is defined as Then we have, The newly proposed layer has a data-independent log det Jacobian, without losing expressivity, as verified empirically in Sections V-A and V-B. Now all terms in (10) are independent of x, so that which is to say that the MAP estimate is obtained simply by feeding the mean of the Gaussian base distribution into the (bijective) flow.
While the proposed technique is exact for bijective conditional flows, however the log det term for conditional injective flows is given as and equation (15) cannot be decomposed into a sum of the log dets of its constituent components. We can nevertheless use this technique to obtain a MAP estimate in the intermediate z -space as in Figure 2. Therefore, we obtain a surrogate of the end-to-end MAP estimate and call it surrogate MAP and denote it as g † -MAP.

D. Training strategy
Thanks to simple, tractable inverses and log dets of layer Jacobians, the parameters of normalizing flows can be fitted via maximum likelihood (ML). However, as C-Trumpets have a low dimensional latent space, the likelihood is only defined in the range of the injective generator and the training data generally does not live in this range. We thus split the training of C-Trumpets, f θ (z; y) = g γ (h η (z; y); y), into two phases, following the method of Brehmer and Cranmer [8] for nonconditional injective flows: (1) The MSE training phase where we only fit the trainable parameters γ of the injective network with the goal of adjusting the range of f θ to contain the training data, and (2) The ML training phase where we optimize the trainable parameters η of the low-dimensional bijective part, maximizing the likelihood of the pre-image (through g γ ) of the training data. 3 In the first phase, we optimize γ by minimizing are the training samples and g † is the layer-wise inverse of the injective subnetwork; therefore, the projection operator on the range of g γ ( · ; y) is given as P gγ (x; y) := g γ (g † γ (x; y); y). After training the injective network for a fixed number of epochs, the range of the network approximately contains the training data.
We now switch to the second phase: maximizing the likelihood of the projected training samples in the intermediate space (z in Figure 2 , by minimizing the following KL divergence over η (cf. Section II): where ; y (i) ); y (i) ) and p Z (z) is a standard Gaussian distribution in R d . In summary, this training strategy first ensures that the range of the injective generator "interpolates" the training data and then maximizes the intermediate space likelihoods as proxies to the image-space likelihoods. After training, sampling an approximate posterior sample for a given y is performed by sampling a z from a normal distribution and using the forward pass: x gen = g γ (h η (z; y); y).

IV. THE C-TRUMPETS SIGNAL MODEL
In this section we briefly discuss the geometric and topological aspects of C-Trumpets. Readers who care mostly about the practical aspects and numerical results may safely skip ahead to Section V.
The primary motivation for introducing (non-conditional) injective flows is to model data supported on low-dimensional manifolds [8], [9], [30], [31]. It was empirically shown that for common structured datasets these models are indeed simpler, faster to train and generate higher quality samples than globally invertible normalizing flows which maintain the same dimension across all layers.
On the other hand, we introduce C-Trumpets from the point of view of uncertainty quantification and posterior sampling rather than geometrical and topological considerations about the class of signals it models. It is nevertheless important to make this class explicit, since this understanding will guide design choices and circumscribe the range of problems in which C-Trumpets are the right tool of choice.
For every (fixed) conditioning sample y, a C-Trumpet becomes a conditional flow modeling the corresponding conditional distribution. Thus for every conditioning sample y the range is a low-dimensional manifold with topology induced by the topology of the latent space (the support of p Z ). The range of a C-Trumpet as a function of z and y then corresponds to the union of the family of these manifolds indexed by y, Let us now assume a slightly stronger condition than injectivity of f θ in the first argument: that the Jacobian (with respect to both z and y) of f θ has full rank dim(Z) + dim(Y). Note that this does not guarantee global injectivity of the map (z, y) → f θ (z; y), but it does imply that there is a (finite) collection of open sets {U i } (thought of as neighborhoods in Y) which cover Y, i U i = Y, such that f θ is injective on Z × U i for all i and in fact a homeomorphism between Z × U i and β(U i ). The signal model we just described is called a fiber bundle [20], [32].
is a homeomorphism. The definition can be illustrated by a commutative diagram [20], We have the following correspondence between the standard fiber bundle notation and the notation specific to C-Trumpets used in this paper:

C-Trumpets
Fiber bundles Fiber bundles model spaces that are locally product spaces but may have non-trivial topology globally. An example of a space that is globally a product space is the cylinder E = S 1 × [0, 1] with the base space being the circle B = S 1 and all fibers being translates of the line segment Z = [0, 1]. The projection π associates to each point x ∈ E its position along the base circle; π −1 takes a position along the base circle and returns the corresponding line segment. The same B and Z can generate a rather different space-the Möbius band-which is globally not a product space. 4 Thus (under appropriate conditions) the range of a C-Trumpet is locally homeomorphic to a product space (it is a product space up to a stretch), even though it globally need not be depending on the topologies of Z and Y. The requirement that φ be a homeomorphism on U i implies that (locally) the fibers do not intersect. This can of course only hold if dim(Y)+ dim(Z) ≤ dim(X ) and the appropriate conditions on the Jacobian are satisfied, but in any case it gives a useful intuition for the kind of problems and datasets that admit modeling by C-Trumpets.
We show that C-Trumpets can indeed model simple low-dimensional fiber bundles: an embedded solid 5 torus in Figure 3a and a solid Möbius band with elliptic cross-sections (fibers) in Figure 3b. As shown in Figure 3, the conditioning samples for both the torus and the elliptic Möbius are taken to live on the base circle (y ∈ [0, 2π)). For each angle y, C-Trumpets generate samples from a distribution on the disk for the torus and an elliptical disk for the elliptic Möbius band. Accordingly, we train C-Trumpets with latent dimension two. Figure 4 demonstrates the generated samples by C-Trumpets; we sample the resulting models in increments of 6 • to make the fiber bundle structure clear. 4 To explain this mathematically we would need to introduce the transition maps and the fundamental group, which is beyond the scope of our sketch. 5 By a "solid" torus we refer to a bundle whose fibers are disks. Modeling a standard torus is challenging with coupling layers which can only be defined starting in dimension 2. Even in this case their expressivity is limited so this low-dimensional example is only meant as illustration.

V. EXPERIMENTAL RESULTS
We start by showcasing how C-Trumpets provide MMSE, MAP and uncertainty estimates in a variety of imaging inverse problems. The MMSE estimate, E[X|Y = y], is approximated by averaging a fixed number of posterior samples from a C-Trumpet fitted to the training data. The MAP estimate can be efficiently approximated using fixed volume-change layers in Section III-C, without iterative maximization over x. Finally, we compute a simple pixel-wise uncertainty estimate as where | · | 2 is applied to each pixel. In all experiments, we use K = 25 posterior samples to approximate the MMSE and UQ estimates (we find that the quality of MMSE and UQ saturate for a higher number of posterior samples).
We compare C-Trumpets with two different types of conditional bijective flows, C-INN [10] and C-Glow [35]. C-INN used conditional coupling layers for the first time, while C-Glow has a rather different architecture by using two parallel bijective flows for simultaneously modeling the target and conditioning samples. In order to emphasize the importance of an expansive injective model over a bijective one, we build a comparison baseline model, C-Rev, which  ) and (b)), the line segments indicate that "sensors" measure entire projection images as opposed to complex scalars in nonlinear scattering (d).
consists of the bijective portion of C-Trumpets with latentspace dimension equaling that of the image data, i.e., h η (z, y). Finally, in Section V-C, we visually compare the MMSE and MAP estimates for several ill-posed inverse problems.
A. Computational imaging a) Limited-view CT: CT is a major medical imaging modality. The corresponding inverse problem is to recover an image from its integrals along straight lines, arranged in a so-called sinogram. In limited-view CT (cf. cryo-electron tomography [36] and dental CT [37], [38]), a contiguous cone of angles is missing from the acquisition; Figures 5a and 5b, illustrate vertical and horizontal missing cones of 60 • . As there are no measurements in a vertical (horizontal) missing cone, we should expect higher uncertainty in horizontal (vertical) components. We use the filtered back projection (FBP) reconstruction as our measurements y, and train on 40000 256 × 256 samples from the LoDoPaB-CT [39] dataset. The measurement SNR is set to 40 dB. In Figure 6, we show posterior samples, g † -MAP, MMSE, and uncertainty quantification (UQ) estimates from C-Trumpets. The realspace UQ estimate shows higher uncertainty in the vertical (horizontal) components where we have horizontal (vertical) missing cone. This is consistent with our expectations from the physics of CT.
We further show the UQ estimate in the Fourier domain computed by averaging over the individual DFT bins. Quite pleasingly, this estimate aligns perfectly with the theoretical prediction from the Fourier slice theorem; higher uncertainty (bright regions) are indeed inside the missing cone while it is worth emphasizing that C-Trumpets are not specifically designed for the forward operator of the CT problem. Moreover, as expected, there is higher uncertainty in higher frequency components (see Figure S3 in the supplemental materials for more samples). We emphasize that C-Trumpets are the only conditional generative architecture that gives such physically meaningful posterior samples.
It is also worth mentioning that C-Trumpets model this 256×256 resolution dataset with only 10M trainable parameters in less than 24 hours of training time on a single NVIDIA V100 GPU. Due to memory constraints, we could simply not train the baseline bijective models C-INN, C-Glow, and C-Rev on 256 × 256 images. The large latent space dimension of these models leads to very large memory footprints.
We can still compare our uncertainty estimates with these models at a lower resolution of 64 × 64. Figure 7 shows such an experiment at the SNR of 25dB. In the first panel (second to fifth row), we show MMSE estimates of different flow models and three random posterior samples. We further provide UQ estimates in real and Fourier space. As we can see, not only do C-Trumpets outperform the conditional bijective flows in terms of reconstruction quality (MMSE estimate) (see also Table I) but they also give a more meaningful uncertainty estimate even in high noise. Bijective models generate significant uncertainty outside the missing cone, indicating that they do not learn a reconstruction map consistent with the physics of CT. b) Electromagnetic inverse scattering: We consider the non-linear electromagnetic inverse scattering problem as described in [33]. We consider reconstruction of the finite number of parameters from the scattered fields. Although inverse scattering becomes well-posed and Lipschitz stable with continuous measurements [40] 6 , it is an ill-posed inverse problem with finite number of measurements, which means that noise in the measurements may translate to exponential errors in the reconstruction [33]. Moreover, the problem becomes increasingly non-linear as the relative permittivity of the objects being imaged, r , increases. In the first experiment, we use 36 incident plane waves and 36 receivers, distributed uniformly around the object with maximum permittivity of r = 6 and dimension 20 cm × 20 cm. We work at the frequency of 3 GHz. and simulate the measurements by solving the Helmholtz equation explained in Section S-II-C in the supplemental materials. We add noise to the measurements for a target measurement SNR of 30 dB. We build a dataset of 64 × 64 images of overlapping ellipses with 60000 samples. Figure 8 shows the performance of C-Trumpets where the scattered fields are used as conditioning samples. This experiments clearly shows that C-Trumpets can generate meaningful posterior samples, even for a highly non-linear ( r = 6) and ill-posed problem. Additional experiments with different contrasts r and conditioning schemes (scattered fields vs back projections) are illustrated in Figures S4, S5, S6 and S7 in the supplemental materials.
In order to assess the performance of different conditional flows in uncertainty quantification, we design another experiment with 180 incident plane waves and 180 receivers only on the top side of the object (5d). In this setup, we expect higher uncertainty in the lower part of the object. Figure 9 illustrates the performance of C-Glow, C-Rev and C-Trumpets for r = 6. We use a simple back projection (BP) [42] as conditioning measurements y. In the first panel (from the second row to the fifth row), we show MMSE estimates of different flow models: C-Trumpets perform better than bijective flows (see also Table I). As expected, however, the reconstructions are not as good as in the previous experiment since we only have measurements on one side of the domain. We see that C-Trumpets again provide meaningful uncertainty estimates, with more uncertainty in the lower part of the object (red regions in the UQ panel correspond to higher uncertainty).

c) Seismic travel-time tomography (NS):
We work with the linearized seismic tomography operator as described in [34]. Here we are given travel times of a seismic wave between each sensor pair in a network of NS sensors placed on the ground. The travel times are assumed to linearly depend on the "slowness" map which is taken to be an MNIST image [4]. We use NS= 10 sensors on the boundary of the lower part of the domain which yields 33 measurements as shown in Figure 5c. We use the pseudo-inverse of the measurements y as the conditioning samples and work at the SNR of 40dB. Figure 10 compares the performance of C-Trumpets and C-Rev. In the first panel, the second and the third rows show MMSE estimates of C-Rev and C-Trumpets. C-Trumpets outperform C-Rev in both MMSE estimate and posterior sampling (see also Table I). Furthermore, given no sensors (Figure 5c) in the top regions of the image (or slowness map), we would expect higher uncertainty there. The UQ column shows that estimates from C-Trumpets assign higher uncertainty to the top half of the domain compared to C-Rev. Additional results are shown in Figure S8 in the supplemental materials.

B. Image restoration
In this section, we compare C-Trumpets with models on various image restoration tasks. Similarly as in computational imaging problems, each task requires training a different model but once trained, the model can be used instantaneously for arbitrary measurements y. We consider four standard restoration tasks: (i) Image denoising: We train a model to generate plausible clean images given a noisy image at an SNR of −1 We used the 8-bit RGB CelebA [43] dataset with 80000 64 × 64 training samples. Figure S2 in the supplemental materials compares the performance of flow models on different image restoration tasks. Table I further gives the SNR and SSIM of the MMSE estimate computed by sampling 25 images from the posterior. C-Trumpets consistently outperform other conditional bijective flows. The difference in the performance of C-Rev and C-Trumpets suggests that the low-dimensional latent space of C-Trumpets acts as an effective regularizer in the restoration mapping, f θ . C-Trumpets also provide a meaningful uncertainty quantification. For example, although the forward operator is random in the random mask problem, C-Trumpets still capture a meaningful uncertainty estimate by assigning higher uncertainty inside the masked region (see Figure S2d).
In order to assess the memory requirements of the different models, we compare the number of trainable parameters used for training over 64 × 64 RGB images: C-INN: 13M, C-Glow: 14M, C-Rev: 22M and C-Trumpets: 4M. We did not have sufficient resources to train bijective models over 256 × 256 images but it is clear that the differences in the memory footprint at that resolution would be further exacerbated.   problems. The MMSE estimate is obtained by averaging over 25 posterior samples. Although the MMSE estimate is the optimal reconstruction in terms of the 2 -error, we see from Figure 11 that it is often blurry, especially when the true posterior is multi-modal; g † -MAP estimates are sharper. Moreover, as the MMSE estimate is obtained by averaging over the posterior samples, it is not generally on the manifold, while the g † -MAP estimate is always on the manifold.

VI. RELATED WORK
There is by now a very large body of work on solving inverse imaging problems using deep neural networks. On the supervised regression end of the spectrum arguably the most important architecture is the U-Net [44]. It has been applied with great success to a variety of imaging problems including CT [1], magnetic resonance imaging (MRI) [45], electromagnetic inverse scattering [46] and optoacoustic tomography [47]. Its success may be attributed to the particular multiscale structure [48], [49] which matches both the physical description of the imaging problems and the representation of the involved image classes. Many alternatives have been proposed for specific problems where the assumptions that make the U-Net a natural choice do not hold, for example for wave-based problems [50], [51].
On the other hand, trained generative models have been shown to be effective priors [52], [53] in ill-posed inverse problems that can be trained in an unsupervised manner. Normalizing flows in particular were used to approximate MAP estimates using iterative optimization [53], [54]. However, normalizing flows are bijective, requiring large memory and compute budget even at moderate resolution. Moreover, they lack a low-dimensional latent space which has been shown to effectively regularize the inversion. Brehmer and Cranmer [8] proposed the first injective model for densities supported on low-dimensional manifolds. Kothari et al. [9] proposed injective flows with significantly optimized compute and memory requirements which were shown to outperform earlier variants of injective and bijective flows in computational imaging problems.
A variety of methods have been proposed for Bayesian imaging. The latter aims to approximate posterior distribution and / or the various point estimators related to the posterior. The authors of [55] consider a convex log-likelihood function for posterior distribution around the MAP estimate, which is suitable for modeling unimodal posteriors. Normalizing flows have been proposed as variational approximators to the posterior distribution for a given measurement [2]. The authors of [9], [27] propose to train a flow model to approximate the posterior corresponding to a prior which is also modeled using a flow model. More recently, pre-trained GANs were used in conjunction with MCMC to generate posterior samples in non-linear inverse problems [22]. The authors of [56], [57] use a style-GAN generator [58], [59] to regularize ill-posed inverse problems and generate posterior samples. A rather similar approach is used in [21], [60] with score matching generative models [61], [62]. All these methods train a new generative model or run an iterative process for every measurement, which makes them slow when applied to multiple reconstructions. Further, training for each conditioning sample requires many calls to the forward operator. To tackle these issues, one may consider amortized inference to make the generative models conditioned based on the measurements.
Conditional versions of generative adversarial networks [11], [63] and variational autoencoders [64], [65] rely on injecting conditioning data into the different layers of the generator model. However, the lack of access to the posterior distribution make them difficult to be used in inference tasks. These generative models also suffer from mode collapse and training instabilities. Conditional normalizing flows were introduced in [10] to estimate the posterior by modifying the scale and shift terms of the coupling layers. The authors of [16] additionally make the mean and covariance of the base Gaussian distribution depend on the measurements. More recently, [66] proposed to append the measurements to all layers of the Glow network [14] in order to enable greater information flow from conditioning data to the generated samples. A different approach to conditioning the flow models has been developed by [35], [67], benefitting from two parallel flow models for simultaneously modeling of target and conditioning samples. All these conditional normalizing flows are bijective mappings from latent space to the target domain for each measurement. It is worth mentioning that all these models can be used in the bijective part of C-Trumpets to exploit their advantages in posterior modeling. On the side of theory, Puthawala et al. [31] establish universality of density and manifold approximation of injective flows such as those in [8], [9].

VII. LIMITATIONS AND CONCLUSIONS
We proposed C-Trumpets, a conditional injective flow model that enables amortized inference with approximate posteriors that live on low-dimensional manifolds. Our proposed model is considerably cheaper to train in terms of memory and compute costs compared to the regular conditional flows. The experiments we performed indicate that C-Trumpets generate better posterior samples and more accurate uncertainty estimates over a variety of ill-posed inverse problems. The proposed fixed-volume-change coupling layers enable us to approximate the sharp MAP estimates instantaneously after training. High computational demands of training bijective flows at high resolution have thus far impeded their wider adoption in computational imaging workflows. The comparably lightweight memory footprint of C-Trumpets together with physically-consistent UQ makes them an attractive architecture for imaging problems where characterizing uncertainty is paramount.
Limitations: C-Trumpets have several limitations that warrant discussion. The latent space dimension in C-Trumpets is chosen arbitrarily and it may be quite different from the true dimension of the posterior support. Recent work [68] proposes an injective flow architecture that estimates the dimension of the data manifold. Similar ideas may extend to C-Trumpets but for the moment we rely on rules of thumb rather than principled choices. Another limitation is that it is not straightforward to estimate the likelihoods of samples generated by C-Trumpets (cf. Section III-C), the reason being that the Jacobian determinant of compositions of maps between spaces of different dimension cannot be written as a product of Jacobian determinants of the constituent maps. Likelihood estimates can still be obtained by sampling but that is considerably slower than what is possible with bijective flows. Recently, Ross et al. [69] proposed an injective generator which provides access to the exact likelihood of the generated samples, but the constraints they impose on the architecture in order to enable this feat seem to severely limit expressivity. The design of an injective model that is at once expressive, lightweight, and gives fast exact likelihoods remains an open problem. Finally, further theoretical studies are needed to characterize the types of posterior distributions that can be modeled by C-Trumpets, especially with fixed-volume-change layers. The important open question is that of universality of C-Trumpets as models of conditional distributions.

Supplemental Materials: Conditional Injective
Flows for Bayesian Imaging

S-I. FAST INVERSES
Inverting the revnet coupling layers requires the matrix inverse of the kernel of the 1 × 1 convolution layer. As this may be slow, Kingma and Dhariwal [14] proposed to use the LU decomposition to reduce the computational complexity, where P is a permutation matrix, L is a lower-triangular matrix with ones on diagonal and s is a vector. Computing the log det of the layer Jacobian then simplifies to The LU decomposition thus reduces the complexity of computing the log det of the Jacobian of the 1 × 1 convolution layer from O(c 3 ) to O(c) (and one-time factorization cost). This trick, however, was only used in inference (when sampling from the model), but not for training the bijective flows as matrix inversion is not critical in that case. Unlike bijective flows, C-Trumpets do require matrix inversion during the MSE training phase (cf. (16)). In order to reduce the computational cost of inversion, we leverage the LU decomposition as follows Since L and U are triangular matrices and P is a fixed rotation, computing the inverses costs O(2 × c) instead of O(c 3 ). This significantly reduces the training time of the injective part of C-Trumpets, especially in high-dimensional problems.

S-II. ADDITIONAL EXPERIMENTAL DETAILS
A. Fiber bundles a) Datasets: We analyzed the performance of C-Trumpets on two fiber bundle datasets: where t ∈ [0, 2π) and s ∈ [0, 2π). We set R = 1 and r = 0.25 and generate 18000 training samples. • Elliptic Möbius x = cos(t)[R − b sin(t/2) sin(s) + a cos(t/2)cos(s)] y = sin(t)[R − b sin(t/2) sin(s) + a cos(t/2)cos(s)] z = b cos(t/2) sin(s) + a sin(t/2) cos(s) (S5) where t ∈ [0, 2π) and s ∈ [0, 2π). We set R = 1, a = 0.4 and b = 0.1 and generate 18000 training samples. b) Network architecture and training details: The injective part of C-Trumpets maps R 2 → R 3 and consists of an expansion layer (a matrix of size R 3×2 ) followed by 24 revent blocks without activation normalization layer. The bijective part which maps R 2 → R 2 also consists of 32 revent blocks. We use coupling layers proposed in [14] with 3 fully-connected layers for scale and bias. We train for 100 epochs using the Adam optimizer [70] with a learning rate of 10 −3 for both the injective and the bijective parts of C-Trumpets. We note that this minimal input dimension of the coupling layers has yields poor expressivity, but we only use this example for illustration.

B. Limited-view CT
We consider the 2D parallel-beam CT problem with a missing cone of sensors [71]. a) Network architecture: We describe the architecture of C-Trumpets for 256 × 256 images. The injective part consists of 6 injective revnet blocks, each increasing the dimension by a factor of 2. Between the injective layers, we intersperse 36 bijective revnet blocks. We choose a latent space of size 2048. The bijective part consists of 48 bijective revnet blocks. We use 3 convolutional layers for the scale and bias networks of fixed-volume-change coupling layers. The conditioning networks have 3 conv layers along with a 'squeeze' layer that helps match the dimension of the conditioning sample to the input of the scale network.
The architecture of C-Trumpets for 64 × 64 images is similar to the 256 × 256 variant except that we use 18 (resp. 24) revnet blocks in injective (resp. bijective) subnetworks and choose a 64-dimensional latent space. The C-Rev model used for comparison has 24 revnet blocks, all at the highest resolution (i.e 64 × 64).
We initialize the weights of the revnet blocks as in [14]. All elements of the skip connection matrix S are initialized to 0.5. All models are trained for 300 epochs (150 per phase) using the Adam optimizer [70] with a learning rate of 10 −4 .

C. Electromagnetic inverse scattering
We consider the 2D transverse magnetic inverse scattering problem. We illuminate the domain of investigation D inv -a D×D square-by N i plane waves; N r antennas measure the scattered field for each incident wave.
The forward scattering problem can be described by two coupled equations [33]. The first equation, called the state equation, relates the total electric field (E t ) in an unknown domain to its contrast current density (J) as E t (r) = E i (r) + k 2 0 Dinv g(r, r )J(r )dr , where g(r, r ) = (1/4i)H 0 (k 0 |r − r |), H 0 (a Hankel function of the second kind) is the 2-D free space Green's function, and J(r ) can be calculated using the equivalence theorem as J(r ) = χ(r )E t (r ) with the contrast χ(r ) = r (r ) − 1.
The second equation, called the data equation, maps the contrast current source to the scattered electric field (E s ) at antenna locations. For r ∈ S, E s (r) = k 2 0 Dinv g(r, r )J(r )dr .
The task of inverse scattering is to reconstruct the contrast χ from the scattered field E s . a) Network architecture: We use the same architecture and training parameters as in the 64 × 64 2D-CT problem.

D. Linearized travel-time tomography
Linearized travel-time tomography is inspired by seismic travel-time tomography that aims to reconstruct the wavespeed variation inside a planet. Sensors are placed on the ground and we measure arrival times of surface waves. We assume that the receivers and transmitting sensors are co-located. We further assume that the pixel intensities represent the "slowness" (inverse wave speed). We can therefore measure the travel times between a transmitting sensor, s i and a receiving sensor s j as where f represents the image. Note that t(s i , s j ) = t(s j , s i ).
a) Network architecture: The injective subnetwork consists of 4 injective revnet blocks, each increasing the dimension by a factor of 2. We use 12 bijective revnet blocks interspersed in between the injective layers. The bijective part takes 64-dimensional latent vectors and consists of 12 bijective revnet blocks. C-Rev used in comparison has 16 revnet blocks all at the 32 × 32 resolution. The training hyperparameters are same as the CT problem.
E. Image restoration tasks a) Network architecture: We use the same architecture and training hyperparameters as that of the CT problem except that we work with 64 × 64 × 3 resolution images. The latent dimension of C-Trumpets is chosen to be 192 (64 × 3).

A. Class-based image generation
We perform class-based image generation over MNIST digits [4] and a subset of 10 people from the voxceleb [72] face dataset with 5000 64 × 64 training samples. We use the one-hot class labels as conditioning vectors to generate samples from the given class and use two fully-connected layers followed by a reshaping module for the conditioning network of C-Trumpets. Figure S1 shows the class-based generated samples by C-Trumpets. This experiment indicates that our proposed model can generate good quality class-conditioned samples. Figure S2 compares the performance of conditional normalizing flows in image restoration tasks. Figures S3 to S12 demonstrate further results on different ill-posed inverse problems. We can observe that C-Trumpets have a significant edge over the baselines.