Deep Microlocal Reconstruction for Limited-Angle Tomography

We present a deep learning-based algorithm to jointly solve a reconstruction problem and a wavefront set extraction problem in tomographic imaging. The algorithm is based on a recently developed digital wavefront set extractor as well as the well-known microlocal canonical relation for the Radon transform. We use the wavefront set information about x-ray data to improve the reconstruction by requiring that the underlying neural networks simultaneously extract the correct ground truth wavefront set and ground truth image. As a necessary theoretical step, we identify the digital microlocal canonical relations for deep convolutional residual neural networks. We find strong numerical evidence for the effectiveness of this approach.


Introduction
Tomographic imaging aims at uncovering the interior 2D/3D structure of an object from a sinogram, which is data obtained by repeatedly exposing the object to a particle or wave from different directions. This is a key example of an inverse problem where one computationally attempts recover an unknown signal from data given as indirect observations. A reconstruction method refers to an algorithm performing the recovery.
Inverse problems, like those that arise in tomographic imaging, are often ill-posed which means that there can be multiple solutions consistent with the data or solution procedures that maximise consistency against measured data are sensitive to variations in data. Such high sensitivity is referred to as instability, and it appears when the forward operator, which models how a signal gives rise to corresponding noise-free data, is not continuously invertible. Stability properties are further degraded for sparse-view data, which is when data is under-sampled, and for limited-angle data, which refers to unevenly sampled data. Regularisation refers to mathematical theory and methods for stabilising the solution procedure of an ill-posed inverse problem. Many regularisers enforce stability by requiring consistency against a prior model. This prior should ideally encode known properties of the unknown signal one seeks to recover, and choosing an appropriate prior is an essential part of regularisation.
Most reconstruction methods in tomography assume that measurements are collected from views that are evenly distributed around the object. Limited-angle tomography refers to a case when this is not fulfilled. Such problems arise naturally in many applications, like digital breast tomosynthesis [60,6], dental tomography [43,44], electron tomography [68,63], transmission x-ray microscopy [42], nondestructive testing [66,71], geophysical prospecting [84], etc. This missing data significantly amplifies the instability in the corresponding reconstruction problem [21,58]. Hence, traditional reconstruction methods, like filtered back-projection

Main contributions
In this paper, we develop theory and algorithms for reconstruction in severely ill-posed inverse problems that arise in tomographic imaging with limited data. In particular, we develop a data-driven reconstruction method for limited-angle tomography that is microlocally consistent, which means 'filling in' missing data in a way that is consistent with how singularities in data are related to those in the signal.
Our approach relies on deep neural networks (DNNs) that integrate a handcrafted forward operator and a theoretical characterisation of how singularities in data are related to those in the signal. In this context, we model singularities through the concept of the wavefront set, that will be introduced in detail in Section 3 below. For the Radon transform, which is the forward operator underlying the tomography problem, it is known exactly how an application of it affects the wavefront set of underlying data. Indeed, the wavefront set of an image and its Radon transform are linked through so-called canonical relations. To use this a priori information on the wavefront set in the reconstruction problem, we set up the following method: Our algorithm consists of two pipelines that act in parallel. First, a reconstruction line that maps the data to the (unknown) signal. This is a specific deep neural network using the Learned Primal-Dual architecture, [2]. Second, a micro-local line that identifies the wavefront set of the signal by using the following three steps: (a) A neural network that extracts the wavefront set of the data. This neural network has been established earlier in [4] under the name DeNSE and is based on the interaction of the shearlet transform with the wavefront set of a function. (b) An analytical computation relating the wavefront set in the data domain to the image domain in a way that mirrors the action of the Learned Primal-Dual architecture of the reconstruction line. The two lines are coupled via this step. (c) A neural network with the U-Net architecture [72] that inpaints the wavefront set in the image domain that was inferred from the incomplete data.
The neural networks in the microlocal line and the reconstruction line are now trained in parallel on an artificial training set consisting of 2D phantoms (signals) with corresponding noisy and incomplete sinogram data. The phantoms are made up of random shapes that are demarcated by piecewise smooth curves given by splines of degree at most four. Several examples from the data set are shown in Figure 2.
The combined procedure, which we coin the joint reconstruction algorithm, now attempts to satisfy two objectives on that data set: (a) The reconstruction returned from the reconstruction line should agree as closely as possible with the ground truth. (b) The wavefront set returned by the microlocal line should resemble the ground truth wavefront set of the data accurately.
We refer to Section 5 for the complete description of the joint reconstruction algorithm and in particular Figure 4 for an illustration of the underlying DNN architecture.
Note that, due to the simplicity of the training data set, we can analytically compute the true wavefront set for those signals, which in turn is needed for the aforementioned joint training of the DNNs for wavefront set inpainting and reconstruction. In contrast, our test data consists of images of brains and associated noisy incomplete sinograms. Therefore, the training data is substantially different from the test data. As such, the empirical numerical study also shows the transfer learning properties of our approach.
We wish to emphasise that this approach applies in principle to any inverse problem where the forward operator is a Fourier integral operator, as is the case for most inverse problems arising in imaging applications. However, for simplicity, we chose to work out the theoretical results with associated algorithms and numerical examples only for the specific case of planar limited-angle tomography.
A central part of the proposed algorithm consists of the theoretical analysis of how the Learned Primal-Dual architecture affects the extracted wavefront set. At this point, the reader may justifiably wonder if it would not instead be possible to estimate the wavefront set of the output of this architecture directly via the wavefront set extractor DeNSE. This is theoretically possible but turned out to be practically infeasible. While the wavefront sets of the data set can be precomputed, this cannot be done for the outputs of the Learned Primal-Dual network as the Learned Primal-Dual network varies during the training. This means that during training a full application of DeNSE would have to be performed in every training step. On a modern machine, an application of DeNSE to a single image in the data set takes approximately 20 seconds, which shows that this approach would slow down the training process dramatically.

Phantom (ground truth signal)
Learned Primal-Dual network (PSNR 24.90) Joint approach (PSNR 30.20) Visible wavefront set of the ground truth extracted by DeNSE Wavefront set inpainted as pre-processing using a trained U-Net Reconstructed wavefront set from joint approach Figure 1: Limited-angle parallel beam tomography with a realistic brain phantom.
As a small appetiser, we conclude this overview of the main contributions by illustrating the superior performance of our approach. Figure 1 compares reconstruction performed by a DNN, here the learned primal-dual network, against our joint approach that combines this Learned Primal-Dual network with a DNN for inpainting the wavefront set. In this figure the tomographic data is given in the form of highly noisy samples of the Radon transform with 40 • missing angular wedge. Reconstructions shown the in top row are from two supervised deep learning approaches, namely, Learned Primal-Dual (middle) and our joint approach (right). The latter is essentially the Learned Primal-Dual network combined with a learned wavefront set inpainting that complements missing data in a microlocally consistent manner.
Both the Learned Primal-Dual network and the joint DNN for reconstruction and wavefront set inpainting were trained against the same training data with the same total number of training steps. The joint approach clearly shows the benefit of complementing missing data as it is able to recover essential features that are lost using the Learned Primal-Dual network. Moreover, the joint approach also seems to recover reasonably well singularities (edges) in parts of the image that are not in the visible wavefront set. This is remarkable as these singularities leave, according to microlocal theory, no measurable footprint in data. Furthermore, applying wavefront set inpainting to data separately as a pre-processing step prior to reconstruction yields a Figure 2: Some examples of the piecewise smooth data set comprised of functions which are piecewise polynomial and where the interfaces are given by splines of degree at most four. In the second row, we depict the associated analytically found wavefront sets. The third and fourth rows show, respectively, the corresponding sinograms and their wavefront set, both in the limited-angle setting with 40 • wedge. significantly worse estimator for the wavefront set.
This shows the benefit of performing these steps simultaneously as in the joint approach. Figures 8 and 7 present a more extensive comparison that also includes sparse-view tomography and other model based and data-driven methods.
1.2 Outline of the paper Sections 1, 3 and 2 along with Appendix A primarly serve as background material, whereas the main scientific contribution is contained in Sections 4 and 5. Finally, Section 6 provides numerical examples that showcase the performance of the suggested approach and compares it to other methods for image reconstruction in limited-angle tomography. We present a more detailed outline below: Section 1 provides background motivation from applications along with an overview of the main scientific contributions (Subsection 1.1) and a survey of current state-of-the-art for limited-angle tomography (Subsection 1.3). This is followed by Section 2 that mathematically formalises an inverse problem, both as an operator equation as in (2.1) as well as a statistical inference problem as in (2.2). This section also introduces the notions of ill-posedness and regularisation (Subsection 2.1). Emphasis is next on the Learned Primal-Dual network introduced in Subsection 2.2. This is a DNN with an architecture that incorporates the forward operator which is to be inverted.
Background on the mathematics of tomographic imaging is provided in Section 3. There, we define the Radon transform arising in planar tomography (Definition 3.1) and recall some of its properties (Subsection 3.1). Focus here is on extending the Radon transform and its corresponding back-projection (Definition 3.5) to tempered distributions. This uses basic notions from distribution theory provided in Appendix A (some of the material in this appendix is also used later in Section 4). Subsection 3.1 also introduces the restricted Radon transform arising in limited-angle tomography along with its corresponding restricted back-projection defined in (3.5). Finally, Subsection 3.3 formally defines the wavefront set (Definition 3.6) and then states the canonical relation for the Radon transform given in (3.7). In (3.8) we also provide a characterisation from microlocal analysis of the visible (and invisible) parts of the wavefront set for the Radon transform.
The main scientific contributions are contained in Sections 4 and 5. More precisely, Section 4 provides a theoretical analysis of the propagation of the wavefront set through the distinct layers in a continuum version of the Learned Primal-Dual architecture with the Radon transform as forward operator. Section 5 then presents, in the digital setting, how the digital wavefront set is propagated through the layers of the Learned Primal-Dual architecture. This propagated wavefront set, along with the characterisation of the visible wavefront set, is used for setting up the DNN for wavefront set inpainting, which recovers the invisible part of the wavefront set of a signal from the visible part. This DNN for wavefront set inpainting is then combined with the Learned Primal-Dual network for reconstruction, and both DNNs are trained jointly following the task-adapted reconstruction paradigm outlined in [1], see Section 5.6 for further details.
Finally, Section 6 provides numerical evidence and benchmarks related to image reconstruction in limited-angle tomography.

Related work
The severe ill-posedness associated with limited data in inverse problems has attracted much interest within the research community. This survey will mainly focus on the development of theory and methods for inversion of the Radon transform from limited-angle data, which is an archetype of a severely ill-posed inverse problem.

Analytic approaches
Lambda tomography is one of the first examples of specifically designing a reconstruction method for limited data. It is an analytic approach that was initially developed for 'inverting' the Radon transform from region of interest data [85,28,27,46]. The idea is to replace the standard filter in filtered back-projection [10,78,36] (that has infinite support) with a filter that takes a second derivative in the detector variable. This numerical derivative filter has small support, just near the line being evaluated, hence the reconstruction method is local and thus it applies to region of interest data. Lambda tomography was later applied to limited-angle data [47], where it recovers the visible wavefront set [65,68,31,67]. The recovery is only mildly ill-posed [68,50], which is in stark contrast to the severe ill-posedness that is associated with attempts at reconstructing a function from limited-angle Radon data.
Lambda tomography is computationally efficient and offers an improvement over the traditional filtered back-projection method. Its robustness and usefulness was successfully demonstrated when Lambda tomography was applied for solving the limited-angle problem with highly noisy data in electron tomography [69]. The analytic nature of the method also means the method is feasible for time-critical and/or large-scale problems. The drawbacks of Lambda tomography are similar to those of filtered back-projection, namely that its filter needs to be specifically designed for the acquisition geometry and that the prior implicitly contained for regularising the problem has limited power.

Variational models for reconstruction
Much effort has been spent on adapting variational models of the form in (2.3) to limited-angle tomography. Methods cited here aim to design regularisers that are specifically adapted for limited-angle tomography, e.g., by accounting for the anisotropic resolution due to limited-angle data.
Many variational models for limited-angle tomography build on modifying the total variation (TV) regulariser, like various anisotropic versions that were introduced in [25,8], see also [15,Section 2.4]. These can account for the scanning configuration bearing in mind the missing angular region in limited-angle data, which was also done in [19,95]. This idea was further elaborated in [91], where an iteratively re-weighted anisotropic TV regularisation method was introduced to approximate the sparsity of 0 semi-norm. A further development came in [93] which set-up an alternating (directional) edge-preserving diffusion based on the one-dimensional 0 semi-norm and a (directional) smoothing model based on a one-dimensional Dirichlet energy. The aforementioned directions are dictated by the missing angular region, and the regularisation strategy takes into consideration the fact that the 0 norm is better at preserving edges, while the 1 norm is better at suppressing noise. A further refinement to better balance the need for edge-preservation against noise-suppression was presented in [24]. Based on the theory of visible and invisible wavefront sets developed in [65], the authors of [24] propose a reconstruction method that encodes the visible singularities as priors to recover the invisible ones. The model utilises generalised p-shrinkage operators introduced in [17] as regularisers to perform edge-preserving smoothing while using visible edges as anchors to recover piecewise constant or piecewise smooth reconstructions, while noises and artefacts are suppressed or removed. Another similar approach minimises an L 1 /L 2 term on the gradient with additional box constraint that is reasonable for imaging applications [89].
Other attempts at sparsity promoting variational models for limited-angle tomography rely on dictionary learning [80] or use a regulariser that promotes sparse solutions with respect to wavelets [90,94] or curvelets/shearlets [30,71]. One can further constrain a sparse solution against a given prior image as shown in [18,90,94,34].
Variational models tend to preserve edges better and reduce streak artefacts that are common in limitedangle tomography. However, the improvements are most notable for the recovery of simplistic images, like piecewise constant or piecewise smooth images. Furthermore, this improvement is only realised when (multiple) hyper-parameters are properly tuned. Finally, variational models are computationally demanding regarding run time and memory footprint, which in turn limits their usefulness in time-critical and/or large-scale imaging studies. In summary, the above cited variational models offer surprisingly little, if any, benefit in many cases where images have more complex features that are essential for their interpretation.

Deep learning based methods for reconstruction
There are several attempts at performing tomographic reconstruction by deep neural networks that are extensively surveyed in [5]. Most can be viewed as using techniques from deep learning to approximate different estimators for the statistical formulation of the reconstruction problem in (2.2).
A popular approach for using deep learning in tomographic image reconstruction is to use a trained DNN as a post-processing step for improving an initial imperfect reconstruction. Such an approach was used for limited-angle tomography in [42], which trains a U-Net against synthetic ellipsoid data and multi-category data to reduce artefacts from images obtained by filtered back-projection.
A more domain-adapted approach that outperforms post-processing is to consider DNN architectures for reconstruction that are obtained from unrolling a suitable iterative scheme. One example is the Learned Primal-Dual network in [2] outlined in Subsection 2.2 that incorporates a handcrafted forward operator and the adjoint of its derivative along with the acquisition geometry. This DNN was recently further adapted to limited-angle tomography in breast tomosynthesis [81] by incorporating additional prior information about the geometry in the form of the thickness measurement of the breast under compression.
A different deep learning based approach for limited-angle tomography is presented in [13]. Here, one learns the invisible part of the image using the visible part of its shearlet coefficients. This amounts to learning an anisotropic regulariser in a variational model.

Sinogram inpainting
An entirely different approach to limited-angle tomography is to fill in the missing angular data by some extrapolation scheme (sinogram inpainting or sinogram-recovery). This pre-processing step needs to be done in a stable manner and [22,23] uses a regularised iterative scheme for this purpose. Another approach uses projection onto convex sets to ensure the extrapolated sinogram is indeed valid by making sure it lies in the range of the Radon transform (Helgason-Ludwig consistency condition) [51].
As to be expected, there have also been attempts at using deep learning for sinogram inpainting. Here, much work has been inspired by the success that generative adversarial networks (GANs) have had in restoring missing parts of an image (image inpainting). In particular, [62] develops a deep learning-based image inpainting where one jointly trains the DNN that inpaints the edges and the image, using the philosophy 'lines first, colour next'. The DNN for wavefront set inpainting (Subsection 5.6) that we use in this paper is strongly inspired by [62]. In our case, the missing parts are in the sinogram domain. Consequently, we need to use the reconstruction method to map the singularities to the image domain.
Most approaches for deep learning based sinogram inpainting are based on setting up and training a GAN to generate the missing sinogram data in order to suppress the streak artefacts from the truncated sinogram in limited-angle data. An example of such an approach is [56] that uses a U-Net generator and patch-design discriminator in the GAN to make the network suitable for standard medical tomography images. The GAN is trained against paired limited-angle and full sinogram data using a joint projection domain and image domain loss function where the weighted image domain loss can be added by back-projection. In this regime, we also refer to [92,64] for similar approach based on GANs.

Joint sinogram inpainting and reconstruction
The final series of methods jointly perform the two tasks of sinogram inpainting and reconstruction. A variational model for doing this is presented in [83] where the resulting non-convex and non-smooth minimisation is solved using an alternating (block) descent approach.
A deep learning based approach is developed in [96]. This approach combines a sinogram inpainting network and an image processing network. A key step is to use layers that correspond to Radon transform and its inverse inserted into existing convolutional network architectures. These allow one to go between sinogram and image domains and one can use the image processing network to reduce the artefacts caused by inconsistencies in the inpainted sinogram generated by the sinogram inpainting network. The three parts form an end-to-end network from sinogram domain to image domain, with benefits of taking both image error and sinogram error into account in the sync process in supervised training, i.e., with limited-angle/full-view sinogram pairs. To tackle this training data bottleneck, we develop an unsupervised train method with only limited-angle projection on our proposed network. Inspired by the observation: reconstruction and projection can form a closed loop, we can derive a fake projection from the reconstructed image, and the disparity between the fake projection and real projection provides feedback signals to train the proposed network unsupervised.
Finally, we also mention [97] that develops an approach with a transformer-based DNN architecture instead of convolutional DNNs. Streak artefacts in limited-angle tomography are non-local. Hence removing these with convolutional DNNs is challenging. One approach to encode such long-range dependencies is to use unrolling based DNN architectures, like the Learned Primal-Dual network, that couple many convolutional neural networks with the forward operator and the derivative of its adjoint. Another approach is to consider transformer-based architectures that are better suited than convolutional DNNs due to their non-local nature. As shown in [97], such transformer-based DNNs have excellent performance for reconstruction in limited angle tomography. A downside of transformer DNNs is that they are very demanding to train and require massive amounts of training data, so this approach will scale poorly. Some of the issues related to memory footprint could perhaps be addressed by using more domain adapted transformer architectures, like Fourier Image Transformer [14]. Furthermore, there are still no guarantees that [97] extrapolates the missing data in a microlocally consistent manner.

Inverse problems, ill-posedness and regularisation
To mathematically formalise the notion of an abstract inverse problem, we introduce the separable Banach spaces X (reconstruction space) and Y (data space) whose elements represent possible signals and data, respectively. In many applications, the reconstruction and data spaces are also Hilbert spaces. Next, the mapping A : X → Y (forward operator ) represents a model for how a signal generates noise free data.
The classical functional analytic/frequentist formalisation views an inverse problem as an operator equation. More precisely, an inverse problem is the task of recovering an unknown signal f true ∈ X from data g ∈ Y that is a single sample of the Y -valued random variable g defined as g : = A(f true ) + e where e is a Y -valued random variable representing observation noise. (2.1) The statistical formulation further generalises the above by first assuming one can equip X and Y with Borel σ-algebrae. One furthermore assumes that the signal and corresponding data are generated by some (X × Y )-valued random variable (f, g) [79]. The inverse problem is now to recover a suitable estimator of the conditional random variable (f | g = g) from data g ∈ Y that is a single sample of the Y -valued conditional random variable (g | f = f true ), where f true ∈ X is the true unknown signal and g = A(f) + e where e is a Y -valued random variable representing observation noise.

Ill-posedness and regularisation
A reconstruction method is formally a mapping R : Y → X that approximates the inverse of the forward operator. An inverse problem is said to be (intrinsically) unstable, i.e., ill-posed, whenever the forward operator A is not continuously invertible with respect to the topologies of X and Y . As already indicated, such inverse problems cannot be reliably solved by merely enforcing consistency against data. Regularisation theory addresses this by balancing the need for data consistency against consistency with respect to a prior model. Defining an appropriate prior model and how to balance this against data consistency are key topics in regularisation theory.
Most regularisation methods are based on the functional analytic/frequentist formalisation in (2.1) of an inverse problem. Variational models offer a powerful framework for regularisation. The reconstruction method " R : Y → X is here defined as solving a variational problem: In the above, L : Y × Y → R is the data discrepancy functional that quantifies the data consistency. It is usually taken as a suitable affine transformation of the data log-likelihood [9]. For an ill-posed problem, merely minimising f → L A(f ), g , which translates to seeking the solution that is maximally consistent with data, is an unstable procedure. The regularisation functional S θ : X → R stabilises the reconstruction by penalising those candidate solutions that are not consistent with respect to some prior model. The latter is typically given in terms of a-priori information about the (unknown) ground truth signal f true ∈ X such as sparsity or some type of regularity, see [75,15,7] for an extensive survey of various options. The parameter θ (regularisation parameter ) governs the balance between data consistency and having a solution consistent against the prior. It is low-dimensional, in many cases a scalar, and its choice depends on the magnitude of the noise in data. The statistical formalisation of the inverse problem in (2.2) contains many of the variational models. More precisely, if the regularisation functional is proportional to the negative log of a density on X, then one can often interpret (2.3) as computing a maximum a posteriori estimator. An advantage with the statistical formalisation is that it opens up for other reconstruction methods (estimators) such as the posterior mean estimator that is known to be stable in most cases [55]. This estimator is given as E[f | g = g], a formulation that requires access to the (posterior) distribution of (f | g = g). Using Bayes' theorem, one can in principle express this posterior distribution in terms of the data likelihood (g | f = f ), which is given by the physics of data acquisition, and the true (prior) distribution of f, which among others generates the true unknown signal f true ∈ X. An alternative formulation is to phrase the posterior mean as a Bayes estimator with respect to the squared L 2 -risk. Stated more precisely, given a parametrised family {R θ } θ∈Θ of admissible estimators R θ : Y → X, we consider the reconstruction operator (2.4) The above approximates the posterior mean, i.e., Rθ(g) = E[f | g = g] whenever {R θ } θ∈Θ has sufficient expressive power. Note also that the expectation in (2.4) is w.r.t. the joint law for (f, g) in X × Y . Many supervised learning approaches for reconstruction are based on replacing the expectation in (2.4) with its empirical counterpart given by supervised training data, see [5,Section 5] for an extensive survey. In particular, it is common to use a deep neural network (DNN) with an appropriate architecture to parametrise the family of estimators {R θ } θ∈Θ . One can use unrolling to define such an architecture as outlined in Subsection 2.2. Finally, besides unrolling, there exist also DNN architectures that are specifically tailored for approximating Fourier integral operators. Examples are [26,29] which develop a DNN architecture based on operator splitting techniques derived from multiscale numerical analysis. Another example is [49], which builds an interpretable DNN inspired by Fourier integral operators that approximate the wave physics. Its main focus is on using a loss based on optimal transport to learning the geometry of wave propagation captured by Fourier integral operators, which is implicit in the training data.

The Learned Primal-Dual network
A particular challenge that arises in deep learning based approaches for solving inverse problems in imaging is to handle the large scale nature of the problem as both images and data typically result in high dimensional arrays once digitised. Many of these applications also lack sufficient amount of training data, hence using a generic DNN architecture will result in a learned reconstruction operator that does not generalise well. For such applications, one needs to use a DNN architecture that is domain-adapted.
If the trained DNN Rθ : Y → X should correspond to a (learned) reconstruction operator for an inverse problem of the form in (2.2), then a natural domain adaptation is to account for the requirement that Rθ ≈ A −1 where A : X → Y (forward operator) is handcrafted (not learned). Such domain adaptation can be achieved by choosing a DNN architecture that is given by unrolling a suitable iterative scheme, see [5,Section 4]. The highly successful Learned Primal-Dual network introduced in [2] provides state-of-the-art results for tomographic imaging. A similar unrolling technique is used in [12] to construct a novel convolutional DNN architecture, called ΨDONet, for learning pseudodifferential operators. This is applied to reconstruction in limited-angle tomography.
It is based on unrolling the non-linear primal-dual hybrid gradient method [16]. Stated in a general setting, the Learned Primal-Dual network is a DNN with an architecture specifically adapted for solving an inverse problem of the form in (2.2). The Learned Primal-Dual network is here a reconstruction operator are neural networks with suitable architectures.

9
The Learned Primal-Dual network introduced in [2] for tomographic image reconstruction operates directly on arrays, which here represent discretised functions. As outlined in Subsection 3.2, discretised images in tomography are represented by arrays in R n1×n2 , whereas arrays in R m1×m2 are sinograms, which are discretisations of a real-valued function on Ξ ⊂ R × (0, π). The Learned Primal-Dual network is then a mapping R θ : R m1×m2 → R n1×n2 given as in (2.5), where the operators in (2.6) are convolutional residual neural networks (ResNets) [38]. The precise architecture used in [2] for these ResNets is given next, see also Figure 3 for an illustration of these ResNets alongside the particular architecture for the corresponding Learned Primal-Dual network. We will in Subsection 4.2 extend these ResNets, and consequently the Learned Primal-Dual network, to operators that act on functions.
The residual convolutional network (ResNet) operator is a mapping ResNet : In the above, ReLU(x) : = max{x, 0} is applied component-wise.

Tomographic imaging and the Radon transform
Planar tomographic imaging aims to recover a 2D image (signal) of the interior of an object from corresponding tomographic data. Thus, elements in X are real-valued functions on a fixed domain Ω ⊂ R 2 that represent possible 2D images, and X itself is some suitable function space, e.g., X ⊂ L 2 (R 2 ). Next, many tomographic imaging modalities rely on probing the object with x-rays. In the planar setting, these rays are all contained in a cross sectional hyperplane through the object. After adopting a simplified physics model for the interaction between x-rays and the object, measured data (after appropriate pre-processing) can be viewed as noisy digitised samples of the Radon transform of the aforementioned signal. (3.1) In the above, ω(θ) : =(cos θ, sin θ) is the unitary vector with orientation described by the angle θ with respect to the x 1 -axis and ω(θ) ⊥ : =(− sin θ, cos θ). naturally corresponds to sums of continuum images. However, it is not immediately clear how to extend a discrete convolution to a continuum convolution. For example, replacing discrete convolutions by convolutions defined over R 2 begs the question how the continuum convolution kernel should be chosen. We will see in the subsection below, that there exists a natural replacement of the convolution operator, which however, necessitates that we work with distributions instead of L 2 functions. Because of this, also the application of the ReLU will need to be generalised to distributions. We will discuss the continuum replacements of convolution and ReLU in the next two subsections. Thereafter, we present the continuum learned primal-dual algorithm.

From discrete convolutions to di↵erential operators
To find an appropriate continuum counterpart to the discrete convolution step in (4.1), we interpret the discrete convolution as a discretisation of a di↵erential operator. This approach is inspired by [68].
Consider a continuum image Furthermore, let K ✓ be a 3 ⇥ 3 convolutional operator parametrised by the filter: 13 Note that (s, θ) represents the line t → sω(θ) + tω(θ) ⊥ that is orthogonal to ω(θ) ∈ S 1 with (signed) distance s ∈ R to the origin, so A(f ) is a function on lines in R 2 . Since data in tomographic imaging can be seen as noisy samples of A(f ), the data space Y is an appropriate space of real-valued functions on lines in R 2 representing non-digitised data. The angle θ ∈ (0, π) governs the direction of the x-ray that probes the object. It typically varies during the tomographic data acquisition and limited-angle tomography is the case when θ is restricted to some interval I ⊂ (0, π). Definition 3.3 (Tomographic reconstruction). Tomographic (image) reconstruction is the inverse problem of recovering a ground truth image f true ∈ L 2 (R 2 ) from noisy measurements g ∈ L 2 (Ξ) for some open set Here, g is a single sample of the L 2 (Ξ)-valued random variable g in (2.1) (or (2.2) for the statistical formulation) with A denoting the Radon transform (Definition 3.1).

Basic properties of the Radon transform
From a functional analytic viewpoint, the Radon transform is a linear operator that maps functions on R 2 to functions on the open set R × (0, π) (representing lines in R 2 ). Continuity of the Radon transform depends on the domain of the functions chosen. As an example, the Radon transform is a bounded map on L 1 (R 2 ), implying that it is a continuous linear operator on L 1 (R 2 ) [61, Corollary 3.25], it is however unbounded on L 2 (R 2 ). The Radon transform is also invertible but its inverse is not continuous on L 1 R × (0, π) [40], yielding that the tomographic image reconstruction problem in Definition 3.3 is an ill-posed inverse problem.
We will consider the Radon transform on Schwartz functions S(R 2 ), i.e., rapidly decreasing and smooth functions with the typical locally convex topology. It is well-known that the Fourier transform maps S(R 2 ) onto itself. An analogous result holds for the Radon transform, namely that A : S(R 2 ) → S R × (0, π) is a linear one-to-one mapping [39,Theorem 2.4].

Remark 3.4. S(Ω)
for some open set Ω ⊂ R 2 is defined as the set of functions on Ω such that their extension by 0 to all of R 2 is a Schwartz function. In particular, this defines the space of Schwartz functions on any open set Ξ ⊂ R × (0, π).
Next, we introduce the back-projection as the dual to the Radon transform in a sense analogous to the way the adjoint of a linear transformation on Euclidean space is dual to the original transformation. (3. 2) The back-projection maps a function g on lines in R 2 to a function on points in x ∈ R 2 by simply averaging g over all lines that pass through x. A simple calculation shows that the back-projection is the dual to the Radon transform [61,Theorem 2.75]: The inner product on the right-hand side refers to the natural inner product on L 2 (R 2 ), whereas the inner product on the left-hand side is the natural inner product on L 2 R × (0, π) . This product is well defined in [76]. The duality in (3.3) can be used to extend the Radon transform to various classes of distributions, like compactly supported distributions [39] and tempered distributions [32,59].
In particular, one can define the Radon transform on a tempered distribution f ∈ S (R 2 ) as The extension of A * to S R × (0, π) is defined analogously and (following [61, Section 2.9.3.2 and 4.3.1]) one can in addition show that, In limited-angle tomography, data is given on lines contained in some open subset Ξ ⊂ R × [0, π). The partial Radon transform is then defined as where P S (Ξ) is the restriction of S (R 2 ) to S (Ξ). Note that the restriction of a tempered distribution S (R 2 ) to S(Ξ), where Ξ ⊂ R 2 is open, is well defined and it corresponds to a tempered distribution in S (Ξ). The corresponding (restricted) back-projection is simply defined as in (3.2) but by setting g to 0 on lines not contained in Ξ. In particular, if Ξ :

Discretisation
Similarly, a sinogram g ∈ Y is digitised by an array g ∈ R m1×m2 by evaluating g on m 1 × m 2 sample points in Ξ ⊂ R × (0, π) that are given by the data acquisition protocol used in the actual scanner. The m 1 sample points in R would simply correspond to detector elements in the scanner, whereas the m 2 sample points in (0, π) correspond to directions used in the actual scanning.

The (discretised) Radon transform and back-projection are then mappings
We wish to remark that there exist sophisticated numerical schemes for evaluating these mappings in an accurate, yet computationally feasible manner, see, e.g., [48,57,87,45,86].
It is now possible to formulate the tomographic reconstruction problem entirely in the discrete setting. This is essentially a digitised version of the tomographic reconstruction problem in Definition 3.3, i.e., to recover the digital image f true ∈ R n1×n2 from noisy measurements g ∈ R m1×m2 where g = A(f true ) + e for some (unknown) observation error e ∈ R m1×m2 .
One could then proceed to develop regularised reconstruction methods for the above finite dimensional inverse problem. This 'discretise first then reconstruct' approach was common in the 1970's with the advent of iterative schemes with early stopping, like algebraic reconstruction techniques (ART) [35], simultaneous ART (SART) [3] or simultaneous iterative reconstruction technique (SIRT) [33]. The success of FBP and subsequent variational models did however point to the advantage of developing reconstruction methods for the continuum formulation, which then are discretised. This is also the approach we take in our work, since the microlocal analysis that we will rely on is naturally formulated in the continuum setting.

Microlocal analysis in tomography
A central part of microlocal analysis is the study of how singularities are transformed by specific operators. This is not possible given only the singular support that describes the location of the singularities. Instead, an appropriate notion of singularity needs to include information about directions in the Fourier domain that causes the singularity. This leads to the wavefront set that we define below.
and a smooth cut-off function ψ ∈ S(R n ) with supp ψ ⊂ U x and ψ(x) = 1 such that the following holds for some C N > 0: The N -wavefront set WF N (u) is the complement of the set of all N -regular directed points. Finally, the wavefront set WF(u) of u is defined as From the above, we see that the wavefront set WF(u) is the set of all position-orientation pairs in Ω × S 1 along which u is non-smooth.

The canonical relation for the Radon transform and its back-projection
For many operators arising in applications, it is possible to describe how they transform the wavefront set. This description is called (microlocal) canonical relation. It plays an important role in inverse problems, since it allows one to relate singularities in data to those in the unknown signal.
The canonical relation for the Radon transform in Definition 3.1 at tempered distribution f ∈ S (R 2 ) is a precise relationship between WF(A(f )) and WF(f ). If P denotes taking the power set, then this can be expressed as a map (3.7) In limited-angle tomography we only have access to the Radon transform on an open subset Ξ ⊂ R × (0, π). The canonical relation then holds for the so-called visible wavefront set of a function/distribution f that is given by WF vis (f ) : = WF(f ) ∩ K(Ξ). (3.8) Following [50], we next provide a more precise characterisation of the canonical relation in terms of a mapping between wavefront sets in image and sinogram, respectively.
Theorem 3.8. The canonical relation for the Radon transform A : can be represented by the mapping as
Next, A is of order m = −1/2 with an amplitude function p((s, θ), x, ξ) = 1/(2π) that is homogeneous of degree zero, implying that A is elliptic. Using the derivatives (3.13), Σ φ is given by Therefore, the canonical relation can be represented by the coordinate mapping (3.14) Now, let x; ω(θ) ∈ WF(f ) be an oriented singular point of f . By (3.14) we obtain that Finally, [68, Theorem 6.3] gives This concludes the proof.
We next focus on the propagation of singularities performed by the adjoint Fréchet derivative of the Radon transform, which is the back-projections operator A * in Definition 3.5. Using [50,Theorem 13] we know that A * is also a Fourier integral operator and In the next proposition, we use (3.16) to introduce the corresponding mapping associated with the canonical relation for A * in a similar fashion to Theorem 3.8.

Computational microlocal analysis
Our aim is to develope a computational counterpart to microlocal analysis that is based on defining a notion of a 'digital wavefront set' and also to provide computational means for extracting such an object from an array that represents a discretised function. This turns out to be theoretical and computationally challenging. The definition of a digital wavefront set we will exploit in our work is based on 'discretising' Definition 3.6. More precisely, the digitial wavefront set of an array is defined as follows.
The visible digital wavefront set is defined in a similar manner.
The definition also applies to arrays f ∈ R n1×n2 and g ∈ R m1×m2 representing images and sinograms in tomographic imaging (Subsection 3.2). Note finally that Definition 3.10 suggests a natural way to represent a digital wavefront set as a multi-channel 'image'. Simply assign M binary channels to each sample point in x j ∈ Ω and set the k:th channel at that point to 1 if (x j , ω k ) ∈ DWF(u), otherwise set the value of that channel to 0.
Having defined a notion of a digital wavefront set of an array that represents a discretised function, a natural task that follows is to computationally extract such an object. This is however not possible to do in a mathematically consistent way as there is no analytic connection between the digitisation of a function and its digital wavefront set [4,Theorem 3.3]. The approach taken in [4] is therefore to view the task of extracting the digital wavefront as a statistical estimation problem. More precisely, extracting the digital wavefront set is phrased as computing the probability distribution of possible digital wavefront sets. A single digital wavefront set can then be computed by choosing at each point of the digitised function the wavefront set orientation with the highest probability. Much of [4] is devoted to developing a deep learning based approach for this task. A key part is the development of DeNSE, which is a DNN with an architecture specifically suited for computing probability distributions of digital wavefront sets of functions represented by their discrete shearlet transforms. DeNSE was in [4] successfully applied to extract digital wavefront sets of functions discretised by shearlets in both image and sinogram space in tomographic imaging.

Microlocal analysis of the Learned Primal-Dual network
This section aims to derive the canonical relation for the non-linear operator given by the Learned Primal-Dual network defined in (2.5) with A as the Radon transform (Definition 3.1). The relation allows to describe how the Learned Primal-Dual network transforms the digital wavefront set of an input array that is a discretisation of a function/distribution representing data, which is a key step in our approach to tomographic image reconstruction, outlined in Subsection 1.1.

Overview of approach
The starting point is the discretised tomographic inverse problem given in Subsection 3.2, where data and images are arrays in R m1×m2 and R m1×m2 , respectively. Assume next that the digital wavefront set is known for an input array g ∈ R m1×m2 , representing measured data. Our aim is to compute the visible digital wavefront set of the array R θ (g) ∈ R n1×n2 representing the reconstructed image, i.e., to compute DWF vis (R θ (g)) where R θ : R m1×m2 → R n1×n2 is the non-linear Learned Primal-Dual network in (2.5) for tomographic reconstruction that was first introduced in [2]. This is a DNN that is made up of stacked convolutional residual neural networks of the form (2.6) that in our setting are given as in Definition 2.1 with A = A : R n1×n2 → R m1×m2 denoting the discretised Radon transform in (3.6).
One approach to derive the mapping between g and DWF vis (R θ (g)) is to work entirely within the discrete setting. Such an approach would require us to describe how the discrete Radon transform along with its adjoint transforms the digital wavefront set of an array. An alternative approach is to utilise the rich and well-developed microlocal theory for the (continuum) Radon transform outlined in Subsection 3.3. In particular, this theory describes how the Radon transform, and hence also its adjoint (back-projection), modifies visible wavefront sets in limited-angle tomography. However, such an attempt at leveraging on the continuum theory requires us to formulate a continuum version of the Learned Primal-Dual network.
In Subsection 4.2, we derive the non-linear operator, which is a continuum version of the Learned Primal-Dual network. This results from replacing every step used in the construction of R θ : R m1×m2 → R n1×n2 with a natural corresponding continuum version, resulting in an operator R θ : Y → X, where Y and X are not necessarily finite dimensional vector spaces. The key part is to assemble appropriate continuum versions of the convolutional residual neural networks in (2.6). While a continuous convolution seems to be the most natural correspondence to a discrete convolution, this begs the question, which continuous filter to choose. Alternatively, following [74] one can view each discrete convolution as a discretisation of a corresponding fourth-order differential operator with coefficients that relate precisely to the discrete filter. Naturally, the coordinate-wise application of an activation function to a discrete input corresponds to the composition with that activation function in the continuous realm. Since the differential operators do not necessarily yield L 2 functions but only distributions, this concept needs to be extended to tempered distributions (see Definition 4.1). Finally, the residual block consists only of summation, which naturally translates to summation of continuous inputs. Subsection 4.3 shows how each of the above operations affect the wavefront set of a function or tempered distribution. A key result is Theorem 4.11 that analyses the action of ReLU. When combined, these provide a precise theoretical description of the way a continuum version of the Learned Primal-Dual network transforms the wavefront set. Its canonical relation can then be used to derive the associated digital canonical relation for the Learned Primal-Dual network in [2].

Continuum Learned Primal-Dual network
Towards the end of Subsection 2.2, we specified the architecture for the Learned Primal-Dual network that was introduced in [2] for tomographic reconstruction. This is a mapping given by a DNN that acts on arrays in finite dimensional vector spaces. In inverse problems these spaces typically represent discretised functions as outlined in Subsection 3.4. The aim here is to formulate a natural continuum version of R θ : Y → X that acts on functions or distributions that are not necessarily discretised.
As outlined in the overview in Subsection 4.1, a key step lies in appropriately extending the residual convolutional networks in (2.5). To this end, we replace every discrete operation of the Learned Primal-Dual by a continuum analogue, i.e., an operator that takes as an input a distribution. In principle, only four types of operations happen in the definition of the Learned Primal-Dual. These are the convolutions, the application of a ReLU, the application of a discretised Radon transform or the adjoint of the Fréchet derivative of the Radon transform, and taking sums of functions, either between channels in the convolution or due to residual connections. The discretised Radon transform as well as the back-projection, which is the adjoint of its Fréchet derivative, already stem from continuum counterparts which are the natural replacement. Furthermore, a sum of discrete images naturally corresponds to sums of continuum images. However, it is not immediately clear how to extend a discrete convolution to a continuum convolution. For example, replacing discrete convolutions by convolutions defined over R 2 begs the question how the continuum convolution kernel should be chosen. We will see in the subsection below, that there exists a natural replacement of the convolution operator, which however, necessitates that we work with distributions instead of L 2 functions. Because of this, also the application of the ReLU will need to be generalised to distributions. In the following two subsection we discuss the continuum counterparts of convolution and ReLU. Thereafter, we present the continuum Learned Primal-Dual network.

From discrete convolutions to differential operators
To find an appropriate continuum counterpart to the discrete convolution step in (2.7), we interpret the discrete convolution as a discretisation of a differential operator. This approach is inspired by [74].
Consider a continuum image f ∈ L 2 (R 2 ) with discretisation f ∈ R n1×n2 . Concretely, let Furthermore, let K θ be a 3 × 3 (discretised) convolutional operator parametrised by the filter: Note that (∆ ij ) 3 i,j=1 ⊂ R 3×3 such that the following forms a basis for R 3×3 : Hence, we can, for a given h > 0, express θ as Note that the 3 × 3 matrices ∆ ij can be seen as the finite difference discretisations of the partial derivatives of f if h corresponds to the distance between the sampling density underlying the discretisation f . For a smooth function f , we therefore observe that if the discretisation h goes to zero, then For an open set Ω ⊂ R 2 , this yields the following operator K θ defined on S(Ω): By duality, we can extend K θ to S (Ω). Note also that K θ is a linear second-order differential operator. In particular, it is a pseudodifferential operator with its symbol given by p θ (ξ) = β 11 + β 12 ξ 2 + β 21 ξ 1 + β 22 ξ 1 ξ 2 + β 13 ξ 2 2 + β 31 ξ 2 1 + β 23 ξ 2 2 ξ 1 + β 32 ξ 2 1 ξ 2 + β 33 ξ 2 1 ξ 2 2 for ξ ∈ Ω. (4.5) The interpretation above of a discrete convolutional operator that takes non-smooth images as inputs necessitates to have a continuum definition that acts on distributions. Consequently, also all further operations will need to be applicable to distributions.

Pointwise application of ReLU to distributions
The rectified linear unit (ReLU) is an activation function used in many neural network architectures. It is defined as the positive part of its argument, i.e., ReLU : R → R is given as ReLU(x) : = max{x, 0}. Our aim is to extend the ReLU to an operator that acts on tempered distributions on Ω ⊂ R n , denoted as ReLU.
We start by rewriting the ReLU function in terms of the Heaviside function H : R → R: The above can be used to extend ReLU in a straightforward manner to f ∈ C ∞ (Ω), by simply defining We only know that ReLU : S(Ω) → L ∞ (Ω) and in fact ReLU(f ) may not be smooth for f ∈ S(Ω). Hence, ReLU does not map S(Ω) to S(Ω), i.e., we cannot use duality to define ReLU on distributions. Using the characterisation in (4.7) to extend ReLU to distributions involves extending the Heaviside function to tempered distributions and also ensuring the subsequent multiplication is well-defined. We start by defining the Heaviside function of a distribution f ∈ S (Ω) as the characteristic function of its positive support, i.e., we define the Heaviside operator H : S (Ω) → L ∞ (Ω) as Having extended the Heaviside function to distributions as in (4.8), our main concern is to ensure that the multiplication between the distribution f ∈ S (Ω) and H(f ) ∈ L ∞ (Ω) is well-defined. By Definition A.3 and Theorem A.4, this is indeed the case whenever (x, −λ) ∈ WF(f ) for all (x, λ) ∈ WF H(f ) , i.e., we can define ReLU(f ) by (4.7) for any f ∈ S (Ω) that satisfies this criteria. However, the multiplication is not necessarily well-defined whenever there exists (x, λ) ∈ WF H(f ) such that (x, −λ) ∈ WF(f ). Thus, all we know is that the multiplication of H(f ) and f is always defined if f ∈ L 2 loc (Ω) (Remark A.5). One idea is therefore to locally dampen f close to points where we cannot define the multiplication of f with H(f ). This leads to the following definition of ReLU on distributions. where f s : =(1 − θ κ )f . Here θ κ : = 1 X * φ κ with In the above, supp L 2 (f ) ⊂ Ω denotes the L 2 -support of f (Definition A.6).
We next show that Definition 4.1 can be used to extend the ReLU function to distributions.
Proof. We need to show that ReLU κ,φκ (f ) ∈ S (Ω), whenever f ∈ S (Ω). To see this, note first that 1 − θ κ is smooth and vanishes on a neighbourhood of every Hence, the product (1 − θ κ )f is well-defined and by Theorem A.4, there does not exist an Theorem A.4 and Remark A.5 now imply that ReLU κ,φκ (f ) ∈ S (Ω) whenever f ∈ S (Ω), which concludes the proof. 3. f = P (1 B + h) for some domain B ⊂ Ω and P is an elliptic linear differential operator. Assume We conclude by pointing out that ReLU κ,φκ : S (Ω) → S (Ω) in Definition 4.1 fulfils almost all of the criteria stipulated earlier for an extension of ReLU to distributions. It satisfies the first and third criterium. Moreover, ReLU κ,φκ (f )(φ) = f (φ) holds for all φ ∈ S(Ω) with a support that has a distance of more than 2κ from WF H(f ) ⊂ ∂ supp + (f ).

The continuum Learned Primal-Dual network
To define the continuum Learned Primal-Dual network, we start by introducing a continuum ResNet. nj−1,nj l=1,k=1 ⊂ (R 3×3 ) nj−1×nj−1 be a set of coefficients. Let κ > 0 and let φ κ ∈ S(Ω) be a function that integrates to 1, is positive and is supported on a compact subset of B κ (0).
2. Note that besides the previously defined operators K θ l,k j and ReLU κ,φκ , only addition is applied in the continuum ResNet. Since the set of distributions is a linear space, we conclude that ResNet κ,φκ is a well-defined operator from S (Ω) to S (Ω).
Based on the definition above for the continuum ResNet, we can now define the continuum Learned Primal-Dual network as in (2.5) with operators in (2.6) given as continuum ResNets. Hence, continuum Learned Primal-Dual network is a mapping given by the N -step iterative scheme in Algorithm 1 in which Λ i and Γ i are continuum two-dimensional convolutional ResNets as in Definition 4.4.
Output: Primal solution f N ∈ S (R 2 ) and dual solution h N ∈ S (Ξ).

Canonical relation for the continuum Learned Primal-Dual network
It is clear that we can describe the canonical relation for the continuum Learned Primal-Dual network, if we can identify the one of the continuum ResNets Λ i and Γ i for i = 1, . . . , I. In addition, we also need to combine these canonical relations with the canonical relations for the Radon transform, and the relations for the Fréchet derivative of the adjoint of the Radon transform.

Differential operator
The canonical relation for a differential operator is typically very straight-forward to compute, if this operator is an (elliptic) pseudodifferential operator. In this respect, we recall the following result. If P is elliptic, then we have equality instead of inclusion, i.e., sing supp(Pf ) = sing supp(f ) and WF(Pf ) = WF(f ) for all f ∈ S (R 2 ).
Since K θ defined in (4.4) is a pseudodifferential operator, we obtain that WF(K θ f ) ⊂ WF(f ). (4.11) This means that K θ does not introduce new singularities to f , and might even delete some of them; in the case that the coefficients β ij are such that 0 < |p θ (ξ)| for all ξ = 0, the operator K θ is elliptic and preserves the singularities, i.e., WF(K θ f ) = WF(f ). Here p θ is the symbol defined in (4.5).

ReLU application
Since for h ∈ S (R 2 ) the distribution ReLU κ,φκ (h) is defined in most parts of the domain as H(h)h, we can study its wavefront set using Theorem A.4. We now first study the wavefront set of H(h). Afterward, we estimate the wavefront set of ReLU κ,φκ (h) in Subsection 4.3.4.

The wavefront set of H(f )
For a function g ∈ L 2 (Ω), the wavefront set of H(g) is determined through the following factors: A point x ∈ Ω, such that on a neighbourhood thereof g is almost always positive will be mapped to something constant by the Heaviside function. Since constant functions are smooth, this operation erases the wavefront set associated with a neighbourhood of x. The same argument can be made on neighbourhoods where g is almost everywhere negative. Points x ∈ Ω in which g vanishes have the potential to create new discontinuities, since the Heaviside function has a jump in 0. If g is smooth in x and also has non-vanishing gradient, then the implicit function theorem tells us the form of the discontinuity of H(g). We will see below in Proposition 4.8 that the same argument can be made for tempered distributions. A crucial ingredient for that result will be the following estimation of the wavefront set of indicator functions.
We can now state the following result describing the wavefront set of H(g).
Proof. We start with the "only if" part. The statement is clear if WF H(g) = ∅. Otherwise, let (x, λ) ∈ WF H(g) . Assume first that x ∈ ∂(supp + (g)) c . Then either x ∈ supp −,0 (g) or x ∈ supp + (f ) • . Since both supp −,0 (g) and supp + (g) • are open sets, we have that there exists an open neighbourhood U of x such that U ⊂ supp −,0 (g) or U ⊂ supp + (g) • . As a result H(g) is constant on U . Therefore, (x, λ) cannot be in WF H(g) , which produces a contradiction.
Hence, we can assume that (x, λ) ∈ WF H(g) and x ∈ ∂(supp + (g)). In addition, we assume that x ∈ C g ∪ S g . Then, x ∈ sing supp(g). Therefore, there exists a neighbourhood U of x, on which g is smooth and ∇g does not vanish.
We wish to show now that on U the set {g = 0} is a smooth curve with normal ∇ x g at x. For this, we invoke a smooth version of the implicit function theorem [54, Theorem 2.1]. In this form, the theorem considers a smooth functiong : Ω → R such that Assuming that ∂g ∂x2 = 0, then there exists a smooth κ defined on a neighbourhood of x * 1 such that locally, i.e., for x 1 in an open neighbourhood of x * 1 , g(x 1 , κ(x 1 )) = 0 and κ (x 1 ) = ∂g Moreover, in an open neighbourhood of x * 1 , x * 2 every (x 1 , x 2 ) such thatg(x 1 , x 2 ) = 0 is of the form (x 1 , κ(x 1 )). Applying the implicit function theorem to g if ∂g ∂x2 = 0 yields that η x = ∇g(x)/ ∇g is a normal at the zero level set of g at x. By swapping variables, the same argument can be made if ∂g ∂x1 = 0. We obtain that locally H(g) = 1 B with ∂B being a smooth curve that has normal η x at x. By Proposition 4.7, this implies that (x, λ) ∈ WF(1 B ) if λ = ±η x for an α = 0. This concludes the proof of the "only if" part.
For the "if" part, we notice again that if x ∈ R g , then x ∈ ∂(supp + (g)). Thus, x ∈ S g which implies that x ∈ sing supp(g). Therefore, and since x ∈ C g , the implicit function theorem is applicable. The same argument as before yields (4.12).
Remark 4.9. We may ask ourselves whether or not the statement in Proposition 4.8 is tight. To improve our intuition in this regard, we provide an example for each of the cases of Proposition 4.8 that may lead to a wavefront set.
1. Creation of wavefront set according to (4.12): Let g(x) = 1 − x 2 . The squared Euclidean norm is a smooth function. We easily observe that is the unit circle. Moreover, H(g) = 1 B1 is the indicator of the unit ball. It is not hard to see that the wavefront set of this function is 2.
x ∈ C g and x ∈ sing supp H(g) : Let g 1 be a positive C ∞ function, supported on a set D 1 that contains (0, 0). Let g 2 be another such function, however, with is not an open neighbourhood of (0, 0), which is possible, then H(g 1 + g 2 ) is discontinuous at (0, 0) implying that (0, 0) is a singular point of H(g 1 + g 2 ).
3. x ∈ C g and x ∈ sing supp H(g) : Let g be a smooth compactly supported positive function. Then every x ∈ ∂ supp(g) satisfies that x ∈ C g and x ∈ sing supp H(g) = sing supp(g).
4. x ∈ S g and x ∈ sing supp H(g) : Let g = 1 R + ×R , then x ∈ S g and g = H(g).

5.
x ∈ S g and x ∈ sing supp H(g) : is not smooth since it has a jump in its third derivative at the x 1 = 0 axis. At the same time {x 1 = 0} = ∂(supp + (g)). Finally, we observe that H(g)(x) = φ(x) and hence the wavefront set of H(g) is empty.
Notice that Proposition 4.8 stays short of a precise characterisation of the wavefront set of H(f ). It implies that all singularities must be in one of the sets R g , C g , or S g . However, there is a closed-form of the orientations of the singularities only if x ∈ R g .
In particular, for g ∈ S (R 2 ) and Let Ω ⊂ R 2 , f ∈ S (Ω), and let us assume that WF(f ) is known. In addition, using the results of Subsection 4.3.3 we have also access to WF(H(f )). We denote as in Definition 4.1: and (4.14) Note that per Definition 4.1 we have that θ κ = 0 on (X 3κ h ) c and hence h s = h on (X 3κ h ) c . Now we have collected all necessary ingredients to be able to compute the wavefront set of ReLU κ,φκ (f ) = H(f )f , which is the goal of the following result.
where R f is defined as in Proposition 4.8. In addition, CS f is given by where C f and S f are defined as in Proposition 4.8. Then WF(ReLU κ,φκ (f )) is given by:

19)
In particular, we have that Proof. Since R 2 can be decomposed as we have that WF(ReLU κ,φκ (f )) can be decomposed as Notice in addition that A f,κ , B f,κ and D f,κ are disjoint. Now, since supp + (f ) o is open, we find for every where A f is as in the statement of the proposition. Moreover, for every On the other hand, since ess supp(H(f )) = supp + (f ), by Theorem 4.10, we can conclude that Then, we have Let us now study the set D f,κ : = WF(ReLU κ,φκ (f )) ∩ (∂(supp + (f )) × S 1 ). Following the notation of Proposition 4.8, we can decompose the set ∂(supp + (f )) as Using this decomposition, we can write D f,κ as Next, we would like to show that Let us start with (4.26). Consider first (x, λ) ∈ R f , x ∈ X 3κ h . Then, x ∈ R f and thus x ∈ sing supp f . In particular, x ∈ sing supp f s . Moreover, since ∇ x f = 0, we conclude that x ∈ ess supp(f ) and therefore x ∈ ess supp(f s ). Using Theorem 4.10, we conclude that (x, λ) ∈ WF(ReLU κ,φκ (f )) = WF(f s H(f )) if and only if (x, λ) ∈ WF(H(f )). Since x ∈ C f ∪ S f , we conclude from Proposition 4.8 that (x, λ) satisfies (4.12). To show the converse embedding, assume that (x, λ) is such that x ∈ R f and (x, λ) satisfies (4.12). By Proposition 4.8, we have that (x, λ) ∈ WF(H(f )). Furthermore, x ∈ sing supp f and ∇ x f = 0, which implies that x ∈ ess suppf . We conclude by Theorem 4.10 that (x, λ) ∈ WF(f s H(f )) = WF(ReLU κ,φκ (f )). This shows (4.26).
To show (4.27) it suffices to observe that x ∈ R f implies that x ∈ sing supp f and hence x ∈ sing supp(1 − θ κ )f . Therefore, we conclude that , where the last equality follows from Proposition 4.8. This yields (4.27).
The full result now follows by considering the decomposition (4.21). The part associated with A f,κ is estimated via (4.22) and (4.23). The part associated with B f,κ vanishes due to (4.24). Finally, the part associated with D κ,φκ is estimated via the decomposition (4.25), where the R f,κ part is estimated via (4.26) and (4.27) and CS f = CS f,κ holds per definition. Remark 4.12. Theorem 4.11 only yields an estimate for the wavefront set associated with the set (X κ h ) c . In the sequel, since we are using the continuum relations to infer certain properties of digital relations, we will assume that κ is chosen very small and X κ h is not seen by the discretisation. In other words, in practice, we compute the wavefront set only via (4.18).
Even on (X κ h ) c , Theorem 4.11 does not entirely save us from computing WF(ReLU κ,φκ (f )), but restricts the necessity for such a computation to the cases where x ∈ S f ∪ C f . The set CS f can also be further split up: For x ∈ R 2 \ X κ h , we denote by WF(f ) x the x-slice of the wavefront set of f defined as Λ ⊂ R 2 such that (x, λ) ∈ {x} × Λ for all (x, λ) ∈ WF(f ).

Proposition 4.13.
Let Ω ∈ R 2 be open, f ∈ S (Ω) be a distribution and CS f be as in Theorem 4.11. Then (4.28) Proof. The result follows immediately from Theorem 4.10.
Theorem 4.11 yields two ways to estimate the wavefront set of ReLU κ,φκ (f ) on (X κ h ) c . First, it can be precisely computed by (4.18). This, however, may require us to compute CS f via Proposition 4.13. This computation could be performed according to (4.28), by using a method such as DeNSE to find WF(ReLU κ,φκ (f )) if required.
Second, the wavefront set on (X κ h ) c can be estimated using (4.20). Since we expect that it is not problematic to overestimate the wavefront set slightly, we opt for the second option and cast this wavefront set extraction algorithm as Algorithm 2.
Input: Distribution f ∈ S (Ω), WF(f ), x ∈ Ω. Because of this, we would like to identify WF(f + g) for two distributions of which we know the wavefront set. The following two results yield a complete description thereof. In particular, if (x, λ) ∈ WF(f ) and (x; λ) is a regular directed point of g, then (x, λ) ∈ WF(f + g).

Corollary 4.15.
Let Ω ⊂ R 2 be open and let f, g ∈ S (Ω), then WF(f + g) is given by In particular, we have that Proof. The result follows immediately from Theorem 4.14.
In a similar fashion as we did for the wavefront set of ReLU(f ), using Corollary 4.15 we can find two ways to extract the wavefront set of f + g via Corollary 4.15. We cast the one that yields a superset of the wavefront set of f + g via (4.30) as Algorithm 3.
Algorithm 3: Wavefront set classifier of f + g.

Microlocal analysis of continuous Learned Primal-Dual networks
Let us first notice that the continuum residual neural network operator as stated in Definition 4.4 has four basic components, the differential layers, summation over channels, application of the ReLU, and the residual connection. We have seen in Subsection 4.3.5 that the effect on the wavefront set through summation over channels and the residual connection can be described as the output of Algorithm 3. In addition, the effect of the differential layers is described by (4.11) and the wavefront set after an application of the ReLU can be found through an application of Algorithm 2. Overall, there is an algorithm that produces for every continuum convolutional ResNet and every input function of which the wavefront set is known, an estimate of the wavefront set of the output. The microlocal behaviour of the full continuum Learned Primal-Dual network (Algorithm 1) can now be found by iteratively applying the canonical relation for the continuum residual neural network operator as well as the canonical relation for the Radon transform (3.11) and the adjoint of its Fréchet derivative (back-projection) in (3.17).

The joint reconstruction and wavefront set extraction algorithm
In this section, we present our numerical algorithm for the digital reconstruction problem associated with the inverse tomography problem. We first introduce the algorithm in the framework of statistical learning theory as an empirical risk minimisation problem over a special set of deep neural networks with a specific loss term. Then, we evaluate our approach on a test set and compare its results with various benchmarks.

Outline of our algorithm
The setting for our data driven approach is to have a supervised tomographic training dataset of the form Here, f i ∈ R n is an array that is a discretisation of a function f ∈ X = L 2 (R 2 ) that represents the true image (signal). Likewise, g i ∈ R m is an array corresponding to a discretisation of continuum data, which here is a function g i ∈ Y = L 2 (Ξ) representing a noisy limited-angle sinogram with Ξ denoting the corresponding manifold of lines. In addition, we also know the complete wavefront set for data g i and signal f i . Subsection 5.2 explores this statistical framework approach in more detail.
From the above, in absence of observation noise we have g i = A(f i ) where A : X → Y is the Radon transform (forward operator) restricted to Ξ (limited-angle data). The elements (f i , g i ) ∈ X × Y are assumed to be independent samples generated by some (X × Y )-valued random variable (f, g). The reconstruction operator R θ : R m → R n is now a DNN given by the Learned Primal-Dual network [2] and its parameter θ =θ is set by training this DNN to achieve two tasks simultaneously: 1. Applying Rθ : R m → R n (trained reconstruction operator) to an input sinogram g ∈ R m that is a discretisation of g ∈ Y drawn from g, should produce f ∈ R n that is a discretisation of a function f ∈ X such that A(f ) = g.

2.
The trained reconstruction operator Rθ : R m → R n should output a discretisation of the wavefront set of f , which we denote by DWF(f ), i.e., DWF Rθ(g) = DWF(f ) While we are, in the end, only interested in an accurate reconstruction of f (see Subsection 5.3), we believe that requiring a neural network to perform both tasks at the same time constitutes a strong prior. Indeed, we will see below that this joint approach yields vastly superior reconstruction results to a neural network that only reconstructs f . As a key ingredient towards the solution of the aforementioned task, we can rely on a digital wavefront set extraction operator DeNSE [4] which allows us to extract the digital wavefront set of a digital image, this task is briefly introduced in Subsection 5.4. Furthermore, it is reasonable to construct an appropriate loss function L that measures the discrepancy between Φ(g) and f and between DWF(Φ(g)) and DWF(f ). However, it turns out that for the training of the neural network it is computationally prohibitive to compute DWF(Φ(g)) in every training iteration with the shearlet-based wavefront set extractor DeNSE [4]. Because of this, we introduced a heuristic based on the mapping properties of a continuum operator, the continuum Learned Primal-Dual network, to estimate DWF(Φ(g)) in Subsection 4.3.
Through the canonical relation for the continuous Learned Primal-Dual operator, we obtain an estimate of a discretisation of the visible wavefront set WF vis (Φ(g)) based on the weights of the neural network Φ. To relate this estimate to DWF(f ), we apply wavefront set inpainting with a neural network Ψ of U-Net architecture [72], this is explained in detail in Subsection 5.5. Overall, this leads to an operator Q that takes as an input g and the neural networks Φ and Ψ and outputs an estimate of f and DWF(f ). We therefore use the loss function where C ∈ (0, 1] and 1 , 2 are appropriate distance measures on discretised images and discretised wavefront sets that will be discussed in Subsection 3.4 below. This strategy that jointly trains a reconstruction and a task (wavefront set inpainting) falls in the framework of task-adapted reconstruction was introduced in [1]. Based on the loss function of (5.1), we now train the neural networks Φ and Ψ to minimise the objective To find a minimiser, we use stochastic gradient descent-based optimisation over the neural network's weights as is standard in deep learning. Finally, Subsection 5.6 explores in detail the joint reconstruction in the digital setting.

Statistical learning framework
We would like to frame the problem of reconstruction and joint wavefront set extraction as a statistical learning problem [77,20]. For this, we first introduce some relevant notions: Definition 5.1 (Setting of statistical learning theory). Let X, Y be two sets. We call X the sample space, and Y the label space. Further, let D be a distribution on X × Y , and let : Y × Y → R be a measurable function. We refer to as loss function. Then the risk R of a measurable function h : X → Y is defined as For m ∈ N, a set of samples (x i , y i ) m i=1 ∼ D m and for a hypothesis class H, we define the empirical risk minimiser h * as Under certain assumptions on the complexity of the hypothesis set H, it can be shown that for sufficiently large m, the function h * will also achieve a small risk [77,20]. This means that the empirical risk minimiser h * resolves the unknown relation between input and output described by D. Because of this, we treat the empirical risk minimisation problem in this work as the central problem of interest.
Next we would like to phrase the problems of digital tomographic reconstruction, of digital visible wavefront set extraction, of full digital wavefront set extraction via inpainting, and of the joint reconstruction and digital wavefront set extraction and inpainting in the framework of statistical learning theory. 29

Digital tomographic reconstruction
A joint probability distribution on function spaces as described through the statistical formulation of the tomographic reconstruction problem in Definition 3.3 implies, via the discretisation procedure of Subsection 3.4, a joint probability distribution D rec on discretised image-sinogram data. In the discrete problem, the sample space is R Is =: X for I s ⊂ [m 1 ] × [m 2 ], where [M ] : ={1, . . . , M }, with the label space being defined by R n1×n2 =: Y . As loss function we choose rec to be the squared Euclidean distance between two elements in Y , i.e., Moreover, we choose as a hypothesis set H rec : =(h ϑ ) ϑ∈Θres the set of functions which perform the mapping f 0 → f I according to Algorithm 1 parametrised via the weights of the discrete two-dimensional convolutional ResNets introduced in Definition 2.1.

Digital visible wavefront set extraction
Similarly to the previous Subsection 5. 3, the discretisation procedure of Subsection 3.4 implies a joint distribution between the digital wavefront set of a measured sinogram and the digital visible wavefront set of the image, precisely defined below: As a hypothesis set we choose H vis : =(ĥ ϑ ) ϑ∈Θres as the set of maps resulting from the digital canonical relation described in Subsection 4.3.6. These maps are parametrised via the weights of the discrete two-dimensional convolutional ResNets introduced in Definition 2.1. The important point to notice here is that H vis and H rec are parametrised by the same parameter set.

Digital wavefront set extraction and inpainting
To reconstruct the full digital wavefront set of an image from its sinogram data, we combine the visible wavefront set extraction of the previous subsection with a wavefront set inpainting step which will be performed using a U-Net. In this case, the sample set is

Joint reconstruction and digital wavefront set extraction and inpainting
We will now describe a joint approach in which the reconstruction problem as well as the wavefront set reconstruction and inpainting problem are solved simultaneously. This task certainly requires that the hypotheses classes underlying the risk minimisation problem are, at least partially, parametrised by the same parameter set. This joint approach leads to the sample space being chosen as X joint : = R Is × R Is × {0, 1} I d for I s ⊂ [m 1 ] × [m 2 ] and I d ⊂ [180], and the label space being defined as Y inp : = R n1×n2 × (R n1×n2 × {0, 1} 180 ). The loss function needs then to be chosen joint as a weighted average of rec and inp in the following sense: For a λ ∈ (0, 1] we set, for g 1 , The task-adapted reconstruction architecture involving the joint loss function (5.4) is depicted in Figure 4. Finally, the hypothesis set for the joint approach is given by H joint : =((h ϑ , h ϑ,θ )) ϑ∈Θres,θ∈Θ U .

Numerical results
We now provide numerical results for the joint reconstruction and wavefront set extraction algorithm of Section 5. We first present the set-up, followed by a description of our results, and finally an interpretation of the outcome.

Set-up
The set-up follows our viewpoint taken in Subsection 5.6, namely regarding the joint reconstruction problem from an empirical risk minimisation standpoint. Therefore, as a training set we use an artificial data set comprised of piecewise smooth functions, where the singularity curves are given by random splines of degree at most four and the smooth regions are polynomials of degree at most two. We call this data set the random cartoon-functions data set. Some examples of functions in this data set were shown in Figure 2.
Notice that for the images in this data set the digital wavefront set is known. We then study two types of partial data, namely limited-angle data, where we assume that a wedge of 40 • is missing (a total angle interval of 80 • ), and sparse-view, where we only measure 40 angles. This is followed by evaluating the trained task-adapted architecture (see Figure 4) on the OASIS dataset [53] formed by real brain scans (see https://www.oasis-brains.org/). The results are then compared with classical approaches for the limited-angle and sparse-view tomography, including filtered back-projection [50], Tikhonov [82], and total variation [88]. In addition, we have also performed varitional regularisation using as regulariser the L2 and L1-norm of the shearlet coefficients of the reconstruction [37,52]. We named these methods Shearlet-L2 sparse and Shearlet-L1 sparse. The deep-learning based benchmarks, namely the Learned Primal-Dual [2] and the Phantom-Net architectures [13], were similarly trained using the random cartoon functions data set. The code to reproduce our experiments can be found in http://shearlab.math.lmu.de/applications.

Results
We now present some exemplary results for our algorithm in the case of limited-angle tomography (80 • wedge) and sparse-view (40 measured angles). The visual results are depicted in Figures 5 and 6. The comparison with the benchmark approaches from Subsection 6.1 is shown in Table 1.
In addition, we show in Figures 7 and 8 reconstruction results of the limited-angle and sparse-view tomography for the benchmarks already mentioned in Subsection 6.1. In all cases, the joint approach outperforms all classical approaches. Table 1 presents the performance measure of the benchmarks in terms of the average self similarity (SSIM) and peak signal-to-noise ratio (PSNR).

Interpretation
In Table 1, we see that the joint approach significantly outperforms all competing approaches. The same observation can be made by observing Figures 8 and 7 that demonstrate a strong improvement in the reconstruction accuracy in comparison with the other methods and in particular to the Learned Primal-Dual architecture. In this context, it is noteworthy that the performance gap in this application is much more pronounced than in the sparse-view application. Since the Learned Primal-Dual corresponds to one half of the pipeline of the joint approach, we observe that the additional wavefront set information is especially helpful in the limited-angle case. Since the gaps in the visible wavefront set are significantly smaller in the sparse-view set-up, we expect that, in this set-up, considerably more information on the wavefront set is already included in the observed data, whereas in the limited-angle set-up more prior knowledge needs to be invoked. It appears as if this necessary prior knowledge was very successfully identified in the training phase of the joint approach.
In Figures 5 and 6 we performed a type of ablation study by looking at the prowess of the inpainting algorithm alone. It transpires from those figures that the inpainting is less successful if it is not coupled with the reconstruction pipeline. This clearly shows that the strength of the joint approach does not solely lie in wavefront set identification and inpainting but in the interplay of inpainting and reconstruction jointly.
We wish to remark that our algorithm is not trained on real-world images but instead on artificial phantoms from the random cartoon-functions data set. This shows that the joint prior that incorporates physical information is not based on memorisation. From the standpoint of applications, this could be a very desirable feature, as it prevents overfitting on specific biases in real-world data sets.
In Figure 6, we see that the learned primal dual, as well as the joint approach, appear to produce reconstructions that exhibit almost no texture-like areas but instead are piecewise smooth. This effect could potentially be an effect of the underlying data set. In this context, it may be worthwhile to expand the training data set by artificial images with more texture like features.

A Notions and results from distribution theory
Below, we collect some essential notions and results from distribution theory and the analysis of the wavefront set. These were used extensively in Subsection 4.3.4.
Definition A.1 (Essential support [73]). Let Ω ⊂ R n be a domain. The essential support of f ∈ S (Ω) is the defined as the set esssup(f ) : = R n \ U ∈U U, where U : = U ⊂ Ω : U is open and f U = 0 with f U denoting the restriction of f to U ⊂ Ω.
Note that if f ∈ C(Ω), then esssup(f ) = supp(f ), i.e., the essential support coincides with the usual support. We also introduce the notion of positive and negative supports of a distribution. From the definition, we see that supp + (f ) = supp |f | + f = {x ∈ Ω : f (x) > 0 whenever f ∈ C ∞ (Ω).
We now turn our attention to defining the product of two distributions. Let Ω ⊂ R n be a fixed domain and consider the distributions u, v ∈ S (Ω). We say that w ∈ S (Ω) is the product of u and v if an only if, for each x ∈ R n , there exists a test function ψ ∈ S(Ω), with ψ = 1 on a neighbourhood of x, so that ' ψ 2 w(ξ) = " ψu * ψv (ξ) = converges absolutely for each ξ ∈ R n . In addition, under this conditions, we can define w by (A.1).
This yields absolute convergence in (A.1).
Definition A.6. The L 2 -support of f ∈ S (Ω) is defined as the largest open set on Ω where f is given by an L 2 -function: Thus, if x ∈ supp L 2 (h) then there is an open set x ∈ U ⊂ Ω and f U ∈ L 2 (U ) such that f (φ) = U f U (x)φ(x)dx for all φ ∈ S(U ).