Task adapted reconstruction for inverse problems

The paper considers the problem of performing a post-processing task defined on a model parameter that is only observed indirectly through noisy data in an ill-posed inverse problem. A key aspect is to formalize the steps of reconstruction and post-processing as appropriate estimators (non-randomized decision rules) in statistical estimation problems. The implementation makes use of (deep) neural networks to provide a differentiable parametrization of the family of estimators for both steps. These networks are combined and jointly trained against suitable supervised training data in order to minimize a joint differentiable loss function, resulting in an end-to-end task adapted reconstruction method. The suggested framework is generic, yet adaptable, with a plug-and-play structure for adjusting both the inverse problem and the post-processing task at hand. More precisely, the data model (forward operator and statistical model of the noise) associated with the inverse problem is exchangeable, e.g., by using neural network architecture given by a learned iterative method. Furthermore, any post-processing that can be encoded as a trainable neural network can be used. The approach is demonstrated on joint tomographic image reconstruction, classification and joint tomographic image reconstruction segmentation.

1. Introduction.The overall goal in inverse problems is to determine model parameters such that model predictions match measured data to sufficient accuracy.Such problems arises in various scientific disciplines.One example is biomedical imaging where the image is the "model parameter" that needs to be determined from data acquired using an imaging device like a tomographic scanner or a microscope.The prime example of this is tomographic imaging in medicine which has revolutionized health care over the past 30 years, allowing doctors to find disease earlier and improve patient outcomes [19,82].Likewise, scientific computing is nowadays considered to be the "third pillar of science" standing right next to theoretical analysis and experiments for scientific discovery, much thanks to possibilities for simulating and optimizing complex physical and engineering systems.A key element in realizing this role is the ability to solve the inverse problem of calibrating parameters in a mathematical model of the system so that simulations match benchmark data [8].
The inverse problem of reconstructing the model parameter from data is often only one out of many steps in a procedure where the recovered model parameter is used in decision The second row represents the data acquisition where raw data is acquired and pre-processed, resulting in cleaned data.In the third row, the cleaned data is used as input to a reconstruction step that recovers the model parameter, which is then post-processed to extract features that are used as input for model building.The final outcome is a task adapted model that can be used for decision making.
The dotted rectangular part outlines the steps that are unified by task adapted reconstruction.
making.The reconstructed model parameter is typically summarized, either by an expert or automatically, and resulting task dependent descriptors are then used as basis for decision making, see Figure 1.
Clearly, there are several disadvantages with performing the various parts of the above pipeline independently from each other.Each single step is prone to introduce approximations that are not accounted for by subsequent steps, the reconstruction may not consider the end task, and the feature extraction may not consider measured data.In fact, the task is almost always only accounted for at the very final step.It is therefore natural to ask whether one may adapt the reconstruction method for the specific task at hand.Task adapted reconstruction refers to methods that integrate the reconstruction procedure with (parts of) the decision making procedure associated with the task.This is sometimes also referred to as "end-toend" reconstruction.
2. Overview.We start with a brief survey of existing approaches to task adapted reconstruction in the context of tomographic image reconstruction (Section 4), which also points out the drawbacks that come with these approaches.The section that follows (section 5) introduces the statistical view on inverse problems.More specifically, we consider Bayesian inversion (section 5.1) in which a reconstruction method is a statistical estimator (section 5.2).After pointing out some key challenges associated with Bayesian inversion (section 5.3), we introduce the learned iterative methods (section 5.4) that are later used in our applications of task adapted reconstruction (section 7.3).We then switch gears and consider tasks on the model parameter space that can be formulated as a statistical estimation problem (section 6).The first step is to provide an abstract framework (section 6.1) with a plug-and-play structure for adapting to a specifik task.To further illustrate the wide applicability of this framework, section 6.2 describes a number of tasks that are worked out in detail followed by further examples in section 6.3.
Section 7 introduces task adapted reconstruction in an abstract setting (section 7.1).It assumes that both the reconstruction (section 5.2) and task (section 6.1) are given by appropriate decision rules.An important part is the computational implementation (section 7.2) that is based on neural networks.This is followed by two applications that are worked out in detail in section 7. 3.In section 8 we provide some theoretical considerations regarding regularizing properties and the potential advantage that comes with using a joint approach.The final section (section 9) contains a discussion and outlook on future research in this area.
3. Specific contributions.The paper offers a generic, yet highly adaptable, framework for task adapted reconstruction that is based on considering both the reconstruction and the task as statistical estimation problems.The implementation uses neural networks for both these steps, which is essential for both performance in terms of quality and computational feasibility as shown in the example for joint tomographic reconstruction and segmentation.Both networks are trained jointly using a joint loss function (20) that "interpolates" between sequential and end-to-end approaches.Here, sequential refers to a setting where the neural network for reconstruction is trained separately and its output is used in the training of the neural network for the task.End-to-end is when a neural network for the task is trained directly against data without explicitly introducing a reconstruction step.
To the best of our knowledge, this is the first paper that offers an approach to task adapted reconstruction that unifies reconstruction with such a diverse set of tasks in a computationally feasible manner under the guiding principles of statistical decision theory, learning, and efficient inference algorithms.This allows for re-using algorithmic components thereby opening up new ways of thinking about machine learning and inverse problems that may ultimately lead to deeper understanding of the possibilities for integrating elements of decision making into the reconstruction.Furthermore, introducing the joint loss in (20) and investigating its properties (section 8) are novel contributions.Our work also leaves many open questions and future research directions for the inverse problems and machine learning communities as outlined in section 9.

Survey of task adapted tomographic reconstruction.
There is an ongoing effort within the inverse problems community to include signal processing steps associated with performing a task jointly with the reconstruction step.In tomographic imaging, most of the tasks considered this far correspond to feature extraction, e.g., segmentation or extraction of features expressed through sparse representations as in compressed sensing.In such case, task adapted reconstruction reduces to joint reconstruction and feature extraction.
Current approaches to task adapted reconstruction are primarily based one the classical approach to inverse problems.In this setting, the problem is to recover the true unknown feature d * ∈ D from data y ∈ Y by solving an operator equation: (1) y = A(x * ) + e and d * = T(x * ).
The forward operator A : X → Y models how a model parameter (image in tomographic imaging) gives rise to data and the task operator T : X → D represents the feature extraction.
In the above, both these are assumed to be known.Likewise, e is generated by a Y -valued random variable e ∼ P noise with known distribution, so task adapted reconstruction reduces to finding the dotted operator in (2). (2) Note that the task operator T : X → D is often highly non-injective, so it makes no sense to consider A •T −1 : D → Y as "new forward operator".Approaches based on "solving" (1) heavily depend on the nature of the T and below is a brief list of prior work in the context of tomographic imaging.Edge recovery: Lambda-tomography [53] is a non-iterative method that recovers edges directly from noisy tomographic data using the canonical relation from microlocal analysis.Another non-iterative approach combines the method of approximate inverse with an explicit task operator, e.g., a Canny edge detector [62].Finally, it is also possible to use a variational approach with suitable regularizer.Examples relevant for edge recovery are variants of total variation (TV) [11,7] or sparsity promoting 1 -type of regularizers with an underlying dictionary that is specifically designed to sparsely represent edges, like curvelets, shearlets, beamlets, and bandlets [27,83,55].Segmentation: Methods for joint reconstruction and segmentation is an active area of research.Most approaches are based on a variational scheme with suitably chosen regularizers and control variables.One example is usage of Mumford-Shah penalty [78,44], another is based on level set approaches [99].A further refinement is to consider semantic segmentation.Here, a variational scheme that amounts to computing a maximum a posteriori (MAP) estimator with a Gauss-Markov-Potts type of prior shows promising results on small-scale examples [70,80].Image registration: To register a template against an indirectly observed target (indirect image registration) is a key step for reconstruction in spatiotemporal imaging.The (temporal) deformation can be modeled using optical flow [10] or diffeomorphic deformations [14,32].Yet another approach is to consider optimal transport [50].Approaches to task adapted reconstruction that are based on solving (1) suffer from two issues that seriously limit their usefulness in practical imaging applications.The first is the requirement for an explicit handcrafted task operator T : X → D.More advanced tasks, like many mentioned in sections 6.2 and 6.3, are difficult to encode in this way and available examples mostly consider task operators that extract "simple" features that must be further processed before they can be used for decision making.This is also the reason for why the term "feature reconstruction" [62] is used as a proxy for task adapted reconstruction.The second issue relates to computational feasibility.Evaluating the task operator, like in segmentation and image registration, is computationally demanding and requires setting values for extra (nuisance) parameters.Furthermore, most state-of-the-art approaches for solving (1) are based on variational methods, which quickly become computationally unfeasible for large scale imaging problems.
Many complex tasks have been successfully addressed using techniques from machine learning, so it makes sense to investigate whether such techniques can be integrated with reconstruction for task adapted reconstruction.One example is given in [96] for abnormality (tumor) detection in low-dose computed tomography (CT) imaging.The idea here is to jointly train a learned iterative scheme for reconstruction [2] with a 3D convolutional neural network for detecting the abnormality in the reconstructed images.Another examples introduces a unified deep neural network architecture (SegNetMRI) for combined Fourier inversion (MRI image reconstruction) and segmentation [89].Here, one has two neural networks with the same encoder-decoder structure, one for MRI reconstruction consisting of multiple cascaded blocks, each containing an encoder-decoder unit and a data fidelity unit, and the other for segmentation.These are pre-trained and coupled by ensuring they share reconstruction encoders.These two examples are special cases of the generic approach develop in section 7.

Statistical inverse problems.
Let X and Y denote separable Banach spaces where (X, S X ) and (Y, S Y ) are measurable spaces.Next, let P X and P Y denote spaces of probability measures on X and Y , respectively.Following [25], a (statistical) inverse problem amounts to reconstructing (estimating) x * ∈ X from measured data y ∈ Y that is generated by a Y -valued random variable y where (3) y ∼ M(x * ) with known M : X → P Y (data model).
Elements in X (model parameter space) represent possible model parameters and elements in Y (data space) represent possible data.In tomographic imaging, elements in X are often functions defined on a fixed domain in R n representing images and elements in Y are realvalued functions defined on a fixed manifold M, which is given by the acquisition geometry associated with the measurements.Furthermore, just as in the classical setting, most statistical inverse problems do not have a unique solution in the sense that the model parameter is not identifiable [25, section 2.3].
A common data model is when data is contaminated with additive noise: (4) y = A(x * ) + e with e ∼ P noise for some known P noise ∈ P Y .
Here, A : X → Y (forward operator) models how data is generated in absence of noise and e ∼ P noise models noise.If e is independent from x * , then (4) amounts to the data model Another data model is when M(x) is a Poisson random measure on Y with mean A(x).This is a suitable data model for imaging modalities that rely on counting statistics in a lowdose setting, such as line of response PET [47] [72, section 3.2] and variants of fluorescence microscopy [41,21], see also [43,87] for a more abstract treatment.

Bayesian inversion.
Only seeking an estimate of x * ∈ X is limiting since it does not account for the uncertainty.A more comprehensive analysis is based on introducing a X-valued random variable x ∼ π * whose true (unknown) probability distribution π * ∈ P X generates x * .One can then rephrase the inverse problem stated earlier as the task of recovering the probability measure π * ∈ P X given data y ∈ Y generated by y, which is related to x * through the data model as in (3).An important special case is when π * is parametrized by x * ∈ X in a known way, so the inverse problem reduces to the task of recovering x * ∈ X.
In a Bayesian setting, one considers the posterior distribution of x given y = y up to a constant of proportionality.More precisely, consider a setting where the joint law (x, y) ∼ µ can be written in terms of conditional probabilities: Here, π 0 serves as a (possibly improper) prior and the last equality in (5) follows from the definition of the data model as the conditional distribution of y given x = x * .In particular, the joint law µ in ( 5) is proportional to the posterior, so the decomposition above exists as soon as Bayes' theorem holds.This is the case in a rather general setting [18,Theorem 14], but a decomposition is also possible is some cases where the prior is not proper 1 .
A key point in the Bayesian setting is to explore the posterior distribution of x given y = y assuming that both x → π 0 ∈ P X (prior) and x → M(x) (data model) are known, but x * ∈ X is unknown.The data model often has an associate density L (data likelihood) that is known to sufficient degree of accuracy, in which case dM(x)(y) = L(y | x) dy.

Reconstruction as an optimal decision rule.
A reconstruction method is formally a measurable X-valued mapping on Y , which in the statistical setting corresponds to an estimator.More precisely, (Y, S Y ), {M(x)} x∈X defines a statistical model parametrized by the model parameter space X and a reconstruction method corresponds to a point estimator.The latter is a non-randomized decision rule for a statistical estimation problem where the model parameter space X parametrizes the underlying statistical model and at the same time constitutes the decision space.The reader may here consult [60, section 3.1] for formal definitions of decision theoretic notions used here.
There are many possible reconstruction methods (estimators) so one needs a framework where these can be compared against each other.Statistical decision theory offers such a framework by associating a notion of risk to a decision rule.This quantifies the downside that comes with using a particular reconstruction method.The first step is to define the loss function on the decision space, which in our specific setting becomes a measurable mapping (see [60,Definition 3.2] for the definition in a general setting): (6) X : X × X → R.
A common choice in imaging inverse problems is the L 2 -loss, which is the squared L 2 -distance.
There are however alternatives that are not based on point-wise differences but on differences between high-level image features, e.g., the Wasserstein distance [3] and perceptual losses [46].
Having selected a loss function as in (6) and a prior π 0 in (5) on the model parameter space, the π 0 -average risk (Bayes risk or expected loss) for reconstruction is given as A natural criteria to select a reconstruction method (estimator) is to minimize Bayes risk, i.e., to select an estimator (non-randomized decision rule) that minimizes A † → R π 0 (A † ) in (7).Note here that in the finite dimensional setting, minimizing Bayes risk is the same as computing the conditional mean (posterior mean) if and only if the loss function in (6) is the Bregman distance of a strictly convex non-negative differentiable functional [6].This holds in particular when the loss function is given by the squared L 2 -norm.Next, another common choice is the MAP estimator that maximizes the posterior, so it corresponds to the most likely reconstruction given the data.On the other hand, a maximum likelihood estimator maximizes the negative log-likelihood of data, i.e., it corresponds to the model parameter that generates the most likely data.This is an unsuitable estimator in ill-posed inverse problems since it frequently leads to overfitting.
To summarize, we will henceforth consider a reconstruction method that minimizes Bayes risk and, as already mentioned, this equals the conditional mean when using a L 2 -loss.

Challenges with Bayesian inversion.
In the Bayesian setting (section 5.1), both the true model parameter and data are assumed to be generated by random variables, and the goal is to recover the conditional probability of the model parameter given data (posterior) [48,25,88,18,13].In contrast, classical (deterministic) approaches view an inverse problem as an operator equation of the type (1) [23,49,85,52] where data may be generated by a random variable, but there are no statistical assumptions on model parameters.
The Bayesian viewpoint offers a more complete analysis than the classical approach that is based on solving (1) in the sense that the posterior describes all possible solutions.In particular, different reconstructions can be obtained by using different estimators and there is a natural framework for uncertainty quantification, e.g., by computing Bayesian credible sets.Furthermore, small changes in the data lead to small changes in the posterior distribution in a fairly general setting [18,Theorem 16] (continuity of the posterior distribution in the Hellinger metric), so working with probability measures on the model parameter space (posterior) and adopting a suitable prior stabilizes an ill-posed inverse problem.
The posterior is, on the other hand, often quite complicated with no closed form expression.Much of the contemporary research therefore focuses on realizing the above advantages with Bayesian inversion without having access to the full posterior.Key topics are designing a "good" prior π 0 ∈ P X and to have computationally feasible means for exploring the posterior.

Designing good priors.
The difficulty in selecting an appropriate prior lies in capturing the relevant a priori information.Many of the results from the statistical community focus on characterizing priors that lead to Bayesian inference methods with desirable asymptotic properties, like consistency and good contraction rates.
Bayesian non-parametric theory provides a large class of handcrafted priors, see, e.g., [30, chapter 2], [18, section 2], and [48,13].These however only capture a fraction of the a priori information that is available.To illustrate this claim, a natural a priori information in medical imaging is that the object being imaged is a human being.It is very difficult, if not impossible, to explicitly construct a prior that encodes this information.
An alternative approach is to consider a prior that is learned from examples in X through some predictive generative model.A simplistic way is to select a Gaussian density that matches the first two sample moments [12].More elaborate approaches can be based on generative adversarial networks that are trained on unsupervised data, e.g., a generative adversarial network can be used to learn a Gibbs type of prior in a MAP estimator [63].

Computational feasibility.
Exploring the posterior requires sampling from a high dimensional probability distribution.It is not possible to directly simulate from the posterior distribution in the infinite dimensional setting unless the model parameter is decomposed into more elementary finite-dimensional components.This quickly becomes computationally challenging in large scale problems, like in imaging where the posterior is a probability distribution over the set of images.
Computational methods used for Bayesian inversion often combine analytic approximations of the posterior with various Markov chain Monte Carlo (MCMC) techniques, see [18, section 5] for a nice survey.There is an extensive theory that guarantees that these techniques are statistically consistent, but it comes with two critical drawbacks that has prevented widespread usage of MCMC techniques in imaging.First, many approaches require access to the prior in closed form, and as already argued for (section 5.3.1),such handcrafted priors are woefully inadequate in representing natural images.Second, these methods are still not sufficiently scalable for exploring the posterior in an efficient manner in large scale inverse problems, such as those that arise in 2D/3D tomographic imaging [8, chapter 1].Alternatively, one can approximate the posterior with more tractable distributions (deterministic inference), which includes variational Bayes [28] and expectation propagation [69].Variational Bayes methods have in particular emerged as a popular alternative to the classical MCMC methods, see [9] for some guidance (on p. 860) on when to use MCMC or variational Bayes.
To summarise, one can sometimes with reasonable efficiency compute point estimators that do not involve any integration over the model parameter space, like a MAP estimator.Estimators requiring such integration, like the estimator that minimize Bayes risk, are however computationally unfeasible.This also includes computational steps relevant for uncertainty quantification.

Learned iterative methods.
As outlined in section 5.3, there are two challenges associated with using Bayesian inversion: selecting a "good" prior (section 5.3.1) and providing a computationally feasible approach for computing suitable estimators, like the one that minimizes Bayes risk (section 5.3.2).
As we outline here, learned iterative methods address both these challenges.It makes use of techniques from machine learning, and deep neural networks in particular, which have demonstrated a remarkable capacity in capturing intricate relations from example data [57].
A key element is usage of highly parametrized generic models that can be adapted to specific decision rules, such as reconstruction by ( 7), by training against example data.Learned iterative methods use a deep neural network to define an estimator (reconstruction method) that minimizes Bayes risk while accounting for the knowledge about how data is generated.
To give a more precise description, consider the joint law µ = π 0 ⊗ M(x) in ( 7) used for defining Bayes risk.In most practical applications, this joint law is unknown.Often one may however have access to the corresponding empirical measure given by supervised training data (x 1 , y 1 ), . . ., (x m , y m ) ∈ X × Y generated by (x, y) ∼ µ.This avoids introducing a handcrafted prior π 0 ∈ P X .Furthermore, searching over all non-randomized decision rules is computationally unfeasible.Instead, we restrict our attention to those given by a (deep) neural network architecture, which are known to have large capacity (can approximate any Borel measurable mapping arbitrarily well [75]) and there are computationally feasible implementations.To summarize, we have a family of reconstruction methods A † θ : Y → X parametrized by a finite dimensional parameter set Θ and the optimal one is given by solving the training problem The above approach for defining a reconstruction operator A † θ * : Y → X is fully data driven in the sense that neither a prior on model parameter space nor a data model are handcrafted beforehand.Instead, all information is derived from the training data, which in particular does not utilize knowledge about how data is generated.This becomes a serious issue when the number of independent samples in training data are low compared to number of unknowns, which is commonly the case in imaging.Next, in many inverse problem the data model x → M(x) that describes how data is generated is known.Thus, it is unnecessarily pessimistic to disregard this information as in a fully data driven approach to reconstruction.
Learned iterative schemes [1,2] define a non-linear reconstruction operator parametrized by a deep convolutional neural network architecture that accounts for the data model, or more precisely the data likelihood.The idea is to unroll a fixed point iterative scheme relevant for solving the inverse problem and replace the explicit iterative updating rule with a learned one given by a deep convolutional residual network.The approach can be formulated as a general scheme for solving (possibly non-linear) inverse problems [1,2,35], see also [65,66,20] for a formulation that learns proximal updates in linear inverse problems.This results in a computationally feasible approach with surprisingly low requirements on training data and good generalization properties that outperforms state-of-the-art image reconstruction in CT [1,2,35], MRI [64,65,37,66], photoacoustic tomography [38], and superresolution [64,65,20].
6. Tasks on model parameters.We consider tasks formulated as an operator that acts on model parameter space X and that takes values in a set D (decision space).We will start with the abstract formalization of such tasks using the language of statistical decision theory.Similar to how reconstruction was treated (section 5.2), the task is represented by a non-randomized decision rule and we will select the one that minimizes Bayes risk.Next, we also indicate how such decision rules can be computed efficiently using (deep) neural networks and supervised learning that minimizes the empirical risk.The remainder of the section is devoted to providing examples that concretizes the abstract framework and illustrates its general applicability.
6.1.Abstract setting.Let (X, S X ), {P z } z∈ be a statistical model where the model parameter space (X, S X ) is a measurable space and {P z } z∈ ⊂ P X is some family of probability measures on X parametrized by elements in .Next, there is a measurable space (D, S D ) (decision space) and a fixed (task adapted) loss function ([60, Definition 3.2]) ( 9) The statistical model along with the decision space and loss function defines a statistical estimation problem.Many tasks can now be seen as an appropriate non-randomized decision rule T : X → D (task operator).
Before proceeding, it is worth reflecting over the roles of the above sets.In our set-up, the decision making associated with the task is based on elements in the decision space D whereas actual observables are elements in X, so the task is represented by a measurable mapping T : X → D (task operator).Often it is more natural to formalize the task as a mapping τ : → D where elements in the set are related to those in X.A difficulty is that elements in are not observable and the mapping relating its elements to those in X is unknown.Hence, the challenge is to infer an appropriate mapping T given τ by resorting to some suitable "optimality" principle.The examples in section 6.2 will further clarify the various roles of these sets in decision making.
Just as in section 5.2, we consider a decision rule that minimizes Bayes risk.More precisely, assume is itself a measurable space and consider a fixed probability measure η 0 ∈ P (task prior).The task operator is the non-randomized decision rule T : X → D that minimizes the associated Bayes risk: where (z, x) ∼ η 0 ⊗ P z .
A difficulty is to provide a 'reasonable' task prior η 0 ∈ P .Another is that P z ∈ P X is not known.Hence, one needs to consider the joint law η := η 0 ⊗ P z in (10) as an unknown.Note that this differs from reconstruction, where the joint law is either known (as in section 5.1), or the prior is unknown but the data likelihood is known (as in section 5.4).Since the joint law is unknown, we replace it by the empirical measure given by (supervised) training data (z 1 , x 1 ), . . ., (z m , x m ) ∈ × X, i.e., one has i.i.d.samples generated by a ( × X)valued random variable (z, x) ∼ η.Furthermore, due to issues associated with computational feasibility (section 5.3.2),we consider a parametrized family of decision rules T ϑ : X → D given by a (deep) neural network architecture.Then, the task operator is the decision rule T ϑ * : X → D parametrized by a finite dimensional parameter in Ξ and the optimal one ϑ * ∈ Ξ is given by empirical risk minimization: We conclude with examples showing how a wide range of image processing tasks can be phrased as decision rules in a statistical decision problem.

6.2.
Examples.The abstract framework in section 6.1 for formalizing a task on model parameter space is very generic and covers a wide range of possible tasks.In the following, we list concrete examples from imaging in order to show how this framework can be adapted to specific cases.To ensure a computational feasible implementation, our focus is on tasks that have been successfully addressed using techniques from deep learning.Deep learning has proven to be an efficient computational framework for many tasks, much thanks to its ability to progressively learn discriminative hierarchal features of the input data by means of training a suitable deep neural network.Hence, this limitation is not as restrictive as it may seem at a first glance, which will also become evident by the examples listed here and in section 6.3.
Unless otherwise stated, tasks are formulated for grey-scale images defined on a fixed domain Ω ⊂ R n , i.e., X := L 2 (Ω, R).We will also assume that X is a measurable space for some σ-algebra S X .Finally, M denotes the space of measurable mappings, e.g., M (X, D) is D-valued measurable mappings defined on X. Bayes risk in (10) associated with a decision rule T : X → D for given task prior η 0 ∈ P becomes The corresponding empirical risk minimization in ( 11) is There are several papers dealing with how to construct a suitable (deep) neural network architecture for the set of decision rules D = {T ϑ } ϑ∈R N and solving (12) will then correspond to training a classifier, see [58] for an early approach based on a convolutional neural network, AlexNet [54] and ResNet [39] represent examples of further development along this line.

Semantic segmentation.
The task here is to classify each point in an image into one of k possible labels, so the special case k = 2 corresponds to (binary) segmentation.Stated more formally, semantic segmentation applies a mapping that associates each point in an image in X to a probability distribution over all k labels.This task becomes a non-randomized decision rule in a statistical estimation problem where := M (Ω, Z k ) and the decision space D := M (Ω, P Z k ) is the set of measurable mappings from Ω to the class of probability measures on Z k .The task adapted loss function is given by (9) with The decision distance D : D × D → R simply integrates the point-wise cross entropy of the (point-wise) independent probability measures d(t) and d (t).The cross entropy is a wellknown notion from information theory for quantifying the dissimilarity between probability distributions [16] and it is often used as a learning objective in generative models involving probability distributions.Bayes risk in (10) associated with a decision rule T : X → D for a given task prior η 0 ∈ P can then be written as The corresponding empirical risk minimization in ( 11) is Note that T(x)(t) is a probability distribution over Z k and z(t) ∈ Z k when z ∈ , so in particular T(x)(t) z(t) ∈ [0, 1] for any t ∈ Ω.The set of decision rules D = {T ϑ } ϑ∈R N can be parametrized by (deep) neural networks, in which case solving (13) corresponds to training a segmentation operator.Deep neural net architectures suitable for semantic segmentation are presented in [61,74,84], see also the surveys in [91,34].In particular, the SegNet architecture has been successful for semantic segmentation of 2D images [5].For (binary) segmentation one may use the U-net [81,15].

Anomaly detection.
The task here is to detect the difference (anomaly) between two grey-scale images, so X = L 2 (Ω, R) × L 2 (Ω, R) for a fixed domain Ω ⊂ R n .This becomes a non-randomized decision rule in a statistical estimation problem where := X and the anomaly is represented by grey-scale images, so the decision space is D := L 2 (Ω, R).The task adapted loss function is given by (9) with Bayes risk in (10) associated with a decision rule T : X → D for given task prior η 0 ∈ P becomes and note that z = (z 1 , z 2 ) ∈ and x = (x 1 , x 2 ) ∈ X.The corresponding empirical risk minimization in ( 11) is (14) ϑ * ∈ arg min where (x i 1 , x i 2 ) ∈ X is the supervised training data.One may here use a (deep) convolutional neural net to parametrize the decision rules in D = {T ϑ } ϑ∈R N .6.2.4.Caption generation.Caption generation refers to the task of associating an image to an appropriate sentence, or paragraph, that describes its content.More precisely, define W to be the set of words in a chosen language, enlarged by a "stop word" w stop that marks the end of the caption.Next, let C denote the set of captions, which are finite sequences made up of elements from W where each sequence contains w stop exactly once as its last element.Then, caption generation is the task of mapping an image to a sequence in C.
This task becomes a non-randomized decision rule in a statistical estimation problem where := C is the set of captions, so P z is the probability of a caption z.The decision space is D := P C , i.e., probability distributions over set of captions C, and the task adapted loss function is given by ( 9) with To express Bayes risk that is to be minimized when computing the optimal decision rule, consider an element z ∈ = C (caption) and let z i ∈ W denote its i:th word.Next, let W n ⊂ W denote the set of sequences of n words that do not contain the stop word w stop (unfinished sentences), i.e., W n := (w i ) n i=1 ⊂ W | w i = w stop for i = 1, . . ., n .
Since an element in the decision space d ∈ D := P is a probability measure on sequences of words (captions), it will in particular yield the following probability measure on W n : We now consider the measure for n + 1 sequences that are conditioned on its first n terms, i.e., let (w i ) n i=1 ∈ W n be fixed with d n (w i ) n i=1 > 0.Then, d induces to a probability measure π d ∈ P W on the set of words W via With this notation, one can identify an element d ∈ P with its corresponding representation in × n P W ( • | W n ) by the identity .
Here P W ( • | W n ) is the set of probability measures on W conditioned on W n .Bayes risk in (10) associated with a decision rule T : X → D (potential caption generation operator) for given task prior η 0 ∈ P can now be expressed through conditional densities on W: To derive the corresponding empirical risk minimization problem, we consider a fixed class of decision rules that are given via a parametrization of their representation in term of marginal densities.More precisely, given a parameter set ϑ ∈ R m , the decision rule T ϑ is given by where x ∈ X and z ∈ , and Ψ ϑ is parametrized by a recurrent neural network, for example using the long short-term memory (LSTM) architecture [42].The corresponding empirical risk minimization in ( 11) is now given by (15) ϑ * ∈ arg min Solving (15) corresponds to training a image caption generator [92], see also [51].

Further imaging tasks.
A common trait with the examples worked out in section 6.2 is that each of them can be recast as finding an optimal decision rule in a statistical estimation problem .Furthermore, deep neural networks offer a computationally feasible implementation of estimating this decision rule from supervised training data.
There is a wide range of other tasks beyond those mentioned in section 6.2 that can be represented as a non-randomised decision rule, which in turn is efficiently parametrized by a suitable deep neural network architecture.The (incomplete) list below is based on [4,77] and aims to show the diversity of tasks from computer vision that can be approached successfully using a suitable deep neural network architecture.Inpainting: This is essentially interpolation/extrapolation to recover lost or deteriorated parts of images and videos and approaches based on trainable neural networks [97].Depixelization/super-resolution: The task is here to upsample, i.e., to synthesize realistic details into images while enhancing their resolution [17] or to fill out information "between" pixels by increasing the resolution of the final picture, also known as the single image super-resolution problem [79].
Demosaicing: The task here is to reconstructing a full color image from the incomplete color samples output from an image sensor [90].Almost all digital cameras, ranging from smartphone cameras to the top-of-the-line digital SLR cameras, use a demosaicing algorithm to convert the captured sensor information into a color image.Colourising: The task is to apply color to grey scale photos and videos [45].Image translation: The task is to translate between two classes of images of the same object, e.g., CT and MRI images [95].Object recognition: This visual classification task involves localization, detection and classification and this can be seen as an example of constellation models, which are a general class of model that describe objects in terms of a set of parts and their spatial relations [77, section 20.5].An integrated framework based on deep convolutional networks for detection and localizatio was already introduced in [86], see also [40] for recognition and [26] for scene understanding.The most well-known use case is recognition of multiple faces in an image where statistical shape models play a central role [100,22,94] An analogous task relevant for clinical image guided diagnostics is detecting melanoma from images of skin lesions [24,36].Non-rigid image registration: The task here is to deform a template image in a "natural" way so that it matches a target image.This is a key step in spatiotemporal imaging and deep neural networks have been utilized for this purpose [29,98].Parametric regression: The task here is to statistically determine relationships among variables (parameters) in a statistical model.This is important in clinical diagnostics where one seeks to determine risk factors and biomarkers of diagnostic and prognostic value associated with clinical progression and severity of specific diseases.An example of image guided regression is estimating cardiovascular risk factors, such as age, from retinal fundus photographs [76].Another is predicting patient overall survival as in the 2018 Multimodal Brain Tumor Segmentation Challenge based on BRATS imaging data [68], and predicting scores for Alzheimer's disease from imaging data [67,59].The abstract framework in section 7.1 allows one to perform any of the above tasks (and those in section 6.2) jointly with a reconstruction step for solving an inverse problem.Examples of the latter are deconvolution, Fourier inversion (MRI imaging), or more elaborate schemes for various types of tomographic imaging.A key success factor is access to suitable training data, another is usage of a differentiable loss function along with a trainable differentiable parametrization (deep neural network architecture) of the task operator.

Task adapted reconstruction.
As stated in the introduction (section 1), solving an inverse problem (reconstruction) is one step in a pipeline (Figure 1) that often involves further coupled tasks necessary for decision making.Task adapted reconstruction refers to methods that integrate the reconstruction with (parts of) the decision making procedure, thereby adapting the reconstruction method for the specific task at hand.7.1.Abstract setting.We will present a generic framework for task adapted reconstruction that is computationally feasible and adaptable to specific inverse problems and tasks.A key element is to formalize both the reconstruction and task as non-randomized decision rules within a statistical estimation problem.
More precisely, our starting point is the inverse problem in section 5 where the data model in ( 3) is known.Following section 5.2, the reconstruction step can be seen as a decision rule in a statistical estimation problem defined by the statistical model (Y, S Y ), {M(x)} x∈X , decision space (X, S X ), and loss X : X × X → R. Given a prior π 0 ∈ P X allows us to define a reconstruction method as a non-randomized decision rule that minimizes the π 0 -average risk (Bayes risk), i.e., as a mapping that solves ( 16) where (x, y) ∼ π 0 ⊗ M(x).
Likewise, following section 6.1 a task becomes a decision rule in a statistical estimation problem defined by the statistical model (X, S X ), {P z } z∈ , decision space (D, S D ), and loss given by ( 9) with known τ : → D (feature extraction map) and D : D × D → R (decision distance).Given a task prior η 0 ∈ P , the non-randomized decision rule representing the task (task operator) can be seen as a minimizer to the η 0 -average risk (Bayes risk): (17) T ∈ arg min where z, x ∼ η 0 ⊗ P z .
We now have the following three approaches to task adapted reconstruction.Sequential approach: A sequential approach starts with determining the reconstruction operator, and then uses it to define the the task operator.It is based on assuming that statistical assumptions for (z, x) ∼ η 0 ⊗ P z and (x, y) ∼ π 0 ⊗ M(x) are consistent, e.g., by assuming that P z is the push forward of M(x) through the reconstruction operator2 .The task adapted reconstruction is then given as ( 18) where A † ∈ M (Y, X) solves ( 16) and T ∈ M (X, D) solves (17).End-to-end approach: The fully end-to-end approach ignores the distinction between reconstruction and the task.Assuming (z, y) ∼ ν for some measure ν ∈ P ×Y , the task adapted reconstruction is here given as B : Y → D that solves where (z, x) ∼ ν.
Joint approach: The joint approach is a middle-way between the sequential and end-to-end approaches.It is based on assuming that there is a joint law (z, x, y) ∼ σ, which by the chain rule in probability can be written in terms of conditional probabilities: Assume that x is a sufficient statistic for y, i.e., dπ(y | z, x) = dπ(y | x).Combined with (z, x) ∼ η 0 ⊗ P z and (x, y) ∼ π 0 ⊗ M(x), we obtain dσ(z, x, y) = dM(x)(y) dP z (x) dη 0 (z).
Next, we introduce a joint loss that interpolates between the sequential case and the end-to-end case.Specifically, we let joint : (X × D) × (X × D) → R be given as Task adapted reconstruction is now given by (18) where the operators jointly solve Note first that in the limit C → 0, the joint approach becomes the sequential one.Next, it may seem sufficient to only consider the loss D in ( 21), i.e., to set C = 1 in (20), which recovers the end-to-end approach.There is however a problem with non-uniqueness in this case since 21) for any invertible B : X → X.
This non-uniqueness does not arise when C < 1, so incorporating a loss term associated with the reconstruction may act as a regularizer.This also indicates that the limit C → 1 does not necessarily coincide with the case C = 1.
7.2.Computational implementation.In section 5.3.1 we mentioned the difficulty to select an appropriate prior π 0 ∈ P X for Bayesian inversion whereas the measure M(x) ∈ P Y is often known by the data model.Furthermore, both measures η 0 ∈ P and P z ∈ P X must be considered as unknown for most tasks.Hence any realistic scenario would contain σ as an unknown.An option is to replace these measures by their empirical counterparts given by suitable supervised training data.
Another concern is computational feasibility.The optimizations in ( 16), ( 17), (19), and ( 21) are taken over all measurable mappings between relevant spaces, which is computationally unfeasible.This can be addressed by considering parametrized sets of measurable mappings as done in sections 5.4 and 6.1.More precisely, we employ a learned iterative scheme to parametrize a family of reconstruction methods A † θ : Y → X since this parametrization includes knowledge about the data model (section 5.4).Likewise, decision rules associated with the task are given by an appropriate parametrized family of mappings T ϑ : X → D. Finally, the approach for the end-to-end setting is to directly parametrize B ϑ : Y → D.
Using such parametrizations allows one to reformulate ( 16) and ( 17) as ( 25), ( 19) as (25), and ( 21) as (27).A key aspect for the implementation is to use stochastic gradient descent (SGD) based methods for finding appropriate parameters by approximately solving the empirical versions of ( 19), (25), and ( 27).This requires that the above parametrizations are differentiable, which in particular requires using differentiable loss-functions.
Depending on the type of supervised training data, we can now pursue either of the three approaches listed in section 7.1.Sequential approach: Here we have access to two sets of supervised training data that are coupled: The coupling is that x i 's in the second data set (bottom one) are reconstructions obtained from y i 's in the first data set (top one).This ensures consistency with the statistical assumptions mentioned before for the sequential approach.The reconstruction is then given by the mapping (24) T ϑ * • A † θ * : Y → D where θ * ∈ Θ solves (8) and ϑ * ∈ Ξ solves (11), i.e., Note here that the only requirement for the sequential approach is that the assumptions (z, x) ∼ η 0 ⊗ P z and (x, y) ∼ π 0 ⊗ M(x) are jointly consistent.The learned task operator given by solving for ϑ * in ( 25) is only well defined for input taken from the support of it's training data, so it may fail when applied to data it has never seen.This is especially the case if new data has a different statistical assumption.Hence, it is important to ensure the range of the reconstruction operator is contained in the support of the elements x ∈ X used to train the task.In most practical implementations, this is ensured by simply letting x i 's in (z i , x i ) in ( 23) be the output of the learned reconstruction operator A † θ * : X → Y .End-to-end approach: Supervised training data is here of the form (26) (z i , y i ) ∈ × X generated by (z, x) ∼ η 0 ⊗ P z for i = 1, . . ., m.
The reconstruction is given by B ϑ : Y → D where ϑ * ∈ Ξ solves Joint approach: In this approach we assume access to supervised training data that jointly involves the reconstruction and task: Given such data, the corresponding reconstruction method can be defined as in (24) where (θ * , ϑ * ) ∈ Θ × Ξ solves the following joint empirical loss minimization: where joint : (X × D) × (X × D) → R is the joint loss in (20).Note that one may in addition have access to separate sets of training data of the form ( 23) and (28).In such case, it is possible to first pre-train by solving for (8) and (11) separately, and use the resulting outcomes to initialize an algorithm for solving (29).

Applications.
In the following we demonstrate performance of the task adapted reconstruction scheme for (24) that is based on solving (29).All cases involve tomographic reconstruction from 2D parallel beam data and as tasks, we consider MNIST classification and segmentation.

Joint tomographic reconstruction and classification.
Task: Recover probabilities that a 2D grey scale MNIST image is a 0,1, . . .,9 from noisy parallel beam tomographic data (see section 6.2.1).Data: Elements in Y are real-valued functions representing samples of a Poisson random variable with mean equal to the exponential of the parallel beam ray transform and an intensity corresponding to 60 photons/line.The ray transform is digitized by sampling the angular variable at 5 uniformly sampled points in [0, π] with 25 lines/angle.Model parameter space: Elements in X are functions representing images supported on a fixed rectangular region Ω ⊂ R 2 , so X := L 2 (Ω, R).These are discretized by sampling on a uniform 28 × 28 grid.The loss X : X × X → R is the squared L 2 -distance on X. Decision space: := {0, 1, . . ., 9} is the set of labels and D is probability distributions over with a loss function D : D × D → R given by the cross entropy: In addition to cross entropy, we employ classification accuracy to measure performance.Given a probability distribution d ∈ D over , the single label prediction is defined to be the element in that is assigned the highest probability, i.e. arg max z∈ d(z).The percentage of images in the evaluation data set for which the predicted label coincides with the real one is reported as classification accuracy.Reconstruction and task operators: Reconstruction A † θ : Y → X is given by the learned gradient descent in [1] and task T ϑ : X → D is a MNIST classifier given by a standard convolutional neural net classifier with three convolutional layers, each followed by 2 × 2 max pooling for segmentation.The activation functions used were ReLUs, layers had 32, 64 and 128 channels, respectively.The final layer is dense and transforms the output of the last convolutional layer to a logit layer of size 10, with the last activation function being a softmax.Joint training: Joint supervised data is given as 512 000 triplets (x i , y i , z i ) where z i ∈ is the label corresponding to the MNIST labels.We also performed pre-training for both the reconstruction and task operator (classifier).The reconstruction operator was pre-trained using 256 000 pairs (x i , y i ) with 8 000 steps with a batch size of 64 and the task operator (classifier) was pre-trained until 97.7% accuracy.Note here that we use about 60 000 entries from the MNIST database, so the above supervised data is not statistically independent.Example outcomes, which are summarized in Table 1 and Figure 2, show that a joint approach outperforms a sequential one when considering the classification accuracy.Besides an improved classification accuracy we also see a significant improvement regarding interpretability.The reconstructed image part can in the joint setting actually be used as a benchmark to assess the reconstructed classification probabilities.On the other hand, the sequential ap-proach results in classification probabilities that deviates from this intuitive observation.We also see that in both cases, the classification probabilities are unnaturally concentrated on a single label, but this is a know phenomena also for regular for MNIST classification [33].

Approach
Accuracy  1: In both the pre-training and sequential approaches, the reconstruction and task operators are trained separately.In the sequential approach, the task operator is then further trained on the output of the trained reconstruction operator.In the end-to-end approach, which corresponds to C = 0 in (20), the reconstruction operator is pre-trained with L 2 -loss.Finally, the joint approach uses the full loss (20).We see that the classification accuracy (explained in "Decision space" in section 7.3.1)improves when using a joint approach.In fact, using a "suitable" C (Figure 2(a)) yields an accuracy of 97.00% that is quite close to the upper limit of 97.76%, which is the accuracy of the classifier when trained on true images.Data on overall performance on all of the MNIST dataset (accuracy) is given in Table 1.

Class
7.3.2.Joint tomographic reconstruction and segmentation.Task: Recover the probability map for segmentation of a grey scale image (see section 6.2.2 with k = 2) from noisy parallel beam tomographic data.In this specific example, we consider segmenting the grey matter of CT images of the brain, which is relevant in imaging of neurodegerantive diseases like Alzheimers' disease.Data: Elements in Y are real-valued functions on lines representing parallel beam tomographic data, which are digitized by sampling the the angular variable at 30 uniformly sampled points in [0, π] with 183 lines/angle.We furthermore add 0.1% additive Gaussian noise to data.Model parameter space: Elements in X are functions representing images supported on a fixed rectangular region Ω ⊂ R 2 , so X := L 2 (Ω, R).These are discretized by uniform sampling on a 128 × 128 grid.The loss X : X × X → R is the squared L 2 -distance.Decision space: Elements in D are point-wise probability distributions over binary images on Ω, which can be represented by grey-scale images with values in [0, 1] that gives the probability that a point is part of the segmented object.Hence, D = M Ω, [0, 1] with the loss function D : D × D → R as the cross entropy: Reconstruction and task operators: Reconstruction A † θ : Y → X is given by the Learned Primal-Dual scheme in [2] and task T ϑ : X → D is given by an "off the shelf" U-net convolutional neural net for segmentation [81].
Joint training: Joint supervised data is given as 100 triplets (x i , y i , d i ) where d i is the segmentation (binary image).We extend joint training data by data argumentation (±5 pixel translation and ±10 • rotation).There was no pre-training.Example outcomes are summarized in Figures 3 and 4. Note that C → 0 corresponds to the sequential approach, so the image for C = 0.01 can be seen as the outcome from a sequential approach.Clearly, a joint approach with C ≈ 0.5 or 0.9 outperforms a sequential one.
Next, as C decreases the reconstruction becomes more adapted to the task of segmentation.In the limit C → 0 the task part is viable but the reconstruction image is useless, which is to be expected.In the other direction, as C increases the reconstructed image becomes less adapted to the task and the latter becomes increasingly challenging due to the low contrast between white and grey matter.
Finally, using C > 0 not only reduces the non-uniqueness as explained in (22), it further regularizes in the sense that information from the reconstruction guides the segmentation, which otherwise would amount to learning the segmented image directly from the noisy sinogams.Intuitively there seems to be an "information exchange" between the task of reconstruction and that of segmentation, which when properly balanced by choosing C acts as a regularizer for the segmentation, e.g., the white/grey matter contrast in the reconstruction is overemphasized for small C.This improves the interpretability since it shows how the reconstructed image "helps" in interpreting why a certain segmentation is taken.(20).Clearly, there is no joint minimizer but 0.5 C 0.9 is a good compromise.The segmentation is a normalized grey-scale image denoting the probability that a point belongs to the segmented structure.The choice C = 0.9 seems to be a good compromise for a good reconstruction and segmentation (see Figure 3).Note also that C → 1 gives the sequential approach, so C = 0.999 may serve as a proxy for it.Reconstructions take a few milliseconds to perform on a desktop gaming PC.
8. Theoretical considerations.The joint task adapted reconstruction defined in ( 21) is given by combining two optimal decision rules into a single decision rule, one for reconstruction that acts on data and the other encoding a task that acts on model parameters.It is therefore natural to investigate whether the theoretical machinery developed for Bayesian inversion can be used to analyze regularizing properties of this joint approach.An example would be to investigate the conditions under which the joint approach is a regularization in the formal sense, which means proving existence, stability, and posterior consistency that is preferably complemented by providing contraction rates, see [30, chapters 6-9] for the precise definitions.
Much of the theory on Bayesian inversion that deals with such matters is well understood for linear problems in the finite dimensional setting [48], but things quickly become complicated for infinite dimensional non-parametric problems.There has been nice progress recently on consistency, posterior contraction rates, and characterization of the microscopic fluctuations of the posterior that is relevant for Bayesian inversion, see [88,18,73] for nice surveys and [71] for a in-depth treatment of reconstruction relevant for tomographic imaging.On the other hand, the theory and associated results require too many restrictive assumptions that renders them inapplicable for analyzing the task adapted approach in (21).To conclude, theory of Bayesian inversion is in its current state not useful for characterizing conditions for when the joint task adapted reconstruction in ( 21) is a regularization.
Another line of investigation considers the potential advantage that comes with using a joint approach over a sequential one.Since the reconstruction and task operators are trained separately in a sequential approach, some information is inevitably lost when applying a regularized reconstruction operator.In contrast, both reconstruction and task operators are trained simultaneously in a joint approach so there is a better chance of preserving the information.Hence, we expect a joint approach to perform better, which is also supported by the observation in (22) and the empirical evidence in section 7.3.Now, albeit convincing, the above heuristic argument is flawed!In fact, as stated by Proposition 1, it is surprisingly difficult to theoretically prove that a joint approach outperforms a sequential one in a non-parametric setting where one has access to all of data.The reason is that many standard operators that map data space to model parameter space are formally information conserving in such a setting.The adjoint of the forward operator, its Moore-Penrose pseudo-inverse, and even some regularized reconstruction operators such as the usual Tikhonov regularization are information conserving under standard Gaussian noise.For 2D parallel beam tomography, yet another example is the filtered backprojection (FBP) reconstruction operator (with a filter that is strictly non-zero in frequency space).Proposition 1.Let x be a X-valued random variable, and y be a Y -valued random variable, both defined on the same probability space.Let Π : Y → Y be a measurable operator with closed range.Let B be an arbitrary measurable map defined on Y that is injective when restricted to ran(Π).Then, the following holds: where f spans all random variables over X.
Before getting to the proof, let us comment on the implication of the statement above.The operator Π typically represents an orthogonal projection onto the closure of the range of A.
The result above states that the probability of x conditioned on y is the same as the one conditioned on B(Πy) if and only if, given the knowledge of Πy, the "noise" in the null space of Π, namely y − Πy, is independent of x.
Proof.The proof is essentially a rewriting of the definitions.Introduce the notations y 1 := Πy and y 2 := y − Πy, so that y = y 1 + y 2 .Then E[f (x)|y] = E[f (x)|y 1 , y 2 ], as y and (y 1 , y 2 ) generate the same σ-algebras.The injectivity of B on ran(Π) implies that the σalgebra generated by B •Π • y and Π • y are the same, so E holds for all f is exactly the statement of conditional independence in the claim.
Corollary 2. Consider the setting in section 7.1 for task adapted reconstruction and assume in particular that y and z are conditionally independent given x.Finally, let B satisfy the assumptions in Proposition 1; we also assume that the equality in Proposition 1 holds, that is π Proof.The conditional independence assumption can be written as π Notice now that B(y) and z are also conditionally independent given x, so we similarly obtain Our assumption is that π(x | y) = π x | B(y) , which combined with (34) yields (31).This concludes the proof.
By Corollary 2 we see directly that the conditional distribution of z given data y ∈ Y is, as -valued random variables, equal to the conditional distribution of z given an initial reconstruction B(y) ∈ X.In particular, a task adapted reconstruction method (either sequential or joint) is equivalent to first performing reconstruction by applying the fixed operator B : Y → X, which is not trained, followed by C : X → D that is given as Note here that C, which is trained, is a measurable map defining a non-randomized decision rule that in principle serves as a "task" operator.
To summarize, we cannot resort to "information bottleneck" type of arguments as an explanation for why the joint approach should outperform only training a task operator in this general setting.On the other hand, the above argument hints that an explanation must involve either the choice of architecture or the training protocol.Both of these are examples of classical and widely studied problems in deep learning concerning why deep learning "works" and these remain largely unsolved.Another argument in favor of a joint approach is that it is highly non-trivial to select an appropriate architecture for parametrizing C, whereas T and A † are easier to parametrize by means of neural networks.Another possible reason is that the operations, like evaluating B or its inverse B −1 , may not be stable.Finally, as we have seen from the examples, using knowledge about the reconstruction may in fact act as a regularizer, either by improving the trainability or the generalization properties.9. Discussion and outlook.A key aspect for the implementation of the joint task adapted reconstruction method in (21) is that both decision rules are given by trainable neural networks, which after joint training forms a single intertwined neural network.In such case, the problem reduces to solving (29).
The neural network for the reconstruction should here preferably incorporate knowledge about how data is generated.Learned iterative methods, like the Learned Primal-Dual method, are therefore well suited for this task since they are given by a (deep) neural network that embeds the forward operator and a statistical model for the nose in measured data [1,2].
Next, as shown in sections 6.2 and 6.3, a wide range of tasks can be interpreted as applying an optimal decision rule on the model parameters.The abstract framework for task adapted reconstruction (section 7.1) works with any task that can be represented by a neural network as long as the parametrization and the loss functions are differentiable, like those listed in sections 6.2 and 6.3.Hence, our approach opens up for truly task adapted reconstruction that goes well beyond performing reconstruction jointly with simple feature extraction.In particular, more advanced tasks, such as image caption generation or image-processing steps in radiomics [31,56], can be performed jointly with reconstruction.This potential is also mentioned in the editorial for the special issue on machine learning for image reconstruction in IEEE Transactions on Medical Imaging [93] where the editors for the special issue introduce the term rawdiomics (on p. 1294) for task adapted reconstruction applied to radiomics.
An important advantage that comes with a joint approach is increased robustness.Advanced tasks, like radiomics, commonly rely on deep neural networks that are trained on images in a supervised setting.Images are however inferred in a pre-processing step from measured data, so contrast and texture may depend on the instrumentation used for acquiring the data and the reconstruction method used for computing the images.Hence, a neural network that has been trained against images acquired from a particular equipment, or obtained using a particular reconstruction method, may generalize poorly when either of these factors change.This is especially the case for tasks involving elements of visual classification, such as semantic segmentation, that can be sensitive to variations in texture and contrast.In contrast, task adapted reconstruction acts on measured data instead of images (model parameters).Using a reconstruction step that incorporates a physics guided model for how measured data is generated results in a joint approach that is much more robust against variations in how data is acquired and processed.As an example, jointly training a learned iterative method with neural network(s) involved in radiomics will result in a joint scheme that is expected to be much more robust against variations in scanner and acquisition protocol.This is essential if radiomics is to be part of clinical-decision support systems for improving diagnostic, prognostic, and predictive accuracy.
Another important advantage with the proposed task adapted reconstruction method relates to computationally feasibility.The trained neural network for task adapted reconstruction scales to large scale problems.Such scalability remains a serious issue with the variational approaches mentioned in section 4. As an example, state-of-the-art methods for joint reconstruction and segmentation are based on a variational approach using a the Mumford-Shah functional, which quickly become computationally unfeasible.In contrast, the 2D examples in Figure 4 for joint reconstruction and segmentation are obtained using the approach in section 7.3.2and these take a few milliseconds on a desktop gaming PC.
Finally, examples involving tomographic image reconstruction (section 7.3) support the claim that a joint approach outperforms a sequential one.Understand this theoretically (section 8) is however an open problem.In particular, there is currently no theory motivating using a joint loss of the type in (20), even though empirical evidence suggests such a choice outperforms the en-to-end and sequential approaches.

Figure 1 :
Figure1: Typical workflow involving an inverse problem.The second row represents the data acquisition where raw data is acquired and pre-processed, resulting in cleaned data.In the third row, the cleaned data is used as input to a reconstruction step that recovers the model parameter, which is then post-processed to extract features that are used as input for model building.The final outcome is a task adapted model that can be used for decision making.The dotted rectangular part outlines the steps that are unified by task adapted reconstruction.

6. 2 . 1 .
Classification.The task is to classify an image into one of k distinct labels, or more precisely, associate an image to a probability distribution over all k labels.This task is represented by a non-randomized decision rule in a statistical estimation problem where := Z k and the decision space D := P is probability distributions over the k labels.The task adapted loss function is given by (9) with D (d, d ) := − z∈ d(z) log d (z) for d, d ∈ D and τ (z) := δ z for z ∈ .
Plot of loss functions after joint training for different C in (20).Clearly, there is no joint minimizer but 0.5 C 0.9 is a good compromise.(b) True image & class: 8.

Figure 2 :
Figure 2: Joint tomographic reconstruction and classification of MNIST images.Training data is to the left and reconstructed image with classification probabilities are on the right.Data on overall performance on all of the MNIST dataset (accuracy) is given in Table1.

Figure 3 :
Figure 3: Log-log plot of loss functions for joint reconstruction and segmentation after joint training for different C in(20).Clearly, there is no joint minimizer but 0.5 C 0.9 is a good compromise.

Figure 4 :
Figure 4: Joint tomographic reconstruction and segmentation for different values of C in (20).The segmentation is a normalized grey-scale image denoting the probability that a point belongs to the segmented structure.The choice C = 0.9 seems to be a good compromise for a good reconstruction and segmentation (see Figure3).Note also that C → 1 gives the sequential approach, so C = 0.999 may serve as a proxy for it.Reconstructions take a few milliseconds to perform on a desktop gaming PC.