Directional sinogram inpainting for limited angle tomography

Robert Tovey; Martin Benning; Christoph Brune; Marinus J Lagerwerf; Sean M Collins; Rowan K Leary; Paul A Midgley; Carola-Bibiane Schönlieb

doi:10.1088/1361-6420/aaf2fe

1. Introduction

1.1. Problem formulation

Many applications in materials science and medical imaging rely on the x-ray transform as a mathematical model for performing 3D volume reconstructions of a sample from 2D data. We shall refer to any modality using the x-ray transform forward model as x-ray tomography. This encompasses a huge range of applications including positron emission tomography (PET) in front-line medical imaging [1], transmission electron microscopy (TEM) in materials or biological research [2–4], and x-ray computed tomography (CT) which enjoys success across many fields [5, 6].

The limited angle problem is common in x-ray tomography, for instance in TEM [7] and mammography [8], and is caused by a particular limited data scenario. Algebraically, we search for approximate solutions to the inverse problem

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle {{\rm Given~data~}} b{{\rm ~find~optimal~pair~}} (u,v) {{\rm ~such~that~}}Sv = b, \C Ru = v\nonumber \end{align*}$

where $\newcommand{\C}[1]{\mathcal{#1}} \C R$ is the x-ray transform to be defined in (2.1), $S$ represents the limited angle sub-sampling pattern described in figure 1. Typically, limited angle problems can occur due to having a large sample or because equipment does not allow the sample to be fully rotated. Mathematically, microlocal analysis can be used to categorise the limited angle problem and characterise artefacts that occur. Viewed through the Fourier slice theorem, it becomes clear that the Fourier coefficients of $u$ are partitioned into those 'visible' in $b$ and those contained in a 'missing wedge' [9]. These coefficients are referred to respectively as the visible and invisible singularities of $u$ . The limited angle problem then is both a denoising and inpainting inverse problem, on the visible and invisible singularities respectively. The artefacts caused by the missing wedge can be explicitly characterised [10, 11] and examples of such streak artefacts and blurred boundaries can be seen in figures 2 and 3.

**Figure 1.** Diagrammatic representation of the acquisition of 2D x-ray transform data, the sinogram, in both full range and limited angle acquisition. Note that measurement at $180^\circ$ is exactly a reflection of that at $0^\circ$ . This symmetry allows us to consider a $180^\circ$ range of the sinogram as a full sample. In the limited angle setting we can only rotate the sample a small amount clockwise and anti-clockwise which results in missing data in the middle of the sinogram.
Download figure:
Standard image High-resolution image

**Figure 1.** Diagrammatic representation of the acquisition of 2D x-ray transform data, the sinogram, in both full range and limited angle acquisition. Note that measurement at $180^\circ$ is exactly a reflection of that at $0^\circ$ . This symmetry allows us to consider a $180^\circ$ range of the sinogram as a full sample. In the limited angle setting we can only rotate the sample a small amount clockwise and anti-clockwise which results in missing data in the middle of the sinogram.
Download figure:
Standard image High-resolution image

**Figure 2.** Demonstration of TV reconstruction in comparison to FBP and SIRT. The top row shows reconstructions from noise-less limited angle data and the bottom shows reconstructions from noisy limited view data (far left images). Comparing the columns, we immediately see that FBP and SIRT are much more prone to angular artefacts than TV. In both cases we notice that the TV reconstructions better show the broad structure of the phantom.
Download figure:
Standard image High-resolution image

**Figure 3.** Examples when TV reconstructions cannot recover the global structures of samples. When there is a large missing wedge ( $\newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} \sfrac23$ of data unseen) and noise on the projections then reconstructions exhibit characteristic blurring in the vertical direction. This can also be seen in the extrapolated region of the sinograms as a loss of structure.
Download figure:
Standard image High-resolution image

$ \newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} \sfrac23$ — **Figure 3.** Examples when TV reconstructions cannot recover the global structures of samples. When there is a large missing wedge ( $\newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} \sfrac23$ of data unseen) and noise on the projections then reconstructions exhibit characteristic blurring in the vertical direction. This can also be seen in the extrapolated region of the sinograms as a loss of structure.
Download figure:
Standard image High-resolution image

Whilst the techniques developed here can apply to any limited angle tomography problem, we focus on the application of TEM for specific examples.

1.2. Context and proposed model

Traditional methods for x-ray tomography reconstruction find approximate solutions to $\newcommand{\C}[1]{\mathcal{#1}} S\C Ru = b$ , constraining $\newcommand{\C}[1]{\mathcal{#1}} \C Ru=v$ and only using prior knowledge of $u$ or the sinogram, $v$ . There are three main methods which fit into this category:

Filtered back projection (FBP) is a linear inversion method with smoothing on the sinogram, $v$ , to account for noise [12–14].
The simultaneous iterative reconstruction technique (SIRT) can be thought of as a preconditioned gradient descent on the function $\newcommand{\C}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \norm{S\C Ru-b}_2^2$ [3, 15–17]. Regularisation is then typically implemented by an early-stopping technique.
Variational methods where prior knowledge is encoded in regularisation functionals have now been applied in this field for nearly a decade. In particular, the current state of the art in electron tomography is total variation (TV) regularisation [2, 18, 19] where $u$ is encouraged to have a piecewise constant intensity. This will be introduced formally in section 2.

FBP and SIRT are commonly used for their speed although variational methods like TV have quickly gained popularity as they enable prior physical knowledge to be explicitly incorporated in reconstructions. This added prior knowledge tends to stabilise the reconstruction process and figure 2 gives examples where TV can vastly outperform FBP and SIRT when either the noise level is large or the angular range small. However, figure 3 further shows the limitations of TV when high noise and limited angles are combined. The only difference between the Shepp–Logan phantom data shown in figures 2 and 3 is that the former is clean data, in the image of the forward operator, whilst the latter has Gaussian white noise added. We see that as soon as there is a combined denoising/inpainting reconstruction problem, the TV prior on $u$ becomes insufficient to recover the structure of the sample.

Recently, these traditional methods have received a revival through machine learning methods, see for instance [20, 21]. In both of these examples the main artefact reduction is a learned denoising step which only enforces prior knowledge on $u$ .

The most common method that has been used to reconstruct pairs $(u,v)$ is to solve each inverse problem sequentially. Typically, we can express the pipeline for such methods as:

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle v&= ~{\rm optimal~ inpainted~ sinogram~ given}~ b \nonumber \\ u&=~ {\rm optimal~ reconstruction~ given}~ v. \nonumber \end{align*}$

This has seen much success in heavy metal artefact reduction [22, 23] where a regularisation functional for the inpainting problem may be constructed from dictionary learning [24], fractional order TV [23], and Euler's Elastica [25]. Euler's Elastica has also been used in the limited angle problem [26] along with more customised interpolation methods [27]. These approaches allow us to use prior knowledge on the sinogram to calculate $v$ and then spatial prior knowledge to calculate $u$ from $v$ ; at no point is the choice of $v$ influenced by our spatial prior. A full joint approach allows us to go beyond this and use all of our prior knowledge to inform the choice of both $u$ and $v$ . If our prior knowledge is consistent with the true data then this extra utilisation of our prior must have the potential to improve the recovery of both $u$ and $v$ . To build a model for this framework we shall encode our In this paper, therefore, we propose a full joint approach which allows us to use all of our prior knowledge at once. To realise this idea we encode prior knowledge and consistency terms into a single energy functional such that an optimal pair of reconstructions will minimise this joint functional, which we shall write as:

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle \label{basic model} E(u,v) = \alpha_1d_1(\C Ru,v) + \alpha_2d_2(S\C Ru,b) + \alpha_3d_3(Sv,b) + \alpha_4r_1(u) +\alpha_5r_2(v) \nonumber \end{align} \tag{ 1.1 }$

where $\alpha_i\geqslant0$ are weighting parameters, $d_i$ are appropriate distance functionals and $r_i$ are regularisation functionals which encode our prior knowledge. Note that choice of $d_2$ and $d_3$ are dictated by the data noise model. In what follows, $r_1$ is chosen to be the total variation.

Our choice for $r_2$ , the sinogram regularisation, is based on theoretical and heuristic observations. Thirion [28] has shown that discontinuities in $u$ correspond to sharp edges in the sinogram. In figure 3 we also see that blurred reconstructions correspond to loss of structure in the sinogram. Therefore, $r_2$ will be chosen to detect sharp features in the given data and preserve these through the inpainting region. The exact form of $r_2$ will be formalised in section 3.

A typical advantage of joint models is that they generalise previous ones. For instance, if we let $\alpha_2,\alpha_4\to\infty$ then we recover the TV reconstruction model. Alternatively, if we let $\alpha_3,\alpha_5\to\infty$ then we recover a method which performs the inpainting and then the reconstruction sequentially, as in [23, 25, 26]. Recent work [29] has shown that such a joint approach can be advantageous in similar situations but closest to our approach is that of [30] where $r_1$ and $r_2$ were chosen to encode wavelet sparsity in both $u$ and $v$ . We shall demonstrate, as seen in figure 4, that the flexibility of our joint model, (1.1), can allow for a better adaptive fitting to the data.

**Figure 4.** Demonstration of the improvement which can be achieved by using a model as in (1.1). Left hand images show state of the art reconstructions using total variational regularisation ( $\alpha_1=\alpha_3=\alpha_5=0$ ). This reconstruction clearly shows the characteristic blurring artefacts at the top and bottom. In our proposed joint reconstruction (right hand) we can minimise these effects.
Download figure:
Standard image High-resolution image

**Figure 4.** Demonstration of the improvement which can be achieved by using a model as in (1.1). Left hand images show state of the art reconstructions using total variational regularisation ( $\alpha_1=\alpha_3=\alpha_5=0$ ). This reconstruction clearly shows the characteristic blurring artefacts at the top and bottom. In our proposed joint reconstruction (right hand) we can minimise these effects.
Download figure:
Standard image High-resolution image

1.3. Overview and contributions

The main contribution of this work is to provide a framework for building models of the form described in (1.1) and provide new proofs for a numerical scheme for minimising these functionals. This numerical scheme is valid for a very large class of non-smooth and non-convex functionals $r_i$ and thus could be used in many other applications.

Section 2 first outlines the necessary concepts and notation needed to formalise the statement of our specific joint model in section 3. It will become apparent that the main numerical requirements of this work will require minimising a functional which is neither convex or smooth. Section 4 will start by reviewing recent work from Ochs et al [31] and we then provide alternative concise and self-contained proofs. Our main contribution here is to extend the existing results to an alternating (block) descent scenario. Finally, we present numerical results including two synthetic phantoms and experimental electron microscopy data where the limited angle situation occurs naturally.

2. Preliminaries

2.1. The x-ray transform

The principal forward model for x-ray tomography is provided by the x-ray transform which can be defined for any bounded domain, $\newcommand{\F}[1]{\mathbb{#1}} \Omega\subset\F R^n$ , by

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\C}[1]{\mathcal{#1}} \newcommand{\F}[1]{\mathbb{#1}} \displaystyle \label{radon} \C R\colon L^1(\Omega,\F R)\to L^1(\C S^{n-1}\times \F R^n,\F R) ~{\rm such~ that }~ \C Ru(\theta,y) = \int_{x = y + t\theta, t\in \F R} u(x){\rm d}t \nonumber \end{align} \tag{ 2.1 }$

where $\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\C}[1]{\mathcal{#1}} \newcommand{\st}{{\rm \;s.t.\;}} \C S^{n-1}=\{s\in\F R^n\st |s| = 1\}$ . In this work, for simplicity, we will only be using $n=2$ although the case $n=3$ is completely analogous.

2.2. Total variation regularisation

Total variation (TV) regularisation is extremely common across many fields of image processing [32–34]. The definition of the (isotropic) TV semi-norm on domains $\newcommand{\F}[1]{\mathbb{#1}} \Omega\subset\F R^n$ is stated as

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\IP}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\re}{{\rm Re}} \newcommand{\TV}{{\rm TV}} \newcommand{\op}[1]{{\rm #1}} \newcommand{\F}[1]{\mathbb{#1}} \displaystyle \label{TV} &g\in L^1(\Omega,\F R) \Longrightarrow \TV(g) = \sup\left\{\int \IP{g(x)}{\op{div}(\varphi)(x)}{\rm d}x\right.\nonumber \\ &\qquad \qquad \qquad \qquad \left.\vphantom{\int}\st \varphi\in C_c^\infty(\Omega,\F R^n), |\varphi(x)|_2\leqslant 1{{\rm ~for~all~}} x\right\}. \nonumber \end{align} \tag{ 2.2 }$

The intuition behind this is that when $\newcommand{\F}[1]{\mathbb{#1}} g\in W^{1,1}(\Omega,\F R)$ then we have exactly

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\TV}{{\rm TV}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \TV(g) = \norm{\nabla g}_{2,1} = \int |\nabla g(x)|_2{\rm d}x.\nonumber \end{align*}$

From this point forward we shall write the $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \norm{\cdot}_{2,1}$ representation where $|\nabla g|$ is to be understood as a measure if necessary. We shall denote the space of bounded variation as

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\TV}{{\rm TV}} \newcommand{\op}[1]{{\rm #1}} \newcommand{\F}[1]{\mathbb{#1}} \displaystyle \op{BV}(\Omega, \F R) = \{g\in L^1\st \TV(g) < \infty\} .\nonumber \end{align*}$

One property that we shall use about the space of bounded variation is $\newcommand{\op}[1]{{\rm #1}} \op{BV}\subset L^2\subset L^1$ which holds whenever $\Omega$ is compact by the Poincaré inequality: $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\op}[1]{{\rm #1}} \newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} \norm{g-\sfrac{\int g}{\int 1}}_2 \lesssim \op{TV}(g)$ .

The most common way to enforce this prior in reconstruction or inpainting problems is generalised Tikhonov regularisation, which gives us the basic reconstruction method [2, 18].

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\argmin}{{\rm argmin}} \newcommand{\TV}{{\rm TV}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle u = \mathop{\argmin}\limits_{u\geqslant 0} \frac12\norm{S\C Ru-b}_2^2 + \lambda\TV(u) ~{\rm for~ some }~ \lambda\geqslant0. \nonumber \end{align} \tag{ 2.3 }$

The parameter $\lambda$ is a regularisation parameter, which allows to enforce more or less regularity, depending on the quality of the data $b$ .

2.3. Directional total variation regularisation

For our sinogram regularisation functional we shall use a directionally weighted TV penalty, motivated by the TV diffusion model developed by Joachim Weickert [35] for various imaging techniques including denoising, inpainting and compression. Such an approach has already shown great ability for enhancing edges in noisy or blurred images, and preserves line structures across inpainting regions [36–38]. The heuristic for our regularisation on the sinogram was described in figure 3 and we shall encode it in an anisotropic TV penalty which shall be written as

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\DTV}{{\rm DTV}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\F}[1]{\mathbb{#1}} \displaystyle \DTV(v) = \int |A(x)\nabla v(x)| {\rm d}x = \norm{A\nabla v}_{2,1} ~{\rm for ~some~ anisotropic }~ A\colon\F R^2\to \F R^{2\times 2}.\nonumber \end{align*}$

The power of such a weighted extension of TV is that once a line is detected, either known beforehand or detected adaptively, we can embed this in $A$ and enhance or sharpen that line structure in the final result. The general form that we choose for $A$ is

$\begin{align} \newcommand{\e}{{\rm e}} \renewcommand{\vec }{\mathbf} \newcommand{\re}{{\rm Re}} \newcommand{\IP}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\F}[1]{\mathbb{#1}} \displaystyle \label{matrix form} A(x) = &\;c_1(x)\vec {e}_1(x)\vec {e}_1(x)^T + c_2(x)\vec {e}_2(x)\vec {e}_2(x)^T \nonumber \\&{\rm such~ that }~ \vec {e}_i\colon \F R^2\to \F R^2, |\vec {e}_i(x)| = 1, \IP{\vec {e}_1(x)}{\vec {e}_2(x)} = 0 \nonumber \end{align} \tag{ 2.4 }$

i.e.

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\vec }{\mathbf} \newcommand{\re}{{\rm Re}} \newcommand{\IP}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\DTV}{{\rm DTV}} \displaystyle \DTV(v) = \int \sqrt{c_1^2|\IP{\vec {e}_1}{\nabla v}|^2 + c_2^2|\IP{\vec {e}_2}{\nabla v}|^2} {\rm d}x.\nonumber \end{align*}$

Examples of this are presented in figure 5. Note that the choice $c_1=c_2$ recovers the traditional TV regulariser but for $|c_1|\ll c_2$ we allow for much larger (sparse) gradients in the direction of $\newcommand{\re}{{\rm Re}} \renewcommand{\vec }{\mathbf} \vec {e}_1$ . This allows for large jumps in the direction of $\newcommand{\re}{{\rm Re}} \renewcommand{\vec }{\mathbf} \vec {e}_1$ whilst maintaining flatness in the direction of $\newcommand{\re}{{\rm Re}} \renewcommand{\vec }{\mathbf} \vec {e}_2$ . In order to generate these parameters we follow the construction of Weickert [35]. Given a noisy image, $d$ , we can construct the structure tensor:

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\vec }{\mathbf} \newcommand{\re}{{\rm Re}} \displaystyle (\nabla d_\rho \nabla d_\rho^T)_\sigma(x) = \lambda_1(x)\vec {e}_1(x)\vec {e}_1(x)^T + \lambda_2(x)\vec {e}_2(x)\vec {e}_2(x)^T {{\rm ~such~that~}} \lambda_1(x)\geqslant \lambda_2(x)\geqslant0\nonumber \end{align*}$

where

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} \displaystyle d_\rho(x) = [d\star \exp(-\sfrac{|\cdot|^2}{2\rho^2})] (x)= \int \exp(-\sfrac{|y-x|^2}{2\rho^2})d(y) {{\rm ~etc.}}\nonumber \end{align*}$

denotes convolution with the heat kernel. This eigenvalue decomposition is typically very informative in constructing $A$ . If we define

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Delta(x) = \lambda_1(x)-\lambda_2(x) ~{\rm is~ coherence}~ \qquad \Sigma(x) = \lambda_1(x)+\lambda_2(x) ~{\rm is ~energy}\nonumber \end{align*}$

then the eigenvectors give the alignment of edges and $\Delta, \Sigma$ characterise the local behaviour, as in figure 6. In particular, we simplify the model to

$\begin{align} \newcommand{\e}{{\rm e}} \renewcommand{\vec }{\mathbf} \newcommand{\re}{{\rm Re}} \displaystyle \label{direction tensor} A_d(x) := c_1(x|\Delta,\Sigma)\vec {e}_1(x)\vec {e}_1(x)^T + c_2(x|\Delta,\Sigma)\vec {e}_2(x)\vec {e}_2(x)^T \nonumber \end{align} \tag{ 2.5 }$

where the only parameters left to choose are $c_i$ . Typical examples of include

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle c_1 = \frac{1}{\sqrt{1+\Sigma^2}},\qquad c_2 = 1\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\re}{{\rm Re}} \newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} \displaystyle c_1 = \epsilon,\qquad c_2 = \epsilon+\exp\left(-\sfrac{1}{\Delta^2}\right) {{\rm ~for~some~}} \epsilon>0.\nonumber \end{align*}$

**Figure 5.** Examples comparing TV with directional TV for inpainting and denoising. Both examples have the same edge structure and so parameters in (2.4) are the same in both. DTV uses $c_2=1$ and $c_1$ as the indicator (0 or 1) shown in the bottom left plot, TV is the case $c_1=c_2=1$ . Left block: top left image is inpainting input where the dark square shows the inpainting region. The structure of $c_1$ allows DTV (bottom right) to connect lines over arbitrary distances, whereas TV inpainting (top right) will fail to connect the lines if the horizontal distance is greater than the vertical separation of the edges. Right block: top left image is denoising input. DTV has two advantages. Firstly, the structure of $c_1$ recovers a much straighter line than that in the TV reconstruction. Secondly, TV penalises jumps equally in each direction and so the contrast is reduced, DTV is able to penalise noise oscillations independently from edge discontinuities which allows us to maintain much higher contrast.
Download figure:
Standard image High-resolution image

**Figure 5.** Examples comparing TV with directional TV for inpainting and denoising. Both examples have the same edge structure and so parameters in (2.4) are the same in both. DTV uses $c_2=1$ and $c_1$ as the indicator (0 or 1) shown in the bottom left plot, TV is the case $c_1=c_2=1$ . Left block: top left image is inpainting input where the dark square shows the inpainting region. The structure of $c_1$ allows DTV (bottom right) to connect lines over arbitrary distances, whereas TV inpainting (top right) will fail to connect the lines if the horizontal distance is greater than the vertical separation of the edges. Right block: top left image is denoising input. DTV has two advantages. Firstly, the structure of $c_1$ recovers a much straighter line than that in the TV reconstruction. Secondly, TV penalises jumps equally in each direction and so the contrast is reduced, DTV is able to penalise noise oscillations independently from edge discontinuities which allows us to maintain much higher contrast.
Download figure:
Standard image High-resolution image

**Figure 6.** Surface representing a characteristic image, $d$ , to demonstrate the behaviour of $\Sigma$ and $\Delta$ . Away from edges (A) we have $\Sigma\approx\Delta\approx 0$ . On simple edges (B) we have $\Sigma\approx \Delta \gg 0$ and, finally, at corners (C) we have $\Sigma\gg \Delta$ .
Download figure:
Standard image High-resolution image

**Figure 6.** Surface representing a characteristic image, $d$ , to demonstrate the behaviour of $\Sigma$ and $\Delta$ . Away from edges (A) we have $\Sigma\approx\Delta\approx 0$ . On simple edges (B) we have $\Sigma\approx \Delta \gg 0$ and, finally, at corners (C) we have $\Sigma\gg \Delta$ .
Download figure:
Standard image High-resolution image

The key idea here is that $c_1\ll c_2$ near edges and $c_1=c_2$ on flat regions. In practice $d$ will also be an optimisation parameter and so we shall require a regularity result on our choice of $d\mapsto A_d$ , now characterised uniquely by our choice of $c_i$ .

Theorem 2.1. If

(i)
$c_i$ are $2k$ times continuously differentiable in $\Delta$ and $\Sigma$ , $k\geqslant1$
(ii)
$c_1(x|0,\Sigma) = c_2(x|0,\Sigma)$ for all $x$ and $\Sigma\geqslant0$
(iii)
$\partial_\Delta^{2j-1} c_1(x|0,\Sigma) = \partial_\Delta^{2j-1} c_2(x|0,\Sigma) = 0$ for all $x$ and $\Sigma\geqslant0, j=1\ldots,k$

Then $A_d$ is $C^{2k-1}$ with respect to $d$ for all $\rho>0,\sigma\geqslant0$ .

Remark 2.2.

Property (ii) is necessary for $A_d$ to be well defined and continuous for all fixed $d$
If we can write $c_i = c_i(\Delta^2,\Sigma)$ then property (iii) holds trivially

The proof of this theorem is contained in appendix A.

3. The joint model

Now that all of the notation and concepts have been defined we can formalise the statement of our particular joint model of the form in (1.1):

The forward operator, $\newcommand{\C}[1]{\mathcal{#1}} \C R$ , is the x-ray transform from (2.1)
The desired reconstructed sample, $\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\op}[1]{{\rm #1}} u\in \op{BV}(\Omega,\F R)$ on some domain $\Omega$
The noisy sub-sampled data, $\newcommand{\F}[1]{\mathbb{#1}} b\in L^1(\Omega',\F R)$ on some $\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\C}[1]{\mathcal{#1}} \Omega'\Subset \C S^1\times\F R_{\geqslant0}$ . We extend such that $b|_{\Omega'^c}=0$ for notational convenience.
The full reconstructed sinogram, $\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\C}[1]{\mathcal{#1}} v\in L^1(\C S^1\times\F R_{\geqslant0},\F R)$

We also define $S = S_{\Omega'}$ to be the indicator function on $\Omega'$ . The joint model that we thus propose is

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\argmin}{{\rm argmin}} \newcommand{\DTV}{{\rm DTV}} \newcommand{\op}[1]{{\rm #1}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle \label{model} (u, v) = \mathop{\argmin}\limits_{u\geqslant 0} E(u,v) = & \;{\mathop{\argmin}\limits_{u\geqslant0}}\frac{1}{2}\norm{\C Ru-v}^2_{\alpha_1} + \frac{\alpha_2}{2}\norm{S\C Ru-b}_2^2 \nonumber \\ &\qquad \quad + \frac{\alpha_3}{2}\norm{Sv-b}_2^2 + \beta_1\op{TV}(u) + \beta_2\DTV_{\C Ru}(v) \nonumber \end{align} \tag{ 3.1 }$

where

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\DTV}{{\rm DTV}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle \DTV_{\C Ru}(v) = \norm{A_{\C Ru}\nabla v}_{2,1}\nonumber \end{align*}$

and $\alpha_i,\beta_i>0$ are weighting parameters, $A_{Ru}$ as defined in (2.5). $\alpha_1$ is embedded in the norm as it is a spatially varying weight, taking different values inside and outside of $\Omega'$ . We note that the norms involving $b$ are determined by the noise model of the acquisition process, in this case Gaussian noise. The final metric pairing $\newcommand{\C}[1]{\mathcal{#1}} \C Ru$ and $v$ was free to be chosen to promote any structural similarity between the two quantities. We have chosen the squared $L^2$ norm for simplicity although if some structure is known to be important then there is a wide choice of specialised functions from which to choose (e.g. [1]).

The choice of regularisation functionals reflects prior assumptions on the expected type of sample; all of the examples shown later will follow these assumptions. The isotropic TV penalty is chosen as $u$ is expected to be piece-wise constant. This will reduce oscillations from $u$ and favour 'stair-case' like changes of intensity over smooth ones. The assumptions of our regularisation on $v$ must also be derived from expected properties of $u$ . What is known from [28], and can be seen in figure 3, is that discontinuities of $u$ along curves in the spatial domain, say $\gamma$ , generate a so called 'dual curve' in the sinogram. $\newcommand{\C}[1]{\mathcal{#1}} \C Ru$ will also have an edge, although possibly not a discontinuity, along this dual curve. Thus, perpendicular to the dual curve $v$ should have sharp features and parallel to the dual curve intensity should vary slowly. The assumption of our regularisation is that if a dual curve is present in the visible component of the data then it should correspond to some $\gamma$ in the spatial domain. The extrapolation of this dual curve must behave like the boundary of a level-set of $u$ , i.e. preserve the sharp edge and slow varying intensities in $v$ . The main influence of this regularisation is in the inpainting region and so any artefacts it introduces should also only effect edges corresponding to these invisible singularities, including streaking artefacts. Another bias that is present is an assumption that dual curves are themselves smooth. In the inpainting region, this will encourage dual curves with low curvature thus invisible singularities are likely to follow near-circular arcs in the spatial domain. Final parameter choices, such as $\alpha_i, \beta_i$ and $c_i$ , are not necessary at this point and will be chosen in section 5.1.

The immediate question to ask is whether this model is well posed. For a non-convex function we typically cannot expect numerically to find global minimisers but the following result shows we can expect some convergence to local minima. We present the following result which justifies looking for minima of this functional.

Theorem 3.1. If

$c_i$ are bounded away from 0
$\rho>0$
$A_d$ is differentiable in $d$

then sub-level sets of $E$ are weakly compact in $\newcommand{\F}[1]{\mathbb{#1}} L^2(\Omega,\F R)\times L^2(\F R^2,\F R)$ and $E$ is weakly lower semi-continuous. i.e. for all $\newcommand{\F}[1]{\mathbb{#1}} (u_n,v_n)\in L^2(\Omega,\F R)\times L^2(\F R^2,\F R)$ :

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle E(u_n,v_n) ~{\rm uniformly~ bounded~ implies~ a ~subsequence ~converges~ weakly} \nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \liminf E(u_n,v_n) \geqslant E(u,v) ~{\rm whenever }~ u_n\rightharpoonup u, v_n\rightharpoonup v.\nonumber \end{align*}$

The proof of this theorem is contained in appendix B. This theorem justifies numerical minimisation of $E$ because it tells us that all descending sequences ( $E(u_n,v_n)\leqslant E(u_{n-1},v_{n-1})$ ) have a convergent subsequence and any limit point must minimise $E$ over the original sequence.

4. Numerical solution

To address the issue of convergence we shall first generalise our functional and prove the result in the general setting. Functionals will be of the form $\newcommand{\F}[1]{\mathbb{#1}} E\colon X\times Y\to \F R$ where $X$ and $Y$ are Banach spaces and $E$ accepts the decomposition

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle E(x,y) = f(x,y) + g(J(x,y))\nonumber \end{align*}$

such that:

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm Sub\mbox{-}level~ sets~ of }~ E ~{\rm are~ weakly~ compact}~\label{myFunc-start} \nonumber \end{align} \tag{ 4.1 }$

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\F}[1]{\mathbb{#1}} \displaystyle f\colon X\times Y\to \F R ~{\rm is~ jointly ~convex~ in }~ (x,y)~ {\rm and~ bounded ~below} \nonumber \end{align} \tag{ 4.2 }$

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\F}[1]{\mathbb{#1}} \displaystyle &g\colon Z\to \F R ~{\rm is~ a~ semi\mbox{-}norm~ on~ Banach~ space }~ Z, {\rm i.e.~ for~ all }~ t\in\F R, z,z_1,z_2\in Z \nonumber \\&\qquad g(z)\geqslant0, g(tz) = |t|g(z)~{\rm and}~ g(z_1+z_2)\leqslant g(z_1)+g(z_2) \nonumber \end{align} \tag{ 4.3 }$

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle &J\colon X\times Y\to Z ~{\rm is }~ C^1 ~{\rm and ~for~ all }~ {K\Subset X\times Y}, \;\exists L_x,L_y<\infty ~{{\rm such~ that }~ \forall (x,y)\in K}\nonumber \\ &\qquad g(J(x+{\rm d}x,y)-J(x,y) -\nabla_xJ(x,y){\rm d}x) \leqslant L_x\norm{{\rm d}x}_X^2 \nonumber \end{align} \tag{ 4.4 }$

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \qquad g(J(x,y+{\rm d}y)-J(x,y) -\nabla_yJ(x,y){\rm d}y) \leqslant L_y\norm{{\rm d}y}_Y^2.\label{myFunc-end} \nonumber \end{align} \tag{ 4.5 }$

Note if $g$ is a norm then $L_x$ can be chosen to be the standard Lipschitz factor of $\nabla_xJ$ . If $J$ is twice Frèchet-differentiable then these constants must be finite. In our case:

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\TV}{{\rm TV}} \newcommand{\pslitln}[4]{\left\{\begin{array}{cc} #1 & #2 \nonumber \\ #3 & #4\end{array}\right.} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle f(x,y) &= \frac12\norm{Rx-y}_{\alpha_1}^2+\frac{\alpha_2}{2}\norm{S\C Rx-b}^2 \nonumber \\&\qquad +\frac{\alpha_3}{2}\norm{Sy-b}^2 + \beta_1\TV(x) + \pslitln{0}{x\geqslant 0}{\infty}{{{\rm else}}} \nonumber \\g(z) &= \beta_2\norm{z}_{2,1} \nonumber \\J(x,y) &= A_{Rx}\nabla y \Longrightarrow \tau_x \sim \beta_2\norm{\nabla^2A_{\cdot}}\norm{R}\TV(y), \tau_y = 0. \nonumber \end{align*}$

Finiteness of $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \norm{\nabla^2 A}$ and weak compactness of sub-level sets are given by theorems 2.1 and 3.1 respectively. The alternating descent algorithm is stated in algorithm 1. Classical alternating proximal descent would take $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\argmin}{{\rm argmin}} x_{n+1} = \argmin~ E(x,y_n)+\tau_x\norm{x-x_n}_2^2$ but because of the complexity of $\newcommand{\C}[1]{\mathcal{#1}} A_{\C R u}$ each sub-problem would have the same complexity as the full problem, making it computationally infeasible. By linearising $A_d$ we overcome this problem as both sub-problems are convex, for which there are many standard solvers such as [39, 40]. This second approach is similar to that of the ProxDescent algorithm [31, 41]. We extend this algorithm to cover alternating descent and achieve equivalent convergence guarantees. Using algorithm 1, our statement of convergence is the following theorem.

Input: Any $x_0\in X$ $x_0\in X$ , $\tau_x, \tau_y\geqslant 0$ $\tau_x, \tau_y\geqslant 0$ .

$n\leftarrow 0$ $n\leftarrow 0$

while not converged do

$n\leftarrow n+1$ $n\leftarrow n+1$

$\begin{align} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\argmin}{{\rm argmin}} x_{n} = \mathop{\argmin}\limits_{x\in X} f(x,y_{n-1}) &+ \tau_x\norm{x-x_{n-1}}_{X}^2 \\ & +g(J(x_{n-1},y_{n-1})+\nabla_xJ(x_{n-1},y_{n-1})(x-x_{n-1})) \end{align} \tag{ 4.6 }$

$\begin{align} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\argmin}{{\rm argmin}} y_{n} = \mathop{\argmin}\limits_{y\in Y} f(x_n,y) &+ \tau_y\norm{y-y_{n-1}}_{Y}^2 \\ &+g(J(x_{n},y_{n-1})+\nabla_yJ(x_{n},y_{n-1})(y-y_{n-1}))\end{align} \tag{ 4.7 }$

end while

Ouput: $(x_n,y_n)$

$(x_n,y_n)$

Theorem 4.1. Convergence of alternating minimisation: if $E$ satisfies (4.1)–(4.5) and $(x_n,y_n)$ are a sequence generated by algorithm 1 then

$E(x_{n+1},y_{n+1}) \leqslant E(x_n,y_n)$ for each $n$ .
A subsequence of $(x_n,y_n)$ must converge weakly in $X\times Y$ .
If $\newcommand{\st}{{\rm \;s.t.\;}} \{(x_n,y_n)\st n=1,\ldots\}$ is contained in a finite dimensional space then every limit point of $(x_n,y_n)$ must be a critical point (as will be defined in definition 4.4) of $E$ in both the direction of $x$ and $y$ .

This result is the parallel of lemma 10, theorem 18 and corollary 21 in [31] without the alternating or block descent setting. There is some overlap in the analysis for the two settings although we present an independent proof which is more direct and we feel gives more intuition for our more restricted class of functionals. The rest of this section is now dedicated to the proof of this theorem.

For notational convenience we shall compress notation such that:

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle f_{n,m} = f(x_n,y_m), \quad J_{n,m} = J(x_n,y_m), \quad E_{n,m}=E(x_n,y_m) {{\rm ~etc.}}\nonumber \end{align*}$

4.1. Sketch proof

The proof of theorem 4.1 will be a consequence of two lemmas.

In lemma 4.3 we show for $\tau_x,\tau_y$ (algorithm 1) sufficiently large, the sequence $E_{n,n}$ is monotonically decreasing and sequences $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \norm{x_n-x_{n-1}}_X, \norm{y_n-y_{n-1}}_Y$ converge to 0. This follows by a relatively standard sufficient decrease argument as seen in [31, 42, 43].
In lemma 4.6 we first define a critical point for functions which are non-convex and non-differentiable. If a subsequence of our iterates converges in norm then the limit must be a critical point in each of the two axes. Note that it is very difficult to expect more than this in such a general setting, for instance example 4.2 shows a uniformly convex energy which shows this to be sharp. The common technique for overcoming this is assuming differentiability in the terms including both $x$ and $y$ [42, 44, 45]. These previous results and algorithms are not available to us as we allow non-convex terms which are also non-differentiable.

Example 4.2. Define $E(x,y) = \max(x,y)+x^2+y^2$ for $\newcommand{\F}[1]{\mathbb{#1}} x,y\in\F R$ . This is clearly jointly convex in $(x,y)$ and thus a simple case of functions considered in theorem 4.1. Suppose $(x_0,y_0) = (0,0)$ then

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\argmin}{{\rm argmin}} \displaystyle x_1 &= \argmin~ E(x,y_0) + \tau_x(x-x_0)^2 = 0 \nonumber \\y_1 &= \argmin ~E(x_1,y) + \tau_y(y-y_0)^2 = 0. \nonumber \end{align*}$

Hence $(0,0)$ is a limit point of the algorithm but it is not a critical point, $E$ is uniformly convex and so it has only one critical point, $\newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} (-\sfrac12,-\sfrac12)$ .

4.2. Sufficient decrease lemma

In the following we prove the monotone decrease property of our energy functional between iterations.

Lemma 4.3. If for each $n$

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \tau_x \geqslant L_x+\tau_X, \qquad \tau_y \geqslant L_y+\tau_Y\nonumber \end{align*}$

for some $\tau_X,\tau_Y\geqslant0$ then

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \sum^\infty\tau_X\norm{x_n-x_{n-1}}_{X}^2 + \tau_Y\norm{y_n-y_{n-1}}_{Y}^2 \leqslant E(x_0,y_0)\nonumber \end{align*}$

and

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle E(x_{n+1},y_{n+1})\leqslant E(x_n,y_n) ~{\rm for~ all }~ n.\nonumber \end{align*}$

Proof. Note by equations (4.6) and (4.7) (definition of our sequence) we have

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle f_{n+1,n} + g(J_{n,n}+\nabla_xJ_{n,n}(x_{n+1}-x_n)) + \tau_x\norm{x_{n+1}-x_n}_{X}^2 \leqslant E_{n,n} \label{gc1} \nonumber \end{align} \tag{ 4.8 }$

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle f_{n+1,n+1} + g(J_{n+1,n}+\nabla_yJ_{n+1,n}(y_{n+1}-y_{n})) + \tau_y\norm{y_{n+1}-y_{n}}_{Y}^2 \leqslant E_{n+1,n} .\label{gc3} \nonumber \end{align} \tag{ 4.9 }$

Further, by application of the triangle inequality for $g$ and the mean value theorem we have

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\op}[1]{{\rm #1}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle g(J_{n+1,n}) - &g(J_{n,n}+\nabla_xJ_{n,n}(x_{n+1}-x_{n}))+ \tau_X\norm{x_{n+1}-x_{n}}_{X}^2\nonumber \\&\qquad \leqslant g(J_{n+1,n} -J_{n,n}-\nabla_xJ_{n,n}(x_{n+1}-x_{n}))+ \tau_X\norm{x_{n+1}-x_{n}}_{X}^2\nonumber \\&\qquad = g([\nabla_x J(\xi)-\nabla_xJ_{n,n}](x_{n+1}-x_{n}))+ \tau_X\norm{x_{n+1}-x_{n}}_{X}^2\nonumber \\&\qquad \leqslant \op{Lip}_{X,g}(\nabla_x J(\cdot,y_n))\norm{x_{n+1}-x_{n}}_{X}^2+ \tau_X\norm{x_{n+1}-x_{n}}_{X}^2 \nonumber \\&\qquad \leqslant \tau_x\norm{x_{n+1}-x_{n}}_{X}^2 .\label{gc2} \nonumber \end{align} \tag{ 4.10 }$

By equivalent argument,

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle g(J_{n+1,n+1}) - g(J_{n,n+1}+\nabla_yJ_{n+1,n}(y_{n+1}-y_{n}))+ \tau_Y\norm{y_{n+1}-y_{n}}_{Y}^2 \leqslant \tau_y\norm{y_{n+1}-y_n}_{Y}^2. \label{gc4} \nonumber \end{align} \tag{ 4.11 }$

Summing equations (4.8)–(4.11) gives

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle E_{n+1,n+1} + \tau_X\norm{x_{n+1}-x_n}_{X}^2+\tau_Y\norm{y_{n+1}-y_n}_{Y}^2\leqslant E_{n,n}.\nonumber \end{align*}$

This immediately gives the monotone decrease property of $E_{n,n}$ . If we also sum this over $n$ then we achieve the final statement of the theorem:

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \sum_{n=1}^\infty\tau_X\norm{x_{n+1}-x_n}_{X}^2+\tau_Y\norm{y_{n+1}-y_n}_{Y}^2\leqslant E_{0,0} - \lim E_{n,n}\leqslant E_{0,0}\nonumber \end{align*}$

□

4.3. Convergence to critical points

First we follow the work by Drusvyatskiy et al [46] we shall define criticality in terms of the slope of a function.

Definition 4.4. We shall say that $x_*$ is a critical point of $\newcommand{\F}[1]{\mathbb{#1}} F\colon X\to \F R$ if

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle |\partial F(x_*)|= 0\nonumber \end{align*}$

where we define the slope of $F$ at $x_*$ to be

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle |\partial F(x_*)| = \limsup_{{\rm d}x\to 0} \frac{\max(0,F(x_*)-F(x_*+{\rm d}x))}{\norm{{\rm d}x}}.\nonumber \end{align*}$

The first point to note is that this definition generalises the concept of a first order critical point for both smooth functions and convex functions (in terms of the convex sub-differential). In particular

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\IP}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\re}{{\rm Re}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle F\in C^1 \Longrightarrow &|\partial F(x_*)| = \max\left(0,\sup_{\norm{{\rm d}x}= 1}-\IP{\nabla F(x_*)}{{\rm d}x}\right) = \norm{\nabla F(x_*)} \nonumber \\{{\rm Hence~}} &|\partial F(x_*)| = 0 \iff \norm{\nabla F(x_*)} = 0\iff \nabla F(x_*) = 0 \nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\argmin}{{\rm argmin}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle F{{\rm ~convex,~hence~}} & x^*\in \argmin ~F \iff F(x_*) \leqslant F(x_*+{\rm d}x) {{\rm ~for~all~}} {\rm d}x \nonumber \\{{\rm Hence~}} &\hspace{-1pt}|\partial F(x_*)| = 0 \iff \forall {\rm d}x, 0\geqslant \frac{F(x_*)-F(x_*+{\rm d}x)}{\norm{{\rm d}x}} \iff x_*\in\argmin~ F. \nonumber \end{align*}$

For a differentiable function we cannot tell whether a critical point is a local minimum, maximum or saddle point. In general, this is also true for definition 4.4, however, at points of non-differentiability there is a bias towards local minima. This can be seen in the following example.

Example 4.5. Consider $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} F(x) = -\norm{x}$

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle |\partial F(0)| = \limsup_{{\rm d}x\to 0}\;\max\left(0,\frac{0+\norm{0+{\rm d}x}}{\norm{{\rm d}x}}\right) = 1\nonumber \end{align*}$

Hence, 0 is not a critical point of $F$ . This bias is due to the $\limsup$ in the definition which detects the strictly negative directional derivatives. This does not affect smooth functions as directional derivatives must vanish continuously to 0 about a critical point.

Some more examples are shown in figure 7. Now we shall show that our iterative sequence converges to a critical point in this sense.

**Figure 7.** Examples of 1D functions where 0 is/is not a critical point by definition 4.4. If a function is piece-wise linear then 0 is a critical point iff each directional derivative is non-negative. If the function can be approximated on an interval of $x>0$ to first order terms by $\newcommand{\re}{{\rm Re}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\e}{{\rm e}} F(x) = cx^{1+\epsilon}$ then criticality can be characterised sharply. If $c\geqslant0$ then 0 is always a critical point. If $c<0$ then 0 is a critical point iff $\newcommand{\re}{{\rm Re}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\e}{{\rm e}} \epsilon>0$ , however, 0 is never a local minimum. (a) Examples of critical points and examples of non-critical points.
Download figure:
Standard image High-resolution image

Lemma 4.6. If $(x_n,y_n)$ are as given by algorithm 1 and $X, Y$ are finite dimensional spaces then every limit point of $(x_n,y_n)$ , e.g. $(x_*,y_*)$ , is a critical point of $E$ in each coordinate direction. i.e.

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle |\partial_xE(x_*,y_*)| = |\partial_yE(x_*,y_*)| = 0.\nonumber \end{align*}$

Remark 4.7.

Finite dimensionality of $X$ and $Y$ accounts for what is referred to as 'assumption 3' in [31] and is some minimal condition which ensures that the limit is also a stationary point of our iteration (equations (4.6) and (4.7)).
The condition for finite dimensionality, as we shall see, does not directly relate to the non-convexity. The difficulty of showing convergence to critical points in infinite dimensions is common across both convex [39] and non-convex [31, 42] optimisation.

Proof. First we recall that, in finite dimensional spaces, convex functions are continuous on the relative interior of their domain [47]. Also note that by our choice of $\tau_x$ in lemma 4.6, for all $x,x',y$ we have

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle |g(J(x,y)+\nabla_x J(x,y)(x'-x)) &- g(J(x',y))| \nonumber \\&\leqslant |g([J(x,y)-J(x',y)]-\nabla_x J(x,y)(x'-x))| \nonumber \\&= |g(\nabla_x J(\xi,y)(x'-x)-\nabla_x J(x,y)(x'-x))| \nonumber \\&\leqslant \tau_x\norm{\xi-x}_X\norm{x'-x}_X \nonumber \\&\leqslant \tau_x\norm{x'-x}_X^2 \nonumber \end{align*}$

where existence of such $\xi$ is given by the mean value theorem. Hence, for all $x$ we have

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle E(x_{n+1},y_n) & = f_{n+1,n} + g(J_{n+1,n}) \nonumber \\&\leqslant f_{n+1,n} + g(J_{n,n}+\nabla_xJ_{n,n}(x_{n+1}-x_n)) + \tau_x\norm{x_{n+1}-x_n}_X^2 \nonumber \\&\leqslant f(x,y_n)+g(J_{n,n}+\nabla_xJ_{n,n}(x-x_n))+\tau_x\norm{x-x_n}_X^2 \nonumber \\&\leqslant f(x,y_n)+g(J(x,y_n))+2\tau_x\norm{x-x_n}_X^2 \nonumber \\&= E(x,y_n) + 2\tau_x\norm{x-x_n}_X^2 \nonumber \end{align*}$

where the first and third inequality are due to the condition shown above and the second is due to the definition of $x_{n+1}$ in (4.6). Finally, by continuity of $f$ , $J$ and $g$ we can take limits on both sides of this inequality:

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \label{conv to stationary} \Longrightarrow E(x_*,y_*) \leqslant E(x,y_*) +2\tau_x\norm{x-x_*}_X^2 {{\rm ~for~all~}} x. \nonumber \end{align} \tag{ 4.12 }$

This completes the proof for $x_*$ as

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle |\partial_xE(x_*,y_*)| = \limsup_{x\to x_*} \max\left(0,\frac{E(x_*,y_*)-E(x,y_*)}{\norm{x_*-x}_X}\right) \leqslant \limsup 2\tau_x\norm{x-x_*}_X = 0.\nonumber \end{align*}$

The proof for $y_*$ follows by symmetry. □

Remark 4.8.

The important line in this proof, and where we need finite dimensionality, is being able to pass to the limit for (4.12). In the general case we can only expect to have $(x_n,y_n)\rightharpoonup (x_*,y_*)$ , typically guaranteed by choice of regularisation functionals as in our theorem 3.1. In this reduced setting the left hand limit of (4.12) still remains valid,
$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle E(x_*,y_*) \leqslant \liminf E(x_{n+1},y_n) {{\rm ~by~weak~lower~semi\mbox{-}continuity.}}\nonumber \end{align*}$
However, on the right hand side we require:
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \lim E(x,y_n) + 2\tau_x\norm{x-x_n}_X^2 \leqslant E(x,y_*) + 2\tau_x\norm{x-x_*}_X^2\nonumber \end{align*}$
In particular, we already require $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \norm{x-x_n}_X$ to be weakly upper semi-continuous. Topologically, this is the statement that weak and norm convergence are equivalent which will not be the case in most practical (infinite dimensional) examples.
The properties we derive for $(x_*,y_*)$ are actually slightly stronger than that of definition 4.4 which only depends on an infinitesimal ball about $(x_*,y_*)$ . However, (4.12) gives us a quantification for the more global optimality of this point. This is seen in figure 8.

**Figure 8.** Theorem 3.1 tells us that $(x_*,y_*)$ is a local critical point but does not qualify the globality of the limit point. Equation (4.12) further allows us to quantify the idea that if a lower energy critical point exists then it must lie far from $(x_*,y_*)$ . In particular, it must lie outside of the shaded cone given by the supporting quadratic.
Download figure:
Standard image High-resolution image

**Figure 8.** Theorem 3.1 tells us that $(x_*,y_*)$ is a local critical point but does not qualify the globality of the limit point. Equation (4.12) further allows us to quantify the idea that if a lower energy critical point exists then it must lie far from $(x_*,y_*)$ . In particular, it must lie outside of the shaded cone given by the supporting quadratic.
Download figure:
Standard image High-resolution image

4.4. Proof of theorem 4.1

Proof. Fix arbitrary $(x_0,y_0)\in X\times Y$ and $\tau_X,\tau_Y\geqslant0$ . Let $(x_n,y_n)$ be defined as in algorithm 1. By lemma 4.3 we know that $\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\st}{{\rm \;s.t.\;}} \{(x_n,y_n)\st n\in\F N\}$ is contained in a sub-level set of $E$ which in turn must be weakly compact by (4.1). The assumption of lemma 4.6 is that we are in a finite dimensional space and so weak compactness is equivalent to norm compactness, i.e. some subsequence of $(x_n,y_n)$ converges in norm. Also by lemma 4.6 we know that the limit point of this sequence must be an appropriate critical point. □

5. Results

For numerical results we present two synthetic examples and one experimental dataset from Transmission Electron Tomography. The two synthetic examples are discretised at a resolution of $200\times200$ then simulated using the x-ray transform with a parallel beam geometry sampled at $1^\circ$ intervals over a range of $60^\circ$ resulting in a full sinogram of size $287\times180$ and sub-sampled data at $287\times60$ . Gaussian white noise (standard deviation 5% maximum signal) is then added to give the data. The experimental dataset was acquired with an annular dark field (parallel beam) Scanning TEM modality from which we have 46 projections spaced uniformly in $3^\circ$ intervals over a range of $135^\circ$ . Because of the geometry of the acquisition, we can treat the original 3D dataset as a stack of 2D and thus extract one of these slices as our example. This 2D dataset is then sub-sampled to 29 projections over $87^\circ$ , reducing the size from $173\times45$ to $173\times29$ . This results in a reconstruction with $u$ of size $120\times120$ and $v$ of size $173\times180$ . A more detailed description of the acquisition and sample properties of the experimental dataset can be found in [48]. The code, and data, for all examples is available⁷ under the Creative Commons Attribution (CC BY) license.

5.1. Numerical details

All numerics are implemented in MATLAB 2016b. The sub-problem for $u$ is solved with a PDHG algorithm [39] while the sub-problem for $v$ is solved using the MOSEK solver via CVX [40, 49], the step size $\tau$ is adaptively calculated. The initial point of our algorithm is always chosen to be a good TV reconstruction, i.e.

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\argmin}{{\rm argmin}} \newcommand{\TV}{{\rm TV}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle u_0 = \mathop{\argmin}\limits_{u\geqslant 0}\frac12\norm{S\C Ru-b}_2^2 + \lambda\TV(u), \qquad v_0 = \C R u_0.\nonumber \end{align*}$

For clarity, we shall restate our full model with all of the parameters it includes. We seek to minimise the functional (3.1):

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\op}[1]{{\rm #1}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle E(u,v) = \frac{1}{2}\norm{\C Ru-v}^2_{\alpha_1} + \frac{\alpha_2}{2}\norm{S\C Ru-b}_2^2 + \frac{\alpha_3}{2}\norm{Sv-b}_2^2 + \beta_1\op{TV}(u) + \beta_2\norm{A_{Ru}\nabla v}_{2,1} \nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\vec }{\mathbf} \newcommand{\re}{{\rm Re}} \displaystyle A_d(x) &= c_1(x|\lambda_1-\lambda_2, \lambda_1+\lambda_2)\vec {e}_1(x)\vec {e}_1(x)^T + c_2(x|\lambda_1-\lambda_2, \lambda_1+\lambda_2)\vec {e}_2(x)\vec {e}_2(x)^T \nonumber \\{{\rm where~}} &(\nabla d_\rho\nabla d_\rho^T)_\sigma = \lambda_1 \vec {e}_1\vec {e}_1^T + \lambda_2 \vec {e}_2\vec {e}_2^T {{\rm ~is~a~pointwise~eigenvalue~decomposition}} \nonumber \\& c_1(x|\Delta,\Sigma) = 10^{-6}+\frac{\tanh(\Sigma(x))}{1+\beta_3\Delta(x)^2},\quad c_2(x|\Delta,\Sigma) = 10^{-6}+\tanh(\Sigma(x)). \nonumber \end{align*}$

We chose these particular $c_i$ according to two simple heuristics. If $\Sigma$ is large (steep gradients) then it is likely a region with edges and so the regularisation should be largest but still bounded above. If $\Delta=0^+$ then there is a small or blurred 'edge' present and so we want to encourage it to become a sharp jump, i.e. $\partial_\Delta c_1<0$ . Theorem 2.1 tells us that choosing $c_i$ as functions of $\Delta^2$ will guarantee accordance with our later convergence results; this leads to our natural choice above. The number of iterations for algorithm 1 was chosen to be 200 and 100 for the synthetic and experimental datasets, respectively. To simplify the process of choosing values for the remaining hyper-parameters we made several observations:

(i)
The choice of $\alpha_i$ and $\beta_i$ appeared to be quite insensitive about the optimum. It is clear within 2–3 iterations whether values are of the correct order of magnitude. After this, values were only tuned coarsely. For example, $\alpha_3$ and $\beta_i$ are optimal within a factor of $\newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} 10^{\pm\sfrac12}$ .
(ii)
We can chose $\alpha_2=1$ without any loss of generality. In which case, in general, $\beta_1$ should the same order of magnitude as when performing the TV reconstruction to get $u_0, v_0$ .
(iii)
$\alpha_2$ pairs $u$ to the given data and $\alpha_1$ pairs $u$ to the inpainted data, $v$ . As such, $\alpha_1$ is spatially varying but should be something like a distance to the non-inpainting region. We chose the binary metric so that $u$ is paired to $v$ uniformly on the inpainting region and not at all outside.
(iv)
$\newcommand{\DTV}{{\rm DTV}} \DTV$ specific parameters ( $\beta_2, \beta_3, \rho, \sigma$ ) can be chosen outside of the main reconstruction. These were chosen by solving the toy problem:
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\argmin}{{\rm argmin}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \argmin ~\frac12\norm{v-v_0}_2^2 + \beta_2\norm{A_{v_0}\nabla v}_{2,1}\nonumber \end{align*}$
which is a lot faster to solve. $\rho>0$ is required for the analysis and so this was fixed at 1. $\sigma$ is a length-scale parameter which indicates the separation between distinct edges. $\beta_3$ relies on the normalisation of the data. As can be seen in table 1, for the two synthetic examples, with same discretisation and scaling, these values are also consistent. The only value which changes is $\beta_2$ , as expected, which weights how valid the $\newcommand{\DTV}{{\rm DTV}} \DTV$ prior is for each dataset.

Table 1. Parameter choices for numerical experiments. Each algorithm was run for 300 iterations.

	$\alpha_1$	$\alpha_2$	$\alpha_3$	$\beta_1$	$\beta_2$	$\beta_3$	$\rho$	$\sigma$
Concentric rings phantom	$\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\1}[0]{\F{1}} \frac1{2^2}\1_{\Omega'^c}$	1	$1\times10^{-1}$	$3\times10^{-5}$	$3\times10^{3}$	$10^{10}$	$1$	$8$
Shepp–Logan phantom	$\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\1}[0]{\F{1}} \frac1{4^2}\1_{\Omega'^c}$	1	$3\times10^{-1}$	$3\times10^{-5}$	$3\times10^{2}$	$10^{10}$	$1$	$8$
Experimental dataset (both sampling ratios)	$\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\1}[0]{\F{1}} \frac1{2^2}\1_{\Omega'^c}$	1	$3\times10^{2}$	$1\times10^{-5}$	$3\times10^{1}$	$10^6$	$1$	$0$

It is unclear whether a gridsearch may provide better results although, due to the number of parameters involved, this would definitely take a lot longer and mask some interpretability of the parameters. A further comparison of different choices of the main parameters can be found in the supplementary material.

5.2. Canonical synthetic dataset

This example shows two concentric rings. This is the canonical example for our model because the exact sinogram is perfectly radially symmetric which should trivialise the directional inpainting procedure, even with noise present. As can clearly be seen in figure 9, the TV reconstruction is poor in the missing wedge direction which can be seen as a blurring out of the sinogram. By enforcing better structure in the sinogram, our proposed joint model is capable of extrapolating these local structures from the given data domain to recover the global structure and gives an accurate reconstruction.

**Figure 9.** Canonical synthetic example. Top row shows the reconstructions, $u$ , while the bottom row shows the reconstructed sinogram, $v$ .
Download figure:
Standard image High-resolution image

**Figure 9.** Canonical synthetic example. Top row shows the reconstructions, $u$ , while the bottom row shows the reconstructed sinogram, $v$ .
Download figure:
Standard image High-resolution image

5.3. Non-trivial synthetic dataset

This example shows the modified Shepp–Logan phantom which is built up as a sum of ellipses. This example has a much more complex geometry although the sinogram still has a clear geometry. In figure 10 we see that the largest scale feature, the shape of the largest ellipse, is recovered in our proposed reconstruction with minimal loss of contrast in the interior. One artefact we have not been able to remove is the two rays extending from the top of the reconstructed sample. Looking more closely we found that it was due to a small misalignment of the edge at the bottom of the sinogram as it crosses between the data to the inpainting region. Numerically, this happens because of the convolutions which take place inside the directional TV regularisation functional. Having a non-zero blurring is essential for regularity of the regularisation (theorem 2.1) but the effect of this is that it does not heavily penalise misalignment on such a small scale. This means that at the interface between the fixed data-term there is a slight kink, the line is continuous but not $C^1$ . The effect of this on the reconstruction is the two lines which extend from the sample at this point. Looking at quantitative measures, the PSNR value rises from 17.33 to 17.36 whereas the SSIM decreases from 0.76 to 0.62, from TV to the proposed reconstruction, respectively. These measures are inconclusive and the authors feel that they fail to balance the improvement to global geometry verses more local artefacts in the reconstructions.

**Figure 10.** Non-trivial synthetic example of the modified Shepp–Logan phantom. Top row shows the reconstructions, $u$ , while the bottom row shows the reconstructed sinogram, $v$ . We regain the large-scale geometry of the shape without losing much of the interior features.
Download figure:
Standard image High-resolution image

**Figure 10.** Non-trivial synthetic example of the modified Shepp–Logan phantom. Top row shows the reconstructions, $u$ , while the bottom row shows the reconstructed sinogram, $v$ . We regain the large-scale geometry of the shape without losing much of the interior features.
Download figure:
Standard image High-resolution image

5.4. Experimental dataset

The sample is a silver bipyramidal crystal placed on a planar surface, and the challenges of this dataset are shown in figure 11. We immediately see that the wide angle projections have large artefacts which produces a very low signal to noise ratio. Another issue present is that there is mass seen in some of the projections which cannot be represented within the reconstruction volume. Both of these issues violate the simple x-ray model that is used. Exact modelling would require estimation of parameters which are not available a priori and so the preferred acquisition is one which automatically minimises these modelling errors. Another artefact is that over time each surface becomes coated with carbon. This is a necessary consequence of the sample preparation and this coating is known to occur during the microscopy. The result of modelling errors and time dependent noise is to prefer an acquisition with limited angular range and earliest acquired projections. Because of this, in numerical experiments we compare both TV and our proposed reconstruction using only $\newcommand{\sfrac}[2]{^{#1}\!\!/\!_{#2}} \sfrac34$ of the available data, 29 projections over an $87^\circ$ interval, with a bias towards earlier projections. The artefacts due to the out-of-view mass are unavoidable but we can perform some further pre-processing to minimise the effect. In particular, if we shrink the field of view of the detector then the 'heaviest' part of the data will be the particle of interest and the model violations will be relatively small, increasing the signal to noise ratio. This can be seen as the sharp horizontal cut-off in the pre-processed sinograms seen on the right of figure 11. The effect of this on the reconstruction is going to be that there is a thin ring of mass placed at the edge of the (shrunken) detector view which will be clearly identifiable in the reconstruction. As a ground truth approximation we shall use a TV reconstruction on the full data for the location of the particle alongside prior knowledge of this sample for more precise geometrical features. We also note that the particle should be very homogeneous so this is another example where we expect the TV reconstruction to be very good.

**Figure 11.** Raw data for transmission electron microscopy example. Projections at large angles, e.g. $-68^\circ$ , show the presence of the sample holder which violates the x-ray modelling assumption that outside of the region of interest is vacuum. If the violation is too extreme then this can cause strong artefacts in reconstructions and so the common action is to discard such data. The plane surface also violates this model but is relatively weak at low angles and so will cause weaker artefacts. A source of noise in this acquisition is that over time the surface becomes coated with carbon. This is first visible as a thin film at $-2^\circ$ and progressively gets thicker through the remaining projections. At $34^\circ$ we see a bump of carbon appear on the top right edge. After pre-processing, as shown top right artificially sub-sample to compare TV with our proposed reconstruction method.
Download figure:
Standard image High-resolution image

The sample is a single crystal of silver and so we know it must have very uniform intensity and we are interested in locating the sharp facets which bound the crystal [48]. In figure 12 we immediately see that the combination of homogeneity and sharp edges is better reconstructed in our proposed reconstruction. Because we expect the reconstruction to be constant on the background and the particle, thresholding the reconstruction allows us to easily locate the boundaries and estimate interior angles of the particle. Figure 13 shows such images where the threshold is chosen to be the approximate midpoint of these two levels. We see that the proposed reconstruction consistently has less jagged edges and the left hand corner is better curved, as is consistent with our knowledge of the sample. Looking back at the full colour images we see that this is a result of lack of sharp decay at the boundary and homogeneity inside the sample. Looking for location error we see the biggest error in both TV and joint reconstruction is on the bottom-left edge where both reconstructions pull the line inwards. However, looking particularly at points $(40,80)$ and $(20,60)$ , we see that this was less severe in the proposed method. The other missing wedge artefact is in the top right corner which has been extended in both reconstructions although it is thinner in the proposed reconstruction. This indicates that it was better able to continue the straight edges either side of the corner and the blurring in the missing wedge direction is more localised than in the TV reconstruction. Overall, we see see that the proposed reconstruction method is much more robust to an decrease in angular sampling range.

**Figure 12.** Reconstructions from a slice of the experimental data. We have chosen the slice half-way down through the projections shown in figure 11 to coincide with one of the rounded corners. The arc artefact was an anticipated consequence due to the out-of-view mass, the pre-processing has simply reduced the intensity. Proposed reconstructions consistently show better homogeneity inside the particle and sharper boundaries. The missing angles direction is the bottom-left to top-right diagonal where we see most error in each reconstruction, in particular, the blurring of the top right corner of the particle is a limited angle artefact.
Download figure:
Standard image High-resolution image

**Figure 13.** Comparison between each reconstruction after thresholding. The geometrical properties of interest are that each edge should be linear, the left hand corner is rounded and the remaining corners are not. The particle of interest is homogeneous so thresholding the images should emphasise this in a way which is very unsympathetic to blurred edges. Again, the top right corner of each particle in the sub-sampled reconstructions coincides with the exacerbated missing wedge direction and so we expect each reconstruction to make some error here.
Download figure:
Standard image High-resolution image

6. Conclusions and outlook

In this paper we have presented a novel method for tomographic reconstructions in a limited angle scenario along with a numerical algorithm with convergence guarantees. We have also tested our approach on synthetic and experimental data and shown consistent improvement over alternative reconstruction methods. Even when the x-ray transform model is noticeably violated, as with our experimental data, we still better recover boundaries of the reconstructed sample.

There are three main directions which could be explored in future. Firstly, we think there is great potential to apply our framework to other applications, such as in tomographic imaging with occlusions and heavy metal artefacts where the inpainting region is much smaller [22, 23]. Secondly, we would like to find an alternative numerical algorithm with either faster practical convergence or one which is more capable of avoiding local minima in this non-convex landscape. Finally, we would like to explore the potential for an alternative regularisation functional on the sinogram which is better able to treat visible and invisible singularities, denoising and inpainting problems, independently. At the moment, the TV prior alone can reconstruct visible singularities well however, introducing a sinogram regulariser currently improves on the invisible region at the expensive of damaging the visible. Overall, we feel that this presents the natural progression for the current work although it remains unclear how to regularise these invisible singularities.

Acknowledgments

RT would like to thank Ozan Öktem for valuable discussions on the application of micro-local analysis in this setting. RT acknowledges support from the EPSRC grant EP/L016516/1 for the University of Cambridge Centre for Doctoral Training, the Cambridge Centre for Analysis. MB acknowledges support from the Leverhulme Trust Early Career Fellowship Learning from mistakes: 'A supervised feedback-loop for imaging applications', and the Isaac Newton Trust. CBS acknowledges support from the Leverhulme Trust project on 'Breaking the non-convexity barrier', EPSRC grant Nr. EP/M00483X/1, the EPSRC Centre Nr. EP/N014588/1, the RISE projects CHiPS and NoMADS, and the Alan Turing Institute. CBS, MB and RT also acknowledge the Cantab Capital Institute for the Mathematics of Information. CB acknowledges support from the EU RISE project NoMADS, the PIHC innovation fund of the Technical Medical Centre of UT and the Dutch 4TU programme Precision Medicine. MJL acknowledges financial support from the Netherlands Organization for Scientific Research (NWO), project 639.073.506. SMC acknowledges support from the Henslow Research Fellowship at Girton College, Cambridge. RKL acknowledges support from a Clare College Junior Research Fellowship. PAM acknowledges EPSRC grant EP/R008779/1.

Appendix A.: Theorem 2.1

Theorem. If

(i)
$c_i$ are $2k$ times continuously differentiable in $\Delta$ and $\Sigma$ , $k\geqslant1$
(ii)
$c_1(x|0,\Sigma) = c_2(x|0,\Sigma)$ for all $x$ and $\Sigma\geqslant0$
(iii)
$\partial_\Delta^{2j-1} c_1(x|0,\Sigma) = \partial_\Delta^{2j-1} c_2(x|0,\Sigma) = 0$ for all $x$ and $\Sigma\geqslant0, j=1\ldots,k$

Then $A_d$ is $C^{2k-1}$ with respect to $d$ for all $\rho>0,\sigma\geqslant0$ .

In this proof we will drop the $x$ argument from $c_i$ for conciseness of notation. Define

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle M_d = (\nabla d_\rho\nabla d_\rho^T)_\sigma.\nonumber \end{align*}$

Note that convolutions are bounded linear maps and $\nabla d_\rho\in L^2$ by Young's inequality hence $\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\op}[1]{{\rm #1}} M\colon L^1(\F R^2,\F R)\to L^\infty(\F R^2,\op{Sym}_2)$ is well defined and differentiable w.r.t. $d$ . Hence, it suffices to show that $A$ is differentiable w.r.t. $M$ where

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle M_d = \lambda_1e_1e_1^T+\lambda_2e_2e_2^T, \lambda_1\geqslant \lambda_2\Longrightarrow A = c_1(\Delta,\Sigma)e_1e_1^T+c_2(\Delta,\Sigma)e_2e_2^T \nonumber \end{align*}$

where $\Delta=\lambda_1-\lambda_2, \Sigma=\lambda_1+\lambda_2$ . Note that this is not a trivial statement, we can envisage very simple cases in which the (ordered) eigenvalue decomposition is not even continuous. For instance

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\sfrac}[2]{^{#1}/_{#2}} \newcommand{\pslitln}[4]{\left\{\begin{array}{cc} #1 & #2 \nonumber \\ #3 & #4\end{array}\right.} \newcommand{\MTwo}[4]{\left(\begin{array}{cc} #1 & #2 \nonumber \\ #3 & #4 \end{array}\right)} \displaystyle M(t) = \MTwo{1-t}{0}{0}{t} \Longrightarrow \Sigma(t) = 1, \Delta(t) = |1-2t|, e_1 = \pslitln{(1,0)^T}{t<\sfrac12}{(0,1)^T}{t>\sfrac12}.\nonumber \end{align*}$

Hence we can see that despite having $M\in C^\infty$ we cannot even guarantee that the decomposition is continuous and so cannot employ any chain rule to say that $A$ is smooth.

The structure of this proof breaks into four parts:

(i)
If $c_1(0,\Sigma) = c_2(0,\Sigma)$ then $A$ is well defined
(ii)
If $c_i\in C^2$ for some open neighbourhood of point $x$ such that $\lambda_1(x)>\lambda_2(x)$ then $A$ is differentiable w.r.t. $M$ on an open neighbourhood of $x$
(iii)
If $\partial_\Delta c_1(0,\Sigma) = \partial_\Delta c_2(0,\Sigma) = 0$ at a point, $x$ , where $\lambda_1(x)=\lambda_2(x)$ then $A$ is differentiable on an open neighbourhood of $x$
(iv)
If $\partial_\Delta^{2j-1} c_1(0,\Sigma) = \partial_\Delta^{2j-1} c_2(0,\Sigma) = 0$ at a point $x$ where $\lambda_1(x)=\lambda_2(x)$ and for all $j=1\ldots,k$ then $A$ is $C^{2k-1}$ on an open neighbourhood of $x$

Proof. Proof of part i: note that when $\lambda_1=\lambda_2$ the choice of $e_i$ is non-unique subject to $\newcommand{\op}[1]{{\rm #1}} e_1e_1^T + e_2e_2^T = \op{id}$ and so we get

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\op}[1]{{\rm #1}} \displaystyle A = c_1(0,\Sigma)\op{id} + (c_2(0,\Sigma)-c_1(0,\Sigma))e_2e_2^T.\nonumber \end{align*}$

Hence $A$ is well defined if and only if $c_1(0,\Sigma) = c_2(0,\Sigma)$ for all $\Sigma\geqslant0$ .

As we are decomposing $2\times 2$ matrices it will simplify the proof to write explicit forms for the values under consideration.

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\MTwo}[4]{\left(\begin{array}{ll} #1 & #2 \nonumber \\ #3 & #4 \end{array}\right)} \displaystyle M = \MTwo{M_{11}}{M_{12}}{M_{12}}{M_{22}}\Longrightarrow \lambda_i = \frac{M_{11}+M_{22} \pm \sqrt{(M_{11}-M_{22})^2+4M_{12}^2}}{2}\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Sigma = M_{11} + M_{22}, \quad \Delta = \sqrt{(M_{11}-M_{22})^2+4M_{12}^2}\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \Delta\neq 0 \Longrightarrow e_1 = \frac{(2M_{12},\Delta-M_{11}+M_{22})^T}{\sqrt{(\Delta-M_{11}+M_{22})^2+4M_{12}^2}}=\frac{(\Delta+M_{11}-M_{22},2M_{12})^T}{\sqrt{(\Delta+M_{11}-M_{22})^2+4M_{12}^2}},\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\MTwo}[4]{\left(\begin{array}{cc} #1 & #2 \nonumber \\ #3 & #4 \end{array}\right)} \displaystyle \quad e_2 = \MTwo{0}{-1}{1}{0}e_1.\nonumber \end{align*}$

We give two equations for $e_1$ to account for the case when we get $\frac{(0,0)^T}{0}$ .

Proof of part ii: note that $\Sigma$ is always smooth and $\Delta$ is smooth on the set $\{\Delta>0\}$

Case $M_{12}(x)\neq 0$ : now both equations of $e_1$ are valid (and equal) and the denominators non-zero on a neighbourhood. Hence, we can apply the standard chain rule and product rule to conclude.

Case $M_{12}(x)=0$ : in this case $M(x)$ is diagonal but as $\Delta = |M_{11}-M_{22}|>0$ we know that one of our formulae for $e_1$ must be valid with non-vanishing denominator in a neighbourhood and so we can conclude as in the first case.

Proof of part iii: given the Neumann condition on $c_i$ we shall express $c_i$ locally by Taylor's theorem.

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle c_i(\Delta,\Sigma) = c_i(0,\Sigma) + O(\Delta^2) = c_1(0,\Sigma)+O(\Delta^2).\nonumber \end{align*}$

Now we consider a perturbation:

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\re}{{\rm Re}} \newcommand{\MTwo}[4]{\left(\begin{array}{cc} #1 & #2 \nonumber \\ #3 & #4 \end{array}\right)} \displaystyle M = \MTwo{m}{0}{0}{m}, \quad \epsilon = \MTwo{\epsilon_{11}}{\epsilon_{12}}{\epsilon_{12}}{\epsilon_{22}}\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\re}{{\rm Re}} \newcommand{\op}[1]{{\rm #1}} \displaystyle \Longrightarrow A(M+\epsilon)-A(M) = (c_1(0,2m+\epsilon_{11}+\epsilon_{22})-c_1(0,2m))\op{id} +O(\Delta^2)\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\re}{{\rm Re}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \Delta^2 = (\epsilon_{11}-\epsilon_{22})^2+ 4\epsilon_{12}^2 = O(\norm{\epsilon}^2)\Longrightarrow O(\Delta^2)\leqslant O(\norm{\epsilon}^2)\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\re}{{\rm Re}} \newcommand{\op}[1]{{\rm #1}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle \Longrightarrow \frac{A(M+\epsilon)-A(M)}{\norm{\epsilon}} = \frac{\partial_\Sigma c_1(0,2m)\op{tr}(\epsilon)}{\norm{\epsilon}} + O(\norm{\epsilon}).\nonumber \end{align*}$

In particular, $A$ is differentiable w.r.t. $M$ at $x$ .

Proof of part iv: the proof of this follows exactly as the previous part,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle c_i(\Delta,\Sigma) = \sum_0^{k-1} \frac{\Delta^{2j}}{j!}\partial_\Delta^{2j}c_i(0,\Sigma) + O(\Delta^{2k})\nonumber \end{align*}$

where the remainder term is sufficiently smooth. Hence $c_i$ is at least $2k-1$ times differentiable w.r.t. $M$ . □

Appendix B.: Theorem 3.1

Theorem. If

$c_i$ are bounded away from 0
$\rho>0$
$A_d$ is differentiable in $d$

then sub-level sets of $E$ are weakly compact in $\newcommand{\F}[1]{\mathbb{#1}} L^2(\Omega,\F R)\times L^2(\F R^2,\F R)$ and $E$ is weakly lower semi-continuous. i.e. for all $\newcommand{\F}[1]{\mathbb{#1}} (u_n,v_n)\in L^2(\Omega,\F R)\times L^2(\F R^2,\F R)$ :

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle E(u_n,v_n) ~{\rm uniformly~ bounded~ implies~ a~ subsequence~ converges ~weakly} \nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \liminf E(u_n,v_n) \geqslant E(u,v)~ {\rm whenever }~ u_n\rightharpoonup u, v_n\rightharpoonup v\nonumber \end{align*}$

Proof. If $c_i$ are bounded away from 0 then in particular we have $\newcommand{\C}[1]{\mathcal{#1}} A_{\C Ru_n}\gtrsim1$ so $\newcommand{\C}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\TV}{{\rm TV}} \newcommand{\DTV}{{\rm DTV}} \DTV_{u_n}(v_n) = \norm{A_{\C Ru_n}\nabla v_n} \gtrsim \norm{\nabla v_n} = \TV(v_n)$ . Hence,

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\TV}{{\rm TV}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle E(u_n,v_n) ~{\rm uniformly~ bounded }~ & \Longrightarrow \nonumber \\ &\norm{S_{\Omega'^c}(\C Ru_n-v_n)}_2^2 + \norm{S_{\Omega'}\C Ru_n-b}_2^2 + \norm{S_{\Omega'}v_n-b}_2^2 \nonumber \\&\qquad + \TV(u_n) + \TV(v_n) ~{\rm uniformly ~bounded}~ \nonumber \\&\Longrightarrow \norm{A(u,v)^T - b}_2^2 + \TV\left((u,v)\right) ~{\rm uniformly~ bounded}~ \nonumber \end{align*}$

for some linear $A$ and constant $b$ . Thus we are in a very classical setting where weak compactness can be shown in both the $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \norm{(u,v)}_2$ norm and $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\TV}{{\rm TV}} \norm{(u,v)}_1+\TV((u,v))$ [50].

We now proceed to the second claim, that $E$ is weakly lower semi-continuous. Note that all of the convex terms in our energy are lower semi-continuous by classical arguments so it remains to show that the non-convex term is too. i.e.

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle (u_n,v_n)\rightharpoonup (u,v)\stackrel{?}{\Longrightarrow} \liminf\norm{A_{\C Ru_n}\nabla v_n}_{2,1} \geqslant \norm{A_{\C Ru}\nabla v}_{2,1}.\nonumber \end{align*}$

We shall first present a lemma from distribution theory.

Lemma B.1. If $\Omega$ is compact, $\newcommand{\F}[1]{\mathbb{#1}} \varphi\in C^\infty(\Omega,\F R)$ and $\newcommand{\st}{{\rm \;s.t.\;}} w_n\stackrel{L^{\,p}}{\rightharpoonup} w$ then

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\F}[1]{\mathbb{#1}} \displaystyle w_n\star \varphi \to w\star \varphi ~{\rm convergence ~in }~ C^k(\Omega,\F R) ~{\rm for ~all }~ k<\infty.\nonumber \end{align*}$

Proof. Recall that $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} w_n\rightharpoonup w\Longrightarrow \norm{w_n}_p\leqslant W$ for some $W<\infty$ . By Hölder's inequality:

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle |w_n\star\varphi (x) - w\star\varphi (y)| \leqslant \int |w_n(z)||\varphi(x-z)-\varphi(y-z)| \lesssim _{p,\Omega} |x-y|W\norm{\varphi'}_\infty\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \displaystyle |w_n\star\varphi(x)|\lesssim _{p,\Omega} W\norm{\varphi}_\infty \nonumber \end{align*}$

i.e. $\newcommand{\F}[1]{\mathbb{#1}} \newcommand{\st}{{\rm \;s.t.\;}} \{w_n\st n\in \F N\}$ is uniformly bounded and uniformly (Lipschitz) continuous.

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\IP}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\re}{{\rm Re}} \displaystyle w_n\rightharpoonup w \Longrightarrow w_n\star \varphi(x)-w\star\varphi(x) = \IP{w_n-w}{\varphi(x-\cdot)} \to 0 {{\rm ~for~every~}} x.\nonumber \end{align*}$

Hence, we also know $\newcommand{\st}{{\rm \;s.t.\;}} w_n\star \varphi$ converges point-wise to a unique continuous function. Suppose $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\re}{{\rm Re}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\st}{{\rm \;s.t.\;}} \newcommand{\e}{{\rm e}} \norm{w_{n_k}\star \varphi-w\star \varphi}_\infty \geqslant \epsilon>0$ for some $\newcommand{\re}{{\rm Re}} \renewcommand{\epsilon}{\varepsilon} \newcommand{\e}{{\rm e}} \epsilon$ and sub-sequence $n_k\to\infty$ . By the Arzela–Ascoli theorem some further subsequence must converge uniformly to the point-wise limit, $\newcommand{\st}{{\rm \;s.t.\;}} w\star \varphi$ , which gives us the required contradiction. Hence, $\newcommand{\st}{{\rm \;s.t.\;}} w_n\star \varphi\to w\star \varphi$ in $L^\infty = C^0$ . The general theorem follows by induction. □

This lemma gives us two direct results.

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \rho>0\Longrightarrow (Ru_n)_\rho \to (Ru)_\rho{{\rm ~in~}}L^\infty\nonumber \end{align*}$

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\union}{\cup} \displaystyle \{(Ru_n)_\rho\}\union\{(Ru)_\rho\} {{\rm ~compact,~}} A_d\in C^1(d) \Longrightarrow A_{Ru_n} \to A_{Ru} {{\rm ~in~}} \norm{\cdot}_{2,\infty}.\nonumber \end{align*}$

Hence, whenever $w\in W^{1,1}$ we have

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\TV}{{\rm TV}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle \norm{A_{\C Ru_n}\nabla w} &\geqslant \norm{A_{\C Ru}\nabla w} - \norm{(A_{\C Ru_n}-A_{\C Ru})\nabla w} \nonumber \\&\geqslant \norm{A_{\C Ru}\nabla w}-\norm{A_{\C Ru_n}-A_{\C Ru}}_{2,\infty}\TV(w). \nonumber \end{align*}$

By density of $W^{1,1}$ in the space of bounded variation, we know the same holds for $w=v_n$ . We also know $\newcommand{\TV}{{\rm TV}} \TV(v_n)$ is uniformly bounded thus

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle \liminf \norm{A_{\C Ru_n}\nabla v_n} = \liminf \norm{A_{\C Ru}\nabla v_n}.\nonumber \end{align*}$

Hence, for all $\newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \norm{\varphi}_{2,\infty}\leqslant 1$ we have

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\IP}[2]{\left\langle #1,#2 \right\rangle} \newcommand{\re}{{\rm Re}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\C}[1]{\mathcal{#1}} \displaystyle \IP{v}{\nabla \cdot(A_{\C Ru}\varphi)} = \liminf_n \IP{v_n}{\nabla \cdot(A_{\C Ru}\varphi)} \leqslant \liminf \norm{A_{\C Ru}\nabla v_n} \leqslant \liminf \norm{A_{\C Ru_n}\nabla v_n}\nonumber \end{align*}$

as required. □

Directional sinogram inpainting for limited angle tomography

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Problem formulation

1.2. Context and proposed model

1.3. Overview and contributions

2. Preliminaries

2.1. The x-ray transform

2.2. Total variation regularisation

2.3. Directional total variation regularisation

3. The joint model

4. Numerical solution

4.1. Sketch proof

4.2. Sufficient decrease lemma

4.3. Convergence to critical points

4.4. Proof of theorem 4.1

5. Results

5.1. Numerical details

5.2. Canonical synthetic dataset

5.3. Non-trivial synthetic dataset

5.4. Experimental dataset

6. Conclusions and outlook

Acknowledgments

Appendix A.: Theorem 2.1

Appendix B.: Theorem 3.1

Footnotes

Directional sinogram inpainting for limited angle tomography

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Problem formulation

1.2. Context and proposed model

1.3. Overview and contributions

2. Preliminaries

2.1. The x-ray transform

2.2. Total variation regularisation

2.3. Directional total variation regularisation

3. The joint model

4. Numerical solution

4.1. Sketch proof

4.2. Sufficient decrease lemma

4.3. Convergence to critical points

4.4. Proof of theorem 4.1

5. Results

5.1. Numerical details

5.2. Canonical synthetic dataset

5.3. Non-trivial synthetic dataset

5.4. Experimental dataset

6. Conclusions and outlook

Acknowledgments

Appendix A.: Theorem 2.1

Appendix B.: Theorem 3.1

Footnotes