Faster PET reconstruction with non-smooth priors by randomization and preconditioning

Uncompressed clinical data from modern positron emission tomography (PET) scanners are very large, exceeding 350 million data points (projection bins). The last decades have seen tremendous advancements in mathematical imaging tools many of which lead to non-smooth (i.e. non-differentiable) optimization problems which are much harder to solve than smooth optimization problems. Most of these tools have not been translated to clinical PET data, as the state-of-the-art algorithms for non-smooth problems do not scale well to large data. In this work, inspired by big data machine learning applications, we use advanced randomized optimization algorithms to solve the PET reconstruction problem for a very large class of non-smooth priors which includes for example total variation, total generalized variation, directional total variation and various different physical constraints. The proposed algorithm randomly uses subsets of the data and only updates the variables associated with these. While this idea often leads to divergent algorithms, we show that the proposed algorithm does indeed converge for any proper subset selection. Numerically, we show on real PET data (FDG and florbetapir) from a Siemens Biograph mMR that about ten projections and backprojections are sufficient to solve the MAP optimisation problem related to many popular non-smooth priors; thus showing that the proposed algorithm is fast enough to bring these models into routine clinical practice.

One of the most popular algorithms to solve the resulting non-smooth convex optimization problem is the primal-dual hybrid gradient (PDHG) algorithm 4 (Esser et al 2010. PDHG has been used in numerous imaging studies on multiple imaging modalities, including PET, see e.g. Dupe et al (2011), Wolf et al (2013), Rigie and La Riviere (2015), Foygel Barber et al (2016), Knoll et al (2016), Zhang et al (2016), Rigie et al (2017), Schramm et al (2017) and Rasch et al (2018). While this algorithm is flexible enough to solve a variety of non-smooth optimization problems, in every iteration both the projection and the backprojection have to be applied for all projection bins. Moreover, in every iteration computations on vectors that have the size of the data have to be performed. For modern scanners like the Siemens Biograph mMR with span-1 data format, these vectors contain more than 350 million elements and therefore limiting the applicability of this algorithm (and thus many non-smooth priors) to state-of-the-art scanners.

Subset acceleration with randomization
We propose an algorithm, coined Stochastic PDHG or SPDHG for short, which in every iteration performs computations only for a random subset of the data. We show on clinical data from a Siemens Biograph mMR that with this algorithm, for the first time, non-smooth priors become feasible to be used in routine clinical imaging. Numerically, we show that SPDHG is competitive with OSEM on unregularized reconstruction problems but stable with respect to the choice of the subsets due to its mathematically guaranteed convergence. In fact, SPDHG converges to the deterministic solution for any proper subset selection, see theorem 1.
In addition to the general randomized solution strategy, we propose two further algorithmic advancements: preconditioning and non-uniform sampling.

Preconditioning
We propose and evaluate the use of data-dependent preconditioners in SPDHG for PET image reconstruction. While the convergence theory for a large class of preconditioners has been available since 2011 , our proposed preconditioners are the first to be computationally efficient and effective for PET image reconstruction with non-smooth priors. The speed enhancement of preconditioning for PDHG was recognized before (Sidky et al 2012), however, we present a novel formulation of these preconditioners that is computiationally efficient, see theorem 2.

Non-uniform sampling
We propose a novel non-uniform sampling strategy, which is necessary to accommodate the differences of data fidelity and regularity. Both randomization and preconditioning can be used independently or can be combined as proposed here in this work.

Uncompressed data
In this work we use uncompressed (span-1) data from the Siemens Biograph mMR. While it is not clear if and how much this improves the reconstructed PET images (Belzunce and Reader 2017), the proposed algorithm is fast enough to study the benefits of uncompressed data in combination with a variety of regularization models.
A few initial findings on randomized reconstruction without preconditioning were published in a conference paper (Ehrhardt et al 2017).

PET reconstruction via optimization
Given the measured data vector b ∈ N M and the projection model P, the PET reconstruction problem can be formulated as the solution to the optimization problem where the data fidelity D(Pu) measures the match of the estimated image u with the data and the prior αR (u) penalizes features that are not desirable in the solution. In other words the prior can be used to avoid solutions which would fit the noisy data too closely. The data fidelity D is (up to constants independent of u) the negative log-likelihood of the multi-variate Poisson distribution with expected value being the sum of the projected image y and the estimated background activity r. The latter is needed in order to model non-linear effects such as scatter and randoms. The data fidelity D measures the distance of the estimated data Pu + r to the measured data b in the sense that D(Pu) 0 and D(Pu) = 0 if and only if Pu + r = b. The operator P performs the projection and includes geometric factors, attenuation and normalization.
While the main motivation is the efficient solution of non-smooth optimization problems, we first compare the method to ordered subsets expectation maximization (OSEM) (Hudson and Larkin 1994) for unregularized reconstruction. The 'ordered subsets' idea has subsequently been used for many algorithms related to nonsmooth optimisation, see e.g. McGaffin and Fessler (2015). We would like to show in the next example (1) that the 'ordered subsets' idea is generally non-convergent and thus may be unstable and (2) that the proposed algorithm is as fast as OSEM-despite its proven convergence.

Motivating example: OSEM
If there is no prior, i.e. αR = 0, the most common algorithm to solve the optimization problem (1) is the maximum likelihood expectation maximization algorithm (MLEM) (Shepp and Vardi 1982) defined by where all operations have to be understood element-wise. The computational bottleneck in the MLEM algorithm is the evaluation of the operator P and its transpose P T in each iteration.
To overcome this hurdle, it has been proposed to change the update and evaluate the operator and its adjoint only on one out of m subsets of the data in each iteration. At every iteration k we choose i = mod(k, m) and change update formula (2) to This algorithm became known as OSEM. Here P i is the restriction of P onto the ith subset, i.e. P = (P T 1 , . . . , P T m ) T . While this change of the update equation reduces the computational burden by 1/m, it is in general not guaranteed to converge to a solution of (1), illustrated in figures 1(a) and (b). A convergent version of OSEM, called completedata OSEM (COSEM), has been developed (Hsiao et al 2002). While it comes with mathematical convergence guarantees, it is much slower than OSEM (see figure 1(c)) and therefore never became popular for the reconstruction of clinical PET data.
MLEM has been extended to include smooth (Green 1990) and certain non-smooth (Sawatzky et al 2013) prior information, however, conceptually both algorithms intrinsically struggle with the ordered subset acceleration. Also other algorithms have been 'accelerated' based on the ordered subset idea, e.g. McGaffin and Fessler (2015) and Ross Schmidtlein et al (2017), but are similarly intrinsically unstable due to their non-convergence. See Cheng et al (2013) for a numerical comparison and Ahn et al (2015) and Teoh et al (2015) for a validation on clinical PET data. For differentiable priors, a surrogate based technique allows for stable subset acceleration (De Pierro and Yamagishi 2001, Ahn and Fessler 2003, Ahn et al 2006. In this work we propose the subset-accelerated algorithm SPDHG that is provably convergent and thus stable and robust, see figures 1(a) and (b). SPDHG is flexible enough to be applicable to a large variety of convex and non-smooth priors and is as efficient as OSEM if no explicit prior is being used, see figures 1(c) and (d).

Non-smooth PET reconstruction with subsets
As outlined above, PET reconstruction can be formulated in terms of the optimization problem (1). Computationally, it is convenient to rewrite (and solve) the optimization problem (1) in terms of subsets. We denote by M the number of projection bins.
, where we used the notation [M] := {1, . . . , M}. It is not necessary to assume that S i ∩ S j = ∅ for i = j. For notational simplicity we will restrict ourselves to the this case. We define with the distance function for every data point given by where we omitted the index j at ϕ, y, r and b for readability. Algorithms from convex optimization require the problem to be defined over an entire vector space which we satisfy by extending ϕ to ∞ for non-positive estimated data y + r. The data and the background are photon counts and therefore have a natural non-negativity constraint. To allow for the concise notation in (5), we define 0 log 0 := 0 and − log 0 := ∞. We model the non-negativity constraint for the image u with the indicator function ı +, which is defined as Thus, this results in the unconstrained optimization problem Problem 1 (PET reconstruction with subsets).
We would like to stress that solving problem (7) is equivalent to solving the original problem (1) for any choice of subsets. In fact, the subset selection becomes a reconstruction parameter that may be varied to speed up the reconstruction procedure.
Often, our prior assumptions involve linear operators, too. One of the most prominent examples of this is the total variation (Rudin et al 1992) where we take the 2-norm locally, i.e. at every voxel i we take the 2-norm of the spatial gradient, and the 1-norm globally, i.e. we sum over all voxels. Forward difference discretization of the gradient operator ∇ is used as in Chambolle and Pock (2011). Similarly, we use the directional total variation R(u) = dTV(u) = D∇u 2,1 to incorporate a-priori knowledge about the solution given by an anatomical prior image, see Ehrhardt and Betcke (2016), Ehrhardt et al (2016), Schramm et al (2017) and Bungert et al (2018) for details.
Solving problem (7) is challenging, even when the involved variables are small and matrix-vector products are easy to compute. The difficulty stems from its non-smoothness. The data term D i is not finite everywhere and while it is differentiable on its effective domain dom(D i ) := {y | D i (y) < ∞}, the gradient is not globally Lipschitz continuous. In addition, further non-smoothness comes from the constraint ı + and the prior R may be non-smooth as well. All of this being said, in PET reconstruction, the variable sizes are actually very large and matrix-vector products expensive to compute.
To apply optimization algorithms to solve (7), we reformulate it as a generic optimization problem of the form Problem 2 (Generic optimization problem).
For instance, for unregularized reconstructions, i.e. αR = 0, we may make the association and reconstructions regularized by the total variation, i.e. R(u) = ∇u 2,1 , can be achieved by

Optimization with saddle-point problems
Instead of solving problem (8) directly, it is more efficient to reformulate the minimization problem as a saddle point problem making use of the convex conjugate of a functional, see e.g. Bauschke and Combettes (2011).
For convex, proper and lower semi-continuous (lsc) functionals f we have that f * * = f , see e.g. Bauschke and Combettes (2011), and thus f ( We will refer to the variable x as the primal variable and to y as the dual variable. Example 1. The convex conjugate of the PET distance function (4) is given by where we omitted the index j at ϕ, y, b and r for readability.
The derivation of the formulas in this and the following example are omitted for brevity.
As some (or all) of the f i and g in (8) are non-smooth, we make use of the proximal operator of these. Our definition varies slightly from the usual definition as we allow the step size parameter to be matrix-valued. For a symmetric and positive definite matrix S, we define the weighted norm x S as x 2 S := S −1/2 x 2 = S −1 x, x . Definition 2 (Proximal operator). Let S be a symmetric and positive definite matrix. Then we define the proximal operator of f with metric (or step size) S as From here on, S and T will always be diagonal (and thus symmetric) and positive definite matrices.

Example 3. Let
The proximal operator of the convex conjugate of the PET distance (11) can be computed element-wise as [prox Si For each element, the proximal operator is given by prox σ ϕ * (y) = where we again omitted the indices j for readability and denoted w = y + σr.

Algorithm
The saddle point problem (10) (and therefore the PET reconstruction problem (8)) can be solved with the PDHG (Chambolle and Pock 2011), see algorithm 1. It consists of very simple operations involving only basic linear algebra, matrix-vector multiplications and the evaluations of proximal operators. As seen in line 4 of the pseudo-code, PDHG updates all dual variables simultaneously. Therefore, in line 4 and 5, the projection and backprojection that corresponds to the whole data set have to be evaluated. The idea of SPDHG, algorithm 2, is to only select one dual variable randomly in each iteration (line 4) and to perform the update accordingly (line 5 and 6). An important detail is the extrapolation in line 8 with the inverse of the probability p i that i will be selected in each iteration. This guarantees the convergence as proven in theorem 1 below.

Convergence
SPDHG is guaranteed to converge for any f i and g which are convex, proper and lsc. We now state a very general convergence result which can be derived from (Chambolle et al 2018, theorem 4.3). The actual proof is omitted here for brevity. For more details on convergence and convergence rates we refer the reader to Chambolle et al (2018).
holds. Then for any initialization, the iterates (x, y) of SPDHG (algorithm 2) converge to a saddle point of (10) almost surely in a Bregman distance.

Input: iterates
Remark 1 (Computational efficiency). Each iteration of algorithm 2 is computationally efficient as only projections and backprojections corresponding to the randomly selected subset i of the data are required. However, the algorithm maintains the whole backprojected dual variable z = P T y = m i=1 P T i y i and in each iteration updates the primal variable with it.
Remark 2 (Memory requirements). The memory requirement of algorithm 2 is higher compared to OSEM or gradient descent but still reasonably low. It requires memory equivalent to two images (z, z) and up to twice the binned sinogram data (y ,y + ) in addition to the necessary memory consumption (output image, sinogram data, background and normalization).

Remark 3 (Sampling)
. SPDHG allows any kind of random selection as long as the draws are independent and the probability that block i is being selected with positive probability p i > 0. We will investigate two choices of sampling in the numerical section of this paper. A more thorough numerical and theoretical investigation will be subject of future work.

Step sizes and preconditioning
We will now discuss two different choices of step sizes under which SPDHG is guaranteed to converge. The proof of the following theorem uses arguments from Chambolle et al (2018) and Pock and Chambolle (2011) and is omitted here for brevity.

Numerical results
The We use in all numerical experiments the parameter γ = 1. Fine-tuning of this parameter is left for future work. Moreover, all peak signal-to-noise (PSNR) or relative objective comparisons are performed by first computing an approximate minimizer x * by the deterministic PDHG using 5000 iterations. The PSNR is defined as PSNR(x k , x * ) = 20 log( x * ∞ / x k − x * 2 ) and the relative objective value is defined as (Ψ(x k ) − Ψ(x * ))/(Ψ(x 0 ) − Ψ(x * )). We frequently use the word 'epoch' to denote the number of iterations of a randomized algorithm which are in expectation computationally equivalent to one iteration of the deterministic algorithm that uses all data for each iteration. As an example, if a randomized algorithm only uses 1/10 of the data in each iteration, then after 10 iterations one can expect that the algorithm has used all data, thus in this case 1 epoch equals 10 iterations. In all figures, the dashed lines correspond to deterministic and the solid lines to randomized algorithms. The Python code is available online at https://github.com/mehrhardt/spdhg_pet and the florbetapir data set can be downloaded via https://zenodo.org/record/1472951#.XZtnR-dKhTY.

Data
We validate the numerical performance of the proposed algorithm on two clinical PET data sets which we refer to as FDG and florbetapir. The two separate PET brain datasets each use a distinct radiotracer: [ 18 F]FDG for epilepsy and [ 18 F]florbetapir for the neuroscience sub-study Insight'46 of the Medical Research Council National Survey of Health and Development (Lane et al 2017). The epileptic patient was injected with 250 Mbq of FDG, one hour before the 15 min PET acquisition. The neuroscience volunteer was injected with 370 MBq of florbetapir and scanned dynamically for one hour, starting at the injection time. The last ten minutes were used as a measurement of amyloid deposition, which for the participant was negative.

Results for total variation
In this section we analyze the impact of various choices within SPDHG on its performance, from randomness over sampling to preconditioning. The test case is total variation prior as defined in (9).  Figure 3 shows the effect of randomness where we compare the deterministic PDHG to SPDHG with uniform sampling and scalar step sizes (13) for two different number of subsets. The horizontal axis reflects the number of projections in each algorithm, we call one full projection for the whole data one 'epoch'. Here and in the following dashed lines represent deterministic and solid lines randomized algorithms. We can easily see that both random variants are faster than then deterministic PDHG. Moreover, the randomized SPDHG becomes faster by choosing a larger number of subsets.

Sampling
The effect of different choices of sampling is shown in figure 4. We compare two different samplings: uniform sampling and balanced sampling. The uniform sampling chooses all indices i ∈ [n] with equal probability p i = 1/n. In contrast, for balanced sampling we choose with uniform probability either data or prior. If we choose data, then we select a subset again randomly with uniform probability. Thus, the probability for each subset of the data to be selected is p i = 1/(2m), i ∈ [m] and for the prior to be selected p n = 1/2.
We make two observations. First, balanced sampling is always faster than uniform sampling. This shows the importance of updating the dual variable associated to the prior. Second, for either sampling choosing a larger number of subsets again improves the performance.  In addition to increasing the number of subsets, the sampling is also very important for the speed of the algorithm: 21 subsets with balanced sampling is faster than 100 subsets with uniform sampling.

Preconditioning
As shown in theorem 2, the step size parameters T and S i can be chosen either as scalars (13) or as vectors (14), the latter can be seen as a form of preconditioning. Results are shown in figure 5, where we see that preconditioning may accelerate the convergence of either the deterministic PDHG or the randomized SPDHG. Moreover, combining randomization and preconditioning yields an even faster algorithm. Optimal TV-regularized Solution PDHG (10 epochs) SPDHG+ (10 epochs, proposed) Figure 6. Qualitative results show that in contrast to the deterministic PDHG, the proposed SPDHG+ (252 subsets) approximates the optimal solution well after only 10 epochs. The 'optimal' solution was computed with 5000 iterations of PDHG.

Performance of proposed algorithm
Based on the previous three examples, we propose to combine randomization, balanced sampling and preconditioning, which we refer to as SPDHG+. Figure 6 shows the visual performance of PDHG and SPDHG+. In contrast to the deterministic PDHG, the proposed SPDHG+ yields a good approximation of the optimal solution after only 10 epochs.

Anisotropic total variation
Anisotropic total variation decouples the penalization of the derivatives. The mathematical model is similar to the isotropic TV model (9), the only difference being the norm how the total variation is measured: f n = α · 1,1 . It can be seen in figure 7 for florbetapir that with randomization and preconditioning only a few epochs are needed to obtain a good approximation of the optimal solution.

Directional total variation
Anatomical information from a co-registered MRI is available on combined PET-MR scanners. The structural information of the anatomy can be utilized by the directional total variation prior, see Ehrhardt and Betcke (2016), Ehrhardt et al (2016), Schramm et al (2017) and Bungert et al (2018) for details. The mathematical model is similar to the total variation model (9), except for an additional matrix D. Thus, the only difference is A n = D∇. A numerical example is shown in figure 8 for the data set florbetapir.

Total generalized variation
More sophisticated regularization can be achieved by the total generalized variation (TGV) (Bredies et al 2010, Bredies andHoller 2015) TGV α0,α1 (u) = inf w {α 0 ∇u − w 2,1 + α 1 Ew 2,1 } which can balance first and second order regularization and achieves edge-preserved reconstruction while avoiding the stair-casing artifact. We can solve the TGV regularized PET reconstruction problem by solving problem (8) with the assignment x = (u, w) and f n−1 = α 0 · 2,1 , f n = α 1 · 2,1 , Optimal Solution: u w PDHG (10 epochs): u w SPDHG+ * (10 epochs): u w Figure 9. TGV regularized reconstruction for the FDG data. Only a few epochs are needed to approximate the optimal solution with randomization and preconditioning. This is visible for both the actual images u and for the reconstructed vector field w. The 'optimal' solution was computed with 5000 iterations of PDHG. * proposed where E is a symmetrized gradient operator, see Bredies et al (2010) and Bredies and Holler (2015) for more details.
The numerical results shown in figure 9 are in line with the previous findings indicating that randomization and preconditioning can significantly speed up the reconstruction. However, we notice a significant increase in performance by increasing the number of subsets from 21 to 252.

Comparison of mathematical models
We conclude this section by a comparison of various methods on both data sets in figures 10 and 11. While we leave the detailed visual comparisons to the reader, we would like to note that all these images use the same number of projections so have basically the same computational cost.

Discussion
The extensive numerical experiments all consistently confirm that randomization and preconditioning both speed up the reconstruction. These trends were irrespective of the data set and the chosen prior. The convergence speed in our work was abstractly defined by a solution of the underlying mathematical optimization model approximated with way too many iterations than would be feasible in routine clinical practice. This strategy was chosen intentionally as we did not want to target a specific clinical use case. After these successful initial trials, in the future we will collaborate with medical researchers and clinicians to focus on specific use cases where each use case defines its own metric of what images we wish to reconstruct.
The focus of this contribution was on non-smooth priors like total variation and its descendants like total generalized variation and directional total variation. However, as long as the proximal operators are simple to evaluate, the proposed randomized and preconditioned algorithm can be applied to any other model, too. It would be of interest to compare this algorithm to convergent subset accelerated algorithms for smooth priors like BSREM (De Pierro andYamagishi 2001, Ahn andFessler 2003), TRIOT (Ahn et al 2006) and OS-SPS (Ahn and Fessler 2003).
We highlighted the improvements from choosing different distributions for subset selection by comparing 'uniform' and 'balanced sampling'. Further improvements are expected by optimizing the probability selection ML TV aTV TGV dTV (using MRI) MRI structure for dTV Figure 10. Comparison of several reconstruction approaches for the FDG data. All approaches have about the same computational cost (10 epochs).
of this algorithm. This can either be an optimal distribution that is constant along the iterations or even developing over the course of the iterations. We will investigate this direction further in the future.
With the exception of figures 7 and 8 where 21, 100 and 252 subsets were similarly fast, more subsets always resulted in a faster algorithm. There are neither theoretical nor numerical insights how the speed will depend on the subset selection and if more subsets always result in a faster algorithm. However, the numerical evidence suggests that increasing the number of subsets never decreases the speed of the algorithm. This being said, due to the per iteration computational costs, from a practical point of view, there will be an optimal number of subsets that might depend on the prior and even the data (e.g. number of counts) to be reconstructed. We would like to point out that the two figures 7 and 8 have in common that both used the same tracer florbetapir. In future work we will study the tracer-dependence of the convergence speed in more detail.
Moreover, the algorithm does not exploit any special structure of our optimization problem like smoothness or strong convexity. It is likely that exploiting these properties will lead to additional speed-up. However, as these properties for the PET data term depend on the acquired data, it is unlikely that a straightforward approach will be sufficient and a tailored solution will be necessary.

Conclusion
We introduced a convergent subset accelerated algorithm for the reconstruction of PET images with nonsmooth priors. The algorithm was enhanced by data-dependent preconditioning. Our numerical results showed that using both randomized subset selection and preconditioning can dramatically speed up the convergence of an iterative reconstruction algorithm. It was observed that a computational effort similar to the current clinical standard OSEM was sufficient for many non-smooth priors, showing that these are now, for the first time, feasible to be used in daily clinical routine.
While these observations were consistent among two data sets with different tracers, more studies are needed to confirm the benefits of this reconstruction strategy. Overall, this algorithmic advancement has the potential to change the PET reconstruction landscape as advanced mathematical models can now be combined with efficient and convergent subset acceleration.

ML TV
aTV TGV dTV (using MRI) MRI structure for dTV Figure 11. Comparison of several reconstruction approaches for the florbetapir data. All approaches have about the same computational cost (10 epochs).