Stochastic EM methods with Variance Reduction for Penalised PET Reconstructions

Expectation-maximization (EM) is a popular and well-established method for image reconstruction in positron emission tomography (PET) but it often suffers from slow convergence. Ordered subset EM (OSEM) is an effective reconstruction algorithm that provides significant acceleration during initial iterations, but it has been observed to enter a limit cycle. In this work, we investigate two classes of algorithms for accelerating OSEM based on variance reduction for penalised PET reconstructions. The first is a stochastic variance reduced EM algorithm, termed as SVREM, an extension of the classical EM to the stochastic context, by combining classical OSEM with insights from variance reduction techniques for gradient descent. The second views OSEM as a preconditioned stochastic gradient ascent, and applies variance reduction techniques, i.e., SAGA and SVRG, to estimate the update direction. We present several numerical experiments to illustrate the efficiency and accuracy of the approaches. The numerical results show that these approaches significantly outperform existing OSEM type methods for penalised PET reconstructions, and hold great potential.


Introduction
Positron emission tomography (PET) is a nuclear imaging technique that allows for the measurement of biochemical changes in the body by observing the spatial distribution of a radioactive tracer. Positron emitting radionuclides are attached to a biochemical compound to create a radioactive tracer, e.g. fluorodeoxyglucose, that is used in natural metabolic processes by an organ or tissue of interest. The radionuclides decay and the emitted positron travels a short distance before encountering and annihilating with an electron. This annihilation interaction results in a pair of 511 keV photons that travel anti-parallel. The emitted photons may be measured by a pair of detector elements along a ring of crystalline detectors surrounding the subject. If two photons are detected within a short coincidence timing window, a PET scanner will record a coincidence event along the line-of-response between the two measuring detectors. The goal of PET image reconstruction problem is then to reconstruct an estimate of the emission distribution from the measured coincidence data. This inverse problem is ill-posed in the sense of Hadamard, i.e., the solution to the problem is not stable with respect to the perturbation in the data.
Iterative methods have been widely used in PET reconstruction [35], amongst which the expectation maximisation (EM) algorithm and its various variants, e.g., MLEM [16], OSEM [24], RAMLA [4], BSREM [13,1] and OS-SPS [18], are predominant. Shepp and Vardi [38,39] reformulated the PET reconstruction problem into a maximum likelihood (ML) estimation of the tracer distribution, and developed an iterative scheme via EM algorithm (MLEM), which enjoys several desirable features, e.g., a closed-form for iterate updates and nonnegativity preservation. The EM algorithm consists of two steps: (i) the E-step computes the complete data sufficient statistic; and (ii) the M-step updates the estimate by maximising the complete data log-likelihood. The algorithm converges monotonically (in decreasing the objective) but slowly. Moreover, full batch updates (i.e. using all measured data to compute the sufficient statistic) can be costly for large data sizes. To mitigate the ill-posedness of the PET reconstruction problem, a suitable penalty is employed, leading to a maximum a posteriori (MAP) problem [25,23]. This requires adapting the standard EM algorithm, since a closed-form solution at the M-step is often no longer available [35, p. R561]. There are several approaches to address the challenge, such as the one-step-late algorithm [21] (applicable to differentiable penalty terms but generally not convergent to the MAP solution), or more principled methods via modified EM algorithms [12] or separable parabolic surrogates for the penalty [18].
One established procedure to mitigate the aforementioned computational challenge with full batch data is the ordered subset EM (OSEM) [24], which first divides the measured data into disjoint subsets and then applies the EM algorithm to one subset at each iteration, either in a cyclic or a stochastic manner [24]. This greatly reduces the cost per update, and leads to significant acceleration during initial iterations. However, the standard OSEM algorithm has been observed to often not converge but instead enters a limit cycle [5,13]. This has motivated intensive development of modified OSEM algorithms that retain both the speed-up in early iterations but also exhibit convergence to the MAP solution (e.g., by suitably adjusting the step-size schedule) [4,13,1].
EM and OSEM can also be written in a gradient ascent-like form, where the search direction is a preconditioned gradient of the objective [28]. This viewpoint enables designing a range of new methods that allow a general class of differentiable penalties. This idea has recently been experimentally studied for PET with iteration-dependent and constant preconditioners in [42] and [43], respectively, both for standard stochastic ascent approaches and for variance reduction methods. For nonsmooth convex penalties, e.g., total (generalised) variation [3], gradient approaches are no longer directly applicable, and one may resort to a saddle point reformulation, and then update primal and dual variables accordingly [8]. Chambolle et al. [7,17] developed a stochastic variant of such an algorithm using only one random component of the dual variable at each update, and provide a convergence guarantee. Alternatively, one may employ proximal methods to handle non-smooth penalties, and many variance reduction algorithms have been extended to the proximal setting [45].
In this work, we contribute to stochastic variance reduction algorithms for the MAP problem in PET, for a popular class of penalty terms, by drawing on recent advances in stochastic optimisation and machine learning. First, we develop a novel algorithm, termed as Stochastic Variance Reduced EM (SVREM), for the MAP reconstruction. It is motivated by the online-EM [6] (see also [31]) and its variance reduction variants [10,27], originally developed for an un-penalised problem. We present an extension to the MAP problem by combining variance reduced EM for computing a variance reduced running average of the sufficient statistics with the surrogate approach for the penalty [9]. The resulting SVREM algorithm maintains the EM nature for PET reconstruction, e.g., nonnegativity preservation, and admits an explicit maximiser at each M-step. The overall algorithm is mathematically principled and numerically easy to implement. Second, we revisit variance reduction algorithms for stochastic gradient ascent, and their use in iterative PET reconstruction, which were recently experimentally studied [42,43]. These algorithms do not belong to the EM family, but rather to the class of diagonally preconditioned gradient ascent algorithms. Due to the inclusion of the penalty, the non-negativity of the iterates is no longer ensured, which requires a projection step (i.e., the proximal map of the characteristic function on the set {f ≥ 0}). In Theorem 4.1, we show the almost sure convergence to a maximiser for a modified likelihood with a constant preconditioner. Third and last, we conduct extensive numerical experiments, which show that these algorithms enjoy steady convergence, significantly outperform classical OSEM type methods, and are very promising for PET reconstruction.
The rest of the paper is organised as follows. In Section 2, we describe the mathematical formulation of the maximum likelihood problem in PET, and expectation maximisation and its stochastic variants for ML estimation. Then in Section 3, we discuss the EM algorithm for MAP reconstruction using parabolic surrogates. In Section 4, we discuss a second class of numerical algorithms, i.e., variance reduction algorithms based on gradient ascent for the MAP problem. Last, in Section 5 we present numerical results that examine and illustrate features of these algorithms. Throughout, the notation c = denotes expressions that are equal up to an additive constant that is independent of the function's argument. The notation and denote entrywise division and multiplication of matrices or vectors. For any fixed M ∈ N, we denote [M ] to be the set [M ] = {1, . . . , M }. We use lowercase letters, e.g., g and f for column vectors, and upper case letters for matrices and operators. For a matrix A we let a m denote its m th row. The notation P ≥0 (x) denotes the coordinate-wise projection of x onto the non-negative plane, i.e., P ≥0 (x) = (max(0, x n )) N n=1 . The notation 1 is slightly abused for a constant vector of suitable size with all entries equal to one.

Expectation maximisation and its stochastic variants
In this section, we describe the ML PET problem, and the EM algorithm and its stochastic variants for finding ML solutions.

ML PET problem
First we recall the standard mathematical formulation of the PET reconstruction problem. Let M denote the number of detector bins, and g m the number of emissions detected in the m th bin, so that the measured data are g = (g 1 , . . . , g M ) ∈ R M . It is customary to approximate the measurement means of the unknown tracer distribution in the form of a linear problem E[g] = Af + w, where A ∈ R M ×N is the system matrix with non-negative elements, f ∈ R N is vector of voxel values, and w ≥ 0 represents the mean number of background events such as scatters, background radiation, and random coincidences, which will mostly not be explicitly written in the equations below. Emission measurements in the m th bin are modelled by the following Poisson model: Recall that a random variable g follows the Poisson distribution Poisson(λ) with a parameter λ > 0 if Assuming that detector bins record independent measurements, and conditioning on the tracer distribution f , it follows that the probability distribution function p(g|f ) of the emission measurements g is given by (2.1) The ML estimator f ml of f is computed by maximising the likelihood p(g|f ) in (2.1), or equivalently its logarithm. Omitting terms independent of f , this yields the following objective 2) The ML estimator f ml is then defined as The functional L(f ) is concave on the space of all admissible tracer distributions (f ≥ 0), but a direct solution via Karush-Kuhn-Tucker conditions is intractable, and instead iterative approaches are commonly used, which we discuss in more detail in Section 2.2 below. Next we introduce the concept of ordered subsets. Consider a partition S = {S 1 , . . . , S Ns } of the set [M ], i.e. a collection of (sub)sets such that ∅ = S t ⊂ [M ]; S t 1 ∩ S t 2 = ∅ for t 1 = t 2 ; and ∪ Ns t=1 S t = [M ]. For a vector v and a matrix A, we denote by v t and A t the subvector of length |S t | and an |S t | × N submatrix whose entries, respectively row indices, belong to S t . Accordingly, given the partition S, we can subdivide the log likelihood L(f ) into For many algorithms the partition S needs to be carefully constructed, in order to optimise the quality of the reconstructions [24]. Moreover, subsets should be balanced so that emission probabilities m∈St a mn are nearly independent of the subset index t [5], which is what we also adhere to. It is thus traditionally recommended that iterating over subsets should follow an order such that projections corresponding to next subset are as "perpendicular" as possible to previously used ones [22].

ML expectation maximisation
EM is the most well-known example of an iterative, functional substitution scheme for PET reconstruction. It solves the ML problem (2.3) by replacing the objective (2.2) through a complete data framework. We follow the complete data framework due to Shepp and Vardi [38]. Let G ∈ R M ×N and G t ∈ R |St|×N denote the full and subset complete data matrices, respectively, with entries g mn that denote the number of emissions detected in bin m that originated from voxel site n. Since Consider now the conditional expectation (2.5) M step. Maximise the expectation by When N s > 1, the above algorithm is referred to as OSEM. The standard EM algorithm uses N s = 1 subset and obeys the update rule From preceding equations it follows that EM and OSEM preserve nonnegativity of the updates.
The above framework falls under the umbrella of function substitution-type methods [30], which at each step replace the original objective function with a surrogate. Recall that a function Φ is said to be a surrogate of a concave objective Φ if it satisfies the following properties: These defining properties ensure that maximising the surrogate Φ monotonically increases the value of the objective Φ, thereby guaranteeing the convergence of the objective value. It can be shown that [29,39,11,30].

Stochastic Expectation Maximisation
The EM algorithm represents a powerful and versatile approach for inference and estimation involving distributions whose complete data likelihood belongs to the exponential family, i.e., , .
Then we can rewrite MLEM as where is the full sufficient statistic. Physically, s(G) ∈ R N represents the unknown emission quantities: the n th entry of the full sufficient statistics s(G) is M m=1 g mn , the total number of emissions from the n th voxel. Then we can interpret the E step in (2.5) as computing either the full expected statistic or its subset variant. A direct way to randomise OSEM is through a random sampling of the subsets. This can be achieved by resampling the subset at each iteration, or choosing only the subset index at random (for a fixed partition S). We employ the latter strategy since in practice it shows superior performance [43].
There have been several recent studies [46,6,10,27] that randomise the classical EM algorithm differently and show excellent performance on a range of problems, e.g. Gaussian mixtures, natural language processing and hidden Markov models. One notable class of these algorithms start from the expression (2.7) and instead of computing the full expected statistics s(f (k) ) or the corresponding subset statistics τ t k (f (k) ) defined below, at each iteration use an exponentially running approximation s (k) . In the M-step we then compute To compute the estimate s (k) , which approximates s(f (k) ), the common practice is to view the full statistics as an average of N s subset statistics with the subset statistics τ t given by Both s(G) and τ t are both of the size of f . Then at each iteration, we (randomly) select an index n to update the estimate s (k) (to s(f (k) ). This idea was initially proposed to derive an online EM algorithm [31,6] for handling streaming data, in order to approximate the conditional statistics by exponentially moving averages as the data streams in. The resulting updates are akin to subset gradient updates. Specifically, for an initial guess s (0) and a subobjective index t k , it can be written as where {α k } is a decaying stepsize schedule, and the index t k is drawn uniformly at random. This algorithm is termed as Stochastic Expectation Maximisation (SEM) below. Similar to classical stochastic gradient descent algorithms [2], the convergence of the resulting sequence of iterates f (k) sem is highly dependent of the variance of the estimated statistics τ t k (f (k) ) (compared with s(f (k) ) of the standard MLEM algorithm). Thus, the convergence guarantee requires a decaying stepsize schedule.
To reduce the variance of the gradient estimate (for stochastic optimisation) and to allow a constant stepsize, in recent years, several variance reduction techniques have been developed in the machine learning community, e.g., SAG [36], SAGA [15], SVRG [26], and SARAH [32]; see [20] for an up-to-date overview. These techniques reduce the variance of the gradient by including in the search direction an average of the full gradient, which is updated either according to a predefined update schedule, or per-iteration. For Online-EM, they can be used to reduce the variations of the sufficient statistics estimate. A variant of online EM, inspired by SVRG, was developed in [10], which uses an anchor point f anc and a full but infrequently updated estimate of s(f ) at f anc : If k mod ηN s = 0, set f anc = f (k) svrem and update s anc = s(f anc ), The full expectation s anc and the anchor estimate f anc , are updated once every η ∈ N epochs, where one epoch refers to every N s iterations. It is worth noting that for both online-EM and its variance reduced variants, the resulting updates still follow the EM paradigm, where the M-step uses the computed running estimate of the sufficient statistic.
Remark 2.1. Naturally all other variance reduction techniques can be applied to improve SEM. For example an algorithm based on SAGA reads [27] We shall not examine these variants, and focus only on SVREM.
The methodology described above naturally extends to the ordered subset framework for the PET problem (2.2). Indeed, with the background events included, we can obtain For any estimatorŝ (k) of the full statistic s(f (k) ), for the PET ML problem (2.3), the maximisation step in equation (2.8) still admits a unique solution and can be computed as (2.6).
Now we briefly comment on the convergence of SVREM (2.10). The convergence result in [10] requires the subset statistics τ t (f ) to be Lipschitz continuous, which holds only for nonzero backgrounds w i . This condition arises also for standard MLEM [1]. There are two common remedies, a practical and a theoretical one. The former is to set the pixel value to 0 whenever the denominator in (2.4) or (2.6) is equal to zero. The latter is to modify the likelihood term, using a quadratic approximation near the origin [1]. Specifically, let ϕ m ( ) = g m log( ) − , and we definê with the constant ε > 0. If ε is chosen to be sufficiently small, then the solution set does not change [1].

Stochastic variance reduced EM (SVREM) for penalised PET reconstruction
To address the inherent ill-posed nature of the ML problem (2.3), one popular approach is variational regularisation, which introduces a convex penalty R(f ) [25]. This can often be interpreted as a maximum a posteriori (MAP) estimation and the corresponding estimator f map is given by A common type of penalties used in PET reconstruction take the form where w nj ≥ 0 are weights, N n is the n th voxels neighbourhood, and ρ is a potential function.
The penalty R(f ) is used to promote the desired image structure, which are often meant to be locally smooth but still preserve edge phenomena. Thus ρ should smooth within a given tissue or organ, while retaining sharp boundaries between different tissues. Most smooth potentials are thus monotonic, non-decreasing functions of the intensity difference |f n − f j | that are roughly quadratic near the origin and linear away from the origin, and satisfy the following assumption.
f is assumed to be non-increasing for f ≥ 0, and such that lim f 0 γ ρ (f ) is finite and non-zero.
A list of commonly used penalty terms satisfying these properties is given in Table 1. Note that this does not cover the relative difference penalty [33]. For the Huber, log cosh and hyperbola penalties, the parameter δ > 0 controls the transition between the quadratic (smooth) and linear (edge-preserving) regimes of the given penalty term.

approximates TV
Note that the general principle for solving the corresponding MAP problem (3.1) by EM does not change. That is, for MLEM, OSEM, SVREM, and SEM, instead of (2.8), we compute If the penalty R(f ) is separable (i.e., no coupling between the entries, which for example is the case for the quadratic prior), the objective function in (3.2) is separable and the M-step has a closed-form solution. However, this is not the case for most penalties of interest in PET reconstruction and computing the maximiser requires solving a coupled system of equations. Thus, (3.2) is often maximised iteratively [35].
To explicitly solve the M-step, we employ a separable surrogate of the penalty R(f ). Surrogates have been widely used in penalised PET reconstruction [12,13,19,18]. The idea is to construct a surrogate for the potential ρ, either to facilitate the computation of the prior or to improve conditioning (and convergence). We employ the parabolic surrogate defined in [9]. Namely, consider the surrogate for the potential ρ given by and define the surrogate penalty by Then for n = j the n th and the j th entry are decoupled since the partial derivative The n th partial derivative for the surrogate objective where s (k) is an estimator of the expected statistic, and d nj : Thus, we arrive at a quadratic equation, with a unique nonnegative solution which can be easily evaluated at each iteration. Provided that b > 0 it is given as where we note that the discriminant is non-negative due to Assumption 3.1, and when b = 0, i.e. when there is no prior, equation (3.3) is linear, and the solution is given as f = −c/a.

Stochastic EM algorithm based on gradient ascent
In this section, we describe a second class of algorithms for problem (3.1). It is inspired by the following additive formulation of the EM update (2.6) [28]: This can be interpreted as a preconditioned form of gradient ascent. Then it is natural to replace the gradient of the likelihood L(f ) with that of Φ(f ). This strategy is directly amenable to stochastic iterative methods, which are very appealing due to their low cost per iteration, flexibility with the penalty (i.e., there is no need for surrogates but only the gradient of the penalty) and the convergence acceleration during early iterations. Specifically, stochastic gradient ascent (SGA) like methods can be written as where the random (index) variable ξ k may depend on f (k) , and h k (f (k) , ξ k ) is the search direction (i.e., a preconditioned version of the gradient of L(f (k) ). Then OSEM can be written in the additive formulation (4.1) with the search direction h k (f (k) , t k ) given by This can interpreted as a diagonally preconditioned gradient ascent with respect to the subobjective L t k . These discussions naturally motivate the following algorithmic developments for the PET MAP problem (3.1). For a given subset index t, we denote Then the additive formulation can be extended to MAP estimation, which leads to SGA updates where the index t k is selected uniformly at random, α k = 1, and the preconditioner d t (f ) is given by An update of this type is a standard extension of OSEM algorithms to the MAP problem. However, it does not always perform the maximisation of the given objective at each step. Note that by including a penalty, the non-negativity of the updates is generally not preserved, and an additional projection step by P ≥0 is applied at each iteration. According to the theory for stochastic gradient ascent, the iteration (4.2) generally does not converge to the MAP solution, unless a decaying step-size schedule is employed [13]. This is attributed to the stochasticity of the gradient estimate, and the variance of the estimated ascent direction can significantly slow down the convergence when the iterates approach the maximiser. One powerful idea to reduce the variance of the gradient estimate in SGA is variance reduction. For the constrained MAP problem, we use proximal versions of SAGA and SVRG [45] to enforce the iterate feasiblity by projection P ≥0 . Both SAG and SAGA keep a running table of computed gradients of the subobjectives, and then efficiently estimate the full gradient. By rescaling the full gradient, SAGA employs unbiased estimates, whereas the SAG estimate is biased (and thus often harder to analyze). More precisely, SAGA estimates of the full gradient are given by SVRG is an unbiased variance reduction method that has inspired SVREM, and mirrors the convergence rate performance of SAG and SAGA, but does not require maintaining a running list of gradients. Setting f anc = f (0) , andq = 1 Ns ∇L(f anc ) − β Ns ∇R(f anc ) the algorithm follows If k mod ηN s = 0 set the anchor estimate f anc = f (k) svrg and updateq = 1 N s ∇Φ(f anc ).

(4.4)
A value of the full gradient update frequency η between 2 and 5 is recommended [26]. Several remarks are in order. First, note that the methods described in Sections 2 and 3 aim at explicitly computing the maximiser at each step. However, update equations (4.3), and (4.4) are rather derived by analogy with the additive formulation (4.1), and thus mathematically less principled. Second, provided that the subobjectives Φ t are L-Lipschitz, SAG and SAGA converge (sub-linearly in expectation) to the minimiser for the fixed stepsize α = (16L) −1 [36]. This result does not apply to the PET problem (3.1) since the subobjectives L t are not Lipschitz in a neighbourhood of 0, and since the algorithms use iteration-dependent preconditioners. Third, as observed in [1], preconditioned gradient ascent based algorithms (which are in PET literature sometimes called diagonally-scaled incremental gradient methods) do not always converge to the MAP solution when using iteration-dependent preconditioners. Indeed, assuming f (k) → f , that ∇Φ(f ) is continuous, and that preconditioner functions d t (f ) ≥ 0 are continuous such that d t (f ) = 0 for f = 0, the issue seems to persist for the proposed stochastic algorithms. The latter assumption is implicitly satisfied by the EM preconditioner. Assuming lim k→∞ α k = 0 and ∞ k=1 α k = ∞, then the convergence of the SGA algorithm implies Thus, unless all the preconditioners d t are the same, this identity generally is different from the true optimality condition ∇Φ(f ) = 0 (for unconstrained optimisation). Note that an analogous analysis holds for stochastic estimators in expectation, provided the corresponding search direction h k (f (k) , ξ k ) is unbiased and consistent. A simple remedy is to freeze the preconditioner, d t (f (k) ) = d, where d is constant and given as d = d(f (0) ). In practice, gradient-based variance reduction methods, with either iteration dependent or constant preconditioners, perform well for the PET problem [42,43].
The convergence of non-preconditioned SAGA and SVRG has mostly been studied for strongly convex objectives. Namely, SAGA enjoys linear convergence for strongly convex problems [15], and O(1/k) convergence of the average iterate for convex problems [14, Theorem 4.8], whereas SVRG converges at a linear rate for strongly convex problems (evaluated at the anchor point) [26]. These assumptions are not satisfied by the PET problem, neither for standard nor modified likelihood (2.11), which is only strictly convex. Nonetheless, combining the arguments in [34] and [1] allows establishing almost sure convergence of both SAGA and SVRG for a modified likelihood term and a constant preconditioner. In practice, this holds since the background events are nonzero, for which there is no need to modify the likelihood. (2.11). Moreover, let all entries of d ∈ R N be positive, denote by L = max t∈Ns L t the largest of the Lipschitz constant of sub-objective gradients Φ t (f ) and by d max = d ∞ the largest entry of d, and assume argmax f ≥0 Φ(f ) = ∅.
Then taking α = 1 3Ld 1/2 max Note that Theorem 4.1 provides conditions for convergence but does not quantify the speed of convergence. This is typical for non-strongly concave problems for most variance reduction techniques. We can draw an analogy with the results for strongly-concave problems which suggest that the speed of convergence depends on the ratio between the Lipschitz constant and the stepsize. In case of preconditioned SVRG and SAGA the effective stepsize is determined not only by α but also by the preconditioner d, as indicated by the bounds on the stepsize in Theorem 4.1.

Numerical experiments and discussions
This section presents numerical results for the two classes of stochastic methods for the MAP problem (3.1) with the log cosh penalty, which is often employed for PET reconstruction. We present two examples: a brain phantom, and a torso XCAT. In these examples we examine the performance of SVREM with SVRG and SAGA. SVRG and SAGA are employed with a constant preconditioner d = d(f (0) ), which provides consistently better performance than the iteration dependent counterpart d(f (k) ). The results for SAG and SARAH are nearly identical with those for SAGA and SVRG, respectively, and thus are not included.

Brain phantom
In this experiment, we take a single slice (of size 114 × 114) from a brain phantom, available at https://github.com/casperdcl/brainweb. The forward map is taken to be the Radon transform using 180 projection angles with a 1 angle separation. The sinogram data was binned into subsets using geometric projections of the scanner. We consider N s = 15, 30, and 45 subsets and accordingly, each subset consists of 12, 6, and 4 views, respectively. In the reconstruction, we use the log cosh penalty with δ = 0.01 and a regularisation parameter β = 60, which are determined in a trial-and-error manner, and conduct the experiments in MATLAB R2019b. The accuracy of a reconstruction f is measured by the relative error ∆(f ) = f − f / f , where f is the reference solution, computed using LBFGS-B (available at https://www.mathworks.com/matlabcentral/fileexchange/ 35104-lbfgsb-l-bfgs-b-mex-wrapper, retrieved on April 11, 2021). All algorithms are initialised with one epoch of OSEM. Following [42], for SAGA and SVRG, we employ a constant stepsize α = 2, and update SVRG with the full gradient every η = 2 epochs, whereas for SVREM, we choose α = 0.7 and η = 1. We consider three OSEM type methods without variance reduction, i.e., BSREM, SGA, and SEM, which are representative within PET reconstruction, as baselines. More precisely, SGA is the PET reconstruction method, described in (4.2), which for the full objective Φ reads . BSREM [1] is an iterative scheme based on OSEM that uses a decaying stepsize schedule, i.e., f bsrem ). Note that standard BSREM uses a subset-independent preconditioner. We use its subsetdependent variant, replacing A 1 with A t k 1 since it exhibits a faster, yet steady, convergence in the studied setting. In the experiments, both SEM (cf. (2.9)) and BSREM use the stepsize schedule α k = (0.001k + 1) −1 , which is sufficient to ensure their convergence. Since SVRG and SVREM require the full gradient once every η epochs, to make a fair comparison of overall computational cost, we count epochs in terms of the number of subsets that are used at each iteration. Thus, every ηN s updates of SVRG and SVREM are counted as η + 1 epochs, and η epochs of SAGA.
In Fig. 1, we show the comparative results for all methods on N s = 30 subsets of the data. It is clearly observed that SVREM exhibits the fastest convergence among all the methods, which is also corroborated by the pixel-wise errors in Fig. 2: the SVREM reconstruction agrees nearly perfectly with the reference solution, whereas the pixelwise errors of SAGA and SVRG reconstructions still clearly exhibit structures, especially edges. Moreover, SAGA is slightly slower than SVRG (in terms of the running time), and both algorithms would benefit from a larger stepsize, though setting it too large impacts the overall convergence. We will explore the stepsize issue for SAGA and SVRG in Section 5.2. Just as expected, all variance reduction methods outperform OSEM, BSREM, and SEM, and are orders of magnitude faster, especially for high-accuracy solutions, showing clearly the beneficial effect of variance reduction for accelerating OSEM type algorithms. Although not presented, one can observe similar behavior for other subset numbers. Fig. 3 studies the convergence behavior of SVREM with respect to three important algorithmic parameters, i.e., update frequency η (of full gradient updates), the stepsize α, and number of subsets N s . Generally, all these parameters greatly impact the performance of SVREM, and they should be tuned simultaneously to achieve optimal convergence behavior. It is observed that increasing the frequency η provides some acceleration in initial epochs but a too large η can impair the asymptotic convergence, even with a tuned stepsize α. The stepsize α has a similar influence as the frequency η: a larger stepsize α gives faster initial acceleration, but too large a value may prevent the algorithm from converging to the MAP maximiser. Lastly, a larger number N s of subsets (with suitably tuned stepsizes) tends to provide faster initial convergence. This observation shows clearly the importance of proper partition of the subsets.

Torso phantom
To evaluate the algorithms in a more realistic PET setting, in this experiment we use a PET scan of a torso, obtained as an XCAT simulated phantom [37] with 2 rings and 280 projection angles. The sinogram data thus consists of 280 views, and was binned into 20, 40, and 70 subsets using geometric projections of the scanner, so that each subset consists of 14, 7, and 4 views, respectively. In the experiment, SVRG and SVREM update the full gradient and full expected statistic, once every η = 5 and η = 3 epochs, respectively, and all algorithms are initialised with one epoch of OSEM. The reconstruction was carried out using the wellestablished Software for Tomographic Image Reconstruction [40], via a python environment, available at https://github.com/UCL/STIR. We use the log cosh penalty with δ = 1 and fix β = 0.0001. Since a reference solution is unavailable, we evaluate the accuracy by the objective value.
The results for 200 epochs of SAGA, SVRG and SVREM are shown in Fig. 4. Unlike the brain phantom, the behaviour with respect to η is more stable, and choosing a larger η provides a better per-epoch comparison with SAGA. One interesting observation is about the feasible stepsize regime for SVREM. The work [10] suggests α ∈ (0, 1), with an upper bound depending on the Lipschitz constant of subset statistics (and also the number of subsets). This behaviour was also observed in Fig. 3(a). In contrast, for the torso phantom, the admissible stepsize seem to be (0, 2], which corresponds to over-relaxation, a well-known practice in iterative linear solvers [44] (see [41] for an application in EM algorithm). In Fig. 4, we use α = 2.0 for 20 and 40 subsets, and α = 1.4 for 70 subsets. In this challenging setting, SAGA, SVRG and SVREM show comparable overall performance, cf. Fig. 4. Nonetheless, SVREM consistently outperforms SVRG, especially when the number N s of subsets is small, and SAGA, whereas SAGA occasionally exhibits an undesirable stochastic behaviour during initial iterations (most notably in the case of 70 subsets), before stabilising after roughly 70 epochs. The latter can be remedied by choosing a smaller stepsize α, which of course can adversely affect the overall convergence behavior. These results suggest that SVREM provides a greater benefit for a smaller number N s of subsets, and as the number N s of subsets increases the algorithms become comparable.
In Fig. 5 we study the convergence behaviour over a longer epoch horizon, using 40 subsets and 1000 epochs, and examine how do SAGA, SVRG, and SVREM depend on the stepsize α. The numerical results show that SVREM with α = 1.0 (and also for smaller values) can eventually outperform both SVRG and SAGA, and catches up to SVREM with a larger stepsize, confirming the asymptotic convergence. Meanwhile the results for SVRG, and particularly SAGA, suggest that a smaller stepsize α is needed to ensure their convergence, since otherwise the iterations enter a limit cycle. Thus, reducing the stepsize has a conflicting effect: it slows down the speed of convergence during initial iterations, but it provides better asymptotic behaviour. For a more effective profile of the overall convergence, a dynamically variable stepsize schedule emerges as a natural choice. To gain further insights, in Fig. 6 we show pixel-wise differences for SVREM with 40 subsets. The results show that the background and the smooth parts of the image are mostly resolved after 50 epochs, whereas the edges are more challenging to resolve and are still improving after 50 epochs.

Conclusions
In this work we revisited two classes of stochastic EM algorithms based on variance reductions from the machine learning community for the penalised PET reconstruction, one from the perspective of EM algorithms and the other from the perspective of preconditioned stochastic gradient ascent. Both classes of algorithms are straightforward to implement, and  are applicable to a wide range of penalty terms. The numerical results indicate that these algorithms can effectively and efficiently accelerate the convergence of classical OSEM type algorithms, and hold significant potentials for the MAP problem in PET reconstructions. This is particularly true for SVREM, which enjoys steady convergence towards the maximising solution, sometimes even with over-relaxation. These promising empirical results naturally motivate further experimental evaluations on patient data and a theoretical analysis, to investigate the interplay between the number of subsets and the stepsize schedule (and also the "full-gradient" update frequency for SVRG), the role of over-relaxation in SVREM, and to precisely characterise their influence on the convergence speed.

B The differences between SVREM and SVRG for ML problems
It is tempting to think that SVRG and SVREM are equivalent in a certain context. We compare the two algorithms for the ML (i.e., unpenalised) problem, which represents the simplest case. Assume that both algorithms employ the same anchored estimate f anc , the full gradient, and the subset index t. For SVREM, we first update the estimator s (k) of the full statistic by s (k+1) = (1 − α) s k + α τ t k ( f (k) svrem ) − τ t k ( f anc ) + t(f anc ) , and then compute the maximising solution f where the search direction h (k) is given by Meanwhile, for SVRG updates with variable preconditioner, we have Thus, the two algorithms substantially differ even if SVRG uses a iteration dependent preconditioner d(f (k) svrg ) = f (k) svrg A 1 (or indeed even a subset dependent preconditioner). There are several notable differences. First, SVREM uses different preconditioners for each term, that is, the gradient terms containing f anc are preconditioned with d(f anc ), and d(f (k) ), for SVREM and SVRG, respectively. Second, the search direction for SVREM involves a term akin to momentum.