Equivariant neural networks for inverse problems

Abstract In recent years the use of convolutional layers to encode an inductive bias (translational equivariance) in neural networks has proven to be a very fruitful idea. The successes of this approach have motivated a line of research into incorporating other symmetries into deep learning methods, in the form of group equivariant convolutional neural networks. Much of this work has been focused on roto-translational symmetry of R d, but other examples are the scaling symmetry of R d and rotational symmetry of the sphere. In this work, we demonstrate that group equivariant convolutional operations can naturally be incorporated into learned reconstruction methods for inverse problems that are motivated by the variational regularisation approach. Indeed, if the regularisation functional is invariant under a group symmetry, the corresponding proximal operator will satisfy an equivariance property with respect to the same group symmetry. As a result of this observation, we design learned iterative methods in which the proximal operators are modelled as group equivariant convolutional neural networks. We use roto-translationally equivariant operations in the proposed methodology and apply it to the problems of low-dose computerised tomography reconstruction and subsampled magnetic resonance imaging reconstruction. The proposed methodology is demonstrated to improve the reconstruction quality of a learned reconstruction method with a little extra computational cost at training time but without any extra cost at test time.


Introduction
Deep learning has recently had a large impact on a wide variety of fields; research laboratories have published state-of-the-art results applying deep learning to sundry tasks such as playing Go [1], predicting protein structures [2] and generating natural language [3]. In particular, deep learning methods have also been developed to solve inverse problems, with some examples being [4,5,6]. In this work we investigate the use of equivariant neural networks for solving inverse imaging problems, i.e. inverse problems where the solution is an image. Convolutional neural networks (CNNs) [7] are a standard tool in deep learning methods for images. By learning convolutional filters, CNNs naturally encode translational symmetries of images: if τ h is a translation by h ∈ R d , and k, f are functions on R d , we formally have the following relation This allows learned feature detectors to detect features regardless of their position (though not their orientation or scale) in an image. In many cases it may be desirable for these learned feature detectors to also work when images are transformed under other group transformations, i.e. one may ask that a property such as Equation (1) holds for a more general group transformation than the group of translations {τ h |h ∈ R d }.
If natural symmetries of the problem are not built into the machine learning method and are not present in the training data, in the worst case, it can result in catastrophic failure as illustrated in Figure 1. To some extent, this problem is circumvented by augmenting the training data through suitable transformations, but it has been shown in classification and segmentation tasks that it is still beneficial to incorporate known symmetries directly into the architecture used, especially if the amount of training data is small [8,9,10]. Furthermore, training on augmented data is not enough to guarantee that the final model satisfies the desired symmetries. There has recently been a considerable amount of work in this direction, in the form of group equivariant CNNs. Most of the focus has been on roto-translational symmetries of images [11,12,8,10], though there is also some work on incorporating scaling symmetries [13,14] and even on equivariance to arbitrary Lie group symmetries [15].
As mentioned before, we will concern ourselves with solving inverse imaging problems: given measurements y that are related to an underlying ground truth image u through a model y = N(A(u)), with A the so-called forward operator and N a noise-generating process, the goal is to estimate the image u from the measurements y as well as possible. Typical examples of inverse imaging problems include the problem of recovering an image from its line integrals as in computerised tomography (CT) [16], or recovering an image from subsampled Fourier measurements as in magnetic resonance imaging (MRI) [17,18]. The solution of an inverse problem is often complicated by the presence of ill-posedness: a problem is said to be well-posed in the sense of Hadamard [19] if it satisfies a set of three conditions (existence of a solution, its uniqueness, and its continuous dependence on the measurements), and ill-posed if any of these conditions fail.
It is a natural idea to try to apply equivariant neural networks to solve inverse imaging problems: there is useful knowledge about the relationship between a ground truth image and its measurements in the form of A and the symmetries in both the measurement and image domain (the range and domain of A respectively). Furthermore, training data tends to be considerably less abundant in medical and scientific imaging than in the computer vision and image analysis tasks that are typical of the deep learning revolution, such as ImageNet classification [20]. This suggests that the lower sample complexity of equivariant neural networks (as compared to ordinary CNNs) may be harnessed in this setting with scarce data to learn better reconstruction methods. Finally, end users of the methods, e.g. medical practitioners, are often skeptical of 'black-box' methods and guarantees on the behaviour of the method, such as equivariance of the method to certain natural image transformations, may alleviate some of the concerns that they have.
We investigate the use of equivariant neural networks within the framework of learned iterative reconstruction methods [5,21], which constitute some of the most prototypical deep learning solutions to inverse problems. The designs of these methods are motivated by classical variational regularisation approaches [22], which propose to overcome the ill-posedness of an inverse problem by estimating its solution aŝ with d a measure of discrepancy motivated by our knowledge of the noise-generating process N and J is a regularisation functional incorporating prior knowledge of the true solution. Learned iterative reconstruction methods, also known as unrolled iterative methods, are designed by starting from a problem such as Problem (3), choosing an iterative optimisation method to solve it, truncating that method to a finite number of iterations, and finally replacing parts of it (e.g. the proximal operators) by neural networks. We will show that these neural networks can naturally be chosen to be equivariant neural networks, and that doing so gives improved performance over choosing them to be ordinary CNNs. More precisely, our contributions in this work are as follows: Our contributions. We show that invariance of a functional to a group symmetry implies that its proximal operator satisfies an equivariance property with respect to that group. This insight can be combined with the unrolled iterative method approach: it makes sense for a regularisation functional to be invariant to roto-translations if there is no prior knowledge on the orientation and position of structures in the images, in which case the corresponding proximal operators are roto-translationally equivariant. Motivated by these observations, we build learned iterative methods using rototranslationally equivariant building blocks. We show in a supervised learning setting that these methods outperform comparable methods that only use ordinary convolutions as building blocks, when applied to a low-dose CT reconstruction problem and a subsampled MRI reconstruction problem. This outperformance is manifested in two main ways: the equivariant method is better able to take advantage of small training sets than the ordinary one, and its performance is more robust to transformations that leave images in orientations not seen during training.

Notation and background on groups and representations
In this section, we give an overview of the main concepts regarding groups and representations that are required to follow the main text. By a group G, we mean a set equipped with an associative binary operation · : G × G → G (usually the dot is omitted in writing), furthermore containing a neutral element e, such that e · g = g · e = g for all g ∈ G and a unique inverse g −1 for each group element g, such that g · g −1 = g −1 · g = e. Given groups G and H, we say that a map φ : G → H is a group homomorphism if it respects the group structures: φ(g 1 g 2 ) = φ(g 1 )φ(g 2 ) for any g 1 , g 2 ∈ G.
Groups can be naturally used to describe symmetries of mathematical objects through the concept of group actions. Given a group G and set X, we say that G acts on X if there is a function T : G × X → X (the application of which we stylise as T g [x] for g ∈ G, x ∈ X) that obeys the group structure in the sense that and T e = id. That is, the group action can be thought of as a group homomorphism from G to the permutation group of X. If there is no ambiguity, the group action may just be written as T g [x] = g · x = gx. An important type of group actions is given by the group representations. If V is a vector space, we will denote by GL(V ) its general linear group, the group of invertible linear maps V → V , with the group operation given by composition. A representation ρ : G → GL(V ) of a group G which acts on V is a group homomorphism, and so corresponds to a linear group action T of G on for x ∈ V and g ∈ G. Given a vector space V , any group G has a representation on V given by ρ(g) = I, which is the so-called trivial representation. If V is additionally a Hilbert space, we will call ρ a unitary representation if ρ(g) is a unitary operator for each g ∈ G, i.e. ρ(g)x = x for all x ∈ V . Given a finite group G = {g 1 , . . . , g n }, we can define the so-called regular representation ρ of G on R n by where {e 1 , . . . , e n } is a basis of R n and k is such that g i g j = g k . With this representation, each ρ(g) is a permutation matrix, so ρ is a unitary representation if the basis {e 1 , . . . , e n } is orthonormal. In this work, the groups that we will consider take the form of a group of isometries on R d . These groups are represented by a semi-direct product G = R d H, where H is a subgroup of the orthogonal group O(d) of rotations and reflections: , which represents the set of pure rotations in O(d). Each element of the semi-direct product G can be identified with a unique pair (t, R) of t ∈ R d , the translation component, and R ∈ H, the rotation (and potentially reflection). The semi-direct product can naturally be encoded as a matrix using homogeneous coordinates so that the group product is given by a matrix product. G naturally acts on a point In the experiments that we consider later in this work, we will consider the case d = 2. In this case SO(2) has a simple description: We will identify the groups Z m of integers modulo m with the subgroup of SO(2) given by

Learnable equivariant maps
The concept of equivariance is well-suited to describing the group symmetries that a function might obey: Definition 1. Given a general group G, a function Φ : X → Y and group actions T X , T Y of G on X and Y, Φ will be called equivariant if it satisfies for all f ∈ X and g ∈ G.
Following the definition of equivariance, we see that equivariant functions have the convenient property that composing them results in an equivariant function, as long as the group actions on the inputs and outputs match in the appropriate way: Lemma 1. Suppose that G is a group that acts on sets X , Y and Z through T X , T Y and T Z . If Φ : X → Y and Ψ : Y → Z are equivariant, then so is Ψ • Φ : X → Z.
Based on this property it is clear that the standard approach to building neural networks (compose linear and nonlinear functions with learnable components in an alternating manner) can be used to build equivariant neural networks as long as linear and nonlinear functions with the desired equivariance can be constructed.
, and in a similar way on Y by T Y . Ordinary CNNs [7], with convolutional linear layers and pointwise nonlinear functions, are equivariant in this setting.
In this work, we will consider the group G = R d H for some subgroup H of O(d) (see Section 2 for some background), acting on vector-valued functions. To be more specific, we will let We define the group actions T X and T Y to be the induced representations, ρ X and ρ Y , of π X and π Y on X and Y respectively. In the setting that we are considering, these representations take a particularly simple form. As mentioned in Section 2, since we assume that G takes the semi-direct product form R d H, each group element g ∈ G can be uniquely thought of as a pair g = (t, R) for some t ∈ R d and R ∈ H. With this in mind, the representations ρ X and ρ Y can be written as follows for any f ∈ Z, x ∈ R d and t ∈ R d , R ∈ H: These representations have a natural interpretation: to apply a group element (t, R) to a vector-valued function, we must move the vectors, as in part (b) of Equation (6), and transform each vector accordingly, as in part (a) of Equation (6).

Equivariant linear operators.
It is well-established that equivariant linear operators are strongly connected to the concept of convolutions. Indeed, in a relatively general setting it has been shown that an integral operator is equivariant if and only if it is given by a convolution with an appropriately constrained kernel [23]. In the setting that we are considering, the more specific result in Proposition 1 can be derived, as done in [24,10] for the case d = 2 and [25] for the case d = 3.
Proposition 1. Suppose that Φ : X → Y is an operator given by integration against a continuous kernel K : Then the operator Φ is equivariant if and only if it is in fact given by a convolution satisfying an additional constraint: there is a continuous k : where k satisfies the additional condition The derivation of this result proceeds by writing out the definitions of equivariance and using the invariances of the Lebesgue measure. The equivariance of Φ implies that we must the following chain of equalities for any Here the tags above the equality signs correspond to the following justifications: (a) Since π Y is a group representation, π Y (R) is a linear map and commutes with the integral, (b) Φ is assumed to be equivariant, (c) We make the substitution y ← gy and note that the Lebesgue measure is invariant to G. Taking the left hand side and right hand side together, we find that and since this must hold for any f ∈ X = L 2 (R d , R d X ), we conclude by testing on sequences converging to Dirac delta functions that Specialising by setting R equal to the identity element, we see that or upon substituting x ← x + t, K(x, y) = K(x + t, y + t). Choosing t to be the translation that takes y to 0, we find that . Now specialising Equation (7) by letting R ∈ H and x ∈ R d be arbitrary and t, y = 0, we obtain the condition π Y (R)k(R −1 x) = k(x)π X (R), or upon substituting x ← Rx and rearranging, Conversely, the above reasoning can be reversed to show that the condition in Equa- The condition in Equation (8) is a linear constraint that is fully specified before training. Hence, if a basis is computed for the convolution kernels satisfying Equation (8), a general equivariant linear operator can be learned by learning its parameters in that basis. Since the choices of H that we consider are all compact groups, any representation of H can be decomposed as a direct sum of irreducible representations of H (Theorem 5.2 in [26]). As a result of this, we can give the following procedure to compute a basis for the convolution kernels satisfying the equivariance condition in Equation (8) as soon as π X and π Y are specified: constructs a block diagonal matrix with the diagonal elements given by the arguments supplied to diag).
• For each i, j with 1 i k X , 1 j k Y find a basis for the convolution kernels k i,j satisfying the equivariance condition with the irreducible representations π j Y and π i X . • Given expansions of the k i,j , compute the overall equivariant convolution kernel k by . This procedure has been described in more detail in [10] and implemented in the corresponding software package for the groups G = R 2 H, where H can be any subgroup of O(2).
Since the equivariant convolutions described above are implemented using ordinary convolutions, little extra computational effort required to use them compared to ordinary convolutions: during training, there is just an additional step of computing the basis expansion defining the equivariant convolution kernels (and backpropagating through it). When it is time to test the network, this step can be avoided by computing the basis expansion once and only saving the resulting convolution kernels, so that it is completely equivalent in terms of computational effort to using an ordinary CNN.

Equivariant nonlinearities.
Although pointwise nonlinearities are translationally equivariant, some more care is needed when designing nonlinearities that satisfy the equivariance condition in Equation (5) with our choices of groups. Examining the form of the induced representations in our setting, as given in Equation (6), it is evident that for a pointwise nonlinearity φ : This can be ensured if π X is the regular representation of H, since in that case each π X (h) is a permutation matrix, giving the following guideline: Another way to ensure that φ commutes with π X is by choosing the trivial representation. Although the trivial representation may not be very interesting by itself, this gives rise to another form of nonlinearity called the norm nonlinearity. If π X is a unitary representation, taking the pointwise norm satisfies an equivariance condition: The right-hand side transforms according to the trivial representation, so by the above comments we deduce that the nonlinearity f → φ( f ) satisfies an equivariance condition of the same form. To obtain the norm nonlinearity, which maps features of a given type to features of the same type, we then form the map Φ : where we used that φ( f (g −1 x) ) is a scalar. This shows that the norm nonlinearity Φ is indeed equivariant: Lemma 3. Suppose that π X is a unitary representation of H, and that φ : R → R is a given function. Then the norm nonlinearity Φ :

Reconstruction methods motivated by variational regularisation
We consider the inverse problem of estimating an image u from noisy measurements y. We will assume that knowledge of the measurement process is available in the form of the forward operator A, which maps an image to ideal, noiseless measurements, and generally there were will be a reasonable idea of the process by which they are corrupted to give rise to the noisy measurements y. A tried and tested approach to solving inverse problems is the variational regularisation approach [22,27]. In this approach, images are recovered from measurements by minimising a trade-off between the data fit and a penalty function encoding prior knowledge: with E a data discrepancy functional penalising mismatch of the estimated image and the measurements and J the penalty function. Usually E will take the form E(u) = d(A(u), y), where d is a measure of divergence chosen based on our knowledge of the noise process.
4.1. Equivariance in splitting methods. Generally, Problem (9) may be difficult to solve, and a lot of research has been done on methods to solve problems such as these. Iterative methods to solve it are often structured as splitting methods: the objective function is split into terms, and easier subproblems associated with each of these terms are solved in an alternating fashion to yield a solution to Problem (9) in the limit. A prototypical example of this is the proximal gradient method (also known as forward-backward splitting) [28,29], which has become a standard tool for solving linear inverse problems, particularly in the form of the FISTA algorithm [30]. In its basic form, the proximal gradient method performs the procedure described in Algorithm 1.

Algorithm 1 Proximal gradient method
Recall here that the proximal operator [31,32,33] prox J is defined as follows: Definition 2. Suppose that X is a Hilbert space and that J : X → R∪{+∞} is a lower semi-continuous convex proper functional. The proximal operator prox J : X → X is then defined as prox J (u) = argmin Although this definition of proximal operators assumes that the functional J is convex, this assumption is more stringent than is necessary to ensure that an operator defined by Equation (10) is well-defined and single-valued. One can point for example to the classes of µ-semi-convex functionals (i.e. the set of J, such that u → J(u)+ µ 2 u 2 is convex) on X for 0 < µ < 1, which include nonconvex functionals. In what follows, we will allow for such more general functionals by just asking that the proximal operator is well-defined and single-valued.
It is often reasonable to ask that the proximal operators prox τ J satisfy an equivariance property; if the corresponding regularisation functional J is invariant to a group symmetry, the proximal operator will be equivariant: Suppose that X is a Hilbert space and ρ is a unitary representation of a group G on X . If a functional J : X → R∪{+∞} is invariant, i.e. J(ρ(g)f ) = J(f ), and has a well-defined single-valued proximal operator prox J : X → X , then prox J is equivariant, in the sense that for all f ∈ X and g ∈ G.
Proof. We have the following chain of equalities: The three marked steps are justified as follows: (a) J is assumed to be invariant w.r.t. ρ, (b) The representation ρ is assumed to be unitary, (c) ρ(g) is invertible, and under the substitution h ← ρ(g)h, the minimiser transforms accordingly.
Example 2. As a prominent example of a regularisation functional satisfying the conditions of Proposition 2, consider the total variation functional [34] on L 2 (R d ) with the group G = SE(d) and the scalar field representation ρ(r) Since the Lebesgue measure is invariant to G and the set of vector fields {φ ∈ As a result of this, Proposition 2 tells us that prox τ TV is equivariant w.r.t. ρ for any τ 0. Note that TV is not unique in satisfying these conditions; by a similar argument it can be shown, for example, that the higher order total generalised variation functionals [35] share the same invariance property (and hence also that their proximal operators are equivariant).
Remark 1. The above example, and all other examples that we consider in this work, are concerned with the case where the image to be recovered is a scalar field. Note, however, that Proposition 2 is not limited to this type of field and that there are applications where it is natural to use more complicated representations ρ. A notable example is diffusion tensor MRI [36] in which case the image to be estimated is a diffusion tensor field and ρ should be chosen as the appropriate tensor representation. 4.1.1. Equivariance of the reconstruction operator. It is worth thinking about whether it is sensible to ask that the overall reconstruction method is equivariant, and how this should be interpreted. Thinking of the reconstruction operator as a map from measurements y to imagesû, it is hard to make sense of the statement that it is equivariant, since the measurement space generally does not share the symmetries of the image space (in the case where measurements may be incomplete). If we think instead of the reconstruction method as mapping a true image u to an estimated imageû through (noiseless) measurements y = A(u), we might ask that a symmetry transformation of u should correspond to the same symmetry transformation ofû. In the case of reconstruction by a variational regularisation method as in Problem (9), this is too much to ask for even if the regularisation functional is invariant, since information in the (incomplete) measurements can appear or disappear under symmetry transformations of the true image. An example of this phenomenon when solving an inpainting problem is shown in Figure 2.  Figure 2. An example demonstrating the non-equivariance of a general variational regularisation approach to image reconstruction, even when the corresponding regularisation functional J (as in Problem (9)) is invariant. Here, A represents the application of an inpainting mask, R is an operator rotating the image by 20 • and Φ is the solution map to Problem (9) with E(u) = Au − y 2 2 and J(u) = τ TV(u).

4.2.
Learned proximal gradient descent. A natural way to use knowledge of the forward model in a neural network approach to image reconstruction is in the form of unrolled iterative methods [5,21]. Starting from an iterative method to solve Problem (9), the method is truncated to a fixed number of iterations and some of the steps in the truncated algorithm are replaced by learnable parts. As noted in the previous section, the proximal gradient method in Algorithm 1 can be applied to a variational regularisation problem such as Problem (9). Motivated by this and the unrolled iterative method approach, we can study learned proximal gradient descent as in Algorithm 2 (where the variable s can be used as a memory state as is common in accelerated versions of the proximal gradient method [30]): Here prox i are neural networks, the architectures of which are chosen to model proximal operators. In this work, we choose prox i to be defined as where each of the K project,i , K intermediate,i and K lift,i are learnable affine operators and φ is an appropriate nonlinear function. We can appeal to Proposition 2 and model prox i as translationally equivariant (we will call the corresponding reconstruction method the ordinary method in what follows) or as roto-translationally equivariant (we will call the corresponding reconstruction method the equivariant method in what follows).  Since the implementations of the equivariant convolutions are ultimately based on ordinary convolutions, a natural comparison can be made between the equivariant and ordinary method by matching the widths of the underlying ordinary convolutions. When the methods are compared in this way, they should take comparable computational effort to use and the ordinary method is a superset of the equivariant method in the sense that the parameters of the ordinary method can be chosen to reproduce the action of the equivariant method.
Remark 2. Both in the case of Algorithm 1 and Algorithm 2, we require access to the gradient ∇E, where E is a data discrepancy functional. In our case, E always takes the form E(u) = d(A(u), y) where A is the forward operator and d is a measure of divergence. As a result of this E can be differentiated by the chain rule as long as we have access to the gradient of d and can compute vector-Jacobian products of A. If the forward operator A is linear, its vector-Jacobian products are just given by the action of the adjoint of A.

Experiments
In this section, we demonstrate that roto-translationally equivariant operations can be incorporated into a learned iterative reconstruction method such as Algorithm 2 to obtain higher quality reconstructions than those obtained using comparable reconstruction methods that only use translationally equivariant operations. We consider two different inverse problems: a subsampled MRI problem and a low-dose CT problem. The code that was used to produce the experimental results shown is freely available at https://github.com/fsherry/equivariant_image_recon.

Datasets.
5.1.1. LIDC-IDRI dataset. We use a selection of chest CT images of size 512 × 512 from the LIDC-IDRI dataset [37,38] for our CT experiments. As in Section 5.1.2, we screen the images to remove as many low-quality images as possible, The set is split into 5000 images that can be used for training, 200 images that can be used for validation and 1000 images that can be used for testing. For the experiments using this dataset, we use the ASTRA toolbox [39,40,41] to simulate a parallel beam ray transform R with 50 uniformly spaced views at angles between 0 and π. We simulate the measurements y as post-log data in a low-dose setting: Here N in = 10000 is the average number of photons per detector pixel (without attenuation), µ is a base attenuation coefficient connecting the volume geometry and attenuation strength, and η is a small constant to ensure that the argument of the logarithm is strictly positive, chosen as η = 10 −8 in our experiments. In these experiments, we will define the data discrepancy functional E as

FastMRI.
We use a selection of axial T1-weighted brain images of size 320 × 320 from the FastMRI dataset [42,43] for our MRI experiments. We use a combination of L 1 norm and the TV functional as a simple way to screen out low-quality images. The details of this procedure can be found in the code repository associated with this work. The set is split into 5000 images that can be used for training, 200 images that can be used for validation and 1000 images that can be used for testing. For the experiments using this dataset, we simulate the measurements using a discrete Fourier transform F and a variable density Cartesian line sampling pattern S (simulated using the software package associated with the work in [44] and shown in Figure 5): where ε is complex-valued white Gaussian noise. In this setting, a complex-valued image is modeled as a real image with two channels, one for the real part and the other for the imaginary part. The corresponding data discrepancy functional (E in Equation (9)) will be defined as Figure 5. The sampling mask S used in the MRI experiments, sampling 20.3% of k-space, and two samples of the images that were used to train the reconstruction operators in the MRI experiments, and the zero-filling reconstructions from the corresponding simulated k-space measurements.

Learning framework.
Although it is also possible to learn the parameters of the reconstruction methods in Algorithm 2 in an unsupervised learning setting, all experiments that we consider in this work can be classified as supervised learning experiments: given a finite training set {(u i , y i )} N i=1 of ground truth images u i and corresponding noisy measurements y i , we choose the parameters of Φ in Algorithm 2 by solving the empirical risk minimisation problem

Architectures and initialisations of the reconstruction networks.
To ensure fair comparisons between the various methods that we compare, we fix as many as possible of the aspects of the methods that are orthogonal to the point investigated in the experiments. To this end, every learned proximal gradient method has a depth of it = 8 iterations. Both for the CT and MRI experiment, the images being recovered are two-dimensional, so we use equivariant convolutions with respect to groups of the form R 2 Z m . Since the equivariant convolutions are implemented using ordinary convolutions, it is natural and straightforward to compare methods with the same width. The width of each network is the same (feature vectors that transform according to the regular representation take up |H| "ordinary" channels, and we fix the size of the product |H| · n channels = 96 where n channels is the number of such feature vectors in the intermediate part of prox i in Equation (11)). All convolution filters used are of size 3 × 3. We choose the initial reconstruction u 0 = 0 and use a memory variable s of five scalar channels wide in the learned proximal gradient method (Algorithm 2). Furthermore we ensure that the initialisation of both types of methods are comparable. Referring back to Equation (11), we choose to initialise K intermediate,i equal to zero and let K project,i and K lift,i be randomly initialised using the He initialisation method [45], as implemented in PyTorch [46] for ordinary convolutions and generalised to equivariant convolutions in [24] and implemented in the software package https://github.com/QUVA-Lab/e2cnn [10].

5.2.3.
Hyperparameters of the equivariant methods. In addition to the usual parameters of a convolutional neural network, the learned equivariant reconstruction methods have additional parameters related to the choice of the symmetry group its representations to use. In this work, we have chosen to work with groups of the form R 2 Z m , so a choice needs to be made which m ∈ N to consider.
In Figure 6, we see the result of training and validating learned equivariant reconstruction methods on the CT reconstruction problem, with various orders m of the group H = Z m . Each of the learned methods is trained on the same training set consisting of 100 images. The violin plots used give kernel density estimates of the distributions of the performance measures; for each one, we have omitted the top and bottom 5% of values so as not to be misled by outliers. Evidently, in this case, the groups of on-grid rotations significantly outperform the other choices, with m = 4 giving the best performance. Based on this result, all further experiments with the equivariant methods will use the group H = Z 4 .

5.2.4.
Training details. For both the equivariant and ordinary reconstruction methods, we train the methods using the Adam optimisation algorithm [47] with learning rate 10 −4 , β 1 = 0.9, β 2 = 0.999 and ε = 10 −8 . We use minibatches of size 1 and perform a total of 10 5 iterations of the Adam algorithm to train each method. Since we have chosen to use the finite group approach, with intermediate fields transforming according to their regular representation, we can use a pointwise nonlinearity for both the equivariant and ordinary reconstruction methods. In all experiments, we use the leaky ReLU function as the nonlinearity (φ in Equation (11)), applied pointwise: Each training run is performed on a computer with an Intel Xeon Gold 6140 CPU and a NVIDIA Tesla P100 GPU. Training the equivariant methods requires slightly more computational effort than the ordinary methods: to begin with, given the specification of the architecture, bases need to be computed for the equivariant convolution kernels (this takes negligible effort compared to the effort expended in training). Besides this, each training iteration requires the computation of the convolutional filter from its parameters and the basis functions and the backpropagation through this basis expansion. To give an example of the extra computational effort required, we have timed 100 training iterations for comparable equivariant and ordinary methods for the MRI reconstruction problem: this took 35.5 seconds for the ordinary method and 41.9 seconds for the equivariant method, an increase of 18%. Note that at test time, however, the ordinary and equivariant methods can be computed with the same effort. 5.3. CT experiment: varying the size of the training set. In this experiment, we study the effect of varying the size of the training set on the performance of the equivariant and ordinary methods. We consider a range of training set sizes, as shown in Figure 5.3, and test the learned reconstruction methods on images that were not seen during training time, both in the same orientation and randomly rotated images. The violin plots displayed have the same interpretation as those shown in Figure 6 and described in Section 5.2.3. From this comparison, we see that the equivariant method is able to better take advantage of smaller training sets than the ordinary method. Furthermore, we see that the equivariant method performs roughly equally well regardless of the orientation of the images, whereas the performance of the ordinary method drops when testing on rotated images. Figure 8 shows some examples of test reconstructions made with the methods learned on a training set of size N = 100. In these reconstructions, it can be seen that the equivariant method does better at removing streaking artefacts than the ordinary method. 5 Generalisation performance of the learned methods Figure 7. A comparison of the performance of equivariant and ordinary learned proximal gradient methods trained on training sets of various sizes for the CT reconstruction problem. The methods are tested on images that have not been seen during training time, both in the same orientations as were observed during training ("Upright test images") and rotated at random angles ("Rotated test images").
A notable difference with the CT reconstruction problem is that, as a result of the Cartesian line sampling pattern, the forward operator is now less compatible with the rotational symmetry. Regardless of this, we have seen in Section 4 that it is still sensible in this context to use equivariant neural networks in a method motivated by a splitting optimisation method. The performance differential between the equivariant and ordinary methods is more subtle than in the CT reconstruction problems. In Figure 9, we see that the equivariant method can again take better advantage of smaller training sets and is more robust to images dissimilar to those seen in training. Figure 10 shows examples of reconstructions made with the methods learned on a training set of size N = 50.

Conclusions and Discussion
In this work, we have shown that equivariant neural networks can be incorporated into learnable reconstruction methods for inverse problems, and that doing this in a principled way results in higher quality reconstructions with little extra effort compared to ordinary convolutional neural networks. Using roto-translationally equivariant neural networks as opposed to ordinary convolutional neural networks results in better performance when trained on smaller training sets and more robustness to rotations.  Figure 8. A random selection of test images corresponding to the plots shown in Figure 5.3, with a training set of size N = 100. On each reconstruction, the top number is its SSIM and the bottom number is its PSNR w.r.t. the ground truth. The images are clipped between upper and lower attenuation coefficient limits of -1024 HU and 1023 HU.
In Section 5.2.3, we saw that that the learned methods perform best when the group H is chosen to be a group of on-grid rotations. In theory, one would expect better performance with a larger number of rotations, but in practice there is the issue of how the equivariant kernels are discretised. Indeed, when solving the constraint for equivariance in Equation (8), the allowed kernels turn out to be circular harmonics multiplied by an arbitrary radial profile, and in practice we discretise these functions on 3 × 3 filters. An opportunity for future work on the use of equivariant neural networks can be found in how the combination of group and discretisation should be optimised.
All of the experiments shown in this work have dealt with two-dimensional images, but the methods described here can be applied equally well to three-dimensional images, as long as the two-dimensional equivariant convolutions are replaced by their threedimensional counterparts. The representation theory of SO(3) is more complicated than that of SO(2), but it is similarly possible to design roto-translationally equivariant convolutions in three-dimensional [25]. One potential application is mentioned in Remark 1: in diffusion tensor MRI, the domain is three-dimensional, with the additional challenge that the image that is to be recovered is a tensor field rather than a scalar field.
In the experiments that we demonstrated in this work, we focused on a single type of learned reconstruction operator, the learned proximal gradient method. In fact, the Generalisation performance of the learned methods Figure 9. A comparison of the performance of equivariant and ordinary learned proximal gradient methods trained on training sets of various sizes for the MRI reconstruction problem. The methods are tested on images that have not been seen during training time and that have been rotated at random angles. framework that we describe is not limited to this form of reconstruction algorithm. As an example of another type of learned reconstruction operator, consider the learned primal-dual method of [48]. A small corollary to Proposition 2 is that, when J is invariant and the Fenchel conjugate J * is well-defined, prox J * will be equivariant in the same way that prox J is. As a result, assuming reasonable invariance properties of a data discrepancy term, a learned primal-dual method can be considered where both the primal and dual proximal operators are modeled as appropriate equivariant neural networks.  Figure 10. A random selection of test images corresponding to the plots shown in Figure 9, with a training set of size N = 50. On each reconstruction, the top number is its SSIM and the bottom number is its PSNR w.r.t. the ground truth.
CE and CBS acknowledge support from the Wellcome Innovator Award RG98755. CBS acknowledges support from the Leverhulme Trust project on 'Breaking the non-convexity barrier', the Philip Leverhulme Prize, the EPSRC grants EP/S026045/1 and EP/T003553/1, the EPSRC Centre Nr. EP/N014588/1, European Union Horizon