Nucleon matrix elements from lattice QCD with all-mode-averaging and a domain-decomposed solver: an exploratory study

We study the performance of all-mode-averaging (AMA) when used in conjunction with a locally deflated SAP-preconditioned solver, determining how to optimize the local block sizes and number of deflation fields in order to minimize the computational cost for a given level of overall statistical accuracy. We find that AMA enables a reduction of the statistical error on nucleon charges by a factor of around two at the same cost when compared to the standard method. As a demonstration, we compute the axial, scalar and tensor charges of the nucleon in $N_f=2$ lattice QCD with non-perturbatively O(a)-improved Wilson quarks, using O(10,000) measurements to pursue the signal out to source-sink separations of $t_s\sim 1.5$ fm. Our results suggest that the axial charge is suffering from a significant amount (5-10%) of excited-state contamination at source-sink separations of up to $t_s\sim 1.2$ fm, whereas the excited-state contamination in the scalar and tensor charges seems to be small.


Introduction
Recent developments in the field of numerical algorithms and computer hardware have made it possible to perform simulations of lattice QCD with dynamical light quarks at physical pion masses, enabling reliable first-principle determinations of hadronic and nuclear properties.
Indeed, with current technology the lattice computation of the light hadron spectrum, including not only the ground states, but also resonances, has become a routine task, which can be accomplished with great precision [1][2][3][4][5][6]. On the other hand, attempts at achieving a similar precision for the prediction of nucleon structure observables from lattice QCD [7][8][9][10][11][12][13][14][15][16] are confronted with a dilemma arising from the simultaneous problems of excited-state contamination at short, and deteriorating signal-to-noise ratio at large Euclidean times: while excited states are exponentially suppressed at large time separations, the signal-to-noise ratio likewise decays exponentially with time, and increases only with the square root of statistically independent measurements. As a result, controlling both statistical and systematic errors for nucleonic observables becomes difficult, in particular for structure observables, where both the time separation between the nucleon source and sink, and those between the source or sink and the operator insertion of interest, need to be made large to suppress excited-state contaminations.
It is therefore perhaps not surprising that the results for nucleon structure observables which have been obtained by different lattice collaborations currently show large discrepancies between the different groups (for details, cf. e.g. [17][18][19][20] and references therein). The problem is particularly acute for the axial charge of the nucleon, which is both of fundamental importance for testing the limits of the Standard Model and well known experimentally, but for which lattice results differ amongst themselves by several standard deviations and show a discrepancy of about 10% from the experimental value when extrapolated to the physical point.
To address this problem, lattice studies of nucleon structure with high statistical accuracy at large Euclidean time separations are needed. Keeping the computational cost manageable requires efficient techniques of variance reduction. The present study aims to reduce the statistical noise by using the recently proposed technique of all-mode-averaging (AMA) [21][22][23]. AMA is able to achieve a significant reduction in statistical error at moderate cost by combining multiple cheap low-precision calculations of the quark propagator with an appropriate bias correction. This makes AMA particularly attractive for approaching the physical light quark mass, since the cost of computing quark propagators scales inversely proportional to the quark mass.
In this paper, we study the efficiency of AMA when combined with the highly efficient locally deflated SAP-preconditioned GCR solver. In an extension of our previous studies [13,24,25], we apply AMA to the calculation of the axial charge of the nucleon from three-point functions with large source-sink separations of around 1.5 fm and above on large lattices satisfying m π L > 4 with N f = 2 flavours of dynamical quarks. In addition, we also determine the scalar and tensor charges of the nucleon on the same configurations.
We find that for the axial charge as extracted from ratios of correlation functions, large source-sink separations are required to reliably suppress excited-state contaminations, which become visible only at high enough statistics, and that values close to the experimental one are obtained from the largest source-sink separations studied. The summation method [9,13,25] is able to extract the asymptotic behaviour already from moderate source-sink separations, but still profits greatly from having precise measurements at large separations.
This paper is organized as follows: In section 2, we explain the numerical methods, including how to properly define AMA when using the Schwartz alternating procedure (SAP) and local deflation with the GCR solver. We also define the ratios of three-and two-point functions that we use to extract the axial charge, scalar and tensor charge. In section 3, we study the performance of AMA and consider how to tune the solver parameters. In Section 4 and 5, we present a first analysis of the nucleon two-point function and three-point functions, respectively, using AMA with parameters tuned as presented in Section 3. In the last section, we summarize our results and discuss directions for further improvement and future study.
2 Numerical method

All-mode-averaging
The all-mode-averaging (AMA) estimator [21][22][23] for an observable O can be defined as To improve the accuracy with which the bias correction O (rest) is estimated, it may also be averaged over an orbit O f under a subset of size N org N G of G. In this way, it becomes possible to reuse existing exact evaluations of O using different source positions to enhance the statistical accuracy without having to recalculate the exact evaluations.
The idea behind AMA is that, as long as O (appx) is an appropriate observable in the sense of having a strong correlation with the original observable O, the statistical accuracy of O AMA evaluated on N conf gauge configurations should be similar to that of O on N meas = N G × N conf configurations, while the cost of evaluating O (appx) is much lower than that of evaluating O. We should therefore expect to be able to achieve a much-increased statistical accuracy for the same effort, or conversely to have to pay only a reduced price for achieving a desired statistical error.
More specifically, the ratio between the standard deviations of O AMA and O is given by [23] where ∆O = O − O , and the standard deviation is given by σ = (∆O) 2 . The deviation from the ideal error-scaling behaviour σ AMA ∼ 1/ √ N G is parameterised by two quantities: ∆r represents the degree of disagreement between O and O (appx) by tracking the amount by which the statistical fluctuations of O (appx) fail to track those of O, while R represents the amount by which using N G approximate measurements O (appx) g falls short of providing N G statistically independent measurements by tracking the degree to which the individual approximations O (appx) g are correlated amongst themselves. Note that we ignore the correlation between the exact measurements O f because these are generally sufficiently few in number (N org N G ) that it is always possible to choose the spatial separations between the sources large enough to render these correlations negligible.
To reduce the error on O AMA as far as possible, it is therefore desireable to achieve both ∆r 0, indicating close tracking of the exact by the approximate measurements, and R 0, indicating nearly-independent measurements from different source locations. Achieving the latter primarily relies on a suitable choice of translation g by large enough distances, whereas the former has to be achieved by ensuring that the parameters of the truncated solver used are suitably tuned.

AMA with a locally deflated SAP+GCR solver
So far, AMA has been mostly used with relatively inefficient solvers such as CG. It is therefore worthwhile to study whether the significant benefits reported in that context [26,27] carry over to the case of a more efficient solver, such as Lüscher's locally deflated SAP-preconditioned GCR solver [28,29] used in the DD-HMC [30][31][32], MP-HMC [33], and openQCD [34,35] codes.
Here we recall the basic features of the Schwarz Alternating Procedure (SAP), as discussed in refs. [28,29]. When applied to the Dirac equation the SAP is a "divide and conquer" strategy which starts by decomposing the lattice into two non-overlapping domains Ω and Ω * consisting of blocks arranged in checkerboard fashion. The SAP then visits the blocks in turn, updating the field on each block to the solution of the Dirac equation with Dirichlet boundary conditions given by the field on the neighbouring blocks. Due to the checkerboard structure of the block decomposition, this can be done in parallel by simultaneously visiting first all black blocks and then all white blocks in parallel.
Denoting the points of Ω and Ω * that have neighbours in Ω * and Ω, respectively, by ∂Ω * and ∂Ω, the Dirac operator can be decomposed into a sum where D Ω acts only on the field at points x ∈ Ω with all terms involving fields in Ω * set to zero, D ∂Ω contains the terms through which points in Ω receive contributions from Ω * , and so forth. A complete cycle of the SAP can then be written as After n cy SAP cycles starting from ψ = 0, this corresponds to approximating the inverse of the Dirac operator by the polynomial and this is used as a preconditioner by solving the right preconditioned equation using the generalised conjugate residual (GCR) algorithm and setting ψ = M SAP φ at the end [28]. The tunable parameters of the SAP preconditioner are therefore the block size and the number n cy of SAP cycles.
To further accelerate the GCR solver, deflation may be used as a means of improving the condition number of the Dirac operator by separating the high and low eigenmodes for separate treatment. If the deflation fields {φ k } k≤N span a subspace (the deflation subspace) containing good approximations to the low eigenmodes of the Dirac operator, an oblique projector to the orthogonal complement of the deflation subspace is given by where is called the little Dirac operator. The Dirac equation can then be split into a low-mode and a high-mode part by left-projecting with 1 − P L and P L , respectively. The low-mode part can be solved in terms of the little Dirac operator, so that the solution is given by where the high-mode part χ = P R ψ satisfies with P L D = DP R [29].
Combining the block-decomposition approach with deflation leads to the construction of a deflation subspace of dimension N = N s N b , where N b is the number of blocks, and each deflation field has support only on a single block. In practice, such deflation fields are obtained by restricting a set of N s global deflation fields to each block and orthonormalising the resulting fields using the Gram-Schmidt procedure. To ensure that the deflation space approximates the low eigenspaces of the Dirac operator efficiently, the global fields should ideally be good approximations to the low modes. Such approximations can be obtained using a few rounds of the inverse iteration φ l → D −1 φ l starting from random fields. Since only approximate low modes are needed, the exact inverse of the Dirac operator is not required, and the SAP approximation M SAP can be used instead [29]. Note that by using the block decomposition, we effectively are able to obtain N deflation vectors for the price of computing only N s N approximate eigenmodes. The tunable parameters of the deflation procedure are then the block size and the number N s of global deflation fields.
For a more efficient implementation, mixed-precision calculations can be used; since the SAP preconditioner needs not be very precise, single-precision arithmetic suffices in this case. In the GCR algorithm, likewise, some operations can be carried out in single precision [28].
In line with the setup used in the generation of the Monte Carlo ensembles, we keep n cy = 5 fixed in our setup. The remaining algorithmic parameters that control the quality of the sloppy solves, as measured by ∆r in Eq.(4) are then the iteration number N iter of the GCR algorithm, the block size (which for simplicity we take to be the same for the SAP preconditioner and the deflation procedure), and the number N s of global deflation fields. Tuning these parameters requires one to make a trade-off between the quality of the AMA approximation and the gain in performance from using the sloppy solver. In the case of the iteration number this is obvious, while in the case of the deflation parameters the trade-off comes from the increase of the size (and hence condition number) of the little Dirac operator with N s and the number of blocks. We will study the dependence of the overall runtime on these parameters in section 3 below.
A notable feature of the block-decomposition technique is that for translations within the same block the translation invariance of the approximate solution O (appx) may be broken due to the limited precision of the sloppy solver in conjunction with the Dirichlet boundary conditions imposed in each SAP cycle. To preserve the translational invariance of O (appx) g under all transformations g ∈ G, we use only shifts that map the domains Ω and Ω * onto themselves. As M SAP is invariant under such translations, these shifts are not affected by broken translation invariance.

Computation of nucleon charges
In this paper we concentrate on the application of AMA to computating the axial (g A ), scalar (g S ) and tensor (g T ) charges of the nucleon in lattice QCD. These observables can be extracted from suitably renormalised ratios of two-and three-point functions involving the operators of the axial current A µ =ψγ µ γ 5 ψ, scalar density S =ψψ, and tensor current T µν =ψσ µν ψ, respectively.
For the nucleon, we use the interpolating field where C is the charge conjugate matrix, α is the Dirac spinor index, and a, b, c are the color indices of the quark fields. (In the following we will omit the spin indices from our notation). In order to increase the overlap between the nucleon ground state and the state created by applying the interpolating operator to the vacuum, we apply Gaussian smearing [9] (with APEsmeared [36] gauge links in the Laplacian) at both source and sink. The smearing parameters used are the same as in [25,37].
Using the spin-projection matrices P + = 1 2 (1 + γ 0 ) and P + 53 = P + γ 5 γ 3 , we evaluate the charges of the nucleon through computing the ratios of three-and two-point functions given which yield the (bare) charges g bare A , g bare S , and g bare T , respectively, at asymptotically large time separations, lim ts,ts−t→∞ At finite time separations the ratio R bare O differs from its asymptotic value g bare O by timedependent contributions from excited states with the same quantum numbers as the ground state. We will discuss in section 5 how to best suppress these contributions, and how the use of AMA can help to obtain a good signal even at relatively large time separations.
To further improve the statistical quality of the signal for the ratio R bare O , we average over the forward-and backward-propagating nucleon, constructing the ratio separately for each direction in order to take optimal advantage of correlations between the two-and three-point functions.
Obtaining the three-point function for both directions requires the computation of sequential propagators with sink positions at both t s and T − t s , whereas the two-point function for both directions can be obtained from a single inversion by using opposite parity projections for the forward and backward directions.
We note that, since the charges are defined at zero momentum transfer and we use the spatial components of the axial vector and tensor currents, no additional operators are needed to realise O(a) improvement. To obtain the renormalised charges g O = Z O g bare O , we therefore only need to include the appropriate renormalisation constants, which should ideally be chosen such that O(a) improvement is realised. We use the non-perturbative determination of Z A computed in the Schrödinger functional [38] together with the perturbative mass correction b A from [39], while for Z S and Z T we take the non-perturbative evaluations in the MS scheme at a renormalisation scale of µ = 2 GeV using the RI-MOM scheme [15]. Since the values of the bare coupling used in [15] are slightly different from the ones used by us, we determine the values of Z S and Z T to use by linear interpolation and extrapolation in 1/β, ignoring the unknown mass correction terms. In summary, our renormalised ratios are related to the bare ones by Table 2 shows the values of the renormalisation constants used in this paper.

Tests and performance of AMA
In order to achieve the greatest possible gain in statistical accuracy for fixed computational effort, we need to appropriately tune the parameters of the sloppy solver.  [38,39], and for the scalar operator, Z MS S (2 GeV), and the tensor operator, Z MS S (2 GeV), from [15]. A perturbative error of the same order as the one-loop contribution to

Covariance test
Before discussing how to tune the parameters of AMA for use with the deflated SAP-preconditioned GCR solver, we first need to check that the covariant symmetry of the approximation is preserved in the presence of the domain decomposition underlying the SAP preconditioner. The amount by which the assumed covariance is violated can be parameterized by [23] δ i.e. the relative difference between evaluating the original observable on the original gauge field and evaluating the approximation based on the transformation g on the appropriately transformed gauge field; if covariance were exact, we would find δ c = 0. For a numerical test, we use the F7 ensemble from Table 1, choosing a domain size of 6 4 for the SAP preconditioner. In line with the arguments laid out in Section 2.2, shift vectors need be such that each of the SAP subdomains is mapped onto itself; we therefore choose the shifted source location to be g = (6, 6, 0, 0), with the corresponding shift of the gauge field given bȳ g = (−6, −6, 0, 0). Since the point of this comparison is to check that the effects of the block decomposition in the SAP are under control, we did not use deflation for this test. For the approximation, we therefore had to choose a fixed iteration count of N iter = 30 for the GCR algorithm, which in our example corresponds to a residual norm of order 10 −2 . The resulting covariance violation |δ c | for the nucleon two-point function as a function of the source-sink separation is shown in Figure 1. At large time separations t 30a, the accumulation of roundoff errors in the approximation leads to a non-negligible covariance violation |δ c | ∼ 10 −5 − 10 −3 (which however is still less than the statistical errors in this time region). On the other hand, the covariance assumption is well-justified in the typical signal region t 25a, where |δ c | 10 −6 . We may therefore conclude that with this set of parameters, the systematic error arising from violating the covariance assumption is negligible for practical purposes.

Correlation between original and approximation
Given that the statistical error of the AMA result depends crucially on the discrepancy ∆r between the fluctuations of the original observable and its approximate evaluation, our tuning of the solver parameters will have to be guided by considering the parameter dependence of ∆r.
The left panel of Figure 2 shows 2∆r for the nucleon two-point function as a function of source-sink separation using both N s = 40 and N s = 60 deflation vectors. For N s = 40, an increase in ∆r can be seen at larger time separations 15 ≤ t/a ≤ 25, where it becomes large enough to no longer be negligible compared to 1/N G for N G = 128. For N s = 60, on the other hand, we find ∆r 10 −3 all the way out to t/a = 25 (corresponding to a separation of ∼ 1.5 fm). The optimal choice of the parameters for the approximation therefore will in general depend on the maximal time separation one is interested in. The right panel of Figure 2 shows ∆r for the three-point function appearing in the numerator of the ratio R A as a function of the operator insertion time t for a range of different source-sink separations t s . For N s = 40, ∆r is seen to increase with t s , becoming comparable to 1/N G for N G = 128 at t s /a = 20. Since ∆r can be reduced by increasing N org as per Eq. (4), we use N org > 1 for t s /a = 20. For N s = 60, we see that ∆r is sufficiently reduced to be negligible even at t s /a = 24, albeit at the expense of a 1.6-fold increase in computing time (cf. Figure 3), indicating that the trade-off between computational cost and achievable statistical accuracy is a crucial consideration in tuning the parameters of the approximation used in AMA.

Performance of AMA for different approximation parameters
In Figure 3, we show a comparison of the overall performance of two different approximations (using N s = 40 and N s = 60 deflation vectors, respectively) relative to the exact evaluation (with N s = 40 deflation vectors), using the G8 ensemble as a test case. For the approximation with N s = 40, the time required for inverting the Dirac operator is reduced by a factor of 5, whereas for N s = 60, the reduction is only by a factor of 3 due to the larger size of the "little" Dirac operator (13) in Eq. (14). Taking into account the fact that generating a larger number of deflation fields is also more costly, and including the fixed costs of source and sink smearing and contractions, the total time for an approximate evaluation using N s = 40 is about 30% of the exact calculation, whereas for N s = 60 it is about 50%.

Error scaling and computational cost
Finally, we consider how the overall statistical error scales as a function of the total CPU time used when varying the total number of measurements by varying N G , N conf , or both. Figure 4 shows the relative error of the ratio R A (t, t s ) from Eq. (21), as measured on the F7 ensemble using t s /a = 13 and t/a = 7, plotted against the total CPU time used for AMA with N G = 8, 16, 32, and 64, with different numbers of configurations used. The relative error and CPU time required when not using AMA and performing a single exact evaluation on each configuration instead is also shown for comparison.
Since increasing N G leads to an increase in the correlation R between the different approximations (because the source positions g will have to be taken closer together when using more approximate solves), the error scaling between different values of N G is not perfect. In our test case, using N G = 8 requires about half the computational cost at the same relative error as the larger values of N G considered, whereas the larger N G values exhibited roughly identical error scaling.
Comparing AMA to the conventional method, a reduction in error by a factor of about 1.5−3 can be achieved at constant computational cost by using AMA. This is one of the main findings of this paper concerning the performance and effectiveness of AMA in conjunction with highly efficient solvers.

Tuning of AMA parameters
For the remaining calculations, we have tuned the parameters of the deflated SAP-preconditioned GCR solver so as to achieve the maximum reduction in computational cost while keeping ∆r sufficiently suppressed to enable good error scaling. Table 3 shows the resulting tuned parameter values.
We use fixed numbers N iter of GCR iterations for the point-to-all propagator and the sequential propagator, as shown in Table 3; using a sloppier propagator between the sink and operator insertion point does not lead to an increase of ∆r, and thus enables us to reduce the computational cost for the three-point function. We note that the iteration count of N iter = 3 corresponds to a similar residual as N iter = 30 in the undeflated test of section 3.1.

AMA study of excited-state contamination in nucleon twopoint functions
The nucleon two-point correlation function may be approximated as with masses m N and m N and overlap factors Z N and Z N , for the ground state and first excited state respectively. In order to study the nucleon mass, we utilised single-and double-exponential fits to the correlation function eq. (25), of the form For the fitting, we used a χ-squared minimisation and found that the nucleon mass could be determined reliably using a single-exponential fit performed in the interval 1.0 fm < t < 1.5 fm (confirmed by the double-exponential fits), whereas in order to incorporate the excited states fitted in the double-exponential fits, an earlier fitting interval was required, starting at t 0.5 fm. This is highlighted by the plots in Figure 5, where the effective mass is used to monitor the region of ground state dominance and of excited-state contamination. It also indicates the effectiveness of the exponential fits.
The results for the single-and double-exponential fits to the correlation function for each of the ensembles are given in Table 4, four of which are displayed in Figure 6. For the subsequent analysis, we took the single-exponential results for the extracted ground state nucleon mass. Whilst we see a small discrepancy for this value obtained from double-exponential fits, we find that, overall, the double-exponential fits largely confirm the single-exponential fits to be in the ground-state region. The observed discrepancy within the double-exponential fits (dependent on the fitting interval) is, in part, due to the contamination by higher excited states, which are not accounted for by the double-exponential fitting ansatz Eq. (27).
For reference, we also show our previous results from [40] in Table 4 and note that for three ensembles (labelled E5, N6 and G8) we observe a discrepancy between the new single-exponential results and our previous results. This is due to the increased statistics on these ensembles, which   One-exp Two-exp One-exp Two-exp F6 One-exp Two-exp F7 G8 One-exp Two-exp Figure 6: Results for single-(red points) and double-exponential fits to the nucleon two-point function for the β = 5.3 gauge ensembles. For the double-exponential fits, the blue and green points indicate the ground and excited state masses respectively, for a number of different fitting intervals (indicated in Table 4).
allows us to better identify the contamination from excited states and hence provide a revised determination of the nucleon mass at later fitting intervals.

AMA study of excited-state contamination in nucleon threepoint functions
The ratios R A , R S and R T , with the target observable g O , mass difference ∆ and unknown coefficient c O suffer excited state contamination at finite t and t s . For the subsequent analysis, we used both the plateau method that assumes ground state dominance around the middle of each t s dataset and the summation method [9,13,25].

Plateau method
In Figure 7 we show g A obtained from plateau fits to the middle 4 points (3 for N6) of the ratio R A (t, t s ), for each t s . The ratio R A (t, t s ) typically resembles the example shown in Figure 8 (right panel) for the F7 ensemble. This demonstrates a clear tendency for the plateau results to approach the experimental value from below as t s is increased and indicates a significant contribution from the excited states in Eq. (29), which is especially pronounced for N6 (small lattice spacing) and G8 (small pion mass). The excited state contamination is significant at t s = 1 fm, which supports our previous findings [25] that the plateau method still suffers from substantial excited state contamination at t s = 1.3 fm and that 1.5 fm or more may be required. The plateau fit results at each t s for ensemble F7 are given in Table 5, for which the extracted  g A results can be clearly seen to approach the experimental value as the source-sink separation is increased. The results for the largest t s on each ensemble are summarised in Table 6.

Summation method
Performing the summation S O (t s ) parametrically reduces the excited state contamination, and through determining S O (t s ) for a number of t s , the target observable may be obtained from the slope of a linear fit. The t s dependence of the summed data points, including the linear fits, for the F7 ensemble are shown in Figure 8 and the results are summarised in Table 5. To check the dependence of the fit on the smaller t s points, the linear fits were performed for two intervals. One used all the t s points and the other used only the points where t s > 0.9 fm. We observe no statistically significant discrepancy between the two fits and therefore we quote the statistically more accurate result that incorporates all t s points.  Figure 7: g A as a function of t s from plateau fits. Green symbols denote our previous results [25] on the same configurations but without AMA. These have been rescaled to account for the updated renormalisation factors in Table 2.  Figure 9 shows the renormalised scalar charge g S and tensor charge g T , extracted from the ratio of the two-and three-point functions defined in Eq. (18) and Eq. (19) and obtained from plateau fits (to the same intervals as for g A ). In contrast to g A , the dependence on t s is very mild with no evidence of excited state contaminations, even for a fine lattice spacing (N6) or for a light pion mass (G8). This echoes the behaviour seen by other groups, such as [18,15]. In Figure 10 we show a comparison of lattice results for these quantities, where we use our plateau fit results for t s ∼ 1.1 fm, which we believe to be reasonable due the absence of excited state behaviour in these quantities, as seen in Figure 9. A similar comparison for the case of g A is shown in Figure  11, but we caution that the results obtained from the plateau method shown here should not be taken at face value due to the strong excited state contamination observed in the case of g A , even when t s is as large as 1.3 fm.

Summary and discussion
We have investigated the performance of all-mode-averaging (AMA) [21][22][23] when used in conjunction with the locally deflated SAP-preconditioned GCR solver [28,29] employed in the DD-HMC [30,31], MP-HMC [33] and openQCD [34,35] packages. While the block decomposition that forms the basis of the SAP preconditioner breaks the translation invariance of the Dirac operator and thus limits our choices of source position for the sloppy solver in the AMA method [51], we find that AMA provides an increase in efficiency (as measured by computer time expended to achieve a given relative statistical accuracy) by a factor of around two compared to the standard method of averaging over multiple source positions. Using AMA with appropriately tuned parameters, we have investigated the excited-state contamination of the axial, scalar and tensor charges of the nucleon by measuring two-and three-point functions at large source-sink separations t s 1.5 fm with large statistics. AMA enables us to achieve good statistical precision even at large t s with moderate computational  effort. The results for the axial charge contained in this paper represent an extension of our earlier study [25] with statistics increased by a factor of 5-20. From a comparison of different analysis methods (such as plateau fits and the summation method), we are able to conclude that the ratio from which g A is extracted still suffers from significant excited-state contamination even at source-sink separations of around t s 1.3 fm, while the ratios for g S and g T are only weakly affected.
In a recent study by the Regensburg group [15] using similar configurations with N f = 2 O(a)-improved Wilson fermions, a value for g A that was around 10% below the experimental one was obtained even at m π 150 MeV, using source-sink separations of t s 1 fm. Our estimate of g A obtained from t s 1.5 fm in the mass range m π 200 − 300 MeV is consistent with experiment and shows now significant pion mass dependence; we also do not find any evidence for a sizeable finite-size correction [14] on our lattices satisfying m π L ≥ 4. It appears likely, therefore, that for g A source-sink separations t s larger than 1.5 fm are required in order to avoid any systematic uncertainty from excited state contamination. This must be controlled before proceeding to an estimate of other systematic uncertainties, for instance finite-size, pion-mass, and cut-off effects. Since we did not apply AMA in order to increase statistics to the entire range of lattice spacings and pion masses available, we refrain from performing a joint chiral and continuum extrapolation of g A , g S , and g T , to quote a result at the physical point.
The present study has demonstrated the feasibility of obtaining statistically accurate results for nucleon structure observables from lattice QCD using large source-sink separations with AMA, and we intend to perform further studies of nucleon structure (including in particular the vector form factors, where the suppression of excited-state effects appears to be of crucial importance [40] in order to address the proton radius puzzle from first principles) using this method. Related efforts on the N f = 2 + 1 CLS configurations [52] are under way.   Figure 10: Left panel: results for g S from ETMC [41,42], LHPC [43], PNDME [18,44], and RQCD [15] (open symbols) compared to this work (filled symbols) as a function of m 2 π ; right panel: results for g T from ETMC [45,42], LHPC [43], PNDME [18,44], RQCD [15], and RBC/UKQCD [46] (open symbols) compared to this work (filled symbols) as a function of m 2 π . All results are quoted using a renormalisation scale of 2 GeV. TR is grateful for the support through the DFG program SFB/TRR 55. ES is grateful to RIKEN Advanced Center for Computing and Communication (ACCC).