Machine learning a fixed point action for SU(3) gauge theory with a gauge equivariant convolutional neural network

,


I. INTRODUCTION
Lattice regularization is the tool of choice to study nonperturbative properties of quantum field theories starting from first principles [1].Modern lattice QCD simulations have attained a high level of precision and for some important Standard Model quantities, e.g., the QCD coupling at the electroweak scale α S (µ = m Z ), they provide the current most accurate determination [2].Increased precision has amplified systematic issues relevant to any lattice calculation, such as the extrapolation to the continuum limit.Numerical simulations become rapidly more costly as the lattice spacing is reduced, not only due to the increased resolution at fixed physical volume, but also due to the increased autocorrelation times (critical slowing down) in generating statistically independent samples in Monte Carlo Markov chains and the related problem of suppressed tunneling between sectors of different topological charge (topological freezing) [3].For a robust continuum prediction, a range of lattice spacings is necessary, requiring a delicate balance between the control of discretization artifacts on coarse lattices on the one hand and the increased cost of simulating on finer lattices on the other.
Several different approaches are currently being followed to deal with the problems of critical slowing down and topological freezing.Simulations employing open boundary conditions in time [4] or huge master fields [5,6] both circumvent topological freezing, but they do not address critical slowing down.Approaches using trivializing or normalizing flows [7] attempt to solve both problems by finding invertible maps from a simple probability distribution for the lattice configurations, which allows efficient sampling, to the target one.Recently, the use of machine-learning tools for parametrizing normalizing flows has roused anew attention in this approach [8][9][10][11][12], however, these attempts are so far restricted to simple field theories, low dimensions or, in four-dimensional SU(3) gauge theories, to very small and coarse systems [13].
Here we propose to follow a complementary approach in order to solve both critical slowing down and topological freezing by using a lattice action with highly suppressed lattice artifacts.Such an action in principle allows simulations on very coarse lattices where both problems are absent, while at the same time lattice artifacts can be kept so small that a solid continuum limit can be taken.The advantage of this approach is the immediate applicability to gauge field theories in four dimensions without encountering scalability issues, once a highly improved action is found.
There is a long history of designing improved lattice actions to reduce discretization effects, bringing simulations at coarser lattice spacing into the scaling regime.One such program, Symanzik improvement [14][15][16][17], removes lattice artifacts in some physical quantities order by order in the lattice spacing a.In a lattice gauge theory, this can be achieved, for example, by building a lattice action combining plaquettes and closed six-link loops.By construction, such an approach involves a perturbative expansion at weak coupling.A radically differ-ent approach makes use of renormalization group (RG) properties to design lattice actions where artifacts are removed completely to all orders.The construction of such quantum perfect actions is an extremely ambitious goal and is in general impossible to achieve.In asymptotically free theories, such as QCD, a constructive method can be designed based on the fixed point (FP) of RG transformations which yields lattice actions without lattice artifacts at the classical level, i.e., for on-shell quantities [18].These so-called classically perfect actions, or FP actions in short, are in general expected to show suppressed lattice artifacts at sufficiently small gauge coupling g even at the quantum level.The FP action approach was used to study the O(3) nonlinear σ-model, SU(3) pure gauge theory, and full QCD, with promising indications of much-reduced cutoff dependence in Monte Carlo simulations [19][20][21][22][23][24][25][26][27][28][29].However, the increased numerical cost of simulating FP gauge actions made it difficult at that time to draw firm conclusions on the level of improvement.Given the intervening dramatic increase in computing capability, this is no longer an obstacle and pushing the FP approach to higher accuracy has in principle become feasible.
The difficulty of implementing the FP program in practice stems from the fact that many of the FP properties are defined only implicitly without knowing the explicit form of the FP action.Moreover, the FP action in principle requires infinitely many loop operators in order to describe the infinitely many gauge link couplings generated through the RG transformations (RGTs).This is not a problem per se, because reasonable choices of the RGT lead to FP actions which are local, i.e., for which the couplings decay exponentially with separation, and the RGT can in fact be designed to optimize this decay.For the SU(3) gauge theory this has been achieved in Ref. [24].One is then still left with the challenging task of finding a compact and accurate parametrization of the FP lattice action.This is an essential first step before any Monte Carlo study can be done.Recent advances in machine learning (ML), in particular the construction of lattice gauge equivariant convolutional neural networks (L-CNNs) in Ref. [30], now provide a completely new way to tackle this problem.Rather than committing to a particular ansatz for the lattice action, e.g., in terms of some of the smaller closed loops like the plaquette and rectangle, one can have a much more general and expressive neural network architecture, where an optimal set of parameters can be found using ML techniques once a sufficiently rich training dataset is provided.An essential element is that gauge symmetry must be exactly preserved in the network architecture.In Ref. [30] this has been achieved by starting with the original gauge links and local untraced plaquettes, and creating extended closed loops of gauge links through successive layers using parallel transport and bilinear products of local gauge equivariant operators.In this way, a rapidly increasing number of possible loops is generated with each additional layer.This was shown to be far superior to convolutional neural net-works (CNNs) where gauge symmetry was not built into the architecture, leading to poor predictions.The complete generality of the L-CNN approach makes it an ideal method to parametrize FP actions.
For any improved lattice action, the true test of how much lattice artifacts are reduced in the full quantum theory is only possible through Monte Carlo simulations.In this paper we focus on describing in detail the already challenging first step, to parametrize an FP action, in particular for the four-dimensional SU(3) gauge theory, using L-CNNs and ML techniques.This allows us to compare with previous studies of the FP parametrization and also serves as a proof of concept that ML can be accurately used in this task.The end result is that the very expressive nature of L-CNNs enables us to find a much more accurate parametrization of the SU(3) FP gauge action than previously possible.This conceptual success constitutes the first necessary step toward future Monte Carlo studies and ultimately toward the construction of a (approximate) quantum perfect action with strongly suppressed lattice artifacts.
The paper is organized as follows.In Sec.II we first recapitulate how the FP action emerges in the limit of iterating RGTs of asymptotically free theories, and how the FP action and its classical properties are implicitly defined through a classical saddle point equation.We then describe the setup and training of the L-CNN architectures in question, starting with a description of the construction of the learning data sets in Sec.III, the explicit description of the L-CNN architecture and ML model in Sec.IV, and finally comparing the accuracy of the L-CNN parametrizations to previous ones in Sec.V. We end with a view to the next steps and possible further uses of L-CNNs and ML for Monte Carlo simulations or more ambitiously for constructing the full renormalization group trajectory in Sec.VI.A preliminary version of this work was presented in [31].

II. THEORETICAL SETUP
The role of the Wilsonian renormalization group transformation (RGT) is to reduce the number of degrees of freedom of a particular physical system by integrating out fluctuations at high-energy scales while leaving the underlying physics at low-energy scales entirely intact [32,33].For a field theory regularized on a lattice, the lattice spacing is increased with each RGT step.Starting from a very fine lattice close to the continuum, for which any discretized action has negligible lattice artifacts, one can follow a chain of RGT steps leading to a very complicated lattice action on a coarse lattice describing the same low-energy physics.For SU(N c ) lattice gauge theory where the underlying variables on the fine lattice Λ = {n ∈ N 4 } are the gauge links U n,µ with some lattice action A[U ] and gauge coupling β = 2N c /g 2 , the RGT can be defined as where the blocking kernel T [U, V ] is given by and defines the coupling between the fine links U n,µ and the coarse links V n B ,µ on the blocked coarse lattice Λ B = {n B ∈ N 4 }.The free parameter κ can be optimized, which we later discuss.The Q n B ,µ variables are blocked links constructed from the underlying fine links U n,µ (cf.Appendix D for the explicit gauge-link blocking used in this work).The normalization term N β µ guarantees that the partition function is invariant under the RGT, i.e., integrating Eq. ( 1) over the coarse gauge links with DV yields Z(β ′ ) = Z(β).The form of the effective coarse action A ′ [V ] and the couplings {g ′ , c ′ 0 , c ′ 1 , . ..} are determined by the choice of the kernel T [U, V ].Under infinitesimal RGTs, the couplings map out a flow in the space of all possible gauge-invariant operators, as illustrated in Figure 1 by the light red trajectories.
For asymptotically free gauge theories, the only relevant coupling is the gauge coupling g and the continuum is approached in the weak coupling limit β → ∞.On the critical surface, where ξ/a = ∞, the irrelevant couplings c 0 , c 1 , . . .flow into a fixed point as shown in Figure 1.The FP couplings {c FP 0 , c FP 1 , . ..} are determined once the form of the RG blocking is prescribed.Slightly off the critical surface, the couplings first flow toward then away from the fixed point, approaching the renormalized trajectory (RT) which describes the flow starting from the FP in the relevant direction of the gauge coupling.Along the RT, the lattice theory is quantum perfect, with no lattice artifacts at all, because it is connected back to the continuum theory on the critical surface.The FP couplings define the so-called FP action A FP .When it is used at finite values of β, it tracks the RT closely at sufficiently weak coupling, cf. Figure 1.The FP action can be shown to be classically perfect [18,20], i.e., it has no lattice artifacts of O(a 2n ) to all orders on field configurations fulfilling the equations of motions.Artifacts of O(g 2 a log a, g 2 a 2 ) are, however, present but suppressed for small g.
As pointed out by Hasenfratz and Niedermayer in Ref. [18], the FP action A FP is implicitly given by the β → ∞ limit of Eq. ( 1), namely by the saddle point equation For a fixed coarse configuration V , the minimization is over all possible fine configurations U , and the normalization term in the limit β → ∞ becomes It is easy to see that the FP action has no lattice artifacts for field configurations fulfilling the equations of motion.It becomes apparent when considering the variation of the FP action using the chain rule, where U min is the configuration minimizing the righthand side of Eq. (3).For a classical coarse configuration V one has since U min minimizes the sum Hence T [U, V ] takes its minimum value, namely zero.This in FIG. 2. The leading couplings ρµν (r) of the perturbative FP action, from [24].The blocking kernel T [U, V ] is designed to maximize the exponential decay of the couplings, with exp(−3.4r/a)shown as a visual guide.
turn forces meaning the minimizing configuration U min is also classical and the FP action value is unchanged in the minimizing step.This can be iterated until one reaches an arbitrarily fine classical solution with the correct continuum action value.In particular, the FP action allows for exact instanton solutions at finite lattice spacing [23] and the exact FP equation therefore preserves topology on the lattice.Note, however, that this is not necessarily true for the RGT step.Starting from a fine configuration U , which is a classical solution, the resulting blocked configuration V might not automatically be one as well.
In fact, this can directly be seen by blocking analytical instanton solutions with a small radius in lattice units such that the instanton properties are lost on the coarse configuration.This process of instantons falling through the lattice is discussed further in Sec.III.A crucial question concerns the locality of the FP action or, more generally, the action A ′ in Eq. (1).In order to guarantee universality, the couplings must fall off exponentially in the separation between fields.One can design RGTs which force an exponential decay and find the one maximizing it, so that beyond some separation the couplings are small enough to be negligible and in practical applications can be omitted.Some guidance for a good choice of blocking kernel can be provided perturbatively [19,20].At weak coupling (and with some gauge fixing), the FP action can be expanded in terms of the gauge potential A a µ (x), only keeping terms up to quadratic order in the potential.The resulting action can be expressed in terms of couplings ρ µν (r) for fields A µ (x) and A ν (y) at separation r = |x − y|.An optimal choice of the blocking and the RGT w.r.t.locality was found in [24].Figure 2 shows the corresponding largest perturbative couplings which fall off exponentially in magnitude with ∼ exp(−3.4r/a).It is this RGT which we employ in our work and we give its details in Appendix D.
The FP action A FP and its properties are defined only implicitly through the FP Eq. ( 3), where A FP appears on both sides.The FP equation is therefore iterative: On the right-hand side of Eq. ( 3), the value of A FP [U ] can be determined through a second minimization over even finer gauge configurations U ′ , and so on, until we reach a configuration so smooth that any (reasonable) lattice discretization of the continuum Yang-Mills gauge action can be used to calculate the inception value of the action.In practice, instead of iterating the FP equation, one can shortcut the procedure and, for sufficiently smooth fine configurations, make use of existing approximate parametrizations of A FP [U ].Previous parametrizations include linear combinations of plaquette, rectangular, and parallelogram loops with various powers of their traces [24], or combinations of thin-link and smeared-link plaquette traces u x,µν and w x,µν with various powers of the form with optimized coefficients p kl [25], cf.Appendix C for further details.While this parametrization ansatz is already very general and flexible, in practice one is restricted to a rather small set of O(20 − 30) parameters.
In this paper, we take a different approach using L-CNNs and ML in connection with the FP data from Eq. (3) in order to explore a much larger space of possible actions, with the goal of finding a more accurate approximation than previously feasible -that is the parametrization challenge which we address in this paper.

III. FIXED POINT DATA
To parametrize the FP action accurately requires a large set of data.In this section we describe how this data is obtained on the basis of Eq. (3).In this work, most of the FP data stems from Monte Carlo ensembles generated using the Wilson gauge action at various couplings β wil .As such, β wil simply serves as a proxy for the size and characteristics of the gauge field fluctuations.For each coarse configuration V one needs to find the minimizing fine configuration U min on the right-hand side of Eq. (3) which then yields the value As described in the previous section, for practical reasons one employs a parametrization of A FP [U ] for the minimization procedure and the question arises how this approximation affects the true value A FP [V ].Since the RG blocking increases the lattice spacing by a factor of 2 in each RGT step, the action density on the fine configuration U is at least a factor of 16 smaller than on the coarse configuration V , and in practice is even smaller, because of the sizable positive contribution from the blocking kernel T [U, V ].In Figure 3 we show the two contributions T [U, V ] and A FP [U ] to A FP [V ] averaged over 4 4 lattice ensembles generated at the indicated coupling β wil and we find that for β wil ≳ 5.5 the action density for A FP [U ] is about a factor ≳ 30 smaller than the one for A FP [V ]. (Note that in the figure the action density for A FP [U ] is normalized to the coarse lattice volume.)Hence, for the very smooth fine configurations, any reasonably good approximation to A FP [U ] can be used on the right-hand side and in practice we employ the existing APE444 parametrization, cf.Appendix C 2 for details.This action is constructed in such a way that the couplings of the FP action in the quadratic approximation are reproduced [24] while explicitly maintaining the Symanzik "on-shell" conditions to O(a 2 ) [25], and it therefore is a very good approximation on sufficiently smooth configurations.From Figure 3 we can estimate the error on A FP [V ] induced by using the APE444 parametrization on the right-hand side.Considering the worst case β wil = 5.0, for the minimizing configurations we find action densities ≲ 0.5 corresponding to β wil ≫ 20.0.From the top plot in Figure 8 we find that for the APE444 parametrization the relative action error is ≲ 0.3% inducing an error of ≲ 0.17% on A FP [V ] for configurations at β wil = 5.0 and far less than 0.1% already at β wil = 6.0.The accuracy of A APE444 [U ] can of course also be checked by further minimization over U ′ .
Another potential error on the FP data A FP [V ] may originate from inaccurate minimization of the right-hand side of Eq. ( 3).The minimization on each configuration starts from an initial random fine configuration U and then sequentially updates each link U n,µ with an adaptive rotation in color space.Each iteration corresponds to a pass through the entire volume.We show two typical examples of this minimizing procedure in Fig. 4 for two coarse configurations V on 8 4 volume generated with β wil = 6.0 (top plot) and β wil = 5.4 (bottom plot).As shown in the figure, on smoother configurations at β wil = 6.0 the minimization converges quickly, while on rougher configurations at β wil = 5.4, as expected, it takes somewhat longer to reach a similar level of convergence.In any case, we see from the illustrations that even in those cases, the error on the value of the FP action A FP [V ] is negligible.Note that the minimization procedure is the most expensive step in generating the FP data.This is because the update of a single link U n,µ contributes both to A APE444 [U ] and several blocked links Q n B ,ν [U ] in a complicated way which requires the expensive recalculation of many intermediate quantities and the resulting contributions, cf.Appendices C 2 and D.
The action value A FP [V ] is only one datum of information for each coarse configuration V .However, the FP Eq. (3) contains much more information which can be extracted from the derivatives w.r.t. the gauge links [25].Since the first term on the right-hand side of Eq. ( 5) vanishes for the minimizing configuration U min , the derivative can be determined solely from the blocking kernel evaluated on the minimizing configuration.To be explicit, one has with t a the generators of SU(3) and the blocked links Q x,µ built from the minimizing configuration U min .The derivative notation concretely means Each coarse gauge configuration V on a L 4 lattice therefore generates 4 × L 4 × (N 2 c − 1) data of derivatives, one for each link and color index.This is a large amount of information which is very valuable in the parametrization process as it tightly constrains the form of the FP action.For later convenience, we combine the derivatives in the form which makes them independent of the choice of basis for the generators.

Gauge invariance of the FP action means the derivatives D FP
x,µ are not independent, which yields a very useful consistency check.Under an infinitesimal transformation of the links with α x = α a x t a and the gauge covariant forward finite difference After summation by parts, this is equivalent to the condition with the gauge covariant backward finite difference defined as for a matrixvalued field G x .Since Eq. ( 13) has to be satisfied for all possible α x , this becomes a local condition ) FIG. 5. Minimization of instanton configurations.In the upper panel, an instanton of size ρ/a = 3.0 on a 16 4 lattice persists after being blocked to the 8 4 lattice.The lower panel shows an instanton with ρ/a = 1.5 which is too small to survive the RG blocking, i.e., the instanton falls through the lattice.
at each x to be true for exactly gauge invariant actions.We note that Eq. ( 14) is a consequence of Noether's second theorem applied to the FP action.
In our approach, we compute the FP derivatives using Eq. ( 9), relying on the fact that U is a (local) minimum of the right-hand side of the FP equation.Since the numerical minimization procedure to determine the fine configuration U can only yield approximate minima, we may check, for each coarse configuration V , how closely the numerically obtained FP derivatives satisfy this requirement.This allows us to directly assess the quality of the minimizing configuration and the FP action data.In practice, we find that the consistency check is satisfied up to the accuracy achieved in the minimization.
In addition to Monte Carlo ensembles generated with the Wilson gauge action, we can also examine the FP action for instanton lattice configurations.Taking as input a fine instanton configuration with some chosen value of instanton radius ρ, we produce a coarse configuration V using the RG blocking.If the topological properties are intact on the coarse side, the FP action should be unchanged by minimization, reproducing an instanton solution on the fine side.We see tests of this in Fig. 5. Starting from a fine instanton solution on a 16 4 volume, the coarse 8 4 configuration V is produced via RG blocking and then fed into the minimization procedure.The upper plot is for an instanton originally of radius ρ/a = 3.0; under minimization, the action is essentially unchanged, with a very small contribution from the blocking kernel T [U, V ], meaning the blocked configuration is also an instanton solution.The inset shows the rapid convergence of A[U ] + T [U, V ] in the minimization.The lower plot is for an instanton originally of radius ρ/a = 1.5; once RGblocked, the instanton is lost, as T [U, V ] becomes much larger during minimization and the minimized total action A[U ] + T [U, V ] is below the continuum value 4π 2 , i.e., the topological features are lost because the instanton can no longer be resolved at the level of the coarse lattice spacing a ′ = 2a.Note that with the RGT-III blocking employed in this work, instantons fall through the lattice for radii ρ/a ′ ≲ 0.85.In order to embrace this specific classical property of the FP action, we generate a set of coarse configurations through blocking fine instanton configurations with ρ/a ranging from 1.1 to 3.0.The corresponding FP action values and derivatives provided by the minimization form part of the FP training data set.Further details of instanton solutions on the lattice are given in Appendix E.

IV. MACHINE LEARNING MODEL
Machine learning is being applied across a vast array of fields [34][35][36][37][38][39].Focused more specifically on lattice field theory, it has been used in a range of topics, including the identification of phase transitions and their underlying critical exponents [40], the generation of decorrelated Markov chains through normalizing [9] or trivializing [12] flow transformations, inverting renormalization group transformations in scalar field theory [41], the finite-temperature deconfinement phase transition in SU(2) and SU(3) pure gauge theory [42,43], preconditioning of lattice Dirac operators [44,45], and the connection between machine learning diffusion models and stochastic quantization of field theories through Langevin dynamics [46].A recent review of some of this work is given in [47].In our context, we need a tool to parametrize a lattice action in a highly general form, maintaining exact gauge invariance.The necessary architecture has already been developed in [30] with the lattice gauge equivariant convolutional neural network (L-CNN).

A. Gauge equivariant network layers
The input to the L-CNN network is a set of gauge configurations U x,µ , which under a gauge transformation change as U ′ x,µ = Ω x U x,µ Ω † x+μ , with Ω x ∈ SU(3).From the gauge links, we build untraced plaquette variables which gauge transform locally as U ′ x,µν = Ω x U x,µν Ω † x .We refer to generic variables with local gauge transformations as W x,a with channel index 1 ≤ a ≤ N ch .Gauge equivariant convolutions of these variables (the "channels") are built through parallel transport via gauge links: with ω a,b,µ,k the convolution weights, channel indices 1 ≤ a ≤ N ch,out and 1 ≤ b ≤ N ch,in , and with K the kernel size.The parallel transporters U x,k•μ , which start at x and end at x + k • μ, are the products of consecutive gauge links along the path.Products of locally transforming variables are constructed in a bilinear layer with parameters α a,b,c and channel indices in the ranges and 1 ≤ a ≤ N out , a crucial point being that gauge covariance is maintained exactly as the product is of variables at the same lattice site.For the L-CNN models used in this work, we use a combination of the convolutional and bilinear layer (a bilinear convolution), which can be expressed as where ω a,b,c,k,µ are real-valued weights and 1 . We also note that each bilinear convolutional layer considers both orientations of a particular input variable (e.g. both W x,i and W † x,i ), which effectively doubles the number of input channels, and a residual term.The number of trainable parameters associated with Eq. ( 18) is given by ( As depicted in Fig. 6, the full architecture can have alternating convolutional and bilinear layers (or combinations thereof), building up more and more loops of increasing length.In principle, any arbitrary loop can be generated once sufficiently many layers are combined.The model also has the possibility to add activation and exponentiation of the variables W x,a , which are not used FIG.6.An example of a lattice gauge equivariant convolutional neural network (L-CNN), taken from [30].Given a lattice gauge configuration as input, a sequence of layers builds untraced loops of gauge links of increasing size, with the total number of loops increasing rapidly with the depth of the network.The loops are traced in the final layer to produce gauge invariant output.Exact gauge covariance is maintained throughout.
in this particular work.As a final layer, a trace over the variables produces a gauge invariant scalar.In Ref. [30], L-CNN models were used to accurately predict traces of planar Wilson loops of size up to 4 × 4 in SU(2) gauge theory and were far superior to CNN models which were not constructed with exact gauge invariance.

B. Parametrizing actions using L-CNNs
L-CNNs built from multiple bilinear convolutions with a final trace layer at the end of the network can be used to express a large class of gauge invariant scalar functions on the lattice.However, there are a few additional requirements to parametrize gauge invariant actions.The first requirement is a normalization condition: the output of a parametrization A L-CNN [V ] must approach the Yang-Mills action S YM [A µ (x)]/β for sufficiently smooth gauge configurations V x,µ ≈ exp(iaA µ (x)).Secondly, one may require that the naive continuum limit a → 0 is reached in a particular way such that lattice artifacts of certain observables are suppressed to some desired order, along the lines of Symanzik improvement.A third condition is that the parametrized action should be positive for all gauge configurations.Finally, we require the action to be local, which means that the parametrization should be expressible as a sum over lattice sites of finite-length Wilson loops and their products.
All four requirements can be explicitly realized by choosing a particular ansatz for the parametrization model: where is the local output of an L-CNN and N x [1] is the network evaluated on a link configuration of unit matrices.Finally, b (n) are manually chosen coefficients with the constraint b (0) = 1.
As we will show below, the prefactor part is used to control the naive continuum behavior of the action, whereas the L-CNN provides corrections for coarse configurations.
We consider prefactor actions of the form where we sum over a set of Wilson loops U x,C (specifically plaquettes, rectangles, chairs, and parallelograms) starting at the lattice site x and c (m) C are real-valued coefficients.By construction, the prefactor action is ultralocal, with zero coupling beyond some separation.Additionally, there are constraints on the coefficients c (m) C which ensure positivity.Particular choices for the coefficients guarantee that the prefactor action approaches the Yang-Mills action smoothly (normalization) and is optionally improved to some particular order (Symanzik improvement).Suitable choices for the prefactor action are the Wilson action (which only consists of the linear plaquette term) or the Symanzik improved action (linear plaquette and rectangle contributions).The specific form of Eq. ( 20) also allows for the fixed point action parametrizations considered in [24], specifically the type IIIa, IIIb and IIIc parametrizations, which include all terms except chairs up to order M = 4.If the set of Wilson loops includes plaquettes, rectangles, parallelograms, and chairs, the normalization condition is rt + 8c (1)  pg + 16c while the Symanzik conditions that can be imposed are [16] c The most frequently used Symanzik improved gauge action sets c (1) ch = 0, combining only plaquettes and rectangles with c (1) rt = −1/ 12 and c (1) pl = 5/3.Note that the parametrized FP action of [24] set c (1) ch = 0, but included parallelogram loops as well.While the prefactor part of A L-CNN [V ] is designed to provide a good approximation to A FP [V ] for smooth gauge fields, we use the term N x [V ] to deal with coarse configurations.We represent N x [V ] as the real trace of a stack of N layer ≥ 1 bilinear convolutional layers.The output of the L-CNN is thus a linear combination of Wilson loops of various sizes and therefore local.We regularize the output of the model such that the difference N x [V ] − N x [1] vanishes in the vacuum for V x,µ = 1.1 Furthermore, since the L-CNN can be written as a linear combination of Wilson loops, a naive continuum expansion using V x,µ = exp(iaA µ (x)) yields The leading order term of the parametrized action is thus Our chosen ansatz therefore guarantees the correct continuum behavior of the action.
The positivity requirement is realized if the prefactor action is positive everywhere and if the coefficients b (n) are chosen appropriately.For example, we may use b (n) = 1/n!which allows us to write the parametrized action as which is positive for all gauge configurations.Another simple choice is to truncate at order n = 1: We note however that this ansatz is not manifestly positive.

C. Training
In the present context, we train the L-CNN using ensembles of gauge configurations {V i }, for which the values of the fixed point action and associated derivatives have been obtained through minimization as in Eqs. ( 3) and (9).
The output of the L-CNN is The predicted derivatives x,µ (analogous to Eq. ( 11)) are calculated exactly through backpropagation: instead of varying the output of the neural network with respect to the parameters of the model to minimize a loss function, we calculate the derivative of the network output with respect to the input variables, the gauge links.With this information, the loss function L for the L-CNN is a combination of where N cfg is the number of configurations in the data set.The weights w 1,2 for the loss function are hyperparameters of the model.Typically, we use w 1 = 0.1 and w 2 = 1.The model is trained by minimizing L using the AdamW optimizer.Note that L 2 contains the group derivatives D L-CNN x,µ of the model which we compute by relating them to matrix-valued Wirtinger derivatives (see Appendix A for details).Unless stated otherwise, we use single precision for floating-point arithmetic during training and testing.
The data used to train and evaluate the network are SU(3) gauge ensembles on volumes 4 4 , 6 4 , and 8 4 with the Wilson gauge action and bare gauge couplings β wil ranging from 5.0 to 100.0, with more dense spacing in β wil at the stronger coupling end.Each member of these ensembles represents a possible coarse configuration V in Eq. ( 3), the minimization procedure begins with a random starting fine configuration U and a parametrization of A FP [U ] appropriate for smooth gauge links.Here, we use the APE444 parametrization (see Appendix C 2 for details).Minimizing A FP [U ] + T [U, V ] by adaptively updating of the links U produces sets of fine configurations with matching volumes 8 4 , 12 4 and 16 4 .Each ensemble consists of 200 saved configurations equally spaced from Markov chains of length 10 6 , and the ensembles are split into 80% training, 10% validation, and 10% test data.

A. Architecture search
The flexibility of the L-CNN architecture allows for a large variation of the network hyperparameters, namely the number of bilinear convolution layers, the number of channels, and the kernel size for convolutions.To gain some insight as to the optimal choices for these hyperparameters, we train a large set of models on the same data set, gauge ensembles with lattice volume 4 4 generated with the Wilson gauge action and bare coupling β wil from 5.0 to 10.0, for which minimization was first done to find the corresponding values of the FP action and derivatives.In the L-CNN models, we use the local Wilson action density as the prefactor A pre x [V ].Details about the various architectures are shown in Table I, where we list the number of bilinear convolutional layers, and their associated kernel sizes and output channels.We also provide the number of trainable parameters.As detailed at the end of Section IV A, the number of parameters for each bilinear convolution grows quickly with the number of channels, the kernel size, and the number of dimensions.For each unique architecture of the thirteen listed in the table, we use both Eqs. ( 25) and ( 26) and train each architecture five times using random initial weights.This amounts to a total of 130 unique models.We show a summary of the hyperparameter scan in Fig. 7, with 130 L-CNN models used to estimate the distributions, examining the accuracy in predicting the FP action value and the FP derivatives.To compare a variety of models, we study their performance in terms of the model depth, the model width, and the size of the receptive field.The depth of the model is determined by the number of layers, while the width is related to the number of channels in each layer.As a simple measure of the model width, we take the sum of the number of channels in each layer.The size of the receptive field, which limits the locality of the action, is approximated by the sum of the kernel size for each layer.A general trend is clear: Increasing the depth, width, or receptive field reduces both the action and derivative errors, as one might expect.The firm indication is that L-CNN models with three layers, cumulative kernel sizes of five, and cumulative number of channels approximately 60, are highly accurate, predicting the FP action with an error well below 1%.Although not explicitly shown in Fig. 7, we remark that we find little difference in the choice of function that is used to combine the prefactor action with the regularized L-CNN: Both the exponential and linear functions in Eqs. ( 25) and (26) show virtually the same performance across all tested architectures.Since the exponential function is manifestly positive and thus more likely to produce strictly positive parametrizations, we deem it the more suitable choice for further studies.We note that results for models with up to three layers were obtained using 400 training epochs, whereas models with four layers required 1000 epochs for convergence.During the training phase of models with four layers, we encountered a single outlier, which did not converge.
The broad scan allows us to narrow the search for the optimal L-CNN, for which training can be extended to a larger number of epochs to ensure convergence.We can also avail of previous studies of the FP action for SU(3) gauge theory, where the accuracy of those parametrizations provides a baseline.The older study [24] used the ansatz as in Eq. ( 20), with plaquette, rectangle and parallelogram loops, and powers up to M = 4, with the coefficients c (m) C determined through χ 2 minimization.Borrowing their nomenclature we refer to this parametrization as IIIc-4 in the figures.The later study [25] used a very different ansatz as in Eq. ( 8), with powers of plaquettes of original and smeared gauge links, with the smearing sensitive to the local fluctuations of the gauge links.This yielded two parametrizations, one designed to be accurate on smooth gauge configurations close to the continuum (denoted APE444 in our figures) and a second to be used on rough configurations with a lattice spacing as large as 0.35 fm (referred to as APE431), cf.Appendix C 2 for details.
Motivated by the hyperparameter scan, we decide on training architectures with three bilinear convolutional layers, using kernel sizes {2, 2, 1} and output channels {12, 24, 24} respectively.To improve the behavior in the continuum β wil → ∞ we opt for a prefactor action of type IIIc-4 and extend the range of training data to 5.0 ≤ β wil ≤ 20.0 on 4 4 lattices.Furthermore, we may consider the parameters of the prefactor in Eq. ( 20) to be adjusted during training while ensuring that the normalization and Symanzik conditions remain satisfied.Instead of using random initialization, we set the coefficients c

(m) C
to the values originally found in [24], cf.Appendix C 1 for details.Thus, both the prefactor coefficients and the weights of the L-CNN are optimized during training.We train these models using multiple random initializations for 800 epochs.Later on, we will employ finetuning to further improve our models, as detailed in Sec.V E. Figures 8 to 12, which we discuss in detail in the following section, are produced with our best model found through this training procedure including finetuning on instantons.

B. Detailed results
In Fig. 8 we compare the older FP parametrizations with the best L-CNN model on gauge ensembles with β wil ranging from 5.0 to 20.0.We see that the L-CNN clearly outperforms the previous parametrizations across this range, with its predicted action value and derivatives much closer to the ground truth FP values.(The absolute value of the relative action error is plotted, as used in the with β wil ranging from 5.0 to 10.0.Test data consisted of the same lattice size and β wil range.All models use the Wilson action as a prefactor action.We show box plots of the relative errors and derivative errors averaged over all test data.The thick central lines show the median error.The box extends from the 25% to the 75% quantile and the whiskers denote the 0% (minimum) and 95% quantile (to remove outliers).The left panels show the dependence of the errors on the model depth, i.e., the number of bilinear convolutional layers.The middle panels show the dependence on the model width given by the sum of channels across all layers.The right panels show the dependence on the size of the receptive field of the models, which we approximate by the sum of kernel sizes in each layer.We observe that larger models (more layers, more channels, larger receptive field) typically lead to better approximations of the data.loss function.)Even on much smoother gauge ensembles at β wil = 20.0, the range for which APE444 was designed with small fluctuations, the L-CNN model is superior in predicting the action and derivatives.Overall, the L-CNN performs well across the entire range from coarse to fine lattice spacing.
To amplify the superiority of the trained network, we show in Fig. 9 the difference between predicted and actual FP action values for APE431 (designed for coarse lattices) and L-CNN in the range 5.0 ≤ β wil ≤ 7.0.The difference changes sign for APE431 as we scan across bare coupling, the L-CNN model gives a visibly much more accurate prediction.The effect of the model can be drawn out as shown in Fig. 10 through the ratio of the L-CNN output A L-CNN [V ] to the prefactor A pre [V ], which varies up to ∼ 30% on the coarsest gauge ensembles, approaching 1 in the continuum limit β wil → ∞.
Because the FP derivatives represent a volume-sized amount of information for each gauge configuration, the distributions of the error D FP x,µ,a [V ] − D model x,µ,a [V ] are an additional probe of the accuracy of each model used for parametrization.As shown in Fig. 11, the distributions narrow with reduced error going to finer lattices, with all parametrizations becoming more accurate.The L-CNN model has the sharpest distributions of all across all bare couplings, even at β wil = 20.0, the range where the APE444 parametrization was optimally designed.The superiority of the L-CNN model at β wil = 6.0 is particularly interesting, as this corresponds to a lattice spacing a ∼ 0.1 fm in the range of coarsest lattice spacings used in current large-scale simulations.
According to arguments of universality, locality of the discretized theory guarantees the correct continuum limit.While the exact FP action has infinitely many couplings, it is still a local action because the couplings decrease exponentially with the separation r = |x − y| of the gauge links at positions x and y, as shown for the perturbative couplings ρ µν (r) in Fig. 2. To test if the optimal L-CNN model shares this feature beyond the perturbative regime, we look at a quantity analogous to the perturbative coupling, namely the variation of the action A L-CNN [V ] with respect to gauge links at locations x and y and in directions µ and ν.As described in more detail in Appendix B, a gauge invariant observable ρµν (r) can be built from the square of this second-order derivative.The behavior of this coupling for the L-CNN model is shown in Fig. 12, measured on 6 4 volumes at β wil = 6.0 and normalized by ρ00 (0).The couplings do indeed decrease rapidly with separation, with a relative change of 10 −5 by separation r/a = 4. From this, we deduce that the L-CNN network does not significantly couple fields at large separation and that the finite ex- tent of the model does not lead to poor accuracy.We note that the numerical evaluation of the locality measure requires double-precision arithmetic to resolve the small couplings at large distances.

C. Restricted training ranges
We also investigate how the selection of training data affects the performance of the L-CNN to make accurate predictions.To do so, we split the training data into low β values β wil ∈ [5,7] and high values β wil ∈ [7,20], and train multiple models with random initializations on the low, high and original β wil ranges.The results are shown in Fig. 13 and Table II.We find that each model generally performs well on the data it has been trained on.Surprisingly, the model trained on the full range performs best on coarse configurations.This might be due to the fact that this full model has been trained with the most data.On the other hand, the high model works best on Comparison of different parametrizations of the FP action evaluated on MC ensembles from β wil = 5.0 to β wil = 7.0 on a 4 4 lattice.We plot the relative linear deviations from the numerical fixed point action data for our best L-CNN model and the APE431 parametrization.
FIG. 10.Ratio of our best L-CNN model A L-CNN [U ] and its associated prefactor action A pre [U ] (in this case, a learned IIIc action) as a function of β wil on a 4 4 lattice.By construction, the L-CNN model approaches the prefactor in the limit of smooth configurations β wil → ∞. high β values.Comparing the low and high models, these results might suggest a lack of generalization of our models to data outside the original training range.Models only trained on very coarse configurations tend to make less accurate predictions for smooth configurations and vice versa.In a sense, this suggests that despite overall good performance, the L-CNN does not truly learn the FP action that underlies the training data.However, it is unlikely that this would hinder the practical use of an FP parametrization based on L-CNNs or that this is a problem affecting only L-CNN models.Similarly, parametrizations based on simple loops as in Eq. ( 20) and even more sophisticated approaches using asymmetrically smeared links such as APE431 and APE444 require data from a large range of β wil in order to determine suitable coefficients with good accuracy for both coarse and smooth configurations.From a practical viewpoint, especially concerning the use of FP parametrizations in a Monte Carlo simulation, it might not even be necessary to have a model that generalizes to all values of β wil .If one intends to perform a simulation at a particular β, it is sufficient to use a parametrization that works well on a specific level of coarseness.Much coarser and much smoother configurations are both unlikely to occur during the simulation and thus less than optimal performance outside a particular β-range does not pose a problem in practice.We also stress that the L-CNN models approach the continuum limit by construction, i.e., for sufficiently smooth fields our models reproduce the Yang-Mills action.

D. Finetuning for different lattice sizes
Up until now, we have only considered models trained and tested on 4 4 lattices.We employ transfer learning to our best type IIIc L-CNN model obtained in the last section (before finetuning on instantons) by additional training with data from larger lattices.Specifically, we FIG.13.Effect of data selection on trained models.We show the average relative error on 4 4 lattices of three different models for each MC ensemble from β wil = 5 to β wil = 20.The models have been trained on different data: low (β wil ∈ [5,7], light blue region), high (β wil ∈ [7,20], light orange region), and the full range (β wil ∈ [5,20]).The averages across all β wil are reported in Table II.
finetune on 6 4 and 8 4 in the range β wil ∈ [5,20] for 400 epochs, starting from our previous best model.For better comparison, we also finetune our previous best model on 4 4 with the same number of epochs.Through experimentation, we found that it is beneficial to change the loss function during this finetuning procedure.In contrast to Eq. ( 27), which optimizes the absolute errors of with typical choices w ′ 1 = w ′ 2 = 1.The performance of these three finetuned models can then be compared across different lattice sizes.The results are summarized in Table III, where we list the relative error measured by the action values and the gauge invariant derivative error for each lattice size.Remarkably, we find that the performance improves only slightly with additional transfer learning and is consistent for all considered lattice sizes.This suggests that training on small lattice sizes is sufficient to obtain FP parametrizations with high accuracy, which generalize beyond the original training data in terms of lattice size.This is highly advantageous because training on small lattices is much more efficient: a typical model trained for 100 epochs requires approximately 4 hours on 4 4 , 7 hours on 6 4 , and 22 hours on 8 4 on an NVidia 3090 RTX GPU.

E. Finetuning with instantons
We have seen that the performance of a trained model strongly depends on the properties of the training configurations.The largest effect stems from the coarseness of equilibrated configurations, controlled by β wil , as demonstrated in Sec.V C. We may extend our training procedure to also include nonequilibrium configurations, for example instanton solutions.One might expect that an L-CNN trained to sub-percent accuracy within β wil ∈ [5,20] would also produce similarly accurate predictions for instantons, but we find that this is not neces- In the case of the IIIc-4 prefactor action, we find a relative error of ∼ 10% for instanton radii between ρ/a = 0.5 and ρ/a = 1.5.
Predictions for instantons can be drastically improved by including them as training configurations in the finetuning procedure.Starting from our best 4 4 model found in Sec.V D, we extend the training data set from equilibrated configurations within β wil ∈ [5,20] to include 20 different instantons and perform transfer learning with a reduced learning rate and increased batch size for 1000 additional epochs with w ′ 1 = 1 and w ′ 2 = 0.1.This parameter choice puts more weight on accurate action values at the cost of slightly more inaccurate predictions for derivatives.To avoid data imbalance, the instantons are included multiple times such that we obtain effectively 200 training instantons.
We test our finetuned model on instantons of various sizes.The results are shown in Fig. 14, where we plot the predicted action as a function of the instanton radius.We see that our model predicts the numerical FP data much better than the IIIc-4 action and even the APE431 action.The predictions closely follow the FP data, except for the kink around ρ/a = 0.85.Moreover, we find that our finetuning procedure does not lead to a loss of performance on equilibrated configurations.Our finetuned model has a relative error of 0.12% and a gauge-invariant derivative error L 2 = 8.731 • 10 −2 within β wil ∈ [5,20].We note that this finetuned model is the one presented in Sec.V B.

F. Approximate lattice symmetries
Finally, we may check the trained model for discrete lattice symmetries.The L-CNN used in this work is, by construction, equivariant with respect to lattice translations.As a result, if U and U ′ (shift) are two gauge configurations which are the same up to a shift on the lattice, then the predictions will agree exactly On the other hand, other lattice symmetries such as rotations and reflections are not implemented exactly.A rotated gauge configuration U ′ (rot) is generally assigned a different action value In principle, the L-CNN architecture can be extended to include such discrete lattice symmetries exactly [48], but only at considerable computational cost.Thus, with the goal in mind to use the trained model in a future Monte Carlo study, we only consider the more efficient translationally-equivariant L-CNNs and test symmetry properties after training.
For rotational invariance, we consider all 90 • rotations about a single origin on the lattice.Taking into account both clockwise and counter-clockwise rotations, these amount to D(D − 1) transformations in D lattice dimensions.The choice of origin is arbitrary due to translational equivariance.Given a particular gauge configuration U (0) from the test set, we generate the set of rotated configurations U (j) with j ∈ {1, 2, . . ., D(D − 1)}.For each of these configurations, we compute the predicted action A (j) = A L-CNN [U (j) ].We then define the relative error due to broken rotational invariance as the standard deviation of the set {A (j) } normalized to the mean value.A similar measure can be defined for reflections along lattice axes.
We present our results in Fig. 15, where we evaluate the measures for broken symmetry on equilibrated configurations on a 4 4 lattice from β wil = 5.0 to 8.0.We observe that the variance between predictions due to symmetry transformations (either rotations or reflections) is much smaller than the prediction error for coarse configurations (β wil < 6).For smoother configurations, the errors become comparable.Overall, we conclude that sufficiently well-trained models exhibit approximate rotation and reflection symmetry.These symmetries are a priori not present in the L-CNN architecture and have been learned during training.

VI. CONCLUSIONS AND OUTLOOK
In the challenge of pushing lattice QCD calculations to ever higher accuracy one faces the imminent problems of critical slowing down and topological freezing.In this paper, we propose to overcome these obstacles by using highly improved gauge actions, such that simulations at coarse lattice spacing are possible, where critical slowing down and topological freezing are avoided, while keeping lattice artifacts under sufficient control to take a reliable continuum limit.In order to do so, we follow the FP action approach [18] based on the properties of RG transformations.FP actions are lattice discretizations which have no lattice artifacts at the classical level.They are also expected to have suppressed lattice artifacts at the quantum level.Parametrizing the complicated FP actions has been a major challenge in the past [25] and in this paper we address it by employing a gauge equivariant convolutional neural network (L-CNN) [30] and ML techniques.Studying the quality and improvement achieved with the new parametrization is the first conceptual step in the construction of highly improved RG actions.
The main observation in this paper is that trained L-CNN models are able to achieve much higher accuracy than previous parametrizations of the FP action over a larger range of lattice spacings and corresponding gauge field fluctuations.This is not surprising, given the flexibility and large number of parameters of the neural network models.It is particularly encouraging that the L-CNN accuracy varies little in the range of coarsest lattice spacings as shown in Fig. 9.The baseline FP parametrization APE431 was previously used in Monte Carlo studies of the deconfinement phase transition, the static quark-antiquark potential, and the glueball mass spectrum [25], with the promising result that the parametrized FP action had very small lattice artifacts in these physical observables up to lattice spacings as coarse as a ∼ 0.33 fm, at the level of accuracy feasible at that time.In the same spirit, the ultimate test of the L-CNN parametrization of the FP action will of course be its performance in actual Monte Carlo simulations and conducting state-of-the-art scaling tests on coarse lattices is therefore the crucial next step in our attempt to construct highly improved RG actions.
In the course of this project, we gained valuable experience in determining derivatives of the FP action with respect to gauge links through backpropagation.This method opens up interesting possibilities for simulation methods based on derivatives such as HMC algorithms [49] or Langevin dynamics [50][51][52].Both simulation strategies use the variation of the action with respect to the gauge fields.Since the derivatives can be calculated efficiently within the L-CNN architecture and constitute the major part of the data used for the learning of the L-CNN, this aids the feasibility of large-scale simulations.
A more ambitious and difficult goal -the holy grail of RG improvement -is to find the exact RG trajectory as in Fig. 1, for which cutoff artifacts are eliminated completely.Such a quantum perfect action (or RG action) would in principle allow one to extract continuum physics from simulations at one (coarse) lattice spacing.A procedure to determine the RG action using field derivatives is outlined in [18].Constructing and parametrizing the RG action is another challenge which can be embraced using the L-CNN and ML techniques.The FP action constructed in this paper is a crucial and necessary first step in that direction.
Finally, in the context of FP actions the inclusion of FP fermions is natural and leads to the realization of chiral symmetry and an exact index theorem at finite lattice spacing [53][54][55][56][57][58].The parametrization of the FP fermion action is another delicate problem [59] which could be tackled by L-CNNs and ML.

Asymmetrically smeared link parametrization
The parametrizations in Ref. [25] uses powers of traces of plaquette loops built from single gauge links as well as from asymmetrically smeared gauge links.The ansatz is very general and flexible as it allows to easily introduce more complex loops with corresponding couplings without much difficulty.In the following, we recall the explicit construction of these early parametrizations, denoted in this paper by APE431 and APE444, and provide their precise definitions.

(C10)
From these asymmetrically smeared links, we now construct a smeared plaquette variable The labeling of the parametrizations is as follows.Denoting the maximal total power of the smeared and unsmeared plaquettes by max(k + l) = K, the number of nonvanishing functions c i (q µ ) in Eq. (C8) by L, and the order of the expansions of η(q µ ) and c i (q µ ) in Eqs.(C9) and (C10) by M , the FP parametrization is denoted by APEKLM .
The parameters of the parametrized FP actions employed in our work are tabulated in Table V and VI for the APE431 and APE444 actions, respectively.In Table VII we additionally provide the parameters of the APE121 action which was used in earlier works as the starting point for the RG iterations.It is based on the couplings of the FP action in the quadratic approximation [19,24] which are fitted by the leading nonlinear parameters η (0) , c 2 and p 10 , p 01 , while explicitly fulfilling the Symanzik "on-shell" conditions to O(a 2 ).We note that the APE444 action maintains these coefficients, while for the APE431 the O(a 2 ) Symanzik condition was released.
which when expanded can be seen to generate a large number of paths connecting the two points.Note that the blocked links should not be confused with the smeared links of the APE parametrization (see Appendix C 2).
The four parameters s i , i = 0, . . ., 3 are subject to the constraint s 0 + 6 s 1 + 12 s 2 + 8 s 3 = 1 (D5) which ensures that for a trivial field configuration Q n B ,µ is equal to the unit matrix.(Note however that the smeared links W µ are in general no longer in the group SU(N c ).) Together with the quantity κ in the blocking kernel T [U, V ] they are free parameters which can be varied for optimization.In [24] it was shown that the values κ = 8.8, s 1 = 0.07, s 2 = 0.016, s 3 = 0.008, with s 0 set via the constraint, are an optimal choice to maximize the exponential decrease of the couplings with distance in the perturbative FP action, as shown in Fig. 2.

Appendix E: Analytic instanton solutions
We give here details of the analytic instanton solutions used to generate part of the FP training data.We start with an analytic instanton solution with radius ρ centered at x = 0 [60] U x,4 = cos(f 4 (x))−i x i σ i x 2 −x 2 4 sin(f 4 (x)), (E1) U x,i = cos(f i (x))+i x 4 σ i −ϵ ijk x j σ k x 2 −x 2 i sin(f i (x)), i = 1, 2, 3, the center of which should be shifted to any location x c except a lattice site.Here, σ i refers to the i-th Pauli matrix.To be consistent with periodic boundary conditions, a dislocation is inserted through a singular gauge transformation These fine solutions U are RG blocked to produce coarse configurations V , for which the FP action is given by the minimization of Eq. ( 3).

FIG. 1 .
FIG. 1.A sketch of the renormalization group flow and the renormalized trajectory (RT) in the infinite-dimensional coupling space, with the gauge coupling as the only relevant direction.The fixed point is on the critical surface β → ∞ where ξ/a = ∞ for any physical scale ξ, with the values of the critical couplings c FP n determined by the specific form of the RG blocking.The FP action uses the same coupling values at finite β, tracking the RT closely at sufficiently weak coupling.

FIG. 3 .
FIG. 3. Fixed point action density as a function of β wil on a 4 4 lattice.We show the ensemble-averaged FP action A FP [V ], the blocking kernel T [U, V ], and the parametrized FP action A FP [U ] used on the right-hand side of Eq. (3), normalized to the coarse lattice volume.The mean values are obtained by averaging over the ensemble at a given β wil .The shaded regions indicate the standard deviation.

FIG. 8 .
FIG. 8.Comparison of different parametrizations of the FP action evaluated on MC ensembles from β wil = 5.0 to β wil = 20.0 on 4 4 lattice volumes.The errors of a particular parametrization are defined as the deviations from the numerical fixed point data.The top panel shows the relative error L1 computed from action values.The bottom panel shows the gauge invariant derivative error L2.The error bars are given by the standard deviation within each ensemble.
FIG. 9.Comparison of different parametrizations of the FP action evaluated on MC ensembles from β wil = 5.0 to β wil = 7.0 on a 4 4 lattice.We plot the relative linear deviations from the numerical fixed point action data for our best L-CNN model and the APE431 parametrization.

FIG. 11 .
FIG. 11.Histograms of the local deviations ∆x,µ,a = D FP x,µ,a − D model x,µ,a of the model derivative D model x,µ,a and the derivative of the fixed point action D FP x,µ,a .Here, the model can refer to an L-CNN or a different parametrization of the FP action.We show the same parametrizations as in Fig. 8 for specific values of β wil .The horizontal axes have been rescaled by the standard deviation of D FP x,µ,a .
FIG. 12. Locality measure ⟨ρµν (r)⟩ as a function of distance |r| of our best L-CNN model.The expectation value has been evaluated on five configurations at β wil = 6.0 on a 6 4 lattice.Blue crosses show parallel couplings ρµµ, whereas orange circles correspond to orthogonal couplings ρµν with µ ̸ = ν.An exponential fit is shown as a red dashed line.Couplings beyond rmax ≈ 4.3 a are zero due to the finite receptive field of the L-CNN.

4 FIG. 14 .
FIG.14.Evaluating different parametrizations of the FP action on instanton configurations with radii ρ/a on an 8 4 lattice.The black points show the numerical fixed point data.The L-CNN model is a type IIIc-inspired action with Symanzik-constrained trainable parameters.As detailed in the main text, it has been finetuned on instanton configurations.

FIG. 15 .
FIG.15.Relative error due to breaking of rotational and reflection symmetry as a function of β wil .We also show the relative prediction error for comparison (black dots).
x,µ along the closed path C. In this paper we use the IIIc-4 action which includes the plaquette (pl), the rectangle (rt) and the parallelogram (pg) loops with M = 4.The parameters c (m) C of this parametrization are given in Table IV.Note that the coefficents c (1)C fullfil the tree-level Symanzik condition for spectral quantities, w x,µν = Re Tr 1 − W pl x,µν ,(C11)and the ordinary Wilson plaquette variableu x,µν = Re Tr 1 − U pl x,µν , = U x,µ U x+μ,ν U † x+ν,µ U † x,ν .(C14)The parametrized action then has the formA[U ] = 1 N c x,µ<ν f (u x,µν , w x,µν ) , (C15)where f is a function of both plaquette variables,f (u, w) = kl p kl u k w l (C16)= p 10 u + p 01 w + p 20 u 2 + p 11 uw + p 02 w 2 + . . .with the coefficients p kl being free parameters.

TABLE I .
Architecture details of the hyperparameter scan.All architectures use the Wilson action density as a prefactor action and use clover leaf plaquettes (24 input channels).After the last convolution, we take the real part of the trace and use a final linear layer to map the remaining channels to a single real number.
FIG. 7. Results of training an ensemble of 130 models, ranging from small to large architectures, on lattice volume of size 4 4

TABLE II .
Effect of training data selection on model performance.The left column denotes the range of β wil used for evaluation, whereas the first row shows the range for training.We report the relative error of the predicted action with respect to numerical FP data, averaged over all configurations within the respective β wil -range.The smallest error in each column is highlighted in bold.It is apparent that the model performance strongly depends on the training range and that there is a trade-off between accuracy on particular ensembles and generality across many ensembles.

TABLE III .
Effect of transfer learning with different lattice sizes.Starting from our previous best model, we use transfer learning to obtain models that have been finetuned to 4 4 , 6 4 , and 8 4 data.The left column denotes three different models and we report the relative error and derivative error on various lattice sizes for β wil ∈ [5, 20].The smallest errors in each column are highlighted in bold.The lattice size appears to have a negligible effect on model performance.10 −2 8.19 × 10 −2 8.22 × 10 −2 6 4 7.39 × 10 −2 7.93 × 10 −2 7.96 × 10 −2 8 4 7.36 × 10 −2 7.91 × 10 −2 7.93 × 10 −2 sarily the case.If instantons are absent during training, then predictions for their FP action values appear to be mostly determined by the prefactor action A pre [V ].

TABLE IV .
Parameters of the approximate FP action denoted by IIIc-4.

TABLE VI .
Parameters of the approximate FP action denoted by APE444.

TABLE VII .
Parameters of the approximate FP action denoted by APE121.