A Deep Learning algorithm to accelerate Algebraic Multigrid methods in Finite Element solvers of 3D elliptic PDEs

Algebraic multigrid (AMG) methods are among the most efficient solvers for linear systems of equations and they are widely used for the solution of problems stemming from the discretization of Partial Differential Equations (PDEs). The most severe limitation of AMG methods is the dependence on parameters that require to be fine-tuned. In particular, the strong threshold parameter is the most relevant since it stands at the basis of the construction of successively coarser grids needed by the AMG methods. We introduce a novel Deep Learning algorithm that minimizes the computational cost of the AMG method when used as a finite element solver. We show that our algorithm requires minimal changes to any existing code. The proposed Artificial Neural Network (ANN) tunes the value of the strong threshold parameter by interpreting the sparse matrix of the linear system as a black-and-white image and exploiting a pooling operator to transform it into a small multi-channel image. We experimentally prove that the pooling successfully reduces the computational cost of processing a large sparse matrix and preserves the features needed for the regression task at hand. We train the proposed algorithm on a large dataset containing problems with a highly heterogeneous diffusion coefficient defined in different three-dimensional geometries and discretized with unstructured grids and linear elasticity problems with a highly heterogeneous Young's modulus. When tested on problems with coefficients or geometries not present in the training dataset, our approach reduces the computational time by up to 30%.


Introduction
Multigrid methods [14,60] are among the state-of-the-art solvers for large linear systems that come from the discretization of Partial Differential Equations (PDEs). They are applicable to a wide range of discretizations such as polyhedral discontinuous Galerkin [5,9] and virtual elements methods [7]. However, their drawback is that they rely on a sequence of coarser grids to solve the problem. The algebraic multigrid (AMG) methods are a highly scalable [11] generalization that is capable of building this hierarchy of grids algebraically. The main challenge behind the algebraic construction of a coarser mesh is the selection of the prolongation operator, which in turns relies on seeing the sparse matrix of the linear system as a weighted graph and defining a strong threshold parameter to partition the aforementioned graph.
The AMG was first introduced in the 80s [16,17,18,51] and gained momentum very quickly is the subsequent years [20,24,54,57]. In recent year modifications have been proposed to the AMG to improve its efficiency, like smoothed aggregation [55,56] and extend its range of application to different discretization techniques, like discontinuous Galerkin [8] and Ritz-type finite element methods [19,21,36], and to different physics equations like Maxwell's equations [13,38], magnetohydrodynamics [1], Navier-Stokes equations [47,59], linear elasticity [29] and multiphase poromechanics [61]. There is also a wide literature that tackles the AMG from a theoretical point of view, the foundation was laid down in [15,25,62,63] and the most recent survey is found in [64]. In particular, we consider the finite element (FE) [46] discretization of diffusion and linear elasticity problems with highly heterogeneous coefficients in 3D. These problems represent a challenging benchmark since they feature a large number of unknowns and a very ill-conditioned system matrix. Consequently, the efficient application of the AMG method is the key to obtain fast and robust convergence.
In this work, we propose to use Deep Learning (DL) [42] techniques to find the optimal value of the strong threshold parameter, depending on the problem we want to solve. In particular, Artificial Neural Networks (ANNs) have gained widespread popularity for a variety of applications. They are versatile tools that are progressively being used in scientific computing [40,44,58], particularly in the numerical approximation of PDEs and model order reduction [27,48]. ANNs can also be utilized within a data-driven framework to provide alternative closure models, which involve learning input-output relationships in complex physical processes [49,50]. Since we treat the sparse matrix of the linear system like an image, part of the ANN that we employ is comprised by a Convolutional Neural Network (CNN). CNNs are among the most applied neural architectures to analyze visual imagery and achieved breakthrough results [30,34,35,39,41]. Today, they are only surpassed by visual transformer [22], which however have a much more complex architecture and are much more difficult to optimize. Moreover, CNNs have already been successfully applied in the field of scientific computing [12,23].
The combination of multigrids methods and machine learning has already been used in literature. For instance, in [28,37] there is the first attempt of optimizing the multigrid parameters, in [2,3,6] DL is used to perform grid refinement ad agglomeration, in [43,45] Graph Neural Networks (GNNs) are used to learn the prolongation operator and in [53] reinforcement learning is used to perform graph coarsening, in [32,33] DL is used to enhance domain decomposition. In this paper, we propose a DL-based algorithm that is able to tune on-the-fight the strong threshold parameter so as to minimize the computational time needed to solve the linear system at hand. The main difference with the previous works is that our algorithm is completely non intrusive: following the approach described in [4], it does not require any change to existing code (neither to the FE nor the AMG solver). This guarantees a wider range of applicability and means that we can rely on all the classical theoretical results regarding convergence. Indeed, we show that our algorithm is able to perform a fine tune of the AMG parameters just by analyzing the sparse matrix of the linear system we want to solve. In particular, we train an ANN to predict the expected computational cost of solving a linear system given as input a certain value of the strong threshold parameter and a small multi channel image representing the sparse matrix of the linear system. In order to build this representation we propose to employ the pooling operator used in CNN. We show that the pooling operator allows for a cheap evaluation that does not lose relevant information for the task at hand. We found that the proposed ANN-enhanced AMG method allows to reduce significantly the computational cost (elapsed time) needed to solve the linear system compared to employing the pre-defined choice of the parameters based on trial-and-error, experience, and literature.
This work is organized as follows. Section 2 introduces the mathematical framework of the AMG methods, in particular we define and show the importance of the strong threshold parameter. In Section 3 we present our algorithm. We describe the architecture of the ANN and show how to apply the pooling operator to a large sparse matrix. Section 4 is devoted to the numerical validation on our method. We first make preliminary sensitivity analysis of the hyperparameters of the ANN. We then apply our algorithm to a family of elliptic problems with a highly heterogeneous diffusion coefficient on structured and unstructured meshes and to a family of linear elasticity problems with a highly heterogeneous Young's modulus. Finally, in Section 5 we draw our conclusions with future developments. Algorithm 1 One Iteration of the V-cycle of the AMG method

AMG methods
To start, we will explain the fundamental concepts and methods that make up AMG. For further information, refer to [51,65]. Our goal is to solve the linear system Au = f where A ∈ R n1×n1 is a symmetric positive definite (SPD) matrix with entries a ij and u, f ∈ R n1 are vectors with entries (u) i . We define a smoothing scheme as a linear iterative method where B ∈ R n1×n1 and the initial guess u 0 are given. We denote ν applications of (1) as smooth ν (A, B, u 0 , f ). The idea of multigrid methods is that after a certain number of iterations ν of Eq. (1), the error is more efficiently reduced by projecting the residual equation Ae = r = f −Au ν on a coarser space, and interpolating the solution back to the original space to apply the correction u ν + e. If we suppose to have a sequence of interpolation operators I k−1 k ∈ R n k−1 ×n k , restriction operators I k k−1 ∈ R n k ×n k−1 and grid operators A (k) ∈ R n k ×n k for k = 2, ..., M with A (1) = A, and of smoother B (k) for k = 1, ..., M with n 1 > n 2 > ... > n M , and a number of pre-smoothing iterations ν 1 and post-smoothing ν 2 , the V-cycle multigrid iteration is defined as in Algorithm 1 (notice the usage of the superscript (k) to indicate the different levels). The AMG method aims at finding the operators needed for this task by just relying upon the original matrix A. Since A is SPD we assume that Hence, all the operators are defined once we have a recipe to build the interpolation operator.

Interpolation operator and coarse-grid selection
We consider the interpolation operator between the level k and k + 1. We assume that we can split the variable into two sets: the one that needs interpolation at the fine level k and the one that are kept at the coarse level k + 1, namely: where x ∈ R n k+1 is a generic vector, C k /F k is a coarse/fine partition of the set N k = {1, ..., n k }, C k i = {j ∈ C k : a ij = 0} is a subset of C k that depends on i and ω k ij is a set of weights. Since we are working between two levels, from now on, we will omit the superscript k. Before being able to define the value of ω ij we need to define when the unknown (u) j is important in determining the value of (u) i . Definition 2.1 Given a threshold parameter 0 < θ ≤ 1, the set of of variables on which the variable i strongly depends on is {−a il }, j = 1, ..., n k }.
We also define the set of variables j that are influenced by the variable i The underlying assumption of AMG (see [51]) is that the error satisfies This yields the formula for the weights The last ingredient that we need to build the interpolation operator is the C/F splitting. Among the several techniques that can be used, we outline the CLJP (Cleary-Luby-Jones-Plassman) algorithm. First, we construct the graph of variable dependencies G = (V, E) with vertices V = {1, ..., n} and edges E = {(i, j) ∈ V × V : j ∈ S i }. For each vertex, we define the measure η(i) = |S i | +η, whereη is a random number in (0, 1) used to break ties. Then, we update η and G in the following way until all vertexes are designed as either C or F. Whenever a update to η(j) is such that η(j) < 1, j is flagged as F.
1. Build the independent set D = {i ∈ V : η(i) > η(j)∀j ∈ S i ∩ S i } for the graph G. 2. For each i ∈ D 2.1. For each j ∈ S i , decrement η(j) and remove the edge (i, j) from E. 2.2. For each j ∈ S i , remove (j, i) from E 2.2.1. For each k ∈ S j ∩ S i , decrement η(j) and remove (k, j) from E.

Every point in D is flagged as C.
This algorithm is applied at each level until the size of the grid operator n k is smaller than a certain given value n min , that we fix equal to two. Thus, given the strong threshold θ we are able to construct the succession of operators needed for the V-cycle. Algorithm 2 showcases the full AMG method.

ANN-enhanced AMG
We now introduce the ideas behind the proposed AMG-ANN algorithm. We have shown that the strong threshold parameter θ heavily conditions the construction of the interpolation operator that stands at the basis of the AMG method. In literature, this value is usually fixed to 0.25 and 0.5 when applying the AMG to the solution of linear systems that stem from the discretization of 2D and 3D PDEs, respectively. Our algorithm aims at making an accurate choice of θ so as to minimize the computational cost of the AMG method. To this end, we devise an algorithm that exploits DL techniques to predict the value of θ that minimizes the elapsed CPU time needed to solve the given linear system Au = f . Namely, we propose to use an ANN F to predict the computational time t, that is the number of seconds needed to solve a linear system with the AMG method (including the setup phase), and its confidence of that prediction σ 2 , that is the square of the committed error, given as input a small multi-channel image V extracted from the matrix A, the FE degree p, the logarithm of the size of A and the value of the strong threshold parameter θ. Then, online, we perform a 1D optimization to find the optimal θ * that minimizes the predicted costt (we use the tilde to denote quantities approximated by the ANN). The architecture of the ANN is shown in Figure 1.
We start by briefly reviewing how we collect data and explaining how the extraction of V from A is performed. We then describe in details our approach in Algorithm 4 and we introduce a way to measure its performance. To make the article self-contained we discuss the basics of DL in Appendix A.

Gathering and smoothing data
Since the measurements of t are affected by error, we repeat the measurements r times with r between two and 100, where r is chosen inversely proportional to the mean elapsed time in the first two measurements. Indeed, the uncertainty in the measurements is caused by the tasks scheduled by the operating system in concurrence with the AMG solver that perturb the CPU load, hence if t is larger the uncertainty is smaller since these perturbations average out. This hypothesis is supported by our measurements, indeed we have observed that the sample variance of our measurements is inversely proportional to t (fixed r). The elapsed CPU time t is then defined as the mean of the measurements.
Moreover, we regularize data using Savitzky-Golay filter [52]. Namely, we fit a polynomial to a successive sub-sets of adjacent data points (called window) by means of least squares. We use a window of size 21 and polynomial degree 7 for uniformly sampled values of θ. The two parameters were selected through a process of manual tuning, whereby various combinations were tested and assessed for their performance on a subset of problems within the dataset. The selection was guided by visual inspection of the resulting fits, with consideration given to balancing model complexity and accuracy. To further validate the result of the filtering we have checked that the filter maintains the positive sign of the data and we manually reviewed the cases where the difference between the minimum of the filtered and unfilterd (raw) data is larger than a certain threshold. In these cases we checked if it would be appropriate to increase the degree of the polynomial or change the size of the window. Indeed, the smoothing sometimes changes the position of the minima in sharp valleys. For consistency all simulations where carried out on the same Intel Xeon Gold 6238R node of the HPC cluster at MOX.

Applying the pooling operation to large sparse matrices
Relying on the properties of the pooling operation that made it so widely used (see Appendix A), we aim to apply the pooling to the sparse matrix A. Indeed, A can be seen as a large black and white image and we want to apply the pooling operation (before the training) to reduce the computational cost associated with handling such a large amount of data, and to prune unimportant features. We call V = pooling(A, m) ∈ R m×m×4 the result of the pooling of A of size m. We report the details in Algorithm 3. Here, we are assuming that Algorithm 3 Pooling algorithm V = pooling(A, m) 1: access A in COO form and extract its: size n 1 , val, row, col 2: initialize V to an m × m × 4 dense tensor with all zero entries v ijl = 0 3: q ← n 1 /m, p ← n 1 mod m, t ← (q + 1)p 4: for k = 0 to val.size() − 1 do the sparse matrix A is stored in coordinate list format, however the algorithm can be easily generalized to other sparse storage formats. The only difference with the pooling operation usually applied in CNNs is that instead of extracting the maximum, we extract the following features (always in a rectangular neighborhood): maximum of the positive part, maximum of the negative part, the sum, the number of non-zeros (nnz ) entries. The insight behind extracting these features is the following: the positive part, the negative part and the sum are relevant in the definition of the weights of the interpolation operator Eq. (4), the nnz count is an indicator of the sparsity of the neighborhood. The algorithm has low complexity, namely O(nnz), that in the case of FE means O(p · n 1 ). Moreover, we have experimentally proved that its elapsed CPU time is negligible with respect to the one to solve the linear system. This is a necessary condition for this algorithm to be worthwhile. We remark that Algorithm 3 could be easily generalized to work in parallel.

The AMG-ANN algorithm
We now have all the ingredients to describe the proposed algorithm in detail. We employ an ANN F to predict, as target the computational cost t and the square of the error committed by the ANN itself σ 2 . The insight behind σ is that, since the elapsed CPU time t is polluted with noise due to the measurements, in this way we are able to know when the ANN is confident on its own prediction. We employ as inputs of the ANN the vector formed by the pooling of the matrix a V = pooling(A, m), where the size m is an hyperparameter that we tune (see Section 4.1.1), the FEM degree p, the logarithm of the size of A log 2 (n 1 ) and the strong threshold θ. Namely, F (V, p, log 2 (n 1 ), θ; γ) = (t,σ) and we optimize (the superscript i indicates the i-th sample of the dataset of size T ): Let us remark that p is not needed as input of the neural network. Indeed, the information about p is embedded into the matrix A: empirical results show that if we add p as input of the network, we need a lower number of epochs to reach the same loss, but it is still possible to effectively train the network without p. Hence, our algorithm is not limited to problems stemming from FE discretizations. Given the matrix A associated to the linear system we want to solve, then our algorithm prescribes the default literature value of the strong threshold θ every time the weighted average varianceσ of the map A → θ,σ is greater than a certain thresholdσ. This weighted average ensures that the variance of the prediction is more relevant if closer to the (expected) minima. The valueσ is calibrated offline on the validation dataset once after the training of the ANN. For each matrix A i in the dataset we compute the relative error indicator σ i . Then we propose to chooseσ as the ordinate of the elbow point of the sorted errors indicatorsσ i . On the other hand, ifσ <σ, the algorithm prescribes as θ the value θ * found by solving the optimization problem θ * = argmin θ∈(0,1] (F (V, p, log 2 (n 1 ), θ; γ)) 1 .
However, we empirically found that by solving the discretization of the above with two hundreds linearly spaced points, we have an estimate of θ * that, for our aim, is indistinguishable from the solution of the continuous problem with gradient descent. There are two reasons to use this approach instead of predicting directly θ * : we can quantify the expected improvement with respect to using the standard literature value of θ, and we do not have to solve a new optimization problem just to add one sample to the dataset. The architecture of F is comprised by two components. The first one leverages a CNN to extract the relevant features from the matrix V. These features are flattened in a intermediate dense layer, where they are concatenated with the other features p, log 2 (n 1 ) and θ and fed into a dense feed forward network, which predicts the computational cost t. On the output layer we clip the prediction of the normalized computational cost between zero and one and use a softplus activation function for the variance estimate.

Evaluating the performance of the algorithm
A small loss is not a good indicator of the performance of our algorithm. Indeed, the choice of θ is subordinate to the map A → θ * defined by Eq. (5). With this in mind, we introduce the following quantities of interest. Let A be fixed, and let: • t ANN be the computational time of the AMG-ANN algorithm • P = 1 − tANN t0.25 be the performance index of the AMG-ANN algorithm 25 be the best performance of the AMG-ANN algorithm (according to the dataset). Moreover, we can compound the quantities over different A and define P B as the percentage of cases where P ≥ 0, P m as the average of P and P M as the median of P . The ratio P/P M AX gives a measure of how well the ANN has learned the data.

Normalization
Normalization of data is a necessary step to assure fast convergence in neural networks. To this end we employ the following logarithmic normalization for each channel of the input V since it has been shown in [4] to outperform a classical linear normalization in this kind of applications: One of the main properties of this normalization is to preserve the sparsity pattern of A. In the Appendix B we shows some examples of the combination of pooling and normalization for some matrices A and confront them with the relative sparsity pattern. Concerning the output, we normalize the data linearly between zero and one, independently for each subset of outputs in our dataset defined by fixing a matrix A. Algorithm 4 shows the complete ANN-enhanced AMG.

Numerical Results
One of the advantages of our algorithm is that it integrates seamlessly into pre-existing code. We carry out several numerical experiments using using deal.II [10] for the construction of the linear system and we rely on BoomerAMG of the library HYPRE [26] as the AMG solver. No changes are made to BoomerAMG, we use it as a black-box solver and only change the value of the strong threshold parameter θ, all the other parameters (choice of pre-and post-smoother B, number of pre-and post-smoothing cycles ν, etc.) are left as default.
Concerning the training of the ANN, we employ a 20-20-60 split of the dataset into training-validationtest, respectively. The splitting is done among problems, this means that a certain matrix A appears just in one of the three datasets. This entails that when we evaluate the algorithm on the test (or validation) dataset, it is making predictions on problems it has never seen before. Notice also that the test dataset is much larger than the other two: this makes sense only when the dataset is large enough so that the neural network can generalize well. Training has been performed using the Adam optimizer with default learning rate 10 −3 and minibatch size equal to 32. Moreover, we found that applying a suitable learning rate schedule is key to accelerate the optimization. Namely, we employ a learning rate schedule that halves the learning rate with patience 15 epochs.
The convolutional part of the network is composed by either one or two plain convolutional blocks each one ending with a max-pooling layer. For each test case, we tune the number of convolutional layers, the size of the filter, the number of filters the number of dense layers and the width of the dense layers.

Test Case 1: Highly heterogeneous diffusion problem, unstructured Grids
Let us consider the following parametric elliptic problem −div(µ∇u) = f, in Ω, u = 0, on ∂Ω, where µ ∈ L ∞ (Ω) is a highly heterogeneous piecewise positive constant and Ω varies among four geometries: a simplex, a plate with a hole, a ball and a cylinder. We consider in total eight different discretization of the four domains and their nested refinements. We show the coarsest meshes in the first column of Table 1. The diffusion coefficient µ is a highly heterogeneous piecewise constant that is conforming with the mesh. Namely, µ has a different value equal to 10 εi on the i-th cell of one of the coarsening of the mesh, where ε i is randomly chosen between [0, ε M AX ] and ε M AX = 1, 3, 10. Moreover, we again employ four different DoFs numbering to enhance the dataset. In this way, we build a dataset counting 5873 different configurations and thus containing 217301 samples.

Study of hyperparameters and pooling features
In this section we analyze the influence of some hyperparameters on the final loss when we employ a subset of the dataset. This is a preliminary analysis that we made for each test case in order to choose the best value of m (the size of the pooling V), which of the features of V are relevant (that is which of the four layers f of v ijf should we employ as input of the CNN) and the activation function. Namely we performed Picture Geometry Description Simplex Tetrahedron inscribed in the unit ball centered in (0, 0, 0).

Ball
Unit ball centered in zero.

Balanced ball
A variant of the unit ball centered in zero that has a better balance between the size of the cells around the outer curved boundaries and the cell in the interior.

Cylinder
Cylinder of unit radius and height two centered in zero. We consider three different discretizations with one, two or eight slices along its height.
Holes 3 × 2 replica of the "Plate with hole" geometry on the x and y. Only test dataset.

Torus
Torus of circle radius 2 and inner radius 1 2 . The axis of the torus is the y-axis. Only test dataset. a grid search over different network architectures. Figure 2 shows the results. The ReLU activation function consistently reaches the smallest loss, hence we employ ReLU activation functions in conjunction with He initialization [31]. The ANN that uses all four the features of the pooling has the smallest loss, thus the whole tensor V is relevant for this task. The pooling size m = 75 gives the best trade-off between accuracy and computational cost.

Calibration of theσ threshold
We discuss how the choice ofσ is made and how it impacts the performance of our algorithm. In Figure 4 we plot the accuracy (PB) and average time reduction (P m ) when choosing asσ the n-th largestσ of the validation dataset. Notice there is an accuracy-performance trade-off, indeed, asσ tends to zero, P B tends to one but at the same time P m tends to zero. Moreover, the fact that P m is not strictly decreasing means that the error committed by the algorithm is not just due to noisy measurements. However, by using asσ the elbow of the orderedσ we have a gain of 6.1% in terms of accuracy meanwhile the mean performance decreases of only 0.5%.

Evaluation of the algorithm
Starting from the architecture we found in Section 4.1.1, we fine tuned the number of filters, the kernel size of the CNN and the width and depth of the dense part. The ANN has a loss of 2.89 · 10 −4 . The performance of our algorithm is evaluated by the indexes described in Section 3.3.1 on the test dataset (see Table 2) and by visualizing the scaling of the AMG with respect to the number of DoFs (see Figure 9). Moreover, we remark that the scaling shown in the bottom row of Figure 9, ideally, should be constant. Indeed, this is true when we apply the AMG to problems where max Ω µ/ min Ω µ ∼ 1, however in the cases that we consider is not   possible to achieve this. In any case, the AMG-ANN is able to achieve a much better scaling with respect to the not tuned AMG version. In particular, the algorithm has an accuracy of 92.51% and an produces an average reduction of the computational cost of almost 16% in average. Since in this case the median on P M AX is rather small (82%), there is still margin of improvement for the reduction of the computational cost by tuning the architecture of the ANN. Figure 3 shows all the predictions made by the trained ANN, notice that the largest points are the furthest from the bisector, thus the ANN is successfully predicting where it is inaccurate. In Figure 5 we highly some relevant predictions. Notice that the map θ → t exhibits many different complex patterns and that θ * varies depending on the problem considered. Hence, it is not possible to find a fixed value θ * that works well for all the problems. In Figure 6, we showcase relevant cases of when the algorithm is not accurate, in particular, we categorize the errors of the algorithm into three classes (from left to right): • Noisy measurements: the error is due to the measurement of t. As mentioned before, we reduce this error by repeating the measurements and regularizing data. Moreover,σ is often a good indicator of a large error in these cases. • Generalization error: this takes place when the network is not able to generalize beyond the training set. You can notice that in this case the prediction is completely wrong. • ANN error: in this category we classify the errors due to the approximation capacity of the ANN, the optimization error and the loss of information occurring during the pooling of A. These type of errors  can be mitigated by hyperparameters tuning. However, there are some limitations, for instance, even if we would like m to be large, in order to lose the least amount of information possible, the cost of the evaluation of F depends quadratically on it, so it must stay small. Moreover, the error estimatẽ σ is affected by these kind of errors itself, and it may not be of use in these cases.

Evaluation on unseen domains
We tested the ANN-AMG algorithm by applying it to problems with a diffusion coefficient µ that was not present in the training dataset, but on a known geometry. To prove the generalization capability of the algorithm, we now test it using domains and diffusion coefficients it has never seen before. The two domains we employ are represented in the two last rows of Table 1: a replication of the plate with a hole in the x and y coordinates and a torus. We build the dataset exactly as done before, in total it counts 445 different problems. The performance on the "Holes" geometry are similar to a known geometry: P B = 86% and P M = 18%. Thus, we have shown that the algorithm is able to generalize also on domains with a different topology w.r.t. the ones in the training dataset. This result is linked to the fact that a (small) subset of the considered geometry (the plate with the hole) is present in the training dataset. Indeed, the results on the Torus are much worse: P B = 64% and P M = 6%. The algorithm is still making predictions that Algorithm 5 Diffusion µ in a given point x ∈ Ω = (−1, 1) 3 , having fixed the mode, size and ε ∈ R size (mode) . µ = µ(x; mode, size, ε) 1: j ← 1 2: for k = 0 to mode do 3: j ← j + ((x) i + 1)size/2 size i−1 4: end for 5: µ ← 10 (ε)j are significantly better than any prediction made "at random", however this challenging mesh shows the limitation of our approach. We show the details about the performance in Table 2.

Test Case 2: Highly heterogeneous diffusion problem, structured grids
Let us consider the same elliptic problem (7) of before. However, here we have Ω = (−1, 1) 3 and µ ∈ L ∞ (Ω) is a highly heterogeneous piecewise positive constant defined differently from before. Namely, µ is defined by means of Algorithm 5, where mode = 1, 2, 3 defines if the pattern is either made of slices, lines or is checkerboard like, size = 1, 2, ... defines how many times the pattern repeats, and the vector ε ∈ R size (mode) defines the value of µ. Figure 7 shows a representation of µ for size = 5 and mode = 1, 2, 3. The problem is discretized by means of continuous FE of order p on nested Cartesian meshes. The dataset is built by varying p = 1, 2, 3, mode = 1, 2, 3, size = 2, 3, ..., 10, the mesh size h between In order to test the stability of the algorithm we also use four different DoFs (degrees of freedom) numbering. Namely, when applying the renumbering, the underlying sparsity graph of the matrix A and thus the optimal value of θ stays the same, but the pooling V changes, so the ANN should learn to be inveriant with respect to these changes. The dataset we built counts 5471 different problems and thus contains 202427 samples.
We started by using an architecture with hyperparameters chosen as described in Section 4.1.1 and then tuned the number of filters and the size of the kernel of the CNN and the width and depth of the dense part of the ANN. We obtain an ANN with a loss of 1.55 · 10 −4 . Details about its performance and its scaling are shown in Figure 9 and Table 2. In particular, our algorithm has a 93% accuracy with an average reduction of the CPU time of almost 30%, and in half of the cases you can expect a reduction greater than 32%.

What does the ANN learn?
To better understand how the model problem we considered works and what does the ANN learn, we study a specific subset of the model problems defined in the previous section. Namely we consider a diffusion µ which is everywhere constant and equal to one apart from one cell of the mesh, where it is equal to µ M AX = 10 ε with ε = 2, 4, 8. We stress that this family of problems was not present in the dataset we employed to train our ANN. Moreover, we define d to be the distance between the cell with the largest diffusion coefficient µ M AX from the center of the domain. Figure 8 (top-left) shows a graphical representation of the problem. First we analyze the conditioning of the matrix A. As expected it scales with respect to the mesh size h as h −2 and it is linearly proportional to the ratio between the maximum and minimum value of diffusion coefficient µ in the domain (i.e. is proportional to µ M AX ). The position of the cell does not affect the condition number. There is just a small reduction of it when the cell touches the boundaries: this is simply due to the Dirichlet boundary conditions, see Figure 8 (bottom). We then test our algorithm. Figure 8 (topright) shows that the optimal value of θ is almost always in the same range: between 0.5 and 0.9. Moreover, we see that the ANN is able to capture the overall relation between the normalized computational cost t and the strong threshold. Hence, in all this cases, the predicted value θ * by the AMG-ANN algorithm is near the true optimum. Finally, to gain some insight on the ANN we computed the feature maps extracted using the convolutional filters learned by the ANN. In Appendix C we show some of them: we can see that from a layer to another the CNN is enhancing the features that deems relevant.

Test Case 3: Linear Elasticity
We consider the following linear elasticity problem where Ω = (−1, 1) 3 and ∇ S u = 1 2 (∇u+∇u ) is the symmetric gradient and C is a rank-4 tensor that encodes the stress-strain relationship. Under the assumptions that the material under consideration is isotropic, by introducing the two Lamé coefficients λ ∈ L ∞ (Ω) and µ ∈ L ∞ (Ω), the coefficient tensor reduces to where tr is the trace operator and I is the identity matrix. We can rewrite the Lamé parameters in terms of the Young's modulus E > 0 and the Poisson ratio ν ∈ (0, 1 2 ); we have G = E/(1+ν) and β = ν/(1−2ν). The problem has been discretized by means of FE of order p = 1, 2, 3 on a structured cartesian grid of diameter We fix the Poisson ration ν = 0.29 and choose a highly heterogeneous Young modulus E. Namely E has the same pattern of the diffusion coefficient µ described in Section 4.2 and is such that size is even. We also employ four different DoFs numbering to enhance the dataset. In this way, we build a dataset counting 5873 different problems and thus containing 217301 samples.
As we have done for the previous test case, we start with an architecture found like described in Section 4.1.1 and then we fine tuned the number of filters, the kernel size of the CNN and the width and depth of the dense part. The ANN has a loss of 5.64 · 10 −4 . Evaluated on the test dataset, the algorithm has an accuracy of 90.71% and an average reduction of the computational cost of almost 23% in average. Details about its performance and its scaling are shown in Figure 9 and Table 2. There is still margin of improvement concerning the reduction of the cost since P M AX is relatively small, however it is possible to see a remarkable reduction of the scaling of the computational cost in the bottom row of Figure 9.

Conclusions
We proposed a DL algorithm to optimize the choice of the strong threshold parameter that stands at the basis of the construction of the levels needed for the AMG method. We have shown that our algorithm can be applied to a wide class of challenging problems coming from the FE discretization of PDEs. Namely, we have provided numerical results for three different test cases concerning a Poisson problem with a highly heterogeneous diffusion coefficient on (i ) structured and (ii ) unstructured grids and (iii ) a linear elasticity problem with a highly heterogeneous Young's modulus. Our algorithm performs better, other than using the standard literature value, in more than 90% of the cases and reduces significantly the computational cost (up to 30% on average).
The algorithm hinges upon the pooling operator to compress a large sparse matrix into a small tensor to feed into an ANN. Indeed, we have proved that the pooling operator reduces the computational cost and preserves relevant information needed to perform the complex regression task at hand. Moreover, our  algorithm can be introduced with minimal changes into any existing code that uses an AMG solver.
In future works we plan to generalize our algorithm by using one single ANN to make predictions on linear systems stemming from different underlying PDEs and being able to predict the optimal value of other parameters of the AMG method such as the type of smoother and the number of smoothing steps. We also plan to try more advanced computer vision models, such as transformers [22]. Finally, we aim to further investigate the properties of the pooling operator applied to sparse matrices and gain a better understanding of what does the CNN learn.
Feed-forward neural networks Let y (0) = x and y (L) =ỹ, a dense feed-forward neural network (FFNN) of depth L is the composition of L functions called layers defined as y (l) = h (l) (W (l) y (l−1) + b (l) ), l = 1, ..., L where W (l) ∈ R N l ×N l−1 (weights) and b (l) ∈ R N l (biases) are the parameters γ, and h (l) is a scalar non-linear activation function applied component-wise.
Convolutional neural networks Convolutional neural networks (CNNs) are neural networks that use convolution instead of matrix multiplication in at least one of their layers. They have great success in computer vision tasks thank to their ability to exploit the structured data format of images. Indeed, convolution layers enjoy three properties: shared parameters, that is, each parameter is tied to multiple component of the input; sparse interactions, that is, each component of y (l) depends only on a subset of the components of y (l−1) , (this also entails a lower number of parameters and thus a greater efficiency); equivariance to translation, that is the application of a translation and a convolution can be interchanged to obtain the same result. The last ingredient of a CNNs layer is the pooling operator. After several parallel convolutions and a non-linear activation, we replace the output of the layer at a certain location with a summary statistic of the nearby outputs. A popular option is the max pooling [39], which reports the maximum value over a rectangular neighborhood. The aim of the pooling operation is to reduce the computational cost, increase statistical efficiency and adding invariance to small translations. Figure 10: Visual representation of the normalized pooling resultsV that is fed into the ANN for some exemplary problems taken from the dataset described in Section 4.1. Figure 11: Example of feature maps of the CNN layers for the problems described in Section 4.2.1.