Non-overlapping block smoothers for the Stokes equations

Overlapping block smoothers efficiently damp the error contributions from highly oscillatory components within multigrid methods for the Stokes equations but they are computationally expensive. This paper is concentrated on the development and analysis of new block smoothers for the Stokes equations that are discretized on staggered grids. These smoothers are non-overlapping and therefore desirable due to reduced computational costs. Traditional geometric multigrid methods are based on simple pointwise smoothers. However, the efficiency of multigrid methods for solving more difficult problems such as the Stokes equations lead to computationally more expensive smoothers, e.g., overlapping block smoothers. Non-overlapping smoothers are less expensive, but have been considered less efficient in the literature. In this paper, we develop new non-overlapping smoothers, the so-called triad-wise smoothers, and show their efficiency within multigrid methods to solve the Stokes equations. In addition, we compare overlapping and non-overlapping smoothers by measuring their computational costs and analyzing their behavior by the use of local Fourier analysis.


INTRODUCTION
The Stokes equations are a saddle-point system modeling a slow viscous flow. There is a general interest to find efficient solutions for solving these kind of systems of PDEs, because they describe the physics of many phenomena of scientific and engineering interest 1 . E.g., the constant movement of the Earth crust caused by the mantle convection is one application that can be described and simulated by means of the Stokes equations 2 . Large saddle-point linear systems represent a significant challenge for linear solvers owing to their indefiniteness and often poor spectral properties. Their importance motivates the development of effective solvers.
Multigrid methods have been applied successfully to large linear systems [3][4][5] . The methods find an approximate solution through two complementary processes: smoothing and coarse-grid correction. Smoothing filters parts of the error from the approximate solution in a computationally inexpensive way, while coarse-grid correction constructs a lower dimensional problem to remove the error remaining after smoothing. Coarse grid correction consists of the components restriction, interpolation and solution on the coarsest grid. This paper develops efficient multigrid methods to solve the Stokes equations discretized by the MAC scheme. The multigrid components of restriction, interpolation and solution on the coarsest grid can be immediately generalized to large linear systems of saddle point type.
However, the design of efficient smoothers for solving systems of PDEs requires special attention. We distinguish between decoupled smoothing and collective smoothing. A zero diagonal block that appears in the discrete system hampers a basic numerical treatment of the problem, which would be to relax the discrete equations directly in a decoupled way. A variety of multigrid methods for the Stokes equations based on different smoothing approaches have been developed, used and analyzed [6][7][8][9] . We focus on collective relaxation, so-called block smoothers. A comparison of a well-known overlapping block smoother, the Vanka smoother, with a non-overlapping smoother, the triad-wise smoother, is given. While the Vanka smoother is well established to solve the Stokes equations it is computationally expensive 10,11 . Non-overlapping block smoothers are less expensive but were considered less efficient in terms of their smoothing and convergence behavior 10 . In this paper, we develop a new nonoverlapping block smoother based on the results of the comparison with both smoothers. For periodic boundary conditions we show the diminished efficiency of the triad-wise smoother relative to the Vanka smoother. Solving the Stokes equations with Dirichlet boundary conditions leads to an enormous decrease of the efficiency for the triad-wise smoother. We highlight the issues of the non-overlapping smoother by means of analysis and numerical examples and demonstrate the causes of the high computational costs of the overlapping smoother. Based on these results a new non-overlapping smoother is constructed for Dirichlet boundary conditions that overcomes the demonstrated issues. To the best of our knowledge, the proposed modified smoother represents the first-ever non-overlapping block smoother that reaches efficiency close to the overlapping block smoother.
Local Fourier analysis (LFA) is utilized as the main analysis tool for quantitative convergence estimates and to optimize geometric multigrid components such as smoothers or intergrid transfer operators. The idea was introduced by Brandt 12 and later extended and refined in 13 . Practical guidelines on the use of LFA can be found in 14 . LFA is commonly applied to relaxation schemes using (symmetric) Gauss-Seidel approaches 15 , and has been extented to more sophisticated relaxation schemes [16][17][18] . In this study, LFA is used to analyze the two-grid method based on the overlapping and non-overlapping smoothers. In addition to the LFA results, we provide numerical results that guide the development of a new algorithm leading to more efficient multigrid methods.
The outline of the paper is as follows. In Section 2, we introduce the marker-and-cell (MAC) finite-difference discretization of the two-dimensional Stokes equations and the basic idea of the two block relaxation methods used in this paper. In Section 3, we present some basic definitions and notations to introduce LFA for the Stokes equations. In addition, analysis results are given at the end of the section. In Section 4, numerical examples extend the analysis and highlight the differences and issues arising due to different boundary conditions. In Section 5, we develop a new non-overlapping block smoother, an update of the so-called triad-wise smoothers leading to an enormous increase of efficiency. We give some details about parallelization aspects in Section 6. Conclusions are drawn in Section 7.

Stokes equations
We introduce the Stokes equations in two dimensions here. Let Ω be a domain in ℝ 2 with boundary Ω, and be a given vector function that describes an external force. Solving the two-dimensional Stokes equations means finding a fluid velocity = ( , ) and pressure such that the following system is fulfilled, where x ∶= x . The discretization of the Stokes equations naturally leads to linear systems of the form so-called saddle point systems. Here 1 denotes the negative discrete Laplace operator, 1 = −Δ ℎ , and 1 and 2 are the discrete divergence operators, 1 = ( x 1 ) ℎ∕2 and 2 = ( x 2 ) ℎ∕2 . In this paper, we use a staggered distribution of unknowns, the so-called Marker-and-Cell (MAC) scheme. The idea of the MAC scheme is to place the unknowns ( , , ) in different locations. More specifically, the discrete pressure unknowns are located in the center of each cell and the discrete values of the velocity unknowns are located on the midpoint of vertical edges (x 1 -component ) and on the midpoint of horizontal edges (x 2 -component ), as seen in Figure 1. We consider uniform meshes with: ℎ = ℎ = ℎ in this paper.
The discrete analogue of the x 1 -coordinate momentum equation (1) is defined at vertical edges, the discrete x 2 -coordinate momentum equation (2) at horizontal edges, and the continuity equation (3) at cell centers using central difference schemes. We introduce the following indexing system with regard to the underlying grid: is the column index and is the row index, ranging from 1 ∶ or 1 ∶ + 1, where is the number of cells in one direction. The discrete pressure unknowns are , , where = 1, … , , = 1, … , , and the velocity unknowns are , with = 1, … , + 1, = 1, … , and , with = 1, … , , = 1, … , + 1. Using this indexing system, the MAC scheme can be written as Our interest in this paper is to develop efficient multigrid methods to solve the Stokes equations discretized by the MAC scheme. We discuss how the multigrid components of smoothing, restriction, interpolation and solution on the coarsest grid are generalized to saddle point systems. A key point of an efficient multigrid method is the right choice of the relaxation procedure. The design of efficient smoothers for solving systems of PDEs often requires special attention. The relaxation method should smooth the error for all unknowns in the equations, which is not an easy task. Therefore, we focus on constructing appropriate relaxation schemes to smooth the error of the unknown functions. The other multigrid components can immediately be extended to systems of PDEs. A zero diagonal block that appears in the discrete system hampers a basic numerical treatment of the problem, which would be to relax the discrete equations directly in a decoupled way. More specifically, decoupled iterative solution methods use blocks on the main diagonal in an equation-wise fashion. Unfortunately, this equation-wise relaxation is not possible due to the zero diagonal block. Moreover, a first obvious choice for these so-called strong off-diagonal operators in the differential system is collective smoothing. However, for staggered grids the unknowns are not defined at the same locations and therefore we define standard collective relaxation as described in the following section.

Vanka smoother
The basic idea of the Vanka smoother is to solve the discrete Stokes equations locally "cell by cell". This means all five unknowns of one cell are updated collectively, involving the respective four momentum equations at the cell boundaries and one continuity equation in the center of the cell, c.f. Figure 2 and Equations (5)- (7). This results in an overlapping updating process of 5 × 5 systems of equations. This method was first introduced in 10 .
Using this scheme, each velocity component is updated twice and the pressure once per relaxation. The process of relaxation appears in a Gauss-Seidel manner. That means, after smoothing all unknowns of one cell collectively, we use the updated values for the relaxation of the next cell. This relaxation method is commonly referred to as overlapping Schwarz smoother or symmetric coupled Gauss-Seidel (SCGS). Especially in the context of the Stokes equations, we use the term Vanka smoother. It turns out to have robust smoothing properties 3 . Due to the overlap the iteration matrix ℎ of the relaxation scheme cannot be easily represented by a banded matrix or equivalently as a stencil. This is possible if a suitable coloring scheme is chosen. More details can be found in Section 6.

Triad-wise smoother
For the Stokes system, the Vanka smoother is a very efficient smoother 10 . However, due to the overlap, computational costs are high and parallel implementation aspects are not satisfying due to the required coloring scheme, c.f. Section 6. We exploit the potential of the Vanka relaxation method and focus on the development of a block smoother that is less expensive and still efficient.
The basic idea of the triad-wise smoother is to update three unknowns collectively, involving two momentum equations at cell boundaries and the continuity equation in the center of the cell. This results in a non-overlapping updating process of 3 × 3 systems of equations, as seen in Figure 3. Using this scheme, one velocity component in each direction and one pressure component are updated simultaneously. The first version of the triad-wise smoother was introduced in 10 and is referred to as unsymmetric coupled Gauss-Seidel (UCGS) 10 . The potential of a different updating process as for example a Jacobi-manner or a different ordering instead of the Gauss-Seidel updating process will be examined in this paper. We stick to the term triad-wise smoother whenever we don't specify the manner of the updating process. The interation matrix of the relaxation scheme, ℎ can be represented by for a Gauss-Seidel-type updating process and for Jacobi-type updating steps. In the light of higher computational cost, the triad-wise smoother seems to be a good alternative to the Vanka smoother, since it has no overlap and therefore reduces computational costs. In addition, this relaxation method has better parallelization properties. These characteristics are examined in detail in the next section. However, the unsymmetric coupled (triad-wise) smoother was observed to have poor smoothing and often led to divergence for the Stokes equations 10 .
In the following we analyze and describe the conditions that lead to poor smoothing of the triad-wise smoother. In addition, we investigate different versions of the triad-wise smoother and modify the relaxation method to generate an efficient smoother with good smoothing properties.

LOCAL FOURIER ANALYSIS FOR STOKES EQUATIONS
We employ local Fourier analysis (LFA) to analyze the convergence properties of multigrid methods for the Stokes equations and thereby compare the triad-wise and Vanka smoother.

Definitions and notations
In order to describe LFA for staggered grids, we first introduce some terminology. More details can be found, for example, in 3 . We consider two-dimensional infinite uniform grids ℎ = 1 Here, the grids that correspond to the velocity unknowns and , are denoted by 1 ℎ and 2 ℎ . Moreover, the grid that corresponds to pressure unknown is denoted by 3 ℎ . The coarse grid, , is constructed by doubling the mesh size, i.e. we use standard coarsening. The functions ( , ) are called Fourier modes or Fourier functions with frequency and form the basis of the Fourier space  ℎ . Under general assumptions, each multigrid operator can be replaced by an operator with constant coefficients. Therefore, Fourier modes serve as eigenfunctions of multigrid operators and can be used to analyze the effect of each multigrid component on different modes. For staggered grids, the basis of the Fourier space  ℎ is given by the We use the stencil notation and consider the operator ℎ of the Stokes system,
Note that for all Fourier modes ( , ), . We assume that the systemmatrix ℎ is a block matrix, where each block is an infinite-grid Toeplitz matrix. Then, each block in ℎ may be diagonalized with LFA. Each entry in the symbol̃ ℎ is computed as the (scalar) symbol of the corresponding block of ℎ . Since ℎ is a 3 × 3 block operator, its symbol is a 3 × 3 matrix, High and low frequencies for standard coarsening are given by Remark 1. The influence of boundary conditions is not taken into account when applying LFA, since the modes ℎ ( , ) are defined on the infinite grid ℎ . Experience with LFA show that it often serves as an accurate prediction tool for problems with periodic boundary conditions, but degradation in performance may be seen with Dirichlet boundary conditions 15 .
We start with a two-grid analysis and compare results for both smoothers.

Two-grid LFA
We analyze the two-grid method by starting with the analysis aspects of the smoothing procedures. This is followed by an introduction of coarse-grid correction symbols. Due to the overlapping cells of the Vanka smoother, certain degrees of freedom are updated multiple times over the course of a single sweep of relaxation. Therefore, classical LFA techniques fail. LFA is based on the assumption that the iteration matrix ℎ can be written as a Toeplitz operator. It is not apparent that the same is true for overlapping smoothers. An extension of the classical LFA analysis for the Vanka smoothers was first introduced by Sivaloganathan in 19 . MacLachlan and Oosterlee generalized the LFA techniques to analyze overlapping smoothers for other PDEs and discretizations 20 . Their paper provides theoretical results in order to use the Fourier ansatz. A necessary assumption is that the error-propagation operator for coupled (overlapping) relaxation is block-diagonalized by the Fourier matrix, regardless of the distribution of degrees of freedom within the block. The key step in proving this is to show the following inductive step: if the errors before relaxation on degrees of freedom within a block , satisfy a generalized LFA ansatz, then the errors after relaxation on , satisfy the same ansatz advanced by one cell. MacLachlan and Oosterlee show that the error-propagation matrix for any coupled relaxation is an infinite-grid block-multilevel-Toeplitz matrix 20 . This means that we can attempt to analyze these techniques using classical multigrid smoothing and two-grid Fourier analysis tools to measure the effectiveness of the resulting multigrid cycles. Practical guidelines to obtain the smoothing symbol ℎ ( ) of the Vanka smoother to analyze this relaxation method are provided in 11 . For more details and a generalization to other overlapping smoothers, we refer to 20 . Due to the non-overlapping blocks, the LFA analysis for the triad-wise smoother is based on a simple expansion of the assumptions of LFA for scalar PDEs as described in Section 3.1. The error-propagation operator for triad-type smoothers, represented by an operator ℎ is The corresponding smoothing symbol ℎ ( , ) is given bỹ with the identity matrix and̃ ℎ ( ) as given in (11). The symbol of the block iteration matrix̃ ℎ can be given bỹ for a weighted Gauss-Seidel updating process and bỹ for a weighted Jacobi updating process.
For systems of PDEs that are discretized on staggered grids, the analysis of the coarse grid correction is challenging due to the different location of unknowns. In that case, LFA of transfer operators cannot be done as in the scalar case. MacLachlan and Oosterlee 20 discussed Fourier representations of grid-transfer operators for general staggered meshes in the context of systems of PDEs. Here, we have three types of grid points on the fine and coarse grid. The restriction operator can be decomposed based on the partitioning of dofs associated with the location of the unknowns on the grid. Osterlee and MacLachlan introduced, explained and proved the need to modify LFA due to this staggering. We emphasize this statement in Remark 2. More details can be found in 17,20,21 .
Remark 2. When calculating the symbol of a restriction operator that mixes different types of dofs, we must split it into the different types of dofs that it restricts from and to. If the restriction operator is defined on a staggered grid, we have = 1 ∪ 2 ∪ 3 (c.f. (10)) and the symbols depend on the location of the unknowns.
In the next paragraph, we show how to determine the symbol of a restriction operator defined on a staggered mesh. A similar structure for interpolation is established subsequently. Let ℎ ( , ) = ∕ . We have the following equality Note that this equality is different to the classical theory and was developed by MacLachlan and Oosterlee in 20 . The following relations are based on this update of the classical theory. Due to the definition of Note that the following relations hold, Then, the Fourier representation of ℎ is given by the (3 × 12) matrix ℎ ( ) = ̃ ℎ ( (0,0) )̃ ℎ ( (1,1) )̃ ℎ ( (1,0) )̃ ℎ ( (0,1) ) , wherẽ ℎ ( ) is given in Definition 2. Note that the matrices̃ ℎ ( ) are diagonal matrices since the restriction operators of different type of unknowns are not coupled, where * denotes the resulting coarse-grid point.
Based on Definition 3, we are able to compute the entries of̂ ℎ ( ). We show the computation of the symbol for the restriction operator ℎ | | | , i.e., the first row of the matrix̂ ℎ ( ).
ℎ ( (0,0) ) | | | =  20 ) gives the symbols of interpolation operatorŝ ℎ ( ). We analyze the two-grid method based on the smoothing operators as introduced in the previous section. Based on the updated restriction and prolongation operators, we perform the analysis of the two-grid method with triad-wise and Vanka smoothing procedures. The asymptotic two-grid convergence factor is given by the spectral radius

Analysis results
In this section, we present LFA two-grid results with a focus on bilinear interpolation and a Galerkin coarse grid operator. A comparison in 11 shows that the combination of these operators promises best results for the Stokes equations. We compare the results obtained by Vanka and triad-wise smoothers. We use a 33 × 33 staggered grid and two pre-and postsmoothing steps ( =2) during the course of this section. Figure 4 presents the LFA convergence factors for the Vanka and triad-wise smoother as a function of the weighting parameter . This shows the sensitivity of performance to parameter choice. On the basis of these parameter studies, c.f. Figure 4, we figured out that the weighting factor = 0.7 gives the most promising smoothing behavior for the Vanka and triad-Gauss-Seidel(GS) smoother. While for the triad-Jacobi updating process, we stick to the weighting factor = 0.45 in the following. As visualized in Figures 5, 6  The performance predicted by LFA shows the potential of the triad-wise smoother.

NUMERICAL EXAMPLES
In this section, we first present convergence results of a two-grid and a V-cycle multigrid method for the Stokes equations. We start with periodic boundary conditions and observe convergence behavior as predicted by LFA in the previous section. A twogrid method for a homogeneous problem with periodic boundary conditions based on a 33 × 33 staggered grid is employed. We apply 20 two-grid cycles with two pre-and postsmoothing steps ( ∶= 1 = 2 = 2) and start with a random initial guess. We measure the convergence via the 2-norm of the error, i.e. we compute || || 2 || −1 || 2 after = 20 two-grid cycles. This gives a convergence factor of = 0.08 for the Vanka smoother with = 0.7, = 0.24 for the weighted ( = 0.7) triad-GS smoother    The spectral radius of the two-grid symbol for the triad-wise relaxation based on a weighted Jacobi updating process with = 0.45 and different frequencies . Convergence behavior of the multigrid method for the Stokes system with periodic boundary conditions. Figure 8 is based on a 33 × 33 staggered fine grid and a 3 × 3 coarse grid on the domain Ω = (0, 1) 2 discretized using the MAC scheme. We use two pre-and postsmoothing steps ( = 2) and a zero initial guess. The right-hand side is set to where the values x correspond to the grid points of the discretized domain Ω ℎ = { ∶= + , ∈ ℤ 2 }, with 1 = (0, ℎ∕2) and 2 = (ℎ∕2, 0). Figure 8 presents the convergence factors corresponding to the different unknowns , , individually. The convergence factors are obtained by computing the 2-norm of the residual for different numbers of multigrid V-cycles (iterations) for the Stokes equations, i.e. we compute || || 2 after each V-cycle. We notice good convergence factors for the Vanka and the triad-GS relaxation methods. However, the Vanka smoother converges even faster than the triad-GS smoother. The triad-Jacobi procedure does not lead to a good convergence rate. Next, we show convergence results for the Stokes system with Dirichlet boundary conditions. For the triad-wise smoother we focus on a Gauss-Seidel updating process, since we have seen that the triad-Jacobi smoother does not lead to good convergence results. For Dirichlet boundary conditions triad-Jacobi downgrades to a divergent method. Again, practical two-grid convergence factors for a homogeneous problem discretized on a 33 × 33 grid are given. We apply 20 two-grid cycles, set = 2 and start with a random initial guess. That gives a convergence of = 0.36 for the triad-wise smoother and = 0.10 for the Vanka smoother. Again, the Vanka smoother leads to good convergence results. In contrast, the triad smoother does not show satisfactory convergence for Dirichlet boundary conditions. For this reason, we investigate different possibilities to adapt the smoother. We start with additional smoothing steps and vary the order of updating different blocks. We use a lexicographical updating process, i.e. forward direction, and a backward direction as well as red-black(RB)-coloring. The results can be found in Table 1 Table 1 shows that a different ordering results in small effects only. An increase in the number of smoothing steps leads to better convergence results. However, we obtain a convergence factor of = 0.29 if we apply six pre-and postsmoothing steps. This is as expensive as using the Vanka smoother but less efficient in terms of the convergence rate. The poor convergence behavior of the triad-wise smoother results from the boundary treatment, as it does not treat the unknowns at the boundary (or near the boundary) in an appropriate way. Since the unknowns at the boundary are predefined, they are not treated by the relaxation. The idea of the triad-wise smoother is to relax three unknowns at the same time, which is not possible with Dirichlet boundary conditions regardless of the choice of the three unknowns. An illustration with an exemplary choice of triad blocks is given in Figure 9. We notice the blocks include three unknowns whenever possible. Close to the boundary we have blocks with two unknowns only. At one corner we have one unknown only. Figure 10   This observation leads to the following three adaptations of the boundary treatment that are able to improve the smoother but still don't lead to satisfying results. First, after each triad-wise-smoothing cycle, we add an additional smoothing at the critical corner by using an overlapping block as visualized in Figure 11a. This gives a convergence factor of = 0.41. By adding overlapping blocks not only at the corner but also at the boundaries, see Figure 11b, we get a convergence factor of = 0.37. By combining the triad updating process at the interior of the grid with the Vanka smoother at the boundary cells, see Figure  11c, we improve the convergence factor to = 0.22.
Triad updating blocks with additional overlapping blocks.
We choose zero boundary conditions for the -component and ± cos( ) and ± cos( ) for the -component of the velocity unknown. Figure 12 shows good convergence for the Vanka smoother (similar to that in the periodic case). In contrast, non of the triad-wise smoothing methods show satisfactory convergence for Dirichlet boundary conditions.

ALGORITHM DEVELOPMENT
The results of the local Fourier analysis, c.f. Section 3.3, and the numerical results for periodic boundary conditions, c.f. Section 4, in combination with the better parallelelizability and the reduced computational work demonstrate the potential of the triadwise relaxation method. However, the numerical results for the Stokes system with Dirichlet boundary conditions indicate the issue of the triad-wise method. In what follows, we show how to fix this issue. For this purpose more computational work is needed. The idea is to repeat the relaxation process four times while changing the unknowns contained in one block after each iteration. In addition, we vary the order in which the blocks are updated. This idea is illustated in Figure 13.
New algorithm for the triad relaxation method. The order of the procedure is performed as visualized in the figures from left to right.
The order of the iteration is visualized from left to right, i.e., Figure 13a illustrates the first iteration and Figure 13d the last one. This new algorithm improves the convergence results. By changing the order of relaxing different blocks, there is not a significant impact on the convergence factor. However, small variations within a range of 0.03 are noticeable and the relaxation process in lexicographical order, i.e., row-wise starting with the bottom left block, promises best results. The two-grid convergence factor of the updated version of the triad-wise smoother for a homogeneous problem with Dirichlet boundary conditions is = 0.04. Again, we applied 20 two-grid cycles with two pre-and postsmoothing steps starting with a random initial guess. Results of a multigrid method are visualized in Figure 14.
The modified procedure causes four times more computational work in comparison with the results of the original version of the triad-wise smoother, c.f. Section 4. This leads to a higher number of arithmetic operations for the triad-wise smoother compared to the Vanka smoother. We have a total number of 139 ⋅ ( − 1) 2 for the Vanka smoother versus 176 ⋅ ( − 1) 2 operations for the updated triad smoother. However, the new triad procedure convergences even better than the Vanka smoother. We have = 0.04 for the triad-wise smoother compared to = 0.10 for the Vanka smoother. Each of the individual substeps of the modified method can be parallelized in the same way as a whole step of the non-modified method. Nevertheless, the four steps are processed consecutively resulting in more synchronization steps.

PARALLEL IMPLEMENTATION ASPECTS
So far, we have seen a comparison between an overlapping and a non-overlapping smoother, analyzed their convergence behavior and potential for optimization. Applications for the Stokes equations result in systems of equations that are much larger than the test problems within this paper, i.e. with several million unknowns. Consequently, a serial implementation is usually not sufficient. As a result, we compare the parallelization potential of the overlapping smoother with the non-overlapping smoother in the following. The obvious difference between the two smoothers is the number of unknowns and equations that are included in one block, while the number of blocks that have to be updated per relaxation is the same for both smoothers. The arithmetic operations that we need for one relaxation step are 139 ⋅ 2 for the Vanka smoother compared to 44 ⋅ 2 for the original version of the triad-wise smoother, where describes the number of blocks. More details regarding the computation of the number of arithmetic operations can be found in 11 . Parallel implementation aspects that we take into account are the required coloring and the communication effort that is needed for computations in parallel.
The required colorings to perform the smoothing steps in parallel are visualized in Figure 15. The Vanka smoother needs five different colors, while the triad-wise smoother requires a red-black coloring (i.e. two colors), only. An important parallelization aspect is the "dependencies", i.e., all neighboring unknowns that are needed to update an unknown. To perform a method in parallel without read and write conflicts it is necessary to avoid simultaneous updates of dependent unknowns. The coloring scheme is based on depenencies. The unknowns of one triad block depend on the unknowns of adjacent cells. Consequently, a red-black coloring is sufficient, c.f. Figure 15b. That means all unknowns of the red triad blocks can be updated simultaneously while all unknowns of the black triad blocks can be updated simultaneously. Whereas, the coloring scheme of the Vanka smoother is illustrated in Figure 15a. Due to the additional unknowns that are included in one Vanka block compared to the triad block, the unknowns do not solely depend on the adjacent blocks. Due to the MAC scheme as given in Equations (5) -(7) a five coloring is sufficient to perform the Vanka smoother in parallel. The repetition of the coloring schemes are visualized in Figure 15b for the red-black pattern of the triad-wise smoother and in Figure 16 for the five color Vanka pattern. The communication effort differs due to the different numbers of unknowns that are updated within each block smoothing step. Moreover, the dependence on updated unknowns of neighboring cells leads to further differences. For the Vanka smoother 60⋅ 2 communication steps are necessary, while for the triad-wise smoother 12⋅ 2 are needed for one smoothing iteration. One smoothing step of the Vanka smoother requires as many communication steps as five smoothing steps of the triad-wise smoother. These results show the low computational cost in combination with good parallelization properties of the triad-wise smoother. However, a parallel implementation may lead to different convergence factors due to the coloring scheme, i.e. a modified order of updating the unknowns.
We implement a two-grid method in serial with the colored versions of both smoothers to get an idea how the coloring influences the convergence factors. We start with a comparison of numerial results based on the overlapping and non-overlapping smoother for the Stokes equations with periodic boundary conditions. The results are based on the same example and settings as described in Section 4. That leads to convergence factors of = 0.29 for the triad-wise smoother with red-black coloring. Compared to a convergence factor of = 0.06 for the 5-color Vanka smoother. We have similar results compared to the serial implementation without a coloring. In consideration of the low computational effort for the triad-wise smoother the convergence factor is reasonable. Multigrid results show the same effect, see Figure 17. This is not true for Dirichlet boundary conditions. As we already notice in Section 4, a different ordering of updating the unknowns may change the convergence results especially with Dirichlet boundary conditions. The two-grid method for the colored versions of both smoothers with Dirichlet boundary conditions leads to the following results. The triad-wise smoother with a Gauss-Seidel update of unknowns and a red-black coloring leads to = 0.58 as seen in Table 1. However, if we implement the modified version of the triad-wise smoother as described in Section 5 with a red-black coloring, we get a convergence factor of = 0.21. The novel triad-wise smoother consists of four consecutive steps, in which one single step is similar to the original triad-wise smoother. The consecutive execution of the four steps results in more synchronization steps. In comparison, the 5color Vanka smoother leads to convergence results of = 0.99. This extremely bad convergence rate for the Vanka smoother can be immensely improved by a 9-coloring (c.f. Figure 15 and 16) because of symmetry that is lost in the 5-color version. We obtain a convergence rate of = 0.10 for the two-grid method with the 9-color Vanka smoother. In comparison with the triad-wise smoother, the Vanka smoother with 9 colors clearly leads to better convergence results. As discussed before the computational costs of the Vanka smoother with 5 colors are approximately the same as the costs of the modified triad-wise smoother. The communication costs of the triad-wise smoother increase from 12 ⋅ 2 communication steps to 48 ⋅ 2 due to the repetition of the original triad-wise smoother for four times, which is still below the costs for the Vanka smoother. In addition, the 9-coloring leads to more consecutive processing and therefore decreases the parallelization of the Vanka smoother even more. Using the

FIGURE 17
Convergence behavior of the multigrid method for the Stokes system with periodic boundary conditions. triad-wise smoother instead of the Vanka smoother compromises the lower convergence rate with slightly lower communication costs and additionally good parallelization properties.

CONCLUSIONS
Focusing on the results of the serial implementation we draw the following conclusions. For the Stokes equations with periodic boundary conditions the two-grid and multigrid convergence of the well-known Vanka smoother is better than the convergence of the triad-wise smoother. However, the triad-wise smoother is less expensive than the Vanka smoother.
For the Stokes equations with Dirichlet boundary conditions the good convergence of the Vanka smoother remains unchanged, while the convergence of the triad-wise smoother is not satisfying. However, the modified triad-wise smoother leads to convergence results that are even better than the convergence results of the Vanka smoother. In comparison to the original triad-wise smoother we need to accept less parallelization potential for the modified triad-wise smoother. These results give a good understanding of the challenges arising from different boundary conditions.
For the parallel implementation, we observe similar characteristics. A clear advantage of the triad-wise smoother over the Vanka smoother is given in respect to its low computational costs and its good parallelization properties with a reasonable higher convergence rate. This is especially true for the Stokes equations with periodic boundary conditions. Dirichlet boundary conditions lead to similar difficulties for the parallel implementation compared to the serial algorithm for the triad smoother. In addition, the Vanka smoother needs additional coloring for the two-grid method to obtain a good convergence rate. This leads to a further decrease of parallelization properties of the Vanka smoother and highlights the advantages of our novel triad-wise smoother.