Predicting Basin Stability of Power Grids using Graph Neural Networks

The prediction of dynamical stability of power grids becomes more important and challenging with increasing shares of renewable energy sources due to their decentralized structure, reduced inertia and volatility. We investigate the feasibility of applying graph neural networks (GNN) to predict dynamic stability of synchronisation in complex power grids using the single-node basin stability (SNBS) as a measure. To do so, we generate two synthetic datasets for grids with 20 and 100 nodes respectively and estimate SNBS using Monte-Carlo sampling. Those datasets are used to train and evaluate the performance of eight different GNN-models. All models use the full graph without simplifications as input and predict SNBS in a nodal-regression-setup. We show that SNBS can be predicted in general and the performance significantly changes using different GNN-models. Furthermore, we observe interesting transfer capabilities of our approach: GNN-models trained on smaller grids can directly be applied on larger grids without the need of retraining.


Introduction
The energy transition is one of the key aspects to meet the goals of the Paris Agreement [1] and its latest successor: Conference of the Parties 26 in Glasgow in 2021. Due to decentralization, reduced inertia as well as volatility in production, integrating renewable energies remains challenging. To safely operate future power grids, the impact of unavoidable fluctuations on the synchronous operating regime has to be limited. Hence, dynamic effects have to be taken into account. Analyzing the dynamic stability of synchronisation in power grids is a complex multi-dimensional problem and many known methods rely on heavy simulations.
The model underlying the recent work on the stability of synchronization and complex dynamics of power grids, e.g. [2], is the Kuramoto model [3] with inertia. In complex system science it also serves as a paradigmatic model for the study of complex using GNNs. Our paper is based on a master's thesis [26] and except of this thesis, the authors are not aware of any literature using the same methods and ideas, but we introduce related work that founds on similar approaches.
Similar approaches There are recent publications on using Graph Neural Networks in the context of power grids, but they do not consider the prediction of statistical dynamical properties such as SNBS. Instead, many approaches deal with the computation of power flows [27,28,29,30,31,32]. GNNs have also been used for control theory [33] and physical neural solvers have been introduced to connect GNNs with differential equations [34]. Furthermore, cascading failures were investigated in [35].
Aside from GNNs, two other publications are noteworthy to mention. Che et al. [36] recently published a paper in which they show the usage of active learning and relevance vector machines to reduce the computational effort of computing SNBS by learning the boundary of stable dynamics. Furthermore, Yang et al. [37] predict the ability of power grids to synchronize after applying perturbations, but they approach the concept of dynamic stability differently. Firstly, they predict the result of single perturbations and not the statistics. Secondly, their approach is not based on providing the full graph, but they rely on common knowledge about the relation of network science and dynamic stability, e.g. by using the degree and betweenness [38] as input.

Our main contributions
(i) For the first time, SNBS is predicted based on the full graph instead of hand-crafted features. The focus lies on evaluating different learning methodologies based on GNNs for the sake of future research. The accuracy still needs to be improved for real world applications.
(ii) In order to train ML-models, we generate new datasets. They are based on wellknown models of synthetic power grids and on Monte-Carlo simulations to analyze dynamic stability. The datasets are rich enough to challenge ML-methods, whereas still being somewhat conceptual to connect to the existing network literature. Compared to real-world power grids, synthetic power grids have a number of advantages, for example they do not have any artifacts and one can obtain more easily large datasets, which are beneficial for statistical analyses.
(iii) We also investigate transductive transfer learning capabilities by training models on small power grids and evaluating the same models on larger networks without fine-tuning.
This paper is structured as follows. Firstly, the generation of the datasets is explained. Afterwards, the background knowledge for the used ML-methods is introduced, before we present the methodology of applying our ML-models to our generated datasets. Finally, the results are given and discussed, before we close with a short outlook.

Generation of the datasets
To analyze the capability of predicting SNBS using ML, two synthetic datasets are generated. We generate new synthetic datasets, because we are especially interested in a method that can deal with different topologies. We start by motivating the selection of our datasets. Afterwards we briefly discuss relevant concepts from network science, before explaining the generation of synthetic power grids. We close by providing details about the dynamical simulations.

Objectives for datasets
High-quality datasets facilitate the application of ML-methods. Therefore, we carefully consider the following criteria for generating the datasets which mimic basic features of power grids. The datasets shall be: (i) homogeneous enough in both structure and dynamics to connect to network theory, (ii) complex enough to be challenging for ML-methods, (iii) computationally feasible using highly accurate Monte-Carlo simulations.
Firstly, homogeneity is important, because previous studies, e.g. by Nitzbon et al. [15] have shown, that there are clear relations between dynamical stability and topological properties for somewhat homogeneous grids. As these patterns are known to exist in such homogeneous graph datasets, they are ideal to test ML systems, which can be expected to learn them.
Secondly, enough complexity is required to justify Machine-learning models. This complexity is inherently given in the problem setup, as SNBS is a highly non-linear measure. Furthermore, we consider different network topologies.
Thirdly, we need to find a compromise between computational effort and relevant properties of the datasets, such as grid size, number of grids, low statistical errors which are determined by the number of Monte-Carlo samples and low numerical errors, which depend on the dynamical solver settings. Low statistical errors are crucial to distinguish small performance differences between ML-models later on.
Prior to generating the datasets, the influence of many parameters is investigated. We shortly motivate and explain the most important parameters for the generation of the datasets. As previously mentioned, Nitzbon et al. [15] observed interesting relations in their dataset, so we often select properties based on their investigations. Before looking at power grids in more detail, some background knowledge on graphs is needed, because power grid modeling relies on graphs.

Network Science: graphs
We briefly introduce theoretical background on graphs, which is also helpful to understand GNNs later on. Graphs consist of nodes (vertices) and lines (edges) connecting two nodes. The size of a graph is given by its number of nodes N . To encode the topology of a graph one can use the adjacency matrix A which is defined by: 1 if there is a line between nodes i and j, By using the degree which is defined by the number of neighbors of a node, we can formulate the diagonal degree matrix D. Using A and D, we can compute the Graph Laplacian (L): L = D − A, which is a singular matrix that is a discrete analogue of the Laplace operator.

Power grids
The topology of the power grids is based on the tool Synthetic Networks [39] ‡. This package uses a parametric growth process to generate networks. The resulting networks have properties that are suitable to observations of real-world power grid networks. We use the same parametrization as Nitzbon et al. [15]: n 0 = 1, p = 1/5, q = 3/10, r = 1/3, s = 1/10, where n 0 is the initial number of nodes, p, q are probabilities related to constructing new lines, s the probability of splitting an existing line and r a parameter controlling the generation of redundant paths. Furthermore, half of the nodes are producers, whereas the other half are consumers. All nodes are modeled by the swing equation [41], also referred to as a second-order Kuramoto model [3,42]. The Kuramoto model was independently introduced in the context of power grids in [43] and has a long history of study there. We use the following notation: where φ,φ,φ denotes the voltage angle and its time derivatives. We use the following parametrization: P i ∈ {−1, 1} the injected power, whereby the condition i P i = 0 guarantees power balance; α = 0.1 the damping coefficient, K is the coupling matrix based on the adjacency matrix which encodes the graph and we use uniform coupling K ij = 9A ij . The values for the injected power and the damping coefficient are the same as in [15], however we use a larger coupling (9.0 instead of 6.0) to increase the overall stability of synchronisation in the power grids and to obtain a clear bi-modal shape of the SNBS-distribution for a better balance for training ML-methods. We are interested in deviations from the nominal frequency (e.g. 50Hz in Europe), and thus will work in frequency deviations throughout the paper. The desired state is thusφ i = 0 at all nodes.

Dataset properties
We study the resilience of power grids operating in their synchronous state to (large) perturbations at individual nodes. The single-node basin stability of a node is quantified as the probability that the systems returns to its synchronized state after such a networklocal perturbation. Since the perturbations are drawn independently at random, SNBS is the outcome of a Bernoulli experiment [6].
To estimate SNBS for every node in a graph, M = 10, 000 samples of perturbations per node are constructed by sampling a phase and frequency deviation from a uniform distribution with (φ,φ) ∈ [−π, π]×[− 15,15] and adding them to the synchronized state. Each such single-node perturbation serves as an initial condition of a dynamic simulation of our power grid model, cf. Equation (2). The simulation time is represented by t in seconds. At t = 500 the integration is terminated and the outcome of the Bernoulli trial is derived from the final state. A simulation outcome is referred to as stable if at all nodesφ i < 0.1. Otherwise it is referred to as unstable. Two exemplary trajectories are shown in fig. 1.
The classification threshold of 0.1 is chosen accounting for minor deviations due to numerical noise and slow convergence rates within a finite time-horizon. The authors are not aware of any other attractors of the Kuramoto system within that threshold. Hence, it may be assumed that every trajectory labeled as stable in that way will indeed converge to the synchronous state for t → ∞. On the other hand, trajectories who are classified as unstable may converge to many different kinds of attractors [44,45]. However, we occasionally observed so-called long transient states at specific nodes, which do eventually converge to the synchronous state but fail to do so before t = 500. While of theoretical interest, we do not expect their asymptotic behaviour to play any role in real world applications and thus we are satisfied with classifying them as unstable.
A 95% confidence interval for the SNBS values may be estimated via the normal distribution approximation of the Bernoulli experiment as [46]: where the inequality is obtained by setting p = 0.5 and M = 10, 000. The distributions of SNBS for both datasets are given in fig. 2. We refer to the dataset consisting of grids of 20 nodes per grid as dataset20, and to the dataset consisting of grids with 100 nodes as dataset100. For both cases, there is a bi-modal distribution of SNBS over the whole data set, which facilitates ML-models to learn the distinction between those modes. The peak at 1.0 indicates a large amount of nodes where no perturbation has an adverse effect on the synchronisation. The second peak can be interpreted in a way that many nodes are somewhat resistant to perturbations and the grid stays synchronised in about 80% when applying perturbations at the particular nodes. In case of dataset20 the mean value of SNBS is 0.84 and for dataset100 nodes it is 0.87. In both datasets, the number of unstable outcomes is low, which is a property we expect to hold for real power grids as well. Conducting the computation of the dynamic stability using one CPU takes about 45 hours per grid in case of 100 nodes per grid and about three hours in case of 20 nodes.

Graph Neural Networks
This section briefly introduces Graph Neural Networks (GNNs). We begin with a general framework for GNNs and subsequently summarize the recent development of GNNs. Graph Neural Networks are a class of Artificial Neural Networks (ANN) designed to learn relationships of graph-structured data. Just as ANNs they have internal weights, which can be fitted in order to adapt their behavior to the given task. In the case of supervised learning these weights are adjusted such that the error between the estimated output and the labeled output for given input data is minimized. As inputs GNNs use the graph structure and potentially node features. Their output can either be global graph attributes, attributes of sub-graphs, or local node properties. Different types of GNNs have been introduced, some of which are detailed below. In [47], the authors introduce a design space for GNNs as a common framework to facilitate understanding and comparison of the different methods. In their design space, GNNs consist of preprocessing, message-passing and post-processing layers. GNN architectures vary in layer number and connectivity, as well as the intra-layer design of the message-passing layers. [47] view message-passing layers as combinations of (i) message computation and (ii) aggregation. First, a message function computes a message for each node from it's current state. Secondly, the messages are aggregated over the neighborhood to a new node state. Both message computation and aggregation can be realized in different ways.
Common ML-methods such as batch normalization [48] or dropout [49] can be added to stabilize training. The application of non-linear activation functions enables GNNs to learn non-linear relations in the graph data. In this work we focus on convolutional GNNs and in particular on those employing spatial-based graph convolutions, because they can be applied to varying topologies, as we have in our datasets. Graph convolutions are based on the concept of the Graph Fourier transform, a generalization of the classical Fourier transform (FT), which enables the remarkable success of Convolutional Neural Networks (CNN) in image recognition. Unlike the classical FT, which uses exponential shifts the Graph FT corresponds to an expansion of the function on the graph in terms of the eigenvectors of the graph Laplacian L. Such an expansion may in turn be multiplied with a function of the graphs eigenvalues, a so-called spectral filter. While it is possible to learn spectral filters from training data, they lack many of the nice properties of the convolution kernels used in CNNs: they are not localized in node space, computing the eigenbasis is expensive and trained models can not be evaluated on different graphs, since each graph has a unique spectrum.
An important insight of [50] was that graph spectral filters can be approximated by polynomials of the graphs' adjacency matrix A, thus achieving a localization of the filter in the (k-th order) neighborhoods of the nodes. Subsequently, in their seminal paper Kipf and Welling [51] realized that it suffices to consider only the linear term of the polynomial expansion, corresponding to a simple multiplication of the node features with the (renormalized) adjacency matrix. They arrived at a computationally efficient and powerful layer architecture that relies only on local information and generalizes well to different graphs. Several GNN models that we investigate in this paper were derived from their so-called Graph Convolutional Layer (GCN): where H denotes the output of a layer, σ is the activation function, X are the input features, Θ is a matrix containing the learnable weights and A is the renormalized adjacency matrix, given by A =D − 1 2ÃD − 1 2 . FurtherÃ = A + I, where I is the identity matrix, denotes an adjacency matrix with added self-loops and the diagonal degree matrixD is determined by:D ii = jÃ ij . In the design space of [47], XΘ manifest the message computation, while A realizes the aggregation. By consecutively applying multiple GCN-layers, not only direct neighbors are taken into account, but also neighbors at further distance.
Instead of stacking multiple GCN-layers, Wu et al. [52] removed the activation functions, combined all weight matrices into one and computedÃ i to obtain: This layer founds on their assumption that the nonlinearity between GCN layers is not crucial and may be omitted in order to reduce computational effort. We refer to this layer as Simple-Graph-Convolution (SG). Du et al. [53] used multiple exponents i ofÃ within one layer according to the following scheme: This layer type is called Topology Adaptive Graph Convolution (TAG), which refers to its ability of considering different topologies. However, this is the case for all methods that are introduced in this paper. This architecture provides an extension to GCNs by incorporating information about higher order neighborhoods within one layer. Auto-Regressive Moving Average (ARMA) neural network layers by Bianchi et al. [54] are far-reaching generalizations of GCN layers. They are derived from a rational expansion of the spectral filter instead of a polynomial expansion. A complete ARMAlayer consists itself of multiple Graph Convolutional Skip (GCS) layers: where j is an index and W and V are matrices of trainable parameters. There are two important distinctions from the GCN layers: the aggregation in the first term uses normalized LaplacianL = I − D − 1 2 AD − 1 2 , instead ofÃ. Additionally, the connectivity of the message-passing layers is augmented with a skip connection, implemented in the second term. It recursively re-inserts the initial node features X from the first layer and thus enables stacking a large number of GCS layers, whereas preventing the loss of the initial information due to Laplacian smoothing. In order to reduce the computational effort and to reduce overfitting, the weights among different GCS layers are shared: W (j) = W and V (j) = V , except for the first layer where W (1) = W .
To increase their expressive power multiple ARMA layers may be combined in a parallel stack: where X (J) k is the output of the last GCS layer in the k−th ARMA layer. We can also interpret J as the number of possible hops and by increasing J larger regions are taken into account. ARMA filters with their recursive and distributed formulation, are efficient to train and capable of learning complex information. All of the layers described above are used in the models introduced in the next section.

Prediction of SNBS using Graph Neural Networks
To predict SNBS of all nodes, we use a node-regression setup, by providing the adjacency matrix of the graph and the injected power per node P i as inputs. The process is shown in fig. 3. In order to test the performance of our models on unseen data, we split the datasets into training and testing sets. The shift between them is marginal as can be seen in Table 1.

Setup of our GNN-models
Based on the introduced GNN layers, eight GNN-models are analyzed to evaluate the performance of different architectures. GNNs are capable of reading in the full graph without any simplifications. We also tried to use CNNs which are well known from image analysis. In case of CNNs, the graph information is provided by using a modified version of the adjacency matrix as input, but the setup had several limitations in comparison to the GNNs. The application of CNNs is shown in Appendix C. In Table 2 the GNNmodels are briefly introduced. All models use one type of graph convolutional layer, * There is a batch normalization between first and second layer.
but may use several numbers of them and all have one linear and one sigmoid layer at the end. Additionally, dropout is used in several cases, cf. Appendix B. We did not do a systematic investigation of hyperparameters such as number of layers and their properties, but focused on identifying relevant factors to enable training.

Training setup
For all models the same parameters are used and the training consists of 500 epochs. To enable reproducibility, the seeds are set before training and can be found in the published source code §. The training is based on the library Pytorch [55] and for the graph handling and graph convolutional layers the additional library PyTorch Geometric [56] is used. For the training of the models, CPUs are used and depending on the model training takes between 20 minutes and 50 minutes on either Haswell or Broadwell architecture without parallelization. The detailed training parameters, e.g. batch sizes and additional information on the computational effort are given in Appendix B. As loss function we use the mean squared error .

Results
To evaluate the performance of different models, the R 2 score, which may also be known as coefficient of determination and a self-defined discretized accuracy is used. The score R 2 is computed by: where mse denotes the mean squared error, y the output of the model, t the target value and t mean the mean of all considered targets of the test dataset. The standard measure of performance is R 2 , which captures the mean square error relative to a null model that predicts the mean of the test-dataset for all points. A constant model that always predicts t mean , disregarding the input features, would get a score of R 2 = 0.0. The R 2 -score is used to measure the portion of explained variance in a dataset. To further simplify interpretation, we rephrase the evaluation as a classification problem.
The outputs are categorized as true or false by using a threshold and we compute the accuracy as: discretized accuracy = correct predictions number of samples .
We refer to this self-defined accuracy as discretized accuracy. Predictions are considered to be correct, if the predicted output y is within a certain threshold to the target value t: y − t < threshold. We set this threshold to 0.1, because this is small enough to differentiate between the modes in the distributions (see fig. 2). Furthermore, a total deviation of the prediction and true output of 0.1 should be efficient for most applications. The discretized accuracy depends on the distribution of SNBS, so it can not be used for comparison across different datasets, but has to be compared to the null model of the corresponding dataset.
Since there is no previous work that can be easily compared to our methods, we introduce a simple baseline model. This baseline model always predicts the average value of the testing set. By design, this results in R 2 = 0, and achieves a discretized accuracy of 67.1 % on dataset20 and of 40.9% on dataset100. The differences in discretized accuracy are rooted in the different distributions of the two datasets (cf. fig. 2).
We use an averaged performance to reduce the impact of the initialization effects. Out of 5 different initializations per training setup, only the best three are considered to compute an averaged performance. The average R 2 -performance is given in Table 3 and for the discretized accuracy in Table 4. The best values are in bold. The training progress of the best model is shown in fig. 4. The fluctuations, especially visible at the bottom right in fig. 4 are typical for ML applications when using storchastic gradient descent (SGD) and constant learning rates during training.
Furthermore, we investigate whether the features learned by GNNs generalize to grids of different sizes. As datasets of large grids are costly to create, successful pretraining on smaller grids with subsequent application on larger grids would be a valuable strategy. To evaluate the transfer learning capabilities, we train GNN-models on the small dataset of grids with 20 nodes and evaulate without fine-tuning on the dataset with large grids of 100 nodes. As performance of the transductive transfer learning, we report the R 2 and accuracy on the large target dataset using the term tr20ev100 (trained on dataset20, evaluated on dataset100).
The results show that the prediction of SNBS using GNNs is feasible and different models have a large impact. We did not perform a detailed hyperparameter study of different GNN-models, so conclusions about their performance are tentative for now. For dataset20 and dataset100, the models are both trained on their training and evaluated on their test sections. To evaluate the transfer learning capabilities, we use the term tr20ev100 meaning that the model is trained on the dataset20, but evaluated on the dataset100. For dataset20 and dataset100, the models are both trained on their training and evaluated on their test sections. To evaluate the transfer learning capabilites, we use the term tr20ev100 meaning that the model is trained on the dataset20, but evaluated on the dataset100.
Next, we shortly summarize our observations. The results indicate that increasing the complexity of the model can be beneficial, as the model ArmaNet2 with the largest amount of parameters (1050) performs best. However, increasing the complexity is not always helpful. GCNNet3 for example performs worse than GCN2, even though having more learnable parameters (149 instead of 107). The meaning of the type of convolution is underlined by considering TAGNet1 and ArmaNet1, because TAGNet1 outperforms ArmaNet1 with only slightly more parameters than ArmaNet1. Figure 5 shows the relation of the complexity and performance based on dataset100. The complexity is firstly represented by the number of learnable parameters on a logarithmic scale and  secondly by the maximum number of possible hops. By hops we mean the order of neighbors that are taken into account. For example, one hop means that only direct neighbors are considered, whereas two means that nodes are considered which are not directly connected, but via one direct neighbor.
Without conducting ablation studies, we can only guess reasons for the superiority of ArmaNet2. We suspect two main reasons: Firstly, the largest number of parameters could be decisive; Secondly, the most complex architecture including skip layers to consider neighbors of higher degrees could have a positive impact. The four GCS-layers of ArmaNet2 can consider a relatively large region. TAGNet1 also performs well and this model can evaluate neighbors of 6 th -order, by having two layers and three hops per layer. The benefit of ArmaNet2 can be emphasized by investigating tr20ev100, because ArmaNet2 outperforms all other models on dataset100, even if it is purely trained on dataset20. Hence, the models ArmaNet2 results in the most robust setup.
To further evaluate the performance of the investigated models, we analyze the distribution of the output of selected models in fig. 6. Therefor, we only consider the output based on the best seed per model using R 2 as a criterion. The output of all models is restricted to somewhat large values and neither low nor very high values of SNBS can be predicted. The small amount of nodes with low SNBS in the dataset might explain the absence of low output values. In case of large output values, it is remarkable and a bit surprising that none of the models predicts the abundance of completely stable outcomes. This behaviour limits the applicability to real world problems. The limitation of all models also becomes clear when comparing the results to the distributions introduced in fig. 2. Since the shifts ¶ within the datsets are small, we can compare the output distributions to the distributions of the entire datasets, even though fig. 6 only considers the test section.
The distributions of the output ( fig. 6) also indicate performance differences between the models. We clearly see that GCN1, having a relatively low performance, has a very limited range of output values and all values are around the mean of the dataset. ArmaNet1 already has a wider range, whereas ArmaNet2 has the largest range. Besides the range, the shape of the distribution and modalities of the predictions are also telling, e.g. we find an indication for a bimodul distribution in case of TAGNet1. All in all, the superiority of ArmaNet2 and TAGNet1 becomes visible. However, even for those models the output is still limited to values that are larger than 0.6 and there is only a small amount of predictions of high stability (SNBS≈1).
To visually analyze the models, we plot the predicted output vs. SNBS in heat maps in fig. 7. Perfect predictions would be on the diagonal only, similarly to R 2 = 1. On the contrary to R 2 shown in table 3, we can find some reasons for the performance differences. We see that ArmaNet2 and TAGNet1 can distinguish between nodes with SNBS ≈ 1 and nodes with lower SNBS. Other models, such as GCN1, have large regions on the off-diagonal, resulting in a lower performance.  Figure 6: Histograms showing density of predicted outputs for different models and dataset100 and the best seed per model.

Conclusion and Outlook
The key result of this paper is a novel approach of estimating SNBS via GNNs. We have demonstrated its potentials and have paved the way for further investigations. We show the necessity to use well-adapted architectures for this problem, since generic CNNs are not able to achieve comparable results even with more parameters (cf. Appendix C).
The strongest limitation of the presented results are probably the assumptions for generating the datasets which matches several properties of real power grids, but it also simplifies some aspects, e.g. missing heterogeneity of nodes (power input) and lines (coupling constant). However, the accuracy can still be increased before moving to more realistic setups, because the performance is still too low for real applications. We provide several ideas for improvements in the next paragraphs.
Since we see substantially improved performance for models with larger number of parameters testing more complex models seems very promising. More complex models might identify other relevant structures of networks to predict SNBS more accurately, there is no suggestion that the performance is already saturating. As a first step, one could conduct a hyperparameter study to improve the investigated models.
In further steps, one could introduce new models to increase the performance. Firstly, new layers could be designed that specifically aim to predict SNBS and deal with power grids. Secondly, hybrid approaches might be used that incorporate knowledge  Figure 7: Heat maps of comparing models using the best seed for each of them and considering the predicted output vs. SNBS and investigating dataset100. The diagonal represents a potential perfect model (R 2 = 1). about known structures, e.g. network motifs that can hardly be recognized by GNNs. Generally it is clear from our results that more complex architectures are promising for this task, even if it remains unclear exactly what direction the complexity increase should point towards.
Another key for improvement are the datasets. The used datasets are relatively small, so increasing the size of the datasets might be an important step for training more complex models. To solve the issue of the limited range of outputs and the observation that the model outputs are around the mean of the datasets, balancing or weighting of samples might help.
Remarkably, we successfully showed that GNNs can generalize across different sizes of power grids. Another avenue for future research is to train models based on different sizes to start with. It is feasible that the overall performance can be increased when actually training the models on multiple datasets. The capability of training models on smaller grids and applying them on larger grids can become crucial for realworld applications to reduce the computational effort of generating datasets and also of training the models.