Multi-task graph neural networks for simultaneous prediction of global and atomic properties in ferromagnetic systems

We introduce a multi-tasking graph convolutional neural network, HydraGNN, to simultaneously predict both global and atomic physical properties and demonstrate with ferromagnetic materials. We train HydraGNN on an open-source ab initio density functional theory (DFT) dataset for iron-platinum (FePt) with a fixed body centered tetragonal (BCT) lattice structure and fixed volume to simultaneously predict the mixing enthalpy (a global feature of the system), the atomic charge transfer, and the atomic magnetic moment across configurations that span the entire compositional range. By taking advantage of underlying physical correlations between material properties, multi-task learning (MTL) with HydraGNN provides effective training even with modest amounts of data. Moreover, this is achieved with just one architecture instead of three, as required by single-task learning (STL). The first convolutional layers of the HydraGNN architecture are shared by all learning tasks and extract features common to all material properties. The following layers discriminate the features of the different properties, the results of which are fed to the separate heads of the final layer to produce predictions. Numerical results show that HydraGNN effectively captures the relation between the configurational entropy and the material properties over the entire compositional range. Overall, the accuracy of simultaneous MTL predictions is comparable to the accuracy of the STL predictions. In addition, the computational cost of training HydraGNN for MTL is much lower than the original DFT calculations and also lower than training separate STL models for each property.


Introduction
Material discovery and design of new materials relies heavily on predicting material properties directly from their atomic structure. There are many physics-based computational approaches to model and predict the behavior of materials at the atomic scale from first principles, such as density functional theory (DFT) [1,2], quantum Monte Carlo (QMC) [3,4] and ab initio molecular dynamics (MD) [5,6]. While these methods have been instrumental in predictive materials science, they are extremely computationally expensive. The advent of data-driven modeling techniques has provided new methodologies to produce inexpensive and accurate predictions of material properties which helps enable rapid screening of large material search spaces to select potential material candidates with desirable properties [7,8,9,10]. Among all data driven models, deep learning (DL) models have the highest potential to accurately represent complex relations between input features and target quantities, but require large volumes of data to attain high accuracy. Data collected in material science is most often small in volume due to expensive experiments and time-consuming numerical 2 HydraGNN architecture HydraGNN directly inputs atomic structure and converts it into a graph, where atoms are interpreted as nodes and interatomic bonds are interpreted as edges, and outputs total (graph-level) and atomic (nodelevel) physical properties. The architecture of HydraGNN is characterized by two sets of layers: the first set of layers learn features that are common to all the material properties and the last set of layers are separated into multiple heads to learn features that are specific to each material property, shown schematically in Figure 2. The shared graph convolutional layers are used to extract common relevant features from pairwise neighbor interactions and, through multiple layers, also represent many-body interactions. The following separate layers are fully connected and learn mappings between extracted features and the physical properties  We have implemented HydraGNN using Pytorch [29,30] as both a robust NN library, as well as a performance portability layer for running on multiple hardware architectures. This enables HydraGNN to run on CPUs and GPUs, from laptops to supercomputers, including ORNL's Summit. The Pytorch Geometric [31,32] library built on Pytorch is particularly important for our work and enables many GCNN models to be used interchangeably. HydraGNN is openly available on GitHub [33].

Graph convolutional layers
A graph G is usually represented in mathematical terms as where V represents the set of nodes and E represents the set of edges between these nodes [34]. An edge (u, v) ∈ E connects nodes u and v, where u, v ∈ V , E ∈ V × V . The topology of a graph can be described through the adjacency matrix, A, an N × N square matrix where N is the number of nodes in the graph, whose entries are associated with edges of the graph according to the following rule: The degree of a node u ∈ V is defined as: and represents the number of edges connected to a node. Every node u is represented by a a-dimensional feature vector x ∈ R a containing the embedded nodal properties and also a label vector y ∈ R b in tasks related to node-level predictions. In order to take advantage of the topology of the graph, many DL models include both the number of neighbors per node, as well as the length of each edge between nodes. GCNNs embed the interactions between nodes without increasing the size of the input by representing the local interaction zone as a hyperparameter that cuts-off the interaction of a node with all the other nodes outside a prescribed local neighborhood. This is identical to the approximation made by many atomic simulation methods, including the LSMS-3 code used to generate the DFT training data, which ignore interactions outside a given cutoff range. GCNNs [35,36] are DL models based on a message-passing framework, a procedure that combines the knowledge from neighboring nodes, which in our applications maps directly to the interactions of an atom with its neighbors.
The typical GCNN architecture is characterized by three different types of hidden layers: graph convolutional layers, graph pooling layers, and fully connected layers. The convolutional layers represent the central part of the architecture and their functionality is to transfer feature information between adjacent nodes (in this case atoms) iteratively. In the kth convolutional layer (k = 0, 1, . . . , K), message passing is performed in sequential with the following operations: 1. Aggregate information from neighbors: the node u collects the hidden embedded features of its neighbors N (u) as well as the information on the edges (if available) via an aggregation function: where m k v = MESSAGE(h k v , h k euv ) is a message obtained from neighboring node v and the edge e uv that connects them. The vector h k v (h k v ∈ R p k ) is the embedded hidden feature vector of node v in the kth convolutional layer. When k = 0, the hidden feature vector is the input feature vector, h 0 v = x.
2. Update hidden state information: with h k+1 N (u) collected, the nodal feature of node u is updated as in: where UPDATE is a differentiable function which combines aggregated messages h k+1 N (u) from neighbors of node u with its nodal features h k u from the previous layer k. Through consecutive steps of message passing, the graph nodes gather information from nodes that are further and further away. The type of information passed through a graph structure can be either related to the topology of the graph or features assigned to the nodes. An example of a topological information is the node degree, whereas an example of nodal feature in the context of this work is the proton number of the atom. A variety of GCNNs, e.g., principal neighborhood aggregation (PNA) [37], crystal GCNN (CGCNN) [20] and GraphSAGE [38], have been developed, differing in the definitions of functions AGGREGATE, MESSAGE, UPDATE for message passing. One simple example of the function combination is: neighborhood ∈ R p k+1 ×p k are the weights of (k + 1)th layer of GCNN and σ is an activation function (e.g., ReLU) that introduces nonlinearity to the model. PNA is used in this work and is one of the convolutional layers available in HydraGNN through Pytorch Geometric; PNA combines multiple aggregating techniques to reduce the risk of classifying two different graphs as identical. Batch normalizations are performed between consecutive convolutional layers along with a ReLU activation function. Graph pooling layers are connected to the end of the convolution-batch normalization stack to gather feature information from the entire graph. Fully connected (FC) layers are positioned at the end of the architecture to take the results of pooling, i.e. extracted features, and provide the output prediction.
Larger sizes of the local neighborhood lead to a higher computational cost to train the HydraGNN model, as the number of regression coefficients to train at each hidden convolutional layer increases proportional to the number of neighbors. Further details on the behavior of HydraGNN with different sizes of the local neighborhood have been previously reported [39].

Multiple heads with fully connected layers for multi-task learning
MTL utilizes a NN to simultaneously predict multiple quantities [16] when those predicted quantities are mutually correlated and can act as inductive biases for each other. The improvement of an MTL model depends on how strongly the quantities to be predicted are mutually correlated in a particular application. This type of field specific inductive bias has been defined as knowledge-based, for which the training of a quantity can benefit from the information contained in the training signal for other quantities. Ultimately, MTL allows a direct and automated incorporation of physics knowledge into the model by extracting correlations between multiple quantities, with manual intervention by a domain expert only needed in determining which quantities to use.
In HydraGNN, each predicted quantity is associated with a separate loss function and the global objective function minimized during NN training is a linear combination of these individual loss functions. Formally, let T be the total number of physical quantities, or tasks, we want to predict. A single task identified by index i focuses on reconstructing a function f i : R a → R bi defined as where x ∈ R a , y i ∈ R bi . The MTL makes use of the correlation between the quantities y i , where the functions f i in (7) are replaced by a single functionf : R a → R T i bi that can model all the relations between inputs and outputs as follows: where W MTL represents the weights to be learned. The global loss function MTL : R NMTL → R + to be minimized in MTL is a linear combination of the loss functions for the single tasks: where y predict,i =f WMTL,i (x) is the vector of predictions for the i th quantity of interest and α i (for i = 1, . . . , T ) are the mixing weights for the loss functions associated with each single quantity. The values of the α i 's in Equation (9) are hyperparameters of the surrogate model and thus can be tuned. In this work we assigned an equal weight to each property being predicted because we are equally interested in all of them; however, this definition of the loss function enables one to modify the values of the α i to purposely favor the training towards one property of interest. As mentioned above, the multiple quantities in MTL can be interpreted as mutual inductive biases because the error of a single quantity acts as a regularizer with respect to the loss functions of other quantities. For a fair comparison and to determine the benefit of using other tasks as a mutual regularizer, we do not use additional regularizers for the STL training in this work. Figure 2 shows the topology of an HydraGNN model for multi-tasking learning to model mixing enthalpy, atomic charge transfer, and atomic magnetic moment. The architecture of HydraGNN for MTL is organized so that the first hidden layers are shared between all tasks, while keeping several task-specific output layers. This approach is known in literature as hard parameter sharing.

Solid solution binary alloy dataset
In this work we focus on a solid solution binary alloy, where two constituent elements are randomly placed on an underlying crystal lattice. We use a dataset for FePt alloys available through the OLCF Constellation [27] which includes the total enthalpy, atomic charge transfer, and atomic magnetic moment. Each atomic sample has a body centered tetragonal (BCT) structure with a 2 × 2 × 4 supercell. The dataset was computed with LSMS-3 [28], a locally self-consistent multiple scattering (LSMS) DFT application [40,41]. The dataset was created with fixed volume in order to isolate the effects of graph interactions and graph positions for models such as GCNN. This produces non-equilibrium alloy samples, with non-zero pressure and positive mixing enthalpy, shown as a function of composition in Figure 3. The input to HydraGNN for each sample includes the three components of the atom position and the proton number. The predicted values include the mixing enthalpy, a single scalar for each sample (graph), as well as the charge transfer and magnitude of the magnetic moment, both scalars per atom (node). Although the magnetic moment is a vector quantity, we treat it as a scalar because all the atomic magnetic moments in the dataset are co-linear (all magnetic moments point in the same direction).
The dataset consists of 32,000 configurations out of the 2 32 available, sampled every 3 atomic percent. For this work, if the number of unique configurations for a specific composition is less than 1,000 all those configurations are included in the dataset; for all other compositions, configurations are randomly selected up to 1,000. This results in a final dataset of 28,033 configurations. In order to ensure each composition is adequately represented in all portions of the dataset, splitting between the training, validation, and test sets is done separately for each composition.
At the ground state, the total enthalpy H of an alloy is where E is the total number of elements in the system, c i is the molar fraction of each element i, H i is the molar enthalpy of each element i, and ∆H mix is the mixing enthalpy. We predict the mixing enthalpy for each sample by subtracting the internal enthalpy from the DFT computed total enthalpy as a value more relevant to materials science (more directly related to the configuration). The chemical disorder makes the task of describing the material properties combinatorially complex; this represents the main difference from open source databases that have very broad elemental and structural coverage, but only include ordered compounds [42,43,44]. The range of values of the mixing enthalpy expressed in Rydberg is (0.0, 65.92), the range of atomic charge transfer in electron charge is (−5.31, −0.85), and the range of atomic magnetic moment in magnetons is (−0.05, 3.81). Since different physical quantities have different units and different orders of magnitude, the inputs and outputs for each quantity are normalized between 0 and 1 across all data.

Numerical results
We present numerical results that predict the mixing enthalpy, atomic charge transfer, and atomic magnetic moment for the binary FePt alloy. Specifically, we compare the predictive performance of multiple separate, single-headed HydraGNN models for STL with multi-headed HydraGNN models for MTL. The output of DFT calculations is considered as the exact reference for the DL model to reconstruct.

Training setup
The architecture of the HydraGNN models has 6 PNA [37] convolutional layers with 20 neurons per layer. A radius cutoff of 7 angstrom is used to build the local neighborhoods used by the graph convolutional mask. Every learning task is mapped into separate heads where each head is made up of two fully connected layers, with 50 neurons in the first layer and 25 neurons in the second. The DL models were trained using the Adam method [45] with a learning rate equal to 0.001, batch sizes of 64, and a maximum number of epochs set to 200. Early stopping is performed to interrupt the training when the validation loss function does not decrease for several consecutive epochs, as this is a symptom that shows further epochs are very unlikely to reduce the value of the loss function. The training set for each of the NN represents 70% of the total dataset; the validation and test sets each represent half of the remaining data. As discussed in Section 3, compositional stratified splitting was performed to ensure that all the compositions were equally represented across training, validation, and testing datasets. The training of each DL model was performed on Summit with one model per NVIDIA V100 GPU across two nodes, resulting in ensembles of 12 models per MTL/STL setup discussed in the next section.

Model accuracy and reliability
We compute and analyze not only the accuracy (RMSE) of each model, but also the standard deviation of RMSE for an ensemble of 12 models. This simple metric enables some quantification of the uncertainty associated with each MTL or STL prediction and an understanding of the reliability of the GCNN models. Note that the RMSE for MTL may be higher compared to STL for two reasons. First, the number of graph convolutional layers and the number of nodes per layer are the same for MTL and STL, but MTL forces these parameters to be shared among the multiple predicted quantities (STL and MTL differ only in the split hidden layers for the heads). Second, MTL introduces an inductive bias in the predictions of a quantity under the influence of other quantities simultaneously predicted.
We first show Figure 4 parity plots for the predictions generated by HydraGNN against the DFT data when MTL is used to simultaneously predict mixing enthalpy, charge transfer, and magnetic moment. HydraGNN with MTL predicts all three properties well for most of the samples over the entire dataset, as shown by the alignment of data near the diagonal. The colormap highlights that more sample points for all three properties are tightly clustered near zero, with larger variations as the property magnitudes increase. As expected, this non-uniform concentration of values for the target properties across the data affects the predictive performance of the HydraGNN models, with more accurate predictions in regions with higher concentration of data.
In Table 1, we compare the predictive accuracy of HydraGNN used to simultaneously predict mixing enthalpy, atomic charge transfer, and atomic magnetic moment with MTL, MTL with each pair of properties, and STL for each individual property. The table shows the average root mean-squared error (RMSE) and the standard deviation from 12 independent runs with random initialization on the test set. Figure 5 shows the probability distribution functions (PDF) of the results from Table 1 for each combination of properties. The lines mark the average values and the filled area indicates one standard deviation. Models with a lower average RMSE are interpreted as more accurate and lower standard deviation as more reliable, as the model predictions are similar across different trainings due to more relevant features extracted that better characterize the dataset. The results show that adding the magnetic moment as a physical constraint improves the both the predictive accuracy and reliability of MTL models over the STL predicting mixing enthalpy only. Addition or replacement of magnetic momement for charge density somewhat reduces the accuracy or reliability, but much less than using STL. These results suggest that the correlation between mixing enthalpy and magnetic moment is stronger than the correlation between mixing enthalpy and charge density. This stronger correlation in ferromagnetic materials is supported by DFT calculations and experiments [46,47,48], as well as previous DL studies using multilayer perceptrons for MTL on the same dataset [49].
The error distributions for MTL are slightly skewed and not centered at zero because the physical properties operate as mutual inductive biases, but this does not affect the accuracy of MTL. While MTL stabilizes the training and prediction of the global graph mixing enthalpy, the same is not true of the atomiclevel properties. The STL models for charge density and magnetic moment are slightly more accurate (as discussed at the beginning of the section), but also slightly more reliable for magnetic moment. It is possible that the additional data contained within the per-node properties, more than an order of magnitude above the per-graph property, is sufficient to predict on this dataset with STL alone.
We finally note that while our results for the mixing enthalpy are less accurate than the best reported GCNN results for DFT data (∼0.21 eV/atom in this work versus 0.022 -0.067 eV/atom [23,22]). This is partially due to the model used, as PNA does not explicitly include bond angle or crystallographic information, as well as the highly non-equilibrium nature of this dataset (although that the literature results include a much more broad range of chemistries should be taken into account). However, the focus of this work is on the MTL capabilities of HydraGNN and comparisons with STL.

Model training time
In Figure 6, we report the average wall-clock time and standard deviation to train all the MTL and STL models for the same 12 averaged runs from Section 4.2 for each combination of properties. In general, STL takes approximately 10% less time to train than MTL for two properties, which in turn takes approximately 10% less than MTL to predict all thee material properties. This is due to the fact that each head introduces additional parameters into the neural network, increasing the total computational cost. However, the training for MTL to simultaneously predict all three material properties is still more than 2.2x faster than to train three separate HydraGNN models using STL.

Conclusion and future work
We have presented a new DL surrogate model for atomic calculations, HydraGNN, to predict both global and atomic properties, available on GitHub [33]. The multi-tasking GCNN model was jointly trained on mixing enthalpy, charge transfer, and magnetic moment to take advantage of physical correlations in a highly non-equilibrium DFT dataset. Each predicted quantity acts as a physical constraint on the other quantities, allowing the shared convolutional layers to extract features that describe local interatomic interactions and use them across different material properties. Although the numerical experiments presented here show that MTL can use strong correlations between physical properties to reduce the uncertainty associated with the DL predictions, imposing too many constraints can be counterproductive. Joint training to simultaneously predict too many quantities can lead to a decrease of the predictive power, particularly as the quantities become less correlated. Therefore, the inclusion of different physical properties for use as physical constraints must be determined judiciously.
The use of HydraGNN reduces the time needed to predict the material properties by a factor of hundreds compared to the original DFT calculations. Moreover, the computational time in training a multi-headed Figure 6: Training time for HydraGNN models in wall-clock seconds to simultaneously predict all three material properties (red), two properties (green) and one property (blue). The bars indicate average training time and the vertical black segments indicate the range of one standard deviation from 12 independent runs with random initialization. Each letter corresponds to one predicted quantity: mixing enthalpy (H), charge transfer (C), and magnetic moment (M).
HydraGNN model is comparable to that of a single-headed GCNN model, resulting in further computational savings (near the order of the number of material properties).
We envisage two primary future applications of HydraGNN. First, the surrogate enables extremely fast estimation of atomic properties once the model has been trained, replacing DFT as long as some reduction in accuracy is acceptable. To further improve the usability for this case, we plan to integrate uncertainty quantification methods with HydraGNN to identify situations where the model requires additional training data. Second, rather than completely replace DFT we intend to integrate physics-based and data-driven models together in scenarios where the trained HydraGNN surrogate model can act as an initial guess for a further refined DFT calculation. In computational workflows such as quantum Monte Carlo, this could substantially reduce the total simulation time without any change in the final accuracy.