Inferring material properties of the lower mantle minerals using Mixture Density Networks

Interpretation of information available from seismic data in terms of temperature and composition requires an understanding of the physical properties of minerals, in particular, the elastic properties of candidate Earth minerals at the relevant (here, lower mantle) pressure and temperature. A common practice for the bulk elastic properties is to measure volume at a range of pressures and temperatures using experiments or computational methods. These datasets are then typically fit to a pre-determined functional form, or equation of state to allow computation of elastic properties at any other pressure or temperature. However, errors, both random and systematic, limitations in the number of data and choice of pressure marker and scale, as well as different functional forms of equations of state, all contribute to the uncertainties in mineral seismic properties. In an attempt to present a more comprehensive view of these uncertainties, we use neural-network based techniques to infer the relationship among: pressure, temperature, volume, bulk modulus, and thermal expansivity of MgO. We illustrate our approach on experimental data, but an extension to ab initio data is straightforward. The type of neural network used is called a Mixture Density Network (MDN) which is a combination of a conventional feed-forward neural network and a mixture model that consists of Gaussian functions. MDNs are capable of approximating arbitrary probability density functions, which allows us to compute the uncertainties in the predicted equations of state. Since the networks interpolate locally between input samples, pressure-volume- temperature relations are implicitly learned from data without imposing any explicit thermodynamic assumptions or ad-hoc relationships. We use the partial derivatives of the mapping between inputs (pressure and temperature) and output (volume) to compute the isothermal bulk modulus and thermal expansivity. Flexibility of the MDNs allows us to investigate the uncertainty due to certain data in one region of pressure-temperature space without influencing the posterior probability density everywhere. In general, we find that the elastic properties of MgO are well-constrained by experimental data. However, our study highlights regions in which sparse or inconsistent data lead to poorly constrained elastic properties, namely: at low pressure and high temperature ( < 25 GPa and > 1500 K), and temperatures above 2700 K. While the former conditions are likely not important for the Earth ’ s lower mantle, they are relevant in other planetary bodies such as the Moon and Mars. Comparison with conventional equation of state forms shows that assuming a certain functional form of the pressure-volume-temperature relationship leads to potential bias in uncertainty quantification, because the uncertainties are then specific to the underlying form. In combination with data sets of other lower mantle minerals, this technique should improve uncertainty quantification in interpretations of seismic data.


Introduction
Information such as variation of wave speeds (e.g. Dziewonski and Anderson, 1981;Kennett et al., 1995), obtained by studying seismic data is crucial for understanding the internal structure of the Earth. Various studies have reported the presence of seismically distinct structures at multiple scales in the Earth's mantle (e.g. Garnero and Helmberger, 1998;Ritsema et al., 1999;Romanowicz, 2008;Hernlund and Houser, 2008;Deschamps et al., 2012;Garnero et al., 2016). In order to relate those observed seismic structures to appropriate temperature and composition, constraints from mineral physics on the sensitivity of seismic wave speeds to these parameters are required (e.g. Jackson, 1998;Trampert et al., 2001). The sensitivities have been used to infer the probable existence of chemical heterogeneities within the mantle (e. g. Trampert et al., 2004;Dobrosavljevic et al., 2019;Jackson and Thomas, 2021). Other studies have tried to constrain the (average) mantle geotherm and composition by combining seismic data and mineral seismic properties (e.g. Cammarano et al., 2003Cammarano et al., , 2005aDeschamps and Trampert, 2004;Stixrude and Lithgow-Bertelloni, 2005;Matas et al., 2007;Cobden et al., 2008Cobden et al., , 2009Simmons et al., 2010;Khan et al., 2009Khan et al., , 2011Khan et al., , 2013. Mantle convection simulations (e.g. Nakagawa et al., 2009Nakagawa et al., , 2010Nakagawa et al., , 2012Schuberth et al., 2009Schuberth et al., , 2012 have also incorporated mineral properties to illustrate the importance of joint geodynamical-mineralogical approaches to explain the seismic anomalies in the mantle. Mineral properties can be derived from experimental or theoretical methods. In particular, information on the density (or volume V), incompressibility and rigidity are required to obtain the seismic wave speeds in a material. Since it is not practical or feasible yet to perform experiments at each pressure (P) and temperature (T) that may exist within the Earth, the convention is to use equations of state (EOSs) to define the relationship among the thermodynamic variables P, V and T (e.g. Duffy and Wang, 1998), and hence be able to estimate mineral properties at the conditions not accessed by experiments.
However, a number of uncertainties are associated with this procedure. Experimental measurements contain random and systematic errors. The choice of pressure scale as well as different functional forms of the EOS (e.g. Vinet EOS, third/fourth order finite strain equations, also called Birch-Murnaghan EOSs, as well as the choice of Grüneisen models) all contribute to the uncertainties in mineral seismic properties. As a result, it becomes challenging to determine realistic uncertainties for the interpretations which relate seismic observations to temperature and composition.
In this study, we present an Artificial Neural Network (ANN) based approach to infer the pressure-volume-temperature (P-V-T) relationship of MgO, with a view to extend the application to other major lower mantle minerals. We collate experimental P-V-T data for MgO together with reported uncertainties, regardless of pressure scale or functional form used. By applying ANN techniques, P-V-T relationships are implicitly learned from data without any prior assumption on the functional form (or thermodynamic model) of the relationship. Specifically, we use Mixture Density Networks to infer material properties and assess their uncertainties. We compute the partial derivatives of inferred volume with respect to pressure and temperature to extract the bulk modulus and thermal expansivity, respectively. In order to test the feasibility of this approach, we train the networks only on experimental data, although a combination of theoretical and experimental data is also possible and straightforward.

Equations of state: uncertainties
Experimental approaches (e.g. Vassiliou and Ahrens, 1981;Yoneda, 1990;Utsumi et al., 1998;Duffy and Ahrens, 1995;Fei, 1999;Dewaele et al., 2000;Speziale et al., 2001;Li et al., 2006;Dorogokupets and Dewaele, 2007;Hirose et al., 2008;Murakami et al., 2009;Kono et al., 2010;Dorfman et al., 2012;Ye et al., 2017) have been used to establish the P-V-T relationship of MgO. Experiments using a diamond anvil cell (DAC), a multi-anvil press (MAP) and shock compression have provided a huge number of data covering a wide range of pressure and temperature. Laboratory measurements of volume are done at a discrete set of pressure and temperature points. To cover the entire pressure and temperature range of lower mantle requires pressure extrapolation and/or interpolation of the measurements using a thermal equation of state. The most common procedure (e.g. Matas et al., 2007;Cobden et al., 2009) is to use an isothermal equation of state with a Mie-Grüneisen model for thermal pressure. In this approach, the total pressure is considered to be the sum of a static pressure and a quasiharmonic thermal pressure. The static pressure term describes the pressure-volume relationship at a reference temperature (usually 300 K). Different functional forms, such as third/fourth order finite strain and Vinet, have been widely used to model isothermal compression curves often leading to different estimates of fitting parameters or ambient mineral properties such as volume (V 0 ), bulk modulus (K 0T ) and pressure derivative of bulk modulus (K ′ 0T ) at 0 GPa pressure (e.g. Speziale et al., 2001;Dorogokupets and Dewaele, 2007;Tange et al., 2009). To compute temperature effects (more precisely, thermal pressure) this framework uses a Grüneisen parameter whose volume dependence is uncertain (Ye et al., 2017). Although anharmonic effects are very small compared to the harmonic contribution to thermal pressure, some authors (e.g. Dorogokupets and Dewaele, 2007) use models to account for this term as well.
Additionally, the exact determination of pressure using a reliable pressure scale in static high pressure and temperature experiments is still a challenging task. The ruby pressure scale of Forman et al. (1972) used in DAC experiments has been largely calibrated (Liu and Bi, 2016) using both static and dynamic compression data, but still suffers from large experimental uncertainties. Dynamic shock compression experiments provide an absolute pressure scale. But the correction for thermal effects can be very uncertain (e.g. Dorfman et al., 2012;Duffy and Wang, 1998), especially at high shock temperatures because the corresponding thermal contribution also increases. Other widely used pressure scales are gold, platinum and MgO scales. A recent study by Ye et al. (2017) shows the inter-comparison of those scales up to 140 GPa and 2500 K. They report ±1 to 4 GPa (sometimes systematic) differences in pressure among those pressure scales. Although their study optimized different Au, Pt and MgO pressure scales to make them agree within ±1 GPa, it concludes that the most preferred form of EOS (and the pressure standard itself) remains uncertain.
Measurement errors, lack of an absolute pressure scale, and a variety of functional forms of EOSs all contribute to the uncertainties in mineral seismic properties. Assuming one particular EOS or pressure scale has the potential to produce biased uncertainty estimates that are specific to the underlying functional form. In this study we train neural networks to learn the implicit relation between pressure and temperature (as inputs) and volume, bulk modulus and thermal expansivity (as outputs). The results are entirely data-driven without a priori selection of experiments or a functional form to explain the data. In this way, we can infer the relative contributions of data sparsity versus prior conditioning to the uncertainties. We can also map the level of certainty of the elastic parameters in pressure-temperature space, which can be propagated into seismic interpretation.

Background
Conventional neural networks (Hornik et al., 1989) are general function approximators, which can be used to infer an (arbitrary nonlinear) relationship (Cybenko, 1989) between inputs and targets/outputs. However, the conditional average (i.e. the mean value of output conditioned on input data) given by such networks only provides limited information about that relationship (Bishop, 1994). Since experimental P-V-T data contain measurement errors, and inferring P-V-T relationship using those data is an inverse problem which can have multiple solutions, naturally we seek to treat the problem in a probabilistic framework. Hence, instead of having only the average volume output, we want to find the posterior probability density function (pdf) for volume. The pdf for volume at a given pressure and temperature can be denoted as σ(V|P, T). (1) We can represent a general pdf by combining a conventional feed-forward neural network with a Gaussian Mixture Model (GMM), which is then called a Mixture Density Network (MDN) (Bishop, 1994 andBishop, 1995). The architecture of the MDN used in this study is shown in Fig. 1, and consists of a two layer feed-forward neural network and a GMM. The GMM contains a mixture of a finite number of Gaussian kernels which are then weighted to give the posterior pdf. The mean, standard deviation and weight of each Gaussian kernel are parameterized by weights and biases of the feed-forward neural network, also known as network parameters (α).
Application of MDNs in Earth Sciences ranges from inversion of surface wave data for global crustal thickness (Meier et al., 2007a,b), temperature and water content variations within the transition zone (Meier et al., 2009), inference of Earth's radial seismic structure (de Wit et al., 2013), inversion of free oscillations (de Wit et al., 2014), constraints on lower mantle anisotropy (de Wit and Trampert, 2015), nonlinear petrophysical inversion (Shahraeeni and Curtis, 2011), source inversion of strong-motion data (Käufl et al., 2016b), inferring parameters governing mantle convection (Atkins et al., 2016) to travel-time tomography (Earp and Curtis, 2020). In our case, based on some experimental P-V-T data, we seek to approximate the true posterior pdf (Eq. (1)) by a parameterized posterior p(V|P, T; α) ≈ σ(V|P, T). (2) In other words, for a given pressure and temperature, the posterior probability density for volume is given by the pdf in expression (2) which is parameterized by the weights and biases (α) of the feed-forward neural network. These parameters are learned during the network training process (see Section 3.2). The posterior pdf (Eq. (2)) can be expressed as a linear combination of a fixed number of Gaussian kernels (also see Fig. 1) as p(V|P, T; α) = ∑ M n=1 π n (P, T; α)φ n (V|P, T; α) (3) where M denotes the number of kernels used, and π n are mixing coefficients which satisfy ∑ M n=1 π n (P, T; α) = 1.
(4) If the number of Gaussian kernels is M, then the total number of outputs from the feed-forward network is K = 3M because each kernel is parameterized by its weight (π n ), mean (μ n ) and standard deviation (σ n ). Eq. (4) ensures that the posterior integrates to 1 making it a valid probability density. φ n in Eq. (3) are Gaussian kernels of the form where μ n and σ n are the mean and standard deviation of Gaussian kernels in the GMM. These parameters of the GMM are related to the outputs (y k ) of the feed-forward network (see details in Appendix A).

MDN initialization and training
In order to find the appropriate weights and biases of the feedforward neural network, we train the MDN using a sub-set of the experimental P-V-T data. In fact, the total experimental P-V-T data, shown in Fig. 2 (Fei, 1999;Jacobsen et al., 2008;Fei et al., 2004a;Fei Fig. 1. Architecture of the Mixture Density Network (MDN). A two layer feed-forward neural network (left) is combined with a GMM (centre) to get the posterior pdf (right). P & T denote the network inputs, h j are the hidden nodes, and y k are the outputs of feed-forward network. Indices J and K represent the number of hidden and output nodes, respectively. Except for the input nodes, each circle represents a computational node. Hidden layer nodes take a weighted sum (with weights α ij , where i ∕ = 0) of input data (P & T) plus a bias term (α 0j ) as inputs and apply a sigmoid activation function. The output layer nodes take a weighted sum (weighted by α jk , where j ∕ = 0) of the outputs from the hidden layer plus a bias (α 0k ) and apply a linear activation function to give the outputs y k . These outputs are related to the mean, standard deviation and weight of each Gaussian in the GMM (see Appendix A for details). Each Gaussian in the GMM is then weighted to give the final posterior pdf. et al., 2004b;Dewaele et al., 2000;Speziale et al., 2001;Utsumi et al., 1998;Fiquet et al., 1999;Ye et al., 2017;Kono et al., 2010;Dorfman et al., 2012;Zhang, 2000;Fiquet et al., 1996;Dubrovinsky and Saxena, 1997;Hirose et al., 2008;Litasov et al., 2005;Murakami et al., 2012;Li et al., 2006;Fan et al., 2019), is divided into three sets: training (70%), monitoring (20%) and test (10%) sets. During training, the MDN takes pressure and temperature from the training data and outputs a pdf for volume according to Eq. (3). However, we need to decide on the initial values of the network parameters of the feed-forward neural network to compute the first output. We randomly draw the input layer and hidden layer weights (Bishop, 1995) according to Gaussian distributions (see Appendix B for details). Once the MDN is initialized and training has started, the difference between the output and the target can be computed according to an error function defined in Appendix B. This function is also called the loss function which is minimized iteratively using the ADAM optimization method (see detailed algorithm in Kingma and Ba, 2014). We use TensorFlow (1.13.1) (Abadi et al., 2015) to construct, train and evaluate the MDN.
Overfitting is a general property of the maximum likelihood technique (Bishop, 1995). We use a separate monitoring data set to monitor the error decay during training. We evaluate the monitoring set error at the end of each iteration; if the monitoring error starts to increase (i.e. the network starts to over-fit the training data) then we stop the training procedure and save the last best trained model. This technique is also called the early-stopping technique.
It is known that the inverse problem can have multiple solutions (i.e. a range of network parameters can possibly provide equally likely solutions). We train a number of independent MDNs, and combine them by a weighted sum (e.g. Käufl et al., 2016a). The weight of each network is based on how well it performs on the test data which is not used during training. The performance is measured by the same error function that we use to calculate training and monitoring errors (for details see Appendix B). In this way, the explicit dependence of the posterior on the network parameters can be avoided. The choice of the number of MDNs depends on the problem at hand. A rough estimate for a relatively simple problem (e.g. a few inputs and a target/output) may lie in the range 10-20 (Käufl et al., 2016a). However, in order to compute the uncertainties in bulk modulus and thermal expansivity (details in Section 5) we train a large number of MDNs (10 3 ). The number of hidden nodes to use in each MDN are randomly selected from a pre-defined range which is 16-32. We conducted a separate test (not shown here) to find the range that provides the lowest errors for the test set. Similarly, we propagate the uncertainties in experimental data through the MDNs by randomly perturbing the thermodynamic variables within the reported uncertainty range.

Network performance
We use the test data set to examine how well the trained MDNs perform when a new datum is presented. Since the test data are not used in network training, we can use them to predict the output and subsequently compare with target data. In Fig. 3 (top panel) the predicted volume is compared with the target data. The MDNs predict pdfs for volume, and for this comparison we compute the conditional mean volume (conditioned on inputs P & T), instead of using the full posterior pdfs on volume, as This special case of MDN corresponds to the standard neural network output (Bishop, 1994), i.e. only the feed-forward network with one volume output. Eq. (6) shows the mean volume output for one MDN, and we calculate the weighted sum (weights are chosen according to the test set error as mentioned previously) of mean volumes from all MDNs. One alternative to the conditional posterior mean could be the posterior mode. However, the posterior mode may be biased towards certain pressure scales which contain relatively more data in the training set compared to other scales.
In the region of high temperatures and low pressures (Fig. 3, top panel) the trained MDNs show lower resolving capacity, providing more uncertain volume predictions. We found that this discrepancy in network predictions comes from the inclusion of specific training data points (high temperature data of Fiquet et al., 1996) in those ranges. We note that Fiquet et al. (1996) did not include a thermal pressure term in their experiments and so it is likely that the total pressure is underestimated. Moreover, the reported temperatures are likely overestimated by about 20 to 50%. We trained another network excluding these data in our training set and access the prediction performance ( Fig. 3, bottom panel). In doing so, MgO volumes are resolved within the prior range of experimental data, also in the region of low pressure and high temperature. This shows the networks' ability to capture the underlying data consistency.
Low pressure data (approximately less than 30 GPa) are relatively dense up to about 1400 K compared to higher temperatures. Similarly, most of the high pressure data, i.e. extending to the lower mantle environment, come either from approximately between 1500 K to 2700 K or from ambient temperature measurements. Besides that, the experimental data does not cover simultaneous high temperature and high pressure regions, for example temperatures greater than ~2700 K at pressures expected near the bottom of lower mantle. Hence, we expect wider posterior probability density functions for volume in regions of sparse experimental data coverage. So far we have only shown the mean of the posterior pdf for volume. To illustrate more clearly the effect of the high temperature data of Fiquet et al. (1996) on the posterior pdf at low pressure, high temperature, we take two data points from the test set (denoted by '+' in Fig. 3, top panel). Both points are drawn at low pressures, but one is at high temperature and located away from the solid line and another at low temperature is close to it. In Fig. 4 posterior pdfs at those points are shown. They show a more uncertain prediction for the high temperature, low pressure input. Once we remove Fiquet et al. (1996) data from  (Fei, 1999;Jacobsen et al., 2008;Fei et al., 2004a;Fei et al., 2004b;Dewaele et al., 2000;Speziale et al., 2001;Utsumi et al., 1998;Fiquet et al., 1999;Ye et al., 2017;Kono et al., 2010;Dorfman et al., 2012;Zhang, 2000;Fiquet et al., 1996;Dubrovinsky and Saxena, 1997;Hirose et al., 2008;Litasov et al., 2005;Murakami et al., 2012;Li et al., 2006;Fan et al., 2019) to train the MDNs. Data with uncertainties from X-ray diffraction experiments (in static high P-V-T, Brillouin spectroscopy and ultrasonic interferometry) are collected for the analysis. Note: uncertainties in collected experimental data are not plotted because the scale would be inappropriate to visualize them. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Fig. 3. Performance of MDNs. Target volumes from the test data set are compared with mean volumes (Eq. (6)) predicted by the MDNs. Top panel shows mean volumes predicted by the MDNs trained with all experimental data while bottom shows results with high temperature data of Fiquet et al. (1996) and Murakami et al. (2012) excluded (also see Section 4.2). The pressure (left) and temperature (right) range of the test data set is shown by colourbars on both panels. We note that the solid red line in the figure refers to a perfectly resolved network prediction. Points located near this line are well resolved and those located away represent more uncertain volume predictions. The MDNs best predict the volumes in low temperature regions and at simultaneous high temperature and pressure. However, including high temperature data of Fiquet et al. (1996) into training provides more uncertain volume predictions in the low pressure, high temperature region. For two data points marked with "+" in both left and right plots in the top panel, we plot posterior pdfs for volume in Fig. 4. One datum is located in the low pressure, high temperature region where the effect of high temperature data from Fiquet et al. (1996) is significant and another away from it. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) The posterior pdf is broad and multi-modal with target volume located away from the posterior modes. The smaller peak is the due to experimental P-V-T data of Fiquet et al. (1996). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) training (see Section 4.2), the network predicts narrow posterior pdfs showing less uncertainty (cf. including those in training) in volume. Although excluding Fiquet et al. (1996) provides less uncertain volume predictions, due to limited availability of experimental data at high temperature and low pressure (approximately >1500 K and <25 GPa) the predicted posterior pdfs are still slightly wider than at similar temperatures and high pressures (also see Section 4.2 and Appendix C.1).

P-V relationship at 300 K
The predicted pdfs for volume along a 300 K isotherm are presented in Fig. 5. A subset of the training data (i.e. only around 300 K temperature) is also shown along with the MDN predictions. The uncertainty in volume increases with pressure as shown by the increasing width of pdfs. This is expected as the training data (around 300 K) are more consistent with each other at lower pressures.
In Fig. 6 we compare pdfs for the volume of MgO along a 300 K isotherm with EOSs of Tange et al. (2009), Speziale et al. (2001, Lithgow-Bertelloni (2005, 2011) and Dorogokupets and Dewaele (2007) (denoted as T09, S01, SLB0511 and DD07, respectively). In this study, we use MINUTI (Sturhahn, 2020) to compute volume, bulk modulus and thermal expansivity as a function of pressure (and temperature) from these EOSs. For ambient temperature comparisons, static equations (i.e. third-order finite strain or Vinet) together with respective fitting parameters (V 0 , K 0T and K ′ 0T ) as reported in the literature are used. We show the pdfs for volume (Fig. 6, left panel) at every 5 GPa. The EOSs diverge as the pressure increases. At 135 GPa, the difference in volume between the equations of state of Lithgow-Bertelloni (2005, 2011) and Tange et al. (2009) is ~0.68 Å 3 , whereas one standard deviation predicted by the neural networks is ±0.54 Å 3 . Moreover, the slope of each individual EOS differs. This can best be visualized by computing ∂P ∂V for all EOSs (see Fig. 6, right panel). Although Speziale et al. (2001) and Lithgow-Bertelloni (2005, 2011) are based on third order Birch-Murnaghan EOSs, their fitting parameters are different. Comparisons between different EOSs and their fitting parameters are given by other studies (e.g. Dorogokupets and Dewaele, 2007;Tange et al., 2009;Ye et al., 2017, etc.). The mean slope predicted by the neural network shows a slightly stiffer EOS compared to the "standard" EOSs from the literature. This may be due to the fact that our training data include experiments which make use of different pressure standards (e.g. Ruby, NaCl, Pt, Au) than the EOSs considered for comparison (which are based on MgO).
Nevertheless, such a difference in slope together with the volume difference will inevitably lead to a significant divergence in the inferred compressibility and thermal expansivity (see Section 5).

High temperature P-V-T relationships
We use the trained MDNs to predict volumes of MgO at different temperatures. As an example, we plot the predicted pdfs for volume along a 2500 K isotherm in Fig. 7, left panel (other isotherms are provided in Appendix C.1). Similar to the ambient temperature (Section 4.1), the 2500 K isotherm shows a well-constrained volume prediction at lower mantle pressures. However, the high temperature pdfs show more uncertain volume predictions at low pressures (except at 0 GPa). For example, at 5 GPa the pdf is relatively wide and bimodal compared to that at high pressures (e.g. 100 GPa) which is unimodal. As discussed earlier in Section 3.3, high temperature experimental data of Fiquet et al. (1996) do not include a thermal pressure term, and it is likely the total pressure is underestimated. This can be visualised in Fig. 7, left panel, where training data points located approximately between 5-15 GPa have a smaller volume compared to data around 20 GPa and ~2500 K. We train another network without the high temperature data of Fiquet et al. (1996) and plot the results on the right panel of Fig. 7. The posterior pdf for volume at 5 GPa now shows a unimodal peak and the width is decreased by approximately a factor of 2 (cf. left panel at 5 GPa). Although removing Fiquet et al. (1996) reduces the uncertainties in volume, the posterior pdf is still wider than at high pressures for the same temperature. This region of low pressure, high temperature is known to be dominated by anharmonic effects. Although these effects are implicitly represented in our volume pdfs, there are limited experimental data in this region (temperature >1500 K and pressure <25 GPa) to further constrain them.
We compare the MDN predicted pdfs along a 2500 K isotherm ( Fig. 7) with some conventional EOSs (Tange et al., 2009;Speziale et al., 2001;Lithgow-Bertelloni, 2005, 2011;Dorogokupets and Dewaele, 2007). The variation in volume between these EOSs at high pressures is similar to that observed at 300 K. It has been noted in earlier studies (e.g. Ye et al., 2017) that the discrepancies in high temperature EOSs are partly due to persistence of the disagreement between them at 300 K (reference isotherm). Furthermore, at low pressure (<25 GPa) Speziale et al. (2001) diverges from other EOSs. This deviation is likely due to different values of fitting parameters together with distinct Grüneisen models to compute the thermal behavior. For example, Speziale et al. (2001) do not consider anharmonic effects, and their ambient Grüneisen parameters are also different than other studies (see e.g. Ye et al., 2017;Dorogokupets and Dewaele, 2007). Besides that, as with the case of the 300 K isotherm, all explicit EOSs lie within the uncertainty range predicted by our MDNs, which is expected because some training data come from the MgO pressure scales described by these EOSs.
At 2700 K, the MDN predicted pdfs (Fig. 8) show bimodal volumes in the pressure range of approximately 45-90 GPa. Once we plot the associated training data on top, it becomes clear that the smaller peaks in the pdfs are the representation of experimental data points of Murakami et al. (2012). Surprisingly, for the same reported volume and temperature they report pressures which are different from each other by about 36 GPa. However, their reported densities appear to be physically reasonable. Nevertheless, we train another network to discriminate how much uncertainty is coming from those specific data points. In doing so, the posterior becomes unimodal. At 60 GPa, including Murakami et al. (2012) data leads to a factor of approximately 3.5 wider pdfs for volume (Fig. 8, right panel) compared to results without those data. However, the effect of those data points seems to be local in P-V-T space and their influence decreases for example, at higher pressures. This is because MDNs interpolate locally in between samples, and data in one region of P-T space does not influence uncertainties everywhere.

Bulk modulus and thermal expansivity
Since the training data do not contain explicit values for the volume derivatives with respect to the inputs (P and T), getting constraints on bulk modulus (-V ∂P ∂V ) and thermal expansivity ( 1 V ∂V ∂T ) is less straightforward than constraining the volumes. Hence, we follow a slightly different approach compared to volume. We calculate the mean volume using Eq. (6) for any given P and T from each earlier obtained MDN. Then we perturb pressure (P + δP) while keeping the temperature fixed and compute the mean volume (<V(P+ δP, T) >) for that pressure from the same MDN. This way, we can compute the mean isothermal bulk modulus (K) as shown in Eq. (7). Similarly, we evaluate mean volumes for two slightly different temperatures but at a fixed pressure, and use that to compute the thermal expansivity, α (Eq. (8) Fiquet et al. (1996) and Murakami et al. (2012) data are excluded. For comparison, volumes along the high temperature isotherm for some previously published EOSs (Tange et al., 2009;Speziale et al., 2001;Lithgow-Bertelloni, 2005, 2011;Dorogokupets and Dewaele, 2007) are computed using MINUTI (Sturhahn, 2020). On both panels we plot a sub-set of the total training data, namely those data at temperatures between 2100 and 2600 K. Excluding Fiquet et al. (1996) data from neural network training significantly reduces the width of the pdfs at high temperature and low pressure. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Hence, in this approach, we take the derivatives of the P-V (or T-V) curve defined by the mean of the posterior pdfs from each neural network rather than fitting P-V-T data to a predefined EOS to get fitting parameters (such as K 0T and K ′ 0T ). Since we have trained a large number of MDNs (10 3 ) to predict the posterior pdf for volume, we get the same number of mean isothermal bulk modulus and thermal expansivity values. This way, each neural network approximates a slightly different mapping and its derivatives, and the distribution on the mean bulk modulus and thermal expansivity can approximate the uncertainties on them. Moreover, we use the same networks to compute the pdfs for volume and the mean volumes; the volume that goes into the calculation of bulk modulus and thermal expansivity is therefore consistent.
As an example, Fig. 9 shows bulk modulus as a function of pressure along two selected isotherms (refer to Appendix C.2 for other isotherms). The bulk modulus predicted by neural networks shows a higher value at high pressure along the 300 K isotherm compared to conventional EOSs. As mentioned earlier, this is likely due to the fact that the training data come from experiments which make use of different EOSs and pressure standards than those (MgO based) EOSs considered for comparison. Moreover, the fitting parameters (V 0 , K 0T and K ′ 0T ) are different for different EOSs. Hence, although these EOSs predict volume within the uncertainty range predicted by MDNs (Fig. 6, left panel), their derivatives (Fig. 6, right panel) differ significantly from each other and also from the MDN prediction, leading to different values of bulk modulus.
One high temperature (2000 K) comparison between the neural network predicted mineral properties and other studies is shown in Fig. 9c-f. In general, bulk modulus values predicted by the neural networks agree well with explicit EOSs, although Tange et al. (2009) shows slightly higher values at moderate pressures (e.g. 60 GPa). The mean bulk modulus predicted by the neural networks shows a large uncertainty at low pressures (below ~25 GPa) when high temperature data by Fiquet et al. (1996) are included. In Fig. 9d, we show the bulk modulus predicted by the neural network trained without Fiquet et al. (1996) (and Murakami et al., 2012. Here, the uncertainties at low pressure are significantly decreased. Similarly, neural networks trained without those two data sets predict physically reasonable thermal expansivities (Fig. 9f) compared to those trained with all data sets (Fig. 9e). At high temperatures, we still see a sharp bend around 20 GPa (also see Appendix C) which we suggest may be related to anharmonic effects. As the experimental data is relatively sparse in this region, one would need additional measurements (or theoretical studies) to confirm this. Furthermore, the thermal expansivity of Speziale et al. (2001) deviates from other EOSs. As mentioned in earlier studies (e.g. Dorogokupets and Dewaele, 2007), this may be improved by including anharmonic terms in the EOS. In equation of state formalisms, one can add an anharmonic term to the total free energy. This additional term has a T 2 dependence, rather than simply a linear temperature term. The effect of adding this term is most significant at low pressures, and can potentially capture more accurately the volume dependence at high temperatures compared with the standard thermal models without anharmonicity (for temperatures less than or equal to 2700 K in this meta dataset).
Besides low pressure, including Murakami et al. (2012) data during network training provides mean bulk modulus uncertainties that are more than 4 times larger (Fig. 10a) than excluding them together with Fiquet et al. (1996), and this discrepancy reduces at higher pressures (Fig. 10b). Moreover, as expected, neither Fiquet et al. (1996) nor Murakami et al. (2012) data influence bulk modulus at low temperatures, as shown in Figs. 10c and 9a, b.

Discussion
Fitting parameters (such as K 0T and K ′ 0T ) are inherent to explicit global EOSs, and a correlation between them tells us how one parameter changes with another providing optimal global fit. We do not estimate the uncertainties on fit parameters of EOSs which are specific to the underlying global functional form. Instead, we directly provide the uncertainties on volumes which are local in P-T space. The MDN is a kernel based method where we fit (a mixture of Gaussian) kernels to the experimental data and get an arbitrary probability density function on volume at any given P and T. The neural networks are flexible and interpolate locally; the uncertainties in one region of P-T space do not impact the posterior pdf everywhere. For example, Fig. 7 shows no change in high pressure pdfs while removing Fiquet et al. (1996) data in the region of low pressures. Our approach is also very powerful at identifying data inconsistencies when using different data sources.
The posterior pdfs given by the MDNs represent uncertainties in volume due to experimental errors, data gaps and data inconsistencies from different studies. Together with the uncertainties in mean isothermal bulk modulus and thermal expansivity, these results can be used by, for example, seismologists working on thermochemical interpretation of seismic data. Although uncertainties in volume, bulk modulus and thermal expansivity vary locally depending on sparsity and Fig. 8. Left: pdfs for volume of MgO along a 2700 K isotherm predicted by the MDNs trained with all data. We also plot a sub-set of the training data, namely those whose temperatures lie between 2600 and 2800 K. Note: the large uncertainty in volume in the low pressure region (approximately below 25 GPa) is due to inclusion of data from Fiquet et al. (1996) as discussed in the text. Right: Comparison of posterior pdfs for volume predicted by MDNs trained with and without Murakami et al. (2012) (M12) and Fiquet et al. (1996) (F96) data at 2700 K and 60 GPa. The small peak at around 66 Å 3 is due to Murakami et al. (2012) data. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) consistency of the experimental data, using these outputs from MDNs, one can directly compute bulk wave speed (φ 2 = K S /ρ) and density (ρ) at any given pressure and temperature. However, in order to compute bulk wave speeds at temperatures applicable to the lower mantle, we need the adiabatic bulk modulus (K S = K T (1 + αγT)), where γ is Grüneisen parameter and α is the thermal expansivity. Nevertheless, assuming that the difference between isothermal (K T ) and adiabatic (K S ) bulk moduli, at 300 K is roughly within ±1.0% (Marquardt et al., 2018), the bulk wave speed of MgO is 11.14 ± 0.07 km/s at 135 GPa. At the same condition, the relative uncertainty (one standard deviation around mean) in density predicted by the MDNs is about ±1.0%. This is larger than or comparable to the relative density variations in lower mantle estimated by previous studies (e.g. Ishii and Tromp, 1999;Trampert et al., 2004;Koelemeijer et al., 2017). Although the Grüneisen parameter varies as a function of volume that ultimately depends on pressure (and temperature), we assume it to be approximately 1.1 ± 0.3 (e.g. Stixrude and Lithgow-Bertelloni, 2011;Ye et al., 2017) at 2700 K and 135 GPa to give an estimate of uncertainties in bulk wave speed. In doing so, the relative uncertainty in bulk wave speed is about ±1.77% which is larger than the reported bulk sound speed variation in the lower mantle (e.g. Trampert et al., 2004).
Estimation of mineral properties beyond the range of experimental data requires extrapolation. The standard EOSs can easily be used for extrapolation provided that the assumptions of the functional form hold Fig. 9. Comparison of the mean bulk modulus (a, b, c and d) and thermal expansivity (e and f) predicted by the neural networks with previously published equations of state for MgO (Tange et al., 2009;Speziale et al., 2001;Stixrude and Lithgow-Bertelloni, 2011;Dorogokupets and Dewaele, 2007) as a function of pressure. The output from the neural networks is shown with greyscalethe darker the region of the plot, the greater the number of MDNs which predict the bulk modulus (or thermal expansivity) has that value. Frequency counts for output from the MDNs are at intervals of 1 GPa for pressure and bulk modulus, and 10 − 7 K − 1 for thermal expansivity. For (a), (c) and (e) neural networks are trained with all collected data, whereas for (b), (d) and (f) data from Fiquet et al. (1996) and Murakami et al. (2012) have been excluded. Due to the inclusion of Fiquet et al. (1996) data we obtain large uncertainties in bulk modulus and thermal expansivity in low pressure, high temperature regions. Note: the overlapping of different EOSs makes the background histogram difficult to visualise. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) in the region of no data. In general, it has been observed that MDNs provide a wider estimate of uncertainties in the region of little to no training data (Käufl et al., 2016a). Here too, as shown by the wider pdfs in Fig. 11, the uncertainty in predicted mineral properties increases when the network has to extrapolate from distant training data. We note that EOSs of Lithgow-Bertelloni (2005, 2011) and Tange et al. (2009) closely follow the pdf predicted by the network indicating that it learns a functional form present in the data, but errs on the cautious side by returning larger uncertainties. From a Bayesian perspective, we would advise against extrapolation as this covers a region outside the prior. Fig. 11, however, demonstrates some capability of neural networks to extrapolate beyond the ranges of the data, although we would need to establish how far this is related to the precise network architecture.
The shear modulus is required to calculate compressional and shear wave speeds. There is no thermodynamic expression for the shear modulus, but functional forms are often assumed, for example third order finite-strain and shear counterpart of the Keane EOSs (Keane, 1954) by Kennett (2017), to compute the shear modulus which are based on the bulk modulus calculation. One can also use the linear relationship among shear modulus, adiabatic bulk modulus and pressure given by Stacey (1995). However, the uncertainties in shear modulus would then be dependent on those in bulk modulus, and the assumption that shear properties can be constrained from the bulk properties. An alternative is to use data from experiments such as Brillouin Spectroscopy that provide shear wave speed information. Together with unit-cell volume, as measured by X-ray diffraction on the same sample (e.g. Murakami et al., 2012;Kurnosov et al., 2017) and known sample composition, the density and thus shear moduli can be determined. However, these data sets do not cover simultaneous high pressure and temperature regions that are expected in the Earth's lowermost mantle. For example, the highest temperature and pressure data for MgO reported in Murakami et al. (2012) are six measurements at 2700 K and between 32.5-68.4 GPa. Nevertheless, a combination of wave speed data from ultrasonic techniques and Brillouin Spectroscopy together with high P-V-T data from X-ray diffraction techniques has the potential to exhaustively sample the lower mantle geotherm in the near future (Marquardt and Thomson, 2020).
We note that, in principle, a combination of experimental data and theoretical calculations (e.g. Karki et al., 1999;Oganov and Dorogokupets, 2003;Wu et al., 2008) is possible. This may provide additional constraints on the predicted mineral properties covering a wider range of pressure and temperature. Since our approach implicitly identifies the consistency between different data sources, a proper rationale can be developed to mix data and uncertainties from theory with experiments. Furthermore, the MDN based approach can easily be extended to the upper mantle and the core. Since MDNs are flexible, they can be employed to model multi-mode targets/outputs. This would be helpful to model for example volume anomalies induced by the iron spin transition (e.g. Marquardt et al., 2009;Speziale et al., 2007;Lin et al., 2006;Crowhurst et al., 2008;Solomatova et al., 2016). A natural progression of this work is to extend it for solid solution. It is straightforward to include composition, e.g. the Mg/Fe ratio, by including it as an extra dimension in the input data (i.e. P, T and mol% Fe in

Conclusions
This study demonstrates the feasibility of a neural network based approach to infer the material properties of lower mantle minerals. In our approach, we learn the underlying P-V-T relationship providing a reasonable approximation of the P-V-T data of MgO. This allows us to compute the uncertainties in density, thermal expansivity and bulk modulus without prescribing an explicit EOS. Once the networks are trained, it is a simple function that can be evaluated at any given pressure and temperature to get volume, mean bulk modulus and thermal expansivity with uncertainties. In order to train the networks, we collect data from high P-V-T experiments without prior selection of data (e.g. based on pressure scale or functional form used). Hence, our uncertainties are not biased towards a subjective selection of experimental data. Furthermore, our approach identifies inconsistencies between data from different sources. The assumption that an EOS follows a particular form provides a priori information by fixing their form (or thermodynamic model) and/or pressure scale. It remains to be determined which EOS form best describes the thermodynamic behaviour of MgO at wide range of pressures and temperatures. In this study, we compare a few "standard" EOSs with the material properties inferred from neural networks and show that choosing one particular explicit form provides a biased estimate of uncertainties.
Based on the prediction performance of the MDNs and comparison with conventional EOSs (such as Figs. 3, 7, 9, and Appendix C), we can be most confident about physical interpretation of seismic data in the lower mantle within the prior range of experimental data (Fig. 2). In the regions where there exists little evidence about how the P-V-T relationship behaves, such as at low pressure, high temperature (<25 GPa, >1500 K), and temperatures approximately >2700 K at pressures expected towards the core-mantle boundary, neural networks show increasingly uncertain predictions. Although for the Earth's lower mantle, low pressure and high temperature environments may not be relevant, they are expected in other planetary bodies such as the Moon and Mars (e.g. Khan et al., 2014Khan et al., , 2018. With currently available data, it likely provides meaningful uncertainties that could be used by seismologists within certain ranges of pressure and temperature, while highlighting the P, T regions in which more experimental (or theoretical) data is needed before we can draw robust conclusions on temperature and composition.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Fig. 11. Probability density function for volume of MgO along a 2700 K isotherm (a) and 100 GPa isobar (b). Training data belonging to temperature between 2400 and 3000 K (a), and pressure range from 96 to 103 GPa (b) are also shown. Magenta (SLB0511) and red (T09) curves are Lithgow-Bertelloni (2005, 2011) and Tange et al. (2009) EOS, respectively. They follow the volume trend predicted by the network. In the region outside the prior data, the trained MDNs provide wider pdfs as they are forced to extrapolate the volume. To illustrate this more clearly, volume pdfs at a fixed temperature (and pressure) and three different pressure (and temperature) are also shown in c (and d). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Appendix A. Generalised theory of the MDN
Let, x = {x 1 , x 2 , …, x I } be the input data to the feed-forward part of the MDN. Please note, to generalise this section, we write inputs as x and targets as m k instead of P & T and V, respectively. The feed-forward network outputs y k are computed as a weighted sum of the outputs from the hidden nodes plus a bias where the function f 2 is an identity function such that f 2 (p) = p, α jk is the hidden layer weight matrix and α 0k represents a bias term of each output node. Now, the hidden node outputs h j are computed as where the function f 1 is a logistic sigmoid function f 1 (p) = 1 1+exp(− p) , α ij is the input layer weight matrix, α 0j are the biases of hidden nodes and x i are input data. y k are related to the parameters, namely weights (π n ), means (μ n ) and standard deviations (σ n ) of Gaussians in the Gaussian Mixture Model (GMM) by the following relationship (for details see e.g. Bishop, 1994;de Wit et al., 2013) π n (x; α) =

exp
( y

Appendix B. MDN initialization and training details
The total data (x) is divided into three setstraining (70%), monitoring (20%) and test (10%) sets such that x train ⊂x, x monitor ⊂x and x test ⊂x (B.1) with x train ∩ x monitor = ∅, x train ∩ x test = ∅ and x monitor ∩ x test = ∅.
Using the training data (x train ) we train the MDN. However, before we train the MDN we need to decide on initial values of the network parameters. We randomly draw the input layer and hidden layer weights (Bishop, 1995)  respectively. Where I and J are number of input and hidden nodes, respectively. Similarly, the output layer biases are initialized by a K-means clustering algorithm (i.e. fitting a GMM to the training data set). Once the initialization is done and the training begins, the difference between the output and the target can be computed according to the error function which is summed over all training data providing the average error. This function is also called the loss function which is minimized iteratively using the ADAM optimization method (see detailed algorithm in Kingma and Ba (2014)). The explicit dependence of output posterior on the network parameters (see Käufl et al., 2016a and references therein) can be avoided by using multiple MDNs and combining them by weighted sum. The weight of each MDN is determined by the test set error as where index i denotes the ith MDN (C MDNs in total) and N is the size of the test data set, and the MDNs are combined according to p(m k |x; α) = ∑ C i=1 w i ∑ j w j p i (m k |x; α i ).  Murakami et al. (2012) and Fiquet et al. (1996). Comparison with previously published EOSs (Tange et al., 2009;Speziale et al., 2001;Lithgow-Bertelloni, 2005, 2011;Dorogokupets and Dewaele, 2007) along 1500 K (top), 2000 K (middle) and 2700 K (bottom) isotherms also shown. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)