An assessment of long duration geodynamo simulations using new paleomagnetic modeling

: 12 Long-term temporal variations of the magnetic field (timescales > 10 Myr), characterized 13 from paleomagnetic data, have been hypothesized to reflect the evolution of Earth’s deep interior 14 and couplings between the core and mantle. By tying observed changes in the paleomagnetic 15 record to mechanisms predicted from numerical geodynamo simulations, we have a unique tool 16 for assessing changes in the deep interior back in time. However, numerical simulations are not 17 run in an Earth-like parameter regime and assessing how well they reproduce the geomagnetic 18 field is difficult. Criteria have been proposed to determine the level of spatial and temporal 19 agreement between simulations and observations spanning historical and Holocene timescales, 20 but no such criteria exist for longer timescales. 21 Here we present a new set of five criteria (Quality of Paleomagnetic Modeling criteria, 22 Q PM ) that assess the degree of semblance between a simulated dynamo and the temporal and 23


Introduction
The geomagnetic field has been a fundamental feature of the Earth for the past 3.5 billion years (Biggin et al., 2011(Biggin et al., , 2015;;Tarduno et al., 2010) and may have been active since the Hadean (Tarduno et al., 2015).Originating in Earth's core and extending into space, the magnetic field shields the atmosphere from erosion by the solar wind, allowing for the preservation of liquid water on the surface and ultimately habitability (Tarduno et al., 2014).Thus, the geomagnetic field is a link between surface, interior, and exterior processes back through geologic time.The magnetic field observed at Earth's surface contains contributions primarily from internal sources.Temporal variations (secular variation) and spatial variations in the internally generated field can be characterized through direct observations using surface, satellite, and aeromagnetic measurements for the last few hundred years (historical record) and indirectly via paleomagnetic/archeomagnetic measurements going further back in time.These variations provide insight into the magnetohydrodynamic processes that occur in the outer core, and into how these processes may be affected by boundary conditions imparted at the mantle and inner core interfaces (Aubert et al., 2010;Olson et al., 2013).The similarity between the timescales observed for long-term variations in the magnetic field, e.g. the timescale for reversal frequency variability, to those of convective overturn in the mantle (200 Myr), has led to the hypothesis that long-term magnetic field variations are a result of external forcing mechanisms and reflect the evolution of Earth's deep interior (Biggin et al., 2012;Jones, 1977;McFadden and Merrill, 1984).If observed variations in the long-term magnetic field can be tied to mechanisms predicted from numerical geodynamo simulations, it would then be possible to evaluate changes in the deep interior going back in geologic time, adding a crucial dimension to our understanding of Earth's evolution.
In the last three decades, significant advances have been made in the field of numerical geodynamo modeling.These simulations have succeeded in capturing the main features of the Earth's magnetic field, such as a dipole dominated field and polarity reversals (e.g.Christensen and Wicht, 2015;Glatzmaier and Coe, 2015;Glatzmaier and Roberts, 1995), in addition to aspects of historical secular variation (Bloxham, 2000;McMillan et al., 2001) such as westward drift (Christensen and Olson, 2003) and weak activity in the Pacific hemisphere (e.g.Aubert et al., 2013;Davies et al., 2008;Gubbins et al., 2007;Mound et al., 2015).Furthermore, simulations have been used to make predictions about magnetic field behavior including estimates of internal field strength and core flow speed (Christensen et al., 2009;Christensen and Aubert, 2006), relations between field strength and reversal frequency (Olson, 2007), the role of core-mantle boundary heat flow in affecting field behavior and flow dynamics (Amit et al., 2015;Amit and Olson, 2015;Olson and Christensen, 2002;Olson et al., 2010) and time-average field morphology (Amit et al., 2015;Amit and Choblet, 2009;Davies et al., 2008;Gubbins et al., 2007).However, due to computational limitations, numerical dynamo simulations cannot yet run with the small diffusion coefficients that characterize the core fluid.In terms of non-dimensional numbers, the Ekman number  (the ratio of viscous to Coriolis forces) and the magnetic Prandtl number  (the ratio of viscous to magnetic diffusion) are many orders of magnitude larger than those estimated for Earth.Lowering  to geophysical values of 10 −15 is the main challenge.
Recent simulations that utilized millions of CPU hours have reached  = 10 −7 (Schaeffer et al., 2017) or 10 −8 by parameterizing the smallest scales of the turbulence (Aubert et al., 2017), but were only run for short time periods and do not reverse.A key issue is then to determine to what degree a given simulation can be said to exhibit 'Earth-like' properties.
To assess whether a numerical dynamo simulation produces an Earth-like magnetic field, past studies have utilized observed behavior of the recent geomagnetic field derived from global time-dependent field models spanning historical and Holocene timescales, to develop criteria that can be used to assess the similarity between numerical simulations and Earth (Amit et al., 2015;Christensen et al., 2010;Davies and Constable, 2014;Mound et al., 2015).These global field models are constructed from satellite, observatory and survey magnetic field observations over the historical period 1590-1990 AD (gufm1; AD (gufm1;Jackson et al., 2000) and from archeomagnetic and paleomagnetic data, collected from archeological artifacts, sediments, and volcanic rocks for the past 10 to 100 kyr (e.g.Korte and Constable, 2011;Panovska et al., 2018).Existing criteria utilize large-scale properties of the field morphology (Christensen et al., 2010;Mound et al., 2015;Amit et al., 2015) or the frequency content of the dipole moment time-series (Davies and Constable, 2014) derived from these global field models as the basis for assessing whether a numerical simulation reproduces Earth's magnetic field behavior.In practice, these criteria have been used to assess the compliance of both short (< 10 5 yr) and long duration (~10 5 -10 7 yr) dynamo simulations with Earth-like behavior (e.g.Driscoll and Wilson, 2018), despite being based on features of the recent geomagnetic field.While it has been suggested that modern secular variation, as captured by global field models, is representative of expected variations over the entire history of the geodynamo, this is fundamentally uncertain (Johnson and McFadden, 2015).Furthermore, current criteria based on time-dependent field models do not include aspects of Earth's long-term magnetic field behavior not observed in the Holocene, such as polarity reversals.To properly assess whether simulations behave like Earth on longer time scales (>10 5 -10 7 yrs), we need to define a new set of criteria which can be used to assess how well numerical simulations reproduce paleomagnetic field behavior.
Here we present a new set of criteria to compare long-term behavior of numerical dynamo simulations with paleomagnetic observations: the Quality of Paleomagnetic Modeling (QPM) criteria.Criteria are assessed using a two-fold approach: 1) the calculation of a nonparametric misfit score (∆   ) between simulated and Earth data, inspired by the approach used in Christensen et al. (2010), and 2) the assignment of a binary score (   ), inspired by the paleomagnetic Q (Van der Voo, 1990) and QPI (Biggin and Paterson, 2014) approaches commonly used in the assessment of paleodirectional and paleointensity studies, respectively.
Total misfit values, ∆  , and total   scores are evaluated over all criteria, where for each criterion that is met the total   score increases by 1, to a maximum score of five.The utility of a two-fold method is as follows.First, this approach helps bring all simulations, regardless of the parameter space in which they were run, to the same baseline, easing comparison between them.
Second, the ∆  helps to quantify overall how close a simulation is to reproducing Earth's paleomagnetic behavior, while the   score highlights which specific aspects of the paleomagnetic field a simulation is reproducing well.Finally, the QPM approach does not prescribe a strict threshold below which a simulation is deemed incompatible with paleomagnetic observations, which allows users to assess which of the paleo-field properties are most important to reproduce for their study.This permits users to get the most out of their simulations, which for timescales on the order of 1 Myr may have taken tens of thousands of CPU hours to run.
The chosen five criteria represent a range of commonly reported paleomagnetic observables that reflect temporal and spatial variations in the long-term magnetic field.Global time-dependent field models are not available for the timescales of interest here and our new criteria reflect the available data in the paleomagnetic record.For the purpose of this study, their Earth-like values are derived for the past 10 Myr as reported in the recent compilation of paleomagnetic directional data, PSV10 (Cromwell et al., 2018), and the paleointensity (PINT) database (Biggin et al., 2009(Biggin et al., , 2015) ) (Table 1).These criteria are assessed at Earth's surface, requiring conversion of Gauss coefficients from geodynamo simulations into pseudopaleomagnetic data.The five criteria address different aspects of the time-average and timevarying field and are as follows: inclination anomaly, virtual geomagnetic pole (VGP) dispersion at the equator, latitudinal variation of VGP dispersion, normalized width of virtual dipole moment (VDM) distribution, and dipole field reversals.We have assessed the compliance of our criteria with a large number of published (Davies et al., 2008;Davies and Constable, 2014;Davies and Gubbins, 2011;Gubbins et al., 2007), and new long-duration geodynamo simulations.These simulations span a wide parameter space that was chosen to best capture a broad range of simulation behavior and serves to demonstrate the QPM approach.Because we cannot predict a priori which simulations will reproduce Earth's paleomagnetic field, we have chosen to explore this parameter space systematically.
In the following sections we will first review the observable properties of the paleomagnetic field that will be used as the foundation of the QPM criteria and introduce the five QPM criteria.We then outline how compliance with these criteria is met.Next, we use these criteria to assess how well our suite of 46 long-duration geodynamo simulations reproduce the Earth's paleomagnetic field.We close with a discussion of the implications of our results, and how we foresee the utilization of these criteria in the future.

Paleomagnetic Modeling Criteria for Geodynamo Simulations (QPM)
In order to be effective, the criteria for assessing a numerical simulation should be objective and quantifiable (Christensen et al., 2010), and address a well-established property of the paleomagnetic field.We base our criteria solely on paleomagnetic observables and not on global time-dependent models (e.g.Panovska et al., 2018), which do not cover the time-frame of interest (>100 kyr), or statistical field models (e.g.Tauxe and Kent, 2004), which fail to reproduce paleomagnetic observations of paleosecular variation (PSV) and time-average field (TAF) behavior, or TAF models (e.g.Cromwell et al., 2018), which do not represent PSV.
Observations made directly from paleomagnetic datasets provide the most reliable representation of long-term magnetic field behavior and we therefore use them as the foundation of our criteria.
Another constraint on viable criteria arises from limitations inherent in current dynamo simulations.Simulations spanning paleomagnetic timescales must run for long periods, which increases the computational cost and further limits the parameter range that can be accessed.We therefore focus on large-scale features of the field as has been done in previous studies (Christensen et al., 2010;Davies and Constable, 2014;Mound et al., 2015;Wicht and Meduri, 2016).

Paleomagnetic Basis for QPM Criteria
To assess the behavior of the magnetic field on long time scales we are interested in both the geometry of the TAF in addition to temporal variations about the long term average (PSV).
An overview of standard paleomagnetic observables is presented in the Supplementary Materials.Full vector records of the paleomagnetic field are sparse and unevenly reported.
Therefore, in defining the QPM criteria, magnetic directions and intensity are treated separately.

Criterion Based on Time-Average Field Behavior
A fundamental assumption in paleomagnetic studies is that over a sufficiently long time period the field can be best approximated by a geocentric axial dipole (GAD), where inclination (  ) is predicted to vary with latitude (λ) via the axial dipole equation However, paleomagnetic records show small, yet persistent, deviations from GAD (Cox, 1975;Cromwell et al., 2018;Johnson et al., 2008).The two parameters used to represent this offset in paleomagnetic studies are inclination anomaly and declination anomaly, defined as Here,  and  are the calculated Fisher mean (Fisher, 1953) inclination and declination values from measured samples.Note, the declination predicted from a GAD field is zero.Due to large gaps in spatial coverage in long-term paleomagnetic records, investigations are restricted to latitudinal structure only.In observational datasets, an accurate measure of ∆ is much harder to capture from paleomagnetic data than ∆, due to error in or absence of sample orientation, unrecognized tectonic rotation, and the expected long-term behavior of the longitudinal variation of the non-GAD field as captured by declination.Therefore, for our criteria we utilize ∆ (IncAnom), as one of our measures of TAF behavior.
The IncAnom criterion utilizes the maximum absolute median ∆ (calculated from 10° latitude bins) and its 95% confidence intervals as a measure of TAF behavior.We do not require that simulations match the observed latitudinal geometry of Δ, which we believe is justified since the latitudinal variation of Δ is not well-constrained in the long-term magnetic field (Cromwell et al., 2018).
A measure of the mean field intensity or mean VDM (see equation 8 below) is most often used as a metric of TAF intensity.Dynamo simulations solve dimensionless equations and so scaling the results into a dimensional field strength is non-unique.Davies and Constable (2018) found that estimates of the local field intensity varied by a factor of 2-3 between two different magnetic field scalings within a given geodynamo simulation.Additionally, the ratio between the Elsasser and Lehnert number scalings commonly used in the literature is and Christensen, 2006), which at  = 10 −4 and  = 1 could produce a factor of 100 or more difference in the field strength.Due to these complexities, we do not include a direct measure of TAF behavior in regard to intensity in our QPM criteria.

Criteria Based on Paleosecular Variation Behavior
VGP angular dispersion () is a commonly used metric to quantify paleosecular variation in the long-term paleomagnetic field.Using VGP dispersion allows for the estimation of paleosecular variation when detailed age control and time series data are unavailable.To mitigate the latitudinal dependence of magnetic field direction, a standard approach in paleomagnetism is to calculate the geocentric dipole that would give rise to the observed site directions, where the VGP is the position that the dipole pierces Earth's surface (cf.Butler, 1992).Here we use the paleomagnetic definition of sites, which are assumed to capture individual snapshots of the magnetic field, i.e., a single cooling unit.The dispersion about a mean pole , from a set of  VGPs contained in a locality or latitude band, can then be determined as an estimate of paleosecular variation, where (5) Here, ∆  is the angular distance of the ith VGP from the geographic pole or mean VGP.Note, for paleomagnetic data,  would further be corrected to remove within site dispersion due to random errors in measuring and sampling (e.g., Cromwell et al., 2018).
It has been observed that  varies as a function of latitude, for which various explanations have been hypothesized (cf.Merrill et al., 1996).The phenomenological Model G of McFadden et al. (1988) is often used to approximate the latitudinal variation of VGP dispersion where  is described as a function of (paleo)latitude and two parameters,  and , Here,  and  are argued to represent variations in the equatorially symmetric and equatorially anti-symmetric spherical harmonic decomposition of the field, respectively.The  and  parameters are calculated by a least squares fit between the measured VGP dispersion curve and that determined by Model G.
For our criteria we have chosen to apply the quadratic fit, as defined by Model G, as a metric of PSV behavior, with  and  parameters defining separate criteria (VGPa and VGPb).
We treat the compliance with the minimum (equatorial) dispersion, , and the latitude dependence, , separately in our framework, since these characterize different aspects of field variability.
The input simulated data for VGP dispersion are  values calculated after using a Vandamme cutoff for both Earth data and simulated outputs (S VD ;Vandamme, 1994).The Vandamme cutoff helps to exclude anomalous VGP data, with the intention of preventing bias in the dispersion estimate from magnetic excursions or reversals.The Vandamme cutoff is not constant, but instead is allowed to vary as follows where  is calculated from the simulated data.Sites with VGP latitudes less than   are excluded,  is recomputed, and the procedure is repeated until all remaining VGPs are within the cutoff angle.The final  value is then noted as   .
Like magnetic directions, the intensity of the magnetic field is also latitudinally dependent.To remove this dependence, a VDM is calculated, which is the strength of the geocentric dipole that produces the observed field intensity , at a given paleolatitude where   is radius of Earth's surface,   is the magnetic colatitude calculated using the mean inclination and the axial dipole equation (1), and  0 is the permeability of free space.
To provide an estimate of temporal variation in magnetic intensity, in this study we chose to measure the variability of a distribution of VDMs (VDMVar), through where  ̂is the interquartile range of a distribution of VDM values, and   is the corresponding median.The VDMVar criterion is passed if the % calculated from simulated data falls within the range estimated for Earth.

Criteria Based on Other Paleomagnetic Observables
The final criterion assesses dipole field reversals (Rev).The Rev criterion is met if a simulation reverses in an Earth-like manner.While reversals are a fundamental feature of Earth's magnetic field, an agreed formal description remains elusive.Here, we define a set of standards that we think faithfully represent the fundamental characteristics of geomagnetic field reversals.
To pass this criterion a simulation must: a) exhibit at least one reversal in the dipole field after the initial transient period, b) result in a new stable direction, and c) the proportion of time spent in a transitional state is within the range calculated for Earth.For our simulations, we estimated the first two standards by first calculating τn, the relative proportion of time spent with a normal polarity (i.e., the time spent with true dipole latitudes >45° divided by the total simulation time), τr, the relative proportion of time with a reverse polarity (i.e. the time spent with true dipole latitudes <45° divided by the total simulation time), and τt, the relative proportion of time spent in transitional periods (i.e. the time spent with true dipole latitudes between 45° and -45° divided by the total simulation time).A simulation passes the first two requirements if both τn and τr are greater than τt.Finally, if the calculated τt for a simulation falls within the range estimated for Earth, the simulation passes the Rev criterion.

Acceptance Thresholds Based on Earth Values for the Past 10 Myr
Establishing acceptance thresholds for QPM criteria that are representative of Earth's long-term magnetic field behavior is non-trivial.Ideally, the values for the established criteria should be representative of the paleomagnetic field for all of Earth's history.However, it has been hypothesized that PSV and the TAF structure are dependent on conditions at the coremantle boundary (CMB), and therefore are expected to be variable throughout geologic time (Jones et al., 1977).For the purpose of this study, we consequently chose to focus on the PSV and TAF structure of the paleomagnetic field for the last 10 Myr.This time period was chosen because paleomagnetic data for the last 10 Myr provide sufficient temporal and spatial coverage to enable global analysis, and are additionally young enough to not be strongly affected by plate motion and changing CMB conditions.For the assessment of TAF and PSV behavior for the past 10 Myr we utilized two datasets, PSV10 (Cromwell et al., 2018) and the PINT database (Biggin et al., 2009).For an assessment of reversal behavior for Earth for the past 10 Myr we utilized the 2012 Geomagnetic Polarity timescale (Ogg, 2012).Acceptance thresholds based on Earth values for our chosen criteria as measured are reported in Table 1.A description of the datasets and how specific criteria were estimated is presented in the Supplementary materials.

Rating Compliance with the Paleomagnetic Field
To rate the compliance of the numerical simulation output with long-term magnetic field behavior we first define a misfit parameter for each criterion, ∆   , where i denotes the five criteria VGPa, VGPb, Rev, VDMVar, and IncAnom.We chose to use this method because it is non-parametric, as the distribution of paleomagnetic data is not well-constrained.Here, ∆   is calculated by This parameter is the ratio of the absolute distance between the median Earth value (12) respectively.If ∆  ≤ 5 and   = 5, then a simulation meets all set criteria.

Geodynamo Simulations
The geodynamo simulations parametrization and solution methods used in this study have been extensively documented elsewhere (Davies and Constable, 2014;Davies and Gubbins, 2011;Willis et al., 2007) and so only a brief description is given here.An incompressible Boussinesq fluid is confined within a spherical shell of width  =   −   , where   and   are the inner and outer boundary radii respectively, rotating about the vertical direction at an angular frequency Ω.The system is thermally driven and the Boussinesq approximation is employed so that density variations are accounted for only in the buoyancy force.The fluid has a constant kinematic viscosity , thermal diffusivity , thermal expansivity , and magnetic diffusivity  = ( 0 ) −1 , where  is the electrically conductivity.The shell aspect ratio is fixed to   /  = 0.35 in this study and Prandtl number ( =   ) is set to 1.The following parameters control the system; Here,  is gravity,  is the modified Rayleigh number, and / is the amplitude of the prescribed temperature gradient at the outer boundary.The solution consists of the magnetic field , fluid velocity , and temperature T throughout the spherical shell and at each time point.
All simulations employ no-slip boundary conditions, that is  =  at   and   .For the magnetic field, the top and bottom boundaries are insulating.Therefore, above the core region the magnetic field is represented by a potential field that matches to the dynamo solution at   .
Fixed heat flux is prescribed at   in all simulations (denoted FF), while FF or fixed temperature (FT) conditions are applied at   .Some simulations additionally employ lateral variations in heat flow at   .Here the pattern is either derived from the seismic shear-wave velocity model of Masters et al. (1996) or a recumbent  2 0 heat flux pattern is used as an approximation to the observed shear-wave structures (Dziewonski et al., 2010).The amplitude of the heat flow anomalies is defined by the parameter  = (  −   )/  , where   ,   and   are the maximum, minimum and average heat flow on the outer boundary.We consider values of  = 0.3 − 1.5 (Table 2) and note that the largest values do not conflict with the Boussinesq approximation (see Mound and Davies, 2017).
In our suite of simulations, 10 have been reported in previous studies (Davies et al., 2008;Davies and Constable, 2014;Davies and Gubbins, 2011;Gubbins et al., 2007) (Table 2).Three of these simulations were integrated further here [Model 2 (B2), Model 3 (B4), Model 8 (B3)] in addition to 36 new simulations.The parameter regime explored in these simulations is as follows:  = 10 −3 − 1.2 × 10 −4 , Rayleigh numbers ranging from 20-450 corresponding to roughly 1-100 times the critical value for onset of non-magnetic convection, and magnetic Prandtl numbers ranging between 2 and 20 (Table 2).All simulations were run for ~3-30 outer core magnetic diffusion times, or the equivalent of a minimum of about 300 kyr -3 Myr using the electrical conductivity value of 3 x 10 5 S/m from Stacey and Loper (2007).

QPM Criteria Calculation Protocol
For the assessment of QPM criteria, Gauss coefficients up to spherical harmonic degree   = 10 were calculated at Earth's surface for each simulation.From the truncated data, we generated simulated values of declination (), inclination (), and intensity () using a spherical harmonic expansion, where  is the magnetic scalar potential and  = −∇, defined according to Here, , , and  are spherical coordinates (radius, colatitude, and longitude),    are the Schmidt-normalized associated Legendre functions of degree  and order , and    and ℎ   are the Gauss coefficients.
Representative examples of simulations that pass or fail each criterion are presented in Fig. 2 (Rev), Fig. 3 (IncAnom), Fig. 4 (VGPa and VGPb), and Fig. 5 (VDMVar).All assessed simulation results are presented in Supplemental Figures S1-3.Values for all calculated QPM parameters are given in Supplemental Table S2 and   results are in Table 3.
Total misfit values, ∆  , for all 46 simulations range from 5.6 to 22.2, with a median value of 10.5.VGPa had the highest median misfit value of 3.4 (Fig. 1c).In a majority (74%) of simulations, misfit values for VGPa were higher than for any other criterion (Fig. 1c).The distribution of ∆  reveals no correlation between total   score and ∆  (Fig. 1d).This lack of correlation clearly highlights that none of our simulations are simultaneously reproducing all aspects of Earth's long-term field behaviour; if a simulation is reproducing some aspects of the paleomagnetic field behavior (highlighted by   scores of 2 or 3), often it is very far from reproducing a different aspect (evidenced by high ∆  values).In the majority of cases with high   scores and high ∆  (74%), the parameter with the highest misfit (≫ 1) is VGPa.
The distributions of simulated values for each QPM criterion generally display two peaks that fall to either side of Earth values for the last 10 Myr (Fig. 6).For most criteria, the simulations fail to pass because simulated values were higher than Earth (except for VGPb, where the latitude dependence of VGP dispersion is equally under or over represented relative to Earth).Furthermore, for each criterion, simulations showing reversals had higher simulated values than those that did not reverse, with the highest values obtained for simulations with τt > 0.15.Reversing simulations show high VGP dispersion and ∆, but Earth-like % values.In general, non-reversing simulations have lower VGP dispersion and high ∆, and often insufficient variation in field strength to pass the VDMVar criterion (Fig. 6).No reversing simulations passed the VGPa criterion, with calculated values higher than those observed for Earth (Fig. 6).In general, positive correlations are observed between calculated values for % -∆, %-,  -, τt -, and ∆ - (Supp Fig. S4), forming a quasi-linear trend that contains Earth.
No universal trends between   or ∆  and input parameters for the simulations assessed in this study were identified.In general, the application of inhomogeneous boundary conditions pushed simulations further from Earth, as reflected in increased ∆  values as  increases (Table 3).However, this trend only applies when the application of an inhomogeneous boundary condition resulted in a reversing simulation with τt > 0.15.In the case where a simulation with an inhomogeneous boundary condition remained non-reversing, there are small changes in the calculated parameters (Table S2), which results in a lower misfit score with increasing .Future work will need to be conducted to further determine the effects of heterogeneous boundary conditions on long-term field behavior.In general, there is a positive trend between the magnetic Reynolds number ( =   ⁄ , where  is the time-averaged RMS flow amplitude) and all calculated parameters utilized for QPM assessment (Supp.Fig. S5).
Plotting our simulation results as a function of magnetic Ekman number (  = /) and  shows that many of our simulations fall within the wedge-shaped region of Christensen et al. (2010) for simulations with FF boundary conditions (Fig. 7).However, conformance with Earth's long-term field behavior for simulations that fall within the wedge is not assured as   scores within the wedge range from 0 to 3 and ∆  values range from ~6 to 22. Furthermore, many of our simulations that performed relatively well, with ∆  less than 10, fall outside of the wedge.

Limitations of QPM Approach
One limitation of the presented QPM criteria is that only data from the past 10 Myr are used to calculate values for Earth's TAF and PSV behavior and are not necessarily representative of all periods of Earth history.As stated previously, we utilize paleomagnetic records for the past 10 Myr because this time period represents the most comprehensive record of TAF and PSV behavior.However, the QPM framework can be used for any interval of Earth history where a sufficient quantity of robust paleomagnetic data are available, but the relative importance of each criterion and associated acceptance regions will need to be updated to reflect paleomagnetic behavior for that time period.We also acknowledge, as discussed in section 2, that alternative paleomagnetic observables exist which are not used here.Notwithstanding, the parameters chosen for QPM criteria are based on well-established and commonly employed measures in paleomagnetic studies.We are confident that they appropriately describe the paleomagnetic field and are suitable to assess the degree to which geodynamo simulations are accurately replicating Earth's long-term magnetic field behavior.
A caveat to the QPM framework, and to any other study that uses the observed field to assess dynamo simulations, is that reproducing these paleomagnetic observables does not inherently demonstrate that a simulation is Earth-like.Magnetohydrodynamic theory suggests that the magnetic, Coriolis and buoyancy (Archimedian) forces are dominant in the momentum equation, termed MAC balance (e.g.Aubert et al., 2017;Starchenko and Jones, 2002).However, it is currently unclear whether the core is in a global MAC balance (Aurnou and King, 2017) and the issue cannot be resolved by current observations.It appears that MAC balance emerges in simulations as  and  are reduced towards geophysically relevant values (Aubert et al., 2017;Schaeffer et al., 2017), though some simulations at relatively high (∼ 10 −4 ) may display MAC balance at leading order with non-negligible secondary contributions from viscous and inertial effects (Aubert et al., 2017;Dormy, 2016).As stated previously, low  and  values have not been achieved in simulations that span long timescales.In view of these limitations, here we chose to focus on criteria that can be derived from paleomagnetic observations and do not consider those based on the internal dynamics of the simulations.

Implications of Simulation Assessment
An unexpected outcome from our assessment of 46 simulations using the QPM criteria is that none are simultaneously reproducing all aspects of Earth's paleomagnetic field and that there are no obvious combinations of control parameters which will yield a simulation that reproduces Earth's long-term field behavior.This result contrasts with the findings in Christensen et al. (2010), who showed that geodynamo simulations within a certain   / space can reproduce properties of the historical field.A potential explanation for the discrepancy between our results and those of Christensen et al. (2010) is that the uncertainty estimated for Earth parameters in the two studies were constructed following different approaches.Because the different time-dependent models of magnetic observations utilize direct and indirect observations, and span different time intervals, Christensen et al. (2010) assigned generalized 1-σ error bounds ranging from a factor of 1.75 -2.5 times the magnitude of the observation to their Earth parameters.For our criteria, we instead utilized 95% confidence bounds calculated directly from paleomagnetic data.Our most restrictive criterion is VGPa, but it is arguably one of the best constrained Earth parameters for the last 10 Myr.The determination of a is dependent upon   values for localities near the equator.In the PSV10 dataset, there are eight localities with latitudes between 10° and -10° ranging across all longitudes, with a minimum number of sites at each locality of at least 33.The maximum   values estimated from these localities is ~15° (including 95% confidence intervals) and the minimum is ~ 6°, which is the absolute range that a can fall within.Even if we use these estimates for our range for Earth a values, our simulations are still well outside this range with a minimum a value of ~27° for simulations that reverse.Furthermore, a recent compilation of directional data for the Cretaceous and Middle Jurassic suggests that a values were between ~8° and 13° for these time periods, respectively, similar to our estimates for the past 10 Myr (Doubrovine et al., 2019).If we use the same approach as Christensen et al. (2010) for estimating uncertainty bounds, it would extend the values for a from 0° to 36°, at 1σ; such a range is inconsistent with estimates determined by paleomagnetic data.
An additional potential cause of the discrepancy between our findings and those of Christensen et al. (2010) could simply be that the field morphology observed for the historical field is not the one expected for long time scales and that secular variation of the recent field does not accurately reflect the behavior of the long-term paleomagnetic field.This is quite plausible given that spontaneous variations in field behaviour appear, from e.g. the PADM2M dipole model (Ziegler et al., 2011), to be active on timescales far longer than those captured in time-dependent field models.
The relatively low total   scores achieved by our simulations appears to be related to a tendency for many simulations (particularly those which reverse) to produce strong and/or strongly variable non- 1 0 components.Generally, simulations that reverse have higher ∆, , , and % values (falling significantly outside the range of Earth), suggesting that these high/more variable non- 1 0 components are more prevalent in reversing and multipolar simulations, as known from previous dynamo studies (Christensen and Aubert, 2006;Kutzner and Christensen, 2002) (Fig. 6).This trend may not hold true for all reversing simulations, as only two simulations passed Rev in this study, and more simulations should be assessed in the future to test this trend.
To find a simulation that better captures Earth's paleomagnetic field, the non- 1 0 components must be reduced while the  1 0 term remains capable of spontaneously changing its sign.
Simulations that reverse and maintain a larger degree of dipole dominance have been produced in previous studies (e.g., Driscoll and Olson, 2009;Lhuillier et al., 2013;Wicht et al., 2009;Wicht and Meduri, 2016) and in future work it would be valuable to assess how these simulations perform using the QPM criteria.In our study, the only simulations with ∆  values approaching the Earth-like regime are those that do not reverse, suggesting that we currently cannot exclude non-reversing simulations in our quest for an Earth-like simulation.
We can conclude that simulations which fall within the 'wedge' of Christensen et al. (2010) are not guaranteed to reproduce Earth's paleomagnetic field behavior, and that compliance with the long-term magnetic field should be assessed separately, similar to findings of Davies and Constable (2014) for the Holocene.This is especially pertinent for studies that use the output from numerical geodynamo simulations to formulate corrections to paleomagnetic data [e.g., Driscoll and Wilson (2018), Lhuillier and Gilder (2013)], as these corrections may include non-Earth-like TAF and PSV behavior.
In this study we did not find long-duration simulations which simultaneously reproduce all aspects of Earth's paleomagnetic field behavior.However, our exploration of the possible parameter space is not exhaustive, and the fact that our simulations bracket Earth values suggest that a simulation reproducing Earth's paleo-field behavior should exist within a computationally accessible parameter regime.Overall, more long-duration simulations need to be assessed using the QPM criteria in the future.

Conclusions
We developed a framework for assessing the compliance between numerical geodynamo simulations and long-term magnetic field behavior (QPM criteria).Using QPM criteria, the compliance of 46 simulations with magnetic field behavior for the past 10 Myr was considered.
We found that our simulations achieved a maximum   score of 3 out of 5, with most simulations scoring much lower, and with median ∆  misfit values of ~10, where less than 5 indicates compliance with Earth behavior.Low   scores appear to be partly due to enhanced non- 1 0 components relative to those observed for the last 10 Myr on Earth.There appears to be no specific combination of   / parameters in which simulations reliably replicate Earth's long-term field behavior.Furthermore, we find that compliance with the criteria set by Christensen et al. (2010) does not guarantee that a simulation reproduces Earth-like TAF and PSV behavior.
The QPM framework can provide a path towards developing simulations which can reproduce Earth's long-term magnetic field behavior in the future.This framework can be modified to represent periods of different geodynamo behavior in Earth's past, e.g. the Cretaceous or Middle Jurassic, allowing for a more robust characterization of the evolution of the deep interior through Earth's history, provided a sufficient quantity of robust paleomagnetic data are available.

Figure 2 .
Figure 2. Representative reversal behavior for three end-member behaviors observed from the

Figure 6 .
Figure 6.Histograms of calculated values from each simulation, shown for each criterion,

Figure 7 .
Figure 7. Evaluated dynamo simulations plotted as a function of magnetic Ekman number ( η )

Table 1 .
Summary of Earth time average field and paleosecular variation values.Med indicates median values, high indicates the upper 95% confidence bound, and low indicates lower 95% confidence bound.Values for , , and  were calculated from data presented in Cromwell et