Thermosphere Model Assessment for Geomagnetic Storms from 2001-2023

We present an updated study for thermosphere model assessment under geomagnetic storm conditions, defined when the geomagnetic index ap=80 or larger. Comparisons between five empirical models, NRLMSISE-00, JB2008 and three versions of DTM2020, and CHAMP, GRACE, GOCE, Swarm-A and GRACE-FO neutral density data sets for 152 storms are presented. The storms are categorized according to ap, as single peak or multiple peak. After applying a model debiasing procedure using the density data just before the onset of a storm, the models are on average only slightly biased, often only a few percent. This is an unexpected and reassuring result for these relatively simple models, which were fitted to different observations. The standard deviations of these averages are however up to 12% (1-sigma), which places the small biases into perspective. The smallest biases are achieved at the lowest altitude when comparing with GOCE data, and the highest for GRACE. The best results, i.e. smallest bias and standard deviation on average over all single-peak storms, over the entire 4-Phase storm period are obtained with DTM2020_Intermediate and DTM2020_Research models, while the oldest model, NRLMSISE-00, is the least precise. However, NRLMSISE-00 is least biased when comparing to multiple-peak storms. As could be expected, multiple-peak storms are reproduced with less precision than single-peak storms. The assessment clearly reveals that model precision decreases with altitude, but that bias is independent of altitude, at least in the range covered by the data, 250-550 km.


Introduction
Semi-empirical (SE) thermosphere models are used operationally in the determination and prediction of the ephemerides of active satellites and orbital debris.Conjunction analysis is becoming a major issue with the fast-growing number of objects in space.Notably for geomagnetic storms, good knowledge of thermosphere model performance, or the uncertainty, is therefore necessary.The accuracy of the determination and prediction of ephemerides of objects in Low Earth Orbit (LEO; altitudes lower than 1000 km) depends largely on the quality of the atmospheric drag modeling.The drag force varies mainly due to the fluctuations in total neutral density, and to a much lesser degree also due to satellite characteristics (Doornbos, 2011;Mehta et al., 2017) and atmospheric composition and temperature.The variability of the thermosphere is driven by the changing solar Extreme Ultra-Violet (EUV) emissions and the changing Joule heating and particle precipitation due to interaction of the magnetosphere with the solar wind (commonly referred to as 'solar activity' and 'geomagnetic activity', respectively), whereas a small part is due to upward propagating perturbations that originate in Earth's lower atmosphere (Yue et al., 2022).The accuracy of the forecast of the drivers is crucial to the accuracy of the density forecast.We do not address errors in the driver forecasts in this paper, but the model performance with observed drivers (or perfect forecasts of the proxies), i.e. we quantify the precision of the combination of model and the proxies it employs.In order to assess and track the progress over time of current and future thermosphere models, appropriate metrics and high quality neutral density data sets, preferably with high-spatial resolution covering long intervals of time (i.e.years, and ideally a complete solar cycle), are required.In a previous paper (Bruinsma et al., 2021), specific storm-time metrics were introduced and described, and then applied to a small number (13) of isolated storms.The neutral density data used in that study to verify model precision in space and time has been augmented with data of the missions Swarm-A (van den Ijssel et al., 2020) and GRACE-FO (Siemes et al., 2023), whereas a more precise GRACE dataset (Siemes et al., 2023) replaces the older one.In this study, a total of 152 storms from 2001-2023, defined as events with a Kp of 6, i.e., ap=80 (units: 2 nT), and larger, is used to comprehensively assess the CIRA (COSPAR International Reference Atmospheres) models NRLMSISE-00 (hereafter MSIS00; Picone et al., 2002), JB2008 (Bowman et al., 2008), and DTM2020 (Bruinsma and Boniface, 2021).In this study, a modification of the storm-time metrics was necessary in order to accommodate also double or multiple-peaked storms, e.g., the 2003 Halloween storm.The density data and the adopted metrics allow objective benchmarking of the thermosphere models, quantification and detailed descriptions of the improved (or degraded) performance.A major goal of this study is to provide CCMC (Community Coordinated Modeling Center: https://ccmc.gsfc.nasa.gov)with the list of storms, and the density data, which will then be used to assess all thermosphere models that run there.Thermosphere model score cards can then be published, applying consistent and always identical metrics, which can help users select the best model for their objective.The following section briefly presents the models, the density data, and storm-time metrics with a small modification for double-peaked storms.The assessment results of the three models is presented in Section 3.After a short summary, the conclusions are given in Section 4.

Model descriptions
2.1 Semi-empirical thermosphere models Semi-empirical models are used in orbit computation and mission design and planning because they are easy to use and computationally fast.They provide density and temperature estimates for a single location and time (e.g., for satellite positions along an orbit).These climatological (or 'specification') models have a low spatial resolution of the order of thousands of kilometers, and a temporal resolution of 1-3 hours that is imposed by the cadence of the geomagnetic indices.Some kind of storm pre-conditioning is taken into account through a combination of delayed and averaged geomagnetic indices, notably in MSIS00, and the resulting density estimate is based on the statistical fits to those drivers.SE models are constructed by optimally estimating the model coefficients to their respective underlying databases (some combination of total density, temperature and composition measurements).As a consequence, the SE thermosphere models adopt the scales of the satellite models used in the respective databases (e.g., DTM2020 based on TU Delft satellite models, JB2008 based on HASDM satellite models) -and these are rarely consistent between modelers as was shown in Bruinsma et al. (2022).The DTM2020-Operational model (DTM2020_Ops) was developed recently, using F10.7 and Kp as drivers.DTM2020_Int is an unpublished intermediate model using F30 but still Kp as drivers that was created in the development of the model DTM2020-Research (DTM2020_Res), which is too complicated to run operationally.The backbone of the DTM2020 models is the complete CHAllenging Mini satellite Payload (CHAMP; Doornbos, 2011), Gravity Recovery and Climate Experiment (GRACE; Bruinsma, 2015) and Gravity field and steady-state Ocean Circulation Explorer (GOCE; Bruinsma et al. 2014) high-resolution accelerometer-inferred density datasets, as well as Swarm A densities (van den Ijssel et al. 2020).The JB2008 model was developed by fitting to 10 years of US Air Force daily-mean densities, 5 years output of the U.S. Space Force High Accuracy Satellite Drag Model (HASDM; Storz et al., 2005), 5 years of CHAMP and 4 years of GRACE data.The files with the solar drivers, which are computed and sometimes modified by the modelers, are regularly updated.The driver files (SOLFSMY and DTCFILE) that were online in January 2024 were used for the assessment given in this paper.MSIS00 is used in this assessment because the difference with MSIS 2.0 (Emmert et al., 2020) is very small for geomagnetic storms and in the form of a bias (the geomagnetic model algorithm was not updated), and because it is widely used.MSIS00 was constructed with mass spectrometer, incoherent scatter radar, historical accelerometer total densities, UV occultation, and various rocket measurements, along with global daily-mean orbit-derived thermospheric density and tabular lower atmospheric temperature.The time resolution and drivers of the models are listed in Table 1.

Selected Density Data
The density variability in the thermosphere is very large, typically hundreds of percent, on long time scales (the solar cycle) but also on short time scales of hours to days in the event of geomagnetic storms.The variations in density depend on position, season, and the level of solar and geomagnetic activity.The high-resolution accelerometer-inferred densities and the GPSinferred densities are in-situ, along the satellite orbits, but a single satellite still provides a rather poor spatial distribution: the latitudinal extent depends on the orbital inclination, the local time coverage is essentially limited to the time of its ascending and descending pass and the precession of the orbital plane (negligible over the duration of a geomagnetic storm).While the measurement cadence typically is 10 s, a satellite passes the same latitude and local time (but at a different longitude) once per orbit of roughly 100 minutes.Table 2 lists the essential information of the five selected datasets, CHAMP (Doornbos, 2011), GRACE and GRACE-FO (GFO;Siemes et al., 2023), GOCE (Bruinsma et al., 2014) and Swarm-A ( Van den Ijssel et al., 2020).These precise density datasets are selected because they are appropriate for storm-time evaluation, notably precise measurements from pole to pole.The Swarm-A densities are derived by means of precision orbit determination and have an along-track resolution of about 1000 km at best, when solar and geomagnetic activity are high.The accelerometer-inferred densities have a constant resolution of about 600 km after lowpass filtering without data rejection.The model assessment is performed by means of comparing to densities of 152 storms with a Kp of 6, i.e., ap=80, and higher.Annexe 1 lists the dates, minimum Dst and maximum ap of the storms, and which satellites monitored the storms.

Updated Storm-Time Metrics for Model-Data Comparison
Storms in this analysis are defined as Kp=6 or larger (ap=80 or larger).In Bruinsma et al. (2021), the storms were divided in four phases of fixed length, two before and two after the minimum Dst associated with the storm defined through Kp: pre-storm (1), onset (2), recovery (3), and poststorm (4).An update of the above phases is necessary in this study in order to accommodate storms with more than one peak in geomagnetic activity of Kp=6 or larger in phases 3-4 (at least 9 hours apart), of which the Halloween 2003 storm is a much studied case.The intervals of such multiplepeaked storms can be unambiguously defined using ap/Kp, which was preferred over a more complicated definition based on both Dst and ap.For consistency, we modified the definition of t0 of the storm phases for all storms based on ap reaching 80, and Dst is no longer used.Figure 1 (top) illustrates the four phases of a single-peak (SP) storm.Phase 1, the pre-storm interval, is used to de-bias the models with respect to the observations via a scale factor, which is then applied to the model densities in all four phases.This de-biasing procedure is used to minimize the effect of non-storm related model differences and errors on the assessment.Density data for the 108 singlepeak (SP) storms are selected 30-hr before to 48-hr after the time ap reaches 80, which now defines t0 and the end of Phase 2 (storm onset).Phase 3 is the main and recovery phase, and Phase 4 the post-storm phase.Figure 1 (bottom) illustrates the phases for double or multiple-peaked (MP) storms, which in this example is for 10-16 July 2004.The t0 for the 44 MP storms is defined as for SP storms as the time when ap reaches 80. Phase 3 for MP storms, in Figure 1 (bottom) due to the second time ap=80 is reached at t=1.4, is extended by a variable duration based on when ap falls below 80 again (at t almost 3.0 in this example) plus 36-hr.Phase 4 then extends 12-hr after the end of Phase 3. The phases and their durations for SP and MP (italic) storms are computed as listed below with respect to t0.Phase 1: t0 minus 30-hr to t0 minus 18-hr t0 minus 30-hr to t0 minus 18-hr Phase 2: t0 minus 18-hr to t0 t0 minus 30-hr to t0 minus 18-hr Phase 3: t0 plus 36-hr t0 plus variable duration plus 36-hr Phase 4: t0 plus 36-hr to t0 plus 48-hr end of Phase 3 plus 12-hr Because of the very large range in density values due to differences in altitude and solar activity (i.e.phase of the solar cycle), it is difficult to analyze and interpret a model's performance using absolute values.Therefore, relative precision is preferred in the form of ratios that are easy to comprehend.The models and data are compared by computing the observed-to-computed density ratios (O/C).The mean µ and standard deviation s of the observed-to-computed density ratios, due to their distribution, are computed in log space following Sutton (2018): where N is the number of observations.A ratio of one indicates that on average the observations are reproduced perfectly, i.e. an unbiased model.Deviation from unity signifies under (larger than one) or overestimation (smaller than one) of the model.The standard deviation (StD) of the observed-to-computed density ratios s, computed as percentage by multiplying by 100, represents the combination of the ability of the model to reproduce observed density variations, and the geophysical noise (e.g.waves, the short duration effect of large flares) and instrumental noise in the observations.The Pearson correlation coefficients R of observed and modeled densities are also computed, which are independent of model bias; R 2 represents the fraction of observed variance reproduced by the model.The number of SP and MP storms identified in this study taking the available satellite density data into account is 108 and 44, respectively.Storm for which data gaps are present in any of the four phases were rejected.Storms are often observed by more than one spacecraft, and the total number of SP and MP storm comparisons is 193 (CHAMP 48,GRACE 72,GOCE 16,and GFO 15) and 79 (CHAMP 31, GRACE 35, GOCE 2, Swarm-A 9, and GFO 2), respectively.The comparisons of each model with satellite density data are computed in three steps: 1. Calculate the model debiasing scale factor with density data in Phase 1 per storm and per satellite (e.g., 48 scale factors for CHAMP and SP storms); 2. Apply debiasing and calculate µi,s, si,s, and Ri,s, where i=1 to the number of storms per satellite s (e.g., i is 48 for CHAMP and SP storms) for the four phases separately and all four together.Observations that are ten times smaller or bigger than model densities are rejected; 3. Calculate the average of the mean O/C and standard deviation of the µi,s per satellite again with eq.1 (so calculated with the 48 µi,s in case of CHAMP and SP storms), referred to hereafter as <O/C> and StD <O/C>, respectively.Calculate the arithmetic average of the StD of O/C (i.e., of the si,s), and arithmetic average of Ri,s as computed in step 2.

Storm-time assessment results
Scale offsets between models are mainly due to the ingested density data.Therefore, assessment of the performance of models over geomagnetic storms requires debiasing.2 for all SP storms and per satellite for the five models.On average, the models perform rather well during storms, and the <O/C> per satellite density dataset are within 5% of unity for all models excepting MSIS00.Overall, the models are only slightly underestimating during storms, by 1-5%.So on first sight model performance is fairly equal.However, the StD<O/C> indicated by the bars in Figure 2 show that the spread around the biases is not equal.4 for all MP storms and per satellite for the five models.On average, the models perform well again during MP storms, and the <O/C> per satellite are within 6% of unity for all models.
Contrary to what was observed for SP storms, the models are overestimating MP storms by a few percent (NB: only CHAMP and GRACE results are based on a large number of storms).The model performance is significantly reduced for MP storms, which is reflected by the larger StD<O/C> compared with SP storms (Figure 2): for GRACE for example, it increases from 7.5% to 9.9% (10.1% to 12.1%) for DTM2020_Res (JB2008).The arithmetic average StD of O/C for all MP storms shown in Figure 5 are also significantly larger than those obtained for SP storms shown in Figure 3, which confirms that the MP storms are reproduced with lower fidelity than SP storms.
The complete statistics per phase are given in tables, which enables identifying the strong and weak points of the models more specifically, which is probably not very pertinent for operational issues, but interesting for modelers.for all models, i.e. in the post-storm phase, and the highest mostly in Phase 2 for the DTM models and JB2008, while it is seen in Phase 3 for MSIS00.For all models, highest correlation is always seen for Phase 4, and lowest for Phase 2. The StD and correlation also reveal a small dependency on altitude, i.e. in the order GOCE-CHAMP-GRACE.While slightly higher, Swarm-A StD and correlations results are better again than for GRACE, but this is due to the different nature of the densities, i.e. low-resolution GPS-inferred densities versus high-resolution accelerometer-inferred densities.GFO StD and correlations results are also better than for GRACE, and this is most likely due to the much smaller number of storms for GFO.Table 6 lists the <O/C> and StD <O/C> of the five models for the five density datasets and all multiple-peak storms.The Phase 3 results listed here for the multiple-peak storms are based on longer intervals (see Figure 1).The bold faced numbers are displayed in Figure 4.Note that GOCE and GFO results are only based on 2 multiple-peak storms, and therefore not statistically significant.CHAMP and GRACE have smallest biases now with MSIS00 (NB: where they are largest for the SP storms), and Swarm-A with JB2008 and the DTM models.All models reveal a clear offset between CHAMP and GRACE compared with Swarm-A.The StD <O/C> for MP storms is noticeably higher than for SP storms.

Conclusions and outlook
The density data and indices for the 152 selected storms as well as updated metrics for thermosphere model assessment have been described in this paper.The mean and standard deviation of the density ratios and the correlations were computed using CHAMP, GRACE GOCE, Swarm-A and GFO density data and the CIRA models MSIS00, JB2008 and DTM2020_Ops/Int/Res.Overall, taking all SP and MP storms into account, the DTM and JB2008 models are only slightly underestimating during storms, 2-4% maximum, which is an unexpectedly good performance for semi-empirical models.MSIS00 biases are somewhat higher and range from 2-8%.It is remarkable, and reassuring, that the MSIS, JB2008 and DTM models predict storm perturbations with comparably small biases, while they were fitted to different observations (very different for MSIS).That performance is reached on condition of debiasing the models before the storm onset, which option is not available to all thermosphere model users, nor is it always possible.However, excepting GOCE results, the StD<O/C> of all models ranges from 6-12%, i.e. the scatter is considerable and improvements are undeniably necessary.The best results, i.e. smallest bias and standard deviation on average over all SP storms, over the entire 4-Phase storm period are obtained with DTM2020_Int and DTM2020_Res models, while the oldest model, MSIS00 (which unlike DTM and JB2008 has not benefitted from an overhaul of its geomagnetic model algorithm), is the least precise.However, MSIS00 is the least biased model for MP storms.The assessment clearly revealed that model precision (StD of O/C) decreases with altitude, but that bias <O/C> is independent of altitude, at least in the range of 250-550 km.
The updated metric and the full list of SP and MP storms (annexe 1) will eventually be implemented in CAMEL (Comprehensive Assessment of Models and Events based on Library tools) at NASA's Community Coordinated Modeling Center (CCMC).Then, assessment of not only the semi-empirical models tested in this study but also of many first principles models such as TIEGCM, WACCMX, GITM and WAM will be possible.This study showed that the average model bias can hardly be improved upon, but maybe the physical models succeed in driving down the standard deviations (StD<O/C>).Presently, an incomplete test version is already available with less storms and not all model output (https://webserver1.ccmc.nasa.gov/camel/NeutralDensity).
The final CAMEL version will also generate score cards per model, which will be based on selected reference storms and density data, so that model users will have access to objective and standardized model assessments.*The precision is strongly dependent on the level of drag, i.e., solar cycle and altitude.It is also for a resolution of a few thousand km vs 600 km for the other satellites.

Figure captions Figure 1 .
Figure captions

Figure 2 .
Figure 2. <O/C>) and StD <O/C> of phases 1-4 of the 108 SP storms, per satellite for the five models.

Figure 3 .
Figure 3.The arithmetic average StD of O/C of the 108 SP storms and per satellite for the five models.

Figure 4 .
Figure 4. <O/C>) and StD <O/C> of phases 1-4 of the 44 MP storms, per satellite for the five models.

Figure 5 .
Figure 5.The arithmetic average StD of O/C of the 44 MP storms and per satellite for the five models.

Figure 1 .
Figure 1.The four phases of the assessment interval for SP (top) and MP (bottom) storms, with t0 centered on the time of the first peak in ap with a minimum of 80.

Figure 2 .
Figure 2. <O/C> and StD <O/C> of phases 1-4 of the 108 SP storms, per satellite for the five models.

Figure 3 .
Figure 3.The arithmetic average StD of O/C of the 108 SP storms and per satellite for the five models.

Figure 4 .
Figure 4. <O/C> and StD <O/C> of phases 1-4 of the 44 MP storms, per satellite for the five models.

Figure 5 .
Figure 5.The arithmetic average StD of O/C of the 44 MP storms and per satellite for the five models.
Table3presents the average of the mean O/C (<O/C>) and the StD of all mean O/C (StD <O/C> of Phase 1, i.e. the mean of the scale factors applied to the models and their StD.The scale factors for the DTM models are close to unity because they were fitted to most of these datasets.MSIS00 and JB2008 were not fitted to the TU Delft densities, and consequently their scale factors are considerable, which demonstrates the essential nature of debiasing to compare fairly. For users of thermosphere models, the main result of the assessment is the model performance over entire storms, i.e. from phase 1 to phase 4. The average of the 193 mean O/C (<O/C>) and the StD of all mean O/C (StD <O/C>) is displayed in Figure Table4lists the <O/C> and StD <O/C> of the five models for the five density datasets and all single-peak storms.The results, per model, are listed for all four phases (bold), Phase 2, Phase 3 and Phase 4 (from top to bottom).Results for Phase 1 are not pertinent due to the debiasing, and are not listed.The bold faced numbers are displayed in Figure2.The performance of DTM2020_Ops stands out for GFO, for which it is significantly more precise than DTM2020_Int and DTM2020_Res.The <O/C> per phase of DTM2020_Ops and DTM2020_Int do not present a specific tendency for Phases 2-4.MSIS00 on the other hand clearly underestimates Phase 3 (recovery), and then overestimates in Phase 4 (post-storm).Except for GOCE, JB2008 underestimates Phase 2, the onset.DTM2020_Res presents an overestimation in Phase 4. The StD<O/C> is predominantly smallest during Phase 2 (onset) for all models.It is mostly largest in Phase 3 for MSIS00, and in Phase 4 for JB2008 and the DTM models.Table 5 lists the average StD of O/C and the average correlation coefficients of the five models for the five density datasets and all single-peak storms.Results are listed again for all four phases (bold), Phase 2, Phase 3 and Phase 4 (from top to bottom).The bold faced average StD are displayed in Figure 3. DTM2020_Res now clearly has smallest StD and highest correlations, also for GFO (not used in DTM), even if DTM2020_Ops results are very close.The smallest StD of O/C is most often seen in Phase 4 Table7lists the average StD of O/C and the average correlation coefficients of the five models for the five density datasets and all multiple-peak storms.The bold faced average StD are displayed in Figure5.DTM2020_Res again clearly has smallest StD and highest correlations.The smallest StD of O/C is as for SP storms most often found for Phase 2, while a clear tendency cannot be gleaned from Phases 3 and 4. For all models, highest correlation is as for SP storms always seen in Phase 4, it is lowest in Phase 2 for CHAMP and GRACE, and mostly lowest in Phase 3 for Swarm-A.Table8lists the <O/C> and StD <O/C> of the five models for the five density datasets and all 152 storms (single and multiple-peak).In orbit prediction, the type of storm is not known beforehand, and these results are therefore most comprehensive.Since the phases 2 and 3 are not representing exactly the same storm phases in case of SP and MP storms, only the assessments over the complete storms, i.e., for all four phases, is listed.The results are of course more or less (because there are more SP than MP) in between the SP and MP results.DTM2020_Res and DTM2020_Int have the smallest biases <O/C> of 2% maximum, and smallest variations StD<O/C>.JB2008 has only slightly higher biases, maximum 4%, but their StD are up to twice larger.MSIS00 has the highest biases, but these are still only 2-8%.Table9lists the average StD of O/C and the average correlation coefficients of the five models for the five density datasets and all storms (single and multiple-peak).DTM2020_Res always has the smallest StD of O/C and highest correlations, i.e., the storms are reproduced with highest fidelity of the five models assessed in this study.

Table 1 .
Thermosphere and upper atmosphere models used in the assessment.The temporal resolution is listed in the last column.

Table 2 .
Datasets selected for the model assessment under storm conditions.(i is inclination, and LST is the local solar time coverage and the approximate period, in days, to cover 24hr).

Table 3 .
The <O/C> and StD <O/C> for Phase 1, for debiasing, per model for the five density datasets and all 152 storms (single and multiple-peak storms).

Table 4 .
The <O/C> and StD <O/C> per model for the five density datasets and the 108 singlepeak storms.Per model are given the results of all four phases (bold), Phase 2, Phase 3 and Phase 4. Green indicates best performance.

Table 5 .
The average StD of the O/C and correlation coefficient per model for the five density datasets the 108 single-peak storms.Per model are given the results of all four phases (bold), Phase 2, Phase 3 and Phase 4. Green indicates best performance.

Table 6 .
The <O/C> and StD <O/C> per model for the five density datasets and the 44 multiplepeak storms.Per model are given the results of all four phases (bold), all Phase 2, Phase 3 and Phase 4. Green indicates best performance.

Table 8 .
The <O/C> and StD <O/C> per model for the five density datasets and all 152 storms (single and multiple-peak storms).Per model are given the results of all four phases.Green indicates best performance.

Table 9 .
The average StD of the O/C and correlation coefficient per model for the five density datasets and all 152 storms (single and multiple-peak storms).Per model are given the results of all four phase.Green indicates best performance.The 152 selected storm intervals from 2001 to 2023, storm class (cat: Single Peak or Multiple Peak), minimum Dst, maximum ap, and the satellite dataset (CH, GR, GO, SW, GFO = CHAMP, GRACE, GOCE, Swarm-A and GRACE-FO).