Validation of Ionospheric Specifications During Geomagnetic Storms: TEC and foF2 During the 2013 March Storm Event‐II

Assessing space weather modeling capability is a key element in improving existing models and developing new ones. In order to track improvement of the models and investigate impacts of forcing, from the lower atmosphere below and from the magnetosphere above, on the performance of ionosphere‐thermosphere models, we expand our previous assessment for 2013 March storm event (Shim et al., 2018, https://doi.org/10.1029/2018SW002034). In this study, we evaluate new simulations from upgraded models (the Coupled Thermosphere Ionosphere Plasmasphere Electrodynamics (CTIPe) model version 4.1 and the Global Ionosphere Thermosphere Model (GITM) version 21.11) and from the NCAR Whole Atmosphere Community Climate Model with thermosphere and ionosphere extension (WACCM‐X) version 2.2 including eight simulations in the previous study. A simulation from the NCAR Thermosphere‐Ionosphere‐Electrodynamics General Circulation Model version 2 (TIE‐GCM 2.0) is also included for comparison with WACCM‐X. TEC and foF2 changes from quiet‐time background are considered to evaluate the model performance on the storm impacts. For evaluation, we employ four skill scores: Correlation coefficient (CC), root‐mean square error (RMSE), ratio of the modeled to observed maximum percentage changes (Yield), and timing error (TE). It is found that the models tend to underestimate the storm‐time enhancements of foF2 (F2‐layer critical frequency) and TEC (Total Electron Content) and to predict foF2 and/or TEC better in North America but worse in the Southern Hemisphere. The ensemble simulation for TEC is comparable to results from a data assimilation model (Utah State University‐Global Assimilation of Ionospheric Measurements (USU‐GAIM)) with differences in skill score less than 3% and 6% for CC and RMSE, respectively.


Introduction
Variabilities of the Earth's ionosphere-thermosphere (IT) system, caused by charged particles and electromagnetic radiation emitted from the sun, can adversely affect our daily lives, which are highly dependent on space-based technological infrastructures such as Low-Earth Orbit (LEO) satellites and the Global Navigation Satellite System (GNSS). To mitigate harmful effects of space weather events, modeling plays a critical role in our quest to understand the connection between solar eruptive phenomena and their impacts in interplanetary space and near-Earth space environment. In particular, the Earth's upper atmosphere including the IT system is the space environment closest to human society. Thus, during the past few decades, first-principles physics-based (PB) IT models have been developed for specifications and forecasts of the near-Earth space environment. In addition, there have been recent developments of whole atmosphere models with a thermospheric and ionospheric extension to fully understand variabilities of the IT system by considering coupling between the IT system and the lower atmosphere (e.g., Akmaev, 2011;T. Fuller-Rowell et al., 2010;Jin et al., 2011;Liu et al., 2018).
For more accurate space weather forecasting, assessing space weather modeling capability is a key element to improve existing models and to develop new models. Over the last decade, in an effort to address the needs and challenges of the assessment of our current knowledge about space weather effects on the IT system and the current state of IT modeling capabilities, the NASA GSFC Community Coordinated Modeling Center (CCMC) has been supporting community-wide model validation projects, including Coupling, Energetics and Dynamics of Atmospheric Regions (CEDAR) (Shim et al., 2011(Shim et al., , 2012(Shim et al., , 2014 and Geospace Environment Modeling (GEM)-CEDAR modeling challenges (Rastäetter et al., 2016;Shim, Rastäetter, et al., 2017).
Furthermore, in 2018, the CCMC established an international effort, the "International Forum for Space Weather Modeling Capabilities Assessment," to evaluate and assess the predictive capabilities of space weather models (https://ccmc.gsfc.nasa.gov/iswat/IFSWCA/). As a result of this international effort, four ionosphere/thermosphere working groups were established with an overarching goal to devise a standardized quantitative validation procedure for IT models (Scherliess et al., 2019).
The working group, focusing on neutral density and orbit determination in LEO, reported their initial results for specific metrics for thermosphere model assessment over the selected three full years and two geomagnetic storms in 2005 (Bruinsma et al., 2018). They reported that the tested models in general performed reasonably well, although seasonal errors were sometimes observed and impulsive geomagnetic events remain a challenge. Kalafatoglu Eyiguler et al. (2019) compared the neutral density estimates from two empirical and three PB models with those obtained from the CHAMP satellite. They suggested that several metrics that provide different aspects of the errors should be considered together for a proper performance evaluation.
Another working group, the "Ionosphere Plasmasphere Density Working Team," performed the assessment of present modeling capabilities in predicting the ionospheric climatology of foF 2 and hmF2 for the entire year of 2012 . Tsagouri et al. (2018) identified a strong seasonal and local time dependence of the model performances, especially for PB models, which could provide useful insight for future model improvements. Tsagouri et al. (2018) cautioned that the quality of the ground truth data may play a key role in testing the model performance. Shim et al. (2018) assessed how well the ionospheric models predict storm time foF 2 and TEC by considering quantities, such as TEC and foF2 changes and percentage changes compared to quiet time background, at 12 selected midlatitude locations in the American and European-African longitude sectors. They found that the performance of the model varies with location, even within a localized region like Europe, as well as with the metrics considered.
In this paper, we expand our previous assessment of modeled foF2 and TEC during 2013 March storm event (17 March 2013)  to track improvement of the models and to investigate impacts of forcings from the lower atmosphere below and from the magnetosphere above on the performance of IT models. For this study, we evaluate the updated version of the coupled IT models available at the CCMC (Webb et al., 2009) since our previous study : Coupled Thermosphere Ionosphere Plasmasphere Electrodynamics (CTIPe) version 4.1 and Global Ionosphere Thermosphere Model (GITM) version 21.11. However, the other types of models such as empirical models, stand-alone ionospheric models, and data assimilation models are not included. In addition, for the first time, simulations from the NCAR Whole Atmosphere Community Climate Model with thermosphere and ionosphere extension (WACCM-X) 2.2 are included in our assessment. We also include a simulation from the NCAR TIE-GCM 2.0 to compare with results from WACCM-X 2.2. For TEC prediction, we compare a weighted mean of the ensemble of all 13 simulations (ensemble average), including eight simulations from our previous study with individual simulations to assess ensemble forecast capability. In Section 2, we briefly describe observations, models, and metrics used for this study. Section 3 presents the results of model-data comparisons and performance of the models are presented. Section 4 shows comparisons of ensemble of TEC predictions with the individual simulations based on the skill scores used in this study. In Section 5, we summarize and discuss our results. Finally, we conclude in Section 6.

Observations and Metrics
We use the foF2 and TEC measurements at 12 ionosonde stations selected in middle latitudes: eight northern hemisphere (NH) stations in the US (Millstone Hill, Idaho National Laboratory, Boulder, and Eglin AFB) and Europe (Chilton, Pruhonice, Ebre, and Athens) and 4 southern hemisphere (SH) stations in South America (SAM) (Port Stanley) and South Africa (SAF) (Louisvale, Hermanus, and Grahamstown) ( Figure 1 and Table 1 in Shim et al. (2018) for details). The foF2 and GNSS vertical TEC (vTEC) data are provided by the Global Ionosphere Radio Observatory (GIRO) (http://giro.uml.edu/) (Reinisch & Galkin, 2011) and by the MIT Haystack Observatory (http://cedar.openmadrigal.org/) (Rideout & Coster, 2006), respectively. Table 1 shows the quantities and skill scores calculated for the model-data comparison. To remove potential systematic uncertainties in the models and observations and baseline differences among the models and between models and observations, we use the shifted values and changes from their own quiet-time background values (e.g., shifted TEC (TEC*) = TEC (UT) on a particular DOY − minimum of 30-day median). Furthermore, using these quantities likely reduce the impacts of differing upper boundaries for TEC calculations, since the plasmaspheric TEC variations with geomagnetic activity are negligible in middle latitudes (Shim, Jee, & Scherliess, 2017).
To measure how well the observed and modeled values are linearly correlated (in phase) with each other and how different the values are on average over the time interval considered, Correlation coefficient (CC) and root-mean square error (RMSE) are calculated, respectively, for the error values below 95th percentile. We also calculate Yield and timing error (TE) to measure the models' capability to capture peak disturbances during the storm. For more detailed information on the quantities and skill scores used for the study, refer to Section 2 in Shim et al. (2018).

Models and Simulations
The simulations used in this study are obtained from the updated and newly incorporated coupled ionosphere-thermosphere models available at the CCMC (Webb et al., 2009) since our previous study (Shim  (Hurrell et al., 2013) into the thermosphere up to 500-700 km altitude. WACCM-X is uniquely capable of being run in a configuration where the atmosphere is coupled to active or prescribed ocean, sea ice, and land components, enabling studies of thermospheric and ionospheric weather and climate. WACCM-X version 2 is based upon WACCM version 6 (Gettelman et al., 2019) with a top boundary of ∼130 km, which is built upon the Community Atmosphere Model (CAM) version 6 having a top boundary of ∼40 km. WACCM-X 2.2 includes WACCM6 physics for middle atmosphere and lower thermosphere as well as CAM6 physics for the troposphere and the lower stratosphere, and it fully incorporates the electrodynamical processes related to low-to mid-latitude wind dynamo that is implemented in the NCAR TIE-GCM. For this study, two specified-dynamics (SD) WACCM-X 2.2 simulations with different high-latitude electrostatic potential models (Heelis et al., 1982;Weimer, 2005) are used. The SD simulations are carried out by constraining the model's lower atmospheric neutral dynamics using meteorological reanalysis data. The constraining process is achieved by nudging the model toward MERRA-2 (Modern Era Retrospective Analysis for Research and Applications, Version 2) data (Gelaro et al., 2017) below around the altitude of 50 km in a way presented by Brakebusch et al. (2013). SD-WACCM-X is nudged at every 5 min time step with horizontal winds, temperatures, and surface pressure from MERRA-2 data to prevent divergence from real dynamical conditions. Additionally, SD-WACCM-X is forced with surface wind stress and sensible as well as latent surface heat flux. As suggested by Brakebusch et al. (2013), the nudging coefficient is 0.01 s −1 below the altitude of 50 km, and linearly decreases and becomes zero above the altitude of 60 km.
The resulting WACCM-X simulations are compared with the simulations of TIE-GCM. The comparisons between WACCM-X and TIE-GCM simulations will show differences and similarities in modeling capabilities between whole atmosphere modeling and ionosphere-thermosphere modeling with a specified low-boundary forcing (e.g., Global Scale Wave Model (GSWM) (Hagan et al., 1999) used for this study). Table 2 shows the version of the models, input data used for the simulations, and models used for lower boundary forcing and high latitude electrodynamics. We utilized unique model setting identifiers to distinguish the current simulations from those used in our previous studies (Shim, Rastäetter, et al., 2017;Shim et al., 2011Shim et al., , 2012Shim et al., , 2014Shim et al., , 2018. Additional information for the models and model setting identifiers is available in Shim et al. (2011) (Refer to all references therein) and at https://ccmc.gsfc.nasa.gov/support/GEM_metrics_08/tags_list.php.
To investigate improvement in foF2 and TEC predictions of the updated versions of CTIPE (12_CTIPE) and GITM (7_GITM), the simulations of the old versions of the models (11_CTIPE and 6_GITM) from our previous study are included. The comparison will be focused on the comparison between the simulations obtained from the same model. As for TIE-GCM, 12_TIE-GCM (run at 2.5° resolution) is presented for this study, but the comparison between 11_TIE_GCM and 12_TIE-GCM was not included in this study because the only difference between the two is horizontal resolution (5°lat. × 5°long. vs. 2.5°lat. × 2.5°long.).
We should take note of the difference between the simulations obtained from the same model that influence foF2 and TEC responses to geomagnetic storms. For two CTIPe runs, different lower atmospheric tides were specified: 11_CTIPE was driven by the imposed migrating semidiurnal (2, 2), (2, 3), (2, 4), (2, 5), and diurnal (1, 1) tidal modes, while 12_CTIPE was run with monthly mean spectrum of tides obtained from WAM (Whole Atmosphere Model) (Akmaev, 2011;T. Fuller-Rowell et al., 2010). For two GITM simulations, 7_GITM used the T. J. Fuller-Rowell and Evans (1987) model, while 6_GITM used the Ovation model (Newell & Gjerloev, 2011;Newell et al., 2009) for specifying the patterns of auroral precipitation average energy and total energy flux. For energy deposition from energetic particle precipitation (EPP) into the atmosphere, results of Fang et al. (2010) and Sharber et al. (1996) were used for 7_GITM and 6_GITM, respectively. For two WACCM-X simulations, Heelis et al. (1982) and Weimer (2005) electric potential models were used for 3_WACCM-X and 4_WACCM-X, respectively. 12_TIE-GCM was driven by Weimer-2005 electric potential model and GSWM.

Performance of the Models in Predictions of foF2 and vTEC on 17 March 2013
Most simulations newly added for this study show similar behavior to those used in Shim et al. (2018), in predicting foF2 and TEC during the storm. For example, the simulations are not able to reproduce (a) the difference between eastern and western parts of the North American sector (e.g., TEC increases at Millstone Hill but decreases at Idaho and Boulder around 20 UT), and (b) different responses between foF2 (negligible changes) and TEC (noticeable increase) found in European (Chilton) and South-African (Grahamstown) stations (See Figure 4 of Shim et al. (2018) for reference). However, compared to other simulations, 4_WACCM-X driven by Weimer-2005 high latitude electric potential model captures relatively well the two differences in TEC and foF2 described above ( Figure S1).
Scatter plots of the observed (x axis) and modeled (y axis) shifted foF2 and TEC, and percentage change of foF2 and TEC during the storm (17 March 2013) are shown in Figure 1 for CTIPe, in Figure 2 for GITM, and in Figure 3 for TIE-GCM and WACCM-X. Figures 1-3 display the values of all 12 locations grouped into 4 sectors: North America (NA, green), Europe (EU, blue), SAF (red), and SAM (black). The modeled foF2 was calculated from the maximum electron density of the F2 layer, NmF2, by using the relation, NmF2 = 1.24 × 10 10 × (foF2) 2 , where NmF2 is in electrons/m 3 and foF2 is in MHz. First, the qualitative comparison between the simulations from the same model can be summarized as follows. 11_CTIPE/12_CTIPE tends to underestimate/overestimate foF2 for both quiet and disturbed conditions, but 12_CTIPE predicts much better both foF2 and TEC during the storm than 11_CTIPE (Figure 1). 6_GITM and 7_GITM underestimate foF2 and TEC for all cases and show relatively small response to the storm compared to the other simulations ( Figure 2). 12_TIE-GCM and WACCM-Xs produce similar foF2 and TEC changes during the storm. All three simulations give substantial underestimation of TEC in SAF. 12_TIE-GCM and 3_WACCM-X produce larger overestimation of foF2 and TEC in the NA Heelis high latitude electric potential (Heelis et al., 1982), Roble and Ridley (1987) auroral precipitation ∼600 km, 1.9° lat. × 2.5° long.
4_WACCM-X Weimer-2005 high latitude electric potential, Roble and Ridley (1987) auroral precipitation a The model results are submitted by the Community Coordinated Modeling Center (CCMC) using the models hosted at the CCMC.
sector than 4_WACCM-X. 4_WACCM-X shows substantial improvement in the TEC overestimation in NA. 3_WACCM-X, of which the high latitude electric potential is specified by Heelis et al. (1982), tends to overestimate foF2 and TEC compared with 4_WACCM-X ( Figure 3). 3_WACCM-X and 4_WACCM-X produce better quiet time foF2 and TEC than 12_TIE-GCM does and capture wave-like small increases in foF2 and TEC at Idaho National Lab around 10-11UT (2-3 LT) ( Figure S1).  As shown for 6_GITM and 11_CTIPE in Shim et al. (2018), the modeled foF2 values from 7_GITM and 12_ CTIPE better agree with the observed ones when they are shifted by subtracting the minimum of the 30-day median (see Figure S2 in Supporting Information in Shim et al. (2018)). Most foF2 and TEC data points from 7_GITM and 12_CTIPE before shifting are below and above the line with slope 1 (black solid line), respectively. This indicates that 7_GITM underestimates foF2 and TEC like 6_GITM, while 12_CTIPE overestimates them. The models that tend to underestimate foF2, such as 6_GITM, 7_GITM, and 11_CTIPE, seem to be unable to produce foF2* larger than about 7 MHz, and underestimate TEC* being less than about 20 TECU during the storm as reported in Shim et al. (2018). 12_TIE-GCM and WACCM-Xs show similar distribution of the data points after shifting foF2 and TEC with a tendency to underestimate foF2 and TEC in the SAF region. This shifting procedure by the minimum of the 30-day median (i.e., quiet-time minimum) for each model simulation and observation should effectively remove any differences among the models and observations that may be associated with potential biases of the models and observations. Note that this comparative study focuses on the storm-time variations of the models from their quiet-time values.  From now on, foF2 and TEC will represent shifted foF2 (foF2*) and shifted TEC (TEC*), respectively.

Correlation Coefficient (CC)
We first calculate CC between the modeled and observed foF2 and TEC for DOY 076 (17 March 2013) for quantitative assessment of the model performance of TEC and foF2 predictions. In Figure 4,  Close inspection of Figures 1 and 4 indicates that a linearity between CTIPEs and observations is improved in the newer version of CTIPE (12_CTIPE), but 12_CTIPE gives more scattered distribution around a linear relation (Figure 1), which seems to lead to the lower CC in 12_CTIPE than in 11_CTIPE. 7_GITM exhibits a slight improvement in a linearity between the model and observations (Figure 2), but this improvement is not clearly seen in the correlation analysis ( Figure 4). For 12_TIE-GCM and WACCM-Xs, both a linearity between the models and observations ( Figure 3) and CCs (Figure 4) demonstrate that the model performances are overall improved in WACCM-Xs compared with TIE-GCM. In terms of the model-observation linearity, 4_WACCM-X is somewhat better than 3_WACCM-X (Figure 3), but their CCs seems comparable to each other ( Figure 4). Figure 5 shows RMSE of foF2 and dfoF2 in the left panel, and TEC and dTEC in the right panel. For foF2 (blue) and dfoF2 (green) predictions, based on the average RMSE values, the RMSEs from the updated version (12_CTIPE and 7_GITM) are about 1.5 MHz for foF2 and about 1 MHz for dfof2, and they are slightly lower than RMSEs in their old versions. 12_CTIPE shows improvement in foF2 in SH and dfoF2 in NA and EU compared to 11_CTIPE. 7_GITM performs better in foF2 and dfoF2 in EU and SH than 6_GITM. 4_WACCM-X has smaller RMSE (∼1 MHz) than 3_WACCM-X and 12_TIE-GCM (∼1.3 MHz for dfoF2 and ∼2 MHz for foF2).

Root Mean Square Error (RMSE)
12_CTIPE is better in TEC prediction than 11_CTIPE, while the opposite holds true for dTEC prediction. The two GITMs' average RMSE values for TEC and dTEC predictions are similar to each other, about 9 TECU for TEC and 5 TECU for dTEC. Like foF2 and dfoF2 prediction, 4_WACCM-X has smaller RMSE (∼5 TECU for TEC and 4 TECU for dTEC) than 12_TIE-GCM and 3_WACCM-X (∼6 TECU).
As seen in Shim et al. (2018), RMSE is highly variable with location. Most simulations appear to predict foF2 and/or TEC better in NA and worse in SH (except for 12_TIE-GCM for foF2 and 12_CTIPE for TEC). This hemispheric asymmetry in the performance of the models may readily be expected from the fact that the ionospheric density structures in SH are typically more complex and therefore relatively less understood compared with the density structures in NH, mainly due to more complex structure of the geomagnetic field, for example, larger declination and larger offset between geographic and magnetic poles in SH (e.g., Jee et al., 2009;Laundal et al., 2017;Kim et al., 2023) and resulting hemispheric asymmetry in thermospheric O/N 2 ratio (Qian et al., 2022). Shim et al. (2018) also suggested that this hemispheric asymmetry is possibly partly attributed to the fact that the models do not include the energy input from the inner magnetosphere that affects the ionosphere (e.g., foF2 and TEC enhancements) in the South Atlantic Anomaly (SAA) region (Dmitriev et al., 2017;Zhao et al., 2016) where the 4 stations in SH are situated nearby. Both 11_CTIPE and GITMs tend to perform better in NA for dTEC, while WACCM-Xs show the opposite tendency for dfoF2 and dTEC. 7_GITM and 4_WACCM-X show the least RMSE dependence on location for dfoF2 and for dTEC, respectively, among seven simulations.

Yield and Timing Error (TE)
To measure how well the models capture the degree of TEC and foF2 disturbances during the main phase, Yield   (foF2)

Ensemble of TEC Obtained From 13 Simulations
The linearity check, RMSE, and CC between model results and observations for shifted foF2 and TEC and their relative changes indicate that the newer versions of the models (i.e., 12_CTIPE, 7_GITM, and 4_WACCM-X)  produce the better results. From the viewpoints of correct prediction of storm phases (Table 3), Yields, and TEs (Figure 7), however, there is no one best simulation for all locations, and the performance of the models varies with location as well as the Yields and TE.
The differences in performance among the simulations could be caused by inherent differences among the models, for example, different methods to solve for chemistry and advection, and different ways to treat eddy diffusion and vertical transport (T. J. Fuller-Rowell et al., 1996;Liu et al., 2018;Perlongo et al., 2018;Ridley et al., 2006;Solomon et al., 2012), or by a combination of different input data and different models used for lower boundary forcing and high-latitude electrodynamics . Even different data assimilation models for the same weather condition can yield different results, due to numerous reasons (e.g., the use of different background weather models, spatial/temporal resolutions, assimilation methods, and data error analyses), even if the same data are assimilated (Schunk et al., 2021). The common way to handle these differences is to use model ensembles and the use of ensembles enables estimations of the certainty of results. Thus, we used a weighted mean of the ensemble of all 13 simulations including eight simulations from our previous study  for TEC, dTEC and dTEC[%] to compare the ensemble average with the individual simulations. To get the weighted mean , we used the RMSE of shifted TEC ( = 1∕RMSE) . Figure 8 is the same as Figure 1 but for the ensemble of the simulations (ENSEMBLE will be used as model setting ID) and a simulation (1_USU-GAIM) from a data assimilation model (DA), Utah State University-Global Assimilation of Ionospheric Measurements (USU-GAIM). For TEC less than about 20 TECU, ENSEMBLE shows better agreement with GPS TEC than the individual simulations, including 1_USU-GAIM. However, as we can expect, ENSEMBLE underestimates TEC larger than about 30 TECU due to the tendency to underestimate TEC of many simulations as pointed out in Section 3 and Shim et al. (2018). For dTEC [%], ENSEMBLE appears to be correlated better with GPS dTEC[%] than the other simulations, although there are some underestimations in SAF, as well as in SAM with opposite prediction of the storm phase.  Table 2, such as 4_IRI, 1_IFM, 1_SAMI3, are presented in Table 2 in Shim et al. (2018). The simulations in Figure 9a were arranged by the average of the three averaged CC values for TEC, dTEC and dTEC[%] from the smallest to the largest (closer to 1). In Figure 9b, the simulations were arranged by the average of the two averaged RMSEs for TEC and dTEC from the largest to the smallest. Based on the averaged CC and RMSE, ENSEMBLEs (ENSEMBLE and ENSEMBLE_wo_ DA) of the simulations perform very similarly and outperform all 12 simulations but a data assimilation model, 1_USU-GAIM, which assimilated GNSS TEC data and shows the best performance for TEC prediction in most cases with the least location dependence of RMSE in our former study . However, ENSEM-BLEs and 1_USU-GAIM do not show big difference in their performance. The differences in RMSE of TEC and dTEC between ENSEMBLE and 1_USU-GAIM are less than 0.5 and 0.1 TECU, respectively. For dTEC[%], ENSEMBLE performs slightly better than 1_USU-GAIM with about 1.5% lower RMSE. The fact that ENSEM-BLEs are comparable to the data assimilation model 1_USU-GAIM indicates that the multi-model ensemble can be useful in forecasting the IT system, although this result is obtained from a single geomagnetic storm event.  coupled models in terms of Yield and TE, although the difference is small. ENSEMBLE underestimates Yield, while most of the simulations overestimate it, except 4_IRI and 11_CTIPE. 7 simulations from PB coupled IT models and 1_USU-GAIM produce Yield closer to 1 than ENSEMBLE does.
Timing Error of dTEC [%] from ENSEMBLE is about 1 hr, which is slightly larger than TE from 4 simulations from CTIPE and WACCM-X, but the difference from the smallest TE is less than 0.5 hr. Regarding the averaged skill scores for all 12 locations, the five newly added simulations in this study produce comparable TEC and TEC changes to the simulations from PB IT models used in our previous study. The simulations of newer versions of the models (12_CTIPE, 7_GITM and 4_WACCM-X) are found to give overall improved forecast results. Based on the average RMSE, the ensemble of simulations of the models' newer versions is comparable to 1_USU-GAIM and performs better than the ensemble of the simulations of older versions of the models (11_CTIPE, 6_GITM and 12_TIE-GCM) ( Table 4).

Summary and Discussion
We expanded on our previous systematic assessment of modeled foF2 and TEC during the 2013 March storm event (17 March 2013) to track the improvement of the models and investigate impacts of forcings from the lower atmosphere and the magnetosphere, on the performance of ionosphere-thermosphere coupled models.
We evaluated simulations from upgraded models (CTIPe4.1 and GITM21.11) since our previous assessment and a whole atmosphere model (WACCM-X2.2). To compare with results from WACCM-X2.2, we also included a simulation of TIE-GCM2.0, of which the electrodynamic processes are implemented in WACCM-X 2.2. Furthermore, to evaluate TEC prediction of the simulations, we used a weighted mean of the ensemble of all 13 simulations including eight simulations from our previous study to compare the ensemble average with the individual simulations.
For evaluation of the simulations, we used the exact same procedure with the same data set, same physical quantities, and same skill scores as our previous study . The skill scores were calculated for the three sectors, EU (Europe), NA, and SH to investigate the longitudinal and hemispheric dependence of the performance of the models.
From the five simulations used in the study, we also found the general behaviors of most simulations identified in Shim et al. (2018): (a) tendency to underestimate storm-time enhancements of foF2 and TEC and not to reproduce large enhancements of dTEC[%] (e.g., about 200% TEC increase at Port Stanley in the SAA region), (b) being unable to capture opposite responses to the storm in the eastern and western parts of NA, especially the negative phase (except for GITM), which is what in part causes lower CC in NA, (c) tendency to predict foF2 and/or TEC better in NA and worse in SH with respect to RMSE. However, it was found that 12_TIE-GCM and WACCM-Xs better produce the large TEC percentage changes at Port Stanley in SAM. Based on the averaged skill scores for all 12 locations, the five simulations used in this study show skill scores better or comparable to those of the simulations from PB IT models used in our previous study.
Compared to 11_CTIPE (obtained from CTIPe3.2), 12_CTIPE (from CTIPe4.1) driven by tides from WAM tends to overestimate foF2 and TEC for both quiet and disturbed conditions and predicts better TEC peaks during the storm. For more cases, 12_CTIPE performs largely better than 11_CTIPE based on the average scores. 12_CTIPE predicts the storm phase better for dTEC [%] Comparing the two WACCM-Xs and 12_TIE-GCM, the two WACCM-Xs, 3_WACCM-X with Heelis high latitude electric potential model and 4_WACCM-X with Weimer-2005, predict quiet time foF2 and TEC better than 12_TIE-GCM. During the storm, 12_TIE-GCM and 4_WACCM-X produce similar foF2 and TEC in the NA sector, while 3_WACCM-X tends to overestimate these variables, producing larger changes in foF2 and TEC. In most cases, the WACCM-Xs and 12_TIE_GCM perform similarly in terms of average values of skill scores, but 3_WACCM-X and/or 4_WACCM-X perform better than 12_TIE-GCM except for Yield of percentage changes. 4_WACCM-X slightly outperforms 3_WACCM-X for all cases but not for TE for percentage changes.
Our findings suggest that the newer versions of the models (12_CTIPE, 7_GITM, and 4_WACCM-X) with Weimer-2005 electric potential model give overall improved forecast, and the performance of the models depends on forcing from the magnetosphere and also forcing from the lower atmosphere even during storms. Differences in upward-propagating tides generate differences in foF2/TEC responses to the storm by E-region wind dynamo and tidal mixing effects (Yamazaki and Richmond, 2013). The tidal differences between the two CTIPe simulations produce differences in O/N 2 column density ratio (not shown here), and better prediction of TEC peaks of 12_CTIPE with the tendency of overestimation during the storm is possibly caused by larger O/N 2 ratio. The differences in the performance between the two GITM simulations and between the two WACCM-X simulations may partially be caused by different O/N 2 ratios affected by different auroral particle heating and Joule heating that cause expansion of the upper atmosphere and the resulting thermospheric composition changes (Richmond, 2021 and references therein). Furthermore, the disturbed neutral composition in the high-latitude region is transferred to the lower latitude region by the disturbed vertical wind and equatorward thermospheric circulation. The investigation of the actual causes of the differences in the simulations will require systematic modeling studies, which are beyond the scope of this paper.
For TEC, dTEC and dTEC[%], our results indicate that the ensemble of all 13 simulations (ENSEMBLE), including eight simulations from our previous study  is comparable to the data assimilation model (1_USU-GAIM) with differences in skill score less than 3% and 6% for CC and RMSE, respectively. However, ENSEMBLE underestimates Yield (0.73) while 7 simulations from PB coupled IT models and 1_USU-GAIM produce Yield closer to 1. Timing Error of dTEC[%] from ENSEMBLE is about 1 hr, but the difference from the smallest TE of the simulations is less than 0.5 hr. In addition, based on RMSE, the ensemble of the newer versions of the models (12_CTIPE, 7_GITM and 4_WACCM-X) is comparable to 1_USU-GAIM.
To advance our understanding of the ionosphere-thermosphere system requires significant efforts to improve the capability of numerical models along with expanding the scope of observations (Heelis & Maute, 2020).
There have been recent new developments of theoretical models, including AMGeO (Assimilative Mapping of Geospace Observations) for High-Latitude Ionospheric Electrodynamics (Matsuo, 2020) and MAGE geospace model that couples the Grid Agnostic MHD for Extended Research Applications (GAMERA) global MHD model of the magnetosphere (Sorathia et al., 2020;Zhang et al., 2019), the Rice Convection Model (RCM) model of the ring current (Toffoletto et al., 2003), TIE-GCM of the upper atmosphere and the RE-developed Magnetosphere-Ionosphere Coupler/Solver (REMIX) (Merkin & Lyon, 2010). These models will be available soon to the public through CCMC, and then the modeling capability will help us better understand the processes responsible for the observed characteristics and features during disturbed conditions. In addition, CCMC will also provide users with the capability to run PB IT models with various combination of models for lower atmospheric forcing and for magnetosphere forcing, which enable us to research further the impacts of the forcings on the IT system.
The findings of this study will provide a baseline for future validation studies using new models and improved models, along with earlier results (Shim, Rastäetter, et al., 2017;Shim et al., 2011Shim et al., , 2012Shim et al., , 2014Shim et al., , 2018 obtained through CEDAR ETI, GEM-CEDAR Modeling Challenges, and the international effort, "International Forum for Space Weather Modeling Capabilities Assessment." We will extend our study to include more geomagnetic storm events and also geomagnetically quiet times to investigate differences and similarities in the performance of the models. In addition, we will also include foF2 and TEC predictions for the high-and low-latitude regions.

Conclusion
As an expansion of the model assessment study for 2013 March storm event , new simulations from the upgraded models including CTIPe model version 4.1, GITM version 21.11, WACCM-X version 2.2, and TIE-GCM 2.0 were evaluated to track the status of model improvement and to investigate the impacts of lower atmospheric and magnetospheric forcings on the performance of the ionosphere-thermosphere models. Here are the main results of the study.
• Model simulations tend to underestimate the storm-time enhancements of foF2 and TEC and to predict them better in the NH (specifically in the NA) but worse in the SH. It seems to be associated with more complex structure of the geomagnetic field in the SH such as larger declination and offset between geographic and magnetic poles. Furthermore, the models do not include the energy input from the inner magnetosphere that affects the ionosphere (e.g., foF2 and TEC enhancements) in the South Atlantic Anomaly (SAA) region. • The performance of the models is strongly dependent on forcings from the magnetosphere and also from the lower atmosphere even during storms. The newer versions of the models (12_CTIPE, 7_GITM, and 4_WACCM-X) with Weimer-2005 electric potential model provide overall improved forecast.
• Ensemble of all simulations for TEC is comparable to the data assimilation model (USU-GAIM) that showed best performance for TEC prediction in most cases, by assimilating GNSS TEC data, in our former study . • The performance of the models substantially varies with the quantity and location considered, and the type of metrics used. • New developments of theoretical models have recently been performed to improve the capability of numerical models along with expanding the scope of observations, including AMGeO for high-latitude ionospheric electrodynamics and MAGE geospace model, which will be available soon to the public through CCMC. • Results of this study will provide a baseline for future validation studies using new/improved models.