Application of the Bagged Trees Technique on Retrieving the Nighttime Ionospheric Peak Density From OI‐135.6 nm Airglow

The NASA global‐scale observations of the limb and disk (GOLD) mission is a measurement opportunity to scan the far ultraviolet airglow at ∼134–162 nm over the American Hemisphere since October 2018. The FORMOSAT‐7/COSMIC‐2 (F7/C2) satellite mission has provided thousands of daily radio occultation soundings in the low‐ and mid‐latitude regions since July 2019. The nighttime OI–135.6 nm emission is mainly through radiative recombination, and the radiance is used to derive the peak electron density. Comparison with corresponding F7/C2 observations demonstrates good correlation in low‐latitudes, while is overestimated near mid‐latitudes in winter, induced by the photoelectrons emanating from magnetically conjugate Hemisphere. The machine learning technique Bagged Trees is implemented to develop an intensity to peak density model training from GOLD and F7/C2 observations. The validation demonstrates that Bagged Trees peak‐density has less influence from conjugate photoelectrons and indicates the power of machine learning techniques for geophysics data processing.


Introduction
The global-scale observation of limb and disk (GOLD) onboard the SES-14 communication satellite was launched on 25 January 2018, to a geostationary orbit over 47.5°W.Its ultraviolet spectroscopy instrument conduct day and night atmospheric glow observations of the ionosphere in the Atlantic region of the Americas, emphasizing equatorial ionization anomaly (EIA) and equatorial plasma bubble imaging (Eastes et al., 2017(Eastes et al., , 2019(Eastes et al., , 2020) ) as well as bubble zonal drift velocity (Karan et al., 2020).Cai et al. (2020) reveal that the GOLD 135.6 nm radiance is similar to total electron content (TEC) in morphology.The major sources of OI-135.6 nm nightglow emission are radiative recombination and ion-ion mutual neutralization, so that the measured intensity can be related to the square of the electron density (Hanson, 1969;Meléndez-Alvira et al., 1999;Tinsley & Bittencourt, 1975).Based on such relationship, the GOLD 135.6 nm intensity is used to derive the peak electron density (NmF2).On the other hand, the FORMOSAT-7/COSMIC-2 (Constellation Observing System for Meteorology, Ionosphere, and Climate-2) (F7/C2) mission comprising six low-earth-orbit (LEO) satellites to provide radio occultation (RO) observations was launched in June 2019.Since the LEO satellite of F7/C2 has an altitude of 550 km and an inclination angle of 24°, the F7/C2 can provide daily atmospheric and ionospheric RO observation data in low and mid-latitude regions.In addition to receiving global positioning system satellite signals, it can also receive GLONASS (GLObal NAvigation Satellite System) satellite signals so that F7/C2 can provide about twice the number of RO soundings compared to FORMOSAT-3/COSMIC (F3/C).Figure 1 illustrates the concurrent observations of GOLD OI-135.6 nm intensity and F7/C2 RO NmF2 at 22 UT on 15 June 2020.The GOLD OI-135.6 nm intensity provides peak electron density information over a wide region via an algorithm based on the assumptions.On the other hand, the F7/C2 RO sounding can provide accurate measurement of ionospheric peak density, but the data coverage within the selected period is very sparse.The motivation of this study is to compare the GOLD NmF2 and F7/C2 NmF2 and use machine learning technique to develop a model to transform the nighttime airglow intensity to peak electron density based on the GOLD and F7/ C2 observations.

Data and Methodology
The GOLD NmF2 and co-related OI-135.6 nm intensity are implemented in the study.The GOLD NmF2 from the level 2 product NMAX (GOLD Mission, 2019) is derived from the OI-135.6 nm intensity scans (available as level 1c NI1 night disk scan measurements) over both the Hemispheres during 17:00 to 21:00 hr local time each night.The OI-135.6 nm nightglow emission has two major sources, radiative recombination and ion-ion mutual neutralization.The NmF2 is retrieved from OI-135.6 nm intensity based on the simplified algorithm, which ignores ion-ion mutual neutralization and multiple scattering effects, and assumes that the electron and O+ densities are identical and that a Chapman layer type vertical profile can describe their distributions.Based on these assumptions, one can obtain a formula that directly relates the NmF2 to the measured 135.6 nm intensity (I135.6), where α 135.6 represents the radiative recombination rate, H is the scale height of the Chapman function, and e is the base of the natural logarithm.The 133-137 nm bandpass is used to derive the GOLD OI135.6 intensity.The value of α 135.6 , which is 7.3 × 10 13 cm 3 s 1 , which is adopted from Melendez-Alvira et al. (1999).(For more information, see the GOLD Science Data Products Guide-Rev 4.4, https://gold.cs.ucf.edu/wp-content/documentation/GOLD_Public_Science_Data_Products_Guide_Rev4.4.pdf).
In the RO sounding technique, the LEO satellite (e.g., F3/C, F7/C2) receives radio signals from GNSS (Global Navigation Satellite System) satellites, making limb sounding of the ionosphere, thus allowing the estimation the line-of-sight TEC.The RO electron density profile is inverted from the calibrated TEC profile by using the Abel inversion.Note that the calibrated TEC is the total number of electrons along the line-of-sight of the LEO and GNSS satellites below the LEO satellite orbit.It can be calculated using dual-frequency GNSS radio signals.However, Liu et al. (2010) demonstrate that the RO electron density has a significant error below the 200 km altitude due to the assumption of spherical symmetry in Abel inversion.Fortunately, the influence on the NmF2 is The decision tree (Breiman et al., 1984) is a tree-like model of decisions, and it is frequently used in machine learning for classification and regression.A hierarchical tree structure is composed of many connected nodes, and each node can connect to one or many child nodes but can only be connected by a parent node.The root node is the first node in the tree structure.It is at the top of the tree structure and has no parent.The leaf nodes are the bottom nodes of the tree, and they have no child nodes.The benefit of the decision tree is that it considers all possible results of a decision and traces each path to a conclusion.The decision trees support nonlinearity and require less effort for data preparation during pre-processing.However, the individual decision tree tends to overfit.The Bagged (Bootstrap-aggregated) Tree combines many decision trees to reduce the effects of overfitting and improve generalization (Loh, 2002;Loh & Shih, 1997;Meinshausen, 2006).It significantly raises the stability of models by improving accuracy and reducing variance, which eliminates the challenge of overfitting.Figure 2 displays the flow chart of the bagged tree used in the study.The Bootstrap method is implemented to sample data sets 1-30 and train corresponding decision trees 1-30.There are 30 decision trees trained in the Bag, and the model results are the average of the outputs from all the trees.In the trained model, every node (except the leaf

Comparison of F7/C2 NmF2 and GOLD NmF2
In this study, we accumulate F7/C2 RO NmF2 and GOLD NmF2 observations from 1st August 2019 to 30th October 2022.The observation area of F7/C2 RO is within 45°latitude due to the 24°inclination angle.Figure 4 displays the co-located NmF2 distributions of GOLD and F7/C2 observations over a period of 12 months in 2020.The co-located F7/C2 and GOLD data refers to the GOLD observation within 30 min before and after the F7/C2 observation time, and the locations of both observations must be the shortest distance of each other, and distance cannot exceed two degrees.The distributions demonstrate that the lowlatitude structure observed by GOLD and F7/C2 are similar.However, the GOLD NmF2 is significantly larger than F7/C2 NmF2 below 15°S, especially in May, June, and July months.GOLD OI-135.6 nm intensity, GOLD NmF2, and F7/C2 NmF2 at 22UT in June and December 2020.Note that the area of F7/C2 NmF2 is limited to the FOV (field of view) of GOLD NmF2.The GOLD intensity distributions show slightly enhanced intensity in the north and south mid-and high-latitude regions in December and June.This might be due to the broad regions of faint airglow excited by photoelectrons emanating on the night side of the terminator in the winter Hemisphere (the Southern in June and the Northern in December) (Kil et al., 2020;Solomon et al., 2020).Solomon et al. (2020) indicate photoelectrons generated in magnetically conjugate areas in the other Hemisphere are still illuminated, transported along field lines, and then precipitated back into the atmosphere.The slightly enhanced intensity becomes non-ignorable after inverting to the peak density shown in the GOLD NmF2 maps.However, the F7/C2 NmF2 maps do not show any enhancement at the corresponding latitude regions.The F7/C2 electron density profile is inverted from the calibrated TEC profile by using the Abel inversion.The TEC calculated by dual-frequency GNSS signal is only affected by the electrons along the line of sight between F7/C2 and GNSS satellites.The conjugate photoelectrons of sufficient energy could result in the electron impact excitation of atomic oxygen at F-region altitudes (e.g., Duboin et al., 1968;Meier, 1971).This could cause additional enhancement of the 135.6 nm intensity in addition to any increase in NmF2.
Figure 6 displays the scatter plot of F7/C2 RO NmF2 and GOLD NmF2 for 12 months from 1st August 2019 to 30th October 2022.The study of Rajesh et al. (2011) illustrates that the correlation between peak electron density and nighttime airglow intensity varies from month to month.Therefore, the calculation of correlation is separated into 12 months.The correlations are relatively high with values above 0.8 in most of the months except for May-August.On the other hand, the correlations in May-August are with lower coefficients below 0.8.The low correlation in those months is due to the overestimated NmF2 induced by the conjugate photoelectrons.

NmF2 Transform by Using Bagged Tree
The comparison of peak density (Figure 4) shows that GOLD NmF2 and F7/C2 NmF2 are consistent except for the boundary region affected by the photoelectrons generated in magnetically conjugate areas in the other Hemisphere.The GOLD peak density is inverted from the nighttime airglow intensity, which is influenced by the conjugate photoelectrons, thus overestimating the electron density measurement.Therefore, the conversion factor calculated between NmF2 and intensity might not be suitable to establish an intensity to NmF2 model based on the F7/C2 RO observation.This study implements the machine learning technique bagged tree to investigate the relation between GOLD OI-135.6 nm intensity and F7/C2 RO NmF2.The 135.6 nm airglow intensity to NmF2 model based on bagged tree is built by Regression Learner App in MATLAB Statistics and Machine Learning Toolbox (MathWorks, 2020).In the training process, we implement the GOLD airglow intensity, location (longitude, latitude), time (year, month, UT) as training material (input), and F7/C2 RO NmF2 as the training target (output).We accumulated co-located GOLD OI-135.6 nm intensity and F7/C2 RO NmF2 from 1st August 2019 to 30th October 2022.The total number of co-located data points is 55,453.Each tree uses about 35,000 random sampling data points to train the structure, and it has about 9,800 nodes, including 4,900 leaf nodes.In this study, the validation of the model performance uses k-fold cross-validation due to the small data set.The k-fold cross-validation partitions the data into k randomly selected, roughly equal-sized subsets.One subset is used to validate the model, which is trained using the remaining subsets.This process is repeated k times, ensuring that each subset is used exactly once for the validation.The average error across all the k partitions is reported as the validation error.In the study, k is set as five, which means that we partition the data into five random subsets, where one of the subsets (20% of the data) is used to validate the model and the remaining subsets (80%) are used to train the model.The training and validation were repeated five times.The average root means square deviation (RMSD) of cross-validation is 2.05*10 5 cm 3 .Figure 7a displays the test result of bagged tree regression, where the blue dots denote the truth response (F7/C2 NmF2), and the red dots mark the predicted response (bagged tree NmF2).The test result illustrates that the airglow intensity is related to the square of the predicted response of NmF2, which agrees with Hanson (1969).Figure 7a also displays the regression function calculated by the airglow intensity and the predicted NmF2, and  7d-7f show the GOLD intensity, GOLD NmF2, and NmF2 derived by the bagged tree process at 23 UT on 15 June 2020.The GOLD NmF2 displays significantly enhanced peak density at the southern Atlantic Ocean.This enhanced density is affected by the conjugate photoelectrons.However, the bagged tree NmF2 is generally smaller than GOLD NmF2 in the faint airglow region.Figure 7b shows the RMSD between airglow NmF2s and two digisonde (ASCENSION ISLAND, AS00Q, 7.95°S, 14.4°W; CACHOEIRA PAULISTA, CAJ2M, 22.70°S, 45.00°W) NmF2.The digisondes NmF2 data is obtained from the Global Ionospheric Radio Observatory (GIRO, 2011;Reinisch & Galkin, 2011).The RMSD between Bagged tree NmF2 and digisonde is generally smaller than the GOLD NmF2 in the winter Southern Hemisphere (May-July), indicating that the bagged tree model has the capability to reduce the influence of the faint airglow by conjugate photoelectrons.

Discussion and Summary
The comparison between co-located GOLD NmF2 and F7/C2 RO NmF2 shows that they are largely similar, despite differences in sounding techniques and algorithmic assumptions.This also suggests that the GOLD OI-135.6 nm intensity algorithm is capable of accurately retrieving peak density.Nevertheless, the retrieved peak density is based on the intensity of airglow emission.Therefore, any external contamination in the measurement could influence the accuracy of the peak density, as revealed by the NmF2 enhancements due to the small noise at mid-latitude intensity (Figures 7d and 7e).Thus, a small enhancement (e.g., noise) of intensity can yield a significant enhancement in peak density via the algorithm used, and the results might not be suitable for further reanalyzing.
To find a better way to invert the peak density, we implemented the machine learning method bagged tree by using GOLD and F7/C2 data to train an inversion model (Lin, 2022).This model is not based on any chemical formulations and assumptions, but purely based on observation data.Note that the predicted response NmF2 generated by the bagged tree model is different from the true response NmF2 from F7/C2 (Figure 7a).However, it does not mean that the model does not train well.In contrast, the airglow intensity is related to the square of the predicted response of NmF2, agreeing with the literature (e.g., Hanson, 1969).It is noteworthy that we do not input any related information between airglow intensity and NmF2 during the training process.The bagged tree method uses the training data to conclude the relation and reduce the impact of some extreme values.Using bagged tree for the equatorial region, we can expect the retrieved electron density structure to be similar to the original result (Figure 7f).This is because both are based on the same source (radiative recombination).However, the overestimated value of NmF2 induced by conjugate photoelectrons in the midlatitude of the winter Hemisphere and the significant density disturbance generated by intensity noise in the Atlantic Ocean is significantly reduced.Moreover, Figure 7c demonstrates that the NmF2 computed by the bagged tree model generally has less deviation than GOLD NmF2 in the mid-latitude regions (Lon: 30°W-55°W, Lat: 25°N-45°N, 27.5°S-47.5°S),especially in May, June, July, and August.The results provide evidence that the bagged tree model trained by actual observation data can reduce the influence of intensity noise on the NmF2.On the other hand, the OI-135.6 nm emission observations shall be excluded in the magnetically conjugate points to avoid the influence of photoelectrons (e.g., Wautelet et al., 2021).The bagged tree model has the ability to mitigate the influence of the measurements possibly contaminated by conjugate photoelectrons, but the effect is limited in cases of extreme intensity values.Moreover, in the future, the inversion model can also include more observations such as ICON to develop a global OI-135.6 nm airglow intensity inversion model.
The bagged tree NmF2 presented in this study contains both peak density distribution from GOLD and peak density magnitude from F7/C2.The wide range of NmF2 values reflects not only the distribution of peak electron density (critical frequency) but also the horizontal structure of the ionosphere.Hence, it is important to observe the extensive variations in NmF2 during nighttime ionospheric conditions.For example, Figure 8 reveals the ionospheric peak density disturbance during a minor geomagnetic storm from January 13 to 18, 2022.The minimum Dst index (Papitashvili & King, 2020) reached 91 nT at UTC 22:00 on 14 January 2022.The nighttime equatorial peak density has an X-shape in the night on 15 January during the geomagnetic storm recovery phase.Aa et al. (2022) mentioned that the X-shape on 15 January 2022 is strongly related to the Tonga volcanic eruption.The GOLD nighttime airglow observation is essential to monitor ionosphere status, and it covers the regions lacking ground measurement, such as the Atlantic Ocean.Moreover, the bagged tree peak density has less influence from the image noise and it is entirely possible to assimilate the peak density into an ionospheric nowcasting model such as global ionospheric specification (Lin et al., 2017).The GOLD peak density can provide dense coverage in the Atlantic Ocean region, supplementing the sparse observation by F7/C2 RO soundings.
To summarize, this study compares ionospheric peak density from GOLD NmF2 and F7/C2 RO NmF2 over the Americas and the Atlantic Ocean regions.The results illustrate good agreement of two satellite measurements in the low-latitude area, and the correlation coefficient of two observations can be reached around 0.7-0.9, with the lower values occurring around the southern Atlantic Ocean region.This discrepancy is because the peak density inverted by GOLD airglow intensity is significantly larger than F7/C2 due to conjugate photoelectrons in the winter Hemisphere.This study also describes a new method to estimate the ionospheric peak density from nighttime airglow intensity using the Bagged Trees method of the machine learning technique.The bagged trees method uses GOLD airglow intensity, location, and time as the input and the F7/C2 NmF2 as the output to fit the model.The test result shows that the predicted NmF2 is related to the square of the measured intensity, agreeing with the literature.The validation demonstrates that bagged tree NmF2 has less RMSD than GOLD with respect to digisonde measurements in May, June, and July since the conjugate photoelectrons effect in bagged tree NmF2 is reduced.In conclusion, this study demonstrates that the combination of the measurements by F7/C2 and GOLD missions and the power of machine learning techniques for geophysics data processing offer excellent prospects to accurately monitor the nighttime ionosphere, and the potential of applying such the results to improve the attempts for ionosphere forecast.

Figure 1 .
Figure 1.(left) The night disk scans of 135.6 nm airglow intensity observed by global-scale observations of the limb and disk at 22 UT on 15 June 2020.(right) The colored dots marking the locations of all the F7/C2 radio occultation soundings on the same day.

Figure 2 .
Figure 2. The flow chart of bagged tree.
Figure 3.The part of the tree structure, the black triangles denote the nodes, and the red circles indicate the leaves.The Radiance symbolizes the 135.6 nm airglow intensity.The Lon represents the longitude, and the Lat means the latitude.

Figure 4 .
Figure 4.The distributions of global-scale observations of the limb and disk NmF2, and F7/C2 radio occultation NmF2 for all co-located data points from January to December in 2020.

Figure 5 .
Figure 5.The median value of 135.6 nm intensity (top) and NmF2 (middle) from global-scale observations of the limb and disk (GOLD) mission at 22 UT in June and December 2020.Bottom panels denote the NmF2 distributions at 22 UT in June and December 2020 from FORMOSAT-7/COSMIC2 radio occultation mission.Note that the area of F7/C2 NmF2 is limited to the field of view of GOLD NmF2.

Figure 6 .
Figure 6.The scatter plots and correlation coefficients of F7/C2 radio occultation NmF2 and global-scale observations of the limb and disk NmF2 for 12 months from 1st August 2019 to 30th October 2022.

Figure 7 .
Figure 7. (a) The test result of bagged trees model.Blue dots denote the test true response (F7/C2 NmF2) and red dots denote the prediction response (bagged tree NmF2).The black line denote the regression function calculated by the airglow intensity and predicted NmF2.(b) The 12 months root mean square deviation between global-scale observations of the limb and disk (GOLD) and two ground digisondes observations from August 2019 to September 2021.(c) The root mean square deviation of NmF2 in the mid latitude areas in 2021.Red line indicate bagged tree NmF2, and blue line indicates GOLD NmF2.The night disks of GOLD 135.6 nm (d) airglow intensity, (e) GOLD NmF2, and (f) bagged tree NmF2 at 23 UT on 15 June 2020.