Prediction of Atmospheric Profiles With Machine Learning Using the Signature Method

An array of atmospheric profile observations consists of three‐dimensional vectors representing pressure, temperature, and humidity, with each profile forming a continuous curve in this three‐dimensional space. In this paper, the Signature method, which can quantify a profile's curve, was adopted for the atmospheric profiles, and the accuracy of profile representations was investigated. The description of profiles by the signature was confirmed with adequate accuracy. The machine‐learning‐based model, developed using the signature, exhibited a high level of annual accuracy with minimal absolute mean differences in temperature and water vapor mixing ratio (<2.0 K or g kg−1). Notably, the model successfully captured the vertical structure and atmospheric instability, encompassing drastic variations in water vapor and temperature, even during intense rainfall. These results indicate the Signature method can comprehensively describe the vertical profile with information on how ordered values are correlated. This concept would potentially improve the representation of the atmospheric vertical structure.


Introduction
Temperature and humidity are essential properties of the atmosphere that characterize atmospheric phenomena, and their changes can affect weather and climate.The atmospheric profile is generally constituted from a series of values (e.g., temperature and humidity) obtained at multiple levels, and it can describe the thermodynamic stability of the atmosphere.We sometimes diagnose the atmospheric condition with vertical integrals regarding a path that connects the point data from the surface to the upper air of the profile data.
However, we can grasp its properties more comprehensively with integral quantities (iterated integrals) concerning any combination of variables along the path using the Signature method (e.g., Chevyrev & Kormilitzin, 2016).The signature is a mathematical concept originating in rough path theory (e.g., Chevyrev & Kormilitzin, 2016;Lyons et al., 2007).It can be computed using a mathematical operation called the iterated integral, which integrates a path over a sequence of intervals to generate a sequence of higher-level functions.Consequently, in the atmospheric profile, the signature of a path captures a sequence of high-order statistical information; it can offer a more detailed description representing consecutive paths in a multidimensional space of temperature, humidity, and pressure.
Since the signature can capture the shape and temporal dynamics of the profiles, it can help to produce models such as a machine-learning (ML) algorithm more effectively than using the individual values in each layer.Recently, various fields have applied this method for the prediction (e.g., Fermanian, 2021;Li et al., 2019;Moore et al., 2019;Morrill et al., 2019;Perez Arribas et al., 2018;Xie et al., 2018).In ocean science, Sugiura and Hosoda (2020; hereafter referred to as SH2020) carried out a notable application of the Signature method to oceanic profiles.Using ML techniques, they effectively conducted a quality check and assessment of oceanic profiles, demonstrating the efficacy of the Signature method in this context.
Even in atmospheric science, the concept of the signature can potentially extend traditional vertical profile representations with more detail and improve the expression of the shape of atmospheric profiles, such as thermodynamic features.In this paper, we investigated the accuracy of the description in the atmospheric profile's signature and the efficacy of signatures revealed through machine-learning-based models.The comparison with the model that predicts the raw values was also discussed.

Signature of Atmospheric Vertical Profile
In this study, we calculated the signature of each profile using the procedure described in SH2020.The atmospheric profile consists fundamentally of pressure (P), temperature (T ), and relative humidity (R).Although SH2020 utilized oceanic profiles composed of pressure, salinity, and temperature, our focus is on atmospheric profiles; therefore, salinity was replaced by humidity.The conception of the Signature method is shown in Text S1 of Supporting Information S1.The profile can be represented as a sequence of three-dimensional vectors (P, T, R), forming a continuous curve in three-dimensional space like the example in Figure S1a of Supporting Information S1.
The Signature method is premised on the mathematical concept that all information about a path, , where i 1 ,⋯,i k denote the component numbers and t 1 ,⋯,t k are the parameters along the path.Rather than directly calculating these iterated integrals, the signature is efficiently derived through the following algebraic procedure.For a linear path ν, represented by a vector ν ∈ R d , the signature S is computed as follows: where ⊗k is the k-times tensor product and n is the order of the signature.For a piecewise-linear path v 1 *⋯*v m , which is made by concatenating linear paths v 1 , ⋯,v m one after the other, the signature is computed as follows: owing to Chen's identity (Chen, 1958).Here, the tensor product is extended to the product in the truncated tensor algebra by ( ∑ , where the subscript represents the order of the term. The atmospheric data were obtained from the analysis values of the operational Mesoscale Model (MSM) of the Japan Meteorological Agency (JMA, 2022), which employs a horizontal resolution of 10 km.The three-hourly data set consists of the surface and 16-layer profiles (at 1000, 975, 950, 925, 900, 850, 800, 700, 600, 500, 400, 300, 250, 200, 150, and 100 hPa) in Fukuoka, Japan (33.6°N, 130.4°E).Note that the relative humidity above the 250 hPa level was consistently set to zero throughout the analysis.The vector sequence was scaled to dimensionless values using the sequence of divisors (1000, 100, 1), chosen as typical scales for the components.
The signature was computed using the Python library Esig (Kormilitzin, 2017), and the order number in the iterated integrals was set to five.An example of the atmospheric physical aspects of the iterated integrals is as follows: the second-order iterated integrals from pressure and water vapor would indicate the precipitable water vapor, or those from pressure and temperature would indicate the total heat content in the atmosphere (e.g., SH2020).Examples of iterated integrals are shown in Figure S1b of Supporting Information S1.

Decoding the Atmospheric Profile From the Signature
To decode the signature into a profile, the following estimation was applied.For a given signature g, we want to find a path with the "closest" signature S to g, which can be implemented by minimizing the cost function (Equation 3): where π k is the projection onto the kth-order tensors, | | F is the Frobenius norm, the exponent 1/k is used for the homogeneity through dilation, vj is the first guess for the vector v j , which is set to π 1 (g)/m, and σ represents a prescribed variability of the vector increment.The second term is added to avoid the non-uniqueness of the solution.
This minimization problem was solved using the BFGS method using SciPy (Virtanen et al., 2020).The gradient was computed by automatic differentiation concerning the signature transformation process.

Machine Learning Model Design Using the Signature
The signature enables the quantification of the profile shape.We constructed a supervised ML model based on a neural network (NN) to predict the signature.In this study, a model to predict the signature using limited data without special observations was used.
The input vector consisted of meteorological values at the surface based on operational observations, and vertically integrated values for water vapor based on satellite observations (Fujita & Sato, 2017), and the brightness temperature derived from the JAXA third-generation geostationary meteorological satellite Himawari-9 (Bessho et al., 2016), as listed in Table 1.The brightness temperatures on the nearest grid at the target point (33.6°N,130.4°E) were used.In this paper, these surface and vertically integrated values were obtained from MSM.The predictor is the signature of the atmospheric profile, as elucidated in Section 2.1.As depicted in Figure S2 of Supporting Information S1, the input vector is passed through two densely connected hidden layers, the first with 32 nodes and the second with 128 nodes, and the output layer consists of 364 nodes, representing the predictor (=signature).
The value at each individual node of a NN is generally calculated by weighting the sum of the inputs and incorporating an additional bias term, where i is the node from the preceding layer, and j corresponds to the node representing the value in the current layer.In this equation, w ij signifies the interconnection weight between nodes i and j, x i represents the value associated with node i, and b denotes the supplementary bias term.The weights and biases underwent iterative updates until the training process reached completion, which corresponded to the minimization of the loss function.The performance and accuracy of the model were evaluated using the loss function based on the mean-squared error (Keras; Chollet, 2015).The rectified linear unit (ReLU; Equation 2; Agarap, 2018) was applied to the hidden nodes (h j ) to incorporate nonlinear transformations when the functions (Equation 5) were finally determined: The data set used in this study encompassed profiles spanning the period from 2019 to 2020, consisting of a total of 5,848 samples.To ensure an unbiased analysis, the data were randomly shuffled along the temporal dimension.Subsequently, the data set was divided into separate validation and training data sets, with a ratio of 2-8.The final validation RMSE was 7.5 × 10 5 , demonstrating the model's strong predictive capability.
The model was utilized to perform predictions on a three-hourly basis throughout 2021.A total of 2,920 input vectors were input to the model, and the corresponding signatures were predicted.These predicted signatures were subsequently decoded to obtain the respective physical values, namely air pressure, temperature, relative humidity, and water vapor mixing ratios which were employed for further analysis.

Results
First, the accuracy of the decoding process from the signature to the atmospheric profile was confirmed.Each profile in 2021 was signature-transformed and decoded back to compare with the original profile.Figure 1 shows the difference between the original MSM meteorological value and the decoded value from the signature.The mean differences in temperature and water vapor were quite small for the annual, summer, and winter mean data.
The root-mean-square tends to be larger in the upper layer, especially for temperature.However, these were relatively small and <2.0 K or g kg 1 .Similar tendencies were found for the other seasons (spring and autumn; Figure S3 in Supporting Information S1).
We then assessed the accuracy of the predicted profiles obtained by ML. Figure 2a illustrates the annual mean difference and root-mean-square error for each layer.Both the temperature and water vapor mixing ratio exhibited an absolute mean difference of <2.0 K or g kg 1 , indicating there was minimal systematic bias across all layers.Notably, the temperature difference tended to be negative, whereas the water vapor difference tended to be positive.The root-mean-square error for water vapor was also below 2.0 g kg 1 , with smaller errors in the upper layers, where the absolute values were generally lower.Conversely, the root-mean-square error for temperature was smaller near the surface and remained below 4.0 K below the 500 hPa level.The temperature errors were relatively larger in the upper layers, yet these errors were still <6.0K.
The seasonal variations in prediction accuracy were examined and are shown for summer (June to August; Figure 2b) and winter (December to February; Figure 2c).The observed patterns in mean difference and the rootmean-square error appear to be consistent with the results obtained for the entire year (Figure 2a).In particular, the temperature error during winter exhibited relatively larger discrepancies, which are likely attributable to the sensitivity of less water vapor situation in the winter.Given that a significant proportion of the input vector values is affected by water vapor and surface atmospheric phenomena (Table 1), errors in the upper layers during the dry winter season may be more pronounced.Similar tendencies were observed in the accuracy assessment for the spring and autumn seasons (Figure S4 in Supporting Information S1).Based on these results, the ML prediction errors tend to be larger than the decoding errors.Moreover, the tendency of the ML prediction errors, such as more significant errors for temperature and in the winter season, resembled the decoding error.
We then focused on the atmospheric features in the predicted profiles for summer, when the temporal variation is more significant than in the other months.The temporal evolution of atmospheric variables in August is presented in Figure 3. Notably, the vertical structure of the atmosphere exhibited significant variations throughout the study period, particularly above the atmospheric boundary layer with respect to water vapor, as depicted in Figure 3b.
The middle atmospheric layer and the surface air were occasionally characterized by warmth and moisture, with the maximum precipitable water vapor reaching 74.7 mm that month.The corresponding true vertical structure is shown in Figure S5 of Supporting Information S1.
To assess atmospheric instability, computations were performed based on the temperature and water vapor profiles.The calculation of the K-index and convective available potential energy (CAPE) instability metrics is described in Text S2 of Supporting Information S1.These time series analyses indicate that our model is capable of capturing variations in atmospheric instability.During the heavy rain event observed on August 13-14, the CAPE value was relatively low, whereas the K-index was notably high.This indicates that the atmospheric conditions were characterized by an abundance of moisture rather than thermal instability (e.g., Takemi & Unuma, 2019).
The accuracy scores (e.g., Thornes & Stephenson, 2001) for the atmospheric instability of specifically the Kindex and CAPE were confirmed.To assess this score, a threshold was established to determine whether a severe event occurred, with values exceeding 30.0 for the K-index and 100.0 for CAPE.The accuracy values (percent correct) of the K-index and CAPE were found to be 0.87 and 0.81, respectively, with a miss rate of less than 0.1 for both indices.Furthermore, noteworthy hit rates (>0.85) were revealed.These results affirm the reliability of the model utilizing the Signature method.However, the false alarm rates for the K-index and CAPE were relatively high, at 0.36 and 0.23, respectively.This elevated false alarm rate may be attributed to a minor positive bias observed near the surface and lower layers.The upper-layer temperature error after 25 August (Figure 4c) also would affect the index's accuracy.Overall, these results confirm our method enables the prediction of atmospheric profiles and more precise modeling of atmospheric processes.The calculation was summarized in Text S3 of Supporting Information S1.

Discussion
We have demonstrated above that the Signature method is able to predict atmospheric profiles with reasonable accuracy.In this section, we shift our focus toward evaluating the influence of the signature on the accuracy of the predictive model.To accomplish this, we developed an alternative model that utilizes the raw profile values as predictors instead of the signature as a baseline model.The raw value vector comprises a sequence of three profile values (pressure, temperature, and humidity) arranged in the order from surface to the top of the atmosphere 16layer, which was designed as simple to clarify the influence of signature.Employing this input vector, which aligns with the original model configuration (as described in Section 2), we constructed a supervised ML model.The training procedure closely followed the original, and the model was set with two hidden layers; the first has 16 nodes, the second has 32 nodes, and the output layer has 48 nodes.The time series analysis of the errors in each layer reveals distinct pattern differences, as shown in Figures 4c, 4d, 4g, and 4h.In the case of the baseline model, pronounced errors are observed in water vapor (Figure 4h).Large biases are evident near the surface, accompanied by substantial variability, whereas a persistent positive bias is observed in the middle layer (from 975 to 800 hPa).The temperature bias (Figure 4g) exhibits a comparable tendency, with distinct variations near the surface and upper layers.These errors may stem from correlations within the ML model between the input vector and the values in each layer categorized as the surface layer, planetary boundary layer, or free atmosphere.
In contrast, the model trained by the signature can capture the characteristics.The errors in the signature model exhibit no discernible boundaries across layers (Figures 4c and 4d).These tendencies highlight the challenges in predicting the structure of the vertical profile and serve as evidence that the signature is a robust representation of the shape of the vertical profile.
Some deep ML methods have a significant advantage in capturing the correlations between layers and output sequences, such as recurrent neural networks.A comparison with deep ML, or an adaptation of the Signature method to deep ML, will be undertaken in future studies.

Conclusions
In this paper, we investigated the accuracy of the description in the atmospheric profile's signature and the prediction of a machine-learning-based model.The description of profiles by the signature was confirmed with adequate accuracy.Moreover, it was confirmed that the developed model could predict the signature from surface-based and satellite data.The annual accuracy of both temperature and the water vapor mixing ratio, which is decoded from the predicted signature, exhibited an absolute mean difference of <2.0 K or g kg 1 .In addition, we confirmed that the seasonal accuracy was almost the same as the annual result.Even during the intense rainy season, the model successfully captured the vertical structure of the atmosphere, accurately representing the pronounced variations in water vapor and temperature within the mid-troposphere.It brought highly accurate estimates of atmospheric instability, as indicated by the K-index and CAPE computed from the predicted profiles.
The atmosphere exhibits distinct variations across different layers, namely the surface layer, planetary boundary layer, and free atmosphere.The Signature method offers a notable advantage in capturing the shape of the threedimensional curve; consequently, the signature enables comprehensive modeling of the atmosphere, transcending the variability of individual layers.These results indicate that the Signature method can accurately describe the vertical profile representations and the predictability of the signature has also been confirmed.This concept would potentially improve the representation of the atmospheric vertical structure.

Figure 1 .
Figure 1.Accuracy of the atmospheric profiles after signature transformation and decoding.(a) Annual air temperature (blue) and water vapor mixing ratio (red) accuracy.Bold lines show the mean differences from the true value of the Mesoscale Model, and horizontal bars show the range between the 25th and 75th percentiles of the difference in each layer.Gray lines indicate zero.Dashed lines indicate the root-mean-square error for each layer.(b) Same as (a), but during summer (JJA), whereas (c) is during winter (DJF).

Figure 2 .
Figure 2. Same as Figure1, but for the predicted accuracy using ML.The ML predicted the signature, which was then decoded into the atmospheric profiles.

Figure 4
Figure 4 presents a comparative assessment of prediction accuracy in August 2021 for the model trained by the signature and the baseline model.The monthly accuracy, depicted by mean difference and root-mean-square values (Figures4a, 4b, 4e, and 4f), clearly reveals the distinctions between the two models.The accuracy with the signature resembles the seasonal mean results (Figure2b) in all layers.Conversely, notable errors arise in temperature and water vapor predicted by the baseline model, particularly in the lower layers.An evident negative mean difference is observed around the 1000 hPa level, with a positive bias in water vapor above the boundary layer.Furthermore, the root-mean-square values are considerable, particularly in the lowermost layer.

Figure 3 .
Figure 3.Time series of atmospheric vertical profiles and indices of atmospheric stability in August 2021.ML simulated vertical profiles of (a) air temperature and (b) water vapor mixing ratio.Atmospheric stability in terms of the (c) K-index and (d) convective available potential energy calculated from both ML simulated results (colored line) and true values (black line).(e) Observed three-hourly precipitation.

Figure 4 .
Figure 4. Improvement of profile prediction using the signature in August 2021.Upper panels (a-d) show the results simulated by the ML model trained on the signature data.In contrast, lower panels (e-h) show the results simulated by the ML model trained using the actual physical values (baseline model).Panels (a), (b) and (e), (f) are the same as Figure 2, but during August only; (c) and (g) show the difference in air temperature from the true value, and (d) and (h) are the difference in water vapor mixing ratio.

Table 1
Values of Input Vector for the Neural Network FUJITA ET AL.