Combining Dynamical and Statistical Modeling to Improve the Prediction of Surface Air Temperatures 2 Months in Advance: A Hybrid Approach

Oettli, Pascal; Nonaka, Masami; Richter, Ingo; Koshiba, Hiroyuki; Tokiya, Yosuke; Hoshino, Itsumi; Behera, Swadhin K.

doi:10.3389/fclim.2022.862707

ORIGINAL RESEARCH article

Front. Clim., 30 March 2022
Sec. Predictions and Projections
Volume 4 - 2022 | https://doi.org/10.3389/fclim.2022.862707

Combining Dynamical and Statistical Modeling to Improve the Prediction of Surface Air Temperatures 2 Months in Advance: A Hybrid Approach

Pascal Oettli¹^*

Masami Nonaka¹

Ingo Richter¹

Hiroyuki Koshiba²

Yosuke Tokiya²

Itsumi Hoshino²

Swadhin K. Behera¹

¹Application Laboratory Research Institute for Value-Added-Information Generation (APL VAiG), Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokohama, Japan
²JERA Co., Inc., Tokyo, Japan

A new type of hybrid prediction system (HPS) of the land surface air temperature (SAT) is described and its skill evaluated for one particular application. This approach utilizes sea-surface temperatures (SST) forecast by a dynamical prediction system, SINTEX-F2, to provide predictors of the SAT to a statistical modeling system consisting of a set of nine different machine learning algorithms. The statistical component is aimed to restore teleconnections between SST and SAT, particularly in the mid-latitudes, which are generally not captured well in the dynamical prediction system. The HPS is used to predict the SAT in the central region of Japan around Tokyo (Kantō) as a case study. Results show that at 2-month lead the hybrid model outperforms both persistence and the SINTEX-F2 prediction of SAT. This is also true when prediction skill is assessed for each calendar month separately. Despite the model's strong performance, there are also some limitations. The limited sample size makes it more difficult to calibrate the statistical model and to reliably evaluate its skill.

Introduction

For the last two decades, seasonal climate prediction (SCP) has become an area of study in its own right (Pepler et al., 2015), similar to weather prediction and climate change projection. Continuous improvements (Doblas-Reyes et al., 2013) have increased the value of SCP in decision-making, though quantifying its benefits remains a challenge (Kumar, 2010; Soares et al., 2018).

In the agricultural sector, decision-making that adapts to climate variability is essential for food security (Hansen, 2002; Meza et al., 2008). For example, in Europe, SCP are helpful in the selection of winter wheat and planning of agro-management practices (Ceglar and Toreti, 2021). In the energy sector, the sudden weather changes and climate variability impact energy consumption (Auffhammer and Mansur, 2014) and, for example, the use of air conditioners (Auffhammer and Aroonruengsawat, 2012; Auffhammer, 2014).

SCP is also used in risk management (Troccoli et al., 2008) and crop insurance (Osgood et al., 2008; Carriquiry and Osgood, 2012), and is viewed as a viable alternative to traditional crop insurance in countries with rainfed agriculture (Leblois and Quirion, 2013), although its full adoption to the benefit of local farmers has still been slow so far (Leblois and Quirion, 2013). Also, combining dynamical and statistical modeling often increases the value of SCP information for decision makers and end-users, as it improves the skill of the forecast and can serve to downscale forecasts to relevant spatial scales. For example, working on the improvement of the Australian seasonal rainfall, Schepen et al. (2012) have confirmed that merging statistical and dynamical forecasts maximizes spatial and temporal coverage of skillfulness. This in turn enhances the value for end-users. However, Darbyshire et al. (2020) concluded that the practical value of SCP is still relatively low and inconsistent for seven Australian case studies and call for improvement of forecasts, in accordance with similar findings by Gunasekera (2018) in the energy sector, stressing the need to create more skillful seasonal forecasts. There are different ways to achieve improvements, like increasing the spatio-temporal resolution of seasonal forecast systems, correcting/adjusting the systematic errors (or bias) of such a system by downscaling or by constructing a statistical model linking a predictand and its predictors.

Oceans are major drivers of the climate system (Shukla, 1998) and the main source of seasonal predictability (Barnston, 1994 and references therein; Goddard et al., 2001; Shukla and Kinter, 2006) for temperature and precipitation. While dynamical prediction systems can accurately predict seasonal climate in the tropics due to the strong coupling between ocean and atmosphere, prediction skills of the SAT are limited in many terrestrial parts of the world (Figure 1), partly because dynamical forecast systems cannot fully reproduce the relevant atmospheric processes (Shukla, 1985; Branković et al., 1990; Livezey, 1990; Milton, 1990; Barnston, 1994; Sheffield et al., 2013a,b; Henderson et al., 2017). Also, many of the teleconnections between the tropics/sub-tropics and the mid-latitudes are not well-represented, with the exception of those arising from the El Niño–Southern Oscillation (ENSO). As monthly-scale SAT in the extra-tropics, including our target region, can be strongly affected by atmospheric teleconnections (e.g., Oettli et al., 2021), there is a potential gain in skill that could be realized by representing teleconnections between the tropics and the mid-latitudes through statistical modeling, by using sea-surface temperature (SST) as a predictor of the surface air temperature (SAT).

FIGURE 1

Figure 1. Anomaly correlation coefficient between the 2-month lead SINTEX-F2 ensemble mean 2-meter temperature and the GHCN CAMS 2-meter temperature (Fan and van den Dool, 2008) for the period 1983–2015 (396 months).

The conventional approach involves multivariate linear statistical models (Barnston, 1994; Drosdowsky and Chambers, 2001). While these models are easy to construct, they are rather limited because they miss some of the important non-linear links between SST and SAT. Machine learning (ML), a subset of artificial intelligence that improves algorithms automatically through experience (Mitchell, 1997; Hastie et al., 2009; James et al., 2013; Kuhn and Johnson, 2016), can capture such non-linearities through statistical modeling.

Artificial Neural Networks (ANN) and Support Vector Machines (SVM) have been successfully used to forecast SAT at different time scales (Cifuentes et al., 2020; Tran et al., 2021). The predictability comes from the temperatures only (e.g., Ustaoglu et al., 2008; Chattopadhyay et al., 2011), or in association with other atmospheric variables (e.g., Mori and Kanaoka, 2007; Smith et al., 2009), oceanic modes (Salcedo-Sanz et al., 2016), or SST indices (Ratnam et al., 2021b). But the skills of such forecast systems are highly variable (Cifuentes et al., 2020) and dependent on past observations. Could the skill of such models be improved by including oceanic conditions forecast by a dynamical model?

In the present study, we attempt to harness the skills from both dynamical and statistical modeling approaches by adopting a hybrid approach to SCP, which has been demonstrated to yield skillful seasonal precipitation forecasts in northwest America (Gibson et al., 2021). In this approach, a dynamical prediction system, SINTEX-F2 (Doi et al., 2016, 2017), is used to provide the SST anomaly field, from which potential predictors of the SAT are extracted. The extracted predictors are used to construct a statistical prediction model using the following nine ML algorithms: Artificial Neural Networks with single- (SLP) and multi-layer perceptrons (MLP); Support Vector Machines with linear (SVML) and radial kernels (SVMR); Random Forests (RF); relatively recent approaches based on the gradient boosting of decision trees, namely the extreme gradient boosting with linear (XGBL) and tree-based (XGBT) approaches, and CatBoost (CBST); and Bayesian additive regression trees (BART). These boosting techniques are largely used in classification problems (Hancock and Khoshgoftaar, 2020; Ibrahim et al., 2020; Xia et al., 2020; Zhang et al., 2020; Jabeur et al., 2021; Luo et al., 2021), but less so for regression problems. This study provides a good opportunity to evaluate the skills of boosting procedures in the prediction of SAT.

In this study, we focus on SAT because it is a crucial climatic factor in agriculture (because it conditions the crop growth and its physiology) and in the energy sector (because it has a large impact on the fuel consumption for heating and cooling systems), two give two examples. Thus, the seasonal prediction of SAT may help to mitigate impacts of climate variations by allowing to take appropriate action in advance.

Motivated by an ongoing project in the energy sector, the target lead time for our hybrid model is 2 months. Forecasts between monthly and seasonal time steps are important for mid-term planification, such as the water and energy management (Oludhe et al., 2013) or the indoor residual spraying for malaria transmission control and elimination (World Health Organization, 2015). Therefore, we use the 2-month lead SST forecasts from the dynamical model as input for the hybrid prediction system (HPS). Details of the HPS and associated data are presented in Section Materials and Methods, and then the hybrid system skills are assessed in Section Application to the Kantō Region, using the Kantō region (a geographical area of the main island of Japan, comprising the Kantō plain to the central eastern part and mountainous region to the North and the West; Figure 2A), as an example of application. Finally, results are discussed in Section Summary and Discussion.

FIGURE 2

Figure 2. (A) Location of the Kantō region (black box) in Japan. (B) Example of normalized information flow calculation, (C) grid points complying with the selection criterion (NIF ≥ 30%) for the global region, and (D) areas complying with the selection criteria (NIF ≥ 30%, surface area ≥ 300,000 km²) for the global region.

Materials and Methods

SINTEX-F2 Seasonal Prediction System

The SINTEX-F2 prediction model (Doi et al., 2016), and its upgraded version, the SINTEX-F2 3DVAR system (Doi et al., 2017) forecast monthly mean SST at various lead times.

Our goal is the prediction of the SAT anomaly in the Kantō region, 2 months in advance using the information coming from the SINTEX-F2 SST anomaly prediction fields. To achieve this purpose, the predicted SST anomalies at 2 months lead time are taken for all 24 members plus the ensemble mean (hereafter 2mlEMs). The monthly anomalies we use are calculated by subtracting the 1983–2015 monthly climatology and removing the linear trend calculated over the same time period.

The SINTEX-F2 prediction model has high skill in the prediction of the ENSO (Luo et al., 2005, 2008b; Jin et al., 2008; Doi et al., 2016), a fundamental requirement for any seasonal prediction system (Stockdale et al., 2011) to be considered skillful. SINTEX-F2 is also successful in predicting the Indian Ocean Dipole (Luo et al., 2007, 2008a; Doi et al., 2016) and the subtropical dipole modes (Yuan et al., 2014).

Surface Air Temperatures in the Kantō Region

From the Automated Meteorological Data Acquisition System (AMeDAS) network maintained by the Japanese Meteorological Agency (2018), monthly mean SAT from 102 stations are extracted to cover the Kantō region (138–141°E, 35–37°N, Figure 2A), for the period March 1983 (due to the 2 months difference)–July 2020. Except for the initial quality check performed at JMA, no further quality check or homogenization has been performed before the analyses. An SAT index for the Kantō region is calculated by arithmetically averaging the values of the 102 stations, without weighting by latitude, considering the rather small region used in this study. To comply with SINTEX-F2 forecasts, departures from the 1983 to 2015 monthly climatology are calculated, and the linear trend calculated over the same period is subsequently removed.

Cause-and-Effect Relationships

The natural way to find linear relationships between time series is the calculation of the correlation coefficient. Correlation, however, does not imply causation (Barnard, 1986) and is not sufficient to establish causality (Sugihara et al., 2012). Therefore, different statistical concepts have been developed to identify the mutual information (Shannon, 1948) between time series and the underlying cause-and-effect relationships between them. Popular statistical concepts include the Granger causality (Granger, 1969), the transfer entropy (Schreiber, 2000), the climate networks (Tsonis and Roebber, 2004), the convergent cross mapping (Sugihara et al., 2012), or the information flow (Liang and Kleeman, 2007a,b; Liang, 2008, 2014, 2016).

The information flow (IF) measures the rate of information flowing from X₂ to X₁ and the maximum likelihood estimator of the rate of IF (Liang, 2014) is:

T_{2 \to 1} = \frac{C_{11} C_{12} C_{2, d 1} - C_{12}^{2} C_{1, d 1}}{C_{11}^{2} C_{22} - C_{11} C_{12}^{2}}

with C_ij the sample covariance between X_i and X_j, and C_{i, dj} the sample covariance between X_i and a series derived from X_j using the Euler forward differencing scheme: $Ẋ_{j, n} = \frac{(X_{j, n + k} - X_{j, n})}{(k t)}$ , with k = 1 or k = 2, some integer defining the time lag of the Euler forward scheme (i, j = 1, 2), n∈N (the sample size), and Δt the time step (Liang, 2014, 2019).

The IF has been shown to be efficient for detecting causality in linear and non-linear systems (Stips et al., 2016). In order to assess the relative importance of an identified causality, a normalized version of the information flow (NIF) has been developed (Liang, 2015; Bai et al., 2018), as:

τ_{2 \to 1} = \frac{a b s (T_{2 \to 1})}{a b s (T_{2 \to 1}) + a b s (\frac{d H_{1}^{n o i s e}}{d t})}

with abs(T_{2 → 1}) the absolute IF and $| (\frac{d H_{1}^{n o i s e}}{d t}) |$ the rate of change of the stochastic effects, defined as (adapted from Bai et al., 2018):

\frac{d H_{1}^{n o i s e}}{d t} = \frac{- 1}{2} E (\frac{1}{ρ_{1}} \frac{\partial^{2} g_{11} ρ_{1}}{\partial x_{1}^{2}}) - \frac{1}{2} E (g_{11} \frac{\partial^{2} log ρ_{1}}{\partial x_{1}^{2}})

with E the mathematical expectation, g₁₁ the perturbation amplitudes of X₁ and ρ₁ the marginal probability density function of X₁ (Liang, 2019).

The normalization avoids the relative causality to become too small (Bai et al., 2018).

Design of the Hybrid Prediction System

Since the model should be evaluated against a data set independent of the one used to train the prediction system (Davis, 1976; Chelton, 1983; Dijkstra et al., 2019), the available data are divided into three subsets. The calibration subset covers the period March 1983 to February 2010 (for a total of 27 years), the validation subset covers the period March 2010 to February 2020 (10 years), and the evaluation subset covers March 2020 to August 2021 (1 year). While it not possible to calculate skill metrics on the evaluation subset, we consider it is a good way to see whether a totally independent subset (i.e., not used either during the training step or during the optimization step) is able to reproduce observed anomalies for different seasons. All subsets start from March because of the 2 months lead initialization of SINTEX-F2.

Defining the Potential Predictors

For each month, the NIF is used to quantify the flow of information from each grid point of each of the 25 SINTEX-F2 2mlEMs into any SAT time series of interest (see Figure 2B for an illustration with the SAT index of the Kantō region) and delineate areas of interest as potential sources of predictability, and thus as potential predictors to use in our hybrid prediction model (upper part of Figure 3). For each 2mlEMs, only grid points with an NIF significant at 99% (Liang, 2015) are kept, reducing the amount of data. Subsequently, grid points with an NIF larger or equal to 30% are kept as a first group of potential predictors (Figures 2C, 3, “GR-Z” block). A second class is then defined by aggregating adjacent retained grid points and calculating the convex hull (i.e., the smallest convex area containing them; Figure 2D), in order to determine regions that potentially offer predictability. Only areas with a surface area larger or equal to 300,000 km² are kept in the second group (Figure 3, “GR-Z” block). The extraction is done globally (Figure 3, “GR-Z” block), but also for the tropical region only (30°N−30°S), where the SINTEX-F2 has high skill (Doi et al., 2016).

FIGURE 3

Figure 3. Workflow of the hybrid prediction system (HPS). For each month, the general process consists of NIF calculation, followed, in order, by the selection of geographical region and zones (GR-Z), the selection of predictors (PredS), the calibration of models through different machine learning algorithms (Alg), and the optimal configuration (OC) ending to the creation of HPS. SLP and MLP are artificial neural network with single- and multi-layer perceptron, respectively; SMVL and SVMR are the support vector machine with linear and radial kernels, respectively; RF is the random forest; XGBL and XGBT are extreme gradient boosting with linear and tree-based approaches, respectively; CBST is CatBoost and BART is Bayesian additive regression trees.

Selecting the Predictors

A set of predictors is designed by performing a feature selection (Figure 3, “PredS” block). First, we test for multicollinearity to reduce the redundancy contained in the pool of potential predictors. The absolute values of pair-wise correlations are considered. If two variables have a high correlation, the function looks at the mean absolute correlation of each of the two variables with all the other variables and removes the variable with the largest mean absolute correlation. The cutoff is fixed at 0.8.

Second, the feature selection itself is performed by using the Boruta method (Kursa and Rudnicki, 2010; Kursa et al., 2010), which iteratively compares the importance of attributes with their shuffled versions of the original attribute (i.e., their “shadows”). At each iteration, the importance of original attributes is compared to their respective “shadow” versions. All original attributes that have smaller importance than the “shadow” are dropped. Shadows are re-created in each iteration. For this study, we fixed the maximum number of iterations at 10,000. A Random Forest algorithm (Breiman, 2001) is used to provide the variable importance measure necessary for the Boruta method to achieve the selection.

Finally, the variance inflation factor (VIF; Marquardt, 1970; Fox and Monette, 1992) is calculated for the remaining predictors and those with values larger or equal to 5 are kept, to create the final selection of predictors. All the three steps of the feature selection are performed over the calibration period and for the two classes of predictors previously defined (i.e., grid points and convex hull areas).

A second set of predictors (Figure 3, “PredS” block), including all the potential predictors without the prior selection, is also defined. Here, the idea is to let the machine learning (ML) algorithms manage the whole data and create an optimal model.

Performance Measurement

The accuracy of the predictions is assessed by calculating several metrics between the observations and the predictions.

• Anomaly Correlation Coefficient (ACC): While the ACC is sensitive to outliers, it remains a simple metric to evaluate the co-variability of two time series. A positive value (as close as possible to +1) indicates similar variability and is targeted;

A C C = \frac{\sum_{i = 1}^{n} (f_{i} - \bar{f}) (a_{i} - \bar{a})}{\sqrt{\sum_{i = 1}^{n} {(f_{i} - \bar{f})}^{2} \sum_{i = 1}^{n} {(a_{i} - \bar{a})}^{2}}}

• Root Mean Square Error (RMSE): A popular measurement of prediction model accuracy. RMSE should be as small as possible (with a perfect model, RMSE would be equal to 0);

R M S E = \frac{1}{n} \sqrt{\sum_{i = 1}^{n} {(f_{i} - a_{i})}^{2}}

• Mean Absolute Error (MAE; Willmott and Matsuura, 2005): Another measurement of model error;

M A E = \frac{1}{n} \sum_{i = 1}^{n} | f_{i} - a_{i} |

• Mean Absolute Scaled Error (MASE; Hyndman and Koehler, 2006): Calculated by scaling the error based on the in-sample MAE from the random walk forecast method. MASE is below 1 when the forecast is better than the average one-step naïve forecast. MASE has been shown (Franses, 2016) to be consistent with the standard statistical procedures for testing equal forecast accuracy (Diebold and Mariano, 1995);

M A S E = \frac{\frac{1}{n} \sum_{i} | e_{i} |}{\frac{1}{T - 1} \sum_{t = 2}^{T} | a_{t} - a_{t - 1} |}

• Linear Error in Probability Space (LEPS; Ward and Folland, 1991): Provides an equitable scoring system, but can tend to assign better score to poorer forecasts near the extremes (“bend back”). It is a negatively oriented score, where 0 is a perfect score;

L E P S = \frac{1}{n} \sum_{i = 1}^{n} | C D F_{a} (f_{i}) - C D F_{a} (a_{i}) |

• Skill Error: A positively oriented score that is considered an enhancement of the LEPS (Potts et al., 1996) because it solves the “bend back” problem. Because this score is not widely used, we show the SK together with the LEPS. SK is defined as

\begin{array}{l} S K = 3 (1 - | C D F_{a} (f_{i}) - C D F_{a} (a_{i}) | + C D F_{a}^{2} (f_{i}) - C D F_{a} (f_{i}) \\ + C D F_{a}^{2} (a_{i}) - C D F_{a} (a_{i})) - 1 \end{array}

with f the predicted values, a the actual values, overbar the mean of all samples of the variable, n the total number of samples and e is the forecast error (i.e., a–f). In the MASE equation, the denominator is the mean absolute error of 1-month-ahead forecasts from each data point in the sample (Hyndman, 2006) i.e., f_t = a_t−1, with T the training period and t a time step during T. In LEPS and SK equations, CDF_a is the cumulative density function of actual values.

Outcomes of the hybrid approach are compared to the SAT index for the Kantō region calculated from the SINTEX-F2 ensemble mean (i.e., the arithmetic average of the 24 members). The 2-month persistence is also considered as a one-period-ahead naïve forecast (i.e., the SAT anomalies taken 2 months before are considered as good predictors of the SAT anomalies 2 months after). Hybrid prediction systems must have better performance than this naïve forecast to show skills in predicting SAT in the Kantō region.

Machine Learning

To predict an SAT index from 2mlEMs, we need to map the routes connecting SAT and SST, to create a general rule, i.e., a statistical model able to accurately predict the SAT from the 2mlEMs. ML can efficiently describe underlying processes between input and output data, mapping the paths from the former to the latter, and is thus applied to our purpose.

Tuning parameters (or hyperparameters) of the algorithm are generated by a random search (Bergstra and Bengio, 2012), which generates 10 combinations of them. The details of the tuning parameters for each method are introduced below.

Artificial Neural Network

The architecture of the artificial neural network (ANN) is designed to mimic the human brain and how the neurons interact with each other, how they are activated and how they learn. ANN consists of an input layer, one or multiple hidden layers and an output layer. A hidden layer is constructed by interconnecting artificial neurons (or nodes) with a specific weight and threshold (which determine whether a node is activated or not). ANNs are often used for pattern recognition but are increasingly used in a broad range of applications, including ENSO forecasts (Ham et al., 2019; Yan et al., 2020).

We use a simple feed-forward multi-layer perceptron with a single hidden layer (i.e., single-layer perceptron) for its ease of use and robustness. The tuning parameters consist of the number of artificial neurons of the hidden layer and the weight regularization (via an exponentially decay to zero), which both to helps the optimization process and avoids over-fitting (Venables and Ripley, 2002). The internal number of iterations to hit the local minima, i.e., the improvement of accuracy through training, is set at 100.

In addition, we also use a multi-layer perceptron with three hidden layers, instead of one, and threshold activation, to alter the results of the ANN. The tuning parameters are the number of nodes in each hidden layer.

Random Forest

A random forest fits a number of decision tree classifiers (which learn simple decision rules inferred from the data features) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The tuning parameter is the number of variables randomly sampled as candidates at each split. The number of trees to grow is set to 500, to ensure that every input row gets predicted at least a few times.

Support Vector Machine

Support vector machine minimizes the coefficients (instead of the squared error in the case of linear regression) in order to reach an acceptable error of the model. A choice of different kernel functions (i.e., a pattern analysis to identify recurrent relationship in datasets) can be specified for the decision function (i.e., decide the acceptable error). Expecting the results being sensitive to the use of different kernels, we choose linear and radial kernels. The former is the dot product between two given observations while the latter creates complex regions within the feature space. The tuning parameter for the linear kernel is the cost regularization parameter, C, which controls the smoothness of the fitted function. In addition to the C parameter, the tuning parameter s (the inverse kernel width) is available in radial kernels.

Gradient Boosting

Gradient boosting (Figure 3, “Alg” block) is an ensemble learning technique to build a strong classifier from several weak classifiers (typically decision trees), through iteration (Breiman et al., 1984; Friedman, 1999a,b). Here we used the XGBoost implementation (Chen and Guestrin, 2016), which gives better performance and a better control of over-fitting issues. Two types of boosters are utilized to create the prediction system, a linear model, with squared loss as objective function, and a tree-based model.

The maximum number of boosting iterations (trees to be grown) is common to both boosters. The linear model has two other hyperparameters: the L1 regularization (lasso regression which adds “absolute value of magnitude” of coefficient as a penalty term to the loss function) and the L2 regularization (ridge regression which adds “squared magnitude” of coefficient as a penalty term to the loss function). In the case of the tree-based model, the maximum depth of a tree, the learning rate (scaling the contribution of each tree, to prevent overfitting by making the boosting process more conservative), the minimum loss reduction (to make a further partition on a leaf node of the tree), subsample ratio of columns when constructing each tree, the minimum sum of instance weight (to stop further partitioning of the tree), and the subsample percentage (random collection of the data instances to grow trees and prevent overfitting).

The CatBoost implementation (Dorogush et al., 2018; Prokhorenkova et al., 2019), a more recent gradient boosting of decision trees, is also used. It provides a different kind of boosting through a symmetric procedure. This approach imposes that all nodes at the same level test the same predictor with the same condition, allowing for a simple fitting scheme and efficiency on CPUs. Also, to avoid overfitting, the optimal solution is found by the regularization operated by the tree structure itself. In this algorithm, the number of trees, the depth of trees, the learning rate, the coefficient of the L2 regularization, the percentage of features to use at each iteration and the number of splits for numerical features are the tuning parameters.

Bayesian Additive Regression Trees

Inspired by the boosting approach i.e., summing the contribution of sequential weak learners to build a strong learner, Bayesian Additive Regression Trees (BART; Chipman et al., 2010) depends on an underlying Bayesian probability model (Figure 3, “Alg” block). The bartMachine implementation (Kapelner and Bleich, 2016) is used in our prediction framework.

The associated hyperparameters are the number of trees to be grown in the sum-of-trees model, the prior probability that E(Y |X) is contained in the interval (y_min, y_max) based on a normal distribution, the base and power hyperparameters in tree prior probability for whether a node is non-terminal or not, and the degrees of freedom for the inverse χ² prior.

Limitations

Machine learning approaches have recently gained large popularity in climate sciences (Monteleoni et al., 2013; Lakshmanan et al., 2015; Voyant et al., 2017; Chantry et al., 2021), but concerns and constraints still exist (Makridakis et al., 2018). First, it is hard to derive physical understanding through the use of ML, even though some attempts have been made to address this issue (McGovern et al., 2019). The computational complexity of the algorithms, particularly the most recent ones, and the need of lengthy batch training can be a limitation to the use of ML. Also, specially trained algorithms are still required to perform specific tasks. The risk of overfitting is also an important concern and must be kept in mind. Closely related, a balance between interpretability and accuracy must be found when using ML. Relationships between variables that do not exist in the training data are likely to be missed in an independent data set, creating a limitation in the use of ML. Finally, the length of the training data may be an issue, as small datasets may limit pattern recognition (Raudys and Jain, 1991; van der Ploeg et al., 2014).

Training Process

To address the issue of the small sample size of the calibration period, bootstrapping (Efron, 1979) is used during the calibration phase to estimate the extra-sample prediction error (Figure 3, “Alg” block). Bootstrapping involves the generation of random sampling with replacement to estimate the error. It also introduces a bias because the replacement introduces repetitions of the same observation. In order to reduce the bias, the 0.632 bootstrap estimator (Efron, 1983; Efron and Tibshirani, 1997) is used, because on average each bootstrap sample contains about two thirds of observations. The optimal regression model is selected as the one which minimizes the MASE, after intermediate models have been built by resampling the calibration datasets 500 times.

The validation subset is subsequently used to evaluate the performance of the model previously trained, as well as to construct optimized forecast systems. Finally, an evaluation subset is used to assess the skills of the HPS.

Optimizing the Time Series

For each month in our training data, nine (9) ML algorithms are used to construct prediction models from two (2) geographical regions (global and tropical), two (2) classes of zones (grid points and convex hull area), and two (2) sets of predictors (selection of features and all features). As a result, a total of seventy-two (72) prediction models (hereafter ML configurations) associating Geographical Region-Zone-Predictor Set-Algorithm (GR-Z-PredS-Alg) are constructed for the calibration period and evaluated over the validation period (Figure 3).

Using the same predictors, each prediction system can be viewed as an ensemble member with its own internal condition and uncertainties. The ensemble mean of the 72 prediction models is calculated by averaging the outcome, as it is usually, but not always, more skillful than using any of the 72 models individually (Fritsch et al., 2000; Hagedorn et al., 2005; Eade et al., 2014).

We also assume that an individual model can outperform all other models for a specific calendar month. Thus, from the monthly 72 ML configurations, an optimized prediction is also calculated, according to the monthly performance measured over the validation period (Figure 3, “OC” block). For each month, the ACC (hereafter ACC_crit) is calculated between the prediction of each model and the observed SAT anomalies, and the model with the largest positive ACC is kept. This is also done according to the smallest RMSE (hereafter RMSE_crit) and the smallest MASE (hereafter MASE_crit). As a result, three optimized predictions are constructed, one for each of those three methods.

Finally, the long-term and monthly linear trends based on the observed SAT index (1983–2015 period) are added back to the observed and forecasted SAT anomalies.

In the present prediction system, we introduce an original hybrid approach that combines dynamical and statistical (machine learning) methods, in a new way. The major characteristic of this hybrid system is that predictors for the statistical part are not taken from observation (with or without time lag), but are provided by the 2-month lead SST forecasts from the SINTEX-F2 seasonal prediction system. In this way the system benefits from the ability of the dynamical prediction system to predict SST anomalies 2 months in advance. The hybrid approach also aims at statistically inferring teleconnections between remote SST anomalies and mid-latitudes, which are generally misrepresented in the dynamical prediction system.

In the case study below, all calculations have been performed on a personal computer with 8 cores [Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50 GHz] using the R programming language (R Core Team, 2019) and packages “caret” (Kuhn, 2020), “neuralnet” (Fritsch et al., 2019), “randomForest” (Liaw and Wiener, 2002), “RSNNS” (Bergmeir and Benítez, 2012), “kernlab” (Karatzoglou et al., 2004), “catboost” (Dorogush et al., 2018), “bartMachine” (Kapelner and Bleich, 2016), and “xgboost” (Chen et al., 2020) and required about 120 h to complete.

Application to the Kantō Region

The Kantō region is a highly populated area (43 million in 2015; Ministry of Internal Affairs and Communications, 2020) with high energy demand, particularly during boreal summer and winter. In summer, the heatstroke death rate increases with higher temperature and humidity (Akihiko et al., 2014) in the Kantō region. Akihiko et al. (2014) suggest that early warning systems based on SCP can be developed to reduce the risk of heatstroke, by forecasting the probability of heatwave occurrence a few months in advance.

At the same time, a significant part of the electricity demand in the region is met by thermal power plants, requiring good planning for the fuel management and logistics sufficiently ahead of time. SCP could potentially help in the management of fuel and in the planning of the operation to reduce the cost of the operation. For example, the Northern Illinois University saved approximately $500,000 in natural gas purchase with the help of climate information and forecast tools developed by a faculty member and a group of undergraduate meteorology students (Changnon et al., 1999).

We apply our hybrid prediction system (Figure 3) to the Kantō region. To put it simply, for each month of the year, the NIF is first calculated between the 25 2mlEMs and the Kantō SAT index during the calibration period. Then predictors are extracted and used to construct the 72 GR-Z-PreS-Alg. Finally, monthly SAT anomalies for the Kantō region are calculated over the validation period, using the GR-Z-PreS-Alg applied on the information provided by the 2mlEMs.

Evaluation of the Training Stage

The performance of the 72 monthly models (GR-Z-PreS-Alg) is evaluated by the ACC and RMSE calculated during the resampling phase of the training stage, between observed and forecasted SAT anomalies (Figure 4). In terms of ACC, performances are very similar across calendar months, with most of the coefficients ranging between 0.25 and 0.90. Performances are more variable with RMSE, some trained models being less accurate in July and February with RMSE >1.0. But overall performances are similar between months.

FIGURE 4

Figure 4. Comparison of model performances during the training phase, in terms of anomaly correlation coefficient (ACC) and root mean square error (RMSE). Shapes correspond to the different combination geographical region/zone (Gr-Z). Colors are the ML algorithms (Alg) with a filling (contour) corresponding to the existence (absence) of feature selection. Vertical and horizontal segment are the dispersion of the performance across the samples.

Across months, XGBL (dark blue) and XGBT (orange) without selection features (open symbol) appear to perform less well-compared to SVML (soft blue), SVMR (lime green), and SLP (gray) with feature selection (closed symbol). Other models are distributed between these two groups of models, following a curved line (better models have both lower RMSE and higher ACC). MLP (black) is a noticeable exception with most of the time high ACC associated to higher RMSE.

However, these results are biased because the evaluation of models is done against the data used to calibrate those models, with a high risk of overfitting, making them less generalizable. In order to possibly overcome this issue, forecasts from all the available monthly GR-Z-PreS-Alg are calculated and the monthly optimization of the time series is performed for the validation period.

Evaluation of the Optimizations

Over the validation period, monthly ACC are calculated between the output of each GR-Z-PreS-Alg and the observed SAT anomalies of the Kantō region (Figure 2A) over the validation period. It is clear from Figure 5 that some ML configurations performed better than others for the same month (some ACC are even negative). In January, the optimal GR-Z-PreS-Alg is the CaBoost (CBST) using all the selected areas of the global region (dark blue bar in the Jan/CBST panel, “Yes” line) with an ACC of 0.50. Other monthly optimal ML configurations can be found in Table 1. Interestingly, all ACC values are around 0.70 or above, ranging between 0.68 (in March) and 0.93 (in July), with the exception of January, February, and November in which a much smaller accuracy (ACC = 0.50, 0.44, and 0.48, respectively) of the optimal choice is obtained, implying that the hybrid model is struggling to accurately predict the SAT anomalies during these months. Another interesting result is the difference between the ACC of models calculated with the calibration subset (Figure 4) and the ACC shown in Figure 5. Support vector machines have good skills during the calibration step, but have generally poor performances with the validation subset, with the exception of boreal summer (SVML is selected as the best approach in August; Table 1). By contrast, gradient boosting approaches were not performing well with the bootstrapping validation, but are more skillful with the validation subset (XGBL/T are selected 6 months out of 12). Results also confirm that the search for an optimal GR-Z-PreS-Alg for each month is a useful approach.

FIGURE 5

Figure 5. Monthly anomaly correlation coefficients (ACC) between the observed SAT index anomalies in the Kantō region and the forecasted SAT index anomalies during the validation period (March 2010 to February 2020). Each row is the results for a machine learning (ML) algorithm for each month (column). “SLP” is the single-layer artificial neural network, “MLP” is the multi-layer (3 layers) version, “RF” is the random forest, “SVML” is the support vector machine with a linear kernel, “SVMR” is the version with a radial kernel, “BART” is the bartMachine implementation of the BART algorithm, “XGBL” is the linear XGBoost implementation of the gradient boosting and “XGBT” is the tree-based XGBoost implementation. ML are run with the selected predictors (“Yes”) or with all the predictors (“No”). The origin of the predictors is provided by the colors. Vertical black lines denote the best coefficient my month.

TABLE 1

Table 1. Optimal selection of monthly GR-Z-PredsS-Alg based on anomaly correlation coefficient criterion (ACC_crit).

We test an alternative approach for finding optimal configurations. In this approach, we use the RMSE_crit, rather than the ACC_crit, as our skill metric (Figure 6). For example, in January, the optimal combination is the linear gradient boosting using all the predictors (convex hull areas) of the global region (RMSE = 0.9164; dark blue bar in the Jan/XGBL panel, “No” line). Other combinations are summarized in Table 2. With this approach, March has the worst score (1.0952) while June has the best (0.4259). Similar to ACC_crit, SVML/R have poorer performances with the RMSE_crit compared to the calibration step, but are selected more often (three times). Again, XGB(L/T) are selected six times.

FIGURE 6

Figure 6. Same as Figure 5 but for the root mean square error (RMSE).

TABLE 2

Table 2. Same as Table 1 but from the root mean square error criterion (RMSE_crit).

The same method of optimized selection is applied with the MASE_crit as selection metric (Figure 7) and the final optimal combination for each month is summarized in Table 3. It is worth noting that with this criterion, most of the models are based on extreme gradient boosting (XGBL/T; 6 models) and support vector machine (SVML/R; 5 models). The exception is April with the single-layer artificial neural network.

FIGURE 7

Figure 7. Same as Figure 5 but for the mean absolute scaled error (MASE).

TABLE 3

Table 3. Same as Table 1 but from the mean absolute scaled error criterion (MASE_crit).

Together with the ensemble mean (Figure 8A; orange line), three optimized predictions of SAT index anomalies in the Kantō region are constructed, respectively, from the ACC_crit (bright yellow line), RMSE_crit (soft blue line), and MASE_crit (dark blue line). The observed SAT index is also shown for comparison (Figure 8A; black line). Overall interannual variability appears to be quite well-predicted by the various simulated SAT indices. In 2015–2016 or in 2018–2019, models are able to capture the seasonal variability. But sometimes, like in 2012–2013 or 2017 for example, all models have difficulties to capture correct values of SAT, particularly in boreal winter. The three optimized systems have anomaly correlation coefficients larger than 0.5 (Figure 9A), which is better than that of the SINTEX-F2 ensemble mean and the persistence (r < 0.2). Based on other accuracy metrics, the three optimized hybrid systems also systematically outperform the SINTEX-F2 ensemble mean and the 2-month lead persistence with smaller RMSE, MASE, MAE, and LEPS, and larger SK (Figure 9A).

FIGURE 8

Figure 8. (A) Observed (black line) and predicted SAT index anomalies in the Kantō region for the validation period. Time series of the prediction consist of the average of the models (orange line) and optimized time series according to the anomaly correlation coefficient (yellow line), the root mean square error (cyan line) and the mean absolute scaled error (blue line). Long-term and monthly linear trends are restored. (B) Same as (A) but for the evaluation period. (C) Monthly stratification of scatter plots of the observed anomalies (x-axis) vs. the predicted anomalies (y-axis). Black dots are the forecasts for the validation period and black triangles for the evaluation period. Rows correspond to the different types of prediction system and columns to the month.

FIGURE 9

Figure 9. Evaluation of the different prediction systems performance including the 2-month persistence (pink) and SINTEX-F2 (lime green). (A) Overall performance over the complete validation period, showing the anomaly correlation coefficient (ACC), root mean square error (RMSE), mean absolute scaled error (MASE), mean absolute error (MAE), linear error in probability space (LEPS), and skill error (SK). (B–F) Individual monthly performance metrics. Monthly performance according to (B) ACC, (C) RMSE, (D) MASE, (E) MAE, (F) LEPS, and (G) SK.

Interestingly, while all three optimized models perform equally well over the validation period (with the model based on ACC_crit slightly left behind), the average of the 72 models performs less well though still better than the SINTEX-F2 ensemble mean and the 2-month lead persistence (Figure 9A). The selective ensemble mean technique (Ratnam et al., 2021a) may help to improve the average of the 72 models by carefully only picking the models with enough skills, before calculating the average.

The monthly stratification reveals the abilities and limitations of the hybrid prediction systems, and their disparities. In late spring, summer, and fall, simulated SAT are close to the observed ones (i.e., along the 1:1 line in Figure 8B), indicating that the hybrid systems perform quite well. ACCs are close to 0.75 during these seasons for most of the optimized models (Figure 9B). This is confirmed by other metrics, RMSE (Figure 9C), MASE (Figure 9D), MAE (Figure 9E), and LEPS (Figure 9F) are very low compared to other months. Also, the SKs are maximum during the same months (Figure 9G). This clearly indicates that the hybrid systems perform better than the SINTEX-F2 ensemble mean as well as the persistence predictions during these months. In late fall, winter (with the exception of December) and early spring, on the other hand, performance as measured by the ACC and other metrics is not as high as in other seasons. Thus, it seems the hybrid models are less able to predict the sign of the interannual anomalies, as well as their intensity during the cold period. Nevertheless, even for the non-ACC metrics, the hybrid model still outperforms the SINTEX-F2 ensemble mean and the persistence. Along the year, the ensemble average of the hybrid models struggles to reproduce the intensity of the anomalies (“Avg.” line, Figure 8C), with a weak forecast amplitude most of the time. Individual hybrid systems have better results (but suffer from other problems, such as overestimating anomalies in February). It is also confirmed that SAT in February and November are not very well-simulated, with the lowest ACC (Figure 9B) among all months, and poor performance in other metrics as well (Figures 9C–G).

Summary and Discussion

We have presented a novel approach that combines dynamical forecasts with machine learning to predict SAT anomalies over the Kantō region, a part of the central region of Japan. In this approach, the role of the machine learning is to represent the influences of remote SST, which are not well-simulated in the dynamical model.

Results of this HPS are promising, particularly because they outperform both the 2-month lead persistence and the SINTEX-F2 forecasts of the SAT in the Kantō region, indicating the (linear and non-linear) teleconnections have been (partially) restored in this method. Results also suggest that the hybrid approach is able to improve prediction skill in mid-latitudes. The HPS was able to quite accurately forecast the SAT of the evaluation subset, particularly the rapid change of sign between March 2020 and July 2020 (Figure 8B). This is very encouraging, as the HPS seems to address the overfitting issue and to create a good compromise between interpretability and accuracy (Section Limitations). It also performed quite well in winter, but missed the correct sign of October 2020 and the intensity of August 2020 and March 2021 (while correct in sign).

Another interesting result is the share (about 50%) of boosting methods in the HPS, outperforming seminal methods such as artificial neural networks. It is also worth noting that the support vector machine algorithms are also picked up during the optimization phase. The remaining selected methods are distributed random forest and CatBoost. The Bayesian approach is never picked up as the best solution (at least in our study design).

Some previous studies have already discussed the prediction of SAT in Japan, either by selecting the best from the SINTEX-F2 predictions the best ensemble members with regard to the SAT (Ratnam et al., 2021a), or by using past observations of SST to predict winter SAT with the help of ANN (Ratnam et al., 2021b). While a direct comparison of the results is difficult due to the differing approach and study regions, the forecast skills in winter are quite comparable. We recognize a potential application could be to combine these prediction systems, for example by selecting the best systems (based on accuracy measurement) or by weighting the forecasts (Bates and Granger, 1969; Aiolfi and Timmermann, 2006; Aiolfi et al., 2010; Hsiao and Wan, 2014). This will be explored in a future study.

On the other hand, the results also show some limitations of this approach. Even though the HPS outperforms the dynamical forecast system, it can only reproduce around 35–40% of the variance of the Kantō SAT index between 2010 and 2020.

This suggests that model improvement may be obtained by including additional variables, such as the geopotential height, to take into account the atmospheric dynamics, which are likely to play a role in the SAT variability, as well as soil moisture, which is a key variable in the occurrence of temperature extremes (Seneviratne et al., 2010; Quesada et al., 2012). In some years, as seen in Figure 8A, the HPS could not correctly predict the SAT index in the Kantō region. Exploring the reasons of these forecast failures could help to improve the HPS by adding supplementary information, such as atmospheric processes. Adding more variables and introducing time lags will likely increase the amount of information to analyze, and using deep learning algorithms may help to improve the seasonal forecast by revealing unknown interactions among variables. We would also like to add that deep-learning could in fact be used as a tool for prediction (with feature selection to avoid redundancy in the information). But it could also be used as an analysis tool to document the relationships between variables, particularly through the use of layer-wise relevance propagation (LRP) or heat maps (Bach et al., 2015; Zhou et al., 2016; Montavon et al., 2019; Toms et al., 2020). LRPs quantifies the contributions of the predictors to the predictand, providing a physical interpretation.

The HPS approach is based on the strong assumptions that (1) the explanatory power of the predictors selected in each monthly model is robust in time i.e., we assume that the selection during the calibration of the models is valid for the validation period and, most importantly, for the future. (2) Although the model seems to be robust over different time periods, it relies on the accurate seasonal prediction of SST, particularly outside the tropical region where SINTEX-F2 has more limited skills.

ML algorithms used in this study have a relatively low risk of over-fitting, but because of the limited sample size for calibration, over-fitting cannot be totally avoided, which may explain the limited skills of the hybrid model in some months.

A possible drawback of the HPS is the limitation in deriving physical interpretation from the monthly models as by construction, potential sources of predictability from the global ocean are included, regardless of physical distance to the target region. Due this limitation, it is still difficult to derive physical mechanisms from ML (Section Limitations). Such physical understanding will eventually help to improve prediction skill and should be therefore addressed in future studies.

Data Availability Statement

The datasets presented in this article are not readily available because of confidentiality agreements. Supporting SINTEX-F2 model data can only be made available to bona fide researcher's subject to a non-disclosure agreement. Details of the data and how to request access are available from the authors. Monthly mean surface air temperatures are available by contacting the Japanese Meteorological Agency (JMA). The list of the stations used in this work is available from the authors. Requests to access the datasets should be directed to SINTEX-F2: Takeshi Doi, takeshi.doi@jamstec.go.jp; Monthly mean surface air temperatures: https://www.jma.go.jp/jma/indexe.html.

Author Contributions

HK, YT, and IH initiated the internal project. PO, MN, IR, and SB contributed to conception and design of the study. PO performed the statistical analysis and wrote the first draft of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

Funding

This research was partly supported by JERA Co., Inc.

Conflict of Interest

HK, YT, and IH were employed by the company JERA Co., Inc. This study received funding from JERA Co., Inc. The funder had the following involvement with the study: decision to publish and preparation of the manuscript.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

The authors would like to thank Dr. Takeshi Doi (APL VAiG, JAMSTEC, Japan) for providing the SINTEX-F2 data. The authors are also grateful to Japanese Meteorological Agency for providing dataset.

References

Aiolfi, M., Capistrán, C., and Timmermann, A. (2010). Forecast Combinations. Rochester, NY: Social Science Research Network. doi: 10.2139/ssrn.1609530

ORIGINAL RESEARCH article

Combining Dynamical and Statistical Modeling to Improve the Prediction of Surface Air Temperatures 2 Months in Advance: A Hybrid Approach

Introduction

Materials and Methods

SINTEX-F2 Seasonal Prediction System

Surface Air Temperatures in the Kantō Region

Cause-and-Effect Relationships

Design of the Hybrid Prediction System

Defining the Potential Predictors

Selecting the Predictors

Performance Measurement

Machine Learning

Artificial Neural Network

Random Forest

Support Vector Machine

Gradient Boosting

Bayesian Additive Regression Trees

Limitations

Training Process

Optimizing the Time Series

Application to the Kantō Region

Evaluation of the Training Stage

Evaluation of the Optimizations

Summary and Discussion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Publisher's Note

Acknowledgments

References

This article is part of the Research Topic

People also looked at