Statistical seasonal forecasting of tropical cyclones over the western North Pacific

Forecasting tropical cyclone (TC) activities has been a topic of great interest and research. Many studies and existing seasonal forecasting models have examined and predicted the number of TCs (including geneses and landfalls) mainly based on the environmental factors in the peak TC season. However, these predictions can be time-consuming, computationally expensive and uncertain, depending on the efficiency and predictability of the dynamical models. Therefore, here we propose an effective statistical seasonal forecasting model, namely the Sun Yat-sen University (SYSU) Model, for predicting the number of TCs (intensity at tropical storm or above) over the western North Pacific based on the environmental factors in the preseason. The nine categories comprising 103 candidate predictors in 1980–2015 (36 years) are systematically investigated. The best subset selection regression shows that the sea surface temperatures at the tropical North Atlantic and eastern North Pacific in April, the 500 hPa geopotential height difference between April and January at the open ocean southwest of Australia and the 700 hPa geopotential height at the North Pacific in April are the most significant predictors. The correlation coefficient between the modeled results and observations reaches 0.89. The model is successfully validated by leave-one-out, nine-fold cross-validations, and later 5 year (2016–2020) observations. The prediction of the SYSU Model exhibits a 95% hit rate in 1980–2020 (39 out of 41), suggesting an operational potential in the seasonal forecasting of TCs over the western North Pacific.


Introduction
A tropical cyclone (TC) is one of the most destructive natural phenomena in the world. It can lead to heavy precipitation, flooding, strong winds and storm surges to the coastal regions , causing great economic and human losses. On average, about 80 TCs form globally each year. Among the basins, the western North Pacific is the region with the most active tropical cyclogenesis. About one third of global TCs, 28 on annual average, form over the western North Pacific (Gray 1968), which could make landfall on Southeast Asia as well as East Asia. East Asia, including China, Korea, Japan, etc, accommodates around 25% of the world's total population with a large portion dwelling in the coastal regions. Thus, accurate seasonal forecasting on predicting the TC number in the western North Pacific is enlightening. Camargo et al (2007), Zhan et al (2012) and Klotzbach et al (2019) reviewed all the existing seasonal forecasting models around the world. In general, there are three types of forecasting: statistical, dynamical and hybrid statistical-dynamical. Each has its advantages and disadvantages. Dynamical and hybrid forecastings comprise full model physics and a series of parameterizations so that they can provide physical reasoning. However, as a matter of fact, the model predictability decreases with simulation time because of the butterfly effect. Small changes in initial conditions, physics schemes and parameterizations can lead to large-scale and unpredictable variations in the future state of the system. Thus, the prediction performance highly depends on both the data and model predictability. Model ensembles are therefore often adopted to minimize the uncertainties. However, doing so would be computationally expensive and time-consuming. This is why statistical forecasting is still competitive nowadays although its physical reasoning is relatively weak.
Nonetheless, most of the existing statistical models for the seasonal forecasting of TCs over the western North Pacific are insufficient. They use many predictors to build their models, which could cause model overfitting and complexity (Chan et al 1998, Fan 2007a, Xia et al 2014. Moreover, many forecasts are closed-source, not documented, and/or for internal use only (Klotzbach et al 2019). Most of them do not provide model verifications or evaluations. In addition, it is noted that most of the existing dynamical or hybrid models make use of the environmental factors in the peak TC season to make predictions (e.g. Au-Yeung and Chan 2012, Zhang et al 2017a, Camp et al 2019, while very few achieve this using the pre-season environmental factors (Chan et al 1998). A study by Tian and Fan (2019) utilized predictors from the preceding year (e.g. the preceding boreal summer SST) and adopted a year-to-year increment method to predict the number of landfalling TCs on China during June to August. In practice, some of the pre-season environmental factors could have a 'memory effect' , suggesting that their seasonal or annual signals are relatively sustainable and transmittable through certain time periods. However, some of them may not be persistent (see figure 2), but they can store their signals in the ocean via air-sea interaction. Compared to the atmosphere, the memory effect of the ocean is more stable. Thereby, through selecting the particular factors in the preseason, it is plausible to predict the number of TCs in the upcoming TC season.
Hence, this study aims to propose an effective (no more than four predictors), open-source (free public disclosure, i.e. this paper) and competitive (at least 80% hit rate) operational statistical seasonal forecasting model that can predict the number of TCs over the western North Pacific in the upcoming TC season using pre-season environmental factors. The forecast is targeted to be issued by every middle of May. Section 2 describes the data and methodology. Section 3 introduces the statistical seasonal forecasting model with model validation. Section 4 summarizes the study with discussion.

Data and methodology
The best-track data from the Japan Meteorological Agency (JMA), China Meteorological Administration (CMA) and Joint Typhoon Warning Center (JTWC) are adopted. The TCs reaching tropical storm (TS) intensity or above (maximum sustained wind speed ⩾34 kt) over the western North Pacific in 1980-2020 (41 years) are counted and mainly examined in this study. As the recorded number of TSs (NTS) is similar among the three sets of best-track data (figure 1), and the JMA is the World Meteorological Organization's official Regional Specialized Meteorological Center for the western North Pacific basin, the JMA besttrack data are used for the major discussion hereafter. On a 41 year average , there are around 25-26 TSs appearing over the western North Pacific per year. The corresponding standard deviation is about 4.2. A slight but insignificant decrease in the trend of NTS is found (−0.78 decade −1 , p = 0.17). It is noted that the number of typhoons (NTY; TCs reaching typhoon intensity or above (maximum sustained wind speed ⩾64 kt)) is not studied here because the benchmarks are divergent. They vary significantly among the three sets of best-track data. Although the absolute diversity of NTY among them seems not apparent, the proportion of diversity in NTY is obviously higher than that in NTS (figure 1).
As the forecast is targeted to be issued by every middle of May, the candidate predictors before May, that is, those from December of the preceding year to April of the following year are mainly examined. Based on previous studies, correlation analysis and data availability, nine categories, comprising 103 candidate predictors in total are selected in this study. The categories that have been suggested to be conducive to tropical cyclogenesis are put into a pool of predictors. For instance, the SST indices (Xiao and Li 2007, Zhou andCui 2011, Zhan et (Fan 2007b), ocean heat content (Tu et al 2011), stratosphere (Chan 1995) and atmospheric chemistry (Takahashi et al 2017). Table 1 summarizes all the categories, predictors and their corresponding data. In order to achieve the objective of this study, the data should cover from 1980 to the present and should be operationally ongoing. The data descriptions are tabulated in table 2. The monthly fifth generation of the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric reanalysis of the global climate (ERA5) is primarily used in this study, including direct data retrievals and indices calculations. The predictors reaching correlation coefficients >0.32 with the NTS (p < 0.05) are chosen as the candidate predictors (figure 2).
The best subset selection regression is then adopted to build the model in this study. In this approach, the selection procedure is first performed. Based on the sequential replacement method, the predictors are sequentially replaced. Replacements that improve the performance are retained. Next, all combinations of predictors are examined to obtain the best combination (highest adjusted R squared). In order to reduce the instability brought by different categories of predictors, such as solar activity and atmospheric chemistry, we iterate the first two steps nine times by adding one category to the consideration each time so that more predictors would be considered. Last, we get 90 regression models (nine considerations and ten models each). After screening out the models having high collinearity or high squares of residuals in 2016-2020, the model that has the best prediction competence (highest R squared) is selected as the best subset.
Standardization of the predictors is applied before the regression analysis so that all the predictors are in the form of indices. The data before 2016 are used for model training. In order to further justify the model skills, the leave-one-out, nine-fold cross-validations and the recent 5 year (2016-2020) observations are performed for the model validation.

Statistical seasonal forecasting model
The correlation matrices between the 103 candidate predictors and the predictand NTS are shown in figure 2. Notably, about 60% of the predictors in the pool reached the peak of absolute correlation with the NTS in April. This makes sense because April is the month closest to the TC season among the five pre-season months (December of the preceding year to April of the following year) so that the signals acquired in April should have more influence on the later months. Meanwhile, some factors have high correlations over a long period of time, for example, SST, meridional modes and ocean heat content. As will be shown, although some candidate predictors have high correlations with the NTS independently, they are not selected as our final predictors because of their high mutual correlations and collinearity.
The best subset selection regression suggests that the SSTs at the tropical North Atlantic ( To avoid the uncertainties or arbitrary conclusions given by a single data source, another set of mainstream reanalysis data (the Climate Forecast System Reanalysis (CFSR) and the Extended Reconstructed Sea Surface Temperature version 5 (ERSSTv5) from the National Oceanic and Atmospheric Administration (NOAA), USA) are employed. Table 3 shows that the mutual correlations between the four selected predictors and the NTS are comparable between the two, ECMWF and NOAA, datasets. Although Z700 NP,4 and Z500 SWA,4-1 have a moderate correlation, the variance inflation factor values of SST TNA,4 , SST ENP,4 , Z700 NP,4 and Z500 SWA,4-1 are 1.11, 1.12, 1.15 and 1.15, respectively. Since they are all below 2, this means the collinearity between the four predictors is low.
The best subset selection regression model based on the ECMWF data (hereafter Model E), NTS E = 25.50 − 1.89SST TNA,4 − 1.89Z500 SWA, 4−1 − 1.33Z700 NP, 4 + 1.02SST ENP,4 explains 83% of variance. The correlation coefficient reaches 0.91. The root-mean-square error (RMSE) and mean absolute error is 1.77 and 1.40, respectively. explains 79% of variance. The correlation coefficient reaches 0.89. The RMSE and mean absolute error is 1.98 and 1.59, respectively. It is noted that the regression coefficients of the predictors among the two models are highly similar and all pass the significance test (p < 0.01). In addition, the weak mutual correlations among the predictors imply that the predictors are mostly independent of each other (table 3). These results suggest that both models are consistent with each other and consolidate the robustness of the predictors. The models are validated by leave-one-out and nine-fold cross-validations. The correlations between the observations and Model E and Model N in the leave-one-out cross-validations reach 0.89 and 0.85, respectively. Those in the nine-fold cross-validations reach 0.89 and 0.86, respectively. In addition, the models are further validated by the recent 5 year (2016-2020) observations. The corresponding absolute errors are ⩽3 in both models. All these findings suggest that the models are effective.
To bring the results from research into practice and application, a new model, namely the Sun Yatsen University (SYSU) Model is proposed. It is composed of Model E and Model N. The RMSEs of Model E (RMSE E ) and Model N (RMSE N ) are utilized to set the prediction range. The prediction range issued by the SYSU Model in a particular year i is given by, In plain text, it is the maximum range of NTS E,i ± RMSE E and NTS N,i ± RMSE N by rounding down the minimum and rounding up the maximum to the nearest integers. Figure 4 shows that the SYSU Model catches the highs and lows well. The range of the annual prediction (maximum prediction minus minimum prediction) in 1980-2020 is about 5-6 (the green shading shown in figure 4), which is operationally reasonable and meaningful. Most of the observations fall within the prediction. Once the observed NTS falls within the prediction range, we call the model hits. The hit rate is encouragingly promising since it reaches 95% in 1980-2020 (39 years of observations fall within the prediction range out of the total 41 years; figure 4).  Table 3. Mutual correlation coefficients among the predictors and predictand NTS using the ECMWF and NOAA data. Single and double asterisk(s) denote the statistical significance at the 95% and 99% confidence levels, respectively. ECMWF (1980ECMWF ( -2015 NOAA  SST

Summary and discussion
In this study, an effective, open-source and competitive statistical seasonal forecasting model for predicting the NTS over the western North Pacific is proposed, namely the SYSU Model. Different to most of the existing seasonal forecasting models that consider the peak-season environment, the pre-season environmental factors are utilized to build the model. Four predictors are optimally selected from the 103 candidate predictors using the best subset selection regression. They are the SSTs at the tropical North Atlantic (SST TNA,4 ) and eastern North Pacific (SST ENP,4 ) in April, the 500 hPa geopotential height difference between April and January in the open ocean southwest of Australia (Z500 SWA,4-1 ) and the 700 hPa geopotential height in the North Pacific in April (Z700 NP,4 ). The correlation reaches 0.89 above in 1980-2015. The regression model is successfully validated by the leave-one-out, nine-fold crossvalidations, and later 5 year (2016-2020) observations. The SYSU Model exhibits a 95% hit rate in 1980-2020, which is satisfactorily promising for a statistical model that is only composed of four predictors. It is effectively superior to many existing dynamical and statistical-dynamical models, in terms of time and computational power. These four predictors show notably high and significant (p < 0.05) correlations with some climatic signals. The contemporaneous correlation coefficients between SST TNA,4 and Atlantic Meridional Mode (AMM 4 ), SST ENP,4 and Pacific Meridional Mode (PMM 4 ), Z700 NP,4 and North Pacific Oscillation (NPO 4 ), and Z500 SWA,4-1 and the Southern Annular Mode difference between April and January (SAM 4-1 ) are 0.812, 0.832, −0.635 and 0.409, respectively. These suggest the SYSU Model is implicitly physical, and could possess relevant climate dynamics mechanisms.
The high correlations between the two oceanic predictors, SST TNA,4 and SST ENP,4 , and the meridional modes imply that the mechanisms of the meridional modes could be highly related to the TC activity in the western North Pacific. Zhang et al (2017b) found that the positive phase of AMM leads to a stronger descending branch in the tropical eastern and central Pacific, which intensifies the ascending branch of the Walker circulation in the western North Pacific. The stronger westerly in the upper troposphere and the stronger easterly in the lower troposphere associated with the enhanced Walker circulation result in stronger zonal shear, which suppresses the TC genesis. In addition, Wang et al (2019), Wu et al (2021) and Zhang et al (2020) investigated the roles of the PMM and the El Niño-Southern Oscillation (ENSO; in terms of the anomalous Walker circulation) in TC activity. They found that when the PMM is associated with the in-phase ENSO event, the positive relationship between the PMM and NTS becomes stronger.
Compared with the oceanic predictors, the mechanisms of the other two atmospheric predictors, Z700 NP,4 and Z500 SWA,4-1 , are relatively uncertain. Newman et al (2016) suggested that the NPO signal in the atmosphere can be stored as the oceanic Pacific decadal oscillation (PDO) signal which can last for a few months. When the Z700 NP,4 is positive, the PDO signal tends to be in negative phase (not shown). The negative PDO signal in the spring can be characterized by the horseshoe SST distribution in the North Pacific (anomalous cooling at high (∼50 • N) and low latitudes (∼10 • N) and anomalous warming at middle latitudes (20 • -30 • N)) (Chen et al 2015). This pre-season oceanic signal can last into the peak season and then initiate three climate correspondences. First, it enhances the Pacific Walker circulation, and thus, strengthens the zonal shear, which is detrimental to the TC genesis. Second, the anomalous easterly in the lower troposphere weakens the monsoon trough activity. Finally, the anomalously warm SST at middle latitudes corresponds to the stronger subtropical high, which also suppresses the TC activity. This agrees with Zhao et al (2018) that the negative phase of PDO can result in a decrease in TC activity in the western North Pacific. On the other hand,  proposed a mechanism without the oceanic correspondence. They suggested that when the NPO is in negative phase (i.e. positive Z700 NP,4 ), the midlatitude westerly jet stream is weaker, which leads to an anomalous cyclonic circulation in the tropical western North Pacific, and thus, strengthens the vertical wind shear and suppresses the TC activity.
Finally, the high correlation between the Z500 SWA,4-1 and SAM could suggest that the Southern Hemisphere Rossby wave at middle latitudes modulates the Madden-Julian Oscillation (MJO) activity with the upper-tropospheric zonal wind (Lin 2019), and consequently affects the TC activity (Nakazawa 1988, Liebmann et al 1994. However, it is uncertain how these signals can exist, propagate and/or transform in the atmosphere for a few months. Air-sea interaction should be involved in between. Meanwhile, Ho et al (2005),  and Choi et al (2010) gave disputing arguments about the relationship between the SAM and TC activity over the western North Pacific. Further study is urged on the linkage between the Z500 SWA,4-1 , SAM, Rossby wave and MJO.
No model is perfect. Although the physics and underlying mechanisms could not all be clearly corroborated, which is indeed an intrinsic deficiency of the statistical model, the performance of the SYSU Model demonstrates high potential in forecasting and merits consideration. The discussed mechanisms are the auxiliary, which suggests possible physical reasoning for the model. Further research is encouraged. Last but not least, in the near future the SYSU Model will be officially implemented and kept up-to-date, targeting the issuing of an open access forecast by every middle of May.
All data that support the findings of this study are included within the article (and any supplementary files).