Evaluation of Tropical Cyclone Intensity Forecasts from Five Global Ensemble Prediction Systems During 2015-2019

This study presented an evaluation of tropical cyclone (TC) intensity forecasts from five global ensemble prediction systems (EPSs) during 2015-2019 in the western North Pacific region. Notable error features include the underestimation of the TC intensity by ensemble mean forecast and the under-dispersion of the probability forecasts. The root mean square errors (brier scores) of the ensemble mean (probability forecasts) generally decrease consecutively at long lead times during the five years, but fluctuate between certain values at short lead times. Positive forecast skill appeared in the most recent two years (2018-2019) at 120 h or later as compared with the climatology forecasts. However, there is no obvious improvement for the intensity change forecasts during the 5-year period, with abrupt intensity change remaining a big challenge. The probability forecasts show no skill for strong TCs at all the lead times. Among the five EPSs, ECMWF-EPS ranks the best for the intensity forecast, while NCEPGEFS ranks the best for the intensity change forecast, according to the evaluation of ensemble mean and dispersion. As for the other probability forecast evaluation, ECMWF-EPS ranks the best at lead times shorter than 72 h, while NCEP-GEFS ranks the best later on.


INTRODUCTION
Tropical cyclones (TCs) are one of the disasters with the most severe impacts to life and properties. In recent years, economies in coastal areas have become increasingly prosperous, and the economic losses caused by TCs are increasing rapidly (Chen et al. [1] ; Wu et al. [3] ; Yu and Chen [4] ). The prediction of TC track and intensity has always drawn significant attention due to its importance in TC disaster preparedness and prevention. In the past decades, the capability of TC track forecasts has been significantly improved, while the progress in the skill of intensity forecasts is relatively slow (Alley et al. [5] ; Chen et al. [6] ).
The significant progress of TC track forecast capability is mainly related to the development of numerical weather prediction (NWP) techniques. However, current NWP models cannot accurately describe the core structures of TCs or the physical processes related to intensity changes, resulting in the dominance of statistical and statistical-dynamic models utilizing climatology and / or persistence factors as predictors (Zhang et al. [7] ), including the linear statistical model, nonlinear statistical prediction model, and statistical-dynamic models.
The accuracy of TC intensity prediction is closely related to the configuration of the NWP (Aksoy et al. [8] ; Zhang et al. [9] ; Cavallo et al. [10] ), the initial intensity (Emanuel and Zhang [11,12] ; Bhatia et al. [13] ; Qin et al. [14] ), and the different synoptic-scale environmental parameters (Yu et al. [15] ). Compared with deterministic forecasts, ensemble forecasts contain multiple possible future developments that may increase the predictability of the atmosphere and the reliability of the forecasts, which is of great significance for disaster prevention and reduction in operational predictions (Bauer et al. [15] ; Titley et al. [17] ). There is a recent trend for the ensemble prediction systems (EPSs) being utilized more and more often in operational weather forecasting and warning practice, including TC track forecasting (Qian et al. [18] ; Zhang and Yu [19] ).
However, how the EPS can help improve TC intensity forecast skill remains a question that needs further study. To address this issue, verification of the TC intensity forecast from the main EPSs has gradually been carried out in recent years. According to Yang et al. [20] , the EPS of European Centre for Medium-Range Weather Forecasts (ECMWF-EPS) generally underestimates TC intensity during 2009 and 2015 and the accuracy of forecasts gradually decreases with an increase in TC intensity. Hodges et al. [21] found that the mean error (ME) in 0-7 days of the ensemble mean intensity forecast of the UK Met Office EPS (UKMO-EPS) has been greater than that of the deterministic forecast and the ensemble is severely underdispersive. Chen et al. [22] compared the forecast performance between the new and old versions of the National Centers for Environmental Prediction Global Ensemble Forecast System (NCEP-GEFS) and found that the new generation of the NCEP-GEFS has significantly higher forecast accuracy in both track and intensity predictions. Chen et al. [23] evaluated the TC intensity forecasts in 2016 from several EPSs, including the ECMWF-EPS, the Japan Meteorological Agency Weekly EPS (JMA-WEPS), the JMA Typhoon EPS (JMA-TEPS), the Meteorological Service of Canada Canada Ensemble System (MSC-CENS), and the NCEP-GEFS, and they found that the Brier score (BS) differences among different EPSs narrowed at longer lead times. Magnusson et al. [24] gave a comparison between the 18 km operational ensemble and a 5 km ensemble intensity forecast for Irma of ECMWF, and the results show that the forecast error of each member with higher horizontal resolution is significantly smaller.
Above listed research and several others not mentioned about here (Lei et al. [25] ; Yu et al. [26] ) focused on either one specific EPS or one year performance of several EPSs. Furthermore, the efforts have never been stopped to improve the EPS at major forecasting centers. This paper aims to present updated information on the performance of TC intensity forecast from several major global EPSs for the most recent five years (2015-2019) in the western North Pacific (WNP) region. Special attention will be paid to the abrupt intensity change process and very strong TCs, both of which have been recognized as major challenges in operational forecast (Demaria et al. [27] ; Na et al. [28] ). Section 2 will introduce the five global EPSs being evaluated and the sources of data. The performance of the EPSs will be evaluated from both the deterministic (ensemble mean) and probabilistic (ensemble spread) aspects in Section 3 and Section 4, respectively. Conclusions and discussions will be presented in Section 5.

Basic information of the global EPSs
A total of five global EPSs are selected for this study, including the ECMWF-EPS, Japan Meteorological Agency Global Ensemble Forecast System (JMA-GEFS) (JMA-WEPS for data before 2017, JMA-GEFS for 2017 and after), MSC-CENS, NCEP-GEFS, and UKMO-EPS. Some basic information of these EPSs has been summarized in Table 1, including the major updates over the past decade since 2015.
The horizontal resolution of MSC-CENS updated from 66 km to 50 km in 2015 and 39 km in 2018, respectively, with the vertical levels increased from 40 to 45 in 2018. It has 21 ensemble members generated by the ensemble Kalman filter (EnKF) (Whitaker et al. [35] ) and SPPT-SKEB methods.
NCEP-GEFS has no major updates in 2015-2019. It has 21 ensemble members, and its horizontal resolution (vertical levels) is 34 km (64), respectively. The initial perturbation method is EnKF, and the model uncertainty method is stochastic total tendency perturbation (STTP) (Hou et al. [36] ).
UKMO-GEPS has 36 ensemble members and a horizontal resolution of 50 km since 2015 with 70 vertical levels (Walters et al. [37] ); the horizontal resolution changed to 30 km in 2017. The initial perturbation method is the ensemble transform Kalman filter (ETKF) (Bowler et al. [38] ) and the model uncertainty method is the SPPT-SKEB.

Data description
All the TCs reaching or above tropical depression (TD) intensity in the WNP region from 2015 to 2019 are considered in the study, with the intensity data downloaded from the best-track database of the China Meteorological Administration (CMA) http://tcdata. typhoon. org. cn / zjljsjj_zlhq. html (Ying et al. [39] ; Lu et al. [40] ). The TC intensity data include both the maximum surface wind speed (V max ) and the minimum central pressure (P min ). In most cases, the variation characteristics of V max and P min during 2015-2019 are basically the same, and V max is given more attention to compare with P min . Therefore, the evaluation of strong TCs and intensity change is implemented for V max and the results for V max will be presented if not stated specifically.
TC intensity forecast data of the five global EPSs are from the THORPEX Interactive Grand Global Ensemble (TIGGE) database (http://tparc.mri-jma.go.jp/ cxmldata / dat / ensemble/). Forecasts with at least four ensemble members and corresponding best-track data are evaluated for a lead time of 168 h (7 days) at a 12-h interval. The total sample size of the five EPSs ranges between 1500 and 2500 for 24 h V max forecasts in each year, which decreases with lead time to be between 200 and 800 for 168 h forecasts (Fig. 1a). The sample size for each EPS is different but most of them varies between 200 (50) and 600 (150) for the 24 h ( Fig. 1b) (168 h) forecasts each year.
A statistical probabilistic intensity forecast scheme named by probability climatology-based intensity forecast (PCIF) is selected as a reference method to understand the skill of the EPSs as relative to climatology-based forecasts. The PCIF scheme is proposed by Chen et al. [41] and it is a probabilistic baseline approach used by the CMA to assess the skill of TC intensity probability forecasts. The PCIF scheme operates by searching for analogous historical cases that meet the criteria developed from climatological persistence predictors. The ensemble size of PCIF varies between 20 and 51, which is determined by the sample size of historical TCs meeting the requirement of PCIF.

Verification methods
The ensemble mean refers to the forecast result of the equally weighted average of the ensemble members. Here, we select ME and Root Mean Square Error (RMSE) to measure the difference between the ensemble mean forecasts and observations as defined in Equations (1) and (2). ME is the abbreviation for mean error or bias. RMSE is the abbreviation for root mean square error. Skill score (SK) defined in Equation (3) represents the forecast skill of the ensemble mean of EPSs relative to that of PCIF.
where F represents the forecast value, O represents the observed value, i represents the ith sample, N represents the total number of samples, and RMSEc represents the RMSE of the PCIF ensemble mean.
According to Table 1, each EPS has its own operational scheme with different horizontal resolution, vertical levels, and initial perturbation scheme. A weighted mean is defined as follows to obtain the overall performance of all the EPSs.
where i varies from 1 to 5 referring to the five EPSs respectively and N i represents the sample size of the ith EPS. E i represents a selected verification indicator, such as ME or RMSE, for the ith EPS. The weighted mean of ME and RMSE is labeled by W_ME and W_RMSE, respectively. When the TC intensity is greater than TD, some EPS has no intensity forecast data, while others do. We believe that the larger the number of samples, the greater the contribution of the EPS to the TC intensity forecast. Therefore, in the overall assessment of global ensemble forecast, the larger the number of samples, the larger the proportion of EPS.  2b), the difference between the largest ME and the smallest ME of the five EPSs is stable at different lead times, which varies between 6-8 m s -1 . The ECMWF-EPS outperforms the other four EPSs with a mean underestimation of 5-6 m s -1 during 12 h and 84 h, which improves quickly later to become an overestimation of 1 m s -1 at 168 h. The MEs of the ECMWF-EPS are approximately 2-5 m s -1 greater than the weighted mean MEs.

ME for intensity and intensity changes
Another notable feature in Fig. 2b is that all the five EPSs have an underestimation of TC intensity at the initial time, with W_ME being -6.7 m s -1 during 2018-2019. With such a significant initial underestimation, the EPSs could still be helpful if they can predict correctly the intensity change of a TC. If that is the case, forecasters can refer to the forecasts of intensity change instead of intensity itself. Thus, the MEs are also calculated for the intensity change forecast. The intensity change is simply calculated from a subtraction of V max at each lead time from that at 0 h. From Fig. 2c, it can be seen that the W_MEs for intensity change (ΔV max ) keep almost stable at 24 h during the five years, with a value varying between -3 to -1 m s -1 . The W_MEs are outstanding in 2016 with a more severe underestimation at longer lead times. In the other four years, especially in 2018 and 2019, the underestimation of intensity change becomes weaker with the lead time and turns to be overestimation at longer lead times. A closer look at the MEs of 2018-2019 (Fig. 2d) shows that the MSC-CENS has no bias for intensity change at 12 h, better than the other four EPSs which all have negative bias. The largest negative bias at 12 h occurs for ECMWF-EPS and NCEP-GEFS (-3.5 m s -1 ). All the five EPSs have MEs for intensity change increasing with lead time, which become positive at certain changing point. For example, the change point of ECMWF-EPS is 120 h. ECMWF-EPS underestimates the intensity change before that time but overestimates after that. The W_MEs enlarge from -2.0 to 6.9 m s -1 during the 168 h with the changing point at around 72 h. The difference between the largest and smallest MEs becomes larger with lead time from 12 h (3.5 m s -1 ) to 72 h (7 m s -1 ), but turns to be stable around 7-8 m s -1 after 72 h.   During 2018-2019 (Fig. 4b), the ECMWF-EPS has the best performance at all the lead times except for 168 h, with RMSEs varying between 7 and 12 m s -1 at different lead times. The difference between the largest RMSE and the smallest RMSE of the five EPSs decreases with the lead time from 10 m s -1 to 5 m s -1 . The UKMO-EPS is the one with RMSEs closest to the weighted mean values. The skill relative to climate forecast (Fig. 5a) shows that there is a significant negative skill of all the five EPSs at short lead times, which turns to be positive at longer lead times. The weighted mean skill changes from negative to positive at 120 h. The ECMWF-EPS has the largest skill among the five EPSs at lead times shorter than 72 h. After 72 h, the NCEP-GEFS is the most skillful one. SK of ECMWF-EPS and NCEP-EPS changes from negative to positive at 60 and 72 h, respectively.
As for the intensity change during 2015-2019, the W_RMSEs at 24 h are significantly smaller than those at 72, 120 and 168 h in all the five years, fluctuating between 7 m s -1 and 11 m s -1 . The W_RMSEs at 72 h and above are close to each other, being larger than 12 m s -1 for most of the time with a maximum value reaching 19 m s -1 , comparable to the W_RMSEs for V max (Fig. 4a). The NCEP-GEFS has the best performance for intensity change during 2018-2019 (Fig. 4d), except for the short lead times at 12 and 24 h. Its RMSEs increase quickly from 6.    (Fig. 6). It can be seen that the bias is generally weak (-3~1 m × s -1 ) when the observed intensity change is between -5~5 m s -1 . The RMSEs for that group are less than 7.5 m s -1 at all the lead times, smaller than those for larger intensity change. The EPSs tend to underestimate the intensity change when the observed intensity change exceeds 5 m s -1 , which becomes worse with the increase of the intensity change value.
If an increase of V max beyond 20 m s -1 over a 24 h period is defined to be a rapid intensification (RI) case, the value of W_ME and the W_RMSE of the intensity change are similar at 0-24 h. For example, the W_ME of 20-25 m s -1 band is about -21 m s -1 with RMSE being similar. For weakening TCs, there are also an overestimation of intensity change values but at a lesser extent as compared with the intensifying cases. For intensity change at different lead times, the difference of the W_MEs (W_RMSEs) between different lead times is generally smaller than 5 (10) m s -1 when the observed intensity change is less than 20 m s -1 . It enlarges to more than 10 (20) m s -1 when the observed intensity change is larger than 20 m s -1 , and the error at longer lead times is smaller (figures not shown). Such a result is in accordance with the better performance of V max at longer lead times as shown by Fig. 2b and Fig. 4b.

Verification methods
Probabilistic forecast evaluations include BS, Brier skill score (BSS), the ratio of ensemble spread and RMSE (SPR / RMSE), reliability diagram, and relative operating characteristics (ROC) diagram (Nurmi [42] ). Observed ΔV max (m s -1 ) Observed ΔV max (m s -1 ) This article evaluates the probability forecast for each category of TCs, as shown in Table 2, in which V max follows the national standard of China ICS 07.060. Unless it is specified as a certain category, the evaluation result is the average of the evaluation values of the 6 categories. For example, the BS evaluation result we finally gave is the average of BS value of 6 levels of TD, TS, STS, TY, STY and Super TY.
The BS is one of the most commonly used probabilistic forecast verification methods, and its calculation formula is as follows: where N is the total number of ensemble members, P i is the forecast probability of ensemble members and Q is the observation probability. When an event occurs, Q is set to 1, and when it does not occur, Q is set to 0. For example, for the BS evaluation of the TD level, a certain EPS member forecasts the TD level at a lead time, but the observation is not the TD level. At this situation, P i is 1, and Q i is 0. Therefore, the lower the BS is, the more accurate the probability forecast is (Wilks [43] ). The BSS is commonly used to evaluate the skill of probabilistic forecasts relative to climatic forecasts. The expression is as follows: BSS = 1 -BS/BSc (6) where BSc represents the BS value of the climate forecast, which is PCIF here.
The ensemble spread (SPR) as defined in Equation (7) can describe the uncertainty in forecasts. Its magnitude is equivalent to the RMSE and the ratio SPR/ RMSE is usually used to measure whether the ensemble spread is reasonable (Ma et al. [44] ). An ideal EPS is expected to have SPR /RMSE close to 1. If SPR /RMSE is smaller than 1, the ensemble forecast is underdispersive, and vice versa.
where N is the total number of ensemble members, F i is the forecast value of each member, and F M is the forecast value of ensemble mean. Reliability diagrams are used to measure the degree of matching between forecast probability and observation frequency. The x-axis and y-axis of a reliability diagram represent forecast probability and observation frequency (% ), respectively. The forecast probability can be set to multiple intervals between 0-1.
In this article, we calculate the observation frequency of samples in the interval [0, 0-0.1, 0.2-0.3, ... 0.9-1.0] for the prediction probability. If the forecast probability is consistent with the frequency of the event, the plot points will be distributed at 45°on the diagram, which indicates that the results of the probabilistic forecast system are credible. The ROC chart is an effective way to compare the skill of a probabilistic forecast with random forecasts. The abscissa and ordinate of the ROC chart represent the false rate and hit rate of a forecast, respectively. The calculation formulas for hit rate and false rate are as shown in Equation (8) and (9) respectively, where a is the number of hits reports, c is the number of missed reports, and b is the number of false reports. We set some thresholds artificially and the ones set in this article are (0, 0.1, 0.2... 1.0). When the probability of a forecast is greater than a certain threshold, we consider the forecast to occur. If it is less than the threshold, it will not happen, and the hit rate and false alarm rate can be calculated. Each threshold can determine a point in the ROC chart. If the verification result is located in the lower-left corner of the chart, the system fails to predict the event; if the result is on the 45°diagonal, no forecast skills are indicated; if the verification result is in the upper-left corner, all forecasts are hits, and there is no false report. The closer the verification result is to the upper-left corner of the chart, the higher the skill of the forecast is. The area of the ROC curve and the x=0, x=1, y=0 (AUC) can be used to represent the level of the forecast skills. In a perfect forecast, the AUC value equals 1. The gaps between the five EPSs are the largest at 0 h, implying a significant discrepancy in initial fields of different systems. The difference between the maximum and minimum BSs reaches more than 0.1. Such a discrepancy reduces quickly with the lead time being less than 0.02 after 72 h. The ECMWF-EPS and NCEP-GEFS have the smallest BSs before and after 84 h, respectively, being 0.05~0.6 less than the W_BSs (Fig. 7b). Fig. 7c provides the skills of the probability forecasts relative to the climate forecasts at different lead times during 2018-2019. It can be seen that the W_BSS is basically negative at lead times shorter than 120 h and becomes positive after that. ECMWF-EPS has the biggest BSSs at 0-72 h and a small difference exists among the five EPSs for the forecasts longer than 72 h. For Super TY (Fig. 8d), the gap between the five EPSs is small, and the JMA-GEPS has the smallest BS among them. But for STY (Fig. 8c), the difference between the five EPSs is larger at longer lead times; JMA-GEPS and NCEP-GEFS have the best probability forecast accuracy at 0-84 h and 84-168 h, respectively. From the BSS in Fig. 9a, it can be seen that the W_BSSs at different levels have an increasing trend at shorter lead times and are all smaller than 0 at 0-48 h. Although Super TY and STY have the highest probability forecast accuracy, the W_BSSs of TY, STY and Super TY are negative at each lead time in 2018-2019, which means the forecast skills are inferior to the climate forecast, especially for Super TY. And the W_BSSs of TD (TS / STS) turn to positive at 50 (120) h, indicating a more comparable skill to the climate forecasts at longer lead times. The BSS gaps between the five EPSs at Super TY level are small, except that the BSSs of JMA-GEPS are larger than those of other EPSs at 96-168 h (Fig. 9b).   120 h with a range of 0.02-0.17. As for the reliability in Fig. 10b, the observation frequency tends to be underestimated when the forecast probability is less than 20% and tends to be overestimated when the forecast probability is greater than 30%. The reliability increases with increasing lead times. The reliability curve of ECMWF-EPS is closer to the perfect reliability than that of other EPSs at 24 h (Fig. 10c), while no obvious difference appears at longer lead times (figures not shown). According to Fig. 10d, the W_AUCs are generally larger than 0.5, implying positive forecast skills compared with random forecasts. The W_AUCs increase from 0.61 to 0.71 during 12-132 h, and keep stable at 132-168 h. ECMWF-EPS has the largest AUCs at lead times shorter than 60 h with a range of 0.66-0.81. At 60-132 h, the NCEP-GEFS is the most skillful one with a range of 0.67-0.75. After 132 h, the JMA-GEPS has the best performance with a range of 0.76-0.78.

CONCLUSIONS AND DISCUSSION
In this study, the intensity ensemble forecasts from five operational global EPSs (including the ECMWF-EPS, JMA-GEPS, MSC-CENS, NCEP-GEFS, and UKMO-EPS) were evaluated in the WNP over a 5-year period (2015-2019), with the intention of presenting the change characteristics of the weighted mean of the five EPSs in the five years at different lead times and the current performance of the five EPSs based on the samples in the last two years (2018-2019) at 0-168 h lead times. The evaluation involved performance verification of both the ensemble mean and probability forecast; the abrupt intensity change performances of the ensemble mean and strong TCs performance of the probability forecast were also assessed. The conclusions and discussion are outlined below: (1) For TC intensity forecast, the weighted mean errors of ensemble mean and probability forecast, as represented by RMSE and BS, decrease consecutively at long lead times during 2015-2019, but fluctuate between certain values at short lead times. The TC intensity tends to be underestimated by the ensemble mean forecasts, and the ensemble is severely underdispersive. The degrees of underestimation and under-dispersion decrease significantly as the lead time increases. For ensemble mean and dispersion, ECMWF-EPS behaves better than other EPSs, while for the other probability forecast evaluations, ECMWF-EPS ranks the best at lead times shorter than 72 h, while NCEP-GEFS ranks the best later.
(2) The EPS has poor forecast skills at short lead times but becomes more comparable to the climate forecasts at longer lead times. And the forecast skill of ensemble mean and probability forecast become positive at 120 h and above in 2018-2019. This shows that the EPSs already have reference values for operational forecasts at longer lead times and are likely to make greater contributions to the improvement of TC intensity forecast capabilities in the future.
(3) As for intensity change, no obvious improvement appears in the five years. And with the lead time increases, the ME changes from underestimation to overestimation and the RMSE increases first and then remains stable in 2018-2019. Positive forecast skill only appears at shorter lead times. The difference between the ME and RMSE is small for abrupt intensity change, especially at 24 h. The error of RI shows that the ensemble mean cannot provide an accurate forecast for the appearance of RI. NCEP-GEFS has the best performance among the five EPSs in intensity change forecast in 2018-2019.
(4) TC at STY and Super TY levels has the highest forecast accuracy but has the negative forecast skill compared with climate forecast at each lead time, indicating that strong TCs are still a big challenge for EPS to forecast.
As indicated in the introduction, the evaluations can help forecasters make better use of the EPS. Encouraging results of the performance improvement in TC intensity for EPSs have been obtained during the past few years, which may be attributed to the development of high-resolution EPSs (Magnusson et al. [24] ; Zhang [45] ). However, the performances in abrupt intensity changes and strong TC are not yet satisfactory. In this article, only a simple preliminary assessment of the intensity changes has been carried out. Classification of intensity changes and further examination of the abrupt intensity change in probability forecast are needed. Moreover, the influence of model update on intensity forecast is also desired.
It can be seen from the evaluation results that obvious systematic deviations exist in the intensity forecast of the current EPSs, so we can make postcorrection to the ensemble forecast to improve the TC intensity forecast ability. In future plans, the correlation between the ensemble mean error and the characteristics of the TC itself, the large-scale environmental field, and the underlying surface characteristics will be analyzed. Furthermore, the statistical correction scheme for the TC intensity ensemble forecast may also be investigated to further improve the TC intensity forecast skill.