HOW SKILLFUL ARE THE MULTIANNUAL FORECASTS OF ATLANTIC HURRICANE ACTIVITY ?

D espite high-profile storms like Sandy and Matthew, the 2012–2016 period has been relatively quiet in terms of hurricane activity over the northern Atlantic compared to the previous two decades. Because the available observational record of Atlantic hurricane shows this basin alternating between decadelong periods of high and low activity since the end of the nineteenth century (Vecchi and Knutson 2011; Chenoweth and Divine 2014), this has led to some speculation as to whether we have entered into a new prolonged period of low hurricane activity similar to what was observed from the early 1970s to the mid-1990s (Klotzbach et al. 2015). This question is of great interest, not only for the academic community, but also for other sectors, such as policy-makers, nongovernmental organizations (i.e., disaster relief agencies), and the insurance industry. For example, in the case of the property and casualty (PC) insurance industry that typically underwrites annual contracts and may have several automatic renewals, the quantification of hurricane risk on short to medium time scales is of big economic relevance. In theory, multiannual forecast systems could be used to give the odds of basinwide hurricane activity remaining low for the foreseeable future. However, in comparison to seasonal hurricane forecasts, which originated in the mid-1980s (Gray 1984), the field of multiannual forecasting is very much in its infancy. Until recently, this type of long-term forecast was exclusively produced using statistical models, wherein hurricane activity is first derived by forecasting a subset of the climate conditions deemed to control hurricane activity (e.g., sea surface temperature over certain key regions) and then combining that forecast with a statistical model linking past climate conditions and past hurricane activity to produce a prediction of upcoming hurricane activity (Jewson et al. 2009). Statistical approaches of varying complexity have been adopted by the risk modeling industry AFFILIATIONS: Caron and LLeDó—Barcelona Supercomputing Center, Barcelona, Spain; hermanson—Met Office Hadley Centre, Exeter, United Kingdom; Dobbin and imbers—Risk Management Solutions, London, United Kingdom; VeCChi—Princeton University, Princeton, New Jersey CORRESPONDING AUTHOR: Louis-Philippe Caron, louis-philippe.caron@bsc.es


D
espite high-profile storms like Sandy and Matthew, the 2012-2016 period has been relatively quiet in terms of hurricane activity over the northern Atlantic compared to the previous two decades.Because the available observational record of Atlantic hurricane shows this basin alternating between decadelong periods of high and low activity since the end of the nineteenth century (Vecchi and Knutson 2011;Chenoweth and Divine 2014), this has led to some speculation as to whether we have entered into a new prolonged period of low hurricane activity similar to what was observed from the early 1970s to the mid-1990s (Klotzbach et al. 2015).This question is of great interest, not only for the academic community, but also for other sectors, such as policy-makers, nongovernmental organizations (i.e., disaster relief agencies), and the insurance industry.For example, in the case of the property and casualty (PC) insurance industry that typically underwrites annual contracts and may have several automatic renewals, the quantification of hurricane risk on short to medium time scales is of big economic relevance.
In theory, multiannual forecast systems could be used to give the odds of basinwide hurricane activity remaining low for the foreseeable future.However, in comparison to seasonal hurricane forecasts, which originated in the mid-1980s (Gray 1984), the field of multiannual forecasting is very much in its infancy.Until recently, this type of long-term forecast was exclusively produced using statistical models, wherein hurricane activity is first derived by forecasting a subset of the climate conditions deemed to control hurricane activity (e.g., sea surface temperature over certain key regions) and then combining that forecast with a statistical model linking past climate conditions and past hurricane activity to produce a prediction of upcoming hurricane activity (Jewson et al. 2009).Statistical approaches of varying complexity have been adopted by the risk modeling industry (Bonazzi et al. 2014) because, up until very recently, no viable alternatives existed.
The advent of climate prediction (also referred to as decadal forecasting) (Doblas-Reyes et al. 2013;Meehl et al. 2014), wherein climate models are initialized with the contemporaneous states of the atmosphere, ocean, and sea ice, has allowed the development of similar multiannual forecasts based on climate model simulations.These climate simulations can be used either to replace the first step of a statistical forecast (Vecchi et al. 2013;Caron et al. 2014Caron et al. , 2015) ) (so-called hybrid forecasts) or to do without empirical models altogether (Smith et al. 2010;Hermanson et al. 2014) (so-called dynamical forecasts).The latter technique involves directly tracking tropical cyclone-like disturbances in climate output using an automated detection and tracking algorithm.These dynamical forecasts are the most demanding in terms of resources because they require an infrastructure in place to detect and track the storms (Ullrich and Zarzycki 2017) as well as high-frequency data, which, in a decadal forecasting context, can be prohibitive in terms of the amount of data storage required for such analysis.These restrictions also limit the possibilities for multimodel ensemble analyses.
By combining aspects of both dynamical and statistical forecasts, the hybrid forecast offers a compromise between the first two approaches.In such forecasts, the large-scale conditions expected to modulate hurricane activity are derived from climate model simulations, and hurricane activity is inferred using a statistical relationship between these large-scale fields and past hurricane activity.Although hurricane activity is implicit in this case, hybrid forecasts have the advantage of relying on large-scale features of the atmosphere-ocean system (usually large areas of sea surface temperature), which the climate models can be expected to be better at simulating and forecasting than smallerscale features, such as hurricanes.Furthermore, such forecasts are usually computed using seasonal or yearly means, thus greatly reducing the amount of the data required and, incidentally, making desirable multimodel analyses more affordable.Both the dynamical and the hybrid approaches are used regularly in the seasonal forecasting and climate communities in order to derive hurricane statistics from climate model simulations (Vecchi et al. 2011;Vitart et al. 2007;Camargo et al. 2007).
Two hybrid techniques have so far been investigated to forecast hurricane activity at the multiannual time scale.The first of these techniques relies on predicting the weighted difference in sea surface temperature (SST) of the tropical Atlantic with respect to that of the wider tropics (Vecchi et al. 2013;Caron et al. 2014).In this case, a relatively warm (cold) Atlantic with respect to the rest of the tropics will lead to higher (lower) hurricane activity due to more (less) conducive dynamic and thermodynamic conditions over the Atlantic.The second technique relies on forecasting a proxy index for the Atlantic multidecadal oscillation (AMO) (Klotzbach and Gray 2008;Caron et al. 2015), a slow oscillation in Atlantic SST that is thought to modulate hurricane activity at long time scales (Zhang and Delworth 2006;Knight et al. 2006;Goldenberg et al. 2001).A positive index is usually associated with increased hurricane activity.
Here, we present and compare the different approaches (statistical, hybrid, dynamical) currently available to provide multiyear forecasts of North Atlantic hurricane activity, starting with a short description of the different forecast systems.These systems are also summarized in Table 1.
FORECASTING SYSTEMS.Climate model data.All climate simulations used here are initialized using contemporaneous observations, thus aligning the simulated natural variability with the observed variability.External forcing (greenhouse gases, solar activity, stratospheric aerosols associated with volcanic eruptions and anthropogenic aerosols) are taken from observations for start dates ranging from 1961 (first forecast period: 1961-65) to 2005 and the representative concentration pathway (RCP) 4.5 scenario (Meinshausen et al. 2011(Meinshausen et al. ) from 2006(Meinshausen et al. to 2014(Meinshausen et al. (last forecast period: 2010-14)-14).Systematic climate model drift in these simulations is addressed by computing a lead-time-dependent climatology for each individual model by first averaging the predicted variable for all of its members across the start-date dimension and then subtracting that climatology from each hindcast to obtain the anomalies over the whole predicted period (García-Serrano and Doblas-Reyes 2012).
Observational data.The hurricane time series used as reference is derived from the revised National Hurricane Center "best track" hurricane database (HURDAT2) (Landsea and Franklin 2013) and includes only hurricanes forming between 5° and 25°N during the period 1961-2014 and which survived at least 48 h at tropical storm strength (or above).The geographical limitation is introduced in order to allow for comparison with the dynamical forecasts, which limits tracking to that region.(Knight et al. 2014).
Initial conditions are generated every year between 1961 and 2010 by relaxing the coupled model to analyses of atmosphere and ocean following the anomaly initialization approach, except for 10 members of the CMIP5 ensemble, which rely on full field initialization.
The number of long-lived minima is then counted for each year, and the anomalies are subsequently computed by removing the mean and dividing by the standard deviation.To allow for comparison with observations, we then multiply the time series with the standard deviation of the observed time series.Variance adjustment is necessary to account for the much larger number of tropical disturbances detected by this technique compared to observations.The three model means are then averaged together and the variance is adjusted a second time.Additional information on this technique can be found in Smith et al. (2010), Dunstone et al. (2011), andHermanson et al. (2014).(Matei et al. 2012) (10 members)], for a total of four forecast systems.The systems were selected from a larger pool of available systems by choosing those with start dates available every year from 1961 to 2010.The multimodel ensemble-mean hurricane anomalies are computed by giving an equal weight to each model mean, regardless of the number of ensemble members available for a particular model, and the variance of the ensemble mean of both series of reforecasts has been adjusted to match that of the observed time series.
hurriCane n u m b e r s f ro m r e L at i V e s e a s u r faC e temPer ature.With this technique (Vecchi et al. 2011), frequencies of North Atlantic hurricanes are estimated based on the weighted difference in sea surface temperature between the tropical Atlantic

Technique Type Summary
Cyclone tracking Dynamical Hurricane numbers are obtained directly by tracking local minima in surface pressure over the tropical Atlantic.

Relative SST Hybrid
Hurricane numbers are estimated through a statistical model with two predictors: mean Jun-Nov SST over 1) the tropical Atlantic and 2) the entire tropics.The parameters are derived from the sensitivity of a high-resolution atmospheric GCM to a number of SST perturbations.

AMO index Hybrid
Hurricane activity is estimated using a climate index correlated with low-frequency North Atlantic hurricane variability.Positive (negative) index values are associated with conditions more (less) conducive to hurricane formation.

Statistical combination Statistical
Basin hurricane numbers are estimated by combining six statistical models that either use 1) regression relationships between mean Jun and Nov SST over the MDR or between the MDR and tropical Pacific region or 2) averages of historical hurricanes counts in the basin in active or inactive periods.
and the tropics at large.More specifically, the annual Atlantic hurricane frequency λ is derived from a statistical model formulated as a Poisson regression model with two predictors and is given by where SST Atl and SST Trop are the mean SST anomalies over the tropical Atlantic (in the region 10°-25°N, 80°-20°W) and of the entire tropics (between 30°N and 30°S), respectively, during the period June-November.In this model, an increase in SST over the main development region (MDR) leads to an increase in Atlantic hurricane numbers, while an increase in SST over the tropics at large leads to a decrease in hurricane activity.The parameters in Eq.
(1) are derived from the sensitivity of the hurricane response to a number of SST perturbations in a highresolution atmospheric GCM (Vecchi et al. 2011).To be commensurable with the other techniques, the variance of the ensemble-mean reforecasted time series is adjusted to that of the hurricane time series.
amo inDex.In this case, we make use of the relationship between Atlantic hurricane activity and the AMO, also referred to as the Atlantic multidecadal variability (AMV), at decadal time scales and estimate hurricane activity using an AMO-proxy index developed by Klotzbach and Gray (2008).The index is constructed using the difference in standardized SST anomalies over the North Atlantic subpolar gyre (50°-60°N, 50°-10°W) and the standardized mean sea level pressure anomalies over the tropical and extratropical Atlantic (0°-50°N, 70°-10°W).To translate the forecasted index values into hurricane anomalies, we adjust the variance of the reforecasted index time series to that of the hurricane time series.Additional information on this technique can be found in Camp and Caron (2017) and Caron et al. (2015).
Combined statistical techniques.With this technique, a weighted combination of six statistical models is used to reforecast the number of hurricanes in the Atlantic basin.Model weights are based on the past performance of each model and evolve with each prediction year.Four of the statistical models use generalized linear regression models to determine the relationship between hurricane counts and either the MDR sea surface temperature or the difference in sea surface temperature between the main development region and the tropical Pacific region.Similar to the relative sea surface temperature hybrid method, a local increase in SST over the MDR leads to an increase in Atlantic hurricane numbers, while an increase in SST over the tropical Pacific leads to a decrease in hurricane activity.
The other two component models are averages of the past hurricane counts in the Atlantic basin in either active or inactive conditions-the activity state being determined using a changepoint detection technique (Jewson et al. 2009).One model includes the probability of shifting from an active to inactive state or vice versa, while the other model does not.Basin hurricane count data from 1950 to the year prior to each forecast year are used to produce a given forecast.Because the basin record is considered incomplete before the 1940s and because we require at least 30 years of data for building a reliable regression model, reforecasts cannot be made prior to 1980 with this technique.Finally, the variance of the reforecasted time series is also adjusted to match that of the hurricane time series.

DETERMINISTIC FORECASTS.
Figure 1 shows that the systems capture the U shape in activity, reforecasting high activity in the 1960s (when forecasts are available), lower activity from the early 1970s to mid-1990s, and higher activity for the period that followed.There are large disagreements between the methods in the 1960s, which might be linked to the quantity and the quality of the ocean data used to initialize the climate model during those years.In terms of skill, the reforecasts generally return significant correlation coefficients for the linear (Pearson) correlation but only the AMO index technique returns a significant ranked (Kendall) correlation coefficient.The AMO index technique also returns the smallest root-mean-square error (rmse), thus suggesting an overall edge for this particular approach over the others evaluated here.
A standard technique to evaluate the reliability of ensemble forecasts consists of comparing the rmse and the spread of the ensemble.In well-calibrated forecast systems, the rmse of the ensemble mean should match the average spread of that ensemble (Fortin et al. 2014); that is, the uncertainty of the forecast should be a good measure of the error of the predictions.Here, the average model spread is defined as the square root of the time-averaged variances and the ensemble-mean spread is the square root of the sum of the average model variances, weighted according to the number of members provided by each model.In this particular case, all three systems relying on climate models are overdispersive (underconfident): the uncertainty is significantly larger than the rmse (inset, Fig. 1).Furthermore, only one observation falls outside the prediction range with the tracking technique and none with the other two hybrid techniques (not shown).Underconfident systems will systematically give probabilities that are too low for any climate signal, thus reducing the odds that the necessary actions will be taken.It should be pointed out that the AMO index-based technique reduces the ensemble spread compared to the other techniques, both for the ensemble and for the individual models of the ensemble (not shown).In contrast, the spread of the statistical model is too small and underestimates the actual uncertainty.Such systems are said to be overconfident and underestimate the probability of extreme events.
Forecasts can also be evaluated with respect to a baseline, which in this case is a cheaper and simpler forecast, such as climatology or 10-yr persistence.The skill score (SS) is given as 1 -(MAE forecast /MAE baseline ), where MAE is the mean absolute error.A skill score of 1 represents a perfect reforecast and a skill score of 0 (or lower) represents no improvement over the baseline.All the techniques return a positive skill score when compared to climatology, but only the AMO index technique is significantly different from 0. The same holds when measured against a 10-yr persistence forecast, although this second baseline appears more difficult to improve upon.The better performance of the AMO index technique is likely related to the fact that the index is constructed using sea surface temperature over the northern North Atlantic, which is the region where initialization of climate models consistently returns an improvement over noninitialized climate simulations (Doblas-Reyes et al. 2013;Meehl et al. 2014), which itself has been linked to the ability of the initialized climate models to reproduce the ocean dynamics of the Atlantic meridional overturning circulation (AMOC) (Robson et al. 2012(Robson et al. , 2014)).A recent study suggests a long and robust link between the Atlantic meridional overturning circulation and the AMO (McCarthy et al. 2015).It could be argued that for the hybrid and dynamical forecast systems, much of the skill originates from the first forecast year, but as shown in the online supplemental material (https://doi.org/10.1175/BAMS-D-17-0025.2),where the results of the same analysis are repeated with forecast years 2-5 only, this does not appear to be the case.
We further evaluated whether each forecast system could accurately anticipate 5-yr periods of belownormal (lower tercile), near-normal, and abovenormal (upper tercile) hurricane activity.Each such accurate prediction is identified with a colored circle at the bottom of Fig. 1.All the different techniques have a similar success rate, in the 60%-65% range (bottom left, Fig. 1).It is worth noting that all the techniques tend to reforecast the appropriate tercile once a pattern of low or high activity has solidly been established.Around tipping points (late 1960s and mid-1990s), they tend to be less skillful, which suggests that a certain level of skill comes from persistence in the initial conditions.WEATHER ROULETTE.To make full use of all the ensemble members and their distribution, we also adopt a probabilistic approach to reforecasting the proper terciles.And while a series of tools for probabilistic forecast evaluation exists, few are intuitive in communicating the skill to nonexperts.One

Fig. 2. An example of weather roulette.
Two players bet on whether the hurricane seasons are going to be below average, near average, or above average.Both players start the game with the same amount of money (in this case $10) and spread their initial bet according to the probability given by their respective forecast.One player always bets according to climatology (top-left wheel) and always distributes 33% of the capital in each of the three categories.The second player uses a hurricane forecast system and distributes the money according to the proportion of ensemble members in each category (topright wheel).For this player, the distribution will vary for each round.At the beginning of round 1, the player using predictions from the forecast systems puts 29% of the money on the winning category, whereas the player using the climatological forecast puts 33%.In this case, climatology gives better results and the player using a forecast system ends up with less money.This player starts round 2 with a capital of $10 × (0.29/0.33) = $8.70,whereas the other player continues with $10.In round 2, the forecast system predicts the winning category with 88% probability, thus resulting in betting 88% of the money in the right category.This player ends round 2 with $22.97 as opposed to $10 for the other player.After n rounds, the net gains associated with each strategy can be assessed.
diagnostic that stands out in that respect is weather roulette (Hagedorn and Smith 2009), where the skill of a forecast is quantified using an effective yearly interest rate representing the cumulative advantage obtained from using that forecast over a baseline.A game of weather roulette is illustrated in Fig. 2 and a formal description is given in the appendix.
Weather roulette is played between two opponents (a forecast and a baseline), with each player starting with the same initial capital and the roulette slots representing each of the possible terciles.The players start the first round by distributing their initial capital proportionally to the odds given for each tercile by their respective forecast.For each technique, the odds are given by the percentage of members forecasting a given tercile, while for a climatological baseline, the money is divided equally between the three categories.The money that is bet on the wrong terciles is lost (for both players), while the money that is bet on the verifying tercile is multiplied by the inverse of the probability of the baseline for that tercile [if the baseline is climatology, 1/(1/3) = 3] and returned to each player.
The ratio of the forecast probability and the baseline probability is called the return ratio, and when the probability of the winning tercile is larger for the forecast than for the baseline, that return ratio is greater than 1 and the player betting according to the forecasts starts the next round with more money than when the round began (and vice versa).All the money is reinvested by both players in the second round (second start date), and the game is repeated for all the start dates.The skill of the forecast R is given by the geometric average of the return ratios, and the effective yearly interest rate is given by R − 1.A forecast that is more skillful than the baseline will return R ≥ 1 and a positive interest rate.
Because the weather roulette requires a sufficient number of ensemble members, we can evaluate only forecast systems that rely on climate simulations (dynamical and A return ratio >1 means that the forecast system outperformed the baseline for that year (dot sitting over the white background), while an effective yearly interest rate >0 means that the cumulative effect of using this system over the 50-yr period compared to the simpler alternative is positive.
current skill level can be raised.How these forecasts can be integrated into a decision-making process given the intrabasin variability (Kossin 2017) is, however, an entirely different matter.
Using a purely dynamical approach, Hermanson et al. (2014) suggested that hurricane activity will hybrid), the three of which are measured against climatology and a combination of climatology and persistence.Figure 3 (top) compares the probability of the verifying category for each start date.The probability for the climatological forecast is always 0.33 and the verifying probability of the three forecast systems is usually greater than this value.The return ratios between the three forecast systems and the climatological forecast are given in Fig. 3 (middle).Although there is much year-to-year variation, the return ratios are usually greater than one.This is confirmed by the effective yearly interest rate, which is greater than 0 for all three systems.The skill decreases when a mix of persistence and climatology is used (Fig. 3, bottom), but again the interest rate is greater than 0 for all three forecast systems, indicating an overall better performance than the baseline.The forecast system based on the AMO index returns the highest interest rate, partly due to the high confidence in accurate forecasts calling for a higher level of activity during the later period.
CONCLUDING REMARKS.So, how skillful are the multiannual forecasts of hurricane activity originating from initialized climate models?While the current skill is still low compared to seasonal hurricane forecasts, they are better than climatological forecasts and at least as good as, but probably better than, 10-yr persistence forecasts.The constant improvement in climate models, combined with the ever-growing network of observations available to initialize them, offers hope that these forecasts will follow a path similar to that of seasonal forecasts and start providing reliable, skillful information in the not-so-distant future.Further calibration (Doblas-Reyes et al. 2005) and improvement in the correction of the climate model drift (Kharin et al. 2012) offer additional and immediate avenues by which the remain low for the upcoming years.Unfortunately, most of the data available for our study originated from CMIP5, which completed in 2012.As such, the series of simulations do not cover the upcoming 5-yr period, which prevents us from using the hybrid techniques to validate that prediction.Nonetheless, there are international initiatives in the work, such as the CMIP6-endorsed Decadal Climate Prediction Project (DCPP), which will soon provide the new data required to suggest an answer to that question.APPENDIX: STATISTICAL EVALUATION.Anomaly correlation coefficient.Anomaly correlation coefficients (ACCs) are computed by correlating the 5-yr ensemble-mean anomalies with the observed 5-yr-mean hurricane anomalies.ACCs are computed using both standard Pearson's correlation and Kendall's rank correlation.The latter describes the ability of the forecast system to identify the relative ordering of 5-yr periods correctly and is used since we do not necessarily expect the ensemble-mean forecasts anomalies and the observed hurricane anomalies to follow a Gaussian distribution.
Autocorrelation in the time series is accounted for by considering an effective sample size n eff , which approximates the number of independent data points in the time series.The effective sample size is defined such that (A1) where N is the actual sample size and ρ(τ) is the autocorrelation function as a function of lag τ (von Storch and Zwiers 2001; Guemas et al. 2014).Whereas the actual sample size is the number of start dates (50), the effective sample size for the 5-yr-mean hurricane time series is much lower (10).Correlations are considered significant if the p values (shown in Table A1) are below 0.05.
Improvement over a baseline forecast.The mean absolute error skill score (SS) is used to measure improvement with respect to a baseline, taken here as either 10-yr-mean persistence forecasts or climatological forecasts.Climatology here is defined as the average from 1900 to the year prior to the forecast but using a different start point to compute the climatology does not impact the results.The mean absolute skill score is defined such that (A2) where MAE forecast and MAE baseline are the mean absolute errors of the forecast and the baseline, respectively.The mean absolute error is defined as (A3) where (y k , o k ) is the kth of n pairs of forecasts and observations.
An SS greater (less) than 0 means that the forecast offers a better (worse) performance than the reference.An SS of 1 means a perfect forecast.The confidence interval of the score was computed using the bootstrap percentile method with 10,000 replicates and a fixed block length given by 1/n eff .The SS was considered statistically significant if the confidence interval did not include 0.
Weather roulette.The weather roulette, as developed by Hagedorn and Smith (2009), is defined as a bet between two opponents-an actual forecast and a baseline-with each player betting that the odds of her/his forecast are better.The roulette slots represent each of the three possible categories (terciles).Both players start the game with the same initial capital c 0 and spread all of their capital over the categories according to the probabilities given by their respective forecast.
The odds o(i) of the ball falling into each of the slots (i.e., that a tercile will verify) are given by (A4) where i = 1, 2, and 3 and p(i) is the probability of the ith outcome.The sum of probabilities over all possible outcomes is, of course, 1: (A5) For each forecasting system, the odds are given by the percentage of members forecasting a given tercile.Because one model (HadCM3) has twice as many members available to produce hybrid forecasts compared to the other models, we limit the number of members for this model to 10.This will prevent this model from being overrepresented in the ensemble.
For a climatological forecast, the probabilities p clim forecasted for each tercile are 1/3 = 33.3%.The probabilities p persis of the persistence forecast are 60% for the forecasted tercile and 20% for the other two terciles.This is necessary in order to avoid the persistence forecast from going bust if an event that is not forecasted does occur.That being said, it has been shown that combinations of persistence and climatological forecasts usually perform better than either of the two standards taken individually (Buell 1958;Murphy 1992).The probabilities of the mix forecast (persistence and climatology) are constructed such that (A6) After the outcome of the first round is established, the capital that was bet on the wrong terciles is lost (for both players), while the capital c 1 that was bet on the verifying outcome ν is returned to each player such that (A7) The ratio of the probabilities of the forecast over the baseline is defined as the return ratio r: When the probability of the winning tercile is larger for the forecast system than for the baseline, the return ratio is greater than 1 and that player starts the next round with more money than when the round began (and vice versa).At the end of each round, all the money is reinvested in the following round and the game is repeated until the last start date.The skill of the forecast R is given by the geometric average of the return ratios, which is given by 1 , where n is the total number of rounds, which in this case is the number of forecasts produced (50).Finally, the effective yearly interest rate (IR) is given by R − 1.
A forecast that is more skillful than the baseline will return R ≥ 1 and a positive interest rate.Finally, it can be shown that IR is related to the ignorance score (IS), which is a proper score, by the following transformation: Note that there was not a sufficiently large number of ensemble members in the statistical model to evaluate that technique with the weather roulette.

Fig. 1 .
Fig. 1.Deterministic forecasts.Time series of 5-yrmean hurricane anomalies in observations (black) and for the various forecast systems.These include forecasts made by tracking storms directly (red), forecasts based on the relative SST of the Atlantic with respect to the rest of the tropics (green), forecasts produced using a proxy for the AMO (blue), and forecasts produced using a statistical model (magenta).The 5-yr forecasts are aligned with the third year of the prediction.For observations, we consider only storms forming between 5° and 25°N.The inset table shows various measures of forecast quality: i) linear correlation index (Corr), ii) Kendall ranked correlation (Rank), and the mean absolute SS with respect to iii) a climatological forecast (Clim) and iv) a 10-yr persistence forecast (Pers).Statistically significant values for the correlations and the mean absolute SS are shown (boldface).The full circles along the x axis show the 5-yr periods for which each system's prediction landed in the right tercile, and the four numbers at the bottom left give the percentage of times that each system managed to do so.The inset plot in the bottom right compares the rmse and the spread of each technique, showing that all three forecast systems relying on climate models are underconfident and the statistical forecast system is overconfident.The asterisk denotes the statistical model skill (see inset table), which is given for the 1980 -2010 period, whereas the other models are scored based on the 1961-2010 period.

Fig. 3 .
Fig. 3. Probabilistic forecast verification.(top) The probability of the verifying tercile as predicted by each of the forecast systems, the climatological forecast, and the mix climatology-persistence forecast.(middle) Return ratio for each forecast system when playing against the climatological forecast.The effective interest rate is given in the upper-left corner.(bottom) As in (middle), but when the forecast systems are measured against the mix of climatology and persistence.A return ratio >1 means that the forecast system outperformed the baseline for that year (dot sitting over the white background), while an effective yearly interest rate >0 means that the cumulative effect of using this system over the 50-yr period compared to the simpler alternative is positive.

ACKNOWLEDGMENTS.
The first author would like to thank Isadora Jimenez for providing the necessary material for Fig. 2. The first author would like to acknowledge the financial support from the Ministerio de Economía, Industria y Competitividad (MINECO; Project CGL2014-55764-R), the Risk Prediction Initiative at BIOS (Grant RPI2.0-2013-CARON), and the EU [Seventh Framework Programme (FP7); Grant Agreement GA603521].We additionally acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output.For CMIP, the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals.LPC's contract is cofinanced by the MINECO under the Juan de la Cierva Incorporacion postdoctoral fellowship number IJCI-2015-23367.Finally, we thank the National Hurricane Center for making the HURDAT2 data available.All climate model data are available at https://esgf-index1.ceda .ac.uk/projects/esgf-ceda/.
forecast systems.Both hybrid forecasts rely on a multimodel ensemble (MME) of multiannual reforecasts performed within the context of CMIP5