Quantifying statistical uncertainty in the attribution of human influence on severe weather

Event attribution in the context of climate change seeks to understand the role of anthropogenic greenhouse gas emissions on extreme weather events, either specific events or classes of events. A common approach to event attribution uses climate model output under factual (real-world) and counterfactual (world that might have been without anthropogenic greenhouse gas emissions) scenarios to estimate the probabilities of the event of interest under the two scenarios. Event attribution is then quantified by the ratio of the two probabilities. While this approach has been applied many times in the last 15 years, the statistical techniques used to estimate the risk ratio based on climate model ensembles have not drawn on the full set of methods available in the statistical literature and have in some cases used and interpreted the bootstrap method in non-standard ways. We present a precise frequentist statistical framework for quantifying the effect of sampling uncertainty on estimation of the risk ratio, propose the use of statistical methods that are new to event attribution, and evaluate a variety of methods using statistical simulations. We conclude that existing statistical methods not yet in use for event attribution have several advantages over the widely-used bootstrap, including better statistical performance in repeated samples and robustness to small estimated probabilities. Software for using the methods is available through the climextRemes package available for R or Python. While we focus on frequentist statistical methods, Bayesian methods are likely to be particularly useful when considering sources of uncertainty beyond sampling uncertainty.


Introduction
Over the past decade, there has been increasing interest in the climate change research community in describing the role of anthropogenic greenhouse gas emissions in specific extreme weather events, commonly referred to as "event attribution" (Stott et al., 2013;National Academies of Sciences, Engineering, and Medicine, 2016;Herring et al., 2016).This increased scientific interest has been motivated by public interest, as the global warming signal both becomes more noticeable and easier to analyse, and an expectation that further understanding of the role of anthropogenic emissions might inform adaptation activities.Concurrently, observationally-based data products, numerical climate models, and computational resources have developed to the point where they can be usefully applied to the analysis of long-term trends in extreme weather.Allen (2003) first noted the potential public demand for event attribution information and suggested that concepts from epidemiology and environmental law would be appropriate in the climate change setting as well.In this setting, one compares the probability of an extreme weather event under a factual scenario of recent and current conditions to the probability in a counterfactual scenario in which anthropogenic emissions had never occurred but other factors (e.g., the eruption of Mt.Pinatubo) had still influenced the climate system.Stone and Allen (2005) formally introduced the concepts of fraction attributable risk (FAR) and risk ratio (RR) and proposed how they might be estimated given available tools.(Note that 'risk' in risk ratio inherits from its usage in biostatistics/epidemiology and is unrelated to statistical risk.)Unlike in epidemiology, in which repeated observations (e.g., multiple patients) are available with different exposures to potential health risks, in the climate context we do not have available repeated samples of the world, particularly under the counterfactual scenario.Analyses therefore use simulations of numerical models of the climate system as surrogates.Stott et al. (2004) presented the first study using this approach, making use of a small number of simulations of a climate model representing the entire climate system, but the small number of simulations meant that they had to assume a relationship between the average summer temperature and the frequency of extremely hot summers.Pall et al. (2011) extended the method to take advantage of an exceptionally large number of simulations that removed the need for the assumption of a mean-extreme relationship, at the expense of an incomplete model of the climate system.Together, these two variants of the "probabilistic event attribution" method have become the most popular approach for event attribution research in recent years (e.g., Peterson et al., 2012Peterson et al., , 2013;;Herring et al., 2014Herring et al., , 2015Herring et al., , 2016)).
Despite this popularity, there has been little formal statistical development of the technique since Stone and Allen (2005).The exception is the work of Hansen et al. (2014), who considered the problem of estimating the FAR as the ratio of means of two independent Poisson processes and proposed use of a confidence interval well-established in the statistical literature (Wilson, 1927).They also considered the case when the estimated counterfactual probability of an event is zero and developed a one-sided interval for FAR.
Underlying probabilistic event attribution are the existence of observational data and relevant climate model simulations, along with the assumption that both are adequate for performing an analysis.Guidelines for such adequacy tests are still being considered (Stott et al., 2004;Pall et al., 2011;Lott and Stott, 2016;Angélil et al., 2016).We leave the question of how to evaluate the input data sources for further discussion elsewhere and focus instead on statistical methods to estimate the RR (conditional on the data sources) and to quantify the uncertainty in the estimate due to limited statistical sampling because of finite model ensemble sizes.
This paper presents a formal frequentist statistical framework for probabilistic event attribution.Previous work has frequently involved interpretation of frequentist-based analysis using a Bayesian perspective (e.g., Stott et al., 2004;Pall et al., 2011), so part of our goal is to clarify ways to implement standard frequentist statistical methods in this context.We focus on estimating uncertainty from limited sampling, for which a frequentist approach is well-suited.We present several statistical methods for estimating uncertainty via confidence intervals, assess performance in a statistical simulation study, provide an example analysis, and make recommendations for the practice of probabilistic event attribution.We highlight that some methods allow estimation of a confidence interval even when the estimated counterfactual probability is zero and the risk ratio is infinity.

Statistical framework
We consider the approach of estimating the risk (probability) ratio, as the ratio of the probability of a specified event under a factual scenario (F), p F , to that probability under a counterfactual scenario (CF), p C : RR = p F p C .
In most climate attribution studies (and a case study in Section 6), the factual scenario is intended to represent the climate we have experienced, i.e., in which historical emissions of greenhouse gases and other drivers of climate change occurred as they did.In contrast, the counterfactual is intended to represent a climate that might have been in the absence of human interference, i.e., in which the anthropogenic drivers are held at some "pre-industrial" level (e.g., year 1850 values) even as the natural drivers (volcanic eruptions and solar luminosity) remain varying as we have experienced.Given estimates, pF for p F and pC for p C , one simply estimates RR = pF pC .We focus on RR rather than FAR because of its use in epidemiology and related fields, its interpretability, and because in our statistical development we generally work with log(RR), which is expressed simply as the difference in log probabilities and which improves statistical performance.Usefully, on the log scale, increases and decreases from anthropogenic influence are symmetric about zero, unlike for the FAR.However, a confidence interval for FAR can be trivially calculated from a confidence interval for RR.
In this work we lay out a classical (or frequentist) statistical framework that treats p F and p C (and therefore RR) as fixed, non-random quantities that are properties of the climate system.Uncertainty arises in estimating these quantities, and our goal is to quantify the uncertainty in RR as an estimate of the unknown RR based on the (sampling) probability density of RR.In contrast, a Bayesian treatment would consider these quantities to be random variables with probability densities and would allow one to make probabilistic statements directly about RR.Note that in previous event attribution work, researchers have frequently interpreted results, including bootstrap-based analyses, in a Bayesian fashion despite largely relying on frequentist methods (e.g., Stott et al., 2004;Pall et al., 2011).

Basic probabilistic framework
We use R to denote the continuous variable of interest (e.g., runoff or rainfall) and I(R > c) to denote the occurrence of the event of interest, defined by whether the variable exceeds some cutoff, c.I(•) is the indicator function that is 1 if the condition occurs and 0 if not. Let be the probability of the event, where E(•) denotes expected value and f (r) is the probability density function of R. While in principle we can estimate this quantity in the real world using observations, we only have a single observational series.If that series were stationary (which may be a reasonable assumption only for short time periods), we might use multiple years as replicates, but we cannot estimate this quantity under the counterfactual scenario.Therefore researchers rely on climate model simulations to estimate p ∈ {p F , p C }. (However the methods in this paper can be used for a RR contrasting probabilities under two different time periods.) In a single climate model simulation, R for a pre-specified time and location is not random since the model is deterministic and therefore R (and I(R > c)) is a fixed value.In this case, the use of expectation and probability above is not meaningful.However, not only can models be run under multiple scenarios, but they can be run multiple times with each realization differing in the initial state and thus the subsequent weather produced.Let W (for "weather") be a (very highdimensional) random variable indicating the state of the earth system and f (w) the probability density function of W .The individual realizations provide a simple random sample, w 1 , . . ., w nw , of size n w from f (w).By using the initial condition ensemble, passed through the deterministic GCM, to represent f (w), we rely on the strong assumption that use of the sample of initial conditions to initialize the GCM can approximate the true f (w) under a given scenario.This ergodic assumption is reasonable for the atmospheric model simulations used in the approach of Pall et al. (2011) after a spin-up period of weeks to over a year but requires decades for atmosphere-ocean GCMs.W is random and induces the randomness in R. Thus R(W ) is our random variable of interest.Now we have (2) The discussion above assumes one value of R per model realization, which would often be the case for annual or seasonal extremes or for short-term extremes for specific calendar dates.If one has multiple values of R for a given realization (e.g., analyzing daily rainfall), one would generally benefit from having a much larger sample size than is the focus here (particularly the simulation results of Section 5), but one would need to account for any correlation between the multiple values of R from a given realization (e.g., the daily values in a given season or year), particularly when estimating uncertainty.
In Section 3 we present methods to estimate p ∈ {p F , p C }, while in Section 4 we discuss how to quantify the uncertainty in estimating RR.Before doing so, we consider the additional complexities involved in estimating RR that are not characterized in the basic framework above.

Sources of uncertainty
Sources of uncertainty in estimating the RR using GCMs (see also National Academies of Sciences, Engineering, and Medicine ( 2016)) include: • variability in the earth system: Uncertainty arises from limited sampling of the variability of the system, which can range from time scales of days to years, decades, and centuries.This is often operationalized via an ensemble of GCM simulations initialized with different initial conditions and quantified based on the notion of sampling uncertainty as discussed in the basic framework above.
• boundary condition uncertainty: This includes aspects of the system that can vary in time but that are prescribed in the model and thus not simulated by it, such as the concentrations of radiatively-active trace constituents in the atmosphere and changes in land use.Modification of one or more of these boundary conditions constitutes the distinction between the factual and counterfactual scenarios; thus uncertainty in the distinction is implicit in this source of uncertainty.When such forcing terms are based on observational data, there is observational uncertainty.
• model parametric uncertainty: This represents uncertainty in the appropriate values for parameters in the GCM that are used in approximations for various processes not directly simulated by the GCM.This could include fundamental physical or chemical constants, but generally uncertainty in those constants is negligible relative to uncertainty in the appropriate value of bespoke parameters.
• model structural uncertainty: This is another component of uncertainty inherent to the climate model -the uncertainty in how to represent a complex physical system as a mathematical model, but not the uncertainty in tuning of that approximation.Note that determination of whether a model is fit for purpose for event attribution is a binary determination of this broader component of uncertainty.
We draw a fundamental distinction between uncertainty arising from variability in the system and the remaining sources of uncertainty.Uncertainty from variability is a statistical sampling problem, with uncertainty decreasing with increasing sample size.Critically, this uncertainty is quantifiable using well-established frequentist or Bayesian statistical methods.If we had an infinitely large ensemble, our sampling uncertainty would be zero.However, there would still be bias from the fact that the climate produced by the model is not the same as the climate produced by the real system, even in equilibrium, due to the additional factors listed above.In this work, we focus on uncertainty from variability, in part because frequentist methods would appear to be of limited use for the other sources.We make some comments on these other components of uncertainty in the discussion.
Finally, many event attribution analyses use atmospheric-land GCMs, as in Pall et al. (2011), rather than coupled models, for reasons of computational efficiency, a partial conversion of (oceanic) model structural uncertainty to better-understood boundary condition uncertainty, and in some cases a desire to more strongly condition the analysis on known features of the real world.With such analyses, model simulations do not sample from the longer-time-scale internal variability of the system.In Supplemental Material B we describe the additional uncertainty related to this longer-time-scale variability and present statistical methods for averaging over results from multiple years.

Estimating event probabilities
We next provide an overview and comparison of methods for estimating p ∈ {p F , p C }.We discuss approaches that simply count exceedances and rely on binomial sampling statistics, parametric fitting of the variable of interest, and extreme value analysis (EVA) of the variable of interest.
While EVA is often used for estimating probabilities of extreme events with observational data, serious difficulties can arise in the context of event attribution analyses that use model ensembles, and simple nonparametric estimators are often a good choice.First note that EVA is generally used on long time series and applied to short-term (e.g., daily) extremes.In this context, one often uses a block maximum (or minimum) approach, blocking by year, or a peaks-over-threshold approach that only uses observations over a high threshold, such as the 99th percentile of the observations.The statistical theory that supports EVA relies on taking maxima (or minima) over a large number of observations per block or setting a high threshold.However, for analysis of annual or seasonal extremes or short-term extremes for specific calendar dates, model ensembles generally only provide moderate sample sizes such as 50, 100, or 400, violating the assumptions of EVA.Furthermore, in event attribution, the event of interest may be extreme only in one scenario, so EVA may not be applicable for one scenario.And for short model simulations with initial conditions that favor extreme weather, the entire sample may be extreme in an absolute sense, in which case the event of interest may, in a relative (i.e., conditional) sense, not be extreme and EVA would not be appropriate (e.g., Pall et al., 2017).
Thus, while there are situations in which EVA has advantages for model-based event attribution (as seen in the case study in Section 6 and discussed further in our recommendations in Section 7), we focus more on the binomial approach because of its greater generality.

Nonparametric (binomial sampling of events)
One straightforward approach to estimating p is a nonparametric Monte Carlo (MC) estimate based on the available sample.An MC estimate of the expectation in (2) involves drawing n w samples of W from f (w) (based on the model simulations) and using the estimator: which is justified by the law of large numbers because p is a statistically-consistent estimator for p as n w → ∞.Consistency here means that as n w gets large, p is guaranteed to get very close to p.This estimator makes no assumptions about the distribution of R (hence the term 'nonparametric', although it is equivalent to Bernoulli sampling and a resulting binomial distribution for the number of events) and is thus robust, but there are drawbacks to the approach.First, the estimator may have more uncertainty than parametric estimators that assume a particular distribution.Furthermore, if R > c does not occur in the sample, our estimator is p = 0 even when we have substantive expertise suggesting that p is non-zero.Zeros for pF , pC , or both result in RR estimates of zero, infinity, or an undefined value.

Parametric
If one is willing to assume a particular parametric form for the distribution of R, such as a normal or log-normal distribution, statistical theory tells us that one may be able to obtain an estimator for p that has lower variance (less uncertainty) than the nonparametric estimator above.For example if we assume normality, we can estimate the mean and variance of R as r and s nw−1 , where r i = R(w i ) and r is the simple average of r 1 , . . ., r nw .Then p = ´∞ c f (r; r, s 2 R )dr where f (r; r, s 2 R ) is the normal density with mean r and variance s 2 R .The methodology allows us to estimate probabilities for events far in the tail of the distribution, but it relies crucially on the assumption that the chosen distribution well approximates the true distribution even far in the tail.Put another way, all the data values are used to estimate the parameters of the assumed distribution and thereby infer the behavior of the tail of the distribution.
Finally, note that unless the chosen distribution has a finite bound (either a maximum for extremes in the upper tail or a minimum for the lower tail), one will not obtain p = 0.

Extreme value analysis (EVA)
A compromise between the fully parametric and nonparametric approaches is to use extreme value techniques developed in the statistical literature; Coles (2001) provides an excellent overview.This approach applies when the event of interest is sufficiently far in the tail of the distribution of R. The approach involves fitting a three-parameter statistical distribution only to the extreme observations.While the approach takes a parametric form, statistical theory provides strong theoretical support for the particular distributional form.We provide an overview of these methods, including a discussion of the statistical and implementational challenges of using the methods for annual or seasonal extremes in Supplemental Material C .

Estimating uncertainty in the risk ratio
In this section we consider several methods for estimating uncertainty using frequentist confidence intervals.Our focus is on confidence intervals for RR, and we introduce each method in the simplest context of the nonparametric estimator (Section 3.1) before describing how the method could be used in the context of EVA.After describing the methods we carry out a simulation study to assess which performs best.
4.1 The frequentist interpretation of uncertainty about p and RR.
As discussed briefly in Section 2, we cannot generate a distribution of p or of RR = p F /p C using the methods discussed here, as this is not part of the frequentist statistical framework used here.RR has a distribution, called the sampling distribution, that is induced by the distributions of pF and pC , but it is not particularly useful to plot it, as it represents the variability of RR around RR.The danger in presenting such a plot is that it will often be viewed incorrectly as representing the distribution of RR, interpreted in a Bayesian fashion even though it has not been derived in a Bayesian fashion.For example, suppose the sampling distribution has a long right tail.This indicates that the estimator, RR, might be much larger than the true value, RR, and suggests that the true value may be much smaller than the our estimate.But if this sampling distribution were incorrectly interpreted as a Bayesian posterior, one would conclude that the true value may be much larger than the point estimate.(However, in the case of a simple parametric model and large sample size, the Bayesian central limit theorem states that the Bayesian posterior distribution will be approximately the same as the frequentist sampling distribution (Gelman et al., 2013)).
Instead, the standard statistical approach to quantifying uncertainty is to estimate a confidence interval.For the sake of illustration we will discuss a 90% confidence interval here, but the ideas are the same for other levels of confidence.The basic principle of a 90% confidence interval is that it is a random interval that should include the true value of the RR in 90% of the datasets that we might observe/collect.It is a random interval because it varies depending on the data collected, while the true RR is considered to be fixed.Therefore given a procedure for producing a confidence interval we can assess the procedure based on its coverage probability (the probability it contains the true value), assessed using synthetic datasets developed to mimic the real data-generating mechanism.Our goal is to use a procedure for calculating a confidence interval that produces intervals as short as possible while still having the specified coverage probability of 90%.

Methods for calculating confidence intervals
In this section, we explore several ways to estimate a confidence interval for the RR.We note that the statistical literature on estimating the RR (and the related odds ratio) in the biomedical literature is large and in particular considers a number of methods applied to binomial counts (see Fagerland et al. (2015) for an overview), and we consider some of those methods in addition to the bootstrap and a basic normal-theory method.Many of these methods are implemented in the climextRemes package available for R and Python, which builds upon the extRemes R package (Gilleland and Katz, 2011).
In general for probabilities (especially those near 0 or 1) and for ratios the sampling distribution of a given estimator is not a normal distribution, often being skewed.In this setting, working on the log scale often gives better statistical performance, with confidence intervals that come closer to having the desired coverage (e.g., Katz et al., 1978).One would compute the confidence interval for the quantity on the (natural) log scale (e.g., the log risk ratio) and then exponentiate the endpoints of the interval to get a confidence interval on the original scale.

Normal-theory confidence intervals
Basic statistical theory provides a standard confidence interval based on the standard error of the estimator when the sampling distribution of the estimator is approximately normally distributed.The approach is appropriate for larger sample sizes but can perform poorly for smaller sample sizes and when either p F or p C is close to zero.Furthermore, it involves some mathematical derivation, with use of the delta method in the context of the RR.For completeness we provide details in Supplemental Material D.1 .
We turn next to the bootstrap, which aims to overcome some of these difficulties by relying on computation.

Bootstrap
The bootstrap is a widely-used, asymptotically-justified statistical tool for estimating the uncertainty in statistical estimates, particularly in cases where one cannot derive the standard error of or confidence interval for an estimator in closed form (Efron and Tibshirani, 1994;Davison and Hinkley, 1997).To introduce the bootstrap we present it to develop a confidence interval for p for simplicity.However the same techniques work on the log scale for log p and for logRR.
A common version of the bootstrap involves resampling values of R with replacement from the sample, r 1 , . . ., r nw to generate n b different bootstrap datasets, each of size n w .With each bootstrap dataset we estimate p, giving us p(1) , . . ., p(n b ) .Note that the method does not provide us with a distribution to represent uncertainty in p and should not be interpreted as a Bayesian posterior distribution for p (though there are connections between Bayesian methods and the bootstrap).Rather the sample p(1) , . . ., p(n b ) provides us with an estimate of the sampling distribution of p.The idea of the bootstrap is that the empirical distribution of the p(i) values around p generally behaves similarly to the distribution of p around p, which is the true sampling distribution, f (p; p).If we knew the true sampling distribution, it would be simple to calculate a confidence interval.E.g., with normal data, we know that x ∼ N(µ, σ 2 n ).Given knowledge of this sampling distribution, we can analytically derive a 90% confidence interval for µ as x plus/minus 1.64 times the standard error, σ/ √ n, without needing to use the bootstrap.In the absence of a known sampling distribution, there are a variety of ways to use the p(i) values to estimate a bootstrap confidence interval for p: • Bootstrap interval using bootstrap standard error estimate: use the empirical standard deviation of the p(i) values around their mean, p.If we are happy to assume normality, we can use the empirical standard deviation of the p(i) values as the standard error estimate, which we denote se, to form a 90% confidence interval in the usual way, • Bootstrap percentile interval: use percentiles of the empirical distribution of the p(i) values to avoid the assumption of normality needed when using the bootstrap-based standard error estimate.This gives the following 90% interval, involving the 5th (p 5% ) and 95th (p 95% ) percentiles of the p(i) values, p5% , p95% . (5) • Basic bootstrap interval: the interval also involves p5% and p95% and is It is useful to consider the intuition behind why the upper quantile is involved in calculating the lower endpoint of the interval and vice versa, in contrast to the percentile method.The basic intuition is that if p95% is much larger than p, this indicates the plausibility of estimated values that are much larger than the true value.Given this, our lower limit for p should be much lower than p, because our actual estimate, p, calculated using our single dataset may be much larger than the true value, p.Note that if the sampling distribution is symmetric, with p − p5% = p95% − p then the percentile and basic intervals will be the same.
• Studentized bootstrap interval: this approach improves upon the basic interval by standardizing by an estimate of the standard error in each bootstrap sample.Let i) where se (i) is a (possibly rough) estimate of the standard error of p(i) that is calculated from the ith bootstrap sample.Let z 5% and z 95% be the 5th and 95th percentiles of the z (i) values.The studentized confidence interval is then where se is the (possibly rough) standard error estimate based on the actual dataset.
• Adjusted percentile (BC a ) bootstrap interval: this approach seeks to improve upon the percentile interval by estimating a transformation that brings the sampling distribution closer to normality (Davison and Hinkley, 1997).
Confidence intervals for RR In general, the ensemble members under the two scenarios are unrelated.If they were related, we would want to resample from the data in a way that reflected the relationship between the simulations in the two scenarios.Given the lack of relationship, a bootstrap procedure is to obtain n b resampled datasets from each scenario and to randomly pair the datasets to get n b pairs, from which one can calculate log RR (1) , . . ., log RR (n b ) .This pertains to estimates obtained from any of the nonparametric, parametric, or EVA methods.
Drawbacks to the bootstrap An important difficulty with the bootstrap occurs when RR = ∞ (the following discussion holds equivalently for RR = 0).In this case in which pC = 0 the bootstrap fails because there is no variability in the data; all bootstrap datasets will have p(i) C = 0. Another case is when some of the bootstrap samples have p(i) C = 0 or p(i) F = 0 and therefore log RR (i) ∈ {−∞, ∞}.In this case, the bootstrap standard error estimate (and therefore the bootstrap normal interval) cannot be calculated, while the ad hoc approach of removing such bootstrap samples before calculating confidence intervals using the various bootstrap methods has no clear justification and could affect performance of the resulting confidence interval.Finally when using EVA, it is not uncommon to be unable to estimate pF or pC because the optimization for the EVA parameters does not converge, and this often occurs with a subset of the bootstrapped datasets as well.It is unclear how to handle this, although if the number of times this occurs is small, ad hoc calculation of confidence intervals based on ignoring these values may not cause serious bias in the uncertainty quantification.
More generally than just in this context of estimating the RR, the percentile bootstrap method is known to perform poorly in practice (Hesterberg, 2015).However, while the logic behind the swapping of the percentiles in the basic bootstrap method relative to the percentile method suggests we might favor the basic bootstrap, both theoretical results (Davison and Hinkley, 1997;Hesterberg, 2015) and our simulation results (Section 5) show that the basic bootstrap also does not perform well.Furthermore, our simulation results suggest that all of the bootstrap-based methods have serious drawbacks.This is not surprising, particularly for values of pF or pC near zero.The bootstrap is known to perform poorly when the resampling gives a discrete distribution with few values.An extreme manifestation of this occurs when pN or pA are zero.However, even when this does not occur, the sampling distribution often has probability mass at ∞, 0 and 0/0, and the bootstrap estimate of the sampling distribution of log RR − log RR can provide a poor approximation to the true sampling distribution.

Inverting a hypothesis test using the likelihood ratio statistic
This approach is appealing because it can be used when pF = 0 (or pC = 0), providing a onesided confidence interval that gives a lower (upper) bound on plausible values of RR, as done in Jeon et al. (2016).The usefulness of providing a bound on the RR should be apparent, because the bound allows us to assess the magnitude of the anthropogenic influence in light of uncertainty.The basic idea is to find a confidence interval by inverting a hypothesis test for the estimate of interest.A standard hypothesis test that is commonly applicable is a likelihood ratio test (Casella and Berger, 2002), which compares the likelihood of the data based on the maximum likelihood estimate to the likelihood of the data when constraining the parameters to represent a simpler setting in which a null hypothesis is assumed true.Confidence intervals based on likelihood ratio tests are widely used and well described in the statistical literature but have not been used in event attribution, so we provide details in Supplemental Material D.2.

Binomial-based intervals
For the case where the two probabilities in the risk ratio are estimated using the nonparametric approach, but not the EVA approach, the statistical literature provides a wide variety of methods to calculate a confidence interval for the risk ratio from independent binomial proportions, motivated by the vast number of analyses of risk ratios in the biomedical literature (Fagerland et al., 2015).Once again, a standard framework involves inverting a hypothesis test.Of the wide array of possibilities, some methods our literature search suggested to be promising are: • Wilson's method: Hansen et al. (2014) propose to find an interval for the risk ratio by conditioning on the sum of events in the two scenarios to produce a binomially-distributed quantity and using an approximate confidence interval for a binomial probability proposed by (Wilson, 1927).
• Wang and Shan's method: Wang and Shan (2015) seek to improve upon existing methods by an inductive process to determine the ordering of how extreme different data values are with respect to providing evidence against the null hypothesis and then computing p-values by computing probabilities of data as or more extreme than observed, using the worst case value of the unknown nuisance parameter.One can then invert the test to obtain a confidence interval.More details are in Supplemental Material D.3.

Simulation study to evaluate the methods
Given the variety of methods available, the potential for small sample sizes, the fact that p F or p C are often near zero, and the difficulty that the normality-based method and the bootstrap methods have when RR = ∞, we explored the performance of the methods using a simulation study.For simplicity and given the limitations of EVA discussed earlier, the simulation study focuses on the context of estimating the RR based on nonparametric approach rather than EVA.The key factors that affect the statistical performance of the methods are the ensemble size, the true value of the RR, and the true probability of the event.

Design
For a given scenario, defined by the ensemble size (n), the RR, and the event probability (p F ), we generated 5000 synthetic datasets of y F and y C values, each from a binomial distribution with n ensemble members and probability of event p F and p C = p F /RR, respectively.We considered n ∈ {25, 50, 100, 400}, where 50 is the minimum suggested in the C20C+ Detection and Attribution project protocol (http://portal.nersc.gov/c20c)and 400 the number available for our case study.Our results focus on ensembles of size 100 as intermediate between the high uncertainty with smaller sizes and the fact that sizes as large as 400 will often not be available.We considered RR ∈ {1, 2, 4, 8, 16} representing the range from a world where there is no anthropogenic influence to one with very strong influence.We considered p F ∈ {0.01, 0.025, 0.05, 0.10, 0.20} to represent a range in how extreme the event is.Given all possible combinations of values of n, RR, and p F , we have 100 scenarios.Note that while we only consider risk ratios greater than or equal to one, the results are equivalent for p C ∈ {0.01, 0.025, 0.05, 0.10, 0.20} with RR ∈ {1, 1/2, 1/4, 1/8, 1/16} because the same calculations can be done for RR −1 ≡ p C p F , followed by taking one over the resulting interval endpoints to get a confidence interval for RR.
We then applied the following methods to each synthetic dataset, thereby producing a set of 5000 confidence intervals for each method for each scenario.The methods were: • Wilson's method (Section 4.2.4), • Koopman's asymptotic score test inversion (Section 4.2.4), • Wang-Shan exact test inversion (Section 4.2.4), • normal-theory with delta method (Section 4.2.1), • likelihood-ratio (LR) test inversion (Section 4.2.3), • bootstrap normal (Section 4.2.2), • percentile bootstrap (Section 4.2.2), • basic bootstrap (Section 4.2.2), • bootstrap-t (Section 4.2.2), and We report results for 90% confidence intervals, considering the lower and upper endpoints separately because in real event attribution analyses, the focus will often be on the lower bound for the RR so as to assess the plausibility of no anthropogenic influence.Thus we focus on 95% one-sided confidence intervals.However we note that some of the methods were not developed as two separate one-sided intervals.Therefore, a two-sided interval may cover the true RR with the correct probability but when interpreted as two separate one-sided intervals not have the correct one-sided coverage probability.
We assessed performance using the following metrics: • coverage probability of the intervals (proportion of times the 5000 intervals included the true RR), • length of the interval (focusing on the magnitude of the lower bound), and • proportion of intervals that could not be calculated because pF = 0 or pC = 0.

Results
Figure 1 shows the coverage probability for a one-sided lower confidence interval, (RR L , ∞).We see that for the LR, basic bootstrap, and bootstrap-t intervals, the intervals fail to include the true value at least 95% of the time.In contrast, the other methods are overly conservative (the intervals include the true value more than 95% of the time) for most or all values of RR.While values higher than 95% may sound appealing, they provide conservative results with intervals that are overly long and thereby increased uncertainty in estimating the RR.Note that in this and other figures we omit results for the bootstrap normal interval.In almost all of the scenarios the standard deviation of the RR estimates in the bootstrap samples could not be calculated because one or more estimates were zero, infinity or 0/0, and omitting those estimates has no statistical justification.
Figure 2 shows the coverage probability for a one-sided upper confidence interval, (0, RR U ).Here the Wang-Shan method is conservative, the LR and Koopman methods show some undercoverage, and the bootstrap, Wilson, and delta methods have widely variable results, with substantial undercoverage in some cases.Figure 3 shows the median value of RR L , with higher values corresponding to shorter intervals and more statistical certainty about the value of RR.As expected, in general intervals are wider for smaller values of p F (and therefore also of p C ), corresponding to fewer observed events.While the LR, basic bootstrap, and bootstrap-t methods have higher lower bounds, these are the methods showing undercoverage (particularly the bootstrap methods), so the shorter intervals are only obtained by violating the condition that coverage probability be correct.In contrast, the Koopman, Wilson, and Wang-Shan methods perform well, with the Koopman and Wilson approaches showing shorter intervals for smaller values of p F .
Finally, Figure 4 shows the proportion of simulated datasets for which a lower confidence bound could not be calculated, illustrating how commonly the bootstrap and delta methods fail, particularly for larger values of RR and lower values of p F , which correspond to scenarios with fewer events.Figures 7-10 in the Supplemental Material A show coverage results for n = 400.In general these results are similar to those for n = 100, but for n = 400 we see that coverage probability is generally closer to 95%, as we would expect with increasing sample size.Results for n ∈ {25, 50} were produced but are not shown; the results are similar quantitatively in terms of coverage and qualitatively in terms of the median bound values, but as expected show smaller lower bounds and higher proportions of datasets in which the bound(s) could not be calculated.
In summary, the bootstrap methods perform poorly.The LR method, which can be used both when analyzing binomial counts and when using EVA, provides some advantages, in particular its ability to provide intervals in most situations, including when pC = 0.In contrast, the bootstrap intervals require pC > 0 and do not provide meaningful intervals when too many of the bootstrap samples result in p(i) C = 0.However, the LR interval shows undercoverage; in contrast, if one is using binomial counting, then the Koopman and Wang-Shan methods are possibilities and avoid the undercoverage of the LR method.However, the upper endpoint of Koopman shows some undercoverage and both methods can be too conservative, leading to overly-long intervals.Thus there is a tradeoff here, and a user may wish to consider how conservative they wish to be in their analysis.The Wilson method performs similarly to the Koopman method for the lower endpoint but has extreme undercoverage for the upper endpoint.The Koopman method is easy to compute, although the endpoints for Wang-Shan (whose calculation is computationally-intensive) can be pre-computed and saved as a lookup table.
6 Case study: Texas drought/heatwave 6.1 The event Our case study focuses on the heat and dryness over the US state of Texas in the summer of 2011, the hottest and driest (in terms of precipitation deficit) on record dating back over a century (Rupp and Mote, 2012;Hoerling et al., 2013).Rupp and Mote (2012) performed a probabilistic event attribution analysis, but with counterfactual conditions estimated from previous years with similar anomalous temperature patterns in the Pacific Ocean (but cooller ocean temperatures generally).They concluded that anthropogenic emissions had increased the chance of the anomalously low rainfall and especially of the anomalous heat over Texas.While Hoerling et al. (2013) also found evidence for an anthrogenic contribution to the anomalous heat, they did not see evidence of a contribution to the precipitation deficit.Revealing possible inconsistencies in methods, however, these estimates based on climate models are at odds with the lack of an observed long-term summer warming over Texas (Stone et al., 2013;Hoerling et al., 2013).In this case study, we follow Rupp and Mote (2012) in considering March-August growing season temperature and rainfall averaged over the state of Texas, though the methods discussed in this work apply to longer or shorter periods.

Data and methods
We use data from CAM5.1 contributed to the C20C+ Detection and Attribution Project (Folland et al., 2014).CAM5.1 simulates the processes of the atmosphere and land surface, given prescribed ra- diative and ocean surface conditions, run at a spatial resolution of approximately 1 • (Neale et al., 2012).Simulations have been run under a factual scenario of observed boundary conditions (e.g., greenhouse gas concentrations, aerosol burdens, and sea surface temperatures) and under a counterfactual scenario with greenhouse gas and other radiative boundary conditions set to year-1855 values and ocean conditions set to the benchmark estimate of Stone and Pall (prep).This removes a spatio-temporal pattern of anthropogenic warming but preserves month-to-month and year-to-year anomalous variability.We use data from 50 simulations of both scenarios covering a 1961-2010 reference period, 100 simulations (including the longer 50) covering the 1997-2010 period, and 400 simulations (including the longer 100) covering 2011-2013 (Angélil et al., 2017).While multiple observational datasets are available, we use the CRU-TS-3.22observationallybased data set (Harris et al., 2014).We work with anomalies of both observations and model output against the 1961-2010 period, by subtracting the historical mean for temperature and dividing by it for precipitation.For the model simulations, the factual scenario reference using the 50 simulations covering the full 1961-2010 period is used for estimating anomalies for both of the scenarios.
When using EVA we set the threshold, u, as the 90th percentile for temperature of the values being analyzed (both scenarios).For precipitation we consider the 20th percentile.
All code and data are available at https://bitbucket.org/lbl-cascade/event_attribution_uq_paper.We use the climextRemes R package version 0.2 (also available for Python through Conda) to estimate risk ratios and uncertainty, in some cases based on binomial counts and in others on EVA.

Uncertainty analysis for temperature
Fig. 5a shows the distribution of temperature anomalies for 2011 for the two scenarios.
The actual event of a 2.62 • C anomaly in 2011 is very extreme relative to both the factual and counterfactual distributions, particularly so with respect to the counterfactual.For the binomial approach, the estimated RR is ∞.Using the likelihood ratio-based confidence interval, we have a one-sided 95% confidence interval (CI) of (1.04, ∞), while using the Koopman method we have (0.74, ∞); both are quite uncertain because of how extreme the event is.Here EVA provides us with the ability to use the information in the sample more effectively because the event is so extreme.The RR estimate is still ∞, with pA = 0.0067 and pN = 0, but the 95% one-sided CI using the likelihood ratio approach is (12.8, ∞), providing strong evidence for a large RR.Next, note that for less extreme event definitions, the event can be fairly common in the factual scenario and still not observed or very uncommon in the counterfactual.Given that the event is not extreme in the factual, EVA is not appropriate for that scenario, and we focus on the results based on the binomial approach.Table 1 shows results for a variety of definitions of the event.Note that we report a two-sided 90% CI by considering two one-sided 95% intervals.
Note that the factual distribution is shifted so substantially relative to the counterfactual distribution that the RR estimates are very large for extreme events and necessarily less so for less extreme events because pF has an upper bound at 1 and as the event becomes less extreme, pC increases and is no longer negligible.
Regardless, the evidence is strong that the probability of an extreme heatwave is much greater with anthropogenic influence than without, and for events defined by anomaly values larger than one, we conclude the risk ratio is at least 10, having accounted for sampling uncertainty.An important caveat is of course whether the model is able to capture the climatology of the event of interest, which is outside the scope of this work.

Uncertainty analysis for precipitation
Fig. 5b shows the distribution of precipitation (relative) anomalies for 2011 for the two scenarios.For the actual event, an anomaly value of 0.40 (i.e., 40% as much precipitation as the historical mean), we have zero exceedances in both scenarios.Furthermore EVA is of limited use as we get pF = 0.000088 and pC = 0, with high uncertainty: the likelihood-ratio based lower bound for RR is less than 0.01.
Given this, we consider several values for the event definition that are less extreme than the actual drought.Fig. 6 shows results from both the binomial count approach and EVA.In general we see evidence for a RR greater than one, with our best estimate of a RR of about two at a variety of event definitions.However the uncertainty in the lower bound at more extreme event definitions Figure 6: Estimated risk ratio and 90% confidence intervals for both binomial counts and EVA for various definitions of the event in terms of March-August precipitation anomaly over Texas (so lower values are more extreme).The three event definitions to the right are the 1 in 20, 1 in 10 and 1 in 5 year events based on the CRU observations.The upper bound for EVA for the event of 0.5 is greater than 100 but truncated at 100 for plotting purposes.Note that for less extreme event definitions, EVA is less appropriate as EVA relies on the cutoff being in the tail of the distribution.
limits our ability to make a more robust attribution statement.

Discussion and recommendations
We have presented a statistical framework for estimating the risk ratio to quantify anthropogenic influence on extreme events, focusing on proposing, implementing, and evaluating standard frequentist statistical methods for accounting for sampling uncertainty in event attribution.Our methodological recommendations are as follows: 1.When using binomial counting, we recommend either the Koopman or Wang-Shan methods.
If one is interested in the upper endpoint as well as the lower endpoint for small sample sizes, then Koopman does not preserve coverage probability, and we suggest Wang-Shan.These methods can be conservative, so the likelihood ratio approach might also be considered despite it producing intervals that can be too short.
2. When using EVA, only the normal-theory, likelihood ratio, and bootstrap methods are feasible.We recommend the likelihood ratio approach to calculate confidence intervals for the RR.This approach can provide a CI even when the estimate of the RR is infinity and avoids the problem that the bootstrap can fail to varying degrees when bootstrapped values of the RR estimate cannot be calculated or are infinity.
3. With the likelihood ratio approach, confidence intervals can be too short in a variety of circumstances and should be interpreted cautiously.
4. While appealing in its simplicity, bootstrap methods can perform poorly for quantifying uncertainty in RR, particularly when either p F or p C are near zero.
5. Bootstrapping provides a methodology to calculate confidence intervals, not Bayesian probability statements/intervals about RR.Unless an explicit Bayesian analysis is done, researchers should provide confidence intervals as their measure of uncertainty and avoid plotting the bootstrap distribution, as it represents the sampling distribution for RR not the distribution of RR.
6.Although we have not investigated its performance in the simulation study, EVA may provide estimates that use the data more efficiently than binomial counting in some cases, particularly when sample sizes are not too small and the event definition is fairly extreme under both scenarios.In this situation, the small counts of extreme events lead to high uncertainty for the binomial counting approach, but EVA can borrow information from data values below the event definition cutoff.However, with small sample sizes, EVA suffers from having few observations with which to fit the distribution and often from difficulties in numerical optimization.
7. EVA is not appropriate if the event is not extreme in the scenario being analyzed (e.g., often the case for cold events in a natural counterfactual scenario).Furthermore simple binomial counts are effective and straightforward as sample sizes increase provided the event is not too rare in at least one of the scenarios.EVA and binomial counting could be used in the same analysis when the event is rare in one scenario and not the other.However, in this case it is clear that the RR is far from zero and there would be limited benefit from reducing uncertainty in the probability for the scenario in which the event is rare, so analysis using binomial counting for both scenarios should be very effective.
Bayesian methods can also be useful and can be used to account for sampling uncertainty.Perhaps most usefully, a Bayesian perspective is probably the only option for accounting for uncertainty from uncertain boundary conditions (including uncertain counterfactual conditions in conditional attribution studies) and model parametric and structural uncertainty.For example, if one has an ensemble of model simulations based on varying the parameters of the climate model, this is most naturally seen as having drawn from a (prior) distribution over model parameters.This is naturally viewed in a Bayesian context but is difficult to conceptualize in a frequentist statistics framework as variation in the data that one might observe under the hypothetical of repeating an experiment.
Similarly, the use of multiple climate models and the possibility of an ensemble of simulations of possible aerosol forcing or other forcings could be considered in a Bayesian framework (e.g., Smith et al., 2009).Critically, unlike with sampling variability, additional simulations do not reduce uncertainty from these sources or give results that converge in a frequentist statistical context to the true RR.In fact, from a statistical perspective one might characterize these uncertainties as bias rather than variance.Given that conditioning on the ensemble output has limitations in terms of quantifying these uncertainties, any statistical treatment of these uncertainties is likely to represent some form of sensitivity analysis or subjective Bayesian analysis.Consideration of how to quantify parametric and structural uncertainty is an active area of climate research generally (e.g., Knutti et al., 2010).One common question in event attribution analyses is how to define the event, which is often based on choosing the cutoff beyond which an 'extreme event' is considered to have occurred.When the motivation for a study is damage to society, one may be able to choose a specific cutoff (or small range of cutoffs) to use.If analysis is motivated by an event occurring and the cutoff is determined based on observations, one might consider the definition to be a source of uncertainty.However, our perspective (originally developed in Jeon et al. (2016)) is that this is most naturally considered as a sensitivity analysis, reporting the RR and uncertainty for a variety of cutoff values, as done in our case study and in Angélil et al. (2017) and Pall et al. (2017).In some cases, such as temperature in our Texas example, while the results vary somewhat with the event definition, the lower bound of the confidence interval provides evidence for a robust attribution statement in light of uncertainty.

B Accounting for uncertainty in simulations without ocean dynamics B.1 Framework and additional sources of uncertainty
In this section we discuss the complications induced when doing event attribution using atmosphereland-only model simulations (e.g., Pall et al., 2011).
First, we highlight the conditional nature of the event attribution analyses that use atmosphereland GCMs and prescribed sea surface temperatures (SSTs).This approach retains anomalous SST features, such as the existence of an observed El Niño event, in both the factual and counterfactual scenarios.Thus it conditions on the SST state and so provides a probability ratio estimate that is conditional on the observed SST pattern and the estimated difference in magnitude of the SSTs between the two scenarios.This means that the framework ignores the possibility that anthropogenic factors influence the likelihood of a given SST state.I.e., we are quantifying the ratio of      Simulations in which the lower bound could not be computed were excluded.Figure 10: Proportion of simulated datasets for which lower confidence bound could not be calculated for n = 400.For the Wilson, Koopman, Wang-Shan, and likelihood ratio methods this occurs only when no events occur in both scenarios.
for particular SST states s F and s C .In this conditional context, there are two additional sources of uncertainty in event attribution analyses: • longer-term internal variability of the system: If we are sampling from simulations of atmospheric-land GCMs, then variability in the longer-term state of the earth system at yearly and multi-year time scales is not included in a simple initial condition ensemble of simulations in the same way that short-term variability is.This distinction does not exist if we are sampling from long simulations of GCMs that also simulate ocean processes.
• ocean and ice counterfactual boundary conditions: the lack of ocean and ice dynamics adds an additional aspect of uncertainty -the uncertainty of the SST and sea ice boundary conditions under the counterfactual scenario.Pall et al. (2011) took one approach to try to address this, but treatment of this uncertainty shares similarities to the uncertainties induced by use of a model discussed in Section 2.2.

B.2 Estimating a partly unconditional risk ratio from conditional analyses
If we want an unconditional risk ratio, we seek to calculate: which averages conditional probabilities over the marginal distribution of the state of the system (S) (i.e., the SST state), under the factual, f F (s), and counterfactual, f C (s), scenarios.To estimate the integrals using Monte Carlo methods we need to be able to draw from the SST distributions.Fortunately, the presence of multiple years in simulation of an atmosphere-land GCM provides some ability to account for internal variability.Note that there remains some conditionality though, in the sense that the sample of SST states is still defined as identical for both scenarios.
Here we discuss a basic frequentist analysis suitable for this scenario.An alternative Bayesian statistical strategy is developed in Risser et al. (2017).
The basic approach is to assume that the SST state is approximately independent from one year to the next such that we can treat the years as a simple random sample over the internal variability of the system.Given an initial condition ensemble, we can consider a probability, p t , for year t, that conditions on the fixed SST state: Here f t (w) is the probability density of weather for a given year (i.e., in a given setting of the internal variability of the system).We can estimate p t by pt nonparametrically using a standard Monte Carlo estimator applied to the ensemble output for year t as in (3) or using techniques discussed in Section 3.
If we are interested in a single probability, p, that represents the probability of the event averaging over internal variability (we would carry out these calculations separately for the factual and counterfactual scenarios), then our quantity of interest is where S is a random variable representing the state of the system and f (s) is the probability density of S.Here the distribution of weather, W , can be decomposed into the conditional distribution of W given the state of the system, S, multiplied by the distribution of the state of the system.The quantity just above can also be expressed as p s is equivalent to p t above when time and the state of the system are considered to be equivalent, and hence we can think of p as being the average of p t over the distribution of the state of the system.
A Monte Carlo estimate of p, averaging over internal variability as represented by the multiple years, n t , can be obtained as follows.
which is simply the proportion of exceedances of the cutoff across the full ensemble by time sample.
Note that this approach assumes that the sample of time reasonably approximates a simple random sample of the internal variability of the system.At the least, this requires a sufficient number of years to capture the different system states, such as ENSO regimes.When this is not the case, we recommend reporting estimates of pt (and the resulting RR t ) for the various years as a sensitivity analysis rather than reporting a single time-averaged estimate, p (and resulting RR).

B.3 Confidence intervals accounting for internal variability
When considering methods for finding confidence intervals in this conditional context of fixed ocean surface conditions, we must account for the fact that our estimator of p is not based on a simple random sample of {W, V } from the joint distribution.Rather we sample the internal variability (with year as the proxy) and then sample n w weather samples, with all n w samples sharing the same state of the internal variability.This complicates construction of confidence intervals, and we cannot directly use the methods described so far except when estimating separate RR values for different time points.
One can derive an analogous normal-theory confidence interval to that discussed in Section 4.2.1 by first calculating the variance of the time-averaged estimator p , and then applying the delta method to the log RR calculated based on the time-averaged estimates, pF and pC .The likelihood ratio-based method is not straightforward to apply when estimating p as an average of time-specific p t values, because we would need to find the constrained maximum likelihood estimators under the complicated constraint A bootstrap sample consistent with the actual sampling mechanism is to resample n t years with replacement and to resample n w ensemble members with replacement.One then uses the same n t resampled years from each of the n w resampled ensemble members.Note that our resampling procedure will show a lot of variability in RR from one resampled dataset to the next when there is a lot of variability across years and the number of years simulated by the model is limited.If we did a bootstrap where we independently resampled pairs from the ensemble members and years, we would see less variability than we should as we would not respect the dependence structure of the original Monte Carlo procedure, namely that all ensemble members share the same set of years.As a side note, an equivalent approach would be to resample n w ensemble members and to calculate pt for t = 1, . . ., n t for the n t years of model simulations and then resample with replacement from {p 1 , . . ., pnt }.Finally, since the SST conditions under the two scenarios are related, one should use the same resampling of years under the two scenarios.
Given that we are averaging over multiple years, it is less likely that the normal-theory and bootstrap intervals cannot be calculated, but this can still occur.

B.4 Case study
Here we extend the precipitation portion of the case study described in Section 6 to account for internal variability in conditional analyses.Using the period 1997-2013, for which we have at least 100 ensemble members available in each year, we consider accounting for internal variability, with the extreme event defined as the 1 in 20 year event based on the 1961-2010 CRU data, which is 282 mm.In Fig. 11, we show the time-varying estimates, RR t , with uncertainty based on the binomial count approach using the Koopman method to estimate confidence intervals.Results using EVA (not shown) are fairly similar, though the interval endpoints differ somewhat from the binomialbased intervals.Unfortunately for two of the years, the EVA likelihood maximization does not converge, failing to give a result.
Texas drought tends to be associated with La Niña conditions in the tropical Pacific, as occurred in 2011 as well as in 1999, 2000, and 2008; these years do not stand out as unusual for the estimated RR though.Given the uncertainty, it is difficult to say with certainty whether the RR varies substantially over time, although the results do suggest that RR is generally greater than one.Risser et al. (2017) tackle this problem in depth, developing a Bayesian methodology and reporting the importance of internal variability on attribution conclusions for a variety of event definitions and regions across the globe.
Finally we estimate the overall RR across time, treating the years as a sample of the internal variability of the system using the methods described in the previous section.The estimate of the time-averaged RR is 2.0 with 95% confidence intervals of (1.6, 2.6) [delta method], (1.4,3.1) [percentile bootstrap], (1.3, 2.8) [basic bootstrap], (1.2, 2.8) [bootstrap-t], and (1.3, 2.9) [BCA bootstrap].While the intervals vary somewhat, the results indicate some confidence that after accounting for sampling uncertainty and internal variability, there is evidence that anthropogenic influence has increased the risk of drought.

C Extreme value analysis C.1 Overview
We will discuss extremes in the right tail of a distribution, but all of what follows can also be used for extremes in the left tail by taking the negative of all the values.
Extreme value analysis is often introduced as a methodology for analyzing maxima (or minima if considering the left tail) of observations within blocks (Coles, 2001, Chapter 3), where a block is often taken as a year in climate science.A standard climatological extreme value analysis would analyze the yearly maxima of daily data in an observational dataset of multiple years.Assuming that the maximum is taken over many individual observations, the distribution of maxima follows a three-parameter extreme value distribution called the generalized extreme value (GEV) distribution, whose cumulative distribution function (CDF) takes the form, for observation y such that 1 + ξ(y − µ)/σ > 0. One can estimate parameters of the distribution (the location parameter, µ; the scale parameter, σ; and the shape parameter, ξ) using maximum likelihood (or in some cases using a technique called L-moments).Given parameter estimates, one can estimate return periods and return values using F (y).For example the probability of an observation exceeding a value z (the return value) in a given block of observations (e.g., a year) is p = 1 − F (z), namely one minus the probability that the maximum in the block does not exceed z.To estimate the probability of exceeding a specific cutoff c in our earlier notation, we simply set z = c.The return period is T = 1/p.The return value for a return period T is found using the inverse CDF, . An alternative approach to EVA that we favor is the peaks-over-threshold approach based on a point process (PP) model (Coles, 2001, Chapter 7).Instead of using block maxima, this approach more effectively uses the information in the data by making use of all the values exceeding a high threshold, u, often chosen as a high percentile of the empirical distribution of all the observations, such as the 95th or 99th percentile.Provided the threshold is sufficiently large, the distribution of the observations follows a three-parameter extreme value distribution that is equivalent to the GEV distribution.The likelihood over the n exc observations that exceed the threshold (out of the total of n observations taken across all years of data) is When one constructs the likelihood appropriately, as is the case above, the parameters are equivalent to those of the GEV distribution for maxima over blocks of size n/n t , where n t is the number of blocks.The number of blocks would generally be the number of years in climatological observational analysis and the number of total years across all ensemble members in a model-based analysis.Based on this equivalence, one can use the GEV CDF (8) with the parameter estimates, θ = {μ, σ, ξ} obtained from maximizing this likelihood (9) to calculate return periods and times as discussed in the previous paragraph.In other words, one uses the point process likelihood to estimate the parameters and then interprets the parameters as those of the GEV distribution for maxima.
All of what we've discussed above can be directly applied to short-term events such as extreme temperature or precipitation over a day or limited number of days.However, adjustments need to be made to the peaks-over-threshold approach to account for daily data not being independent from day to day, such as the use of declustering techniques (Coles, 2001, p. 99).

C.2 Estimating return values with seasonal/yearly data
It is more complicated when the event of interest occurs over a longer period such as a season or a year.We are often interested in such extreme events, e.g., seasonal drought or heat waves.Here we assume that we have one or a small number of observations in a year, e.g., a single summer precipitation value per year.A basic approach using block maxima would be to group the data into multi-year blocks and apply GEV analysis to the block maxima.The PP alternative would be to choose a high threshold and maximize the likelihood.In both cases, we have much less data than we would with short-term events.Given this, the greater efficiency of the PP approach compared to the GEV approach is particularly important.One must be careful in choosing the threshold, u, balancing the desire for more data (leading to a lower threshold) versus the need for the threshold to be high enough for the extreme value distribution to be appropriate.One might use a moderately large percentile of the data, such as the 75th or 80th percentile.Methods discussed in (Coles, 2001, Section 4) can be used to guide the choice of threshold but need to recognize that for seasonal events there are generally few data points available.In addition, care is needed to find return periods/values that can be interpreted in terms of probabilities of exceedance in a year in this context, as follows.
In the PP likelihood (9), if one takes n t to be 1 (in essence treating all the years in a single block), then a probability calculated using the GEV CDF based on θ would be the probability of exceeding a certain value in a period as long as the entire time period of the observational dataset used.This is because the equivalent GEV is the distribution of block maxima for blocks of size n/n t .Instead, one can make an adjustment so that any probability that is calculated is the probability of exceedance over a year, which can be directly converted to standard return periods.The adjustment simply sets n t to be the number of years of data (using the total number of years across all ensemble members in a model-based analysis).(Note that this has the odd interpretation for the equivalent GEV model of being the maxima over a single year, which will be the maximum over a single observation in many cases.)Alternatively, assuming independence between years, one can convert between probabilities based on multi-year blocks and single-year blocks using the relationship P , where m b is the maximum over a block of b year(s).
Finally we discuss how to set the input arguments in various R packages such that results can be interpreted in terms of probability of exceedance in a year.In the climextRemes package, please see the help information for the fit_pot() function.If using fevd() in the extRemes package, then the time.unitsargument will be relevant.In particular, if one passes in all of the observations regardless of whether they exceed the threshold, the implied value of n t is the number of observations divided by the npy value in the code.When time.units is days (the default), npy = 365.25 and when time.units is years, npy = 1.Similarly with pp.fit() in the ismev package, the implied value of n t is the number of observations divided by the argument npy.To accomplish having n t be the number of years or seasons in the dataset, for extRemes this means setting time.units to be years (or if one has multiple observations in year, setting time.units to be x/year where x is the number of observations in a year).If using the blocks argument for a stationary model in extRemes, set blocks=list(nBlocks = X) where X is the number of years.And for ismev set npy to be 1 (or the number of observations in a year if greater than one).When n w is sufficiently large, then by the Central Limit Theorem, p is approximately normally distributed and we can calculate a 90% confidence interval for p in the usual fashion (plus/minus 1.64 times the standard error).However, particularly when p is close to zero, we are unlikely to have n w large enough for this approach to perform well.

D Statistical details on various confidence interval methods
Confidence intervals for the RR To derive a confidence interval for the log of the risk ratio, which is a nonlinear function of p F and p C , we make use of a standard statistical method call the delta method (also known as the method of propagation of errors), which provides valid standard errors in the limit as the sample size gets large, relying on the central limit theorem to support the assumption of normality of the estimator.Consider the log RR expressed as a function of parameters, log RR = f (θ) (where θ = {p F , p C } in the simple nonparametric binomial count estimation context).The delta method says that Var(log RR) = Var(f ( θ)) ≈ ∇f (θ) ⊤ Cov( θ)∇f (θ), where ∇f (θ) is the gradient vector of f (•) with respect to the elements of θ and Cov( θ) is the covariance matrix of θ.Note that if θ is the maximum likelihood estimator, the standard asymptotic estimate of Cov( θ) is based on the inverse of the second derivative matrix (the Hessian) of the log-likelihood, taken with respect to the elements of θ, and evaluated at θ.
This can be most simply applied when using the nonparametric binomial count approach to estimating p.In this case, since log RR = log p F − log p C we have ∇f (θ) = { 1 p F , − 1 p C } , which can be estimated by plugging in pF and pC .Cov( θ) is where the diagonal elements can be calculated as ( 10) and the off-diagonal elements are 0 because the simulations for the two scenarios are unrelated.Hence we have se from which we can derive a confidence interval for RR in similar fashion to that for p described above.
For situations in which either of the p estimates is itself a nonlinear functional of other estimates, as in the parametric and EVA approaches, one would apply the delta method directly to log RR expressed as a function of the original parameters.For example when using EVA, one has log RR expressed as a function of θ = {µ F , σ F , ξ F , µ C , σ C , ξ C } where we have estimated GEV parameters for both factual and counterfactual scenarios that determine pF and pC .

D.2 Inverting a likelihood ratio test
A standard approach to finding a confidence interval is to invert a test statistic (Casella and Berger, 2002).The basic intuition is that for a hypothesized parameter value, θ 0 , if we cannot reject the null hypothesis that θ = θ 0 based on the data, then that θ 0 is a plausible value for the true θ and should be included in a confidence interval for θ.A confidence interval is then constructed by taking all values of θ 0 such that a null hypothesis test of θ = θ 0 is not rejected.For a 90% interval, the hypothesis test is done at the α = 0.10 (10%) significance level.
A standard statistical test that is commonly applicable is a likelihood ratio test (Casella and Berger, 2002), which compares the likelihood of the data based on the MLE (i.e., the maximized likelihood) to the likelihood of the data when constraining the parameters to represent a simpler setting in which the null hypothesis is assumed true (which in the notation above would be expressed as setting θ = θ 0 ).If the null hypothesis were actually true then as the sample size goes to infinity, twice the log of the ratio of these two likelihoods has a chi-square distribution with ν degrees of freedom.ν is equal to the difference in the number of parameters when comparing the original parameter space to the restricted space.The hypothesis that θ = θ 0 is thus rejected when twice the log of the likelihood ratio exceeds the 1 − α quantile of the chi-square distribution.
Specifically, we are often interested in the plausibility of p F = p C versus the alternative that p F > p C , so it would be natural to derive a one-sided confidence interval, RR ∈ (RR L , ∞), that gives a lower bound, RR L , on the risk ratio.We illustrate the ideas in the nonparametric binomial count context.The likelihood ratio test we use here is one where the restricted parameter space sets RR = RR 0 , so there is one free parameter, p F , with p C ≡ p F /RR 0 .The unrestricted parameter space has two free parameters (p F and p C ) so ν = 1 in this setting.For any value RR 0 we fail to reject the null hypothesis at level α when twice the log of the likelihood ratio is less than the 1 − α percentile of a chi-square distribution with one degree of freedom.The minimum over the values of RR 0 that are not rejected gives us RR L .Mathematically, this can be framed as setting RR L as the value of RR 0 such that λ(RR 0 ) = 2(log L(p F , pC ; y) − log L(p F,0 , pC,0 ; y, RR = RR 0 )) = q 1−α , which can be found via simple one-dimensional optimization or root-finding.Here pF,0 and pC,0 in the second log likelihood are the maximum likelihood estimates under the constraint that RR = RR 0 , which can be found in closed form as the solution to a quadratic equation (Farrington and Manning, 1990).q 1−α is the 1 − α quantile of the chi-square distribution with one degree of freedom.
While we have derived a lower bound using a one-sided confidence interval, we could derive an analogous upper bound using a one-sided interval from the other direction, RR ∈ {0, RR U }.
Extreme value analysis setting The approach above can be generalized to the case where one is estimating p F and p C by EVA, as proposed in Jeon et al. (2016).In this setting, we have parameters of an extreme value model under the factual scenario (denoted θ F ) and parameters under the counterfactual (θ C ).Each model will generally be a stationary three-parameter extreme value distribution with θ = {µ, σ, ξ}.We can find the unconstrained likelihood by maximizing with respect to both θ C and θ F .For the constrained likelihood, it will often be most simple to express the constraint, RR = RR 0 as p C = p F /RR 0 .Under the PP approach, we can constrain p C by letting σ C and ξ C be free parameters and setting µ C ≡ c + σ C ξ C (1 − (− log(1 − p F RR 0 )) −ξ C ) where c is the value of the event for which we are finding the probability of exceedance.This constrained maximization can be done in a manner similar to maximizing the PP likelihood, but one needs to sum the log likelihood terms for both of the scenarios.The numerical optimization to find RR L can be done as described in the previous section.

D.3 Inverting hypothesis tests for the ratio of binomial proportions
Here we provide an overview of methods from the extensive statistical literature on confidence intervals for the ratio of two binomial proportions.
In general for a confidence interval based on inverting a test, one can frame the inversion of the test based on the p-value associated with the test.In order to determine the endpoint of a one-sided 1 − α level interval, one seeks to find the value of the risk ratio for which the probability of data as or more extreme than observed (i.e., the p-value) is equal to α. (Two-sided intervals can be found by constructing two one-sided intervals.)The challenges are (1) determining the data values that are to be considered 'more extreme', (2) computing the probability of the data in light of the presence of a second, nuisance, parameter (namely that there is still a free proportion after fixing the risk ratio), and (3) carrying out the optimization necessary to find the value of the risk ratio such that the probability is α.The goal is to find a method for constructing an interval that keeps the coverage probability greater than or equal to 1 − α, while giving as short an interval (e.g., as large a lower bound) as possible.Fagerland et al. (2015) assess the various methods and find that the approximate method of Koopman (1984) performs reasonably well in simulations.They also assess various so-called 'exact' methods that seek to exactly calculate the p-values used in inverting the test statistic to derive the interval.Because of the nuisance parameter (p F ) in addition to the parameter of interest (RR), the exact methods maximize the p-value over possible values of p F .Unfortunately this maximization can involve maximization of a non-monotonic function for some data values, which makes the optimization difficult to implement.Furthermore, by ensuring that coverage is preserved for all values of the nuisance parameter (even values that are unlikely given the observed data), intervals based on exact methods can be overly conservative.Wang and Shan (2015) present a new exact approach that hopes to improve upon existing approaches discussed in Fagerland et al. (2015).They determine what data values are more extreme than observed using an inductive process.First they specify a simple, well-defined partial ordering of possible data values in terms of extremeness (e.g., 5 of 10 exceedances in F with 0 of 10 exceedances in CF is more extreme than 5 of 10 exceedances in F with 1 of 10 exceedances in CF).However the logic behind the partial ordering does not describe a full ordering -e.g., how does 5/10 vs 0/10 compare to 10/10 vs 1/10?Thus to determine the full ordering, they sequentially consider all possible orderings that satisfy the partial ordering and determine the more extreme data values as those that produce a larger lower bound.For a given ordering, one can compute the p-value by optimization with respect to the nuisance parameter, finding the worst case (corresponding to the most conservative confidence interval) probability over possible values of the nuisance parameter.The computation then does an optimization to find the smallest possible value of the risk ratio such that that worst case probability is equal to α.One difficulty is the afore-mentioned non-monotonicity, which they address using a grid search.The other is that the computations are time-consuming when the binomial sample sizes are large.

Figure 3 :
Figure 3: Median value of lower confidence bound for n = 100 for various methods and values of RR and p F .Higher values are better as they correspond to shorter confidence intervals.Simulations in which the lower bound could not be computed were excluded.

Figure 4 :
Figure4: Proportion of simulated datasets for which lower confidence bound could not be calculated for n = 100.For the Wilson, Koopman, Wang-Shan, and likelihood ratio methods this occurs only when no events occur in both scenarios.

Figure 5 :
Figure 5: Histograms of temperature (degrees Celsius) (a) and precipitation (b) anomalies for March-August 2011 over Texas from 400-member ensembles, with actual event indicated by black line.

Figures 7 -
Figures 7-10 present simulation results for a larger (n = 400) sample size, for comparison with the results in the main body for n = 100.

Figure 9 :
Figure 9: Median value of lower confidence bound for n = 400 for various methods and values of RR and p F .Higher values are better as they correspond to shorter confidence intervals.Simulations in which the lower bound could not be computed were excluded.

Figure 11 :
Figure 11: Estimated risk ratios and 90% confidence intervals for March-August precipitation over Texas by year based on binomial counts with Koopman-based confidence intervals.
σ 2 = Var(I(R > c)) = p(1 − p) is the standard Bernoulli variance.Thus the standard error for p is p(1 − p)/n w , which can be estimated by se(p) = p(1 − p) n w .
Figure 1: Coverage probability of 95% lower confidence bound for n = 100 for various methods and values of RR and p F .Values of 0.95 are optimal, while values less than 0.95 indicate undercoverage and values greater than 0.95 indicate conservativeness (overcoverage).Values lower than 0.8 are set to 0.8 for display purposes.Simulations in which the lower bound could not be computed were excluded.
Figure 2: Coverage probability of 95% upper confidence bound for n = 100 for various methods and values of RR and p F .Values of 0.95 are optimal, while values less than 0.95 indicate undercoverage and values greater than 0.95 indicate conservativeness (overcoverage).Values lower than 0.8 are set to 0.8 for display purposes.Simulations in which the upper bound could not be computed were excluded.

Table 1 :
RR estimates and 90% CI for a variety of event definitions based on binomial count approach.The last three event definitions are based on the quantiles of the CRU observations.
Figure 7: Coverage probability of 95% lower confidence bound for n = 400 for various methods and values of RR and p F .Values of 0.95 are optimal, while values less than 0.95 indicate undercoverage and values greater than 0.95 indicate conservativeness (overcoverage).Values lower than 0.8 are set to 0.8 for display purposes.Simulations in which the lower bound could not be computed were excluded.
Figure 8: Coverage probability of 95% upper confidence bound for n = 400 for various methods and values of RR and p F .Values of 0.95 are optimal, while values less than 0.95 indicate undercoverage and values greater than 0.95 indicate conservativeness (overcoverage).Values lower than 0.8 are set to 0.8 for display purposes.Simulations in which the upper bound could not be computed were excluded.