Climate Response and Sensitivity: Timescales and Late Tipping Points

Climate response metrics are used to quantify the Earth's climate response to anthropogenic changes of atmospheric CO2. Equilibrium Climate Sensitivity (ECS) is one such metric that measures the equilibrium response to CO2 doubling. However, both in their estimation and their usage, such metrics make assumptions on the linearity of climate response, although it is known that, especially for larger forcing levels, response can be nonlinear. Such nonlinear responses may become visible immediately in response to a larger perturbation, or may only become apparent after a long transient. In this paper, we illustrate some potential problems and caveats when estimating ECS from transient simulations. We highlight ways that very slow timescales may lead to poor estimation of ECS even if there is seemingly good fit to linear response over moderate timescales. Moreover, such slow timescale might lead to late abrupt responses ("late tipping points") associated with a system's nonlinearities. We illustrate these ideas using simulations on a global energy balance model with dynamic albedo. We also discuss the implications for estimating ECS for global climate models, highlighting that it is likely to remain difficult to make definitive statements about the simulation times needed to reach an equilibrium.


Introduction
The central question as to how the climate is likely to change as a function of anthropogenic CO 2 emissions can be posed as 'How does an observation of the climate system respond to changes in its radiative forcing induced by changes in atmospheric CO 2 ?'. This question has been studied in various ways for at least over a century [1,2], although efforts to answer it became more intense and in-depth over the last decades. Amongst early efforts was the pioneering work by Charney et al in 1979, who made the first estimates of expected equilibrium warming after doubling of atmospheric CO 2 (while keeping vegetation and land ice fixed at present-day values) using a numerical Global Climate Model (GCM) [3]. This metric has later been named the Equilibrium Climate Sensitivity (ECS) and is still widely used. Since then, researchers have developed a number of different metrics that measure climate response to different scenarios of anthropogenic change in CO 2 and have incorporated information from other sources besides computer models, including historical observations and data from palaeoclimate records. Recently, these efforts were summarised in an assessment of the World Climate Research Programme [4] that synthesised different quantifications of climate response using these different lines lines of evidence and led to the headline that the Earth's ECS is likely between 2.6K and 3.9K.
One of the hurdles for this assessment was the variety of definitions of (the quantification of) climate sensitivity -and ECS especially -in the literature. The root of this problem can be attributed to the lack of data on equilibrium climate states or detailed long-term transient data. This can be due to low time resolutions in proxy data, lack of observational data or insufficient computing power to equilibrate modern GCMs. Consequently, equilibrium properties need to be estimated from incomplete data sets, leading to many slightly different ways to quantify climate sensitivity. Common to them all, however, is the need to extrapolate long-term dynamics from data on shorter time scales. In this paper, we describe and discuss this extrapolation process in detail, hereby focusing on estimates of ECS using (idealised) experiments in climate models for the sake of mathematical simplicity. Of particular interest here is the exploration of linear, and non-linear, dynamics that can emerge in multiscale dynamical systems that can cause problems with extrapolation.
The common way to obtain estimates of ECS in climate models involves the use of extrapolation and regression methods on non-equilibrated transient simulations -typically of 150 year long runs. Values for ECS obtained in this way are now often referred to as the effective climate sensitivity [5] signalling that it might not encompass all long-term climate change. Although there are many different ways to perform such extrapolation, common is that it is usually based on linear concepts and frameworks. A recent review [6] of climate sensitivity highlighted that it is a key challenge to study the limits of such linear frameworks. Here, we will investigate these limits and in the process highlight the trade-offs that need to be made when designing experiments to quantify ECS: in order to measure a clear signal of warming in relation to the noise of natural variations, large perturbations are desirable but precisely in the case of larger perturbations the nonlinear behaviour becomes important and linear frameworks break down.
One of the most important tools to study past and future climate change are the GCMs as used in Coupled Model Intercomparison Projects (CMIP, e.g. [7]), because they provide a globally complete and detailed representation of the climate state while (approximately) satisfying the physical laws. However, specifically for these large models there is no way to determine whether a model really has arrived in the linear regime near an equilibrium, or even if such an equilibrium exists. In this paper we explore some simple conceptual examples of the potential nonlinear dynamics of the climate. We also make a number of observations that we hope illuminate some of the limitations of linear frameworks. (i) We highlight cases where there may be strong dependence on the climate background state and the forcing levels. (ii) We highlight examples where there may be a good fit to transient data but poor extrapolation preventing an accurate estimation of the ECS. (iii) We show that nonlinear systems can have slow tipping points. When these are crossed the tipping dynamics play out on slow time scales, and it can take arbitrarily long times before nonlinear and/or asymptotic behaviour is observed. (iv) We demonstrate how in the presence of multiple-timescales with nonlinear feedbacks a late tipping can occur in which fast processes suddenly dominate after arbitrarily long slow transient behaviour. This highlights the potential for slow and/or late tipping points to be particular obstructions to estimating ECS.
The rest of this paper is organised as follows: in the remainder of this section we discuss in general the response of a nonlinear system to forcing. In section 2, we consider the equilibrium response and equilibrium climate sensitivity of the climate system in terms of limiting behaviour. Moreover, we point out the challenges that arise when estimating those from short time series, highlighting the trade-offs that emerge in terms of perturbation size and required simulation time. In section 3, we examine the nonlinear effects that may appear as a result of climate dynamics on multiple timescales, including slow tipping which may in turn lead to late but rapid tipping. We illustrate these effects using multi-scale global energy balance models with dynamic albedo and/or chaotic variability and an example from a LongRunMIP abrupt8xCO2 run [8]. Finally, we briefly discuss these results, and the influence of time-varying forcing on estimation of climate response and sensitivity in Section 4.

Response of nonlinear models to forcing
Consider a notional state of the climate system y(t) for t > t 0 that evolves in response to various (unknown) forcings, with a partially known initial state y 0 and an input of atmospheric CO 2 generating a radiative forcing ∆F (t) that is specified for t > t 0 . We write this climate state at time t as where Y t is an evolution operator that evolves forward the initial state y 0 (at time t 0 ) up to time t according to a climate model Y with (possibly time-dependent) radiative forcing ∆F . Given a scalar observable O that maps the full climate state y(t), the response clearly depends on the choice of observable O, the choice of model Y , the forcing ∆F experienced by the system, the initial climate state y 0 at time t 0 and the time moment t > t 0 of interest.
At the level of a single initial state y 0 starting at t 0 of which we have perfect knowledge and subject to deterministic forcing ∆F , the response in the observable O at time t > t 0 is the difference in the observable's value at times t 0 and t, i.e.
This corresponds to a two-point response in the terminology of [9]. As this is often the easiest response type to think about mathematically (and extensions to other types are possible albeit more technical), it is this response type we will be referring to throughout this paper. However, often we are interested not in specific trajectories but rather in the distribution of possible responses for a probability distribution µ 0 of initial states and forcing ∆F . In this case we write the response as This corresponds to a distributional response, namely it is a random variable with some distribution determined by the "pushforward" of the initial probability distribution µ 0 by the dynamics. Furthermore, there are different interpretations of (3), depending on the choice of probability function. These include: • A physical measure on a climate attractor [10,9]. This can be an observable measured in a long palaeoclimate time series, or an observable in a model, where the attractor is (partly) known from the underlying model equations.
• An ensemble of initial conditions that are thought to sample subgrid processes in a model (or observational data).
• An empirical measure for a finite segment of trajectory, i.e. a choice of states on {Y t (y 0 , t 0 , 0) : t ∈ [t 0 , t 1 ]} over some finite interval with t 0 < t 1 , with equal weight to any given time instant. Such a measure can be approximated from a finite length time series of a palaeoclimate record.
Note that µ 0 is a physical measure means that for typical initial conditions the empirical measures converge to one and the same distribution: for a more precise definition of a physical measure, see for example [11,12]. If there are multiple attractors then there can be several physical measures, and typical initial conditions converge to one of these depending which basin of attraction they are in.

Equilibrium Response and ECS as limiting behaviour
While the response on any time scale can be relevant, often the asymptotic, or equilibrium, response as t → ∞ is considered first. This response is typically easy to analyse and understand in simple models. Taking the limit t → ∞ of (2), the equilibrium response is: Of course, this begs the question of whether the limit exists. In particular, one cannot expect such limit to hold for any forcing ∆F . For instance, if the forcing specifies uninhibited and constant emission of greenhouse gases, the climate system will not evolve to any equilibrium. Hence it makes sense to limit ourselves to forcing scenarios that have constant forcing levels as t → ∞ (i.e. ∆F (t) → ∆F * as t → ∞). In practical model studies of equilibrium climate sensitivity, often the forcing is just taken as a constant throughout the whole simulation.
Of particular interest is the equilibrium response to an instantaneous and abrupt doubling of atmospheric CO 2 , which we indicate by the forcing ∆F abrupt2xCO2 . Then, the equilibrium climate sensitivity (ECS) is defined as the response of global mean surface temperature (GMST) to such forcing, i.e.

ECS(y
Even for such idealised forcing, such a limit may not be well-defined. In any but the simplest models, the asymptotic climate state will have stationary internal variability, for which the limit of the two-point response is not well-defined without first averaging for long enough that any internal variability is averaged out. In such cases, a distributional response may have a well-defined limit, although it can happen that even these do not converge in cases where there is non-ergodic behaviour [13]. It is difficult to say anything definitive about the convergence of climate response in state-of-the-art GCMs. These models are numerical representations of the underlying physical equations, which have been developed to include many physical processes and ever-improving parametrizations of sub-grid scale processes; they are very high-dimensional and complex. We do not have access to the attractors of these models and so cannot exclude the possibility of poor or no convergence. These models are roughly calibrated only by assessing how well they can reproduce the present day climate, including the historical period. However, in practice, reaching the true equilibrium may also be less relevant with such a model; the physical state of the climate systems after a few centuries or even millennia could be difficult to predict anyway because of incomplete knowledge of the initial state y 0 , model details and forcing. For these reasons, a pragmatic Effective Climate Sensitivity [5,14] is often taken, in which response over a few centuries or millennia is taken, ignoring dynamics on longer time scales. However, we focus here on cases where the limit in (5) is well-defined.

Background State, Forcing Scenario and ECS
In (5), it is clear that the equilibrium climate sensitivity depends on the initial condition y 0 or background state where the latter refers to the initial climate attractor. However, often ECS is given without explicitly stating initial conditions. This can lead to ambiguity about what is meant by ECS when comparing simulations of current and palaeoclimates. Because of the possibility of multistability of the climate system, even for the same CO 2level, may support multiple climate states. In physical terms, the dependence on the background state originates from feedback processes that changes as the forcing is applied [10], necessitating a proper communication of the background state considered when computing the ECS of that background state.
Further, in the definition of ECS (5) a doubling of atmospheric CO 2 is given as forcing scenario. However, in practice, ECS is often used as a measure of temperature increase per CO 2 doubling. So by assuming linearity of the climate response to forcing levels, ECS is employed to estimate warming for other CO 2 forcing levels.
Specifically, for an abrupt 2 γ xCO 2 forcing, an assumption of linear response would mean that warming of γ times the ECS is expected: lim t→∞ R GMST,Y (t; t 0 , y 0 ; ∆F abrupt2 γ xCO2 ) = γ ECS(y 0 ). (6) Certainly, this assumption will fail when γ is large enough that a tipping point is crossed, but even when that does not happen such linear assumption only holds when the forcing is small enough that nonlinear terms can be ignored. It has been shown that this linearity assumption in fact does break down in GCMs. For instance, palaeoclimate simulations with a wide range of CO 2 -concentrations suggest such linearity can be broken [15] and multi-millennial experiments in the model intercomparison project LongRunMIP [8] also show deviations from linearity; it was found that abrupt4xCO2 experiments lead to more than twice the warming of an abrupt2xCO2 experiment in the same GCM. Further, abrupt8xCO2 experiments led to less than twice the warming of an abrupt4xCO2 experiment. Hence, the usage of ECS as a linear predictor for warming based on CO 2 levels can easily lead to over-or underestimations of warming.

Challenges to estimating ECS from timeseries
It is computationally expensive to run state-of-the-art GCMs and in principle, millennial length simulations may be needed to get close to equilibrium (see e.g. the LongRunMIP [16]). Because there exists variability on many time scales and spatial feedback patterns in these models, there is no a priori method to determine when or indeed whether a nonlinear model has reached equilibrium. This means that the equilibrium response of a climate model cannot be directly found from time evolution of the model; instead, one needs to derive and extrapolate the equilibrium properties of the model from possibly relatively short transient data.
In general, estimation of ECS for a model (such as a GCM) involves four steps:  Table 2] that lists 11 different methodologies. However, the most common standard for estimating ECS uses a technique by Gregory et al [17]. Typically, a single abrupt CO 2 -forcing experiment is run (starting from pre-industrial forcing levels, standard is to use an abrupt 4xCO2 forcing) for some years (150 years is the benchmark for CMIP6 models). The transient data on change in the yearly and globally averaged observables near-surface-temperature ∆T and top-of-atmosphere radiative imbalance ∆N is fitted to the linear model ∆N = λ∆T + f . Then, equilibrium warming ∆T * is estimated setting ∆N = 0 in this linear model (since, in equilibrium, there should be radiative balance), yielding ∆T * est = −λ −1 f . Albeit its predominant use in climate sensitivity analyses in GCMs, it is clear that GCMs are not well-approximated by this simple linear model over all time scales; because climate feedback processes operate at quite different timescales, ∆N and ∆T will have a non-linear relationship that has non-zero curvature over the course of a long simulation, and the linear relationship only holds approximately for certain time intervals [18,8,19,20]. Better fits to the response over all the time scales can be found by considering a combination of several linearly decaying modes, i.e. by viewing the climate system as a combination of linear processes with quite different time scales [21,18,22,23].
Other protocols use results from the literature of linear response theory directly [24,25,26,27,28,29,30,31,32]. That is, in relative generality, the response (of an observable O) in the linear regime of a (non-linear) system to a forcing can be characterised via a (causal linear observational) Green's function G [O] (t). Specifically, the yearly and globally (and ensemble) average near-surface-temperature increase ∆T at time t under a certain forcing scenario ∆F is given by the relation Using this relationship, transient data can be used to estimate the Green's function G [T ] from which the equilibrium response can be extrapolated -which can be done through fitting to some prescribed function (typically a sum of decaying exponential functions) or through a discrete Fourier transform algorithm.
For all the fitting and extrapolation protocols, the optimal choices in the protocol are not always obvious as certain trade-offs need to be made:  Figure 1 -Schematic diagrams illustrating trade-offs between perturbation amplitude and integration time when computing ECS on perturbing a linearly stable state of a nonlinear climate model. The light blue regions illustrate the trade-off needed to give a good signal-to-noise ratio of the estimate of ECS. The pink region illustrates the trade-off needed to ensure the system has entered the linear regime. The green "Goldilocks zone" shows points where accurate prediction of ECS is possible. (a) illustrates a case where the state is globally stable while (b) shows a case where a large enough perturbation (above the red bar) pushes the system out of the linear regime -perturbations above this may in principle give super-long transients and/or convergence to another stable state. Finally, (c) shows a case where accurate estimation of ECS is not possible.
1. The simulation time needs to be as long as possible to ensure (a) we are in the linear response of the final equilibrium state and (b) fluctuations caused by natural variability can be averaged out. However, long simulations for GCMs are computationally expensive and even these will not be able to detect slow timescales beyond the length of simulation time.

2.
A large ensemble and/or a long time period for fitting needs to be chosen to reduce noise caused by internal variability. However, each additional ensemble member increases the simulation effort and the time period for fitting needs to start as late as possible to maximise the chance of being in a linear regime.
3. The perturbation needs to be as large as possible to maximise the signal-to-noise ratio for the fitting procedure. However, large perturbations may result in nonlinear effects, including tipping into different climate states. Figure 1 illustrates two important trade-offs between perturbation size and integration time. In particular, the figure highlights the need to find a "Goldilocks Zone" where the perturbation is neither too small nor too big. Examples of these trade-offs in a nonlinear setting using an conceptual energy balance model are discussed within Section 3.

Slow linear responses and ECS
We start by illustrating some challenges that already arise in the linear response regime of a model. In such setting, extrapolation can be difficult if the time scale of the slowest response exceeds the length of timeseries available. To illustrate this, we now consider the evolution of a linear observable O of a finite M -dimensional linear system. In the absence of repeated eigenvalues, the Green's function will be a sum of exponential functions (with exponents being the eigenvalues) with the following functional form: where λ j ∈ C represent eigenvalues of the linear system and β [O] j ∈ R depends on corresponding eigenvector and observable; often λ j is restricted to the negative reals but more generally they may be complex with oscillatory decay (see e.g. [33]).
Estimating the Green's function for high (or infinite) dimensional systems can be extremely challenging -not least because linear operators in infinite dimensions may have a continuous (operator) spectrum. Nonetheless, one can assume a functional form for G [O] (t), and fit parameters from transient data. This approach has been applied successfully to many response problems in the climate system, see e.g. [32,33,29,34].
Let us now assume that (7) holds for the Green's function, and restrict to λ ∈ R. Even then, the number of modes M needs to be determined, and that comes with its own problems as shown in Figure 2. This figures compares responses of an observable given by one of the following:  Figure 2 -Examples of the response of observables ∆O1 (black), ∆O2 (blue) and ∆O3 (red), sums of exponential functions as defined in (8). Note that the t-axis is given in log-scale to highlight how long these different equations stay almost indistinguishable: the red case corresponds to a linearly unstable setting, i.e. a 'run-away' scenario..
These three examples differ only by the absence/presence of an eigenvalue λ with |λ| small: only the first two are bounded and these have different asymptotic values; the third describes a 'run-away' response. Nonetheless, Figure 2 shows that all three observables are indistinguishable at first; only over longer time scales does the effect of the small eigenvalue become apparent. It is practically impossible to determine which of these functional forms is correct from short-time transient data only. In the climate system, the dynamics play out over many different time scales [35,36]. Hence, this should play an important role in understanding GCM experiments. In particular, it is important to try to determine time scales on which the constructed estimations and extrapolations can be trusted, as there seems to be no way to completely rule out slow warming, or even slow tipping, on all slow time scales. GCMs very often do not include the very slow climate components such as land ice sheets dynamically, but still need very long spin-up times and almost never are integrated to full equilibrium. For example palaeoclimate experiments with GCMs typically show considerable drifts in the globally averaged ocean temperature after several millennia of simulation, while already in good radiative balance (e.g. [37]).

Nonlinear response and ECS for climate models
In the previous section, we discussed potential problems associated with timescales that can affect estimation of ECS even for linear systems. In this section, we turn our attention to issues related to non-linear response. Here, a particular challenge are tipping points where fast dynamics can suddenly take over even after long, very slowly evolving transient periods; this is impossible in a purely linear system.
To make our considerations in this section more explicit, we consider a global energy balance model (GEBM) that has dynamics on two timescales and the possibility of tipping phenomena on a slow or a fast timescale. We introduce the model in subsection (a). Then, we consider tipping-related effects in this model due to time-scale separation of physical processes in subsection (b) or due to internal variability in subsection (c).

A fast-slow energy balance model
We consider a GEBM of Budyko-Sellers-Ghill type [38,39,40], which describes the evolution of GMST T according to the model where C is the specific heat capacity, Q 0 is the incoming (predominantly short wave) solar radiation, α is the planetary albedo (so that Q 0 α is the reflected solar radiation) and εσT 4 is the outgoing (predominantly longwave) Planck radiation (with planetary emissivity ε and Boltzmann constant σ). Further, µ represents the mean radiative forcing due to increases in CO 2 and µ N V (t) models variability in radiative forcing, assumed to have zero mean. Following [41], we assume is the concentration of atmospheric CO 2 at time t, and µ 0 is a reference radiative forcing level for a CO 2 concentration of ρ(0). When albedo and/or emissivity are taken to be temperature-dependent, i.e. α = α(T ) and/or ε = ε(T ), the model can have multiple stable climate states each with different climate sensitivity. In this paper we assume there is relaxation towards an equilibrium albedo α 0 (T ) at a rate τ α ≥ 0 We assume a temperature-dependent equilibrium albedo α 0 (T ) given by (12) and an instantaneously settling emissivity ε(T ) given by Both of these functional forms are of sigmoid-type, and change from one constant to another as T moves through a range of temperatures near T α,ε [9]; α 0 (T ) models the (relatively slow) lowering of albedo in the presence of land ice sheets, while ε 0 (T ) models a (relatively fast) transition from a clear to a cloudy planet with large quantities of low cloud. Each of them on their own can lead to a bistability between a colder and a warmer climate state but we include both to allow the possibility of independent slow and fast tipping points. In fact, we believe that both the (slowly settling) temperature-dependent albedo and emissivity are required to have some of the later illustrated phenomena -late tipping in particular -that do not present themselves in models with constant albedo or emissivity. We include natural variability of the energy input at the surface represented by chaotic forcing through a Lorenz-63 model, i.e. natural variability µ N V is given by where x adheres to the Lorenz-63 model, which conceptually represents the chaotic dynamics of weather processes so that ν NV is a measure for the strength of the variability and τ N V is the characteristic timescale of chaotic variability. Parameter values used in the simulations in this paper are given in Table 1, except where stated otherwise.
There are two special parameter settings that we distinguish. We say there is dynamic albedo if τ α > 0; in the case τ α = 0, albedo settles instantaneously so that we can eliminate (11) and set α = α 0 (T ). We say there is chaotic variability if ν N V = 0; in the case ν N V = 0, there is no internal variability and we can eliminate the chaotic Lorenz-63 model (15).
It is well known that in the case of no internal variability, equations (9) can be bistable [38,40]. Due to the functional forms of temperature dependent albedo and emissivity, the model (9) can have one, two or three stable equilibria depending on the parameter values. This is organized by a fifth order "butterfly" singularity [43]: see Appendix A for a verification and in-depth analysis of the bifurcation structure of this model. Nonetheless, for the parameter values given in Table 1, the model is bistable for a certain range of values of the parameter µ: in this bistable region, the model supports a stable cold "icehouse" and a warm "hothouse" climate state (see Figure 3).
Note that the ECS of both type of states (for the same CO 2 -level) differs between branches as albedo and emissivity are different between branches. However, the ECS within a branch is also not constant: Figure 3(b) shows variation between initial points y 0 that lie on the same branch (intra-branch differences). In the climate literature, these variations are not well-quantified, mainly because they depend on a multitude of physical feedback processes, which are difficult to observe and model numerically in full [4,44]. Still, it is good to keep in mind that observed or estimated ECS might vary as the (initial) climate state changes.

Nonlinear response: slow and/or late tipping and ECS
As discussed in section 2, if the transient relaxation dynamics of a climate model is approximated well by a linear system this can be used to estimate ECS. This thus works for nonlinear systems with small enough forcings. For example Figure 4(a) shows a simulation of (9) with parameters C in Table 1 subjected to an abrupt2xCO2   Table 1. The bifurcation parameter µ represents radiative forcing due to atmospheric CO2. Solid lines correspond to stable equilibria, and dashed lines to unstable equilibria. There are two different branches of stable equilibria: one that corresponds to a cold climate (blue) and one that corresponds to a warm climate (red). (b) Equilibrium 'two-point' response for different forcing levels corresponding to 2 γ ×CO2 starting from an initial state T0 = 293K corresponding to equilibrium temperature before perturbation. The red part of the figure correspond to end states on the warm branch; the blue part to end states on the cold branch. The dashed line indicates the location of a tipping point. (c) ECS (i.e. equilibrium two-point response to CO2 doubling) as function of the initial temperature T0. Blue lines indicate starting points on the cold branch; red lines indicate starting point on the warm branch; the grey region corresponds to unfeasible initial temperatures (i.e., they lie on the unstable branch in (a)). The large peak in the blue line corresponds to tipping from the cold branch to the warm branch; the location of this tipping point is indicated with a dashed line.   Table 1) subjected to an abrupt2xCO2 forcing. The initial forcing µ0 is chosen such that there is an initial equilibrium is T0 = 255K.  C D units C 5 × 10 8 · · · · · · · · · Jm −2 K −1 Q 0 341.3 · · · · · · · · · W m −2 σ 5.67 × 10 −8 · · · · · · · · · W m −2 K −4 α 1 0.7 · · · · · · · · · α 2 0.289 · · · · · · · · · T α 274.5 · · · · · · · · · K K α 0.1 · · · · · · · · · K −1 ε 1 0.5 · · · · · · · · · ε 2 0.41 · · · · · · · · · T ε 288 · · · · · · · · · K K ε 0.5 · · · · · · 0.1  (11) and chaotic variability (15) used in the numerical simulations. We take the standard choice for the Lorenz parameters: σ = 10, ρ = 28 and β = 8/3. For the simulations, time t is rescaled to years. The equilibrium albedo is given by (12) and (equilibrium) emissivity by (13). The forcing µ is given by (10) and νNV represents the amplitude of a chaotic forcing via (14). The use of "· · · " indicates that values of column A are also used in this case.
forcing. The initial forcing µ 0 is chosen such that there is an equilibrium at T 0 = 255K. There is a clear two-stage exponential decay to equilibrium with ∆T * ≈ 3.1K: the right panel of Figure 4(a) shows that there is a nearby equilibrium attractor. Figure 4(b) shows estimates using the Gregory method on rolling windows of 150 years. Within a time window, we regress the time series of ∆T and ∆N := C d∆T dt to the linear model ∆N = f + λ∆T , which gives estimates for the forcing f and the dominant feedback parameter λ. The regression is performed using the MATLAB fit to linear model fitlm; standard errors for best fit are shown, and the bottom panel shows the adjusted R 2 -statistic for this window, where R 2 = 1 implies all variance in the signal is described by the model within the window ending at that time-point. Equilibrium warming is derived from these fits by extrapolation of the linear model, giving ∆T * est = −f /λ. Note that the initial 150 year fit is already good. Indeed one can see decreasing signal-tonoise ratio and R 2 for fits taken later in the time series, as noise dominates the dynamics of the state this late in the simulation.
A second equilibrium estimation protocol is shown in Figure 4(c) in which blocks of 150 years are fitted to a decaying exponential function T (t) = T ∞ + be λt using the MATLAB fit to nonlinear model fitnlm. This gives an estimate for T ∞ , b and λ with standard errors and is a direct approximation of a linear response to a Heaviside input; we show λ and ∆T * est = T ∞ − T 0 . Observe that, similarly to the Gregory fits, these fits also become degenerate for later time frames.
To contrast with Figure 4, Figure 5 shows the case for an abrupt 4xCO2 forcing but otherwise identical parameters and initial condition, in which the transient dynamics are not approximated well by a linear system, although a long transient period (due to the crossing of a slow tipping point) conceals the nonlinear dynamics. Figure 5(a) shows that the run seems to rapidly approach an equilibrium, but warming then continues slowly as albedo slowly decreases. Then, around t = 1500 years, there is a surprising and rapid "late tipping" followed by a relaxation to the final equilibrium. From the fits in Figure 5(b) and (c), approximately linear behaviour can be seen at first; however, we are near (but beyond) a fold bifurcation on the stable part of the slow manifold where the blue and red nullclines become tangent (i.e. a slow tipping point), and for this forcing the nullclines are barely detached. As the state passes this point (sometimes called a ghost attractor), the dynamics on the slow manifold speed up before tipping over a fold in the slow manifold, causing a rapid late tipping event to another stable branch of this slow manifold. Figure 5(b) shows estimates using a Gregory fit. It can be seen that a fast decay is picked up initially, and slower decay dominates from about t = 250 years. At around t = 500 years, the fitted value for λ passes through zero, suggesting a linearly unstable climate, and the estimated warming becomes unreliable. Only after the late tipping event, from t = 1750 years onwards, the fits make sense again, with negative λ and sensible warming estimates corresponding to the actual equilibrium warming of the simulation. Similarly, for the exponential fit shown in (c), λ ≈ −0.05 corresponding to the initial fast decay but this quickly decays to pick up the slow decay with λ ≈ −0.001 by about t = 250. The fit remains good up to t ≈ 500 years but after this the estimated errors   Table 1) subjected to an abrupt4xCO2 forcing. The initial equilibrium is T0 = 255K. Here, a late tipping event happens as the dynamics drive the system over a fold point of the slow manifold. (a) Time series for ∆T and ∆α as well as a the trajectory through phase space (cyan). The red dotted curve in the right panel denotes the nullcline on which dα dt = 0 and the blue dotted curve denotes the nullcline on which dT dt = 0, which also acts as a slow manifold. Fits (b,c) as in Figure 4.
on ∆T est increase rapidly as the fit attempts to fit a decaying exponential to something that is actually growing slowly but exponentially. At t ≈ 1500 years the system passes through the late rapid tipping before settling to a fit to ∆T * est ≈ 72. Clearly, in both of these fitting approaches the true equilibrium warming is not estimated accurately at all until after the late tipping event when the system is again approximately linear. From the fits up to about t = 500 years there are no obvious hints that anticipate this late tipping and the fit results seem to indicate convergence to a noisy equilibrium state (hence for example the low R 2 score for the Gregory method as it is mostly noise at this point). Only after t = 500 years there start to be some signs of the passing of a slow tipping point (λ > 0 in the Gregory method and large uncertainties in the exponential fit method) in this example, as the almost-equilibrium (ghost attractor) on the slow manifold is passed around this time.
Comparing Figures 5 and 4, we see very similar fits and estimates up to t = 500 years, further indicating the difficulty of distinguishing scenarios with and without late tipping. Moreover, the perturbation that exceeds the threshold shown in Figure 1(b) lies somewhere between 2xCO2 and 4xCO2 for this model and parameters. Figure 6 shows an analogous simulation of (9) under abrupt4xCO2 forcing but parameters D of Table 1. Again, the initial forcing µ 0 is such that there is an initial equilibrium at T 0 = 255K. For these parameters there is no fold in the critical manifold meaning that there is not a rapid late tipping (in the bifurcation sense).However, similarly to Figure 5, the initial (linear) warming is not representative of the equilibrium warming and the transient means one can only see evidence of the final state after t = 1500 years.
This indicates that even in the absence of (late) tipping points, an initial good fit cannot exclude a later rapid warming phase in systems that have dynamics on multiple time scales. For all three simulations presented in this section, extrapolations from fits to the initial few hundred years look very similar, although their long-term behaviour is very different, again highlighting that extrapolations may only be accurate after long transients that bring the system into a linear regime.

Ensemble variability and ECS
When estimating ECS in models with internal variability, one of the ingredients is the precise choice of the initial conditions y 0 . In Section 22.1, we already discussed that the background climate state (i.e., the initial attractor A 0 ) influences the transient and equilibrium response to forcings. However, also the precise initial state y 0 on the initial attractor will impact the observed transient dynamics and can potentially also change the final equilibrium state. We illustrate such situations in this subsection. Figure 7 (a,b) show an abrupt4xCO2 experiment for an ensemble of different initial states on the same initial attractor (a warm climate state), for a simulation of (9) with parameters B of Table 1 -note the presence of chaotic variability. There is potential variation in the warming of the different ensemble members during the transient, which stems from different realisations of the natural variability, corresponding to the different initial states. Gregory fits over a time window starting at time 0 up to time t are shown in Figure 8(a). The associated regression to individual ensemble members (black) are poor, but the regression to the ensemble average (red) is much better as the noise (internal variability) is averaged out.
Another example is given in Figure 7(c,d) for a different initial attractor (a cold climate state), where natural variability pushes the state over a tipping point at different times during the simulation of each ensemble member. The simulations initially suggest relaxation towards a state close to the original colder state, but later they consistently exhibit tipping to a different (and much warmer state). In this example the colder state is almost at equilibrium. As long as the natural variation in forcing is small enough, the system remains close to the colder state. For larger fluctuations the system tips into the warmer state. Figure 8(b) shows that even the ensemble average is not adequate to estimate ECS in this case; accurate estimates can only be made if the model has been run until (almost) all individual ensemble members have tipped. Nevertheless, the ensemble averaged response is still much better than the other approaches, because data from tipped and non-tipped ensemble members leads otherwise to very unreliable bimodal estimates with high variance.
Even worse, for non-constant forcing the equilibrium response may depend more drastically on the precise initial state y 0 ; some part of the initial attractor A 0 can be attracted to a final attractor A 1 , while the rest is attracted to a different final attractorÃ 1 . This effect has been called a partial tipping of the attractor and is studied abstractly in [45,9,46]. Because of the relative simplicity of the chaotic GEBM (9), we cannot show this behaviour for constant forcing, but we can illustrate this phenomenon by forcing the model temporarily with an abrupt4xCO2 forcing, after which the initial CO 2 -levels are restored at time t = 75 years. Figure 9 shows the results of this experiment. One can clearly see that some ensemble members experience tipping but others do not. In this situation (details not shown), partial tipping means that none of the ECS estimation techniques will paint a full picture. The ensemble-average does contain some information on the number of tipping and non-tipped states but we suggest more meaningful estimates would need to be made for the attractors separately, first by categorising each individual ensemble member as tipped or not, and using estimation techniques on these   Table 1) subjected to an abrupt4xCO2 forcing. The initial equilibrium is T0 = 255K. The left panel shows time series for ∆T and ∆α. The right panel shows the trajectory through phase space (cyan). The red dotted curve in the right panel denotes the nullcline on which dα dt = 0 and the blue dotted curve denotes the nullcline on which dT dt = 0, which also acts as a slow manifold. Note that for the parameters D there is no longer a late tipping in T of the speed as seen in Figure 5

Evidence of late tipping within GCM runs
For GCM runs with conditions corresponding to the relatively stable conditions of the Holocene pre-industrial climate, the accepted wisdom is that we do not expect to find any major global tipping effects as extreme as the ice-house to hothouse transitions explored above. Nonetheless there are hints that we may be close to regional tipping points such as changes in the Atlantic Meridional Overturning Circulation (AMOC) or West Antarctic icesheet collapse, and some emissions scenarios are likely to take us over these tipping points. Crossings of these regional tipping points can result in a global signal, such as changes in the AMOC leading to global climatic changes [47,48]. Further, as emission reduction scenarios may take us over tipping points only temporarily [49], also the possibility of a partial tipping of an attractor may be very relevant to study in GCMs.
Initial conditions for GCM runs are notoriously difficult to set -they are typically taken as the end of a spin-up simulation, or as a state at some time during a control experiment (in both of which atmospheric CO 2 is kept fixed at the starting levels). In ensemble runs, variation of initial states on the initial attractor are sometimes explored either by sightly perturbing an initial state (called 'micro-perturbations'), or by taking several states of a control run, typically separated by a few months up to a few years, depending on the time scale of the internal variability that is being considered (called 'macro-perturbations') [50,51]. Nonetheless, even after substantial spin-up there may be continued variability that can cause extrapolations such as Effective Climate Sensitivity to continue varying over centennial timescales [5]. For example, [52,53] find multi-century changes in an atmosphere-ocean GCM, mostly to do with the strength of the AMOC, depending on the magnitude of the CO 2 perturbation.
The response of GCMs can also include late rapid changes. An example of such a late warming event is visible around year 2, 300 of the abrupt8xCO2 run in the model CESM 1.0.4 within LongRunMIP [8]. Figure 10 shows features of this run, along with associated abrupt2xCO2 and abrupt4xCO2 runs of the same model for comparison. In (a), the time series for the increase in (yearly averaged) global mean near-surface temperature is shown. For the abrupt8xCO2 experiment a late and sudden increase can be seen around t = 2300 years (highlighted in red in the figure), which is not present in the other experiments. We have analysed this data using the Gregory method on millennia-long rolling windows (to suppress the natural variability on shorter time scales) in (c-d). We found an increase in the feedback parameter λ around the same time, and also an underestimation of the equilibrium warming for t < 2400 years. This is similar to our findings in a conceptual energy balance model ( Figure 5) albeit less distinct. Hence, we suggest that this late warming event in the abrupt8xCO2 run could be an example of a late tipping event in a GCM. Appendix B.1 illustrates that this tipping behaviour is probably due to a qualitative regional tipping of the AMOC which appears for the 8xCO2 run, but is not present in the 2xCO2 or 4xCO2 runs. However, we note that unlike in Figure 5, the tipping for the 8xCO2 run is of transient nature: the final state is an "AMOC on" state in all cases.

Conclusion and discussion
Although many authors have pointed out deficiencies with estimating and using equilibrium climate sensitivity (ECS), it clearly remains an important metric for understanding the response of climate models to changes in forcing CO 2 . In particular, although there may be problems with timescales, low frequency variability and lack of linearity, ECS and variants of it are key metrics that find their way (for example, via integrated assessment models of the socioeconomic impact of an emissions pathway such as used in [54,55,56]) into decision making about climate change and its likely impact on human activities. In this paper, we have illustrated how such linear concepts could break down in many different ways, even after long transient periods in which they seem valid, when nonlinear dynamics start to play a role. Although we have focused in this paper on climate response to idealised abrupt CO 2 forcing scenarios, we also want to stress that in multistable nonlinear systems the precise outcome can also depend on the pathway taken -that is, not only the amount of emissions but also the moment of emissions can be important. This further complicates and challenges too simplistic linear frameworks. See for instance [57,49] for examples in conceptual settings, as well as [58] for a discussion on how this can strongly influence integrated assessments.
Climate is a multiscale process that takes place on many fast and slow time scales, so it is unrealistic to assume that all dynamics can be modelled by a univariate linear model. Moreover, we also cannot expect to estimate processes that take place over substantially longer time scales than the simulated duration. As we have illustrated in this paper, this means that even in the case of pure linear response, ECS cannot be accurately estimated unless the simulation times are long enough to resolve the slow timescales such as those common in large scale ocean dynamics or land ice sheets. On top of that, with the examples in Section 3 we have illustrated how nonlinear effects in a multiscale climate model can lead to additional warming effects -such as slow and late tipping, with       long transients without obvious hints of these late events. These examples demonstrate that even if a fit is very good for a long period of time, there still may be large and abrupt late tipping points.
In section 2 and in Figure 1, we have introduced several trade-offs that need to be made when estimating ECS for a climate model. It would be of great interest to locate the "Goldilocks Zone" in which reliable and accurate estimates ECS are possible, in order to give suitable protocols for experiments with GCMs. In particular, it would be good to understand (a) the minimum times and ensemble sizes needed to reliably estimate ECS and (b) the thresholds in perturbation size for general GCMs that lead to tipping behaviour. This will depend not just on the current climate state but also on the processes that are included in the model and the form of the forcing. We suggest there is a need to find criteria that imply that an estimation protocol will work -and on which time scales. For instance, the Gregory method when applied on data from one decade can typically predict a few decades but is unlikely to be predictive on the scale of centuries; similarly, if only 150 years of data is available, it is unlikely to obtain an accurate estimation on millennial time scales.
The most drastic examples of nonlinear response given in this paper concern tipping phenomena. This begs the question of how relevant this is for future projections with GCMs. After all, in these models the GMST response is typically fairly linear to changes in forcing levels, and the transient response seems linear over quite long timescales. This might suggest that tipping points for GMST are not very relevant. However, the parameter space of such models has not been sufficiently explored to capture past and future tipping [59] and, consequently, under standard settings (optimised for stable Holocene pre-industrial climates) those GCMs may operate in a too stable manner [60]. Simultaneously, local or regional tipping has been observed more frequently in GCMs [61], and can be observed in past climate records [35]. Tipping effects at regional levels may give only a small signal in the global average (although e.g. the AMOC restoration in the abrupt8xCO2 experiment in Figure 10 is visible in GMST). Indeed, one might conjecture a global redistribution that almost averages out in data of GMST -similar to what is described in [62,63]. However, such regional tipping is much more problematic than a global mean signal might indicate, as local impacts can be very dramatic. Moreover, when several regional tipping elements are involved, cascading effects may occur [64,65] opening the possibility of an eventual global response for example through triggering of additional carbon cycle feedbacks. It remains an important issue for future climate projections to determine which tipping points may be crossed on time scales of centuries to millennia. It also highlights the importance of going beyond classifying climate response only via GMST, to look at spatial responses and other observables.

Data statement
Simulation data from models in LongRunMIP data.iac.ethz.ch/longrunmip/, including the here used model CESM 1.0.4, requests for access can be made to the coordinators of longrunMIP. More information and details of the simulations can be found on longrunmip.org and in [8]. The numerical code to simulate and subsequently analyse the conceptual energy balance model introduced in equations (9), (11), (14) is available from https://github.com/peterashwin/late-tipping-2022

Author contributions
All authors designed the study. RB and PA undertook the computer simulations and analysis. All authors edited the final text.
A Bifurcation structure of the GEBM In the absence of chaotic forcing (ν N V ≡ 0), the energy balance model (9) has equilibria that are independent of the value of τ α ≥ 0; hence we study here the equilibria for the case of no dynamic albedo. That is, we consider with α 0 and ε 0 as in (12), (13). To study the bifurcation structure of this equation, we apply the scalings y := K α T and s := σ(ε 1 + ε 2 )/(2K 3 a C)t, and we introduce the following (composed) parameters Then (16) becomes This system has an equilibrium at y = y r when f (y r ) = 0 for a given set of parameter values. However, bifurcations can occur as parameters change. At such bifurcation points, not only f (y r ) = 0, but also derivatives of f with respect to y vanish. How many derivatives vanish denotes the co-dimension and degeneracy of the bifurcation. If only the first derivative vanishes, it is a saddle-node bifurcation in which two equilibria collide and disappear. If the first two derivatives vanish, it is a cusp bifurcation, in which three equilibria meet -or, in other words, two saddle-node bifurcations. If the first three derivatives vanish, it is called a swallowtail point, at which four equilibria meet (or two cusp bifurcations). If the first four derivatives vanish, it is called a butterfly catastrophe or butterfly singularity [66,43], where five equilibria or two swallowtail points meet (or three cusp bifurcations, or four saddle-node bifurcations).
To study the bifurcation structure, it is therefore useful to look at the Taylor expansion of f around a reference point y r . For this we set y = y r + z and define z α := y α − y r , z ε := y ε − y r . Then we obtain Using computer algebra software such as Mathematica the expansion around z = 0 of this equation can be computed as with expressions for f i given in Supplementary Material B.2.
A.1 Cusp bifurcations when c = 0 or a = 0 We first inspect some limit cases, starting with the limit case in which c = 0. That is, ε 1 = ε 2 , indicating the emissivity does not change with temperature. In this setting, it can be shown that f 0 = f 1 = f 2 = 0, whenever the following conditions hold simultaneously: Since it is required that a ≥ 0 and y r ≥ 0, from these expressions it can be seen that cusps bifurcations can only occur if z α > 0. Further, higher order degeneracies cannot occur (no choice for z α leads to f 0 = f 1 = f 2 = f 3 = 0 in the case c = 0).
These results can be brought back to the original scaling by a series of substitutions and manipulations. For instance, fixing Q 0 = 341.3, σ = 5.67 · 10 −8 , K a = 0.1, T a = 274.5, ε 1 = ε 2 = 0.7, α 1 = 0.7 it can be shown that a cusp bifurcation occurs when α 2 ≈ 0.5071 for µ ≈ 91.663 at T ≈ 273.9529. For lower values of α 2 , the bifurcation diagram (µ, T ) has two saddle-node bifurcations; for higher values it has no saddle-node bifurcations. See figure 11 for numerical continuation of the saddle-node lines in (µ, α 2 )-parameter space, and bifurcation diagrams (µ, T ) for various choices of α 2 below, above and at the critical value for which the cusp bifurcation occurs. Another limit case arises when a = 0. In this case, α 1 = α 2 , indicating that the albedo does not change with temperature. In this setting, it can be shown that f 0 = f 1 = f 2 = 0 occurs whenever the following conditions hold: In this case, to ensure c ≥ 0 and y r > 0, it is necessary to take z ε < 0. Observe that higher order degeneracies cannot occur.
A.2 Butterfly catastrophe in the full system When a > 0 and c > 0, both albedo and emissivity change with temperature. In this case, the cusp bifurcations from both degenerate settings are present. In these bifurcations, there is a transition from two stable and one unstable equilibria to one stable equilibrium. Next to these, there is also another cusp bifurcation in the full system. In this additional cusp bifurcation, two unstable and one stable equilibria meet and become one unstable equilibrium. It is possible that these cusp bifurcations meet, and hence the five potential equilibria of the full system meet. This happens for parameter values for which f 0 = f 1 = f 2 = f 3 = f 4 = 0. Here, a bifurcation of codimension 4 occurs, which is sometimes called a butterfly catastrophe.  Using the expressions found before, it is possible to find locations of butterfly catastrophes in the full system. For instance, fixing c = 0.003 and d = 5, using a numerical root finding algorithm we obtained the solution ν = 717271, a = 96965.4, y r = 28.9109, z α = 0.212802, z ε = −0.2214.
For parameters close to this point, the degeneracy unfolds into four saddle-node branches, with three cusp points. In Figure 13, we show such a unfolding where we follow the fold loci as parameters α 2 and µ vary, and all the other parameters are taken close to above found butterfly singularity. In the diagrams the three cusps are located where the fold curves meet. If parameters would be taken closer to the butterfly singularity, these cusps points move together and meet up precisely at the singularity.  Figure 14 shows that the abrupt addition of atmospheric CO 2 causes a rapid weakening of the AMOC in the model CESM 1.0.4. In the abrupt2xCO2 and abrupt4xCO2 experiments this weakening gets restored gradually over time, but in the abrupt8xCO2 the system lingers around in a weakened state for long and then suddenly restores rapidly around t = 2400 years, which seem to be the cause of the rapid increase in global warming this late in the run.   [8]. This strength is measured as the (yearly averaged) maximum of the global stream function of depths below 500m in the Northern Hemisphere. The insets show a 50 year rolling average for the first 200 years of each experiment. In all abrupt CO2 forcing experiments, a weakening can be seen at the start of the runs. In the 2xCO2 and 4xCO2 experiments, this is restored gradually; in the 8xCO2 experiment, the AMOC lingers in a weakened state, and suddenly restores around t = 2400 years.