Are general circulation models obsolete?

Traditional general circulation models, or GCMs—that is, three-dimensional dynamical models with unresolved terms represented in equations with tunable parameters—have been a mainstay of climate research for several decades, and some of the pioneering studies have recently been recognized by a Nobel prize in Physics. Yet, there is considerable debate around their continuing role in the future. Frequently mentioned as limitations of GCMs are the structural error and uncertainty across models with different representations of unresolved scales and the fact that the models are tuned to reproduce certain aspects of the observed Earth. We consider these shortcomings in the context of a future generation of models that may address these issues through substantially higher resolution and detail, or through the use of machine learning techniques to match them better to observations, theory, and process models. It is our contention that calibration, far from being a weakness of models, is an essential element in the simulation of complex systems, and contributes to our understanding of their inner workings. Models can be calibrated to reveal both fine-scale detail and the global response to external perturbations. New methods enable us to articulate and improve the connections between the different levels of abstract representation of climate processes, and our understanding resides in an entire hierarchy of models where GCMs will continue to play a central role for the foreseeable future.

Traditional general circulation models, or GCMs-that is, three-dimensional dynamical models with unresolved terms represented in equations with tunable parameters-have been a mainstay of climate research for several decades, and some of the pioneering studies have recently been recognized by a Nobel prize in Physics. Yet, there is considerable debate around their continuing role in the future. Frequently mentioned as limitations of GCMs are the structural error and uncertainty across models with different representations of unresolved scales and the fact that the models are tuned to reproduce certain aspects of the observed Earth. We consider these shortcomings in the context of a future generation of models that may address these issues through substantially higher resolution and detail, or through the use of machine learning techniques to match them better to observations, theory, and process models. It is our contention that calibration, far from being a weakness of models, is an essential element in the simulation of complex systems, and contributes to our understanding of their inner workings. Models can be calibrated to reveal both fine-scale detail and the global response to external perturbations. New methods enable us to articulate and improve the connections between the different levels of abstract representation of climate processes, and our understanding resides in an entire hierarchy of models where GCMs will continue to play a central role for the foreseeable future.
climate modeling | machine learning | model calibration | model hierarchy The general circulation model, or GCM, is a mainstay of research into the evolving state of the Earth system over a range of timescales. The term dates back to the very origin of numerical simulation of the atmosphere (e.g., refs. 1 and 2). The equations governing the general circulation of fluids on a spinning sphere use the basic Navier-Stokes equations, whose form specialized for the planetary circulation was first formulated at the turn of the 20th century (e.g., refs. 3 and 4). However, closed-form solutions are not readily available, and their use as research and prediction tools had to await the advent of numerical solution in the 1950s (5).
The 2021 Nobel prizes in Physics honor some of the work done with GCMs. The first formal global warning of anthropogenic climate change, the Charney Report (6), was substantially based on the pioneering work of Syukuro Manabe, who confirmed 19th century speculations on the warming effect of adding CO 2 . While normally attributed to Tyndall and Arrhenius, the earlier work of Eunice Foote has recently come to light (7). She, in fact, presciently remarked, "An atmosphere of that gas would give to our earth a high temperature" (8). While Foote and others were talking principally about the radiative effects of CO 2 , it was Manabe and others who included dynamical considerations, the transport of heat vertically through convection (9), as well as from the equator poleward through atmospheric and oceanic circulation (e.g., ref. 10). Besides, GCMs also play a central role in the work of another of the 2021 winners, Klaus Hasselmann, who laid the groundwork for the statistical methods behind the field of detection and attribution of climate change (e.g., ref. 11). The detection of climate change requires extracting the signal of forced response in simulations from natural variability, and the attribution of it to external climate forcing agents, such as CO 2 emissions, again requires counterfactual runs of a GCM where that particular forcing is absent.
It may seem an odd juncture, when a Nobel prize has just been awarded for GCM-based work, to speculate on the obsolescence of the GCM. However, there has been a considerable body of literature, for a while, arguing that the limitations of GCMs require a major overhaul for further progress in climate modeling. It has been noted (see, e.g., ref. 12) that the bounds of uncertainty on equilibrium climate sensitivity (ECS: the asymptotic response of a model climate to a doubling of CO 2 concentration) has not significantly diminished since the Charney Report (6). Furthermore, a systematic synthesis of multiple lines of evidence to constrain ECS in ref. 13 indicates, in several places, a diminishing role for GCMs relative to other sources of information. Some have taken a leap from here to assert that the entire project of parameterization-the discovery of parsimonious representation through insight or mathematical methods-may have no future (e.g., refs. 14 and 15), and that large-scale computation is the way forward.
It is perhaps no accident that this debate takes place at a particular inflection point in the history of computing (16), where it is now possible to marshal and extract information from data at an unprecedented scale, the era of big data and machine learning (ML). These methods have led to some spectacular successes in various fields: AlphaFold, for example, can decipher the structure of complex molecules directly from data (17). This has led to speculation that we might have entered the era of "post-theory science" (18). This is a fierce debate in many fields: whether large-scale structure emerges directly from the addition of detail and data, and where the limits of models built from data might lie. We explore this debate here in the context of the modeling of climate, a complex system undergoing slow but inexorable global changes, but where the details matter as well.
The debate poses questions that resonate across all fields of science that use large-scale data and computation as a pillar of the scientific method, alongside theory and observations. In the discussion at the end, Section 6, we will point out certain parallels, particularly with the debate in neuroscience surrounding the Human Brain Project (HBP). Leading up that discussion, the paper will begin with an account of the structure of the GCM from the time of Manabe's pioneering studies to the present day (Section 1), followed by an analysis of some aspects of climate modeling which have exposed GCMs to criticism (Section 2), interrogating the role of model resolution (Section 3), model calibration (Section 4), and the generation of counterfactuals (Section 5). It is our assertion that parameterized GCMs can expect to get a new lease of life at this moment through sophisticated approaches to model calibration based on methods borrowed from ML. We begin with our account of the GCM.

The Structure of the GCM, from Manabe to Present Day
The general circulation of the atmosphere and ocean can be described by equations of fluid flow and thermal energy transport and exchange. The numerical solution of a discretized form of these coupled partial differential equations can be written in the form where x is a state vector consisting of mass, momentum, and energy associated with fluid elements, as well as other quantities that can contribute to changes in state, which could include various phases of water in the atmosphere, salinity in the ocean, and trace elements, CO 2 , methane, dust, and other species, of natural or human origin. R(x) + U(x) represents the dynamics, the Navier-Stokes equation.
Because of the discretized nature of the numerical model, only part of the fluid motions are explicitly resolved (R). The unresolved (U) fluid motions (with scale smaller than one or several grid points) are represented in the form of a closure, that is, a representation of subgridscale dynamics in terms of resolved-scale state variables. P represents other processes that contribute to the thermodynamics; these can include diabatic processes associated with phase changes of water in the atmosphere, leading to the formation of clouds and rain, and the influence of solar radiation, and the equation of state of a complex fluid with many constituents. U and P are often collectively referred to as the physics. Finally, F represents the terms that are considered external (not simulated by the model) influences on the system, known as forcings; these can include solar radiation, volcanoes, anthropogenic emissions of CO 2 , geothermal heating and other radiatively active quantities, or particulate aerosols that can play a role in cloud formation. Both atmosphere and ocean are extremely shallow compared to Earth's radius R (H/R ≈ 10 −3 , where H is the fluid height), so that the numerical treatment of vertical dynamics is generally different from the horizontal. In ref. 9, Manabe and Wetherald considered the timeasymptotic balance between the destabilization of an atmospheric column by radiation (warming at the surface, cooling aloft) and stabilization by convection, which transports heat vertically. Radiative-convective equilibrium (RCE) mediated by water vapor and CO 2 yields the basic behavior of global warming in a single atmospheric column. Later work (e.g., refs. 10 and 19) extended it to include horizontal transport of heat as well, from the equator to the poles, confirming the single-column result. To give a sense of the computational size, a typical GCM resolution today is about 50 km or less, compared to 500 km in ref. 10, and vertical resolutions have increased by a comparable factor; at this resolution, the required temporal resolution is measured in minutes. At these scales, the spatial grid may encompass 2 × 10 7 points, and a simulation of 100 y in length requires 3 × 10 6 time steps. In the decades since the pioneering work of Manabe, considerable ingenuity has gone into creating fast and accurate solutions to the dynamics and increasing its resolution.
An even more significant site of creativity has been in the physics (U(x) + P(x)) which now encompasses myriad processes in the atmosphere and ocean contributing to the transport of heat and other quantities; these include processes associated with clouds and precipitation, mixing by unresolved motions in the ocean, and an increasingly sophisticated treatment of the terrestrial and marine biosphere, with the contribution of trace elements and aerosols, whose natural and anthropogenic emissions also play a role in understanding the response of the climate system to changes in F. The structure of the GCM (Fig. 1) now consists of a number of parameterizations representing individual processes and feedbacks, where λp represents a set of parameters that may be empirically set through comparison with observations or theory. Typically, these are bulk quantities, representative values at the resolution of the model discretization. The climate is a multiscale system, encompassing processes from microscales (cloud droplet formation, stomatal exchange) to megascales (orbital changes, plate tectonics). Transport of heat and CO 2 into the ocean abyss occurs in the planetary scale on times measured in millennia. We introduce a vocabulary for problems addressed with numerical models, where x in Eq. 1 represents the state of the Earth system, and for identifying sources of error and uncertainty in predictions from these models.
• Note, first, that Eq. 1 represents the time trajectory of x.
The trajectory itself is the weather. The system is chaotic (21), which represents the first source of uncertainty. Numerical weather prediction (NWP) has improved over decades with better models, observations, and the techniques of data assimilation, which constrain trajectories to stay close to observations in a least-squares sense (22). • The climate is the set of preferred states of the system, its attractors, discovered by running trajectories for a long time, and averaging over the weather. If F is held constant, the climate should be stationary, and the fluctuations around that state are its internal variability. • The forcing F is time varying, with both natural and anthropogenic contributions, and the climate change problem consists in separating the forced response from internal variability. For purposes of policy related to the climate emergency, this constitutes studying the response to scenarios (trajectories of F; see, e.g., ref. 23). • Abrupt transitions, or tipping points, are changes in climate which are very fast relative to the rate of change of F. For instance, orbital changes are known to lead to long-term changes in climate, such as the glaciationdeglaciation cycles. Yet, the climate record contains transitions such as Heinrich events (24,25), a rapid collapse and resumption of the planetary-scale meridional overturning circulation (MOC). We shall return several times to the MOC in this article as a canonical problem of concern for climate. The long intrinsic timescales of the forces shaping the MOC pose a unique set of challenges. Studies of individual processes (e.g., ref. 26) also show that the system may be capable of abrupt (relative to the rate of change of F) transitions. • Parametric uncertainty arises from imperfect constraints, either from observations or theory, on λp. Structural uncertainty and structural error arise when no values of λp can fit the known constraints, leading to the conclusion that the form of the equations in M could be improved. (It is, of course, possible to formulate structural uncertainty as parametric, by having a parameter that chooses one equation structure over another!) It has been noted that averaging over different GCMs (individual formulations of R, U, and P) can disguise structural error given statistical independence among GCMs (e.g., ref. 27). However, some biases remain common and persistent across many generations of models, suggesting gaps in understanding, or epistemic uncertainty (28).
For the purposes of this Perspective, we define the GCM as the tool used to study the response of the climate system to changes in F. The tool is broadly similar to (and, in many cases, is based on the same model code as) the models used for NWP. In both cases, we follow trajectories of x in time from a specified initial state. For weather, the trajectory itself is the solution, while, for climate, we are interested in the attractors of the landscape where the trajectories lie. Besides chaotic uncertainty, we must also contend with the structural and parametric uncertainty associated with the physics terms U and P. Finally, F and its time rate of change (usually ignored for weather, although not for climate) is also uncertain, as emissions trajectories are unpredictable (29). For timescales of interest to address the climate emergency, including the possibility of abrupt transitions, we generally need to run simulations of at least O(100) simulated years (SY) in length. The exploration of the different sources of uncertainty will require sampling at least O(100) model settings, as we shall show below in Section 5. A GCM is defined, for the purpose of this study, as an engine for simulating the Earth system, capable of running 100 instances of 100 SY in a reasonable amount of time and given adequate computing resources. The same tool may be configured for other purposes (e.g., process studies on a limited domain or in highly idealized settings, or higher resolution but shorter duration), but these constitute the minimum requirements for studying the response to changes in F.

What Ails the GCM?
There are several diagnoses of the weaknesses of GCMs. There is, first, the argument that the column abstraction breaks down in the presence of large-scale organization: the mesoscale organization of cloud systems in the atmosphere (30). A similar argument can be made for ocean dynamics as well, where mesoscale turbulence in the form of persistent eddies is able to deposit energy and momentum away from the source (e.g., ref. 31), in the form of Agulhas rings, for example (32). Such nonlocal effects of unresolved terms call into question the structure of the GCM that has been used since Manabe's pioneering calculations. Some aspects of the climate system have resisted efforts at representation in models, with stubborn biases and uncertainties. This has led some to question whether such processes are parameterizable at all (e.g., ref. 15). Following this line of thinking, it is now often contended that nothing short of resolving finer-scale motions, coupled with assimilation of presentday observations to control model biases, will, in fact, address these shortcomings, and that a future generation of models will address these issues through substantially higher resolution.
A second criticism of GCMs is around the consideration of the "tuning," or calibration, of climate models. As noted above, unresolved physics (U + P) is represented using equations with parameters constrained within some range by observations or theory. The coupled system is then further subjected to global constraints such as top-ofatmosphere energy balance (33). The fact that the models are tuned to reproduce some features of the observed planet is, in some quarters, viewed as rendering the results suspect.
Finally, GCMs are now numerous [122 models from 44 institutions at the time of counting (34)]. Viewed as an ensemble of simulations, they embody considerable structural or epistemic uncertainty, in the form of differing representations of the unresolved scales. The GCMs are not all statistically, or in terms of model code, independent of each other (28), and different evaluation metrics yield different, and contradictory, measures of model quality. The uncertainty bounds have, if anything, increased between the last two climate model assessment cycles, and the recently published Intergovernmental Panel on Climate Change Sixth Assessment Report (IPCC AR6) (35) notes that many models now produce ECS values outside the assessed "very likely" range, leading to an enhanced role for emulators (e.g., ref. 36), as discussed in Section 5. Models used in service of decision-making and policy, including those used in the recently concluded IPCC AR6, rely on emulators allowing for rapid exploration of multiple future scenarios or, through the use of statistical techniques, for the correction of biases. Furthermore, it is increasingly noted that models pegged to present-day climate do not do a good job of representing the climate fluctuations of the past, including past warm climates that may hold lessons for the climate emergency (e.g., ref. 37). We address these concerns in turn.

What Resolution Is Enough?
We begin with the question of resolution. As noted above, one suggested remedy for the weakness of GCMs is to increase the resolution until some of the nonlocal phenomena alluded to above are, in fact, resolved (ref. 38, for instance). The computational expense of such a model would require a substantial boost to computing capability (39).
Atmospheric and ocean dynamics fit within the broad contours of geophysical turbulence. At very large (planetary) scales, this looks like two-dimensional (2D) turbulence, known to have an energy spectrum with a k −3 spectrum. At smaller scales, baroclinicity starts to play a role, and 3D turbulence with a k −5/3 spectrum. This is, in fact, observed in the atmosphere, seen, for example, in ref. 40. Similar spectra are observed in ocean turbulence, as well (41,42).
The key feature to underline for this discussion is that the 3D energy cascade continues all the way down to molecular scales. There is no fundamental scale separation in turbulence. Any truncation applied in order to create a discrete representation for numerical purposes is an arbitrary one, usually constrained by the available computing power.
We can look beyond turbulent energy spectra to the specificities of certain dynamical phenomena. In the atmosphere, one of the key phenomena of interest is moist convection. The pioneering work of Rayleigh and others shows that fluids heated from below will overturn, with the overturning motion in the form of "cells" that roughly scale with the height of the convecting fluid. In the case of atmospheric convection, this includes deep convection, which roughly scales with the height of the tropopause (∼10 km; see, e.g., ref. 43), such as thunderstorms or tropical cyclones, and shallow convection within the planetary boundary layer (depth ∼1 km), which can take many forms, including large stratocumulus decks thousands of kilometers in extent, playing a significant role in the planetary albedo and heat balance. GCMs have traditionally tried to represent these in parameterizations of the vertical transport of momentum, heat, and moisture, as well as other tracers, by convection. Subgrid closures of deep convective processes can be based on an assumption of quasi-equilibrium between synopticscale destabilization of the column and the stabilization by convection (e.g., ref. 44, and its descendants). The slow rate of advances in these methods (45) has led to efforts where parameterizations are replaced with embedded cloud-resolving models (CRMs), known as "superparameterizations" (46), or, going even further, replacing shallow convection as well, with large-eddy simulation (LES) models (47). It is not clear whether, since their inception, this class of models have justified their extreme computational expense in terms of improved climate simulation.
Many aspects of clouds, such as the representation of moist convection, require us to invoke microphysical processes involved in the condensation of water vapor and the formation of falling hydrometeors. But the motions themselves can be captured in nonhydrostatic models. Limited area modeling of deep convection dates back to the 1970s, and advances in computing capacity in the intervening decades make global CRMs (GCRMs, also known as global cloud system resolving models, global convectionpermitting models, global storm-resolving models, etc.; see ref. 48 for a defense of the term despite not quite resolving cloud dynamics) within reach. The DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains (DYAMOND) experiment (49) is a recent systematic comparison across multiple GCRMs. These models typically have horizontal resolutions in the range of 1 km to 5 km, considered sufficient to capture mesoscale convective organization and, at least marginally, resolve individual convective events. Aspects of convective organization, such as the formation of gust fronts with downstream surface density currents (50), remain below the resolution of GCRMs, and must be parameterized, as they are responsible for the initiation of new convective events. Boundary layer convection is responsible for shallow clouds, and requires at least 10× higher resolution; this is, in fact, one of the largest sources of uncertainty in the current generation of models (51). This class of clouds will not be resolved by kilometer-scale models. And, of course, water as vapor or condensate is radiatively active, and particulates play a role in cloud formation, as well. Surface exchanges at very fine scale mediate aerosol emissions (e.g., ref. 52). Many of these aerosolradiation-cloud processes take place at micrometer scale and will be forever outside any conceivable resolution of a numerical model on any known computational technology in the literature today.
CRMs (and their global incarnation, GCRMs) and LES models of the turbulent and cloudy boundary layer are widely used to study processes that cannot be resolved in GCMs, but perhaps we can learn from them, to inform the development of parameterizations. Such studies are usually mediated by single-column models (see, e.g., ref. 53). Such studies often show that the disparity between different CRM formulations remains comparable to those between GCMs. In a set of RCE (the same problem treated in a column by ref. 8) comparisons across both CRMs and GCMs (54), the uncertainty spreads across CRMs and GCMs with parameterized convection were quite comparable (55). In the first extensive comparison of GCRMs (necessarily short runs, only 40 d compared to the GCM timescales outlined in Section 1) (49), there was considerable intermodel variability (56). In Fig. 2, we show a comparison of the relationship between precipitable water and outgoing top-of-atmosphere longwave radiation and albedo and low cloud cover from ref. 56 with the same quantities from the 6th Coupled Model Intercomparison Project (CMIP6) experiment. While the CRMs are closer to observations for the precipitable water and outgoing longwave radiation, the spread in low cloud cover and albedo is just as wide in GCRMs as in GCMs. The model spread in DYAMOND has been traced to differences in the treatment of boundary layer convection (57).
LES structural uncertainty tells a similar story: Comparisons across LES models of the boundary layer with or without clouds still show some reduction in spread in certain cases (58), but considerable spread in response to changes in configuration in specific aspects such as microphysics (59), turbulence closures (60, 61) and numerics (62).
Similarly, the representation of oceanic fronts, eddies, and currents is markedly different depending on the type of model that is used to represent them, whether turbulent mixing by eddies is resolved or parameterized. The ocean depth varies widely between pelagic zones and the abyss, and, consequently, the Rossby deformation radiusthe length scale representing when mesoscale eddy mixing is significant-varies from O(100) km in the equatorial open ocean to O(1) km or less near coastlines and toward the poles (63). Attempts to build a "scale-aware" parameterization of eddy mixing (e.g., refs. 64 and 65) must contend with these variations.
Other features such as vegetation, not discussed in this article, also exhibit heterogeneity at any conceivable resolution. While there is no doubt that, in general, increasing resolution reduces the number of processes needing to be parameterized (although placing increasing resolution demands on observations as well), and, in many cases, increases the fidelity of simulations, particularly in the short term, in the climate context, this must be weighed against the expense, which limits the simulation duration and consequently the ability of models to realize the aspects that regulate the climate on long timescales. The absence of any target resolution where one could argue that everything of interest to climate is resolved implies that these trade-offs are always with us.

Are There "Untuned" Models?
The question of the "tuning," or calibration, of GCMs has been a fact of life since their inception, often not clearly described in the literature, until recent efforts to document the role of tuning in model development (33,66). On the scientific level, these attempts highlight the central role of tuning in modeling, and open up new avenues in the use of automatic calibration techniques from ML.
In broad terms, we can define the process of tuning as one of finding suitable values of λp in Eq. 2 that best fit observations or theory, identifying it as an intrinsic and universal aspect of model development. We seek to achieve fidelity to each process M in the model, as well as respecting global constraints across the coupled system: for example, conserving energy and mass, including of individual agents in the climate system, such as water. The global constraints must be applied when coupling models at any resolution. Tuning is thus a multistep process, where individual parameterizations M are first tuned separately to within a desired tolerance, but which may then be refined in a second stage after the coupled model is built. Thus, tuning, which is often seen as an optimization of a loss function, may be redefined as identifying the subspace of parameters compatible with a number of constraints. While the procedures can be onerous, the process of calibration is central to model development and the way teams learn how parts of the coupled system respond to changes in others. The coupled system can yield surprises: In one example [the National Oceanic and Atmospheric Administration's Geophysical Fluid Dynamics Laboratory (NOAA/GFDL) model (67,68)], the coupled system had an ECS higher than was expected during the development of individual components. Thus, model calibration is not a weakness of models; it, in fact, holds the key to how model developers learn how their model behaves, and, consequently, how the Earth system regulates itself.
Historically, the tuning of models has been found to be expensive, if one uses the GCM itself as the forward model. In particular, key circulation features such as the MOC may be sensitive to tuning in ways that reveal themselves after simulations of O(100) SY (67). Tuning such models "by hand" can be an inefficient scattershot exploration of a small amount of the possible space of parametric uncertainty. This is, perhaps, one reason why some observers hold tuning in low regard. There is also the risk of tuning to a state Fig. 3. A schematic of the use of LES models for parameter estimation, adapted from ref. 72. LES models of different observed cloud regimes are used to build emulators to estimate parameters governing the behavior of column-averaged quantities. Successive waves of HM deliver the minimal parameter space consistent with the LES data, the NROY space. The effect of clouds on absorbed shortwave radiation is reduced by a comparable amount to the previous laborious tuning "by hand" using the same target. ARM, atmospheric radiation measurement; RICO, Rain In Cumulus over Ocean experiment.
containing compensating errors that satisfy the constraints but for the wrong reasons.
As noted earlier, computing technology, at the present time, favors the algorithms of ML, the ability to add or extract patterns seen in large datasets. Tuning is a constant concern in the construction and optimization of methods, such as choosing the width, depth, and structure of a neural net, aspects often referred to as hyperparameters, as they are of the network, not of the process being emulated. They are chosen to meet the requirements of fidelity against the training dataset. Some data are often withheld to guard against overfitting to noisy data, and the withheld ("out of sample") data can be used to validate the result as generalizable to novel situations not seen during training. "Physics-informed" ML (69), where the loss function can be made to penalize violations of global constraints such as conservation laws, also has a parallel to the tuning process described above.
In the HighTune project, an atmospheric boundary layer convection scheme based on the eddy-diffusivity mass-flux (EDMF) approach (70,71) is calibrated to match results from an LES simulation (72,73). The EDMF scheme, which has both upgradient (organizing) and downgradient (dissipative) components, is structured for problems such as cloudy convection and turbulence, but has parameters that must be empirically determined in a variety of boundary layer regimes. The process is illustrated in Fig. 3. A variety of cloud regimes is simulated in an LES constrained by observations; the LES serves as the "truth" for tuning. A single-column representation of EDMF is then compared against LES output using the "history-matching" (HM) method of ref. 74, exploring the range of uncertainty simultaneously across multiple parameters at a number of points. The full range of parametric uncertainty is then explored using emulators, as described in ref. 72. Rather than seeking a single optimum λp, HM seeks only to remove implausible regions of parametric space from consideration, leaving a "not ruled out yet" (NROY) region for consideration. Any parameter values within NROY are possible valid values, and new constraints can be progressively added as needed. Finally, the HM method can also serve as a means of diagnosing structural error: A null NROY space (i.e., no possible values of λp permit M to meet the desired constraints within the chosen tolerance) indicates that the representation in M needs to be refined.
An independent effort from the Climate Modeling Alliance (75) project uses a somewhat different approach, but with the same ends, namely, to calibrate an EDMF scheme with a set of parameters and their uncertainty bounds (76,77). There are also efforts to build a "library" of cloud regimes (78) for use in training. As noted above in Section 3, LES models carry their own structural uncertainty. The Development and Evaluation of PHYsical parameterizations of atmospheric models (79) project seeks to build systematic benchmarks and training datasets for similar efforts elsewhere.
This represents the first step toward a more systematic and objective exploration of parametric uncertainty than is possible when GCMs are directly used as the forward model. These methods are now being extended to consider the tuning of slower processes in the ocean, following ref. 80. Some of the subsequent problems of tuning coupled systems with multiple timescales are still being explored using highly simplified models (81). Preliminary results indicate that compensating errors and other defects inherent in the tuning of coupled systems may still be present, but it may be possible to diagnose those within the HM method.

Why Do We Need Counterfactual Earths, and Where Do They Come from?
In Sections 3 and 4, we have outlined a procedure for creating a tool for numerically simulating a very complex multiscale system at a maximal level of detail given computing limits, and calibrating the system within those limits to resemble the observed Earth as closely as possible. We use this system to study Earth's climate and its history; the appearance of life and the maintenance of an atmosphere, ocean, and land surface suitable for life; the fluctuations of the past and the recent period of relative stability that allowed for the possibility of settlements and agriculture; and the consequences of the Industrial Revolution leading to the current climate emergency. As an object of science, Earth poses a key problem, in that there is only one instance of it, and one temporal trajectory of its state, that we can observe. Simulated Earths remain our only means of exploring different hypotheses about how the system works, and other trajectories it might have taken and might take in the future. It also remains our only means of exploring responses to the climate emergency, and understanding and predicting the impacts of different policy choices of mitigation, adaptation, and resilience building.
Using techniques pioneered by Hasselmann, another of 2021's Nobels in Physics (e.g., ref. 11), the techniques of detection and attribution have helped us understand the role of different forcing agents in F, and their contribution to a changing climate. First, a comparison of present-day climate with a counterfactual simulation known as the "pre-Industrial control" (where F is held constant at its value in 1850 CE), allows us to unequivocally detect that the climate has changed, beyond the bounds of simulated random internal variability. Attribution of climate change to particular forcing agents is performed using either single-forcing runs, where all contributors to F are held constant save one, or agents grouped into greenhouse gases only, natural only, etc. (82), to tease apart the contribution of different forcings to climate change, as well as their nonlinearities (as the contributions may not be simply additive). The number of individual forcings considered now number over 10 (see, e.g., figure 2c in IPCC AR6 Summary for Policymakers (35)]. In addition, ref. 23 considers at least eight pathways of future evolution of forcing. Together, such simulations represent a requirement for GCMs to explore hundreds of counterfactual pathways of climate evolution.
We draw attention to two major implications of this for the use of GCMs. First, GCMs are used to explore a counterfactual space, not directly constrained by observations. This includes chaotic uncertainty (different trajectories of x under changes in initial conditions, which provide a measure of the intrinsic internal variability), and to counterfactual settings of F, where various actual observed forcings are turned on or off. An individual GCM (a particular formulation of the terms R, U, and P) is first calibrated against theory and observations as described in Section 4 and, once this is satisfactory, is used to explore an even vaster counterfactual space (see, e.g., figure 4 from ref. 29). The requirement to be able to simulate counterfactuals must be taken into account in the context of ML as well, as we shall discuss below in Section 6.
Second, in view of the computational expense of GCMs outlined in Section 1 (see also ref. 83), it has proved prohibitively expensive to explore all the forcings, and their potential future pathways, using GCMs. The IPCC has turned, instead, to an extended use of emulators. Note that, unlike the emulators in Section 4, which attempt to mimic individual climate processes, these are emulators of the whole climate system, which attempt to predict the response of the entire system, usually in the form of an integrated measure such as the global mean surface temperature, to changes in F. Regional patterns of climate change can be inferred by coupling with techniques such as pattern scaling (84). The emulators are all reduced-complexity models of various flavors, ranging from relatively simple regressions trained on recent historical data, to dynamical systems models such as impulse-response models, to highly simplified, and usually 1D, GCMs (85). Their advantage is that they are typically millions of times faster than GCMs (86), although their lack of internal physical consistency poses epistemic risk.
One of the key measures of the climate response to CO 2 , the ECS, is itself a counterfactual, as it is based on an asymptotic equilibrium that is never observed in nature. It is nonetheless useful as a measure in order to project a range of responses to scenarios (possible future trajectories of F). Furthermore, while the recent (and most precise) observational record of the satellite era is too short to constrain GCMs adequately, there are other indirect means of placing limits on ECS, including paleoclimate data and constraints on individual processes contributing to the ECS (13). The recent IPCC concluded that many GCMs were providing ECS outside the "very likely" range, and used emulators where ECS is a tunable parameter, to refine the consensus projections and their uncertainty bounds (86).
Reduced complexity models have also been used extensively to study the potential for abrupt transitions in the climate system, for which there is some evidence in the paleoclimate record (87). The MOC (see Fig. 4) is a canonical feature of the climate system with the potential for metastability: Under sufficient fresh water input in the North Atlantic, from retreating continental ice sheets, for instance, the MOC can "collapse." Yet, GCMs exhibit metastability of the MOC less readily than reduced-complexity models, and eddy-resolving models even less so (88). A concern often expressed is that GCMs are too stable to perturbations (37).
The ability to run a wide variety of century-or millennialscale simulations is essential for an evaluation of many aspects of climate risk, such as tipping points. High-resolution model trajectories constrained by assimilation of recent observations will do little to mitigate this concern. The discussion above in Section 4 points to ways forward using objective methods of emulation of GCMs, just as GCMs themselves learn to emulate CRMs and LES models. This would alleviate the epistemic risk associated with emulators.

What Might a Future Modeling Landscape Look Like?
In a celebrated 1972 essay "More is different" (89), Philip Anderson, another Nobel laureate in Physics, argued that many of our sciences abstract reality at different levels of complexity, and struck a cautionary note on the limits of assuming that one level of explanation is "nothing more" than an expression of aggregate behavior of the elements at a deeper level of abstraction. The understanding of such emergent behavior of complex networked systems (CNS) happens at multiple levels, and the reduction of the properties at one level to the one below may not be computationally tractable or of practical use, even if it is true in principle. The Earth system is an exemplar of a CNS, as discussed here: bringing together domains as different as fluid dynamics, radiative transfer, chemistry, biology, and ecology, over a range of time and space scales. Yet the assembly as a whole appears to persist in stable states for millennia. The scientific puzzles related to the climate emergency center on understanding the emergent balance of terms perpetuating stable states, and what thresholds of these balances we are exceeding with anthropogenic perturbations.
As simulation is now a pillar of the scientific method, one temptation at this point is to turn over the problem of emergence to the computer itself. Many diverse fields of science have attempted to simulate a CNS at a high level of network fidelity, explicitly incorporating all the interactions and feedbacks. The question is then, Can such a numerical system spontaneously exhibit higher levels of organization? In neuroscience for example, we can pose this question in the context of the emergence of higher-level brain function from the details expressed or simulated at the level of individual neurons. The HBP (90) attempts to do just this, and others have questioned the limits of this approach (e.g., ref. 91, which examines limits to this approach in the context of vision, or ref. 92, which looks at "top-down" effects where the large-scale structure regulates the behavior at the neuronal level).
The equivalent question to pose in climate science is whether the stable states of the climate, where x from Eq. 1 remains near an attractor despite fluctuations around it, emerge directly from the assembly of the system with all its details. We also would like to see whether such a detailed simulation can accurately capture the responses of the system to changes in F. Despite the "fast" physics of the atmosphere and planetary surface, we have seen that the system contains natural and forced variability at "slow" timescales, regulated by emergent features of the general circulation such as the MOC. The timescales of changes in forcing, such as anthropogenic emissions since the Industrial Revolution, and of responses in features such as MOC demand modeling tools capable of century-long simulations under many possible forcing pathways, 100 simulations of 100 SY each, as argued in Section 1. We have reviewed the arguments for kilometer-scale models in Section 2, which are quite limited in capability compared to the GCMs [optimistically, perhaps capable of O(10) simulations each of O(10) SY in length on the largest available computing platform, an order of magnitude below that (39,93)]. Simultaneously, those exploring policy responses require very fast models for exploring many policy options under many forcing scenarios.
The sense that GCMs may be obsolete comes from these conflicting demands: very high resolution for some key processes, which restrict the ability to explore and quantify uncertainties and study low-frequency variability; the need to explore many counterfactual scenarios for constructing climate policy, which cannot be guided simply by presentday observations; and the long simulation times needed to understand prior episodes of abrupt climate change.
As noted in the comparison with neuroscience, this is a debate, across many fields, on how much reliance to place on the most complex models capturing detail using the elements of the finest representation of all the elements in a complex system. Such simulations are sometimes now called "digital twins," borrowing a term from manufacturing (e.g., ref. 94), where it originally meant an exact digital copy of an engineered system and its specifications. It has been increasingly used to describe complex systems, including living systems (95). But, as noted in ref. 95, for sufficiently complex systems, we need simulations at multiple levels of abstraction and complexity.
The MOC is a case in point. At the broadest level, Earth is warmed by radiation from the sun around the tropics, and loses heat near the poles, implying an equator-to-pole transport of heat in the ocean-atmosphere system. In the ocean, for the current planetary topography, this results in what is referred to as the ocean's "conveyor belt" for heat and energy (96). Looking a little closer, we see important features such as the horizontal gyres that also contribute to the meridional transport of heat northward in the North Atlantic (97), an important feature regulating the climate on the continents around (98). Zooming farther in, we can see that turbulent eddies continue to transport heat across the meridional flow in the North Atlantic (99). All levels of explanation are broadly consistent with data, and, while each rung of the ladder of complexity can be described in terms of residuals of a fine-scale balance from a level below, the balance is regulated by global constraints such as the equator-to-pole radiative imbalance. In short, it is the entire hierarchy that constitutes our understanding of the MOC.
In the modeling of complex multiscale systems, calling one of these layers a "twin" at the expense of others appears to be a rhetorical overreach (100). Indeed, in speaking of the HBP in ref. 101, the ambition is sharply circumscribed: In constructing a 'digital twin' of a living organ, one is confronted by important challenges over and above those encountered when constructing the digital twin of an inanimate object. Therefore, the concept of the 'digital twin' in this context needs to be carefully defined to provide clarity on its limitations and to avoid creating unrealistic expectations of exact fidelity . . . The digital twin is thus a copy in the practical sense, usually associated with a model of a function or process, and its power lies in its usefulness in dealing with relevant problems faced by its physical counterpart without the need (and certainly not the claim) of capturing every single detail thereof.
In the climate context, this could describe any numerical model dating back to ref. 5! We have demonstrated (Section 4) that there are methods for making the model hierarchy traceable, by showing how to derive a parameterization on the basis of a model farther up what Charney called the model "ladder" (16). It might be worth including such high-resolution benchmarks for the training of parameterizations in future model intercomparison projects for a new generation of GCMs calibrated using the methods outlined here. Similar methods can be imagined to link the reduced-complexity models of ref. 85 to GCMs, from which we can carefully explore how to remove the biases in the GCMs themselves, such as those outlined in Section 2.
We contend, in this article, that the GCM remains the indispensable meeting point of these divergent directions. Its structure encapsulates our fundamental understanding of how the climate works, and represents an astute assembly of choices and trade-offs that are versatile enough to meet the challenges outlined here. The GCM's column structure reflects the importance of separating the vertical in the model topology, and the importance of convection and its timescales in atmosphere and ocean. The structural independence of columns has been seen as a limitation, but new methods can use nonlocal predictors (102). The stochastic parameterizations mentioned in Section 3 also impose nonlocal (in space and time) coherence to the stochasticity (103).
Just as data assimilation techniques for constraining trajectories toward a time series of observations led to major improvements in NWP (22), these new methods, based on data from process models such as CRMs, or observations, as well as imposing physical constraints, hold out the possibility of efficiently approaching the attractors of the system, the key feature distinguishing the climate problem from weather, as outlined in Section 1. The recommendation is to adopt a rigorous, transparent, and reproducible tuning process, rather than assuming tuning will simply disappear when we simulate at a high enough level of detail.
Finally, we do not wish to forget the use of models as pedagogical tools, for students to explore climate response, to have "fun"-a point repeatedly made by Manabe after receiving the Nobel Prize, for example, in his first Nobel press conference (104). Models must, in addition, be easy to use to explore their sensitivity to counterfactual changes, and explore novel and risky ideas on how it responds to perturbations.
The future modeling landscape must rest on the principle of a traceable model hierarchy (16). Models at every level of the hierarchy have their own forms of structural uncertainty, as noted above; this uncertainty does not vanish at any conceivable model resolution possible on any known computational technology. Each model can be put to multiple uses and subject to diverse physical and computational constraints. The traditional GCM, with its ability to combine resolved dynamics with unresolved physics for a nonstationary Earth system, remains the crossroads between models built for other purposes: models that can resolve some of the physical uncertainties, but in limited settings; models that can be used to study transitions in the climate system that are abrupt events between millennia-long stable states; and emulators that produce corrected data for downstream users. Each of these involve trade-offs sacrificing accuracy in one part of the climate system against another. The GCM will remain the essential element ensuring that these tradeoffs remain within reasonable limits for the entire Earth system. 6.1. Data, Materials, and Software Availability. There are no data underlying this work.