Tropospheric Ozone Assessment Report: Assessment of global-scale model performance for global and regional ozone distributions, variability, and trends

The goal of the Tropospheric Ozone Assessment Report (TOAR) is to provide the research community with an up-to-date scientific assessment of tropospheric ozone, from the surface to the tropopause. While a suite of observations provides significant information on the spatial and temporal distribution of tropospheric ozone, observational gaps make it necessary to use global atmospheric chemistry models to synthesize our understanding of the processes and variables that control tropospheric ozone abundance and its variability. Models facilitate the interpretation of the observations and allow us to make projections of future tropospheric ozone and trace gas distributions for different anthropogenic or natural perturbations. This paper assesses the skill of current-generation global atmospheric chemistry models in simulating the observed present-day tropospheric ozone distribution, variability, and trends. Drawing upon the results of recent international multi-model intercomparisons and using a range of model evaluation techniques, we demonstrate that global chemistry models are broadly skillful in capturing the spatio-temporal variations of tropospheric ozone over the seasonal cycle, for extreme pollution episodes, and changes over interannual to decadal periods. However, models are consistently biased high in the northern hemisphere and biased low in the southern hemisphere, throughout the depth of the troposphere, and are unable to replicate particular metrics that define the longer term trends in tropospheric ozone as derived from some background sites. When the models compare unfavorably against observations, we discuss the potential causes of model biases and propose directions for future developments, including improved evaluations that may be able to better diagnose the root cause of the model-observation disparity. Overall, model results should be approached critically, including determining whether the model performance is acceptable for the problem being addressed, whether biases can be tolerated or corrected, whether the model is appropriately constituted, and whether there is a way to satisfactorily quantify the uncertainty.


Introduction
Tropospheric ozone is a greenhouse gas and pollutant detrimental to human health and crop and ecosystem productivity (LRTAP Convention, 2011;REVIHAAP, 2013;US EPA, 2013;Monks et al., 2015). Since 1990 a large portion of the anthropogenic emissions that react in the atmosphere to produce ozone have shifted from North America and Europe to Asia (Granier et al., 2011;Cooper et al., 2014;Zhang et al., 2016). This rapid shift, coupled with limited ozone monitoring in developing nations, has led scientists to ask some basic questions: Which regions of the world have the greatest human and plant exposure to ozone pollution? Is ozone continuing to decline in nations with strong emission controls? To what extent is ozone increasing in the developing world? How can the atmospheric sciences community facilitate access to ozone metrics necessary for quantifying ozone's impact on climate, human health and crop/ecosystem productivity?
To answer these questions the International Global Atmospheric Chemistry Project (IGAC) developed the Tropospheric Ozone Assessment Report (TOAR): Global metrics for climate change, human health and crop/ecosystem research (www.igacproject.org/TOAR). Initiated in 2014, TOAR's mission is to provide the research community with an up-to-date scientific assessment of tropospheric ozone's global distribution and trends from the surface to the tropopause. TOAR's primary goals are to: 1) produce the first tropospheric ozone assessment report based on all available surface observations, the peer-reviewed literature and new analyses, and 2) generate easily accessible, documented data on ozone exposure and dose metrics at thousands of measurement sites around the world (urban and non-urban). Through the TOAR Surface Ozone Database (https://join.fz-juelich.de), these ozone metrics are freely accessible for research on the global-scale impact of ozone on climate, human health and crop/ecosystem productivity (Schultz et al., 2017, hereinafter referred to as TOAR-Surface Ozone Database).
The assessment report is organized as a series of papers in a Special Feature of Elementa: Science of the Anthropocene, with this paper (hereinafter referred to as TOAR-Model Performance) providing an assessment of the skill of current-generation global chemistry models in simulating the observed present-day tropospheric ozone distribution, variability, and trends. To understand the implications of any ozone changes on the Earth system one must have accurate knowledge of its global distribution, from the Earth's surface into the stratosphere and above. For example, ozone impacts on human health, agriculture, and natural ecosystems are primarily driven by near-surface concentrations, whereas radiative forcing, and thus climate change, is most sensitive to ozone in the (tropical) upper troposphere and lower stratosphere (Lacis et al., 1990;Stevenson et al., 2013;Monks et al., 2015). In situ and satellite observations provide a substantial amount of information on the present day tropospheric ozone distribution and its variability and trends over the recent past (Tarasick et al., 2017, hereinafter referred to as TOAR-Observations;Gaudel et al., 2017, hereinafter referred to as TOAR-Climate), but there are important gaps in our knowledge (Cooper et al., 2014). Many regions of the world, including remote oceans and continental areas like Africa, South America, the Middle East, and India, remain under-sampled leading to incomplete knowledge of the horizontal, vertical and temporal distribution of ozone (Oltmans et al., 2013;Cooper et al., 2014;Lin et al., 2015a;Sofen et al., 2016a). Furthermore, observational estimates of the preindustrial ozone burden are highly uncertain (TOAR-Observations) making it difficult to accurately quantify the preindustrial to present day ozone changes and the resulting radiative forcing on climate and air quality impacts. Global atmospheric chemistry models not only fill in these observational knowledge gaps, but are also tools to interpret the observations, to identify the key processes and variables that determine ozone distributions, variability and trends, and to project future tropospheric ozone and trace gas distributions for different anthropogenic or natural perturbations.
A global atmospheric chemistry model is a numerical synthesis of the complex physical and chemical processes that describe the state of the atmosphere and is designed to simulate the distribution and evolution of chemical species on regional to global scales (different types are discussed in Section 2). Figure 1 summarizes the modeled processes necessary for simulating tropospheric ozone at these scales. These include representation of natural and anthropogenic ozone precursor emissions, such as nitrogen oxides (NO + NO 2 = NO x ), carbon monoxide † † † Department of Atmospheric and Oceanic Sciences and Laboratory for Climate and Ocean-Atmosphere Studies, School of Physics, Peking University, CN (CO), methane (CH 4 ), and non-methane volatile organic compounds (NMVOCs); the photochemical reactions that lead to ozone formation and destruction, and the actinic flux that drives this chemistry; the transport of ozone and its precursors away from the source by advection, convection and mixing; and loss of chemical species via wet and dry deposition. The detail to which these processes are represented depends on the intended application of the model, on the availability of observations or results from laboratory experiments to constrain the processes, the knowledge of processes that influence ozone, and the available computing power.
Models are numerical approximations of the real atmosphere, but since they are based on incomplete parameterizations of real-world processes, they will never be perfect representations of the real world (Box, 1976; Hargreaves and Annan, 2014). However, as we will show, they can provide useful information for understanding the distribution and evolution of tropospheric ozone at a range of spatial and temporal scales. Confidence in model projections of ozone would be demonstrated by the ability of models to reproduce the past and present observations of ozone on a range of different spatial and temporal scales, along with their ability to simulate the relationship of ozone to its precursors and to atmospheric physical and dynamical processes. Identification, investigation and quantification of model discrepancies with observations help inform model development and support improvement in process understanding.
TOAR-Model Performance assesses the performance of current-generation global chemistry models in simulating the observed present-day tropospheric ozone distribution, its variability and trends, drawing mainly on the results from major international multi-model intercomparison Figure 1: Schematic of chemical and physical processes included in a typical global chemistry model to simulate tropospheric ozone. The Earth is divided into a 3-dimensional grid, with latitude and longitude as the horizontal coordinates, and altitude or pressure as the vertical coordinates. Physical processes include transport by advection, convection, turbulence, and boundary layer mixing, as well as temperature, humidity, cloud cover, sun angle/latitude and time of year. Chemical processes include photochemical ozone production and destruction, aerosol-cloud interactions, wet and dry deposition and precursor emissions from anthropogenic and natural sources. Ozone precursors undergo similar physical processes as ozone itself. DOI: https://doi.org/10.1525/elementa.265.f1 projects for both surface and free tropospheric ozone in the last decade (see Table 1). We acknowledge the exclusion of regional models from the discussion, but point the interested reader to large regional model intercomparison projects such as the Air Quality Model Evaluation International Initiative (AQMEII) (Im et al., 2015). We begin with an overview of the types of global models and their nomenclature, and a summary of the international assessments that have evaluated their performance (Section 2). We then describe the evaluation methods commonly applied to models (Section 3), and assess model performance for present day ozone levels (Section 4), including extreme episodes (Section 5), before focusing on interannual variability (Section 6) and multidecadal trends (Section 7). For quick reference, Sections 4-7 each have a short summary section that summarizes model performance for these different temporal scales. We then discuss the potential causes of biases in models (Section 8). We conclude by summarizing the current state of model performance, and propose directions for future developments (Section 9).

Nomenclature of global chemistry models and international assessments
Motivated by the severe smog in Los Angeles, air pollution events were linked to sunlight and NO x -and VOCdependent chemistry, resulting in the generation of ozone, as early as the 1950s (Haagen- Schmidt, 1950). Further details of the global importance of this photochemistry were beginning to be understood by the 1970s (e.g., Levy II, 1971;Chameides and Walker, 1973;Crutzen, 1973), and throughout the 1970s and early 1980s there were several efforts to synthesize this information into simple tropospheric chemistry model studies. These focused on atmospheric profiles (Levy II, 1973), or on the hemispheric and global scales (Fishman and Crutzen, 1978;Fishman et al., 1979;Peters and Jouvanis, 1979;Logan et al., 1981). In the 1980s and early 1990s, tropospheric chemistry models became increasingly more sophisticated in their design, with greater chemical detail, improved parameterizations for atmospheric transport and removal processes, and better estimates for trace gas emissions (see Peters et al. (1995) for a review of developments in tropospheric modeling up until this time). One key model result from these earlier studies was confirmation that in situ photochemical tropospheric ozone production was important at a global scale as well as in polluted urban centers, and that, globally, the net influx of ozone from the stratosphere was of secondary importance. This resolved debates on the origin of tropospheric ozone that began in the 1970s (see Monks et al., 2015;Archibald et al., 2017, hereinafter referred to as TOAR-Ozone Budget).
Beginning around the late 1990s, models of global tropospheric chemistry developed along two parallel tracks. Broadly these involve: (1) incorporation of tropospheric chemistry into models designed to simulate the physical climate, so that chemistry-climate interactions can be explored, and (2) inclusion of chemistry in models driven by pre-calculated meteorology fields, optionally constrained to observed meteorology, allowing more detailed investigation of chemistry processes and comparison to specific measurements. Figure 2 shows schematics of the different model configurations, which are described below.

Atmospheric chemistry in global climate models
This first type represents the most complex models (Figure 2a and 2b), where atmospheric chemistry processes are embedded within a general circulation model (GCM): i.e., climate models, where physical atmospheric processes are calculated online by solving equations that describe fluid flow and radiative transfer, which can not only respond to changes in greenhouse gas concentrations, solar output or other forcings, but also generate their own internal meteorological variability (e.g., Flato et al., 2013).
Chemistry-climate models, or composition-climate models (CCMs), represent the most complex models in this family (Figure 2a), where the chemically-driven changes in radiatively active gases and aerosols (e.g., ozone, methane, sulfates) influence the model's radiation scheme, thus coupling composition directly to climate (e.g., Sudo et al., 2002;Watanabe et al., 2011;Naik et al., 2013;Shindell et al., 2013;O'Connor et al., 2014). More routine use of CCMs with tropospheric chemistry and aerosols is a relatively recent phenomenon (see Morgenstern et al., 2017 and references therein for a recent review), whereas coupling of upper atmosphere chemistry to climate has a much longer history due to the increased importance of chemically active compounds for heating rates in the stratosphere and above (e.g., Pyle, 1980;Garcia and Solomon, 1994;de Grandpre et al., 1997;see Morgenstern et al., 2010 for a review). A less complex model than the CCM is the chemistry GCM (Figure 2b), where the chemistry is affected by the climate changes from the radiative and dynamical parts of the model but the chemically-driven changes in radiatively active gases and aerosols do not subsequently affect climate. This type of model was the first step in coupling tropospheric chemistry to physical climate models (e.g., Roelofs and Lelieveld, 1997;Johnson et al., 1999;Doherty et al., 2006;Zeng et al., 2008), and it is still occasionally used (e.g., Lamarque et al., 2013).
Although fully interactive ocean-atmosphere-chemistry simulations have been performed (e.g., Collins et al., 2011;John et al., 2012;Shindell et al., 2013;Nowack et al., 2015), the computational expense of simulating atmospheric chemistry means that both CCMs and chemistry GCMs are typically run without interactive ocean and sea ice components. Instead simulations are often run with a boundary condition of prescribed sea surface temperatures (SSTs) and sea ice concentrations (SICs) (usually time varying) in lieu of an ocean model. If the SSTs and SICs follow observations, rather than being taken from another model simulation, these are referred to as AMIP (atmospheric model intercomparison project) simulations (Gates, 1992). If time-varying observed sea-surface temperatures are used as the boundary condition, then the model will "see" observed large scale climate variability, such as the phase of the El Niño-Southern Oscillation (ENSO) (e.g., Zeng and Pyle, 2005;Lin et al., 2014), but precise meteorological conditions (e.g., temperature and winds) driven by internal atmospheric variability will not be reproduced (e.g., Barnes et al., 2016). Finally, the representation of the stratosphere, in terms of resolution (e.g., vertical extent completely or partially covering the stratosphere) and chemistry (e.g., lumped halogen chemistry, individual halogen gases or prescribed stratospheric ozone concentrations), varies substantially between different CCMs or chemistry GCMs (e.g., Iglesias-Suarez et al., 2016;Morgenstern et al., 2017).

Atmospheric chemistry in offline global models
The second type represents those models where the physical atmospheric processes are taken from pre-calculated three-dimensional, time-dependent meteorological data (such as temperature and winds), from either meteorological reanalyses (e.g., Kanamitsu et al., 2002;Dee et al., 2011) or from prior simulations of a global climate model (Figure 2c). These are "offline" models, in that the transport is pre-calculated and the chemistry cannot affect the radiation or dynamics. Models differ greatly in which physical variables are read directly from the offline data, and which are directly calculated (e.g., convective fluxes and cloudiness).
The most computationally efficient of these models are chemistry transport models (CTMs), which are developed solely for coupling offline meteorological fields with a chemical mechanism (e.g., Law et al., 1998;Bey et al., 2001;Horowitz et al., 2003;Emmons et al., 2010). A more recent innovation in this area has produced specified dynamics CCMs (SD-CCMs) (e.g., Lamarque et al., 2012). These models are based on a full CCM framework, but overwrite  the GCM-calculated meteorology with reanalysis data to constrain the dynamics (normally just the temperature and horizontal winds). These models have an advantage over CTMs since they allow the coupling of the offline meteorology with other model components, such as biogeochemistry modules or components that simulate natural emissions. Nudged CCMs represent a hybrid of the CCM and SD-CCM, and sit somewhere between Figure 2a and 2c. In these models the meteorology calculated by the GCM is nudged towards reanalysis fields (e.g., temperature and horizontal winds, although this varies by model), rather than overwriting them, using techniques such as Newtonian relaxation (e.g., Jeuken et al., 1996;Telford et al., 2008;Uhe and Thatcher, 2015). These models are widely used for tropospheric chemistry studies, including model evaluation and validation (Pozzoli et al., 2011;Lin et al., 2012a;Fiore et al., 2014a;Brown-Steiner et al., 2015;Jöckel et al., 2015). While there is a difference in the model formulation, the term "SD-CCM" is often used to describe both nudged CCMs and the SD-CCMs as described above. In practice, nudging greatly reduces the role of model-generated internal climate variability, yielding a close resemblance between the model and reanalysis meteorology (Jeuken et al., 1996;Telford et al., 2008). Nudging may also be applied only to parts of the atmosphere. For example, nudging of tropical stratospheric winds in stratosphere-resolving CCMs ensures a realistic periodicity of the quasi-biennial oscillation (QBO) (Morgenstern et al., 2010). If the nudging is limited to the QBO, then these models are still referred to as CCMs.
CTMs and SD-CCMs (and nudged CCMs) are often used for performing process-oriented analysis, including interpretation of short-term field measurements (e.g., Law et al., 1998;Liang et al., 2007;Zhang et al., 2008;Telford et al., 2010;Lin et al., 2012a;Wespes et al., 2012) and understanding the causes of ozone variability and long-term trends in observational records, by isolating the roles of emissions and meteorology (Koumoutsaris and Bey, 2012;Lin et al., 2014Lin et al., , 2015Lin et al., , 2017Strode et al., 2015). These models are also used to make chemical forecasts as part of flight planning for field missions (e.g., Fast et al., 2007). In addition, global CTMs often provide boundary conditions for regional CTMs that are used for air quality planning purposes.

Model intercomparison projects (MIPs) with tropospheric chemistry
There is a long history of international model intercomparison projects (MIPs) involving global tropospheric chemistry models, largely motivated by the need to inform major international assessment activities. Table 1 summarizes some notable examples of these projects since ~2 000, and those most recently completed (ACCENT, ACCMIP, CMIP5, TF-HTAP, and POLMIP) are used to inform the assessment in this paper. Several individual models have taken part in many or all of these projects, so they are not independent samples of model performance.
As with their physical climate modeling counterparts (e.g., Hawkins andSutton 2009, 2012), these projects have been used to explore uncertainty, particularly the structural uncertainty associated with the different representations of the physical system in the different models, such as the chemical, photolysis and deposition schemes. However, because these projects represent "ensembles of opportunity", i.e., a collection of simulations from modeling centers or groups that were able to complete the simulations, they are unlikely to capture the full structural uncertainty and this remains a research area deserving of more investigation.
The experimental design of MIPs is typically based around the use of many different models to conduct simulations for the same conditions, such as the same ozone precursor emissions, and the same meteorology (for CTMs and SD-CCMs) or greenhouse gas concentrations, aerosol and solar forcings (for CCMs and chemistry GCMs). This is to ensure that simulations are directly comparable, and to allow assessment of the ozone (etc.) levels that result from given scenarios or conditions. In practice the different formulation and chemical complexity in different models means that some processes remain poorly constrained (e.g., natural emissions are often left unspecified) and others may be omitted entirely (e.g., higher VOC chemistry). An important consequence is that there are first order differences in model implementation which need to be considered when comparing and evaluating model simulations (Shindell et al., 2008;Fiore et al., 2009;.

Model evaluation methods
This section summarizes the range of different techniques currently used for evaluating modeled ozone, with additional discussion of issues that should be considered for model-observation comparisons. The purpose of evaluating model performance is to quantify our confidence in their output, given the particular application or experiment. There is no single metric that captures model skill, and the choice of evaluation method needs to be targeted for the application, while also considering the available observational constraints.
Although the focus here is on evaluation of tropospheric ozone, confidence in a model also depends on its performance for other parameters and processes. For GCMs and CCMs, the performance of the chemistry component is only part of the evaluation of the model, which will include a suite of atmospheric, oceanic, cryospheric and biogeochemical parameters (Flato et al., 2013). To a lesser degree, this is also the case for CTMs, where some of the physical climate variables may not be provided directly from the driving meteorological fields (e.g., convective transport). We direct the reader elsewhere for discussions on other aspects of model evaluation (e.g., Flato et al., 2013 and refs. therein).
In addition to whole model evaluation, subcomponents of models may be evaluated (or benchmarked) against more detailed models. One example is comparing production and temporal evolution of ozone from (necessarily) simplified global model chemistry mechanisms against that from a more detailed, nearexplicit chemical mechanism using box models (e.g., Emmerson and Evans, 2009;Archibald et al., 2010a;Squire et al., 2015). These techniques are not discussed here, and we instead focus on the use of observational constraints on model performance. Table 2 provides a summary of different evaluation techniques, type of observations and metrics used, and the model skill or process evaluated. Direct comparisons of simulated ozone with measurements provide a measure of model skill in capturing the spatial and temporal distribution of ozone (techniques 1-3 in Table 2). Comparisons at time scales other than monthly mean and seasonal cycle are now becoming more routine, with the use of both high frequency or long term ozone measurements. Such comparisons can provide additional information on model performance and potentially additional clues into the process drivers of model biases (e.g., the diurnal cycle might point to issues with the evolution of the boundary layer over the day).

Summary of evaluation techniques
Other evaluation approaches target the processes controlling ozone rather than ozone itself (techniques 4-5 in Table 2). These techniques can provide greater process-oriented understanding of a model's simulation  Collins et al. (2017) of ozone (e.g., evaluation of CO and ozone correlations can point to issues in simulated emissions, chemistry and mixing timescales). By extension, such methods can also provide insight into a model's usefulness for a range of possible (future) environmental conditions. However, care needs to be exercised when employing process-oriented model evaluation approaches because the observed relationships between ozone and the co-measured meteorological field or tracer may be complicated by other competing influences (e.g., Steiner et al., 2010;Brown-Steiner et al., 2015;Shen et al., 2016)., or may change over time (e.g., Hassler et al., 2016).

Considerations for model-observation comparisons
Aside from any errors and uncertainties in observational data, a major issue for direct model-observation comparisons is the representativeness of the available measurements. One issue of representativeness relates to the fact that the atmosphere is not completely sampled (Sofen et al., 2016a(Sofen et al., , 2016bTOAR-Observations), meaning that there are important locations and times where there are no constraints for the model (e.g., a globally sparse distribution of monitoring stations, coarse vertical data from satellites, poor constraints on pre-industrial ozone). Given the incomplete sampling, the primary issue for representativeness where we have observations relates to the fact that the spatial and temporal resolutions of global models are necessarily coarse. This means that modeled mixing ratios reflect regional averages over grid scales of 100 × 100 km or more. However, in polluted regions the chemical lifetime of ozone is sufficiently short for ozone to vary over much shorter spatial and temporal scales, and thus much finer grid scales would be required to resolve its variations. The wide range in spatial and temporal scales is a particular problem for model comparisons against  site-based measurements from a single geographical location. In the absence of precursor emission sources driving rapid chemical formation or sub-grid dynamical processes such as convection introducing fine structure, ozone may be sufficiently well mixed for a given measurement site to be representative at model grid scales, and for comparison of observations and models to be meaningful. But if an observation site has local emission sources or meteorological features associated with mountain or coastal locations, direct comparison with models may be less meaningful, and model "biases" may actually reflect differences in assumptions of site representativeness rather than incorrect process treatment. Similarly, a given grid cell may mix pollution sources (e.g., urban areas) over the whole grid, meaning that what is a remote, unpolluted site in the real world may have higher pollution levels in the model. The most basic way to avoid representativeness issues is to evaluate global chemistry models against baseline sites: i.e., those not heavily influenced by urban-scale fast chemical processes and certain dynamical regimes (such as convection). In addition, site representativeness can sometimes be partly addressed through conditional analysis approaches. This includes comparing observed and modeled ozone under particular wind directions, such as marine flow at Mace Head on the west coast of Ireland (Simmonds et al., 1997); by selecting free tropospheric air masses, such as at the mountaintop site of Jungfraujoch in Switzerland (Cui et al., 2011); or filtering observed and modeled ozone based on the concomitant measurements of other trace gases, such as CO, to separate ozone into polluted and unpolluted categories (e.g., Auvray et al., 2007;Arnold et al., 2015;Lin et al., 2017). Spatially-averaged measurements can provide a more appropriate comparison for coarse resolution models, and allow assessment of bulk behavior at the expense of some loss of spatial detail. Measurements may also be classified and aggregated through data-driven techniques such as cluster analysis or objective evaluation of site characteristics, allowing an assessment of model skill against particular ozone regimes (rural, urban etc.) when the model output is analyzed in the same way (Lyapina et al., 2016;TOAR-Surface Ozone Database).
Representativeness considerations are also important for satellite data. While these data generally provide a greater spatial coverage than in situ observations, they may be representative of a particular satellite overpass time, their spatial and temporal sampling may be biased, and their measurements are generally representative of a broad vertical region of the atmosphere (TOAR-Climate). These comparability issues can be mitigated by applying the instrument averaging kernel to the model output, and by saving the model output at the overpass time (Zhang et al., 2010;Aghedo et al., 2011). Comparison of modeled and retrieved tropospheric ozone columns requires consideration of how to define the troposphere and stratosphere in the model. Using a reanalysis climatology of tropopause heights might give consistency with the satellite product, but can bias a model's true tropospheric column if its own tropopause position is biased .
A more general issue with satellite measurements is that the instruments measure irradiances, which are then transformed into ozone abundances by a complex retrieval process that relies on various assumptions and models (Liu et al., 2005;Bowman et al., 2006). Although not applied to atmospheric chemistry observations, a recent innovation to deal with this issue is the development of instrument simulators for models. Here the model state is translated into the irradiances measured by a particular satellite instrument, allowing direct comparability with the satellite measurements (Bodas-Salcedo et al., 2011), but with additional uncertainties from translating the model state.
Independent of issues of representativeness, the assumption that individual model grid boxes are wellmixed shortens the chemical and transport time scales. This typically biases modeled ozone high in source regions, since dynamical limits on chemical production are absent and the artificially well-mixed conditions favors more efficient ozone production, missing localized ozone titration due to intense NO emissions for instance (Grewe et al., 2001;Wild and Prather, 2006;Hodnebrog et al., 2011). For convective mixing in the presence of strong concentration gradients, vertical transport may be biased low at coarse resolutions (Kiley et al., 2003;Wang et al., 2004;Wild et al., 2004). These scale-related biases are important under many conditions and need to be considered if model-observation comparisons are used to assess model performance or to identify weaknesses in specific model processes.
Finally, caution needs to be applied when comparing modeled and observed metrics at long time scales, particularly for free-running chemistry GCMs and CCMs. These models generate their own climate and weather variability, meaning that they will be unlikely to capture the decadal-scale (and shorter-term) variations and the timing of anomalous years seen in observations (e.g., the 1997/1998 El Niño). The lack of synchronized natural variability between the models and observations can then likely lead to a bias in their trends, even if sampled over the same time periods (e.g., Lamarque et al., 2010;Parrish et al., 2014), as discussed by recent studies (Lin et al., 2014;2015a, 2015bBarnes et al., 2016).

Evaluation of present day ozone climatology:
Whole troposphere, free troposphere and surface

State of knowledge
Evaluation of simulated ozone climatology against observations provides a measure of the skill of models to accurately represent the physical and chemical processes shaping the observed ozone distribution, building confidence that the model will be able to capture processes that lead to changes in ozone levels. Comparison with climatology also has the advantage of reducing uncertainties in observations. Below we assess the ability of global models to represent the mean present-day tropospheric ozone burden and budget, and the distribution of free troposphere and surface ozone, drawing upon results from past and recent multi-model intercomparisons ( Table 1) augmented by results from a few single-model studies. The focus is on results from multi-model assessments as these broadly reflect accepted understanding within the research community, although we acknowledge that individual model studies may better reflect the state of current knowledge in treatment of particular processes and in their approach to evaluation.
Our primary focus in this section is the evaluation of the mean climatology, while the capability of models to capture ozone episodes and variability is addressed in Sections 5 and 6. We note that it is difficult to track the improvements in the simulation of ozone across different model intercomparison projects because of inconsistencies in experimental setup, stages of model development, evaluation approach and metrics applied, and inconsistencies in observational datasets used for model evaluation.

Global tropospheric ozone burden and budget
The global tropospheric ozone burden and budget hide the complexity of all the different processes important for ozone abundance, but provide a useful first order metric for some limited observational comparisons, as well as model benchmarking (e.g., Wild, 2007;Wu et al., 2007) and intercomparisons .
The tropospheric ozone burden is simply the mass of ozone in the troposphere, and the budget is defined from four main terms: chemical production (P), chemical loss (L), loss to the surface through deposition (D), and a net influx resulting from stratosphere-troposphere exchange (S). There is considerable variation in how modeling groups define these budget terms, which can lead to ambiguity when comparing different models . This is particularly the case for the P, L and D terms, which may consider a complete range of chemical and depositional fluxes involving NO y species (e.g., Horowitz et al., 2003), or be more limited to production from peroxy radical plus NO reactions and direct loss of ozone (via HO 2 , OH, alkenes, and OH production from O( 1 D) and water vapor). The stratospheric influx can be diagnosed in some models, but is often determined indirectly by assuming budget closure over a year (i.e., P + S = L + D) , and thus depends on how the other terms are defined. Variations in tropopause definitions are another source of model variability for all the budget terms (Prather et al., 2011). Finally, a mean tropospheric ozone lifetime can also be defined, by dividing the burden by the total production (P + S) or loss (L + D) fluxes. Figure 3 summarizes the modeled values for these terms using results from the models that took part in ACCENT  and ACCMIP , as well as recent single model studies (after Myhre et al., 2013;their Table 8.1). The figure highlights the considerable range of budget terms calculated by different models, with the burden ranging by a factor of ~1 .5, the chemical terms by a factor of ~2 , and D and S by a factor of ~3 . However, aside from deposition, the range of values from the central 50% of models (boxes) are comparatively small compared to the full range (whiskers), although the models are not fully independent from one another.
Observationally derived estimates are only available for the burden, based on in situ measurement climatologies (see Wild, 2007) or satellites (Osterman et al., 2008;Ziemke et al., 2011), and net stratospheric influx, based Chemical production and loss, (c) Dry deposition, and (d) Net stratospheric influx. Model results are shown as box (interquartile range)-whisker (full range) plots, also indicating the median (horizontal line) and mean (filled circle) values of ~5 0 models for the burden and ~3 0 models for the fluxes (numbers of models indicated next to the boxes).
The interquartile range of model results for both terms lies within the central observation estimates, providing some confidence that models generally simulate their correct magnitude. However, we note the uncertainty in the net stratospheric influx term from observations, as well as the fact that the magnitude of this term is strongly influenced by interannual and longer-term variability (e.g., Neu et al., 2014). The lack of observational constraint for the P and L terms (at least for their global value) means that there has been little progress in defining their "true" magnitude, although the consensus is towards P minus L being positive (i.e., net chemical production), at least for model studies post 2000 Wild, 2007;Myhre et al., 2013;Hu et al., 2017;TOAR-Ozone Budget). The magnitudes of these terms are strongly dependent on ozone precursor emissions in the model (Wild, 2007;Wu et al., 2007). Some of the inter-model difference may be explained by the capacity of different chemical schemes to simulate more complex NMVOCs, which would tend to increase the ozone production efficiency (e.g., Jenkin et al., 2008;Porter et al., 2017). This would also account for some inter-model differences in the burden. Global models geared for long simulations might adopt a simpler chemistry scheme to reduce the computational cost, and this needs to be considered when evaluating this metric. The spread of model estimates for deposition is comparatively large, and reflects considerable uncertainty in this process due to the lack of observational constraints beyond a few land cover types for very few sites, as well as inter-model spread in lower tropospheric ozone levels (Hardacre et al., 2015). Furthermore, while the oceans have a comparatively low deposition velocity, they account for about two thirds of the Earth's surface which means that they are an important sink. Some model estimates have the oceans accounting for about one third of total ozone deposition (Ganzeveld et al., 2009;Hardacre et al., 2015), although there is some debate as to whether its importance is overestimated (Luhar et al., 2017) or underestimated (Sarwar et al., 2016). Nevertheless, the magnitude of the oceanic deposition flux is the dominant driver of intermodel differences (Hardacre et al., 2015). Modeling of deposition fluxes to grassland and tropical forest surface types has also been identified as an important uncertainty in this budget term (Hardacre et al., 2015).

Free troposphere
The ability of models to accurately simulate the free tropospheric ozone distribution is key to studies of longrange transport and changes in ozone radiative forcing. We present an extension of the comparison of ACCMIP model present-day (year 2000) simulation to ozonesonde data  in Figure 4 showing model bias and correlation coefficients against an ozonesonde climatology covering 1995-2009  grouped into 12 different regions with similar ozone distributions and sampled at three vertical levels. Model biases range from positive to negative for each region and altitude (Figure 4b), although greatest positive and negative biases are found for Northern Hemisphere (NH) extratropics and the Southern Hemisphere (SH) tropics, respectively. A closer look at comparisons in the NH extratropics previously indicated that global models overestimate wintertime ozone in the low and midtroposphere (Stevenson et al., 2013;Eyring et al., 2013a;, although the models are within one standard deviation as estimated from the observed variability. These findings are supported by comparisons of satellite-derived tropospheric emission spectrometer (TES) (Bowman et al., 2013) ozone profiles and tropospheric column observations (Ziemke et al., 2011) against the ACCMIP models . Most models capture the ozone seasonal cycle in the free troposphere for most regions (median r ≥ 0.6; Figure 4c), although there are exceptions, notably including the Equatorial Americas, located in the path of the Intercontinental Tropical Convergence Zone (ITCZ).
If the majority of models are biased in the same direction, it could be indicative of a common issue with the simulations, with a likely candidate being precursor emissions; at least in the case of NO x and CO emissions, these are reasonably similar across the models . However, shared shortcomings in vertical mixing, deep convection, representation of stratospheric ozone and a host of other drivers cannot be excluded as possible explanations of model-observation discrepancies from this simple analysis (see Section 8 for more discussion).
Based on comparisons against ozonesonde observations and other in situ measurements, particular regional features of free tropospheric ozone can be generally captured by current global chemistry models (Zhang et al., 2010;Tilmes et al., 2016;Hu et al., 2017), although these have not been systematically investigated in all models. Such features include the ozone maximum west of southern Africa over the South Atlantic Ocean (e.g., Jonquières et al., 1998;Sauvage et al., 2006), the mid-Pacific minimum, which describes the well-characterized "wave-1" pattern in the tropics (Thompson et al., 2003;Ziemke et al., 2010), and the summertime free tropospheric ozone maximum over the Eastern Mediterranean (e.g., Kalabokas et al., 2013;Zanis et al., 2014).
A new compilation of long-term measurements conducted aboard commercial aircraft of internationally operating airlines (MOZAIC-IAGOS: see Petetin et al., 2016 and TOAR-Climate) provides another means to evaluate the free tropospheric ozone in models. Figure 5 compares the ozone annual cycle over Frankfurt, Germany from this dataset against the ACCMIP models, at pressure levels from 950 to 300 hPa. Data above 800 hPa at Frankfurt is considered to be representative of the European background free troposphere (Logan et al., 2012;Parrish et al., 2012;Petetin et al., 2016). The seasonality and vertical gradient of the observations is in general agreement with the multi-model mean of ACCMIP models. However, the ensemble mean is biased high throughout the troposphere (by 5-20%) with biases strongest in fall and winter, consistent with evaluations against ozonesondes over NH extratropics described above and previous evaluations . Recent individual model evaluations against observations have also found similar biases in the free troposphere (e.g., Zhang et al., 2010;Kim et al., 2013;Tilmes et al., 2016).
While models are largely able to capture the ozone climatology in the free troposphere, they have difficulty in simulating ozone episodes related to long-range transport    assess possible impacts (e.g., Anenberg et al., 2009;Tong et al., 2009;Avnery et al., 2011;TOAR-Vegetation). Here we discuss model evaluation of the mean spatial and temporal distribution of surface ozone while model skill in simulating extreme events is discussed in Section 5. Model evaluation relies on the availability of high quality observations with high spatial and temporal coverage. Reasonably comprehensive "baseline" (TOAR-Observations) surface ozone observations over the U.S., Canada, Europe, and Japan augmented with data over numerous polluted sites in these regions have facilitated a thorough evaluation of global chemistry models over those regions (e.g., Fiore et al., 2009;Reidmiller et al., 2009;Schnell et al., 2015;Sofen et al., 2016b). Evaluation elsewhere is limited by poor data availability. To alleviate data limitation, a comprehensive database of global surface ozone measurements was compiled within the TOAR framework (TOAR-Surface Ozone Database). This was achieved by collating in situ hourly ozone data over the time period 1970-2015 from regional or national air quality monitoring networks, multi-national programs and data from individual researchers. A gridded product was generated from this dataset for comparison with models. Data from stations with elevations greater than 2 km were not included in this gridded dataset. Mean climatological present-day observations were constructed by averaging data over 1996-2005 (Figure 6b), and gridded to 5 degree latitude × 5 degree longitude to facilitate comparison with coarse resolution model data.
We evaluate 15 ACCMIP model simulations of presentday annual mean surface ozone mixing ratios and their seasonal cycle (Figure 6a) against the TOAR-Surface Ozone Database using only data from stations that are classified as "rural" (locations as shown by Figure 6b). The ACCMIP multi-model mean generally captures the observed largescale spatial pattern of annual mean surface ozone: higher in the NH and lower in the SH. The multi-model mean generally overestimates ozone (biases range from -5 to +24 nmol mol -1 ; Figure 6c) with a mean bias of +7.1 nmol mol -1 globally, and 7.7 nmol mol -1 in NH and +3.5 nmol mol -1 in the SH. Over the U.S. and Europe, the mean biases of +7 nmol mol -1 and +5.6 nmol mol -1 , respectively, are similar to the 5 nmol mol -1 bias in HTAP CTMs over these regions . For the U.S, the positive bias ranges from 5-15 nmol mol -1 over the eastern U.S. but exceeds 15 nmol mol -1 over North American coastal regions compared with observations. The mean seasonal cycle is generally reproduced with correlation coefficients for monthly mean values greater than 0.6 (Figure 6d), although the multi-model mean tends to peak later in the year over the eastern U.S., consistent with previous model evaluations ( Murazaki and Hess 2006;Fiore et al., 2009;Reidmiller et al., 2009;Lamarque et al., 2012;Naik et al., 2013;Brown-Steiner et al., 2015;Strode et al., 2015;Travis et al., 2016).
In Europe, the annual multi-model mean performs better over northern Europe (biases range from -2 to +10 nmol mol -1 ) as compared to southern Europe, particularly, over the Mediterranean region where biases exceed 20 nmol mol -1 . A previous study found the multi-model mean of a subset of ACCMIP models, combined with regional and global CTMs, to generally have greater biases in summertime ozone over northern versus southern Europe in comparison to site-level observations (Colette et al., 2015). This conflicts with the analysis presented here possibly due to a combination of different model ensemble size and observational dataset. The multi-model mean generally captures the seasonal cycle over Europe with correlations greater than 0.6 ( Figure 6d) consistent with the comparison over North America.
For grid-cells with measurements in Asia (chiefly Japan), the multi-model mean performs better (biases in the range of 5-10 nmol mol -1 ) and simulates the seasonal cycle accurately (r > 0.8), although a recent in-depth evaluation of the seasonal cycle simulated by global CCMs over marine boundary layer sites on the west coast of Japan indicated that models have difficulty in simulating the seasonal cycle over this region . For the handful of grid-cells in the SH with observations, the mean model bias ranges from -5 to +5 nmol mol -1 with a good simulation of the seasonal cycle (r > 0.8).
Model skill in simulating mean surface ozone distribution varies by the type of model and simulations, evaluation approach (e.g., individual sites versus regional averages), and the availability and quality of observations (Fiore et al., 2009;Reidmiller et al., 2009;Lamarque et al., 2012;Doherty et al., 2013;Naik et al., 2013;Strode et al., 2015;Brown-Steiner et al., 2015;Monks et al., 2015;Schnell et al., 2015;Colette et al., 2015;Tilmes et al., 2015). Overall, current generation global CCMs reproduce the spatial patterns in annual mean surface ozone based on available observations but are generally biased high in the NH and, except for eastern Australia and New Zealand, biased low in the SH. The observed seasonal cycle is generally captured at mid-latitude land areas but there are biases, the cause of which can be inferred with in-depth analyses like those conducted in recent studies (e.g., Derwent et al., 2016;Parrish et al., 2016).

Summary and assessment of model skill
Comparison of simulated and observed annual and monthly mean ozone climatologies provides a first order evaluation of model skill for this species. The current generation of global models simulates a tropospheric ozone burden and net stratospheric influx that compares well against the available observational estimates, whereas simulated total chemical production and loss fluxes, and deposition (which lack observational constraints) show a broad range between the different models. For the chemical fluxes, the spread is likely related to the complexity of the different chemical reaction schemes, particularly the ability to accommodate a range of VOCs. For deposition, the spread is related to uncertainty and model spread in boundary layer dynamics and surface uptake coefficients (after controlling for different near surface ozone levels). Additional measurements may help to narrow the spread, but would be required for several land surface types.
Breaking down the evaluation to a regional level reveals the models are biased high in the northern hemisphere and low in the southern hemisphere. Ozonesonde, satellite, aircraft and surface monitoring data show that these biases generally persist throughout the depth of the troposphere. Models also have difficulty in reproducing observations at sites influenced by long-range transport of ozone and its precursors. As these biases are typical amongst models it suggests a common cause, making emissions a likely candidate, as well as potentially deposition or any other processes where models share similar representations of a process. An evaluation of emissions data as well as targeted model sensitivity simulations could make progress on this issue.
The simulated seasonal cycle of surface and free troposphere ozone compares favorably against observations for most locations, giving confidence that the seasonal variation in meteorology and emissions (chiefly from biomass burning and natural sources) and their impact on ozone is well simulated. There are some exceptions in the free troposphere, including for sonde sites over the Equatorial Americas and, to a lesser extent, over Japan and high latitude northern hemisphere. The reasons for these biases could reflect poorer simulation of local dynamics or missing chemical processes, and requires additional study.

State of knowledge
Extreme pollution typically arises during specific meteorological events, such as heat waves and stagnation episodes, favorable to production from local and regional emissions of ozone precursors (Kirtman et al., 2013). These events may occur on local and regional scales, and can persist over multiple days. Modeling future changes in extreme pollution events requires accurate representation of the underlying synoptic-scale meteorology, but confidence in projecting changes in blocking events, often associated with the most persistent observed regional-scale events (e.g., 2003 European heat wave), is currently poor (Kirtman et al., 2013 and references therein). Feedbacks from local anthropogenic and biogenic emissions and atmospheric chemistry during these meteorologically driven events will influence the severity of the event, and add another layer of uncertainty to projecting future changes. Whether or not our projections are hindered, several studies argue that global chemistry models are able to capture the impact of such large-scale synoptic processes on ozone levels (Fiore et al., 2003(Fiore et al., , 2015Jacob and Winner, 2009). Figure 7 presents a further example, showing a favorable comparison of extreme ozone levels from a global CTM simulation (Murray, 2016) against the same ozone metric from the TOAR-Surface Ozone Database during a heatwave over the U.S. The evaluation of model skill in representing the frequency, intensity and duration of extreme ozone episodes relies on metrics derived from dense, high frequency, long-term, and reliable measurements of surface ozone (TOAR-Observations). Availability of such datasets over Asia and particularly the U.S. and Europe has facilitated modeling studies of extreme ozone pollution over these regions, while sparse observational records limit extreme episode analysis in other regions of the world. The focus here is on the U.S. and Europe as these are the regions with high-quality, long-term observations used in most evaluations of model skill, although there are growing examples of studies from Asia (e.g., Liu et al., 2010;Zhang et al., 2012;Huang et al., 2015).
Evaluating extreme ozone episodes requires a definition of "extreme". There are four approaches to define ozone extremes which have been described in the literature.  These metrics have proven useful to evaluate process representation and representativeness of global model simulations. Given that models are biased, evaluations that avoid absolute definitions of an extreme event (e.g., a particular mixing ratio) are preferable; three of the four approaches meet this criterion. These approaches are: 1. Specific (high) percentiles (e.g., Lei et al., 2012;Pfister et al., 2014). 2. The number and/or frequency of events above a fixed value that is considered extreme at present, including values that are relevant to attaining ambient air quality standards (AAQS) (e.g., Murazaki and Hess, 2006;Wu et al., 2008;Gao et al., 2013;Pfister et al., 2014;Rieder et al., 2015). 3. Statistical methods from extreme value theory (EVT) to analyze ozone extremes in observations and CCM simulations (Rieder et al., 2013(Rieder et al., , 2015. The EVT approach is useful as it combines the frequency and intensity aspect of ozone extremes by focusing on so-called 'T-year ozone return values', which describe the probability of exceeding a value of intensity × within a time window T. 4. The spatial distribution and connectedness of a fixed number of climatologically extreme events at each grid cell (e.g., 100 days in a decade, ~9 7.3 percentile) (Schnell et al., 2014(Schnell et al., , 2015. This approach avoids complications through systematic biases present in many CTMs and CCMs (Dawson et al., 2008) by highlighting the times at each location when ozone pollution is at its worst, regardless of the absolute ozone abundance.
Metrics described in (1) and (2) are available from the TOAR-Surface Ozone Database. The fourth approach has been applied to develop metrics characterizing the climatology of extreme ozone episodes (e.g., annual and interannual variability, areal extent, duration), and has enabled the evaluation of both hindcast and free-running global chemistry model simulations (Schnell et al., 2014). An evaluation of extreme episodes over the U.S. and Europe for a selection of the ACCMIP models showed that, although generally biased high, most models were able to reproduce the observed climatological mean annual ozone cycle, the frequency of extremes, as well as the persistence, spatial extent and observed distribution of pollution episode sizes (Schnell et al., 2015). Thus, despite biases relative to observed ozone levels, global chemistry models do capture day-to-day variability and thus contain information regarding the frequency of extreme events and their spatial extent. However, some models were not able to reproduce the largest episodes, likely a result of too coarse resolution of the synoptic meteorology fields in these models. Additionally, trends in the observations can complicate this analysis (Schnell et al., 2015). While evaluations applying a range of the approaches above show that global models are generally able to represent the salient features of extreme ozone episodes including extent, duration, frequency and year-to-year variability (e.g., Fiore et al., 2003;Schnell et al., 2015), there are systematic regional biases in the intensity of ozone extremes, especially for the summertime eastern U.S. (e.g., Rasmussen et al., 2012;Pfister et al., 2014;Rieder et al., 2015). As stated above, these biases are most problematic when focusing on ozone extremes defined as number or frequency above a fixed threshold or AAQS metric. Despite these biases, global chemistry models simulate larger decreases in the upper tail (relative to other parts of the overall surface ozone distribution) following regional nitrogen oxide emission reductions, consistent with observations (Rieder et al., 2015). This suggests that these models can represent the response of extreme ozone levels to changing emissions but suffer from a mean state bias.
In an effort to address this systematic mean bias, recent studies have applied statistical bias-correction techniques to derive threshold-based metrics (e.g., Rieder et al., 2015). However, bias-correction techniques are model dependent and can only reduce systematic biases intrinsic to the model (e.g., Kang et al., 2008). Improvements in model physics and chemistry, and the implemented emissions inventories can reduce both systematic and unsystematic errors thereby improving the global model simulation of extreme ozone episodes and their metrics.
Computational advances allow current-generation CCMs to perform global simulations at around 1° × 1° to 2° × 2° horizontal resolution (e.g., Lamarque et al., 2013 and references therein), with higher resolutions possible for shorter (~1 year) simulations (e.g., Lin et al., 2012aLin et al., , 2012bPfister et al., 2014;Stock et al., 2014;Zhang et al., 2014). Where high and low resolution versions of a given global model have been compared, the general picture is of reduced biases and better agreement with the probability distribution for extreme episodes, both due to changes in the effective timescales of mixing for the chemistry (see also Section 3.4), and representation of the meteorology (Wild and Prather 2006;Pfister et al., 2014;Stock et al., 2014).
While most extreme events in polluted regions are fueled by regional anthropogenic emissions, extreme ozone events are in some cases produced by wildfire emissions (e.g., Jaffe and Wigder, 2012), transport from the lower stratosphere to the lower troposphere (e.g., Langford et al., 2009;Lin et al., 2012a;Trickl et al., 2014), and intercontinental transport (e.g., Jaffe et al., 1999;Jacob and Winner, 2009). Impacts from these sources vary in space and time and models differ in their level of process representation and estimates of the relative importance of these background sources (e.g., Fiore et al., 2014aFiore et al., , 2014b. Model evaluation is confounded by the lack of source information in direct ozone measurements. While co-located measurements of additional species can provide information for source attribution, in most national, longterm networks, only surface ozone measurements are available (with only a couple studies now documenting long-term precursor measurements in North America; see Pollack et al., 2013;Hassler et al., 2016). Progress in model evaluation for such episodes is anticipated with availability of multiple chemical measurements, such as occurs during field campaigns and perhaps in the future from space-based platforms.

Summary and assessment of model skill
Notwithstanding a mean state bias (see also Section 4), the current generation of global models shows a degree of skill in simulating the timing and spatial distribution of extreme ozone concentrations, as associated with higher pollutant concentrations or particular meteorological conditions. Model-observational differences are generally reduced in higher resolution models, for both the concentration bias and the probability distribution of ozone concentrations. This is due to a more realistic simulation of mixing and chemistry timescales, as well as better representation of the meteorology. Nevertheless, there are still fundamental biases in climate model simulations of meteorological conditions relevant for air quality (e.g., stagnation), which will impact the CCM simulations of ozone extremes. Furthermore, due to the limited availability of appropriate data, these evaluations have been limited to North America, Europe and East Asia, leaving a currently unresolvable gap in our understanding of model skill for extreme episodes in many key regions known to be experiencing high pollution levels (i.e., Africa, Middle East, Asia outside Japan, Central America and South America).

State of knowledge
The lifetime of ozone in the free troposphere is on the order of several weeks, sufficiently long for ozone to be affected by climate variability and associated changes in large-scale atmospheric circulation patterns on interannual to decadal time scales. Quantification of this natural variability in tropospheric ozone is critical to not only understand year-to-year changes in ozone, but also to assess the magnitude of emissions-driven trends (e.g., Lin et al., 2014;Verstraeten et al., 2015;Wespes et al., 2017). Evaluating the fidelity of global model simulations of the relationship between variability in tropospheric ozone levels and climate variability provides an assessment of these models' ability to capture large-scale circulation changes that impact hemispheric transport of ozone pollution in the troposphere, stratosphere-troposphere exchange, and regional meteorological conditions conducive to pollution accumulation.
ENSO is the dominant mode of interannual variability in tropical climate. As shown in many observational and modeling studies, tropospheric column ozone decreases in the eastern tropical Pacific and increases in the western tropical Pacific in response to circulation and convective changes during El Niño conditions (Doherty et al., 2006 and refs. therein). A few studies show that CCMs driven by observed SSTs (AMIP mode; see Section 2.1) are able to capture the overall pattern and magnitude of the tropical ozone response to ENSO obtained from satellite measurements (Oman et al., 2013;Sekiya and Sudo, 2014), including the observed ozone-ENSO index (Ziemke et al., 2010;Oman et al., 2011). This is calculated as the difference between observed monthly mean column ozone over two broad regions in the western and eastern Pacific ocean; see Figure 8 for an example. Reproducing the observed magnitude of ozone enhancements over the western tropical Pacific/Indonesia during El Niño requires simulations with interannually-varying biomass burning emissions (Doherty et al., 2006;Nassar et al., 2008;Inness et al., 2015;Voulgarakis et al., 2015), but the same is not true for the eastern Pacific where ENSO-related ozone variability is largely controlled by changes in dynamics as opposed to emissions (Oman et al., 2013;Lin et al., 2014).
A second important source of tropical variability is the Madden-Julian Oscillation (MJO). The MJO is responsible for variability of ~± 5 DU in total column ozone in the subtropics that arises primarily through vertical movement of the subtropical tropopause (Li et al., 2011). It also drives variability of ~± 1-2 DU in tropospheric column ozone in the tropics through a combination of large-scale advection, convective uplift, and lightning NO x production, and can account for up to 47% of the total variability in the tropical tropospheric column on daily to interannual timescales (Sun et al., 2014). Two different CTMs driven by analyzed meteorological fields have successfully reproduced satellite-observed MJO variability   (Sun et al. 2014;Ziemke et al., 2015), but this has not been demonstrated for CCMs. A single CCM study, using observed SSTs, reproduced the tropical ozone response to ENSO variability but not shorter timescales related to the MJO (Ziemke et al., 2015). Over northern mid-latitudes, ENSO events can affect the interannual variability of hemispheric pollution transport by modulating the strength and position of the subtropical jet stream, particularly in the Pacific-North America sector (Trenberth et al., 1998;Koumoutsaris et al., 2008;Li and Lau, 2012;Lin et al., 2014). Continuous ozone measurements at Mauna Loa Observatory (located in the subtropical North Pacific) since 1974 (Oltmans et al., 1996) can provide a benchmark for evaluating the ability of CCMs to represent tropospheric ozone variability in response to mid-latitude ENSO teleconnections. An ensemble of simulations from a single CCM with constant emissions, driven by observed SSTs and sea ice over 1960-2012, captured the observed springtime ozone variability at Mauna Loa. These simulations attribute this variability as a response to shifts in the position of the subtropical jet stream, coherent with ENSO on interannual time scales and the Pacific Decadal Oscillation on decadal time scales (Lin et al., 2014). These same simulations also demonstrated model skill in capturing the observed autumnal ozone increase at Mauna Loa from the mid-1990s onwards, attributing the increase to a shift to the positive phase of the Pacific North American (PNA) pattern. Simulations constrained by meteorology (nudged) showed more skill in simulating the autumnal ozone interannual variability than simulations only constrained by SSTs. Note that these results are from a single CCM and a wider evaluation of global models is required.
Interannual variability of stratosphere-to-troposphere transport (STT) is an important driver of extratropical tropospheric ozone variability in hemispheric winter and spring. It is thought to have strong connections with several modes of climate variability, including ENSO (Langford, 1999;Zeng and Pyle, 2005;Neu et al., 2014;Lin et al., 2015b), the Arctic Oscillation (Hess and Lamarque, 2007), the North Atlantic Oscillation (Sprenger and Wernli, 2003;Pausata et al., 2012), and the stratospheric quasi-biennial oscillation (QBO) (Hsu and Prather, 2009;Neu et al., 2014), as well as (episodic) volcanic eruptions (Oltmans et al., 1998;Fusco and Logan, 2003;Tang et al., 2013;Lin et al., 2015b). The ability of global models to represent tropospheric ozone variability associated with the STT ozone flux varies depending on model representation of stratospheric chemistry and its dynamical coupling with the troposphere (see Table S1 of Lin et al., 2015b).
Overall, available studies suggest that global chemistry models with a parameterized stratospheric ozone source for the troposphere tend to underestimate extratropical ozone variability. For instance, a simulation over 1987-2005 with assimilated meteorology but a parameterized stratospheric ozone source was unable to match the observed interannual variability of extratropical ozone (Koumoutsaris et al., 2008). Furthermore, the interannual variability (standard deviation) of midtropospheric ozone derived from ozonesonde and aircraft measurements is three times larger than that simulated in a CTM with a parameterized stratospheric ozone source (Hess and Zbinden, 2013). Both models in these studies were unable to reproduce the observed lowozone anomaly following the 1991 Pinatubo volcanic eruption, although they do capture some aspects of the observed 1998-1999 high-ozone anomaly at northern mid-latitudes, which was associated at least in part with an increase in the STT ozone flux following the 1997-8 El Niño (Koumoutsaris et al., 2008;Hess and Zbinden, 2013). A few studies show that CTMs with a climatological stratosphere underestimate observed ozone abundances at high northern latitudes in winter-spring (Hu et al., 2017) and in stratospheric intrusions that penetrate into the lower troposphere (Hudman et al., 2004;Zhang et al., 2011). These modeling analyses suggest the need for more sophisticated simulations that include detailed representations of stratospheric chemistry and circulation, as well as its dynamical coupling with the troposphere. Indeed, a study using a CCM relaxed to observed SSTs and QBO, and with a detailed representation of stratospheric chemistry (but simplified tropospheric chemistry), was able to capture up to 36% of the observed NH midtropospheric ozone interannual variability, depending on location (Hess et al., 2015).
A small number of CCM studies have evaluated tropospheric ozone and STT using simulations that include interactive stratospheric and tropospheric chemistry and aerosols (e.g., Neu et al., 2014;Lin et al., 2015a,b;Strode et al., 2015). A recent study applied observational constraints from the Tropospheric Emission Spectrometer (TES) and Microwave Limb Sounder (MLS) on-board NASA's Aura satellite over the 2005-2010 period to assess the ability of one CCM to reproduce the tropospheric ozone response to stratospheric circulation changes (Neu et al., 2014). The model was able to reproduce the observed relationship between stratospheric circulation changes and tropospheric ozone, and showed that the strength of lower stratospheric circulation varies by 40% from year-to-year which leads to changes in northern midlatitude tropospheric ozone of about 2%: a modest but important contributor to tropospheric ozone changes from climate change (Neu et al., 2014). Over western North America, where the world's deepest stratospheric intrusions are found (Langford et al., 2009;Lin et al., 2012a;Škerlak et al., 2014), a nudged CCM simulation for 1980-2012 reproduced enhancements of upper tropospheric ozone during El Niño as measured at the Trinidad Head ozonesonde station, and simulated greater mid-tropospheric and surface ozone enhancements during La Niña consistent with a suite of observations (Lin et al., 2015b). Since STT occurs as discrete multi-day events, appropriate evaluation of this process requires daily observations. Figure 9 shows that a model simulation with constant emissions captures much of the observed April-May interannual variability in ozone at western U.S. high-elevation sites (r = 0.75), including more frequent high-ozone events in surface air following La Niña and fewer events in the two springs after the 1991 volcanic eruption of Mt. Pinatubo (after Lin et al., 2015b). Similar hindcast simulations with a different model (but with similar chemistry scheme) captured some of the observed springtime western U.S. ozone interannual variability, but with a correlation coefficient of 0.4 (Strode et al., 2015).

Summary and assessment of model skill
Detailed comparisons of modeled and measured sub-decadal variability of ozone is a relatively recent evaluation method, facilitated by the availability of longer observation records. Patterns of variability, such as the ozone-ENSO relationship, emerge from the combination of several underlying processes (e.g., STT, biomass burning emissions, convection etc.) and thus provide a reasonably thorough test of model skill. While these evaluations have only been applied to a small selection of CTMs and CCMs, studies indicate broadly successful model-observation comparisons for the ozone-ENSO relationship, decadal ozone variability driven by the PDO and PNA, and -if the model has the ability to simulate stratospheric chemistry and dynamics -the role of interannual variability of STT on tropospheric ozone. Extending these evaluations to a wider range of global models is now possible with the availability of long, transient CCM simulations of the last ~f ive decades as part of CCMI (Table 1). When completed, this evaluation will allow a better assessment of the ability of individual models to simulate chemistry-climate interactions.

State of knowledge
Global model simulations of long-term (>decade) changes of tropospheric ozone are required to assess the radiative forcing and air quality impacts resulting from those changes and to project them in the future in response to anthropogenic and natural perturbations. Such simulations are challenging because ozone changes result from complex and mutually dependent interactions between precursor emissions, meteorological variability, ozone photochemical production and in situ loss, surface deposition, atmospheric circulation, and long-range transport including stratosphere-troposphere exchange (Lelieveld and Dentener, 2000). Given this complexity, it is perhaps not surprising that simulated ozone changes over a range of time scales disagree significantly with observations and diverge widely between different global models (Lamarque et al., 2010;Parrish et al., 2014). Many global models have been applied to simulate the increase of the tropospheric ozone distribution from preindustrial times to the present. Evaluation of the simulated preindustrial tropospheric ozone levels has been problematic because of extremely limited and highly uncertain measurements of preindustrial tropospheric ozone abundance (TOAR-Observations). Surface ozone measurements made from the Municipal Observatory at the Parc de Montsouris located on the southern edge of Paris during 1876-1910, the oldest quantitative record of ozone, suggest that past ozone values were about 1/5th of present day values (Volz and Kley, 1988;TOAR-Observations). Global models typically have had difficulty in reproducing such low preindustrial ozone levels when compared to the quantitative measurements at Montsouris or to the semi-quantitative, but highly uncertain, Schönbein method observations from a handful of other sites in the late nineteenth and early twentieth century (e.g., Wang and Jacob, 1998;Mickley et al., 2001;Horowitz, 2006;Stevenson et al., 2013). However, a reevaluation of the Montsouris record as part of TOAR (TOAR-Observations) has called into question its reliability, finding that while the observations were made with a valid technique, they were likely impacted by significant sulfur dioxide and ammonia emissions from nearby coal burning and livestock facilities, making the data unsuitable for comparison with global model output. Lack of reliable ozone measurements has made it difficult to provide a robust assessment of the increases in tropospheric ozone levels from preindustrial to present-day (e.g., Staehelin et al., 1994;Cooper et al., 2014;TOAR-Observations).
Nevertheless, appropriate observation-derived metrics with well-defined confidence limits, well-designed model simulations, and novel analysis methods are needed to help identify model deficiencies and resolve modelmeasurement disagreements, thus enhancing confidence in our knowledge of historical changes in tropospheric ozone. One effort to identify model-measurement disagreement in the global CCM simulations of longterm changes in tropospheric ozone has focused on the application of quantitative measurement-derived metrics that describe long-term changes in lower tropospheric baseline ozone from the mid-twentieth century to the present to evaluate models . These metrics span the period during which, presumably, most of the increase of ozone since preindustrial times occurred, but are not derived from the limited and uncertain 19 th century measurements. The derived metrics are the coefficients of polynomial fits to seasonally averaged baseline lower tropospheric ozone mixing ratios at northern mid-latitudes normalized to year 2000 that allow comparisons of long-term ozone changes (see TOAR-Metrics, their Section 2.4.3 for more details). Figure 10 compares these observationally derived metrics with model-derived values, to assess model skill in representing the measured long-term ozone changes at northern mid-latitudes.
The individual seasonal averages exhibit significant variability about the polynomial fits (see Fig. 4c of Parrish et al., 2014). Since the polynomial fits largely remove this interannual variability, the derived values can be interpreted as the seasonally averaged, near surface, baseline ozone levels compared to the year 2000 in the absence of interannual variability. However, the polynomial fits do not remove potential variability on decadal time scales induced by shifts in atmospheric circulation patterns which can confound efforts to obtain estimates of emission-driven ozone trends, as discussed in detail below (see also Lin et al., 2015b). Notwithstanding, a comparison of these measurement-derived polynomial metrics with those derived from three free-running CCMs (which produce their own meteorology and therefore may not reproduce the actual meteorological variability) suggests that those models overestimate absolute ozone abundances for these sites, and capture only about half of the long-term changes in ozone that occurred at northern mid-latitudes over the past five to six decades (Figure 10 derived from Figures 4 and 7 of Parrish et al., 2014).
Unforced, low-frequency climate variability can confound comparisons of long-term trends between observations and the CCMs that simulate their own meteorology (such as those analyzed by Lamarque et al., 2010 andParrish et al., 2014). Studies show that 20-year trends driven by internal climate variability can be as large as emission-driven trends (Lin et al., 2014;2015b;Barnes et al., 2016). An approach to overcome this complication is to compare long-term ozone observations with model hindcasts forced (or nudged) with observed meteorology. This better isolates model-observation disagreements, and could potentially suggest model improvements (e.g., Pozzoli et al., 2011;Koumoutsaris and Bey, 2012;Brown-Steiner et al., 2015;Lin et al., 2015b;Strode et al., 2015;Tilmes et al., 2016;Lin et al., 2017). Overall, such hindcast simulations capture observed decreases in summertime surface ozone in the populated regions of North America and Europe during 1990-2010, but have difficulties simulating the ozone increases measured at remote baseline sites.
In addition to climate variability, sparse in situ measurements on both spatial and temporal scales can complicate the evaluation of ozone trends simulated by global models. One study found that global model hindcast simulations were able to reproduce observed ozone increases in the free troposphere over the western U.S. when the model was co-sampled in time and space with observations, as opposed to continuous temporal and spatial sampling of model results (Lin et al, 2015a). In a follow-up study, when hindcast model results were filtered for hemispheric-scale baseline conditions using simulated regional CO-like tracers, the model was able to successfully reproduce the observed springtime ozone increases at western U.S. sites as well as at a high-elevation site in Japan that is strongly influenced by Asian pollution outflow (Lin et al., 2017). These studies highlight the need to consider representativeness of measurements and provide examples of analysis/sampling approaches to address the model-observation discrepancy in long-term trends in tropospheric ozone.
Discrepancies between modeled and observed ozone trends reported in the published literature thus reflect a combination of factors. Factors related to the measurements or their use include: uncertainties in early measurements (e.g., Staehelin et al., 1994;Logan et al., 2012;TOAR-Observations); sampling biases in modelmeasurement comparisons, such as the representativeness of the trends derived from sparse measurements (Cooper et al., 2010;Lin et al., 2015a); and the presence of trends driven by low-frequency climate variability that freerunning CCMs are not expected to reproduce exactly (Lin et al., 2014(Lin et al., , 2015bBarnes et al., 2016;Garcia-Menendez et al., 2017). Factors related to the models include: errors in the trends incorporated in the underlying emission inventories used in the model (Granier et al., 2011;Hassler et al., 2016); limitations of coarse-resolution models in resolving observed baseline conditions (Lin et al., 2017); and weaknesses in the model representation of the processes (chemical, physical, and dynamical) that control the observed trend at a given location. It is also the case  1930 1940 1950 1960 1970 1980 1990 2000 2010 1980 1990 2000 2010 Europe that a model's ability, or not, to reproduce ozone trends may be different for mean or median values, as opposed to other percentiles. This needs further exploration and quantification with the current generation of global models.

Summary and assessment of model skill
Assessment of model skill in simulating ozone trends over long time periods is hampered by a range of factors, related to basic uncertainty in the available measurements, sampling biases, and the impact of low frequency variability on ozone concentrations that may influence the observed trend (and would not be simulated in a free running CCM). Nevertheless, the poor comparison of a range of models against long term ozone measurements is worthy of deeper exploration, to understand the extent to which weaknesses in model processes or inputs are responsible, compared to issues related to measurement uncertainties and sampling biases. Progress can be achieved by analyzing model simulations constrained by the observed meteorology and sampled to the same spatial and temporal pattern as the measurements.

Explaining and addressing model ozone biases
In the previous sections, we have discussed model evaluation approaches and model skill in simulating tropospheric ozone distributions, variability and trends. Model parameterizations, observation limitations (uncertainties, spatial and temporal coverage), and an incomplete understanding of physical and chemical processes introduce errors into model simulations as we have described in the above sections. Here, we outline specific processes or parameters that lead to uncertainties in the simulation of tropospheric ozone at various spatiotemporal scales. Our list largely focuses on processes that are centrally important for gas-phase chemistry, omitting a detailed discussion of biases arising from treatment of atmospheric physics or the terrestrial biosphere, for instance, which are all important for simulation of ozone in chemistry GCMs, CCMs and Earth system models. Issues related to representativeness of measurement-model comparisons as discussed in previous sections.

Emissions of ozone precursors
A substantial proportion of the uncertainty in the spatial and temporal distribution of ozone simulated by global models arises from uncertainties in emissions of ozone precursors, which are chiefly NO x , CO, methane, and NMVOCs (e.g., Granier et al., 2011). These emissions may be from anthropogenic sources, which are typically prescribed in global models as inputs, or from natural sources, which may be either prescribed or parameterized based on their dependence on land surface properties and meteorological or other model variables.

Anthropogenic emissions
Anthropogenic emission sources include fuel production, industrial and domestic combustion of fossil fuel and biofuel, transportation, waste disposal, industrial processes, solvent production and use, and agriculture. Emission inventories used by global models are generally derived from a bottom-up statistical approach, which estimates emissions as the product of activity levels (such as fuel consumption and number of vehicles) and emission factors (emissions of trace species per unit activity) (Granier et al., 2011;Hoesly et al., 2017). Recent inventories provide gridded monthly anthropogenic emissions (e.g., Hoesly et al., 2017) although previously they only considered annual values (Lamarque et al., 2010). Temporal variability on diurnal, daily, and weekly scales is not typically included in these inventories.
Developing emission inventories at global and regional scales requires detailed activity information at national or provincial levels, and their historical and future trends (e.g., Xing et al., 2013). The accuracy of emission estimates is thus limited by the completeness of activity data and accuracy of emission factors. Significant differences exist in commonly used global and regional anthropogenic emission inventories for ozone precursors and aerosols (Granier et al., 2011;Monks et al., 2015). Inverse methods, such as Bayesian inversion, Kalman filter, and model adjoints have been used to provide top-down constraints to improve the bottom-up emission estimates through combining models and observations of atmospheric trace species (e.g., Arellano et al., 2004;Heald et al., 2004;Kopacz et al., 2009;Miyazaki et al., 2012).
Several studies have examined the sensitivity of ozone simulations to emission inventories, often focusing on Asia where differences in emission estimates are particularly large (Ma and van Aardenne, 2004;Streets et al., 2006;Amnuaylojaroen et al., 2014;Jena et al., 2015;Zhong et al., 2016;Saikawa et al., 2017). For example, large discrepancies between emission inventories in NO x emissions over urban areas in India and China have been shown to produce significant differences in model simulated surface ozone mixing ratios (Jena et al., 2015;Zhong et al., 2016). Furthermore, one reason for the apparent mismatch in observed and modeled surface ozone trends in the northern mid-latitudes could be due to inaccuracies in precursor emission trends . Analyses of long-term observations of NO x and CO ratios in U.S. and European megacities indicate that current global emission inventories fail to capture the observed trends in NO x /CO enhancement ratios due to inadequate regional knowledge of emissions source information (Hassler et al., 2016). Most recently, global model overestimation of surface ozone in the Southeast U.S. has been partly attributed to overestimated NO x emissions in the U.S. EPA emission inventory (Travis et al., 2016). These studies highlight the need for a careful evaluation of precursor emissions and innovative ways to constrain them through tracer-tracer correlations (see Section 3.3), the use of satellite data sets (e.g., Richter et al., 2005;Martin et al., 2006, Lin et al., 2010Lamsal et al., 2011, Duncan et al., 2016 or data assimilation techniques that employ models and satellite observations to provide top-down emission estimates (e.g., Miyazaki et al., 2017).

Biomass burning emissions
Biomass burning or fire emissions present a large source of ozone precursors to the atmosphere. Estimates of biomass burning emissions are commonly calculated as products of burned area, fuel consumption, combustion completeness, and emission factors of various species (van der Werf et al., 2006(van der Werf et al., , 2010Wiedinmyer et al., 2011), all subject to large uncertainties. Emissions are typically input as seasonally varying monthly means for longer simulations, but inventories with daily variability are also used (e.g., Young et al., 2012).
Satellite observations of fire activity have been applied to estimate burned area at global or regional scales, constraining the timing and locations of fires (Sukhinin et al., 2004;Giglio et al., 2006). Some recent inventories have started to use satellite retrievals of fire radiative power instead of burnt area which appears superior at least in some world regions (Kaiser et al., 2012). However, a comparison of five global biomass burning emission inventories based on different satellite fire or burned area products showed a large range of 365-1422 Tg CO emissions for the year 2003 (Stroppiana et al., 2010;Reddington et al., 2016). Different estimates of biomass burning emissions are also highly variable in the spatial patterns, temporal variability, and long-term trends (Schultz et al., 2008). This means that construction of an emissions dataset for decadal-to-centennial time scale model simulations is challenging, with additional uncertainties likely arising from discontinuities in the observational record (van Marle et al., 2017).
Considerable uncertainty exists in model simulation of ozone production from biomass burning emissions. Most observations have shown that ozone is produced from fire emissions, yet some have reported no ozone enhancement or even ozone depletion in fire plumes depending on plume age, effects of co-emitted aerosols, and mixing with urban pollution (Singh et al., 2010(Singh et al., , 2012Jaffe and Wigder, 2012;Parrington et al., 2013;Baylon et al., 2015). Observations of ozone enhancements in fire plumes indicated by the O 3 /CO enhancement ratio show a wide range of -0.1 to 0.9 nmol mol -1 /nmol mol -1 , and tend to increase with plume age (Jaffe and Wigder, 2012;Wigder et al., 2013). Quantifying ozone production from fires in models is difficult, and uncertainties arise not only from the magnitude and variability of NO x and nonmethane volatile organic compounds (NMVOCs) emitted (Andreae and Merlet, 2001;Akagi et al., 2011), but also from the sub-grid non-linear photochemistry associated with aerosols that affects both chemistry and radiation in fresh fire plumes (Alvarado et al., 2009;Jiang et al., 2012). Global models may overestimate ozone production in fresh fire plumes and underestimate the regional ozone enhancements from transport of peroxyacetyl nitrate (PAN) due to inadequate chemistry and coarse grid resolution (Zhang et al., 2014;Lu et al., 2016).
In addition, observations show that under favorable atmospheric conditions strong fires can enhance deep convection, injecting plumes more than 5 km into the free troposphere (Cammas et al., 2009;Fromm et al., 2010;Val Martin et al., 2010;Sofiev et al., 2013). Model studies have also found that limiting fire emissions to the boundary layer underestimates their influence in downwind regions (Cook et al., 2007;Freitas et al., 2007;Brioude et al., 2009;Chen et al., 2009;Jian and Fu, 2014). More effort is needed to evaluate model simulation of ozone production from biomass burning emissions, in particular during the evolution of fire plumes and over regions that are dominated by biomass burning emissions (e.g., Mauzerall et al., 1998;Alvardo et al., 2010;Singh et al., 2010;Arnold et al., 2015).

Natural emissions
Lightning NO x emissions: NO x emissions from lightning have a large influence on tropospheric ozone, particularly in the tropics, because they occur much higher in the troposphere where ozone production is more efficient (Pickering et al., 1998;Sauvage et al., 2007;Murray et al., 2013;Barth et al., 2015). Model representation of lightning NO x emissions relies on parameterizations of lightning flash rate based on deep convection properties (Price and Rind, 1992;Price et al., 1997;Allen and Pickering, 2002). However, these lightning flash parameterizations generally have difficulty in reproducing the satellite lightning flash observations from the Optical Transient Detector (OTD) and the Lightning Imaging sensor (LIS) (Christian et al., 2003;Tost et al., 2007;Murray, 2016). Scaling the spatial and temporal distribution of lightning flashes in models to match OTD/LIS observations has been shown to deliver notable improvements in simulating tropical ozone (Sauvage et al., 2007;Murray et al., 2012). The sensitivity of ozone to lightning NO x , poor constraints on the magnitude and vertical distribution of NO x emissions, and the potential sensitivity of lightning NO x emission changes to climate change (e.g., Banerjee et al., 2014), all point to the need for an improved mechanistic understanding of lightning NO x generation. For instance, flash parameterizations based on the upward flux of ice particles in clouds have been shown to suggest a reduction in lightning NO x emissions under climate change (Finney et al., 2016). Ultimately, the magnitude of lightning NO x emissions will remain dependent on the ability of the parent meteorological model to reproduce the strength, timing and distribution of major convective events.
Biogenic emissions: The biosphere releases a large quantity and variety of NMVOCs to the atmosphere that far exceeds anthropogenic sources, with isoprene (C 5 H 8 ) and monoterpenes (C 10 H 16 ) being the most abundant compounds emitted (Arneth et al., 2008;Guenther et al., 2012). Several studies have demonstrated that biogenic NMVOC emissions are important for simulating the tropospheric ozone budget and distribution (e.g., Pfister et al., 2008;Williams et al., 2009Williams et al., , 2013. Empirical emission algorithms such as the Model of Emissions of Gases and Aerosols from Nature (MEGAN;Guenther et al., 2006Guenther et al., , 2012 have been applied to calculate biogenic NMVOC emissions as a function of vegetation type, surface temperature, solar radiation, leaf age, soil moisture, leaf area index and other activity factors. Vegetation type may be from satellite-derived maps of vegetation classes or from vegetation classes simulated by the land surface component of the model. For reasons of computational expediency, the land surface components of models typically have far fewer different vegetation types represented than the satellite products, meaning that their vegetation classes often represent the emitting capacity of several different species (although such a setup allows a coherent simulation of climate-vegetationatmospheric chemistry interactions).
While the use of empirical emission algorithms has been argued to be unsatisfactory, as they do not capture the fundamental biochemical processes that underlie biogenic emissions (Monson et al., 2012), these are widely used in current global chemistry models for convenience in estimating global-scale emissions. Estimates of biogenic NMVOC emissions are highly sensitive to the meteorological, soil, and vegetation conditions (Arneth et al., 2011;Guenther et al., 2012;Henrot et al., 2017), with annual global isoprene emissions ranging from 350 to 769 Tg (as reported by Guenther et al. (2012), based on different inputs). Satellite observations of formaldehyde columns may provide valuable information for optimizing global and regional biogenic NMVOC emissions using inverse modeling methods (e.g., Palmer et al., 2006;Fu et al., 2007;Millet et al., 2008;Stavrakou et al., 2009;Marias et al., 2014;Bauwens et al., 2016). However, satellite-derived formaldehyde columns are themselves subject to large uncertainties which combined with uncertainties in the oxidation mechanisms add to the uncertainty in inferred biogenic NMVOC emissions estimates (e.g., Millet et al., 2006;Dufour et al., 2009;Barkley et al., 2013;Zhu et al., 2016). Moreover, the wide variety of higher hydrocarbons and oxygenated NMVOC released from biogenic sources remains poorly characterized, and this may constitute an additional source of chemically-active species that are typically neglected in current global modeling studies.
Soil emissions: NO x emissions from microbial nitrification and denitrification in soils are estimated to account for ~1 5% of the present-day global NO x emissions. These emissions are subject to large uncertainties with global above-canopy estimates ranging from 4.7 to 16.8 Tg N year -1 (Hudman et al., 2012, Vinken et al., 2014. Implementations of soil NO x emissions in global models are generally based on empirical (e.g., Yienger and Levy, 1995;Steinkamp and Lawrence, 2011) and process-based models (e.g., Potter et al., 1996;Parton et al., 2001). Soil NO x emissions are highly sensitive to fertilizer application methods and timing, climate and soil conditions, such as temperature, soil moisture, and nitrogen availability, and show large pulses of emissions following soil wetting by rain (e.g., Jaeglé et al., 2004;Stehfest and Bouwman, 2006;Hudman et al., 2010). Better physical parameterizations of these processes are needed for models to represent the spatiotemporal variability of soil NO x emissions (Hudman et al., 2012).

Methane emissions
Atmospheric methane concentrations have been increasing steadily since preindustrial times, and after remaining flat for about a decade in the mid-1990s to early 2000s, their growth has resumed since 2007 (Saunois et al., 2016). Increases in methane abundance lead to increases in global background tropospheric ozone concentrations (Prather et al., 2001;Lin et al., 2017). In most current-generation global CTMs and CCMs, atmospheric methane concentration is prescribed, either globally uniform or with latitudinal variation, at the surface but allowed to undergo chemical processing above the surface layer Lamarque et al., 2013) to avoid the computational expense of running long simulations to reach steady state. The assumption of uniform methane concentrations would have implications for the simulated ozone spatial distribution. As models begin to incorporate more realistic representation of atmospheric methane (e.g., Szopa et al., 2013;Dalsøren et al., 2016) with emissions prescribed from inventories and/or calculated interactively for natural sources, effort is needed to quantify the impact of this update on the spatio-temporal distribution of tropospheric ozone.

Chemistry
In addition to the uncertainty in reaction rate coefficients and absorption cross sections, and numerical uncertainties from the chemical solver code (e.g., Sandu et al., 1997), the main uncertainty in global chemical models comes from limitations in the representation of tropospheric chemistry. The computational expense of calculating chemical tendencies and tracer transport means that chemical mechanisms in global models are necessarily simplified, providing a relatively parsimonious description of gas phase tropospheric oxidation with respect to organic molecules and their oxidation pathways in particular (~50-250 species, ~5 00-1000 reactions) (e.g., Lamarque et al., 2013). More complete descriptions of tropospheric chemistry are available, for example from the Master Chemical Mechanism (MCM; ~6 700 species; ~1 7000 reactions) (Jenkin et al., 1997;Saunders et al., 2003). Standard simplifications in global chemistry models involve grouping ("lumping") chemically similar species together, such as using a single species to represent all higher (≥C4) hydrocarbons (Emmons et al., 2010), or simply omitting certain classes of compounds. Mechanisms are often developed and expanded to address particular research questions, and they are typically optimized for ozone production rather than for generation of semivolatile species important for secondary aerosol formation for instance. There are also isolated examples of mechanisms that are partially (Pöschl et al., 2000) or wholly (Jenkin et al., 2008) traceable to more complex antecedents. However, few global models resolve the urban scales where the fast reactions of higher hydrocarbons are important, and it is therefore unclear if implementation of complex chemistry schemes would improve representation of global scale ozone distributions and the chemistry important for climate.
There have been several efforts to benchmark global model chemical mechanisms against each other and the MCM (e.g., Pöschl et al., 2000;Emmerson and Evans, 2009;Archibald et al., 2010a;Squire et al., 2015). Using box models to isolate the differences to just the chemistry, these studies reveal substantial differences between the mechanisms and against the MCM, variously attributable to peroxy radical (HO 2 and RO 2 ) production, chain termination (HNO 3 and ROOH production), and the treatment of organic nitrogen (NO y ) chemistry (e.g., alkyl nitrate production), depending on the particular mechanism or prescribed chemical conditions. However, while these differences are undoubtedly important in understanding global model limitations and biases, box model studies themselves are not sufficient to understand how the chemical mechanism interacts with other global model components (deposition, photolysis etc.). A step forward would be to run a range of global models, each with a variety of different chemical mechanisms in order to explore these interactions.
An additional source of bias is the neglect of whole classes of compounds from model chemistry schemes. Tropospheric halogen (Cl, Br and I) chemistry is a particular example of this, and is only routinely simulated by a few models (none of the ACCMIP models include it). Halogens from inorganic (sea-salt) and organic (photodissociation of biogenic compounds) sources, particularly of marine origin, are thought to be important for the tropospheric ozone budget, as they take part in efficient ozone loss catalytic cycles (Read et al., 2008;Saiz-Lopez and von Glasow, 2012;Wang et al., 2015). Studies with single models have shown that this chemistry may have a notable impact on the ozone budget and associated radiative forcing in the troposphere Sarwar et al., 2015;Schmidt et a., 2016;Sherwen et al., 2016aSherwen et al., , 2016b, and the inclusion of tropospheric bromine chemistry has been suggested as a way of partially accounting for the low preindustrial ozone levels suggested by the measurements (Parrella et al., 2012) (although these measurements are uncertain; see TOAR-Observations).
Other aspects of chemistry that are often omitted include the full range of peroxy radical cross reactions (incomplete in even the MCM; Saunders et al., 2003); peroxy radical recycling from isoprene oxidation (e.g., Crounse et al., 2011;Peters et al., 2014), which some studies indicate can affect OH levels and VOC lifetimes over low-NO x forested areas (Archibald et al., 2010b(Archibald et al., , 2011Taraborrelli et al., 2012); nitryl chloride (ClNO 2 ), which has been found to be important in oxidative chemistry, particularly in coastal regions (e.g., Osthoff et al., 2008); and nitrous acid (HONO) formation other than from NO + OH (+M), including from other gas phase sources (Bejan et al., 2007: Li et al., 2008Li et al., 2014), bacteria (Oswald et al., 2013), aerosol reactions (Ammann et al., 1998Stemmler et al., 2007) and heterogeneous processes (Zhou et al., 2001;Stemmler et al., 2006;Su et al., 2011;Mao et al., 2013). In general, heterogeneous processes (Ravishankara, 1997;Jacob, 2000) are simulated in most models, although typically only for a few species (e.g., heterogeneous formation of N 2 O 5 and loss of HO 2 ) and with substantial variation in uptake coefficients, which can have notable effects on modeled abundances and chemical budgets (e.g., Evans and Jacob, 2005;Macintyre and Evans, 2010). See TOAR-Ozone-Budget for further discussion of these processes.
Much of atmospheric photochemistry is initiated by photolysis processes, which are represented in models in a number of different ways. Solution of the equations of radiative transfer is computationally expensive, and simplifications include restricted geometry (e.g., a twostream approach), optimization over a reduced number of wavelength intervals, infrequent calculation, or precalculation using lookup tables (e.g., Wild et al., 2000;Voulgarakis et al., 2009). Light scattering and absorption by cloud droplets and aerosol particles affect photolysis rates greatly, and are sensitive to both how these radiative processes are represented and the simulated cloud and aerosol distributions. Treatment of surface albedo (Laepple et al., 2005) and cloud overlap (Neu et al., 2007) have also been identified as significant sources of bias in simulating tropospheric oxidants. These issues add substantial uncertainty to photolysis calculations, and are in addition to the large uncertainties in observationallyderived absorption cross sections and quantum yields.
Overall, limitations and uncertainty in the chemistry mechanism and associated processes are well known sources of biases in global models. Yet the importance of their impact depends on the question being studied. For example, a simplified mechanism with just methane and isoprene chemistry may be sufficient for capturing decadal-centennial scale variability in long climate simulations, but would do poorly in simulating the ozone production and extreme pollution episodes in an urban environment. Additionally, different models likely differ in sensitivity to additions and improvements to their chemical schemes. Without a well-designed study we do not know the relative size of the uncertainty from chemical mechanisms compared to other structural uncertainties in these complex models.

Wet and dry deposition
Removal at the Earth's surface through wet and dry deposition processes provides the ultimate sink of many atmospheric constituents. Ozone is chemically reactive and is therefore readily removed from the atmosphere by a wide variety of processes occurring on surfaces. Although its solubility is relatively low, and its direct removal by precipitation is small, these processes strongly influence the levels of several key precursor species.
Dry deposition is a major removal pathway for ozone in the boundary layer. Global model estimates of annual deposition range from 710 to 1470 Tg yr -1 (Wu et al., 2007;Hu et al., 2017), with multi-model studies suggesting 1000 ± 200 Tg yr -1 ; see also Section 4.2). Most current models use resistance-based deposition schemes (based upon Wesely et al., 1989), which use observationally-derived ozone resistances for different surface types. There are large differences in the deposition velocity of ozone to different surface types, ranging from orders of 1 cm s -1 over forest to less than 1 mm s -1 over snow and ice (Fowler et al., 2009). This partly reflects the range of different processes involved, from adsorption and chemical take-up in plant stomata (Fowler et al., 2009) to iodine-mediated removal at the ocean surface (Prados-Roman et al., 2015). Analysis of observational studies of the processes controlling ozone deposition shows that the importance of different deposition pathways differs by location as well as by season and from year to year at each location (Rannik et al., 2012;Clifton et al., 2017). Comparison of dry deposition fluxes from 15 global models involved in the TF HTAP model intercomparison project (Table 1) found that differences in ozone dry deposition are strongly influenced by differences in land cover classification used in models (Hardacre et al., 2015). Comparison of modeled and observed ozone fluxes showed substantial biases at individual locations, particularly where the vegetation at the measurement site was not representative of the wider region, but there was no evidence for a systematic bias in deposition velocities across all sites when considered together (Hardacre et al., 2015). For the oceanic surface, comparison of deposition velocity simulated by a single CCM with observations suggests that the Wesely et al. (1989) scheme overestimates dry deposition by approximately a factor of two (Luhar et al., 2017). However, another study with a regional model has shown that comparison against observations is improved with higher deposition velocities, driven by the addition of chemical interactions at the air-water interface in the model (Sarwar et al., 2016). Overall, reducing model uncertainties associated with dry deposition requires tighter constraints on model deposition velocities through measurements over a wider range of land cover types and surfaces.
Wet scavenging of soluble species in precipitation affects the tropospheric ozone budget through removal of species such as nitric acid, which alters the lifetime of ozone precursors and hence tropospheric ozone abundances (Neu and Prather, 2012). Substantial differences in the distribution and intensity of rainfall between models affect the magnitude and timing of precursor removal, indirectly affecting ozone. The influence of wet deposition on modeled ozone distributions has not been explored thoroughly, but studies with a single CTM suggest that halving the wet deposition of precursors leads to a 10 Tg (~3%) increase in the tropospheric ozone burden and a smaller (~2%) increase in surface ozone (Wild, 2007).

Representation of the stratosphere
Input from the stratosphere is an important source of tropospheric ozone, but its contribution to the tropospheric ozone distribution and budget is highly uncertain. Stratosphere-troposphere exchange follows the Brewer-Dobson circulation: a downward air flow with a net input of ozone from the stratosphere to the troposphere in the extratropics, while a slow upward flow from the troposphere to the stratosphere occurs above the tropics, all varying in strength by season (Haynes et al., 1991;Holton et al., 1995;Stohl et al., 2003 and references therein). As well as this large scale process, transport of stratospheric ozone occurs through synoptic scale events at mid-and high latitudes (e.g., Stohl et al., 2003;Langford et al., 2009).
Limited direct measurements of the stratospheric ozone source (e.g., Olsen et al., 2013) and diversity in the model representation of the dynamical and chemical processes needed to account for this source make it difficult to accurately quantify the stratospheric contribution to tropospheric ozone (Lin et al., 2012a and references therein). Models differ in the complexity of representation of stratospheric chemistry, ranging from a passive ozonelike tracer or linearized ozone chemistry (McLinden et al., 2000), to detailed stratospheric chemistry coupled with the troposphere (e.g, Lin et al., 2012a;Neu et al., 2014;Iglesias-Suarez et al., 2016;Tilmes et al., 2016). There are also differences in the location and magnitude of stratosphere-to-troposphere transport among models that lead to model ozone discrepancies in the upper troposphere as well as for episodic ozone events near the surface (Fiore et al., 2014a).
As discussed in section 6, a realistic representation of stratospheric processes (including dynamical variability and ozone depletion/recovery) has been realized to be key in capturing the impact of interannual variability in stratosphere-troposphere transport on tropospheric ozone, and also to simulate realistic long-term trends. Recent developments in global chemistry models towards representing tropospheric and stratospheric chemistry as a single entity (Morgenstern et al., 2017 and references therein) will allow the uncertainty in the stratospheric impact on the troposphere to be evaluated more comprehensively (e.g., CCMI). Extensive evaluation of these models against more meaningful diagnostics that provide information on transport between the stratosphere and troposphere (e.g., Neu et al., 2014;Orbe et al., 2015;Liu et al., 2016) will help build confidence in the model simulated stratospheric contribution to tropospheric ozone.

Model dynamics and meteorology
Atmospheric dynamics and meteorology play an important role in controlling the spatial distribution of tropospheric ozone and its precursors, and biases in model meteorology can lead to substantial biases in simulation of tropospheric ozone. Ozone and its precursors can be lifted and transported over regional and intercontinental scales by dynamical features such as the warm conveyor belts within mid-latitude cyclones and deep convection (e.g., Liu et al., 2003;Cooper et al., 2004;Liang et al., 2004;Kiley et al., 2006;Brown-Steiner and Hess, 2011;Knowland et al., 2015). Accurate representation of these features in models is thus important to simulate long-range pollution transport. Model simulations of CO, an effective tracer of pollution transport, are often used to assess transport biases, and studies have clearly demonstrated that differences in the strength and timing of convection and in boundary layer mixing lead to large differences in pollutant transport (Kiley et al., 2003;Liu et al., 2010;Hoyle et al., 2011), with 10-30% differences in simulated CO levels due to differences in model transport alone (Arellano and Hess, 2006). Even models driven by the same meteorological fields can show substantial differences in large-scale tropospheric transport due to differences in the parameterization of convection (Orbe et al., 2017).
Meteorological nudging provides one method to reduce the errors associated with biases in model meteorology. Nudging has been shown to permit better representation of large-scale dynamical features and their impacts on ozone, in particular the stratospheric Quasi-Biennial Oscillation (Randel et al., 2009), North Atlantic Oscillation (Pausata et al., 2012), the Brewer-Dobson circulation (Jöckel et al., 2006), and meteorological changes driven by the eruption of Mount Pinatubo in 1991 (Telford et al, 2009;Shepherd et al., 2014). Model biases in predicting ozone levels near the tropopause have been shown to be sensitive to the chemistry-climate model used and the source of meteorological reanalysis data employed for nudging (e.g., NCEP versus ECMWF), and these biases can be in opposite directions (Aghedo et al., 2011). A study using two global models driven by different assimilated meteorological data found that the modeled ozone differences in the western U.S. were very sensitive not only to natural ozone sources, but also to stratosphereto-troposphere transport and vertical mixing in the troposphere (Fiore et al., 2014). Resolving these dynamical biases remains a challenge for the wider weather and climate modeling communities, but advances made through improvements in numerical weather prediction can ultimately be expected to benefit model simulations of tropospheric ozone.

Temporal and spatial resolution
Model biases against observations may also arise from numerical issues associated with the underlying model design and formulation. Spatial and temporal discretization of the atmosphere, the ordering of processes within a model time step, and the numerical methods associated with process parameterization may all introduce biases. Spatial resolution can be a particular problem for comparison of global models with observations in urban regions, where strong precursor sources lead to large variations in ozone on spatial scales far smaller than the model grid scale. Models typically overestimate ozone in these locations, partly because numerical mixing shortens chemical production timescales and misses localized ozone titration from intense NO sources, and partly because observation sites may be representative of regions no larger than a few kilometers (Wild and Prather, 2006;Hodnebrog et al., 2011;Stock et al., 2014; see also Section 3.4). Numerical separation of chemical environments through the use of plume-in-grid treatments has been explored in some studies (Sillman et al., 1990) and have been used to represent the chemistry of aircraft and ship plumes (e.g., Cariolle et al., 2009;Charlton-Perez et al., 2009), but these approaches are not widely used in global models.
How can these biases be addressed? Increasing model spatial resolution to match the chemical and dynamical timescales of interest has important benefits, bringing model scales closer to the region of representativeness for measurement sites and additionally reducing biases caused by numerical mixing. Studies with regional models, simulating only a limited area and thus able to achieve much higher resolutions than global models, suggest resolutions on the order of 10 km are necessary to capture the strong gradients in emissions and photochemical processes over large urban areas (Tie et al., 2010). Simulations at these scales have also been shown to have important impacts on the estimation of human health benefits arising from emission controls (Thompson et al., 2014).
Yet increasing model resolution is currently not computationally feasible for models covering a global domain and designed for decadal-or centennial-scale runs. Global models are being developed to run with "regional refinement" (variable resolution GCMs), where a geographical region of interest has enhanced resolution (e.g., Huang et al., 2016), and there are several examples of running a more highly resolved model nested within a coarser grid (e.g., Misenis and Zhang, 2010;Yan et al., 2016). But even when simulating much smaller domains, the horizontal and vertical resolution may be insufficient to capture features of interest, such as intercontinental transport of pollution plumes, mainly due to numerical diffusion in Eulerian grid models (Rastigejev et al., 2010;Eastham and Jacob, 2017). Sub-grid scale treatment of processes occurring at scales much smaller than the model grid-size is needed, but evaluation of variables against observations at a single site when sub-grid scale variability is high can be misleading unless the representativeness of the site of the model grid scale has been reliably assessed. New methods applying a more probabilistic approach to model evaluation are needed to account for sub-grid scale spatial variability and to avoid attributing biases associated with observational representativeness to model weaknesses. This may provide a better assessment of overall model behavior while avoiding some of the problems associated with the minimum scales resolved in the model.

Missing processes
Models will always remain incomplete in their inclusion of the physical and chemical processes affecting atmospheric composition. This reflects both the boundaries of scientific understanding and the computational constraints imposed by ever increasing model complexity. The challenge is to include or approximate the effects of all known processes that impact ozone substantially for the purpose at hand, contingent on the temporal and spatial scales of interest, and to identify where unknown processes may be needed to explain biases against observations. The timescales involved in the process of scientific discovery, from conception of an idea and demonstration of its importance through to wider scientific acceptance and inclusion in models, is often long, and there is consequently a time lag between process identification and implementation in models. Examples of recent advances in understanding that have yet to be included in most models include the chemistry of higher VOC and halogens, gas-aerosol interactions and heterogeneous chemistry, rapid urban photochemistry and dynamics, and vegetation canopy processes occurring close to the Earth's surface. Many of these advances relating to ozone are summarized in a recent review (Monks et al., 2015), and omission or oversimplification of these processes in current models lead to biases that have yet to be adequately quantified. In many cases there is a reliance on more detailed process models working on much smaller spatial and temporal scales to demonstrate the importance of new processes before they are included in large-scale models. Computational constraints impose a threshold for inclusion of processes in these models, governed by the need to achieve an optimum balance of accuracy and usability, as well as understanding new uncertainties the processes may bring. Careful exploration of model biases against observations is vital for identification of gaps in current understanding and can lead to scientific advances through identification of missing processes, such as OH recycling over tropical forests (Lelieveld et al., 2008), and rapid surface ozone loss in the Arctic due to Br chemistry (Yang et al., 2008). Progress in model development is contingent on this observational ground-truthing, and it is thus important that model biases are seen as an opportunity to improve scientific understanding as well as a reminder of the limitations of current modeling tools.

Conclusions and future outlook
TOAR-Model Performance has discussed, summarized and assessed global chemistry models with a focus on their simulation of tropospheric ozone. We have discussed their development history and nomenclature, and the results from their participation in multi-model assessments; summarized common model evaluation strategies; assessed their performance for the global ozone distribution, for extreme pollution events, and how they simulate changing abundances at sub-decadal to multidecadal time scales; and presented some reasons why we might expect models to deviate from the real world.
A key conclusion in terms of assessing model performance is that great care needs to be taken when considering how meaningful a model-measurement comparison is: i.e., how representative are the model or measurement are of a given time period and location, in light of the model and measurement sampling patterns, measurement error and model resolution? Additionally, is the evaluation aligned with the purpose of the study? If the model is being used to quantify potential health impacts, is the model being assessed against extreme values? If we are interested in the role of the stratosphere, does the model have a realistic meridional circulation? As the literature stands, it is difficult to unambiguously identify the drivers of model-observation disparity, and in particular how these drivers might vary across the range of global models. We therefore recommend systematic and thorough evaluation of the current generation of global models, to the degree that is currently done for isolated, single model studies.
New ways forward in model evaluation include analysis in a regime-focused or process-based way (e.g., does the model simulate the expected ENSO-ozone response, does it capture the relative time scales?), using multiple constraints (e.g., simultaneous measurements of a number of species) and relationships (e.g., CO/ozone ratios), and making better use of observational data beyond monthly means. With the caveats related to representativeness and observational uncertainty in mind, such analyses show the most promise for quantifying, characterizing and understanding global chemistry model performance in future assessments. Beyond that, we would urge the community to collaborate with observational scientists to better understand the uses and limitations of the data, as well as data scientists to take advantage of their specialized knowledge of advanced statistical techniques.
Assuming that evaluations against measurements are valid, why do global models sometimes not perform well? We have reviewed several potential causes of model-measurement discrepancy here, including inputs (chiefly emissions), chemical scheme, physical processes (deposition, transport/meteorology), temporal and spatial resolution, and (potential) missing processes. Some of the effects of these limitations and uncertainties have been explored in model studies, mostly with single models, but we currently lack a comprehensive assessment of their relative importance for tropospheric ozone and chemistry in general. We recommend designing simulations that target specific uncertainties (e.g., long term ozone trends, the NH versus SH model bias pattern), preferably completed by a large range of models, in order to improve our understanding.
Where is the development of global chemistry models headed? In the case of understanding chemistry as part of the climate system, the next-generation of CCMs is advancing towards modeling the full terrestrial-oceanatmosphere biogeochemical cycle to represent the whole Earth System; termed as Earth System Models (ESMs) (Heavens et al., 2013). The primary feature that distinguishes ESMs from CCMs is their ability to simulate the interactions between land, ocean and atmosphere in a fully coupled sense. While this might facilitate improvements in the representation of, for example, natural precursor emissions (e.g., interactive oceanic halogen emissions or biogenic emissions) and deposition (e.g., interactive dry deposition), these advances in modeling will require concurrent advances in our ability to observe ozone and ozone precursors for thorough model evaluation. We should also note that advances in coupling processes might result in other simplifications: the level of plant differentiation in land surface models of ESMs is far less detailed than that used by MEGAN for instance (Henrot et al., 2017). Moreover, yet more complexity means more sources of uncertainty, due to more processes having to be modeled, each described by equations and parameters that are known to different degrees.
Dealing with the myriad sources of uncertainty in global models is one of the most challenging aspects to their use for science and policy-relevant questions. Currently, uncertainty is mainly assessed by considering model-measurement agreement and inter-model spread using "ensembles of opportunity" ( Table 1). While useful snapshots of the state-of-the-science, such experiments are not designed to explore uncertainty systematically, and the inter-model spread likely underestimates the true structural uncertainty. Ongoing research is exploiting advanced statistical techniques (after Carslaw et al., 2013;Lee et al., 2013) to more fully quantify the drivers of uncertainty in modeled tropospheric chemistry. Such work is important to better target our limited scientific resources onto the most important uncertainties, to improve the models and increase the reliability of the projections they make.
How then should modelers, observation scientists, analysts and other users approach model results? The obvious answer is the same as it is across science: critically. One should ask whether the model performance is acceptable for the problem being addressed, and whether biases can be tolerated or corrected; whether a model is appropriately constituted, including if it has appropriate chemical complexity, resolution etc.; and if there is a way to assess the likely uncertainty.

Data Accessibility Statement
Surface ozone observations shown in Figures 6 and 7 are available from the TOAR data portal (http://toardata.fz-juelich.de/). All other data can be obtained from the corresponding authors: Paul J. Young (paul.j.young@ lancaster.ac.uk) and Vaishali Naik (Vaishali.Naik@noaa. gov). Meteorological Organization (WMO). This work was greatly assisted by the comments and feedback of several participants of the International Global Atmospheric Chemistry (IGAC) TOAR project, via workshops and on earlier drafts. In particular, the authors would like to thank (in alphabetical order) Bill Collins, Owen Cooper, Pat Dolwick, James Hemby, Barron Henderson, Terry Keating, Jingqiu Mao, Norm Poissel and Heather Simon for their in-depth reviews and suggestions. The authors also thank Catherine Raphael for refining the figures and preparing Figure 1. Finally, the authors would like to acknowledge the scientists who completed the model simulations for the results that we draw on, particularly for the ACCMIP simulations. ACCMIP was organized under the auspices of the IGAC and Stratospheretroposphere Processes And their Role in Climate (SPARC) projects, which fall under FutureEarth and World Climate Research Program (WCRP) respectively. The authors are grateful to the British Atmospheric Data Centre (BADC), which is part of the NERC National Centre for Atmospheric Science (NCAS), for collecting and archiving the ACCMIP data, and which also hosts the simulation output from the next generation Chemistry-Climate Model Intercomparison (CCMI) activity.

Funding information
A portion of the work was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the NASA Aeronautics and Space Administration. A portion of the work was carried out the National Center for Atmospheric Research, which is operated by the University Corporation for Atmospheric Research under sponsorship of the National Science Foundation. PY acknowledges support from the Faculty of Science and Technology, Lancaster University. JB and UI acknowledge NordForsk under the Nordic Programme on Health and Welfare Project #75007: Understanding the link between air pollution and distribution of related health impacts and welfare in the Nordic countries (NordicWelfAir); and the H2020-LCE project: Role of technologies in an energy efficient economy -model based analysis policy measures and transformation pathways to a sustainable energy system (REEEM), Grant agreement no.: 691739. GZ

Competing interests
The authors have no competing interests to declare.

Author contributions
• VN and PY lead the overall writing and editing of the manuscript and contributed equally ("coordinating lead authors"); AM, AG, JG, ML, JN, DP, HR, JS, ST, OW, JZ and LZ all made substantial contributions to the text or figures and took lead roles in compiling individual sections of the manuscript ("lead authors"); and JB, AD, RD, CG, MH, LH, UI, RK, AL, LM, DP, JR, AS-L, MS, MW, and GZ all contributed text ("contributing authors"). All authors helped put together the final manuscript.