Tropospheric Ozone Assessment Report: Tropospheric ozone from 1877 to 2016, observed levels, trends and uncertainties

David Tarasick*, Ian E. Galbally†,‡, Owen R. Cooper§,‖, Martin G. Schultz¶, Gerard Ancellet**, Thierry Leblanc††, Timothy J. Wallington‡‡, Jerry Ziemke§§, Xiong Liu‖‖, Martin Steinbacher¶¶, Johannes Staehelin***, Corinne Vigouroux†††, James W. Hannigan‡‡‡, Omaira García§§§, Gilles Foret‖‖‖, Prodromos Zanis¶¶¶, Elizabeth Weatherhead§,‖, Irina Petropavlovskikh§,‖, Helen Worden‡‡‡, Mohammed Osman****,††††,‡‡‡‡, Jane Liu§§§§,‖‖‖‖, Kai-Lan Chang§,‖, Audrey Gaudel§,‖, Meiyun Lin¶¶¶¶,*****, Maria Granados-Muñoz†††††, Anne M. Thompson§§, Samuel J. Oltmans‡‡‡‡‡, Juan Cuesta‖‖‖, Gaelle Dufour‖‖‖, Valerie Thouret§§§§§, Birgit Hassler‖‖‖‖‖, Thomas Trickl¶¶¶¶¶ and Jessica L. Neu******


Introduction
Tropospheric ozone is a greenhouse gas and pollutant detrimental to human health and plant growth (Monks et al., 2015;WMO Reactive Gases Bulletin, 2018). Large changes after 1990 in the global distribution of the anthropogenic emissions that produce ozone have been reported, including reductions in North America and Europe and increases in Asia (Richter et al., 2005;Granier et al., 2011;Russell et al., 2012;Hilboll et al., 2013;Cooper et al., 2014;Zhang et al., 2016). This rapid shift, coupled with limited ozone monitoring in developing nations, has left scientists unable to answer the most basic questions: Which regions of the world have the greatest human and plant exposure to ozone pollution? How is ozone changing in nations with strong emission controls? To what extent is ozone increasing in the developing world? How can the atmospheric sciences community facilitate access to the ozone metrics necessary for quantifying ozone's impact on climate, human health and crop/ecosystem productivity?
To answer these questions, the International Global Atmospheric Chemistry Project (IGAC) developed the Tropospheric Ozone Assessment Report (TOAR): Global metrics for climate change, human health and crop/ecosystem research (www.igacproject.org/activities/TOAR). Initiated in 2014, TOAR's mission is to provide the research community with an up-to-date scientific assessment of tropospheric ozone's global distribution and trends from the surface to the tropopause. TOAR's primary goals are, 1) Produce the first tropospheric ozone assessment report based on the peer-reviewed literature and new analyses, and 2) Generate easily accessible, documented data on ozone exposure metrics at thousands of measurement sites around the world (Lefohn et al., 2018). Through the TOAR surface ozone database  hereinafter TOAR-Surface Ozone Database) these ozone metrics are freely accessible for research on the global-scale impact of ozone on climate , human health (Fleming et al., 2018) and ecosystem productivity .
The assessment report is organized as a series of papers in a special feature of Elementa -Science of the Anthropocene (https://collections.elementascience.org/ toar), with this paper comprising the Tropospheric Ozone Assessment Report: Tropospheric ozone from 1877 to 2016, observed levels, trends and uncertainties, subsequently abbreviated as TOAR-Observations. This paper describes the different tropospheric ozone measurement techniques used since the late 19 th century to the present, and characterizes the uncertainty in the measurements and the spatial and temporal information obtained from each instrument type.
Knowledge of the uncertainties associated with tropospheric ozone measurements is important to reconciling measurements from different methods and platforms and for accurate and realistic model evaluation. It is also essential for the evaluation of trends. Historical ozone observations, those made before the widespread deployment of UV-based ozone instruments, are important to climate models. The global average radiative forcing of ozone (0.4 ± 0.2 W m -2 ; IPCC, 2013) is approximately 1/5 of the radiative forcing due to CO 2 , and slightly less than the radiative forcing due to methane (NOAA, 2018). This estimate has large uncertainty due to limited knowledge of preindustrial concentrations of tropospheric ozone and its present-day spatial distribution (IPCC, 2013). Additional uncertainty arises from the detrimental impact of ozone on plant productivity, which due to feedbacks on CO 2 uptake, produces an indirect forcing (Sitch et al., 2007). Past efforts to evaluate 19th century ozone measurements have concluded that ozone in pre-industrial times was as low as 1/5 of its present concentration (e.g. Marenco et al., 1994;Volz and Kley 1988;Bojkov, 1986;Staehelin et al., 1994), based primarily on observations at Montsouris, Paris, France in the late 19 th century. However, the validity of the early Montsouris measurements as representative of the regional atmosphere has been challenged (Calvert et al., 2015;Staehelin et al., 2017), and global atmospheric chemistry models have difficulty reproducing such a large historical increase from pre-industrial times (e.g. Wang and Jacob, 1998;Mickley et al., 2001;Lamarque et al., 2005;Young et al., 2013;Parrish et al., 2014;Young et al., 2018). It is therefore important to quantify uncertainties for these older measurement methods, to establish confidence limits for reproducibility and bias, and to answer the question: how well do we know historic levels of tropospheric ozone? Section 2 of this paper describes the many methods that have been used to measure tropospheric ozone. Section 3 is an in-depth re-evaluation of the record of ozone in surface air away from cities and other interferences. Section 4 addresses the measurement of ozone in the free troposphere, beginning with the relatively few historical measurements. Section 5 discusses several aspects of representativeness, and uncertainties associated with sampling of ozone in the troposphere. The paper concludes with a discussion of knowledge gaps and recommendations for future measurements.

Standards for the measurement of ozone in the atmosphere
Ozone is a highly reactive gas, with strong absorption bands in the IR and the UV. Three broad sets of techniques based on chemical reaction, UV absorption and IR absorption and emission have been used to measure ozone in the atmosphere. The methods derived from these techniques and their first use to measure ozone in the atmosphere are presented in Table 1. These methods have different measurement uncertainties and the results obtained from paired measurements using either the same or different techniques may differ from each other both systematically and randomly.
As a reactive gas ozone cannot currently be kept in containers nor does it persist in snow without ongoing loss. Hence no current measurements of past concentrations are possible (although they may be inferred from isotopic measurements of oxygen trapped in ice (Yeung et al., 2019)). It is also not possible to transport a sample of gas containing a known concentration of ozone from one location to another without ozone loss occurring within the container. Therefore, some other form of standard for ozone calibration to ensure world-wide traceability of measurement results is required. The current standard for tropospheric ozone measurement is based on its ultraviolet absorption cross-section at 253.65 nm of 1.148 × 10 23 cm 2 molecule -1 . This standard originates from Hearn (1961), and has been adopted by the International Ozone Commission in 1984, the International Standards Organisation (ISO, 2017), the World Meteorological Organisation in its Guidelines for Continuous Measurement of Ozone in the Troposphere  and is used by the International Bureau of Weights and Measures (BIPM) for ozone calibrations (BIPM, 2019). To propagate this standard for surface ozone and aircraft ozone measurements, specially designed ozone photometers incorporating an ozone generator, and utilizing the measurement of the absorption of UV radiation of 253.65 nm wavelength within short cells (1 m or less in length) by sample air containing ozone, have been used as ozone transfer standards (ISO, 2017;Paur et al., 2003;Viallon et al., 2006a). By referencing ambient measurements to these standards, well-understood and traceable observations of tropospheric ozone are made Tanimoto et al., 2007;Viallon, 2006a, b).
The numerical value of the ozone absorption crosssection is currently under review (Hodges et al., 2019;Orphal et al. 2016), with a recommendation that the value should be decreased by approximately 1.23% (Hodges et al., 2019). If accepted by the appropriate agencies (BIPM, WMO, ISO), this change will require all tropospheric ozone measurements on the current UV standard scale to be increased by 1.23%. This will not affect trends, but it will have a small effect on estimates that depend on the absolute ozone amount, such as calculations of ozone radiative forcing. This change will also improve agreement of the UV scale with gas phase titration (GPT) and the potassium iodide (KI) ECC ozonesondes.
A second ozone standard is gas phase titration of ozone against nitric oxide gas standards. Differences between GPT and standard UV photometry have been investigated by Tanimoto et al. (2006) and Viallon et al. (2006bViallon et al. ( , 2016 and found to be very small (~0.3%) when the newer values of the ozone absorption cross-sections (see Section 2.2.1) are used . Thus GPT supports the proposed decrease in the ozone absorption cross-section at 253.65 nm. Because GPT is utilized as a standard and has not been used for ambient ozone measurements in either the historical record or the TOAR database, it is not listed in Table 1. Further information on GPT is included in the Supplemental Material (Text S-1).
These standards are propagated, via international coordination of the adoption of ozone absorption coefficients for UV and visible light (Orphal et al., 2016), to the communities using remote sensing methods for ozone measurement in the free troposphere. There is a recommendation to extend this co-ordination to infrared methods (Orphal et al., 2016).
The available record of surface ozone measurements divides into two periods: the modern period covers approximately 1975 to the present and is defined by the widespread availability of sensitive UV photometers for surface ozone measurements; the historical period covers  and is defined by the use of other techniques and the absence of these UV photometers. There are a few years of overlap between the periods during the uptake of the modern technology.
A set of four criteria were developed and applied to select data for the historical reconstruction: (1) the measurement methods used should be related, through intercomparisions, to the current standard UV absorption photometer method; (2) the likelihood for significant contamination of the ozone measurements due to interfering pollutants in the sampled atmosphere should be low; (3) for surface ozone, the air sampled should be representative of the well-mixed boundary layer, and (4) recognizing the uncertainties associated with all of the historical data sets, the measurements should be free from major artifacts or inconsistencies. The datasets that pass these four criteria are explicitly documented.
To commence the reconstruction of a historical tropospheric ozone record, the first of the four criteria must be addressed for each of the ozone data sets examined. As many historical measurement methods pre-date the current UV method, traceability may be derived through published side-by-side intercomparisons with an intermediate method.
The surface ozone method/instrument intercomparisons found in the literature are presented in Table 2. The ratio of each pair of methods corresponds to either the ratio of the mean values from each method or the slope of a regression whose intercept is assumed to be zero. Often only this number is recorded. In some cases an uncertainty is cited. Where not explicitly stated, we have assumed that this corresponds to one standard deviation. Table 2 is divided into 5 sections, the first four sections commencing with the comparison of a method with the UV method and then proceeding to other comparisons of that or closely related methods. The methods separated are Levy, Ehmert, Electrochemical Concentration Cell (ECC), and Colorimetric. The fifth section is for other relevant method comparisons not included in the first four sections. Two conclusions can be drawn from Table 2: (a) in the absence of other information, the relative bias of a historical set of ozone observations with the current UV standard lies in the range 0.7 to 1.2 at approximately 90% confidence limit; and (b) the uncertainty in the bias from one to another study of apparently identical instruments can be as large as 50%. Consequently, except in special cases of traceability, each past set of observations should be seen as having a substantial unknown bias and the historical record at a particular location will not necessarily sensibly relate to current measurements at the same location. Due to such inconsistenices between historical and modern ozone observations, TOAR-Observations estimates historical ozone levels on regional or zonal scales using all available data sets. The rationale for this new approach relies on the concept that the normalised biases of multiple sets of observations, if random, will tend to cancel out when averaged across multiple stations. A formal description of this approach is given in the Supplemental Material (Text S-2). The average of multiple sets of these past observations for multiple years and a particular geographic region is more appropriate for comparison with current observations. Therefore, we recommend that the Table 2: Comparisons of (a) various surface ozone measurement methods against in-situ UV ozone measurements and other key methods and (b) ozonesonde responses in the lower troposphere. Comparisons were undertaken either sampling ambient air, (A), or via laboratory studies, (L). NBKI = neutral-buffered potassium iodide solution. Where required, older measured ratios have been adjusted to reflect the current standard UV absorption cross-sections (Hearn, 1961). DOI: https://doi.org/10.1525/elementa.376.t2
Each of the techniques used in ozone data sets either selected or rejected in this study will be discussed in turn.

Potassium Iodide measurement techniques
While ozone was originally identified and investigated by its odour (Schönbein, 1840;Rubin, 2001), the first quantitative measurements were based on the reaction of ozone with potassium iodide: The basis of this measurement is the assumption that for each ozone molecule reacted, a molecule of iodine is produced; this ratio is the stoichiometry of the reaction. The amount of iodine produced is (in most methods) measured, and this in mole units, equals the amount of ozone in the air volume sampled. A number of techniques based on the KI reaction have been developed during the last two centuries, and the ozone-KI reaction is still in use in balloon borne ozonesondes ( Table 1).
The stoichiometry of the reaction is crucial, and has been studied extensively. Many studies, however, were made at ozone concentrations much higher than those in the troposphere, because of the difficulty of working with low concentrations of ozone at the time (Saltzman and Gilbert 1959a;Boyd et al., 1970;Hodgeson et al., 1971;Kopczynski and Bufalini, 1971;Dietz et al., 1973). Byers and Saltzman (1959) found using GPT a reaction stoichiometry of 1.00 (with unquantified uncertainty) at pH 7, and that the reaction stoichiometry varied with pH, being lower by 50% at pH 14. This implies that the chemistry of the KI reaction with ozone is complex, involving reactions other than (1) that produce additional iodine, as well as reactions that cause loss of iodine (Byers and Saltzman, 1959;Staehelin and Hoigné, 1985).
Without buffering, the reaction will drive the solution alkaline, so the KI solution is in most methods buffered (NBKI, for neutral-buffered KI). Dietz et al. (1973) found a NBKI/UV ratio of 1.00 ± 0.03 at pH 7 for two measurements at 100 and 400 nmol mol -1 . Pitts et al. (1976a) found that the 2% NBKI method gave NBKI/IR = NBKI/UV = 1.23 ± 0.06 at 50% relative humidity for 0.1 to 1 ppm ozone and NBKI/IR = NBKI/UV = 1.14 ± 0.04 at 3% relative humidity. They had no explanation of the apparent water vapour dependence. However, it was reduced when potassium bromide was added (Lanting, 1979;Bergshoeff et al., 1980), as is the case in ozonesondes. Slow side reactions involving the phosphate buffer may also change the stoichiometry from 1.0 (Saltzman and Gilbert, 1959a;Flamm, 1977;Johnson et al., 2002).
Other compounds present in air can interfere with the KI-ozone reaction. NO 2 and H 2 O 2 give positive interferences , NO 2 at a level of 5-10% (Pitts et al., 1976), although this appears to be quite variable (Cherniak and Bryan, 1965;Tarasick et al., 2000). SO 2 causes a negative interference of 1:1, i.e. a quantitative reduction in the ozone detected (Pitts et al., 1976;Schenkel and Broder, 1982;Volz and Kley, 1988). NH 3 is also a negative interferent (Anfossi et al., 1991), which increases the pH of the solution as well as reacting directly with iodine although the stoichiometry is not quantified (Downs and Adams, 1973).
Losses can occur in the inlet to the sampler, but even early experimenters appear to have been aware of this, and strove to avoid it. Inlet tubes (where described) were usually of glass (e.g. Dauvillier, 1934;Glückauf et al., 1944) and the type of glass was found to be important (Carbenay and Vassy, 1953). Other materials such as polyvinyl chloride became available later (e.g. Vassy, 1958), and may have caused negative biases in some cases (Altshuller et al., 1961;Potter and Duckworth, 1965) before Teflon was introduced (Gudiksen et al., 1966). In one case a cotton wool filter was used in the inlet (Edgar and Paneth, 1941b). No information is available with which to estimate inlet losses, but they could have negatively biased some of the KI measurements.
Loss due to evaporation of the iodine produced can also occur (Brewer and Milford 1960;Kley et al., 1988). There are a number of methods based on the KI reaction (Table 1), and while all are similarly subject to interfering gases, they differ in terms of potential for iodine and/or ozone loss, and side reactions. In the Cauer method, the evaporation of the iodine is part of the analytical technique (Warmbt, 1964). The efficiency of the sampler needs to be considered for each case.
The contributions of each of these interfering or modifying factors cannot always be separately quantified. The best available summary of how a technique performs is obtained from comparisons in unpolluted ambient air with the UV method or with a method traceable to the UV method, as presented in Table 2.

Schönbein papers
The Schönbein paper method uses a KI and starch impregnated paper. Ozone diffusing to the paper surface reacts with the iodide, and the iodine produced forms a strongly blue-colored complex with the starch. Alternately, a pH indicator is present and the colour change is due to the alkalinity resulting from reaction (1), see Hartley (1881). Following exposure, the paper strips can be compared to a standard color scale to give a semi-quantitative ozone measurement (Fox, 1873). There are several variations on this technique, named after their inventors/developers: Schönbein, Sallon, de James, Therry and Houzeau. These methods are described in limited detail (Houzeau, 1857;Fox 1873;Hartley 1881;Linvill et al. 1980;Bokjov 1980;Kley et al. 1988;Anfossi et al. 1991).
Interest in ozone was very high in the late 19 th Century, in part because of its role as an "air purifier" and the erroneous belief that it could eliminate pathogens, particularly cholera (Fox, 1873). Measurements were therefore made with Schönbein or related papers at hundreds of sites in Europe, the Americas, Australia, Asia, Africa, and Antarctica (Smyth, 1858;Fox, 1873;Royal Society, 1908;Bojkov, 1986;Galbally and Paltridge, 1989;Sandroni et al., 1992;Sandroni and Anfossi, 1994;Pavelin et al., 1999;Nolle et al., 2005).
There are two laboratory test chamber studies of the color development response to time and ozone concentration, that either directly or indirectly relate the filter paper method to the current UV-absorption standard. Linvill et al., (1980) found a filter paper response relationship to ozone exposure where color development was very strongly dependent on the relative humidity (RH) present in the chamber. A change from 3 to 4 (of 10) color units corresponds to a 10 nmol mol -1 ozone change at 80% RH and a 30 nmol mol -1 ozone change at 60% RH. Kley et al. (1988) found the papers gave, on exposure to a constant ozone level, an initial linear color increase continuing for 3 hours or more, followed by a plateau and then a color decrease. Further exposure to ozone increased this color loss. Because of this complex behaviour, there is a region between 6 and 10 hours exposure where ozone values between 10 and 50 nmol mol -1 correspond to less than 1 unit difference on the Schönbein scale. For longer exposures the responses overlap and reverse order, i.e. longer exposures at high concentrations for selected conditions give lower color responses than some shorter exposures at lower concentrations. Consequently, it is impossible to uniquely relate a color development on these filter papers to an ozone concentration on the UV scale.
The filter paper method of measuring ozone concentrations is a passive measurement method that lacks a controlled diffusion barrier, the absence of which creates a wind speed dependence. There does not appear to be any information on the repeatability and reproducibility of these techniques in the field. It appears that the colour development (the signal) is dependent on ozone concentration, time of exposure, relative humidity, wind speed and light. The colour development has negative responses to ammonia and sulphur dioxide (Fox 1873;Linvill et al. 1980;Bokjov 1980;Kley et al. 1988). There may also be differences in response dependent on type of paper and method of preparation.
In 1876-1877, 289 parallel ozone measurements at the Montsouris Observatory in Paris were undertaken using KI papers and the more quantitative KI-arsenite method (Albert-Lévy, 1877). This study was repeated in the two following years (Marenco et al. 1994). The individual data are not available, but a frequency table is available from which a linear regression relationship can be determined. Several authors have used such a relationship in attempts to calibrate the KI paper measurements (Bojkov, 1986;Anfossi et al., 1991;Lisac and Grubišić, 1991;Sandroni et al., 1992;Marenco et al., 1994;Anfossi and Sandroni, 1997;Pavelin et al., 1999;Weidinger et al., 2011). These either use the Montsouris comparison directly, or use it to scale the chamber results of Linvill et al. (1980). In all cases they find that ozone was much lower in the 19th century than now. Evidently, this conclusion is entirely dependent on the calibration at Montsouris, a single site on the edge of an urban centre, not necessarily representative of the regional background atmosphere (the Montsouris measurements are discussed in section 3.1 below). Moreover, the results of the scaling are not consistent with the chamber measurements (compare Figure 1 of Pavelin et al., 1999 to Figure 1 of Linvill et al., 1980 or Figure 3 of Kley et al., 1988).
The KI papers appear to have been useful as a relative measure of ozone concentration, and showed many aspects of ozone variation and distribution that are now well known (Bojkov, 1986;Anfossi et al. 1991). However, given the high sensitivity of KI papers to relative humidity (greater than to ozone concentration), exposure time, wind speed, and other factors, and the radically different results from intercomparisons, the filter paper measurements cannot be related to modern ozone measurements with any degree of confidence, and are not recommended for quantitative use. The same recommendation was made by Fox (1873), Hartley (1881) and Kley et al. (1988).
In the Albert-Levy technique, the sampling solution contains iodide and arsenite. Ozone bubbling through the solution reacts with the iodide, producing iodine. The iodine produced reacts with the arsenite (AsO 3 3-), converting it to arsenate (AsO 4 3-). Two titrations are performed, to determine the amount of arsenite in a vessel of the solution that has had air bubbled through it, and in an identical vessel that has not been exposed to bubbling. The titrations are conducted in an alkaline medium with a volumetric standard solution of iodine. The quantity of ozone in the air is calculated from the difference between the amounts of arsenite in the two vessels. Measurements were continuous, 24-hour sampling averages. Volz and Kley (1988) replicated the apparatus and method of Albert-Levy and found agreement in the laboratory with the UV method to ±2%. Dauvillier (1935) undertook an intercomparison of the UV and the KI-arsenite methods using atmospheric samples over a snow surface in Abisko, Sweden. Between 22 December 1934 and 7 March 1935, there were 50 simultaneous measurements and the ratio of the derived ozone amounts was 0.78. This suggests that in practice the KI-arsenite method may underestimate atmospheric ozone levels.
In the Ehmert technique (Ehmert 1951(Ehmert , 1952(Ehmert , 1959Galbally 1969), the sampling solution is neutral buffered and contains iodide and thiosulfate (S 2 O 3 2-). Ozone bubbling through a vessel of the solution reacts with iodide producing iodine which converts the thiosulfate to tetrathionate (S 4 O 6 2-). With the Ehmert technique the quantity of thiosulfate in the vessel is calculated electrochemically using a coulometric (where chemical transformation is equated to electron flow) analysis (Ehmert, 1959;Galbally, 1969). The thiosulfate loss equals the ozone amount sampled. Dividing this amount by the air volume sampled gives the ozone concentration. Measurements can be made with air sampling as short as 30 minutes.
As presented in Table 2, the Ehmert, ECC and UV methods had ratios indistinguishable from 1.0 with an uncertainty of approximately ±10%. Other variations of the Ehmert method involve injection of known amounts of thiosulfate into the reacting solution which allows continuous or semi-continuous measurements (Paneth and Glückauf 1941;Glückauf et al., 1944;Bowen and Regener 1951;Carbenay and Vassy, 1953;Regener 1959). An ozonesonde based on this method was developed (Kobayashi and Toyama 1966a A number of the KI methods accumulate iodine in solution while the air is sampled. The first approach to this is the colorimetric method of measuring ozone, where the iodine produced in the neutral buffered 1% KI solution is subsequently measured spectroscopically at 352 nm and quantified by comparison against iodine standards (Byers and Saltzman 1959;Saltzman and Gilbert 1959a). Saltzman and Gilbert (1959a) demonstrated that the iodide concentration, the pH of the solution and the time delay before measuring the iodine absorbance were all critical parameters. They also demonstrated the effectiveness of SO 2 as a negative interferent in the KI method. There have been multiple comparisons of the colorimetric NBKI method against other methods, as listed in Table 2. It shows a high bias with respect to UV methods, with some exceptions (e.g. Cherniak and Bryan, 1965). Before the mid-1970s, benchtop UV photometers and chemiluminescent ozone analysers were calibrated to an external calibration source, typically NBKI (e.g. Clements, 1975;Pitts et al., 1976;Torres and Bandy, 1978).
The second approach involves removal of iodine produced in the solution. In the simplest case the anode and cathode are chemically separated by the steady one-way flow of the sensing solution. The iodine produced from ozone is measured via the current from a platinum cathode downstream of the air mixing zone. The reaction at the cathode is: The charge flow in the external circuit is proportional to the ozone reacted. This "coulometric" method is used in the transmogrifier (Brewer and Milford 1960) and its commercial adaption as the Mast Ozone Meter (Mast and Saunders, 1962), as well as the instrument developed at the Max Planck Institute for Aeronomy (Pruchniewicz 1970(Pruchniewicz , 1973 which will be subsequently described as MPI-Pruch. This "coulometric" method is also used in the Brewer- Mast and Electrochemical Concentration Cell (ECC) ozonesondes and their adaptions for use in surface air. In the case of the ECC sonde an ion bridge is used instead of relying on the one-way flow of the sensing solution. The Mast ozone meter has undergone other testing to that in Table 2 (e.g. Gudiksen et al., 1966;Potter and Duckworth, 1965), indicating that under field conditions it gave responses of 50-70% of the NBKI result. A correction factor of 0.8 is used here, corresponding to the Mast Ozone Meter ratio to the UV method based on information in Table 2 and references above.
There was widespread use of the surface ECC methods in the 1970s (Oltmans, 1981) and there are comparisons against the UV and Ehmert methods, see Table 2. The HP-KI method, used for surface ozone observations at Hohenpeissenberg Observatory between 1971 and 1986, is a variation of the ECC method as developed by Kobayashi and Toyama (1966b). The HP-KI method was compared with the Ehmert method via an ozone generator supplied by V. H. Regener and gives a ratio of 1.00 ± 0.02 (Attmannspacher and Hartmannsgruber 1982).
The MPI-Pruch analyser, (Pruchniewicz 1973) is another KI method relying on one way flow of the sensing solution. Comparisons give 1.0 ± 0.05 for ambient ozone measurements with the MPI-Pruch analyser and the Ehmert method at 4 sites and also laboratory comparisons against an ozone generator supplied by V. H. Regener (Pruchniewicz, 1973). However, new evaluations conducted for TOAR-Observations cast doubt on the reliability of this method in the field. A comparison of 15 months of overlapping observations of ambient surface ozone measurements by the MPI Pruch and HP-KI methods at the Hohenpeissenberg Observatory from January 1971 to May 1972, shows that the MPI Pruch method indicates 0.50 ± 0.04 of the ozone level given by the HP-KI method. Similar results were obtained comparing a MPI-Pruch analyser at Zugspitze with nearby ozonesonde results from the same height in the atmosphere. These comparisons are in sharp contrast with previously described comparisons of the HP-KI method and the MPI Pruch method (Pruchniewicz, 1973;Attmannspacher and Hartmannsgruber, 1982). One possible explanation is that the instruments, due to some unknown factor, were far more variable in their efficiency at measuring ozone than the initial tests indicated. Given the information available, no conclusion about the cause can be drawn. This is discussed further in the Supplemental Material (Text S-3).

The Cauer ozone measurement method
The Cauer technique (Cauer 1935(Cauer , 1951 involves bubbling a large volume of air (100 to 300 litres) through a 10 ml solution buffered by sodium acetate and containing 50 µg iodine as KI. The ozone in the air converts the iodide to iodine and the iodine evaporates into the air stream. At the end of the sampling the remaining iodide is determined and the loss of iodide is equated to the quantity of ozone in the air sampled. When divided by the air volume sampled this gives the ozone concentration. Cauer (1951) writes "For experienced chemists, this analysis requires 20 minutes, but necessitates great care, and is difficult for non-chemists." The Cauer method was used in multiple studies between 1935 and 1955 and in the national network of the then East Germany from 1952until 1982(Cauer 1951Teichert 1955;Warmbt 1964;Feister and Warmbt 1987). It was compared with the Ehmert method by Warmbt (1964). An initial study, with the two systems presumably in their normal operating conditions, gives a Cauer/Ehmert ratio of 0.66. A more intense study, standardising various components and including blank corrections gives a Cauer/Ehmert ratio of 0.90. However, for low ozone concentrations, when the Cauer method measured less than 5 nmol mol -1 , the Ehmert method often measured 10-20 nmol mol -1 (Warmbt, 1964). These results cause considerable uncertainty concerning what corrections should be applied to Cauer data from different measurement periods. Here a correction factor is utilized that corresponds to the 0.9 ratio for Cauer/Ehmert in Table 2 and the Ehmert agreement with the UV method.

Cryotrapping
An unusual technique that was used by only one group was the cryotrapping of ozone on silica gel, with subsequent distillation to remove NO 2 and analysis for ozone by KI and UV methods (Edgar and Paneth, 1941a). This provided a sound measurement method and atmospheric ozone concentrations were obtained over the UK for multiple days (Edgar and Paneth, 1941b).

Other methods of measuring tropospheric ozone
2.2.1. Early ultraviolet absorption methods  The ultraviolet absorption method for measuring ozone is based on the strong optical absorption of ozone in the  and Huggins (320-360 nm) bands. Measurements involve an artificial light source (a mercury or hydrogen lamp) and a prism-based spectrograph with a detector, typically a photographic plate, that records the intensity of light at multiple wavelengths. In early measurements, the source and detector were separated by a long atmospheric path, of hundreds of metres to several km. The long distances were required to obtain adequate attenuation of the UV radiation, owing to sensitivity issues with the detectors. The ozone concentration was calculated from the ratio of intensities measured, at night, at long and short distances from the light source (Fabry, 1950). This measurement of ozone was difficult and the uncertainties involved (e.g. Kay 1953) do not appear to be quantified.
Because the early UV methods are in principle the same as current methods, the main issue for comparisons of ozone data is identifying the wavelengths and absorption cross-sections used. The values used for ozone absorption cross-sections in the Hartley and Huggins bands have changed with time, particularly up to the time of the determination of Hearn (1961).
To illustrate these variations, Table 3 extends in time and expands in wavelengths the information presented in the final table of Hearn (1961). Ozone absorption crosssections are presented in units of cm 2 molecule -1 , with earlier units converted. Between 1913 and 2015 there is a 20% range in absorption cross-sections at 253.7 nm and a range that varies from <1% to nearly 30% at wavelengths up to 334.2 nm. These wavelengths cover the range used to measure tropospheric ozone by the UV method between 1929 and 1960.
Early UV ozone measurements discussed in the following sections are corrected for UV cross-sections to the current standard (Hearn, 1961), following Table 3. (1970-present) In the early 1970s there was a revolution in tropospheric ozone measurements. A newly developed UV absorption photometer (Bowman and Horak 1972) provided a stable, low-maintenance continuous ozone analyser based on an absolute method of measuring ozone with an incremental sensitivity of 1 nmol mol -1 . UV photometers measure the UV light absorption in the Hartley band (220-310 nm) where ozone is a strong absorber. Usually, a mercury lamp emitting light at 253.7 nm is used as the light source.

Recent ultraviolet absorption method
UV absorption photometers are currently the most commonly used method for in-situ ozone observations. The method fulfils well the requirements of analyser performance such as signal-to-noise ratio, detection limit, stability of sensitivity, and negligible interferences when measuring in clean air . Moreover, it requires little maintenance during operation. Consequently, it is also the recommended technique for continuous ozone observations in ambient air, e.g. in the WMO Global Atmosphere Watch Programme , in the United States (US EPA, 2013), Europe (European Union, 2012), and India (Central Pollution Control Board, 2009). Data quality and consistency of records within and across networks have improved over time, as seen by comparison of time series (Logan et al., 2012;Parrish et al., 2012), and station audits by the World Calibration Centre for Surface Ozone, Carbon Monoxide, Methane and Carbon Dioxide (WCC-Empa) (Buchmann et al., 2009). The great majority of in-situ records from more than 9000 sites in the TOAR database that have operated for 3 years or more between approximately 1975 and 2015 were recorded with UV absorption photometers.
A comprehensive uncertainty evaluation was undertaken by the WMO in the Guidelines for Continuous Measurement of Ozone in the Troposphere . There, the total expanded (95% confidence) measurement uncertainty for UV absorption photometers was estimated to be   with the ozone reading of the analyser given in nmol mol -1 . Thus, at a mean level of 30 nmol mol -1 the total expanded uncertainty (95% confidence) is ±1.7 nmol mol -1 . A comparison of this bottom-up uncertainty analysis with field results is shown in Figure 1 with the results of 559 calibrations of UV absorption photometers in the Swiss National Air Pollution Monitoring Network. If the assumptions for the uncertainty estimation are correct, 95% of the calibration results shown in Figure 1 should be found within the central part surrounded by the grey shaded area. In fact, in Figure 1 less than 1.5% of the observed calibration results lie outside the estimated uncertainty range, so for this extensive dataset of UV absorption photometer field calibrations, the results fall well within the uncertainty estimate (4). This reflects the robust nature of properly undertaken ozone measurements with the UV photometric method.
The foregoing analysis does not cover all sources of uncertainty, omitting those associated with such causes as inlet losses and potential interferences. The uncertainty of the absorption cross-section is not included, as the calibrations are performed with primary standard analysers based on the same UV absorption cross-section. For the purposes of the subsequent analyses, at a mean level of 30 nmol mol -1 , the total expanded uncertainty (95% confidence) in modern ozone measurements is <2 nmol mol -1 . There are multiple chemiluminescent methods for measuring tropospheric ozone, of which the methods involving ozone reaction with either ethylene, rhodamine B or nitric oxide are relevant to this paper.
The ethylene chemiluminescent ozone analyser (Warren and Babcock, 1970) is based on the reaction of ozone and ethylene, which produces electronically excited formaldehyde. As this formaldehyde returns to the ground state, light is emitted in a band centred at 430 nm, which is detected by a photomultiplier. The count rate varies linearly with ozone concentration, provided the cell pressure and temperature, the gain of the detector (the photomultiplier tube), the ethylene and sample flows and the composition of other components of the air sample are unchanged. The ethylene chemiluminescent ozone analyser is a sensitive (~1 nmol mol -1 ) and stable instrument with a fast response (~1 s) and importantly, is not subject to SO 2 or NO 2 interference. The analyser requires regular calibration with a standard ozone analyser, which in the 1970s was either a KI or UV instrument. When calibrated, the ethylene chemiluminescent analyser gives results that closely match those of the ultraviolet method as seen in the intercomparisons at Hohenpeissenberg, Germany (Attmannspacher and Hartmannsgruber 1982) and Cape Grim, Australia (Elsworth and Galbally 1984). These analysers were used widely in North America in the 1970s (Heidorn and Yap, 1986), following the discovery that tropospheric ozone was damaging to tobacco crops (Macdowall et al., 1964;Mukammal, 1965). Although not in common use currently, in the TOAR surface ozone database chemiluminescence is listed as the method of ozone measurement at 627 of more than 9000 sites.
The rhodamine-B solid chemiluminescent ozone analyser was developed as an ozonesonde (Regener 1960) which was calibrated both with a surface based analyser, traceable to a UV measurement, prior to launch and against the total column ozone measured by a UV (Dobson) instrument at the same location and near the time of the ozonesonde flight. This system was also developed into an aircraft and surface based analyser (Regener, 1964). The analyser was regularly calibrated with an internal ozone generator that in turn had been calibrated against UV based ozone measurements (Regener, 1964). This chemiluminescent ozone analyser was used in the 1960s in the USA and Antarctica. The Regener ozonesonde was subject to large changes in sensitivity due to the effects of ozone, humidity and temperature (Regener 1964;Chatfield and Harrison, 1977;Hering and Dütsch, 1965) and the surfacebased analyser may have had similar deficiencies.

Differential Optical Absorption Spectrometry (DOAS)
The DOAS technique measures the spectrally-resolved absorption features in a beam of light that is returned by a retro-reflector located at some distance from the instrument (Platt et al. 1979). A telescope is used to send and receive a beam of white light, typically from a Xenon arc lamp. A photodiode array detector is used for simultaneous detection of the UV spectrum. The absorption features measured in the returning light beam are a convolution of all the absorption bands of molecules present in the beam path. The concentrations of ozone and other absorbing species are extracted based on well-characterized absorption cross-section data. The precision of a DOAS system for O 3 is estimated at 3% due to uncertainties of the absorption cross section (~1%), and stray light in the spectrometer (~2-3%), while noise and unexplained spectral structures determine biases and detection limits (2-4 nmol mol -1 ), which scale inversely with the path length (Stutz and Platt, 1996). A field study comparison of DOAS with UV photometric analysers found differences of ±7%, attributed to spatial and temporal atmospheric inhomogeneities, including the fact that the DOAS beam scanned over a path higher in altitude than the sampling point of the UV instrument (Williams et al., 2006). In the TOAR database DOAS is listed as the method of ozone measurement at 39 sites.

Ozone measurements in surface air 1870s-1970s
For the purpose of reconstructing the historical surface ozone record an extensive literature search was conducted, which unearthed several data sets that have not before been coherently analysed in a single study. As discussed in Section 2, the approach used in TOAR-Observations is to reconstruct regional or zonal average ozone values based on all available historical data sets rather than relying on individual data sets as in previous studies. Following is a description of the application of the four data selection critieria for the historical reconstruction.
Criterion 1: all measurements via the Schönbein and related filter paper methods are rejected; the early UV measurements  are corrected to the currently accepted values of absorption cross-sections (Hearn, 1961); the observations made by the Mast Ozone Meter and related transmogrifier, and the Cauer method have been adjusted following Table 2.
Criterion 2: It is known that SO 2 causes quantitative negative interference with the KI measurement methods (Albert-Lévy, 1907;Glückauf, 1941Glückauf, , 1944Saltzman and Gilbert 1959a), with a stoichiometry reported as 1.0 (Schenkel and Broder, 1982;Volz and Kley, 1988). SO 2 is known to be present in very low concentrations in the background atmosphere of <<1 nmol mol -1 (Seinfeld and Pandis 2000), thus not a significant interferent there. However in urban areas in Europe following the industrial revolution coal burning was widespread (Mylona, 1996;Smith et al., 2011) and the resulting sulfur-based acid pollution is widely documented (Smith, 1872;Ladureau, 1883, Witz, 1885. This criterion can lead to some ambiguity, as "clean" sites such as Arkona, on the Baltic coast, and the hilltop site of Hohenpeissenberg in southern Germany are subject to modest levels of SO 2 interference (Feister and Warmbt, 1987;Low et al., 1990Low et al., , 1991. Thus KI-based ozone measurements need to be scrutinised for possible SO 2 interference before being accepted. The interference should be quantified and small compared with the ozone signal, and the uncertainty introduced into the ozone reading due to correction should be ideally ≤5%. Criterion 3: The ozone sampled in the air in the surface layer should be representative of the unpolluted planetary boundary layer. At times of good turbulent mixing in the lower atmosphere, due to either convection driven by solar radiation or mechanical turbulence from wind shear, vertical gradients of ozone mole fraction diminish and ozone levels in near surface air are representative of the planetary boundary layer (Auer 1939;Glückauf 1944;Teichert 1955;Galbally 1968Galbally , 1972Garland and Derwent 1979;Fabian and Pruchniewicz, 1977;Galbally et al. 1986). At rural sites with plant and soil surfaces in flat plains away from fresh anthropogenic sources of NO, when turbulence diminishes at night, so that the rate of ozone supply from above is less than that lost due to ozone destruction, i.e. dry deposition, at the underlying surface, a night-time decrease of ozone occurs (Auer 1939;Glückauf 1944;Teichert 1955;Galbally 1968Galbally , 1971Galbally , 1972Garland and Derwent 1979;Fabian and Pruchniewicz, 1977;Galbally et al. 1986). Consequently, nighttime measurements or 24-h average measurements of surface ozone at such continental sites are not representative of the well mixed boundary layer. Over ice, snow and water surfaces, this ozone decrease with weak mixing does not occur because the rate of ozone loss at the underlying surface is much smaller. The diurnal cycle is different at mountain top sites in steep terrain. At night the slopes cool and typically a downslope wind develops that entrains air from above, as described for Mauna Loa (Price and Pales, 1963). During the day the slopes warm and, away from urban pollution, an upslope wind brings air from below that has been depleted in ozone. Data selected in this study take into account topography and surface characteristics to ensure that only data from well mixed conditions, or conditions reasonably assumed to be well mixed, are selected.
Criterion 4: Published data may be suspect for a variety of reasons, including inconsistency with other observations, or artifacts or outliers that suggest instrument problems. The outstanding example is Pring (1914) where the measurements appear to be of excellent quality and well documented, except that they differ by a factor of 100 from current measurements. As Fabry (1950) said, the results are absurd. Serious artifacts and inconsistencies associated with individual datasets are discussed in the Supplemental Material (Text S-3).
For every site, the method used, the possible pollution sources, the topography, surface cover and prevailing meteorology and credibility of the results have been examined to assess the suitability of the observations for inclusion in the historical record. One excluded site is explicitly included in the following discussion. This is Montsouris, Paris, which is discussed here because it has been used as a central set of data for a number of previous historical reconstructions of tropospheric ozone. For the other historical data discussed below, in all cases the method is valid and with one exception, the sites are presumed free of interfering pollutants and representative of the well-mixed boundary layer. In a few cases some of the data are excluded on credibility grounds, and some are accepted with caution; the reasons for these decisions are discussed. Before discussing the available historical data which passed the selection criteria, it is worth repeating the cautionary note from Section 2: for comparison with current observations these data should be treated as a group, because there are substantial unknown uncertainties associated with the bias corrections of each individual set of observations.
There are some other features of the data that are important to note. Some studies involve measurements at multiple sites; other studies have long records at one site that have been broken into distinct periods. Thus the number of published studies and the number of data sets do not correspond. The 60 data sets accepted into the record, their site names, locations, data references etc. are presented in Tables 4, 5 and 6. During the period 1870s -1950, most of the available measurements were made using KI techniques with a small number of spectroscopic measurements by UV absorption. The studies were occasional scientific experiments, usually of limited duration and initially in central and northern Europe. From 1950 to the 1970s the coverage became global and measurements at a given location moved to yearly or greater duration.
The standard deviations presented here differ for the periods 1890-1950 and 1950-1970 because of the paucity of data in the earlier period. In the period 1890-1950 the standard deviation is calculated from daily ozone values when there are 7 or more days of data. In the period 1950-1970 the standard deviation is calculated from the monthly mean ozone values for an annual cycle or longer when there is at least one year of data. For more information see the Supplemental Material (Text S-5).
The observations are grouped into the tropical (0°-30°), temperate (30°-60°) and polar (60°-90°) regions of each hemisphere. They are presented in Tables 4-6 and discussed in the following text for different periods. Ozone is summarized for each region using a weighted mean, with the weighting proportional to the number of days with observations at each site.

The Montsouris Observatory ozone measurements 1876-1910
The ozone measurements at the Municipal Observatory at the Parc de Montsouris at the southern edge of Paris (Albert-Lévy, 1877) from 1876-1910, are the oldest active sampling ozone measurements known. As noted in Section 2.1.1, they are pivotal to conclusions in previous work that ozone in pre-industrial times was much less than its present concentration (e.g. Marenco et al., 1994). Hence the Montsouris observations are examined in detail here.
The technique used was the KI-arsenite system and is related to the UV method as described in Section 2.1.2, and is a valid method. Volz and Kley (1988) examined daily measurements of ozone at Montsouris in 1876-86 and 1905 and correlated the results with the wind direction. The area to the southwest of Montsouris had no significant urban centers in 1876-1905 and Volz and Kley concluded that air from the southwest would be free of urban air pollutants which might interfere with the analysis. It was found that with wind from the southwest the average ozone levels measured in 1876-86 and 1905 were approximately 8 and 10 nmol mol -1 , and with wind from directions other than southwest the ozone levels were lower by 2 and 5 nmol mol -1 , respectively. The lower ozone levels observed when the wind direction was from Paris were attributed to the presence of SO 2 in the Paris urban plume. The corrected ozone data gave an average ozone level of 11 nmol mol -1 over the period 1876-1910. This is much lower than that found at clean mid-latitude sites today.
Observers in the 19 th century were aware of the potential for interference of reducing gases with the KI reaction, and, as noted by Volz and Kley (1988), an attempt was made to quantify these interferences. Average values for 1905-07 of "gaz réducteurs" correspond to ~3 nmol mol -1 , with weekly values as high as 16 nmol mol -1 (Albert-Lévy, 1907, 1908. However, exactly how to interpret these measurements remains uncertain, as SO 2 does not react strongly with KI in the absence of ozone. The period 1870s to 1900s was one of rapid change in Paris. During the period 1870-1880, Paris was a city of 2 million people and coal supplied 58% and wood 42% of the city's total energy needs (Kim and Barles, 2012). Coal burning releases SO 2 , and two approaches are taken to estimate SO 2 in Paris at this time. Firstly, measurements of ambient SO 2 in Paris began in the 1950s, and records of coal use in the city are available from 1875. These concentration and emission data are combined to estimate SO 2 levels of 55 nmol mol -1 in Paris in the early 20 th century (Ionescu et al., 2012). However, this approach neglects activities outside the city boundary where both building construction and the presence of gasworks that provided gas to Paris were located (Kim and Barles, 2012;Kesztembaum and Rosenthal 2014). Secondly, at Montsouris measurements were made of sulphate in rainwater with an average of 13.9 mg l -1 (range 3.5-37.0 mg l -1 ) as SO 3 -1 : (Albert-Lévy, 1907, 1908. Sulphate in rainwater is closely associated with atmospheric sulphur dioxide, and such sulphate values are typical of highly polluted areas and correspond to a SO 2 level of ~2 5-75 nmol mol -1 (e.g. Sequeira, 1975;Davies, 1979;Aas et al., 2007;Gonçalves et al., 2007). Both approaches indicate that SO 2 interference would have biased the Montsouris measurements low. The inferred SO 2 levels suggest a large degree of interference (Section 2.1), but in the absence of actual measurements there is inadequate knowledge to quantify the problem.
Other measurements at the Observatory (Albert-Lévy, 1877; Hartley, 1881) include on average 12 nmol mol -1 of oxides of nitrogen (measured as nitric acid) during this period. This measured value is lower than the 28 nmol mol -1 estimate based on coal and other fuel use in the early 20th century (Ionescu et al., 2012). However, it compares well with values of ~8 nmol mol -1 found in London by Reynolds (1930) and Edgar and Paneth (1941b). As nitrogen oxides are emitted as both NO and NO 2 , they can interact with O 3 in the gas phase through NO titration removing ozone and NO 2 photolysis producing O 3 particularly if reactive organic compounds are present. Also the NO 2 can act as a positive interferent in the KI ozone method. Therefore, it is not immediately apparent whether the influence of these nitrogen oxides increased or decreased the measured ozone values.
Paris at the end of the 19 th century was home to large numbers of horses and dairy cattle (Barles, 2012). Measurements of NH 3 were also made at the Observatory (Hartley 1881;Albert-Lévy, 1903); an average of 28 nmol mol -1 is reported for the period 1883-1901. As already discussed, NH 3 is a negative interferent in the KI ozone measurement method, although the stoichiometry of the interference with the arsenite method is not known.
Overall the Montsouris observations fail Criterion 2 that the likelihood for significant contamination of the ozone measurements due to interfering pollutants in the sampled atmosphere should be low. Correction for SO 2 and other interferences is not feasible, due to their estimated magnitude, lack of a full understanding of the interference, and the absence of reliable atmospheric observations of the interfering compounds at Montsouris at that time.
Criterion 3 asks: are the measurements representative of the well-mixed boundary layer? The low sensitivity of the KI-arsenite method required 24-hour averages. The sampling was from a balcony of the Observatory building 5 m above the ground. The 24-hour averages at a clean air site would be biased low, being 0.8 of daytime average and 0.7 of daytime maximum measurements, given a typical diurnal cycle  due to ozone dry deposition as discussed earlier. Furthermore wind speed and direction observations in Paris show an average diurnal pattern with northeasterly winds at night between 2000 and 0400 h, and southwesterly winds during the afternoon during 1100 to 1600 h between March and October 1890-1896 (Angot, late 1890s). These overnight winds would bring SO 2 enriched air to the Observatory causing chemical interference with the measurements, while cleaner air in the afternoon would have reduced the cumulative SO 2 effect.
At that time W.M. Hartley, user of the KI-arsenite method in his laboratory studies and a scientist familiar with the Montsouris work expressed concern over the possibility of ozone loss in the sampling system and wrote: "It is impossible, therefore, to accept the figures given in the Annuaire de L'Observatoire de Montsouris as indicating anything like the true proportion of ozone usually present in country air …" (Hartley, 1881). Indeed, the Montsouris measurements are consistent with other historical measurements in urban areas (Text S-6, Table S-1 and Figure S-1).
In summary, the measurements of ozone concentration at Montsouris were made with a valid measurement technique, however it is very likely that there is a large negative bias in the measurements, of comparable magnitude to the observed ozone concentration, due to the presence of SO 2 and ammonia and a further uncertain bias due to nitrogen oxides. Also the 24-hour sampling includes low night-time ozone values due to the occurrence of nocturnal inversions and ozone dry deposition as well as chemical interference due to the recirculation of air over Paris. Consequently, the measured ozone concentrations for Montsouris for 1876-1910 are biased low, are not representative of the regional atmosphere and are not used in this assessment of historical ozone concentrations.

Other surface measurements: 1896-1901
During 1896-1901 other surface ozone measurements, in clean air, were made with the KI-arsenite method (Albert-Lévy, 1877) and are listed in Table 4. De Thierry (1896) made measurements at Chamonix (~1 km asl) and Grands-Mulets (~3km asl) on Mont Blanc in southeastern France on 3 days in August and September 1896. Lespieau, (1906) made 13 measurements over 4 days in 1900 and 6 days in 1901 on glaciers at three levels (~1.25 km, ~3 km and ~4 .8 km) on Mont Blanc. The mean of these observations, weighted by the number of measurement days, (hereinafter referred to as the "weighted mean"), in the temperate zone over Europe is 25 nmol mol -1 with a range of 20-63 nmol mol -1 for the period 1896-1901.

1929-1934
The first measurements of surface ozone with the UV method were made in 1929 (Fabry and Buisson 1931; Götz and Ladenberg 1931). In Europe there were multiple measurements with long path UV (Götz and Maier-Leibnitz, 1933;Götz and Penndorf 1941). The study of  involved two groups making simultaneous UV measurements at two alpine sites, swapping sites mid-way during the experiment. No significant differences were observed in this early instrument intercomparison. The ozone mixing ratio, corrected for the Fabry and Buisson (1931) UV absorption coefficients, was 31 nmol mol -1 at Jungfraujoch, Switzerland (3450 m) and 18 nmol mol -1 at nearby Lauterbrunnen (800 m), showing an increase with altitude.
The weighted mean of these observations in the temperate zone over Europe is 25 nmol mol -1 with a range from 18-32 nmol mol -1 for the period 1929-1934. (Contd.) Dauvillier (1934) used the KI-arsenite method at Scorsby Sund on the east coast of Greenland, from November 1932 to August 1933 (Table 5). His daily data (reported as 24-hour means) can be notionally divided into two sets: 225 days of data show a background value of about 50 µg m -3 , with less than 5% over 100 µg m -3 (approximately 47 nmol mol -1 ), and 45 days of data that show major events where ozone went as high as 570 µg m -3 . The latter maximum is approximately 270 nmol mol -1 . Ozone during December 1932 was particularly high, averaging approximately 100 nmol mol -1 . This monthly mean is more than twice the highest December monthly mean at Arctic sites in the TOAR Surface Ozone Database. These high events were recorded by Dauvillier (1934) as interesting and remarkable. He associated them with aurora and suggested enhanced transport from the stratosphere or ionosphere to the surface air. Dauvillier (1934) also states that during the winter the inlet was close to or at the snow surface. There are two possible explanations of these high ozone levels, either (a) the KI-arsenite method is subject to large positive interference in the recorded calm conditions in the polar night or (b) there was enhanced stratospheric-tropospheric exchange or tropospheric ozone production that led to high ozone levels in surface air, that has not recurred in the Arctic during the period of UV surface ozone measurements, approximately 1980 to the present. Two points are worth noting: Dauvillier (1935) subsequently validated his method in the Arctic in winter against UV surface ozone observations; also, the ozonesonde record from Resolute Bay, in the Canadian Arctic, shows occasional very high values at the surface, only in winter, including a value of 164 nmol mol -1 in 1966, and Chung and Dann (1985) record surface ozone levels measured with an ethylene chemiluminescent analyser in December in Saskatchewan, Canada of up to 228 nmol mol -1 . With the evidence available it is impossible to resolve which explanation is correct. The considered judgement is that the 45 days of observations during the major events are not credible and omitted from the subsequent analysis, while the 225 days of background values are retained as credible. Dauvillier (1935) presents measurements from Abisko, Sweden between December 1934 and March 1935, giving average ozone concentrations of 41 µg m -3 from 68 UV measurements and 33 µg m -3 from 56 measurements with the KI-arsenite method.
Considering the qualifications above, the methods used are valid, the sites are presumed free of interfering pollutants and representative of the well-mixed boundary layer. The weighted mean of these observations in the Northern Polar region is 22 nmol mol -1 with a range of 15-24 nmol mol -1 for the period 1932-1935.

1938-1941
The only UV measurements of surface ozone found during this period were at Mt. Ventoux (1912 m, southern France) by A. Vassy in October 1938, as reported by Fabry (1950. After correction for the Fabry and Buisson (1931) UV absorption coefficients these observations average 26 nmol mol -1 .
The KI method became more widely used during this period. At Jungfraujoch (3450 m) on 5 days in August 1938, Regener (1938b) observed 30 (range 24-43) nmol mol -1 ozone, and in September at Friedrichshafen (at 400 m near the shore of Lake Constance, southern Germany), 21 nmol mol -1 (range 15-24). Ehmert and Ehmert (1949) made measurements using the Ehmert method on Pfänder Mountain (1060 m in western Austria) in September, 1940. The data have been reinterpreted by Volz et al. (1988), where the original ozone measurements gave 15 nmol mol -1 and the revised value is 22 nmol mol -1 . Edgar and Paneth (1941b) present measurements of ozone in the air in a street of South Kensington, London, UK, from the rooftop of the Royal College of Science building in South Kensington, at the Kew Observatory outside London and at Southport on the northwest coast of England, made from February 1938 to July 1939. They used a cryogenic trapping and purification method followed by both UV and KI analysis (Edgar and Paneth, 1941a;1941b). Considering only the non-urban observations at Kew and Southport, the mean and range are 24 (17-29) nmol mol -1 . Glückauf (1944), using an automated KI-thiosulfate method, presents occasional ozone data in a meteorological analysis of observations near Durham, UK. The following results are extracted from his paper. On 21 March 1941 in wind from a clean sector, the ozone mixing ratio was 31 ± 2 nmol mol -1 . The level in November 1940 is reported as half this. Daily maximum values varied from 24 nmol mol -1 in November to 68 nmol mol -1 in May. On 28 and 29 August 1941, in prolonged strong steady winds, ozone reached daily maxima of 25 and 27 nmol mol -1 with a 37-hour average of 21.5 nmol mol -1 . Analyses of five warm fronts indicated ozone mole fractions of 25 to 34 nmol mol -1 . Analyses of four cold fronts indicated ozone mole fractions of 27 to 68 nmol mol -1 . The high values, likely from subsidence behind the front, may have been partly of stratospheric origin. If the maximum values are treated as 95 th percentiles and an annual cycle is fit to the monthly observations then the mean mole fraction of ozone in clean, well-mixed conditions is 27 nmol mol -1 .
Accepting the specific analysis above of Glückauf (1944), for this period the methods used are valid, the sites, or air masses sampled, are presumed free of interfering pollutants and representative of the well mixed boundary layer. The weighted mean of these observations in the temperate zone over Europe is 26 nmol mol -1 with a range from 21-30 nmol mol -1 for the period 1938-1941. These results are indistinguishable from those both 5 years and 4 decades earlier.

1951-1970s
The period 1950s to 1970s is the first period in which there were globally distributed measurements of ozone in surface air. Series of measurements lasting a year or more became common and short-term campaigns fewer. The surface ozone records during this period include: the upsurge in observations during the International Geophysical Year of 1957-1958; measurements made initially at Dresden-Wahnsdorf in 1954 and then expanded in 1956 to a network of 7 surface stations across Germany (Warmbt, 1964;Feister and Warmbt, 1987); and from 1969-1975 the Troposphärisches Ozon (TROZ) network of 16 surface ozone stations on a meridional zone from Norway to South Africa, run by the Max Planck Institute for Aeronomy (Fabian and Pruchniewicz, 1977). Several other sites utilized here were run on an individual or national network basis.
For the TROZ network, the largest network for this period, a coulometric KI method (Section 2.1.3) was used (Pruchniewicz 1970(Pruchniewicz , 1973. Detailed consideration of the TROZ data reveals issues with (a) its traceability to the modern UV standard, (b) interferences or pollution at the sites, and (c) special features in selection of representative surface ozone data (reported ozone values are representative of daytime maximum values rather than daytime means). These issues are discussed in the Supplemental Material (Text S-3). Because this data set has data from 15 stations (one station was omitted for reasons described below), located between 90°N and 30°S, it potentially dominates the 1950s-1970s historical record. To understand the sensitivity of the historic analysis to the TROZ data and to provide robust conclusions, the quantification of the change in ozone from the historical to the modern period is performed both with and without the TROZ data (see Section 3.6).
Five surface ozone observing sites operating in Northern Polar region in the 1950s to 1970s are presented in Table 5. Three sites are from the TROZ network and two from Alaska: College (137 m, located in a suburb of Fairbanks in the center of the state) and Barrow (15 m, near the shore of the Beaufort Sea and now called Utqiaġvik) (Wilson et al. 1952;Kelley 1973). The College data, taken with the KI-arsenite method, show spikes of very high values like the Dauvillier (1934) record; these are truncated on credibility grounds (see Text S-3 for detailed discussion). The Barrow data are well-documented, but they show odd seasonal behaviour and some surprisingly high values, and so have questionable credibility, but are accepted (Text S-3). Considering the qualifications above, the methods used are valid, the sites are presumed free of interfering pollutants and representative of the well mixed boundary layer. The weighted mean of the accepted observations in the Northern Polar region (1950s-1970s) is 24 nmol mol -1 with a range of 19-34 nmol mol -1 . If the College and Barrow data are entirely omitted this average becomes 19 nmol mol -1 , but this value relies entirely upon the TROZ network.
In the Northern Temperate region there are multiple data sets. Bowen and Regener (1951) present ozone data from Capillo Peak Observatory in New Mexico (approximately 2800 m asl). Four years of ozone data were obtained at the Arosa Light Observatory (located on the northern edge of the town of Arosa in a Swiss valley, 1810 m (Staehelin et al., 2018)) using the Ehmert technique (Perl, 1965). In August 1953 and June/July and September 1954 ozone observations were made utilizing an 80 m tower using the Cauer method (Section 2.1.4) at Lindenberg (98 m) in northeastern Germany (Teichert 1955). Measurements were made at Hohenpeissenberg (975 m) from 1971 to 1975 using the HP-KI method. Six other sites, all from the TROZ network, are listed in Table 4. The TROZ station Lindau is omitted due to the influence of high levels of local pollution (Fabian and Pruchniewicz, 1976). The long-term site at Arkona on the north coast of Germany using the Cauer method commenced in 1956 (Warmbt 1964;Feister and Warmbt 1987). With the exception of Arkona, the methods used are valid, the sites are presumed free of interfering pollutants and representative of the well mixed boundary layer. With a correction for SO 2 interference (Warmbt 1964;Feister and Warmbt 1987), the Arkona data are accepted (see note #12 in Text S-7). Measurements from three sites in Japan made with the Ehmert method are included (Miyake et al. 1962;Kawamura and Sakurai 1966). The methods used are valid, and after the correction of the Arkona data, all sites are presumed free of the influence of interfering pollutants and representative of the well mixed boundary layer. The weighted mean of these observations in the Northern Temperate region is 22 nmol mol -1 with a range of 19-32 nmol mol -1 .
In the Northern Tropics in the 1950's to 1970's surface ozone observing sites are Fort Lamy in Chad, Africa (12°N), from the TROZ network (Fabian and Pruchniewicz, 1977), Pune (Poona) 19°N and Ahmedabad, 24°N in India (Tiwari and Sreedharan 1973;Naja and Lal 1996) and Mauna Loa, Hawaii 20°N (Price and Pales, 1963) ( Table 6). The weighted mean of these observations in the Northern Tropics is 23 nmol mol -1 with a range from 16-31 nmol mol -1 for the period 1950-1975. In the Southern Tropics in the 1950's to 1970's surface ozone sites of Luanda, 9°S, Sa da Bandeira 15°S, Alexander Bay 28°S and Windhoek 23°S are from the TROZ network and all are located in Africa (Fabian and Pruchniewicz, 1977). The weighted mean of these observations in the Southern Tropics is 18 nmol mol -1 with a range from 14-24 nmol mol -1 for the period 1970-1975. In the Southern Temperate region in the 1950's to 1970's surface ozone sites are Hermanus, South Africa 34°S (Fabian and Pruchniewicz, 1977), 20 days measurements during campaigns at locations in south east Australia, ~3 4°S (Galbally, 1968(Galbally, , 1970(Galbally, , 1971(Galbally, , 1972 and two years of measurements at Macquarie Island 54°S, south of New Zealand (Galbally and Roy, 1981). The weighted mean of these observations in the Southern Temperate region is 22 nmol mol -1 with a range from 21-25 nmol mol -1 for the period 1967-1975. Six surface ozone observing sites in the Southern Polar region operated in this period, all in Antarctica. These are Little America (Wexler et al., 1960), Halley Bay (MacDowall, 1962, Hallett and Amundsen-Scott South Pole Station (Aldaz, 1965;Oltmans & Komhyr, 1976), Base Rio Baudouin (Wisse and Meerburg, 1969), and Mirny (Kolbig and Warmbt, 1978) (Table 5). The 1958 data from Halley Bay have questionable credibility (Roscoe and Roscoe 2006; see Text S-3). The Amundsen-Scott South Pole Station is at 2835 m altitude, and therefore it is expected, as observed, that the ozone levels there are higher than those at the lower altitude stations. There are several other sites at southern high latitudes that  Oltmans and Komhyr, 1976), but data are currently not available. With the qualifications for Halley Bay, above, the methods used are valid, the sites are presumed free of interfering pollutants and representative of the well mixed boundary layer. The weighted mean of these observations, with the Halley Bay data, is 24 nmol mol -1 , with a range from 16-30 nmol mol -1 for the period 1957-1966. Without the Halley Bay data, the range is 19-30 nmol mol -1 , but the weighted mean remains at 24 nmol mol -1 .

Changes in surface ozone
The sets of observations that pass the 4 data selection criteria imposed here for historical surface ozone observations are presented in Tables 4-6 and Figures 2-6, grouped by region. To quantify the changes of ozone levels from the historical (pre 1975) to the modern period , the historical data are compared to all available modern ozone observations in the same regions, according to the criteria described below. The modern data are extracted from the TOAR Surface Ozone Database and averaged across 5 × 5 degree grid cells at 5-year intevals . The historical observations have been selected to be representative of the well-mixed boundary layer, and the comparable metrics from the TOAR Database for rural stations are daytime average values and daily 8 hour maxima (DMA8). As the "rural" designation in the gridded average product from the TOAR database  excludes sites at elevations above 2000 m (because of the typical increase of ozone with altitude) a proper comparison should exclude the measurements at higher mountain sites. In Europe (Figure 2) these are Grands-Mulets, Mont Blanc, Jungfraujoch, Zugspitze, Mt. Norikura in Japan and in North America, Capillo Peak. Except for the value of 63 nmol mol -1 in 1896 at Grands-Mulets, disregarding these eight points does not substantially change the picture in Figure 2 (see Figure S-2).
In Figure 2 there is no evidence of a change in rural background ozone during the historical record , as noted in Section 3.4. The averages for the four historical time periods in Table 4 differ by less than 3 nmol mol -1 . This is in sharp contrast with previous analyses for the historical period and arises mainly due to the application of the 4 criteria to select valid historical data. The difference is due to: (a) the rejection of the Montsouris data, (b) the corrections detailed in Table 4, which have raised values for the early UV measurements by as much as 11%; (c) the inclusion of some early data that are not frequently cited and (d) other corrections as noted for Mast Ozone Meter, transmogrifier, and Cauer method data. Interestingly, the overall historical average found here is similar to older well-informed estimates: for example, Fabry (1950) remarks that surface ozone is typically about 20-25 nmol mol -1 , with only modest daily variation. Figure 2 indicates an increase of about 12 or 16 nmol mol -1 between the historical record and the modern period, depending on which metric (12-hr daytime average or DMA8) is applied to the modern data. The diurnal variation of surface ozone depends on both the underlying surface and the topography. At remote sites, as are considered here, measurements over flat continental surfaces show a distinct daytime maxima, while measurements over water, snow and ice show little diurnal variation and Asian, one North American and one North African data sets; see Table 4 for details). Error bars represent standard deviations of the measurement averages (atmospheric variability), not uncertainty of the measurement. 5-year averages of modern UV measurements at sites below 2000 m, classified as "rural", in the 5 × 5 degree gridded product from the TOAR database are also shown, both daytime averages (day) and daily 8-hour maxima (DMA8) . DOI: https://doi.org/10.1525/elementa.376.f2

Northern Temperate (Europe): Historical surface ozone measurements
Year 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 Mean ozone ( measurements from hilltops or from elevated sites in valleys often show night-time maxima. Afternoon averages were chosen for the historical data where available (see Tables 4-6 and notes above), but in a number of cases only daytime or 24-hour averages were available, and the UV measurements were all made at night, so it is difficult to ascertain whether DMA8 or the 12-hour daytime average is the best metric for comparison. However both show the same behaviour.
In the Northern Polar region (Figure 3), and in the Tropics (Figure 5), there is also some evidence of an increase, although there are fewer historical datasets with  Table 5 for details).
Error bars represent standard deviations of the measurement averages (atmospheric variability), not uncertainty of the measurement. Five-year averages (daytime mean) of modern UV measurements at sites below 2000 m, classified as "rural", in the TOAR 5 × 5 degree gridded average product are also shown, for both daytime averages (day) and daily 8-hour maxima (DMA8) . DOI: https://doi.org/10.1525/elementa.376.f3

Southern High Latitudes: Historical surface ozone measurements
Year 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 Mean ozone ( which to make the comparison. In the southern hemisphere (Figures 4 and 6), there is no clear evidence of a change.
A quantification of the change in ozone mole fraction from the historic to the modern measurements has been performed for each latitude band in Tables 4-6 and is presented in Table 7. Table 7a presents changes of surface ozone from the historical period to the modern period using all available historical observations below 2000 m elevation. The modern period is based on rural ozone observations below 2000 m for the years 1990-2014, and the metric for the modern data is the 12-hr daytime average. The modern data were extracted from the TOAR Surface Ozone Database and reduced to monthly means  Table 6 for details). Error bars represent standard deviations of the measurement averages (atmospheric variability), not measurement uncertainty. Fiveyear averages (daytime mean) of modern UV measurements at sites below 2000 m, classified as "rural", in the TOAR 5 × 5 degree gridded average product are also shown, both daytime averages (day) and daily 8-hour maxima (DMA8) . DOI: https://doi.org/10.1525/elementa.376.f5

Southern Midlatitudes: Historical surface ozone measurements
Year 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 (see Text S-8 and Tables S-2 and S-3), indicate that surface ozone has increased by 30-70% from historical levels to the present in rural air in the temperate and polar regions of the Northern Hemisphere, and negligibly changed in the remote locations in the Southern Hemisphere. Statistical tests were performed using Welch's generalization of Student's t-test for unequal variances (Welch, 1947). As described above and in the Supplemental Material (Text S-3), some of the historical data sets have questionable reliability due to a variety of documented issues. To understand the sensitivity of the results in Table 7a to these less reliable data sets, analyses were also performed with all questionable historical data sets omitted (see Text S-8 and Tables S-2 and S-3 for details), and also with all KI-based data omitted (this limits the historical measurements to seven long-path UV data sets in the Northern Temperate region and one in the Northern High Latitude zone). These results (Table 7b) are very similar, indicating that the results in Table 7a are robust with respect to the choice of historical data sets included in the analysis. Other tests (omitting only the observations from the Fabian and Pruchniewicz (1977) TROZ network, and using the non-parametric Wilcoxon rank sum test (Mann and Whitney, 1947) also yielded similar results (Table S-2), with smaller p-values for the non-parametric test. Comparisons were also made to the modern data using the daily 8-hour maximum (DMA8) metric; as expected these increases are about 20% larger (Table S-3).
The modern set of measurements utilized for this analysis is comprised of all rural sites with surface ozone measurements within the geographical boundaries of the region. There is considerable variability in the behaviour  Lyapina et al., 2016). To test the sensitivity of these calculated changes to the choice of modern data sets, two domains were selected to represent the modern ozone observations for the Northern Temperate region. The first domain covers most of Western Europe and utilizes monthly mean observations from 16 grid cells, each 5° × 5°. The second domain focuses on a single 5° × 5° grid cell that encompasses most of the historical ozone observations from Western Europe, especially those that were based on UV methods. Within this single grid cell monthly means from the individual rural sites were used. The differences in the estimated changes for the Northern Temperate region, based on these two modern domains are notable, both in Table 7a and 7b. However, for both modern domains the differences between Table 7a and 7b are quite small. This suggests that the uncertainty in the estimated increases in Table 7 depends more on the modern region chosen for comparison than on the historical data. Data representativeness thus seems to be the more important source of uncertainty.
It is therefore not surprising that the increases determined here are different from some past analyses (e.g. Parrish et al., 2012Parrish et al., , 2014. Past analyses have used data from a few selected stations with long-term records, while this analysis has used all available historical and modern measurements. A more detailed matching of co-located historical and modern sites will provide additional insight and is being undertaken.
In the southern hemisphere, there is little evidence for an increase of ozone from the historical to the modern period.
It is worth emphasizing that these results do not depend on particular individual records, as has been noted with regard to several data sets judged to be of questionable credibility; indeed, if only the long-path UV measurements are retained, the increase in Europe is in the range 33-48%, with p-values between 0.01 and 0.03.
In summary, our best estimate for the increase of ozone at northern temperate and high latitudes is a range of 32-53%, based on all historical measurements, using the 12-hour daytime average as the modern metric, and 43-71% using the daily 8-hour maximum metric. These increases are different from zero at the 95% confidence level. Similar results are found using non-parametric statistical tests, and for calculations using (the most reliable) subsets of the historical data.

Free tropospheric measurements
Many of the measurement techniques described in Section 2 have been applied to measurements in the free troposphere through balloon and aircraft profiling, and more recently through ground and satellite-based remote sensing. Recent studies have examined free tropospheric ozone data quality issues by comparing time series of surface observations with commercial aircraft and ozonesonde profiles from nearby locations (Logan et al., 2012;Tanimoto et al., 2015). In some cases laboratory and field intercomparisons can provide information about instrument response changes with time (e.g. (Attmannspacher andDütsch, 1970, 1978;Hilsenrath, 1986;Kerr et al., 1994;Smit et al., 2007). The different methods and their biases and uncertainties as established through intercomparisons are each reviewed, and related to the UV standard.

Early measurements
Historical observations of free tropospheric ozone that pass the relevant data selection criteria are presented in Table 8 and Figure 7. The first direct measurement of ozone from a balloon ascent was made with three flights near Stuttgart, Germany in 1934 (Regener and Regener, 1934). A quartz UV spectrograph was used to observe the change with altitude of the total amount of ozone above the balloon. The derived profile increases very smoothly from a value equivalent to 40 nmol mol -1 near the ground to ~1 80 nmol mol -1 at 10 km; this integrates to about 40 Dobson units (DU) below 10 km.
A similar differential method was used on the manned flight of the high-altitude balloon Explorer II on November 11, 1935 in the western USA. This showed very little ozone below the tropopause: less than 10 DU (O'Brien et al., 1936;reported in Craig, 1950 andFabry, 1950). This first flight also showed an unusually sharp tropopause.
Additionally, ozone measurements using wavelengths 310-330 nm were made on two unmanned balloon flights in Germany on 30 October 1937 and 11 December 1937 by Regener (1938a). These appear to have been more sensitive (reduced stray light) and found a good deal of variation in the troposphere, with maximum values that correspond to ~1 10 nmol mol -1 at 3 km. These high values are questioned by Fabry (1950), who also points out that the differencing method is subject to large errors below the tropopause, where changes are small. However, 0-10 km column amounts of 23 and 37 DU on the ascent and descent of 30 October 1937 and 15 DU on both the ascent and descent of 11 December 1937 (Regener 1938) compare well, on average, to November averages of 1-10 km columns measured over Europe by ozonesondes since 1990 (27-35 DU). These estimates of historical ozone levels include a 10% upward correction for the ultraviolet absorption coefficients used (Lauchli, 1928(Lauchli, , 1929. The large variation between the ascent and descent of 30 October 1937 is probably an indication of the uncertainty of the differencing method, since it is unlikely that ozone changed by that much in less than a day. Coblentz and Stair made a series of 19 balloon launches in 1938-40 from the eastern USA with a filter-based optical ozonesonde using wavelengths 290-330 nm Stair, 1939, 1941); average values correspond to ~2 4 nmol mol -1 in the lower troposphere and ~9 6 nmol mol -1 near 10 km. These estimates include a 13% upward correction for the ultraviolet absorption coefficients used (Ny andChoong, 1933 andFabry andBuisson, 1931).
In the early 1950s Paetzold (1955a, b) conducted 32 balloon flights with a UV spectrograph using wavelengths 295-318 nm, over Weissenau, Germany, of which 25 gave results in the 0-10 km region. From the cross-sections he  1938,1939,1940 Differential UV Stair (1939, 1941)  quotes for 306 and 318 nm an estimated 24% upward correction is required, so that these flights show a seasonal maximum in April of 27 DU and a minimum in October of 15 DU, and an extreme range of 0-50 DU for the 0-10 km ozone amounts. In 1942, ozone sampling with a wet chemical technique was conducted on 6 aircraft flights over Germany giving ozone profiles up to 9 km (Ehmert 1949). Subsequently Kay (1953) andBrewer (1955) presented data using the Ehmert method from aircraft flights over the UK and Norway. The former are quite low, and failed to detect the tropopause, possibly because of losses in the intake (Brewer, 1955). These data are excluded based on criterion 4. The latter show an average profile increasing from 30 nmol mol -1 at launch to ~9 5 nmol mol -1 just below the tropopause, very similar to a typical contemporary profile.
While some data sets show a large variability (e.g. Paetzold 1955a, b), the approach here is to include all of the early measurements that meet the relevant selection criteria (1, 2 and 4). The weighted mean of the observations in Table 8 is 23 DU, with a range of 0-50 DU for the period 1934-1955.
The corresponding modern average 0-10 km tropospheric column amount of ozone is 36 DU, for both the northern Europe and the eastern US regions in Figure 7. When the historical datasets are treated as separate observations with equal weight (as for the surface data in Section 3.6), the estimated increase in free tropospheric ozone in the temperate Northern Hemisphere is 47.4 ± 30% (t-test) or 47.9 ± 28% (Wilcoxon test), where the uncertainty indicates a 95% confidence interval. The increase is consistent with the increases inferred for surface ozone at northern midlatitudes. This free tropospheric increase is significant for climate studies, since it is primarily ozone in the upper troposphere that contributes to radiative forcing (IPCC, 2001).

Umkehr
Umkehr measurements rely on the fact that the scattering intensity of solar UV (effective scattering height) changes with solar zenith angle (SZA). Since most of the ozone in the atmospheric column is in the stratosphere, the information content of the tropospheric part of the retrieval is limited, and only ~5 0% or less of the information in layer 1 (1000-250 hPa in the standard retrieval) comes from the troposphere (Stone et al., 2015; Text S-9), while the rest of the information comes from the adjacent stratospheric layers.
In 1932-33 a series of 46 Umkehr profiles were made at Arosa, Switzerland  for determining the atmospheric profile of ozone. These yielded estimates of ~3 0-70 nmol mol -1 at 2 km, ~8 0-100 nmol mol -1 at 5 km and ~1 80-220 nmol mol -1 at 10 km. These values are high compared to modern measurements, but carry a substantial uncertainty due to stratospheric influence, and are therefore excluded from the analysis of Section 4.1.
The accuracy of Umkehr profile retrievals has improved with time as a result of modifications to the retrieval algorithm and better a priori information from ozonesonde and satellite-derived climatologies. Early comparisons Year 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 Mean 0-10 km ozone column (DU) 0 10 20 30 40 50 60 Regener and Regener (1934) Northern Europe 40-55 o N, 0-15 R E Coblentz and Stair (1939; O' Brien et al. (1936) Regener (1938a) Paetzold (1955a,b) Brewer ( (Kulkarni and Pittock, 1970) indicate a high bias of ~4 5% in the troposphere (likely in part due to the bias to lower response of the Brewer-Mast sondes; see Section 4.3). A 1989 field study (Komhyr et al., 1995) comprised of 6 morning and 6 afternoon co-incident Umkehr and ozonesonde measurements found tropospheric ozone in Umkehr profiles to be on average 16% low against ECC sondes launched in the morning, and 29% high in the afternoons. In a 2004-2005 comparison, 60 co-incident Umkehr and ozonesonde profiles taken at Belsk, Poland showed on average a slight (~2% ±25%) low bias of the Umkehr partial ozone column below 250 hPa, relative to the sonde column (Krzyścin and Rajewska-Więch, 2007). Umkehr profile measurements are currently made twice daily, at ~1 6 sites worldwide (Figure 8).

Ozonesondes
Ozone soundings were undertaken at a global network of 11 sites from 1962 to 1966 by the US Environmental Science Services Administration. This network operated in parallel with a North American network of 13 sites, coordinated by the US Air Force Cambridge Research Laboratories from 1963-1965. Together these networks released over 2000 Regener, Brewer-Mast and carbon-iodine sondes Sticksel, 1967, 1968;Hering, 1964;Hering and Borden, 1964, 1967. Regener chemiluminescent sondes were used regularly for only a brief period in the 1960s, as they showed somewhat erratic response, with an average bias of about -40% in the troposphere (Chatfield and Harrison, 1977;Wilcox, 1978;Hering and Dütsch, 1965). Regular chemical ozone soundings at a number of sites in Europe, North America, Australia and Antarctica began in the latter half of the 1960s. Balloonborne ozonesondes therefore provide the longest time series of the vertical ozone distribution throughout the troposphere. However, ozone soundings are limited in spatial and temporal coverage. Routine ozonesonde launches have been made at less than 100 stations worldwide; these are unevenly distributed, although this is much improved since the introduction of the Southern Hemisphere ADditional OZonesondes (SHADOZ) network in the 1990s. Launch frequency is typically weekly, and at most 2-3 times per week. Vertical resolution is high: the ozone sensor response time (e -1 ) of about 25-40 seconds  gives the sonde a vertical resolution of about 100-200 metres for a typical balloon ascent rate of 4-5 m s -1 in the troposphere. All "modern" sonde types -ECC, Brewer-Mast (BM), Brewer-GDR (GDR), Indian and the Japanese KC -use the reaction of ozone with aqueous potassium iodide (KI), which assumes that two electrons are produced for each molecule of ozone, as the method of ozone detection. As discussed in Section 2.1, there can be variations in the stoichiometry of the ozone-iodide reaction. Also, there may be losses of ozone in the pump and of ozone or iodine to the walls of the sensor chamber, as well as iodine evaporation and possibly adsorption to the platinum cathode (Tarasick et al., 2002). Both the GDR and Indian sondes are similar in design to the BM sonde (Brewer and Milford, 1960), while the KC sonde is similar to the early Komhyr carboniodine (CI) sonde (Komhyr, 1964;. The ECC (Komhyr, 1969) is an electrochemical concentration cell with two chambers connected by an ion bridge. These different instrumental layouts cause differences in response (Smit, 2002) and ozone losses depend both on these differences and on sonde preparation. Losses can be as large as 40% in poorly prepared sondes, although such issues are much less common after about 1980. Slow side reactions in the sensing solution can cause excess iodine to be produced (Saltzman and Gilbert, 1959a;Flamm, 1977;Johnson et al., 2002). This effect is modest except when there are sharp ozone gradients, as it causes a slow (~20 min) second-order time response. As for surface KI monitors, interference from other gases can be a problem (generally restricted to the boundary layer) in polluted areas (Schenkel and Broder, 1982;Tarasick et al., 2000). Pump rate or temperature errors, as well as radiosonde pressure biases, will also produce ozone measurement errors (positive or negative), but these are generally small (<1%) in the troposphere. Background currents cause ozone offsets that are typically as large as 5% in the upper troposphere. The use of sensing solutions other than those recommended for each type of ECC sonde can introduce additional biases of 2-8% (Smit and ASOPOS panel, 2011), although these can be corrected. Ozonesondes can, therefore, under certain conditions significantly underestimate ozone concentrations, but it is difficult to explain positive errors larger than about 10-20%.
Ozonesondes record a surface measurement at release time, but this is subject to negative errors if the sonde is not allowed to run for a few minutes after removal of the ozone filter. Also, the sonde is released at about 1 metre above the ground, and strong gradients often exist in the first few metres above the ground (Galbally, 1968). Aliasing from diurnal cycles is also an issue in the planetary boundary layer if release times are variable (Tarasick et al., 2005;Thompson et al., 2014). Ozonesonde profiles are very useful for detecting the transition from the boundary layer to the free troposphere, however.
Numerous field and laboratory intercomparisons of ozonesondes have attempted to characterize ozonesonde biases and uncertainties. These show considerable variability (Text S-10, Figure S  in part due to differences in preparation, and also as in a number of studies a UV reference photometer was not available, and so the results are relative to an average profile, different for each study. A clearer picture emerges if only the comparisons with a UV reference (the modern standard) are retained. This is done for ECC sondes in Figures 9 and 10. Several adjustments for consistency are described in the figure caption. Despite the fact that there have been several models of  (1985). The 1985 point has been adjusted to account for a positive bias due to the use of 1.5% KI solution (6.2 ±1%, from Barnes et al. (1985) and Peterson (1978)). The 1980 point and that for Torres & Bandy (1978) were not adjusted for this bias, as it apparently was not observed before the early 1980s (as discussed by Barnes et al. (1985)). Instead, following JCGM 100:2008, an additional uncertainty of 6.2/√3 = 3.6% has been added to these points. The data from Hilsenrath (1986) are for the NOAA (1% KI) sondes. Data from Smit and Kley (1998) are averaged over all ECC results, and from Smit and Sträter (2004) are an average of results for the two types using recommended solutions (1% for SciPump and 0.5% for EnSci). Sondes were operated according to standard operating procedures for the agency participating. Data after about 1995, with some exceptions, have not been normalized to a total ozone measurement. DOI: https://doi.org/10.1525/elementa.376.f9

Bias ECC sondes -Lower Troposphere -UV referenced
Year 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Bias (%)  Hilsenrath et al. (1986) Smit and Sträter (2000) Deshler et al. (2008) Smit and ) Reid et al.,1996Barnes et al. (1985 Torres & Bandy (1978) Weighted mean = 1.0 ± 4.4% Johnson et al. (2008) ECC sondes, the running averages show no significant trend in bias over the 50-year period, despite the model changes. This is independent of normalization to coincident total ozone measurements, as normalization factors at most long-term sites show no trend. A mean offset, weighted by the standard error of each point is therefore found (shown as a red dashed line). Using the ECC sondes as a transfer standard, this value is then used to transfer the UV reference to the bias results for other sondes, for intercomparisons without a UV reference photometer (Figures 11-14). The calculated uncertainty of this weighted mean is also added, and we note that this may be an underestimate, as sample sizes in intercomparisons are small, and experimental conditions may not reflect the full range of conditions found in long-term field operations. The result (Figures 11 and 12) indicates that the BM sonde shows an increase in tropospheric response of about 20% between the 1970s and the 1990s. Improved preparation procedures for BM sondes (Attmannspacher and Dütsch, 1978;Claude et al., 1987;DeBacker, 1999;Favaro et al., 2002;Tarasick et al., 2002) may have contributed to this, and there may have been minor changes in sonde manufacture over the long period of record (World Climate Research Programme, 1998). This is consistent with the conclusion of Schnadt Poberaj et al. (2009), who compared European BM sondes with UV-measurements from the GASP and MOZAIC flights (Sections 4.7 and 4.8, below).
Over the same period, the KC sondes also show a modest increase in tropospheric response (Figures 13 and 14).
In addition, although almost all current ozonesonde data are from ECC sondes, the transition to the ECC has been gradual (Figures 15 and 16). Since the other sonde types historically show negative biases in the troposphere relative to the ECC sonde (see also Tables S7  and S8), this transition may itself introduce an apparent trend in free tropospheric ozone derived from ozonesondes, if data are combined without adjustments. The BM sondes were used extensively in the 1970s and in Europe in the 1980s, and are currently in use at only one site (Hohenpeissenberg). Large amounts of data prior to the early 1990s exist from the Brewer-GDR, Indian, and the Japanese KC sondes.
Currently, station records are being re-evaluated for artifacts introduced by changes of sonde type, manufacturer, strength of sensing solution, or preparation procedure, under the Ozonesonde Data Quality Assessment activity . This has resulted in changes to the Canadian record of 2-5% in the 1980-2015 period, and as much as 20% to the pre-1980 BM data . Tropical stations in the SHADOZ network show changes of up to 8% . A large part of this is due to the fact that after about 1995, stations using the new EnSci sonde with 1% KI solution show a 4-8% positive bias in the lower troposphere and 2-6% in the upper troposphere (Table S-

Ozonesonde derived data products
Ozonesonde data are global and long-term, but sparse in space and time. Several products attempt to remedy this deficiency by combining ozonesonde data with satellite ozone data or meteorological information. The ML climatology (McPeters and Labow, 2012;McPeters et al., 2007) uses Microwave Limb Sounder (MLS) data to produce a zonally-averaged climatology in 10° latitude bands from 0 to 65 km. BSVertOzone Bodeker et al., 2013) uses several satellite data sources and a sophisticated regression-interpolation scheme to produce a zonally-averaged climatology in 5° latitude bands from 0 to 70 km. The Trajectory-mapped Ozonesonde dataset for the Stratosphere and Troposphere (TOST), used elsewhere in this paper and in TOAR-Climate , uses the Hybrid Single-Particle Lagrangian Integrated Trajectory (HYSPLIT) model (Draxler and Hess, 1998) and meteorological fields from National Centers for Environmental Prediction (NCEP) reanalyses to fill the gaps between ozonesonde stations, by extending each ozone record along its trajectory path forward and backward for 4 days. Over this 4-day period ozone production and loss is assumed to be negligible. Ozone values along these trajectory paths are binned into a 3-dimensional grid of 5° × 5° × 1 km (latitude, longitude, and altitude), from sea level or ground level up to 26 km (Figure 17). Tropospheric column ozone (TCO) is calculated from ozone mole fractions below the tropopause, found using the WMO 2 K/km lapse-rate definition applied to the NCEP reanalysis data.
TOST has been evaluated using individual ozonesondes, excluded from the mapping, by backward and forward trajectory comparisons, and by comparisons with aircraft profiles and surface monitoring data . Differences are typically about 10% or less, but there are larger biases in the UTLS, the boundary layer, and in areas where ozonesonde measurements are sparse. The accuracy of the TOST product depends largely on the accuracy of HYSPLIT and the meteorological data on which it is based. Data products are available at http://woudc.org/data/ products/#related-ozone-products.

Tropospheric ozone lidar
UV DIAL (DIfferential Absorption Lidar) technique is a well-established technique for tropospheric ozone monitoring, from as low as 100 m to the tropopause, from ground-based sites (Figure 8; Table S-9) or aircraft (Browell et al. 1983, Ancellet andRavetta, 2003). Differences between existing lidar instruments are in the wavelength choice and number of wavelength pairs used for the ozone retrieval, the size of the telescope or the laser power. Vertical resolution is variable, depending on wavelength choice, signal strength and integration time, but can be as fine as 50-100 m in the lower troposphere. Temporal resolution can be very high (1 min). Temporal coverage is generally limited by the need for human operators. The recent deployment of an autonomous system , however, removes this handicap. DIAL errors may arise from electronic interference in the detected signals, unaccounted effects of particulate backscatter and extinction, errors in ozone absorption cross sections, insufficient knowledge of the near-range geometrical overlap function, and errors in the crosssections or a priori concentrations of interfering atmospheric molecules (Eisele and Trickl, 2005;Leblanc et al., 2016b). Beam misalignment can be avoided by comparing the ozone profiles obtained for different wavelength pairs. Relative uncertainties can be as low as 2-5% in the lower and middle troposphere, for averaging times of ~1 min, if the "on" (i.e. more strongly-absorbed) wavelength selected is below 280 nm (Trickl, 2019). For longer "on" wavelengths, used for higher altitudes, the uncertainties grow, and significantly longer signal averaging or a reduction of the vertical resolution must be applied.
The accuracy of tropospheric ozone measurements using lidar systems has been analysed using three different approaches: 1) a system uncertainty analysis based on estimated uncertainties from component sources, 2) simultaneous differences with other techniques (ozonesonde, UV instruments on aircraft and mountain stations) during intercomparison campaigns, 3) differences between seasonal averages from two instruments over a long time period.
The system uncertainty analysis considers random uncertainty (altitude, resolution and ozone-dependent signal to noise ratio), as well as the sources of systematic uncertainty noted above. Details may be found in Leblanc et al. (2016a, b). Table 9 summarizes results of the small number of published intercomparison campaigns. The weighted mean of the quoted lidar bias in Table 9 is approximately 0% ±2% in the lower troposphere, and -3% ±3% in the upper troposphere. The bias is also small below 3 km, but uncertain as lidar systematic errors increase at low altitude (high aerosol load, lidar misalignment).
Several sites (Boulder, Observatoire de Haute Provence, La Reunion (France), Garmisch-Partenkirchen, Huntsville) currently have DIAL running in parallel with other techniques (Figures 8 and 16), and so further comparisons are possible.
One example of lidar-ECC comparison results is shown in Figure 18 for a set of 13 co-located and simultaneous measurements with the Table Mountain Facility (TMF) lidar during the SCOOP campaign ). Average agreement is excellent; throughout the profile it is within the theoretical total uncertainty. For the lidar, this includes detection noise, and systematic errors as described above, while for the ozonesonde profiles it is assumed to be 5%. Gaudel et al. (2015) examined differences between 5-year ozone seasonal averages, using regular, not necessarily simultaneous, lidar and ECC profiles. Over a 20 year period, the ECC sondes averaged about 1 nmol mol -1 higher in the free troposphere above 4 km. Seasonal differences fluctuated generally between ±5 nmol mol -1 , with a maximum of 11 nmol mol -1 , at 6-8 km. This was shown to be due to significant transport differences, correlated with the requirement for clear sky conditions for lidar measurements, inducing a meteorological bias in lidar sampling.
The three different analyses yield results consistent with a precision better than 10% and a slight negative bias of 0-3% with ECC sondes. When adjusted for the mean ECC biases found in Section 4.3, this corresponds to a mean bias of +1% ±8% in both lower and upper troposphere. These results depend on good aerosol and cloud screening, based on the backscatter lidar signal analysis; under these conditions the lidar accuracy (as from these published intercomparison campaigns) is better than that calculated from the system uncertainty analysis.

Ground-based Fourier Transform Infra-Red (FTIR)
Global FTIR observations are coordinated by the Infrared Working Group of NDACC (Network for the Detection of Atmospheric Composition Change, https://www2.acom. ucar.edu/irwg, www.ndacc.org). Calibration and retrievals for ozone are standardized across the network (Vigouroux et al., 2015). The retrieval follows Optimal Estimation (OE) theory (Rodgers, 2000) and requires a priori data for the atmospheric state and other forward model parameters. A priori atmospheric composition profiles are constant for all retrievals in the time series for a given site. They are derived from climatological runs of the WACCM V4 (Whole Atmosphere Community Climate Model) model. Daily a priori temperature and pressure profiles are from NCEP. The instruments are solar viewing and so observations are biased to clear sky daytime, with seasonal limitations at high latitudes. Typically, several observations are taken per day if there is a clear line of sight to the sun.
Typical degrees of freedom for signal (DOFS) are 4-5 (see Text S-11). Averaging kernels are shown in Figure S-11a (and see Figure 1 of Vigouroux et al. (2015)). They represent the contribution to the retrieval from the measurement. Table S-10 lists stations that obtain at least 0.8 DOFS for the ground -8 km layer. The remainder of the information is from the a priori best estimate. Those stations with sufficiently long and dense time series are used in TOAR-Climate  where most have DOFS (to 8km) >0.9. Only a small portion of this first summed partial kernel is sensitive to the lower stratosphere ( Figure  S-11b). Further details about the information content of ozone retrievals can be found in , Vigouroux et al. (2015), and in Wespes et al. (2012) focusing on tropospheric ozone. Figure 19 shows time series of the partial columns of ozone measured by the FTIR at the Izaña Atmospheric Observatory (IZO), together with columns derived from the coincident ozonesonde profiles. The sondes were launched weekly at Santa Cruz de Tenerife (35 km northeast of IZO) from 1999 to 2006 and at Guimar station (15 km east of IZO) since October 2006. The FTIR data are averaged in a temporal window of 6 hours around the ECC launch time (12 UTC). Table 10 gives the ozone uncertainty budgets estimated by the OE approach and from ECC sonde intercomparisons at the Izaña Atmospheric Observatory. Uncertainties in the tropospheric columns (from station altitude = 2.37 km for IZO) are shown. Total theoretical random uncertainty Table 9: Lidar bias and precision derived from intercomparison campaigns between lidar, ozonesonde, aircraft and mountain site simultaneous measurements. Bias and Precision represent means and standard deviations of differences from the ECC sonde profiles. Theoretical accuracy is the expected lidar accuracy from the system uncertainty analysis. is estimated as the sum of the random parameter uncertainty and the smoothing uncertainty (associated with the limited vertical sensitivity of the FTIR technique). Significant contributors to the the random parameter uncertainty are uncertainties in the temperature vertical profiles, the instrumental line shape, and the measurement noise, while the systematic uncertainty is dominated by spectroscopic uncertainties García et al., 2012). The overall theoretical uncertainty of ~1 1% in tropospheric ozone partial columns is dominated by the smoothing uncertainty.
Smoothing uncertainty is also important in sonde-FTIR comparisons: the scatter (one standard deviation) of the relative differences is reduced from ~9 % to ~7 % when comparing ECC sonde profiles smoothed with the FTIR averaging kernels.
The mean bias between ozone partial columns from FTIR and ECC sondes of 4%, which can be up to ~6 % with alternate retrieval strategies (García et al., 2012), is in excellent agreement with the positive bias of 1-5% (±5%) for ECC sondes found from UV-referenced sonde intercomparison studies (Section 4.3), and with the positive bias of 5-8% found from MOZAIC/IAGOS comparisons (Section 4.8).
Recent laboratory work recommends simultaneous measurements of ozone absorption coefficients in the IR and the UV (Orphal et al., 2016), which will lend further confidence in FTIR methods by tying them directly to the UV standard.

Ozone measurements from aircraft
In addition to the early measurements described in Section 4.1, observations were made using a KI method by Fabian and Pruchniewicz (1977) on 34 regular airline flights. The first major program of ozone measurements from regular passenger aircraft began after the observation of high ozone inside the cabin of planes flying over the US (V. Mohnen, personal communication). Ozone levels over 350 nmol mol -1 were observed on some flights over the US in 1973, and as high as 600 nmol mol -1 on polar flights (Bischof, 1973). The problem was exacerbated in 1975 when the long-range Boeing 747 SP was introduced, as this flew higher and further north, and so frequently well into the lower stratosphere. Ozone levels over 600 nmol mol -1 were observed frequently and passengers and crew complained of severe headaches and nosebleeds. This became a crisis for airlines, which was initially dealt with by pilot advisories (FAA, 1977) and flight planning to avoid areas of expected high ozone (based on the measurements available in the 1970s). New FAA regulations, AC_120-38 (FAA, 1980) which restrict maximum cabin ozone levels to 250 nmol mol -1 (peak) and 100 nmol mol -1 (3-hour average) were developed and are still in effect. Most passenger jet aircraft now have ozone destruction filters on the cabin air intakes, but not all, as avoidance is still an option. This is not always successful, however, as even these high limits are sometimes exceeded (Bekö et al., 2015). This urgent issue, combined with concern about adverse effects of aircraft exhaust emissions on the atmosphere, led to a collaboration between NASA and several US airlines to operate the Global Atmospheric Sampling Program (GASP). From March 1975 to June 1979, GASP provided the first representative ozone measurements from regular aircraft (Falconer and Holdeman, 1976;Nastrom, 1977). A commercially available 253.7 nm UV photometer was modified for automated operation. The air sample from the inlet for gas phase measurements was pressurized by means of a PTFE (Teflon)-coated diaphragm pump to well above cabin pressure. Losses in the inlet and pump were as large as 16% in 1975-76, reduced to <6% in 1977. Overall uncertainty is estimated at ±8.4% in 1975-76, and ±3.3% from 1977, with a known bias of +9% from the calibration via NBKI (Schnadt Poberaj et al., 2007). Four passenger B-747 aircraft made a total of 6149 flights, with measurements at altitudes between 6 and 13.7 km. The program mostly covered the North Atlantic and Pacific Oceans, as well as North America, but also to a lesser extent Europe (Schnadt Poberaj et al., 2009), with a few flights to India, Singapore, Australia, New Zealand, and Brazil. NOXAR (Nitrogen Oxides and Ozone along Air Routes) provided measurements onboard a B-747 from Zurich (Switzerland) to Atlanta, Boston, New York and Chicago and also Beijing, Bombay and Hong Kong. The system was operated from May 5, 1995, until May 13, 1996, and from August 12 until November 23, 1997 (104 flights). The analyzer was a modified Environics S-300 253.7 nm UV absorption instrument. The air sampling contained an aerofoil-sectioned aluminum boom with a PTFE core, just forward of the aircraft's rearmost starboard door. The boom extended 23 cm into the slipstream from the aircraft skin. Ambient air at ~2 00 hPa was compressed to aircraft cabin pressure (~800 hPa) using a PTFE-coated diaphragm pump. Losses in the inlet and pump were as large as 7%, adding ±3.5% to the overall uncertainty. Precision is estimated at ±0.5 nmol mol -1 , and overall uncertainty at ±5 nmol mol -1 ±6% (Dias- Lalcaca et al., 1998;Brunner et al., 2001).
More than three decades of dedicated research aircraft measurements, mostly with UV absorption instruments and also with some O 3 -NO chemiluminescence measurements, are also archived (e.g. LARC, 2019; BADC, 2019; ESRL, 2019; NCAR, 2019). Typically these flights were directed at observing particular atmospheric phenomena (e.g. biomass burning plumes), and so sampling may be biased accordingly.

MOZAIC/IAGOS
In-service Aircraft for a Global Observing System (IAGOS) and its predecessor Measurement of Ozone and water vapor by Airbus in-service airCraft (MOZAIC) have been making automatic and regular measurements of O 3 , water vapour and standard meteorological parameters onboard long-range commercial Airbus A330/A340 aircraft since August 1994 .
Ozone measurements are made by dual-beam UV absorption monitors with a response time of 4 s, a detection limit of 2 nmol mol -1 , and an uncertainty of ±2 nmol mol -1 ±2%, including a 1% uncertainty in the reference instrument Nédélec et al., 2015). As in GASP and NOXAR, the sampled air is compressed by a Teflon-coated diaphragm pump before entering the UV-photometer, but losses in the inlet and pump are estimated at less than 1%, based on laboratory and ground tests . Quality assurance procedures have not changed since the beginning of the record in 1994. No calibration drift has been observed, nor inconsistency between MOZAIC and IAGOS instruments (Blot et al., in preparation). Ozone monitors are calibrated annually to a reference analyser at the Bureau Internationale des Poids et Mesures (BIPM), and also compared every 2 hours to an in-flight ozone calibration source. MOZAIC can be considered a reference dataset (e.g. Thouret et al., 1998Thouret et al., , 2006Schnadt Poberaj et al., 2009;Logan et al., 2012), due to its known calibration history. Previous comparisons of MOZAIC/IAGOS data with ozonesondes show negative biases of a few per cent (sonde values higher), with larger differences in the earlier part of the MOZAIC record Staufer et al., 2013Staufer et al., , 2014. Recent results also show small (6% or less) negative biases against ECC sondes (Zbinden et al., 2013;Tanimoto et al., 2015). Despite the large number of profiles in either case, coincidences between aircraft and sonde launches are few. However, a comparison (Figure 20) of trajectory-mapped averages of ozonesonde and MOZAIC/IAGOS profile data (see description of TOST, above) indicates that over 1994-2012 sonde measurements are about 5 ± 1% higher in the lower troposphere, and 8 ± 1% higher in the upper troposphere, consistent with the 1 ± 5% and 5 ± 5% average biases found for ECC sondes in Section 4.3, from UV-referenced sonde intercomparison studies. In addition, some of the routine soundings during this period will be from EnSci sondes used with 1% KI solution, which may account for the additional bias (Section 4.3).
Unlike ozonesonde sites, airports are typically urban, but MOZAIC/IAGOS ozone data do not appear largely affected by local boundary layer chemistry (Petetin et al., 2016. This is also indicated by Figure 20. The IAGOS database (http://www.iagos.org) currently contains data from more than 100,000 vertical profiles of tropospheric ozone, measured during takeoff and landing from 148 airports around the world since August 1994. The data sampled from the ascents and descents at these airports are unevenly distributed both spatially and temporally because the frequency of visits to airports by aircraft that take part in MOZAIC/IAGOS varies, depending on commercial airlines' operating constraints. In particular, data are sparser in the southern hemisphere (Figure 21).
A year-by-year comparison (Figure S-10) shows considerable variability (almost certainly due to sampling differences) but no overall trend if the first two years, 1994-95, are excluded. The apparent high bias of the sondes is reduced to ~4 % and ~7 % if 1994-95 are excluded. These larger differences in the first two years of the MOZAIC/IAGOS record have been noted previously (Logan et al., 2012;Staufer et al., 2014) and may also be due to sampling differences.

Tropospheric ozone satellite and residual measurements
The measurement of tropospheric ozone from space is a challenge, because of the large stratospheric ozone burden that satellite instruments must look through. Typical variations in the stratosphere at mid-latitudes (more than 10%) are larger than the entire amount of ozone in the troposphere. A number of techniques have been developed to derive information about tropospheric ozone from nadir-pointing spectrometer data. The tropospheric ozone residual (TOR) technique (Fishman et al. 1990(Fishman et al. , 1996(Fishman et al. , 2003Ziemke et al., 1998;Chandra et al., 2003) uses height-resolved ozone information from the Solar Backscatter Ultraviolet (SBUV/2) or the Stratosphere Aerosol and Gas Experiment (SAGE), Halogen Occultation Experiment (HALOE) or Microwave Limb Sounder (MLS) instruments to subtract stratospheric ozone from the total column ozone measured by the Total Ozone Mapping Spectrometer (TOMS), or more recently, the Ozone Monitoring Instrument (OMI) (Ziemke et al., 2006;Jing et al., 2006). An extension of this technique uses forward tra- Figure 20: Average (1994Average ( -2012 relative differences (%) of trajectory-mapped MOZAIC/IAGOS profile data minus trajectory-mapped ozonesonde data (Osman et al., paper in preparation). Variations with latitude in the left-hand plot are likely due to differences in sonde type and preparation, which may cause biases of several percent. When averaged over latitude, sonde measurements are about 5 ± 1% higher, in the lower troposphere, and 8 ± 1% higher in the upper troposphere, consistent with the average biases found from UV-referenced sonde intercomparison studies (Section 4.3). DOI: https://doi.org/10.1525/elementa.376.f20 jectory model calculations or potential vorticity mapping with MLS data (Schoeberl et al., 2007;Yang et al., 2007), to produce stratospheric column estimates with higher horizontal resolution, suitable for producing a daily TOR product. Winds are from NASA's Modern Era Retrospective Reanalysis (MERRA). Similarly, the Global Modeling and Assimilation Office (GMAO) assimilated ozone product is produced by using OMI and MLS retrievals as input to the global data assimilation system used to produce MERRA (Wargan et al., 2015). Tropospheric ozone has also been derived from the TOMS data alone by assuming the longitudinal distribution of stratospheric ozone (Kim et al., 1996;Hudson and Thompson, 1998). Cloud differential methods employ the fact that, particularly in the tropics, the tops of the highest clouds are essentially at the tropopause, and so the tropospheric ozone column can be found from the difference in total ozone measured in adjacent cloudy and cloud-free pixels. The Convective Cloud Differential (CCD) method (Ziemke et al., 1998) or the Cloud-Clear Pair (CCP) method (Newchurch et al., 2003) use this approach with TOMS. This method has been applied to OMI, GOME, GOME-2, and SCIAMACHY data (Ziemke et al., 2017;Valks et al., 2014;Leventidou et al., 2016). A second method, called "cloud slicing" (Ziemke et al., 2001(Ziemke et al., , 2009(Ziemke et al., , 2017, uses measurements of above-cloud column ozone together with cloud-top pressure data to derive ozone column amounts in the upper troposphere. Used in combination, these methods can estimate 400 to 1000 hPa lower tropospheric column ozone. Similarly, lower tropospheric ozone amounts near mountainous regions have been derived from TOMS data using a topographic contrast method (Jiang and Yung, 1996;Kim et al., 1996;Newchurch et al., 2001), and tropical tropospheric ozone has been derived from TOMS data alone, based on differences in ozone-column retrieval sensitivity as a function of scan angle .
More recent satellite instruments with higher spectral resolution and broader spectral coverage retrieve tropospheric ozone directly from the backscattered radiances in the  and Huggins (320-350 nm) bands. Information on the ozone vertical distribution is derived from the effective scattering depth at different wavelengths, and also from the temperature dependence of the ozone absorption cross-sections in the Huggins bands, which separates ozone in the warmer troposphere from colder stratospheric ozone (Chance et al., 1997). GOME profiles have been retrieved at 20 or more layers from the surface to ~6 0 km using the OE technique (Munro et al., 1998;Hoogen et al., 1999;van der A et al., 2002;Liu et al., 2005), Tikhonov-Philips (TP) regularization (Hasekamp and Landgraf, 2001), and neural networks (Del Frate et al., 2002;Müller et al., 2003). Layer values are not independent, however; total degrees of freedom for signal (DFS) are about 5-6.5, with most in the stratosphere; only about 1 independent point is retrieved in the troposphere. These algorithms are reviewed and compared in Meijer et al. (2006). Similar methods have been applied to GOME-2 (Cai et al., 2012;van Peet et al., 2014;Miles et al., 2015;van Oss et al., 2015), SCIAMACHY (Sellitto et al., 2012a, b), OMI (Kroon et al., 2011;Liu et al., 2010a, b;Mielonen et al., 2015;Sellitto et al., 2011;Di Noia et al., 2013), andOMPS (Bak et al., 2017). Table S-11 compares GOME and OMI retrievals. The Tropospheric Emission Spectrometer (TES) is an FTIR interferometer that uses the 9.6 micron ozone absorption band to retrieve ozone concentrations. Its 0.1 cm -1 spectral resolution is sufficiently fine to distinguish the pressure-broadening of ozone absorption lines at atmospheric pressures in the lower troposphere, giving vertical information to discriminate tropospheric and stratospheric ozone. Like ground-based FTIR, it uses OE algorithms to retrieve vertical profiles of ozone concentration (Bowman et al., 2006). It is mostly sensitive to tropospheric ozone between 700 and 300 hPa, owing to the higher thermal contrast there with respect to the surface.
The Infrared Atmospheric Sounding Interferometer (IASI) instrument is also an FTIR, similarly operating in the 3.7 to 15.5 µm spectral range, but with 0.5 cm -1 spectral resolution (see Table S-11 for details). Ozone profiles are retrieved with similar vertical resolution in the troposphere to that of TES. Table 11 summarizes the characteristics of these satellite data products. More detailed descriptions are found in the Supplemental Material (Text S-12).
For enhancing sensitivity in the lower troposphere, IASI profiles may also be retrieved using a TP altitudedependent regularization (Eremenko et al., 2008), which  Cuesta et al. (2013) optimizes the retrieval constraints to maximize the DOFS in the lower troposphere (Dufour et al., 2010(Dufour et al., , 2012(Dufour et al., , 2015. These IASI (TP) retrievals are able to depict the horizontal distribution of ozone plumes within the lower troposphere (Figure 22) with a relative maximum of sensitivity typically between 3 to 4 km, in case of positive thermal contrasts (i.e. over land during summer) but with limited sensitivity to near surface ozone. Multispectral measurements can also enhance retrieval sensitivity to lower tropospheric ozone. Examples include: UV radiance + polarization (Hasekamp and Landgraf, 2002), UV + IR (Worden et al., 2007b;Landgraf and Hasekamp, 2007), UV + IR + VIS (Natraj et al., 2011), and VIS + IR (Hache et al., 2014) measurements using OE or TP regularization techniques. The polarization measurements in the UV show higher sensitivity to ozone in the troposphere than to ozone in the stratosphere.
IASI+GOME2: A multispectral satellite approach that simultaneously fits IASI measurements in the thermal IR and co-located UV spectra from GOME-2, at the IASI spatial resolution (12 × 25 km) has allowed the spaceborne observation of ozone plumes below 3 km, both over land and ocean (Cuesta et al., 2013). Sensitivity in the surface-3 km layer peaks at 2 to 2.5 km asl over land (Figure 22), while the DOFS for this layer are 0.35, 40% more than IASI (TP). Validation with ozonesondes in 2009-10 show that ozone is retrieved in this layer with a mean bias of 4% and a precision of 17%, when smoothing by the retrieval vertical sensitivity (9% mean bias and 27% precision for direct comparisons).
TES+OMI: A similar multispectral approach combines radiances from TES and OMI (Fu et al., 2013). The joint TES/OMI retrieval provides 2 DOFs in the troposphere with approximately 0.4 DOFS for near surface ozone (surface to 700 hPa).
Figures 23 and 24 display bias and uncertainty information for satellite retrievals and data products, from published validation studies. Systematic biases, however, can vary by region (see TOAR-Climate, Gaudel et al., 2018). In all cases evaluations were with respect to ECC sondes; there are very few comparisons with other tropospheric data sources (e.g. Safieddine et al., 2016). The published It is apparent from Figures 23 and 24 that while biases are fairly modest, ranging between -10% and +20%, but often much smaller, standard deviations are large, compared to those of the other measurement systems discussed above: about 10-30%, versus 5-10% for sondes, aircraft, lidar and ground-based FTIR. Nevertheless, as satellite data products offer global or near-global coverage with few gaps, their value is correspondingly large. Other measurement systems suffer from errors of representativeness, when point measurements are interpolated or extrapolated to infer information at points other than the place and time of the original measurement.

Representativeness
Ozone is a highly reactive secondary pollutant with many processes such as photochemical formation, deposition and titration playing a role in determining atmospheric mole fractions. Although several thousand ground-based stations measure ozone concentrations at high temporal frequencies world-wide (TOAR-Surface Ozone Database), the globe is nevertheless undersampled, since surface ozone over land surfaces may vary locally on scales of a few kilometres or less. (Here the spatial representativeness of observations down to an urban/regional scale is considered. The issue of measurements being representative of the well-mixed boundary layer addressed earlier with Criterion 3, is on a smaller scale and not addressed in this section.) Sites are unevenly distributed, with relatively few in the tropics and southern mid-latitudes (Sofen et al., 2016; TOAR-Surface ozone database). Spatial representativeness is therefore often the largest source of uncertainty in the use of ground-based data. For a well-calibrated analyzer, measurement uncertainty is in the 1-nmol mol -1 range (Section 2.2.2), while two locations with different site characteristics (e.g. urban vs rural) may show average differences an order of magnitude larger. Land use may also change with time, complicating trend analysis by inducing changes that do not reflect background ozone trends. Remote baseline sites with minimal influence of local processes are usually representative of a larger area than sites located for catching the plumes of particular sources (e.g. monitoring stations at curbsides) which reflect local conditions. The application of the data may also determine the area of representativeness, as correlation lengths are typically longer for monthly averages than for daily data (e.g. Sofen et al., 2016). Standard attempts to assess the representativeness of in-situ air quality monitoring stations classify the sites into categories like urban, suburban, rural, or remote, in terms of their exposure to sources and sinks (e.g. European Union, 2008). These classifications are somewhat qualitative. More quantitative approaches use station metadata. Ozone concentrations at urban stations are strongly controlled by NOx titration, and so population density (which can be seen as a proxy for NOx emissions) is often used to classify sites as urban, suburban, and rural. The intensity and nature of sources (traffic, industrial or background) can be used to refine the classification. TOAR-Surface ozone database has developed a globally consistent classification scheme based on population density, nighttime light intensity, OMI tropospheric column NO 2 , and station altitude. Due to the need to define thresholds for each of these parameters, this classification based on metadata is still partly subjective.
Beyond the use of ozone monitoring station data for trends, their broader use for model evaluation and data assimilation raises the question of objectively quantifying station spatial representativeness, i.e., how a single measurement is related to its spatial surroundings (Spangl et al., 2007). Methods have been developed to characterize station representativeness with more objective criteria. Janssen et al. (2012) show that using a classification parameter based on land use improves model validation results by ~2 0%. Henne et al. (2010) have proposed a classification based on an explicit estimation of emissions, deposition and transport influencing a particular station. They use population density as a proxy of emissions, land cover from the Wesely (1989) dry deposition parameterization to derive deposition fluxes, and a Lagrangian trajectory approach to evaluate transport impact. Methods based solely on the characteristics of the measurements themselves, especially the diurnal profile amplitude, have also been employed (Flemming et al., 2005;Tarasova et al., 2007;Gaubert et al., 2014). Urban stations exhibit larger diurnal amplitude (due to strong night time ozone loss and strong daily photochemical production) while remote and high elevation stations show much flatter diurnal profiles. Joly and Peuch (2012) have refined this approach by adding other parameters such as the weekend effect to describe the station characteristics. These methods are used to evaluate models with objectively classified station data (Marécal et al., 2015), and to build assimilation systems of observational data. Solazzo and Galmarini (2015) have proposed an alternative approach employing spectral analysis of the ozone time series, and correlation analysis of different spectral components. They find that the area of representativeness is generally very non-isotropic and quite heterogeneous (as also shown by the catchment areas of Henne et al. (2010)). Noting that certain spectral components of ozone variability showed discontinuities between countries (Europe) and networks (North America), they discard those as noise, and note an improvement in evaluated model performance of ~5 %. Schutgens et al. (2016) have shown how high sub-grid variability, such as is prominent with ozone observations, can result in imperfect matches between individual stations and the regionally averaged values.
All these methods show grouping significantly different from the metadata approach, and also demonstrate both the value and challenge of representative station classifications. These findings also suggest that trend analyses using the large number of observations available from surface ozone sites benefit from applying objective classification methods (TOAR-Surface ozone database; Fleming et al., 2018, hereinafter TOAR-Health).
In the free troposphere local effects are less important and representativeness areas are larger. Typical correlation lengths in the free troposphere, for individual measurements, are about 500 km (Liu et al., 2009;Nastrom, 1977). However, the distances between ground-based observing sites are usually larger than this (Figures 8, 15, 16, 21), and as in the case of surface data, sites are unevenly distributed. Observations are also less frequent, so the ozone distribution is in general undersampled, both in space and time. Several authors have noted that this raises representativeness issues (Logan et al., 2012;Tilmes et al., 2012), which can be serious when comparing to model fields or attempting to determine global or regional trends Lin et al., 2015b). MacDonald (2005) illustrates how infrequent temporal observations in the free troposphere can directly add uncertainty on monthly averages and can limit ability to detect trends.
To address the problem of uneven distribution of sites, they are sometimes grouped according to geographic region or ozone characteristics Stauffer et al., 2016). Alternatively, a subset of sites is chosen (e.g. Oltmans et al, 2013). The generalized additive mixed model (GAMM) technique has also been used to derive regional trends for large regions with uneven monitoring networks . Linear interpolation of widely-separated sites yields unsatisfactory results (Logan, 1999), but interpolation methods that take meteorological representativeness into account can produce better regional estimates from limited sampling . Such methods have their limitations, however: in Figure S-10 the interannual differences in average bias, from thousands of profiles annually, are much larger than can be expected from instrumental uncertainty, and so must result from sampling differences.
Sampling differences can add additional uncertainty to regional trends. This will be independent of the metric used, as most of the metrics described in TOAR-Metrics are linear transformations of the measured data. The ozone trend found by Cooper et al. (2010), based on all available mid-tropospheric ozone measurements over western North America during 1984-2008 (more than 1200 data points per year on a 0.2° × 0.2° × 200 m grid), passed a number of statistical tests for robustness. However, in springtime, meteorological variability in ozone over western North America is large and heterogeneous in space and time (e.g. Lin et al. 2015a;Stauffer et al., 2016). The Cooper et al. (2010) dataset was re-examined by Lin et al. (2015b), using chemistry-climate model hindcast simulations driven by observed meteorology. The GFDL-AM3 model co-sampled in space and time with observations reproduces the observed ozone trend (0.65 ± 0.32 nmol mol -1 year -1 ) over 1995-2008, while the model with continuous temporal and spatial sampling indicates a smaller trend (0.25 ± 0.32 nmol mol -1 year -1 ). This comparison suggests that the sampling frequency and distribution of ozone profile measurements does not capture the full interannual and spatial variability of ozone across western North America. Lin et al. (2015b) noted that if the meteorology of the model forced with reanalysis winds is approximately correct, then the differences between the model median and the median of model points co-sampled with observations can be used as a measure of the "data representativeness uncertainty". When this "representativeness uncertainty" is added to the statistical uncertainty on the trend from observations, that trend estimate for 1995-2008 becomes 0.65 ± 0.57 nmol mol -1 year -1 , which overlaps with the model trend of 0.25 ± 0.32 nmol mol -1 year -1 . Spatial correlation can exist on different scales. Sofen et al. (2016), using monthly averages, found much longer spatial correlation lengths than Liu et al. (2009), who examined individual ozone soundings. Similarly, Eriksson and Chen (2002) found vertical correlation lengths of 2-5 km in ozonesonde data, while Sofieva et al. (2004) found vertical correlation lengths of ~1 km for small-scale deviations from a smoothed profile.
Like spatial representativeness, temporal autocorrelation in ozone exists on a variety of timescales. Hourly averaged ozone measurements are uncorrelated if more than a few days apart (Galbally et al. 1986;Liu et al. 2009;Lehman et al, 2004), but monthly averages nevertheless show significant autocorrelation (e.g. Tarasick et al., 2005, Oltmans et al., 2006. The temporal persistence, or autocorrelation can differ by location and can be closely linked to weather patterns and variations in sources and sinks. Frequent observations can allow quantification of temporal autocorrelation at most relevant timescales. A number of research groups have developed techniques to optimize and evaluate proposed network changes that can be applied to tropospheric ozone monitoring. Spatial coherence has been studied by McBratney (1981) and Yost (1982) who identified "areas of influence." Dantzig et al. (1963) and Cressie (1985) pioneered research efforts on optimization of networks, work that has been carried forward with more respect for the specific goals of environmental monitoring by Nychka and Salzman (1998), among others. Weatherhead et al. (2017) have addressed the challenge of designing monitoring systems when realistic constraints, including financial budget constraints, must be considered.

Conclusions and recommendations for design of a future global observational program
From the earliest measurements in the 19th century, both measurement methods and the portion of the globe observed have evolved and changed significantly. The historical methods have different uncertainties and biases, and the data records differ with respect to coverage (space and time), information content, and representativeness. There are significant uncertainties with the 19 th and early 20 th century measurements with regard to representativeness, and interfering gases. SO 2 levels in particular appear to have been quite high in urban areas, and may have negatively biased urban ozone measurements. There is therefore no unambiguous evidence of very low ozone values in the 19 th century.
There are 49 and 11 sets of measurements of surface ozone, by KI and spectroscopic methods respectively, made before the mid-1970s and suitable for this historical analysis. Values of ozone absorption coefficients used before 1960 varied, however, and caused ozone to be underestimated by up to 11%. Overall, the 60 available datasets during 1896-1975 indicate an ozone mole fraction in the wellmixed unpolluted boundary layer that lies in the range 22 to 26 nmol mol -1 . Comparison with modern measurements from the TOAR database suggests that surface ozone has increased by 32-71%, with large uncertainty, in rural air in the temperate and polar zones of the northern hemisphere, and by much smaller amounts in the southern hemisphere. This estimate depends much more on the modern region chosen for comparison than on  (Hardacre et al., 2015;Bariteau et al., 2010;Luhar et al. 2017Luhar et al. , 2018.

Chemical data assimilation
Moderate accuracy and precision, preferable 3-5% level. Vertically-resolved measurements desirable. Daily or better time resolution.
Many sites in different regions. Choice of sites should be guided by objectively quantified site spatial representativeness. Satellite, surface monitor, aircraft data. Can we increase the impact of sparse measurements? Aircraft, lidar, ozonesondes have small measurement errors, relative to model error. Data impact should therefore be significant.

Satellite ozone data validation
High accuracy and high precision, preferably 2-3% level. Profile (free tropospheric) information required. aerosol) Data quality of prime importance; periodic reevaluation needed.
How do ozone levels in the free troposphere affect levels in the planetary boundary layer (PBL)?
Measurement campaigns with vertical sounding at a resolution down to a few hours -lidar, satellite, sonde and other met measurements, possibly at multiple sites.
Sites in different latitude bands. Sites with multi-year measurement records are of value for background climatology.
More sites at lower latitudes. Important to interpreting satellite measurements, which are primarily sensitive to ozone above the PBL (Crawford and Pickering, 2014;Martins et al. 2015). (Contd.)

Comment
Model processes: Can we properly model and quantify long-range transport (e.g. Stohl and Trickl, 1999), and stratosphere-troposphere transport events and their effects on ozone levels in both the free troposphere and PBL? How well is plume mixing represented? (Osman et al., 2016;Trickl et al., 2014Trickl et al., , 2016Eastham and Jacob, 2017) Measurement campaigns with lidar, global satellite observations, met sonde or ozonesonde (daily launches).
Sites in different latitude bands. Sites near tropopause breaks. Sites with multi-year measurement records are of value for background climatology. Global flux is moderately well-estimated by models; regional flux varies (Stohl et al., 2003a, b) and is not well characterized by scattered observations. Models show great improvement in capturing tropopause folding events (e.g. Trickl et al., 2010Trickl et al., , 2011He at al., 2011;Langford et al., 2015;Lin et al., 2015a), in part due to better spatial resolution, but effects on surface ozone may be underestimated Lin et al., 2012a, b;Hess and Zbinden, 2013;Zanis et al., 2014;Lefohn et al., 2014;Akritidis et al., 2016).
Climate change effects: How will temperature increases affect ozone photochemistry, transport?
Will more forest fires increase photochemical ozone production? Will the Brewer-Dobson circulation increase stratospheric O 3 transport to the troposphere?
What is causing the observed trend in Arctic surface ozone depletion events (ODE)?
Detection of long-term ozone distribution changes, ozone transport changes. Need vertical resolution of ~0 .2 km or better, to separate PBL, mid-troposphere and UTLS processes. Need global coverage, and long-term stability of ~1 nmol mol -1 .
Long term measurements in the background atmosphere of ozone precursors Several sites in different latitude bands. Sites with long-term measurement records are preferable. Climate change is expected to increase planetary wave activity and so produce an increased Brewer-Dobson circulation (e.g. Butchart et al., 2006;Butchart, 2014).
In the lower troposphere increases will be offset by losses due to reaction with water vapour to produce OH .
At some Arctic sites a long-term increase is observed in surface ODEs (Tarasick et al., 2014), possibly related to changes in snow and sea ice cover (Simpson et al., 2007).
Oil and gas extraction: Will increased activity in oil and/or gas fracking release NO x and VOCs, leading to high ozone formation?
Measurement campaigns with surface monitors, measurements of other species; possibly at multiple sites. Vertical profiling would be useful.

Near extraction areas.
Sites with multi-year measurement records are of value for background climatology.
the historical data, as when some of the historical datasets judged less reliable are omitted the results are quite similar. The ±20% range of the estimated increase comes primarily from the variability of present-day surface ozone. Data representativeness thus seems to be the more important source of uncertainty. Based on a more limited, but completely independent set of data, free tropospheric ozone appears to have also changed by a similar amount, in the mid-latitudes of the Northern Hemisphere. In spite of the extensive efforts to identify and evaluate early ozone data records, other data may be available that could pass the selection criteria from this study.
Representativeness, especially for surface sites, is a potential source of significant biases, which are difficult to quantify. Recent research into objective methods of determining areas of representativeness has made valuable progress in reducing this source of uncertainty.
The great majority of validation and intercomparison studies of free tropospheric ozone measurement methods are undertaken with ECC ozonesondes. ECC sondes have been compared to UV-absorption measurements in a number of intercomparison studies. The sondes show a modest (~1-5%) high bias in the troposphere, with an uncertainty of 5%, but no evidence of an instrument change with time. Other methods -Umkehr, lidar, FTIR and UV instruments on commercial aircraft -all show modest low biases relative to the ECCs, and so, using ECC sondes as a transfer standard, all appear to agree to within 1σ with the UV-absorption standard.
Relative to the UV standard, BM sondes show a 20% increase in sensitivity to tropospheric ozone from 1970-1995. The KC sondes show a smaller increase of 5-10%. In combination with the gradual shift of the global network to ECC sondes, this will, if uncorrected, introduce an erroneous positive trend in the free troposphere, to analyses based on sonde data.
Satellite biases are often larger than those of other free tropospheric measurement systems, ranging between -10% and +20%, and standard deviations are large: about 10-30%, versus 5-10% for sondes, aircraft instruments, lidar and ground-based FTIR. Although measurement drift has been examined extensively for satellite measurements of stratospheric ozone (Harris et al., 2015;Hubert et al., 2016), there is relatively little information on temporal changes of bias for satellite measurements of tropospheric ozone. This is an evident area of concern, and one that must be addressed if satellite retrievals are used for trend studies (TOAR-Climate, Gaudel et al., 2018).
The importance of ECC sondes as a transfer standard for satellite validation means that more effort should be placed on understanding and reducing their uncertainties. The overall accuracy of the global ozonesonde network has improved: at many important sites the historical record has been homogenized, by correcting for known changes in station records (e.g. Tarasick et al., 2016;van Malderen et al., 2016;Witte et al., 2017Witte et al., , 2019. In addition, spatial inhomogeneity has been reduced by adopting strict standard operating procedures (Smit and ASOPOS panel, 2011).
These continuing efforts should proceed in tandem with research to better quantify systematic and random uncertainty in ECC data and to understand changes therein. The global network is also unevenly distributed (Figures 15  and 16), and so additional sites, in southern midlatitudes, North Africa, Asia and other areas with limited coverage, are recommended, possibly as a measurement campaign Lelieveld et al., 2002).
Planning future observations of ozone will need to make careful use of known spatial and temporal coherence. Decisions concerning spatial choices and temporal frequency need to be made with consideration for measurement accuracy and co-location with other observations, including NOx, windspeed and direction, and other relevant information needed to understand both ozone and its sources. The integration or merging of data from different platforms, which has had little attention to date, can improve coverage.
Although tropospheric ozone monitoring has evolved from sporadic measurements at a few locations to extensive, well-calibrated networks with formal international collaboration (e.g. Schultz et al., 2015), as well as global satellite observations, it is not comprehensive, nor evenly distributed. It is recommended that the design of the global observational program in the future be guided by several current and emerging scientific issues (Table 12). Each method of observation has its inherent advantages and limitations, and so different techniques will continue to complement and support each other. For example, satellite observations are likely to be of great importance to ozone data assimilation, but ozonesonde and lidar profiles are required for satellite product evaluation. It is to be hoped that commercial aircraft monitoring will be expanded, to close monitoring gaps with reliable, wellcalibrated measurements. Some modest improvements in the distribution of ground-based observing sites could yield significant benefits in global coverage. International cooperation and data sharing will be of paramount importance, as the TOAR project has demonstrated.

Data Accessibility Statement
No new measurements were made for this review article. All datasets discussed in the text were obtained from the published scientific literature. Lidar, FTIR, Umkehr and ozonesonde data used in this publication are publicly available, from the World Ozone and UV Data Centre (http://www.woudc.org) and the Network for the Detection of Atmospheric Composition Change (http://www. ndacc.org), and MOZAIC/IAGOS data at (http://www. iagos.org). Satellite data products are available at the URLs listed in Supplemental Material, Text S-12.

Supplemental files
The supplemental files for this article can be found as follows: • Text S-1. Instrumental method: Gas phase titration (GPT   Funding information C. Vigouroux was supported financially by the EU H2020 project GAIA-Clim (No 640276). M. Steinbacher acknowledges funding from the GAW Quality Assurance/Science Activity Centre Switzerland (QA/SAC-CH), which is supported by MeteoSwiss and Empa. OHP observations are funded by the NDACC French program. The Laboratoire Inter-universitaire des Systèmes Atmosphériques (LISA) acknowledges the support from CNES (Centre National des Etudes Spatiales)/TOSCA (Terre Océan Surface Continentale Atmosphère), PNTS (Programme National de Télédétection Spatiale) and ANR (Agence Nationale de la Recherche -project: ANR-15-CE04-0005) for the development and production of ozone observations from IASI+GOME-2 and IASI. The MLS, OMI and TES projects are supported by the National Aeronautics and Space Administration (NASA) Earth Observing System (EOS) Aura Program. Part of this research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with NASA. The National Center for Atmospheric Research is sponsored by the National Science Foundation. J.W. Hannigan is supported under contract by NASA.