The Greenhouse Gas Climate Change Initiative (GHG-CCI): comparative validation of GHG-CCI SCIAMACHY/ENVISAT and TANSO-FTS/GOSAT CO2 and CH4 retrieval algorithm products with measurements from the TCCON

Column-averaged dry-air mole fractions of carbon dioxide and methane have been retrieved from spectra acquired by the TANSO-FTS (Thermal And Near-infrared Sensor for carbon Observations-Fourier Transform Spectrometer) and SCIAMACHY (Scanning Imaging Absorption Spectrometer for Atmospheric Cartography) instruments on board GOSAT (Greenhouse gases Observing SATellite) and ENVISAT (ENVIronmental SATellite), respectively, using a range of European retrieval algorithms. These retrievals have been compared with data from ground-based high-resolution Fourier transform spectrometers (FTSs) from the Total Carbon Column Observing Network (TCCON). The participating algorithms are the weighting function modified differential optical absorption spectroscopy (DOAS) algorithm (WFMD, University of Bremen), the Bremen optimal estimation DOAS algorithm (BESD, University of Bremen), the iterative maximum a posteriori DOAS (IMAP, Jet Propulsion Laboratory (JPL) and Netherlands Institute for Space Research algorithm (SRON)), the proxy and full-physics versions of SRON's RemoTeC algorithm (SRPR and SRFP, respectively) and the proxy and full-physics versions of University of Leicester's adaptation of the OCO (Orbiting Carbon Observatory) algorithm (OCPR and OCFP, respectively). The goal of this algorithm inter-comparison was to identify strengths and weaknesses of the various so-called round- robin data sets generated with the various algorithms so as to determine which of the competing algorithms would proceed to the next round of the European Space Agency's (ESA) Greenhouse Gas Climate Change Initiative (GHG-CCI) project, which is the generation of the so-called Climate Research Data Package (CRDP), which is the first version of the Essential Climate Variable (ECV) "greenhouse gases" (GHGs). For XCO2, all algorithms reach the precision requirements for inverse modelling (< 8 ppm), with only WFMD having a lower precision (4.7 ppm) than the other algorithm products (2.4–2.5 ppm). When looking at Abstract. Column-averaged dry-air mole fractions of carbon dioxide and methane have been retrieved from spectra acquired by the TANSO-FTS (Thermal And Near-infrared Sensor for carbon Observations-Fourier Transform Spectrometer) and SCIAMACHY (Scanning Imaging Absorption Spectrometer for Atmospheric Cartography) instruments on board GOSAT (Greenhouse gases Observing SATellite) and ENVISAT (ENVIronmental SATellite), respectively, using a range of European retrieval algorithms. These retrievals have been compared with data from ground-based high-resolution Fourier transform spectrometers (FTSs) from the Total Carbon Column Observing Network (TCCON). The participating algorithms are the weighting function modiﬁed differential optical absorption spectroscopy (DOAS) algorithm (WFMD, University of Bremen), the Bremen optimal estimation DOAS algorithm (BESD, University of Bremen), the iterative maximum a posteriori DOAS (IMAP, Jet Propulsion Laboratory (JPL) and Netherlands Institute for Space Research algorithm (SRON)), the proxy and full-physics versions of SRON’s RemoTeC algorithm (SRPR and SRFP, respectively) and the proxy and full-physics versions of the University of Leicester’s adaptation of the OCO (Orbiting Carbon Observatory) algorithm (OCPR and OCFP, respec-tively). The goal of this algorithm inter-comparison was to identify strengths and weaknesses of the various so-called round- robin data sets generated with the various algorithms so as to determine which of the competing algorithms would proceed to the next round of the European Space Agency’s (ESA) Greenhouse Gas Climate Change Initiative (GHG-CCI) project, which is the generation of the so-called Climate Research Data Package (CRDP), which is the ﬁrst version of the Essential Climate Variable (ECV) “greenhouse gases” (GHGs).ForXCO 2 , all algorithms reach the precision requirements for inverse modelling ( < 8 ppm), For XCH 4 , the precision for both SCIAMACHY products (50.2 ppb for IMAP and 76.4 ppb for WFMD) fails to meet the < 34 ppb threshold for inverse modelling, but note that this work focusses on the period after the 2005 SCIAMACHY detector degradation. The GOSAT XCH 4 precision ranges between 18.1 and 14.0 ppb. Looking at the SRA, all GOSAT algorithm products reach the < 10 ppm threshold (values ranging between 5.4 and 6.2 ppb). For SCIAMACHY, IMAP and WFMD have a SRA of 17.2 and 10.5 ppb, respectively.


B. Dils et al.: The Greenhouse Gas Climate Change Initiative (GHG-CCI)
For XCH 4 , the precision for both SCIAMACHY products (50.2 ppb for IMAP and 76.4 ppb for WFMD) fails to meet the < 34 ppb threshold for inverse modelling, but note that this work focusses on the period after the 2005 SCIA-MACHY detector degradation. The GOSAT XCH 4 precision ranges between 18.1 and 14.0 ppb. Looking at the SRA, all GOSAT algorithm products reach the < 10 ppm threshold (values ranging between 5.4 and 6.2 ppb). For SCIA-MACHY, IMAP and WFMD have a SRA of 17.2 and 10.5 ppb, respectively.

Introduction
According to the IPCC 2007 report (Solomon et al., 2007), based on estimates of radiative forcing between 1750 and 2005, carbon dioxide and methane combined account for over 80 % of the anthropogenic greenhouse gas warming effect. It is therefore important to understand the magnitude and distribution of the CO 2 and CH 4 sources and sinks. Despite their importance, our knowledge of the sources and sinks still has significant gaps (e.g. Stephens et al., 2007;Canadell et al., 2010). For instance it is still unclear why between ∼ 2000 and 2006 methane levels in the atmosphere were rather stable (Simpson et al., 2012), while before and after this period they were rising (currently by about 7-8 ppb year −1 ; e.g. Rigby et al., 2008;Schneising et al., 2011).
Currently surface in situ trace gas concentration measurements are the primary data used to constrain inverse model estimates of surface fluxes (Baker et al., 2006), but these measurements only cover a fraction of Earth's atmosphere. Global satellite observations, sensitive to the nearsurface CO 2 and CH 4 variations, are therefore important data sets to improve these flux estimations (Chevallier et al., 2007;Bergamaschi et al., 2009). However given the long atmospheric lifetimes of both gases (30-95 years for CO 2 , ∼ 12 years for CH 4 ; e.g. Jacobson, 2005;Prather, 1994;Prather et al., 2001), the fluxes are small compared to the resident quantity in the atmosphere. Therefore the satellite accuracy requirements are very demanding, since small errors in the retrieved total column concentrations may result in significant errors in the derived fluxes (e.g. Meirink et al., 2006;Chevallier et al., 2007).
Currently only two satellite instruments, SCIAMACHY (Scanning Imaging Absorption Spectrometer for Atmospheric Cartography) on board ENVISAT (ENVironmental SATellite) (Bovensmann et al., 1999) and TANSO-FTS (Thermal And Near infrared Sensor for carbon Observations-Fourier Transform Spectrometer) on board GOSAT (Greenhouse gases Observing SATellite; Kuze et al., 2009), deliver, or have delivered (SCIAMACHY operation ended in April 2012), measurements that are sensitive to near-surface CO 2 and CH 4 concentration variations. Both make use of the near-infrared/short-wave-infrared (NIR/SWIR) spectral region, to analyse the reflected solar radiation in a nadirlooking configuration.
The aim of the European Space Agency's (ESA) Greenhouse Gas Climate Change Initiative (GHG-CCI) project is to provide a single high-quality satellite product for each trace gas retrieval (four satellite-species combinations in total): the so-called Essential Climate Variables (ECVs). In the round-robin (RR) evaluation phase of the project, a number of different algorithms are competing to proceed into the next phase of the project, which is the development of the aforementioned ECV records. Here we will present the validation results of these algorithms, using retrievals from spectra acquired by ground-based high-resolution Fourier transform spectrometers (FTSs) in the Total Carbon Column Observing Network (TCCON). All the algorithms discussed in this paper have already been validated to some extent at various stages in their development, often using the very same TC-CON data. However, approaches such as the collocation area and time, averaging of data over time, etc., often vary between each study. Here we will present a comparative validation study, using a uniform strategy, focussing on the interalgorithm differences and the significance thereof. The decision reached at the end of the round-robin analysis was based on more than this study alone. A general overview of the project's complete quality assessment results is given in Buchwitz et al. (2013).

Instruments
SCIAMACHY is a grating spectrometer on board the European environmental satellite ENVISAT, which was launched on 1 March 2002 into a sun-synchronous polar orbit. After a decade in orbit, contact with the satellite was finally lost on 8 April 2012. The SCIAMACHY instrument measured reflected, transmitted and backscattered solar radiation with a 0.2-1.4 nm resolution (Bovensmann et al., 1999). Its spectral band pass was divided into 8 channels. The first 6 covered the 214-1750 nm region while channels 7 and 8 covered the 1940-2040 nm and 2265-2380 nm intervals, respectively. Unfortunately NIR/SWIR channels 7 and 8 suffered from inflight ice deposition on the detector. Therefore, despite the fact that these channels featured many CO 2 and CH 4 absorption features, the retrieval algorithms discussed in this paper make use of channel 6. A problem of channel 6 is that the number of dead and bad detector pixels continued to increase in the spectral region used for methane retrieval during the instrument's lifetime.
GOSAT was launched on 23 January 2009 by the Japanese Space Agency (JAXA) as a dedicated greenhouse-gasmonitoring satellite (Kuze et al., 2009). It is equipped with two instruments: TANSO-FTS and TANSO-CAI (the latter being a Cloud and Aerosol Imager that supports the FTS measurements). The TANSO-FTS instrument has four spectral bands with a resolution of 0.3 cm −1 , of which three operate in the SWIR (around 760, 1600 and 2000 nm) and one (between 5500 and 14 300 nm) in the thermal infrared. The first three provide sensitivity to the entire column including good near-surface sensitivity, while the latter is sensitive to the mid-troposphere. ENVISAT/SCIAMACHY retrieval algorithms are typically associated with the instrument (i.e. SCIAMACHY), while GOSAT/TANSO-FTS algorithms typically use the satellite (i.e. GOSAT) identifier. For the sake of consistency, we use the above-mentioned convention in this paper. Therefore, if we refer to GOSAT, we are implying the TANSO-FTS instrument on board GOSAT.

Retrieval algorithms
In total, 10 retrieval algorithm products (listed in Table 1 together with their version number and appropriate references) have been compared in four separate comparison pools for the four ECVs, namely SCIAMACHY XCH 4 , SCIA-MACHY XCO 2 , GOSAT XCH 4 and GOSAT XCO 2 . The data used in this study contain over-land measurements only. The features of all algorithms have already been reported in several peer-reviewed publications, and in the GHG-CCI Algorithm Theoretical Basis Document (ATBD; Reuter et al., 2012), so we will only give a very brief overview. Several algorithms come in a full-physics (typically tagged by FP in their four-letter acronym) and proxy (PR) version. The proxy method uses a "reference gas" to derive the dry-air columnaveraged mole fraction (XCO 2 and XCH 4 ). This reference gas (in the case of CH 4 , CO 2 is used as the reference; in the case of CO 2 , O 2 is used) needs to have a far lower variability (in space and time) than the species of interest. This method allows for a very fast but still at least reasonably accurate retrieval in which many of the retrieval errors are cancelled in the CH 4 / CO 2 or CO 2 / O 2 ratio. On the downside, some error components do not cancel out and, in the case of XCH 4 , one needs to correct for the remaining variability of the CO 2 reference gas, typically by using a global model (see for instance Frankenberg et al., 2005Frankenberg et al., , 2011Parker et al., 2011;Schneising et al., 2009Schneising et al., , 2011Schepers et al., 2012). The full-physics algorithms, on the other hand, model all relevant physical effects and derive the dry-air column-averaged mole fractions from the retrieved surface pressure or meteorological data. They are computationally more demanding than their proxy counterparts, but their dependence on models is reduced (Butz et al., 2011). All algorithms are still under continuous development, and indeed in some cases have already released an updated version (e.g. Guerlet et al., 2013;. This paper deals with the versions submitted to the GHG-CCI round-robin data pool.

SCIAMACHY XCO 2 algorithms
Here the weighting function modified (WFM) differential optical absorption spectroscopy (DOAS) algorithm (henceforward referred to as WFMD) competes with the Bremen optimal estimation DOAS (BESD) algorithm, both developed at the University of Bremen. For WFMD we refer to Buchwitz et al. (2000Buchwitz et al. ( , 2005Buchwitz et al. ( , 2007, Schneising et al. (2008Schneising et al. ( , 2009Schneising et al. ( , 2011Schneising et al. ( , 2012 and Heymann et al. (2012a). The version validated in this paper is described by Heymann et al. (2012b). For BESD, a more recent product, we refer to Reuter et al. (2010Reuter et al. ( , 2011. WFMD is a proxy least-squares method based on a fast look-up table (LUT) scheme and uses a single constant atmospheric prior. BESD on the other hand is a full-physics algorithm based on optimal estimation (Rodgers, 2000) and uses on-line radiative transfer (RT) model simulations. Note that WFMD is the only XCO 2 retrieval algorithm that did not feature a bias-correction postprocessing step based on TCCON (which would improve its validation parameters).

GOSAT XCO 2 algorithms
Here we have two full-physics algorithms: one developed at the University of Leicester (UoL), referred to in this article as OCFP, and one at SRON, the Netherlands Institute for Space Research, referred to as SRFP. The first is UoL's implementation of the OCO (Orbiting Carbon Observatory; Crisp et al., 2004) full-physics algorithm . The second is a development of SRON's RemoTeC algorithm (Butz et al., 2011). Both algorithms adjust parameters of a surface-atmosphere state vector and other parameters to the satellite observations, but differ in many other aspects such as their inversion scheme (optimal estimation versus Tikhonov-Phillips), RT models, pre-and post-processing, etc. For more information we refer to Cogan et al. (2012) and Butz et al. (2011). Note that both algorithms feature a post-processing bias-correction scheme. The algorithms are henceforward referred to as SRFC and OCFC to contrast with the non-bias-corrected SRFP and OCFP products.

SCIAMACHY XCH 4 algorithms
Again we have the WFMD algorithm, although this time the version described in Schneising et al. (2011) together with the IMAP (iterative maximum a posteriori) DOAS  algorithm (in this article further referred to as IMAP). Both algorithms are fairly mature but have primarily focussed on the first three years of SCIAMACHY retrievals up until the 2005 SCIAMACHY detector degradation in the methane spectral region. Extending the time series beyond 2005 remains a challenge (see Frankenberg et al., 2011;Schneising et al., 2011, for details). Both are proxy algorithms. Apart from calibration, pre-and post-filtering differences, WFMD uses a method in which a linearized radiative transfer model (chosen from a look-up table) plus a low-order polynomial is linear least-squares fitted to the logarithm of the measured sun-normalized radiance. IMAP on the other hand uses an optimal estimation inversion method, which minimizes both the least-squares difference between forward model and measurement as well as between the a priori and a posteriori state vector.

GOSAT XCH 4 algorithms
Here we have both the full-physics and proxy versions of the UoL (OCFP & OCPR) and SRON (SRFP & SRPR) algorithms mentioned above in Sect. 3.2. We refer to Parker et al. (2011) for information on OCFP and OCPR, to Butz et al. (2011) for SRFP and Schepers et al. (2012) for SRPR.

TCCON
The Total Carbon Column Observing Network (TCCON) (Wunch et al., 2011a) is a network of ground-based FTSs that provide long and quasi-continuous time series of precise and accurate column abundances of CO 2 , CH 4 , N 2 O and CO, retrieved from NIR solar absorption spectra using a nonlinear least-squares fitting algorithm called GFIT. Rather than retrieving the entire profile, GFIT scales an a priori profile to produce a synthetic spectrum that provides the best match with the measured spectrum. TCCON also makes use of the retrieved O 2 columns to derive the corresponding dryair column-averaged mole fractions.
XCO 2 = 0.2095(CO 2 column/O 2 column) (1) XCH 4 = 0.2095(CH 4 column/O 2 column). ( Note that the TCCON O 2 retrieval uses the 1.27 micron band of O 2 , not the O 2 A band used in satellite retrievals. An important aspect of TCCON is that aircraft measurements have been performed over many sites, which allows for an empirical scaling to calibrate the TCCON measurements to the WMO standard reference scale Deutscher et al., 2010;Geibel et al., 2012;Washenfelder et al., 2006). The scaling factor is uniform for all sites: 0.989 ± 0.001(1σ ) and 0.978 ± 0.002 for XCO 2 and XCH 4 , respectively. The uncertainty on the TCCON / aircraft ratio also yields information on the total (station-to-station) network consistency (1σ uncertainty of 0.4 ppm for XCO 2 and 3.5 ppb for XCH 4 ; see Wunch et al., 2010). There is a continuous effort to decrease any station-tostation biases through improving the network-wide compatibility of the instrumental line shape (ILS) of the spectrometer . These are monitored by performing regular lamp measurements with a low-pressure HCl gas cell. Another issue which could contribute to the uncertainty is the bias caused by faulty laser sampling boards in the Bruker 125HR instruments (Messerschmidt at al., 2010). These have all since been replaced, but the historical data set remains somewhat compromised. Dohe et al. (2013) have devised a correction scheme, but this still needs to be implemented. In the meantime the TCCON community offer estimated bias corrections for various stations and periods in time. The strongest suggested correction is −1.2 ppm XCO 2 for pre-17 June 2009 Bremen data, while other stations were unaffected. We have not applied these corrections, since all the retrieval algorithms in this study used the equally uncorrected TC-CON data for the assessment of their bias-correction procedures and the impact of any such correction on the reported network accuracy (which applies to the uncorrected data set used and should be taken into account when interpreting the validation results) is still unknown. The 10 TCCON stations employed in this study together with their coordinates and periods of operation are listed in Table 2. It is clear that not only the time at which these stations became operational differs, but also the amount of data obtained within a given time period. Because solar absorption FTS measurements can only be made under clear-sky conditions, site location, and the corresponding occurrence of clear-sky days, has a large impact on the number of available measurements.
The TCCON data used in this paper were analysed with the GGG2012 version of the standard TCCON retrieval algorithm.

Methodology
The scope of the round-robin algorithm-TCCON comparisons was to identify any remaining shortcomings in the data products generated with the competing algorithms and determine any inter-algorithm quality differences. Therefore the methodology has been kept straightforward and simple, but identical for all algorithms involved.
Complicating the validation is the fact that both TCCON and satellite measurements provide best estimates of the true atmospheric state, based on their own individual sensitivities and a priori information. According to Rodgers (2000), one can correct for the different a priori profiles used in the TC-CON and satellite retrieval algorithms. Here we have opted to use the TCCON a priori as the common a priori profile for all measurements. Using Rodgers (2000), in which x cor and x are the a-priori-corrected and original column-averaged dry-air mole fractions; i is the vertical layer index; and m i corresponds to the mass of dry air in layer i, which is directly derived from p i /g i . Here p i is the dryair pressure difference over layer i and g the gravitational constant. m 0 is the sum of m i over all layers. A i corresponds to the satellite algorithm's column-averaging kernel, while ap x and ap T are the algorithm and TCCON a priori dry-air mole fractions in layer i, respectively. The impact of the a priori correction is fairly limited. For XCO 2 , most algorithms exhibit a quasi-constant correction factor (a priori corrected-original) over all stations ranging between −0.68 and 0.63 ppm. Only WFMD exhibits a stronger and more erratic a priori correction, no doubt due to the single constant a priori it uses in its retrieval scheme (see Fig. 1). For XCH 4 , we again notice a quasi-constant correction apart from OCFP and WFMD at Darwin and the SRON products at Lauder. OCFP uses an a priori directly from the TM3 model, while for OCPR a stratospheric adjustment is made using GEOS-Chem model simulations. As the OCPR data exhibit a far smaller correction at Darwin compared to OCFP, an offset in the TM3 stratospheric output is probably the cause. SRON on the other hand uses a XCH 4 a priori derived from the TM4 model (Meirink et al., 2006).
Also noticeable in Fig. 2 is the gradual increase in the WFMD correction factor as we move from north to south, while the SRON products show a slight decrease (apart from Lauder). All the XCH 4 a priori corrections range between −8.6 and 13.7 ppb.
www.atmos-meas-tech.net/7/1723/2014/ Table 3. BESD and WFMD XCO 2 validation results for all individual stations and using all data combined (ALL). All results apart from R and N are in ppm units. The station flagged by * has been excluded from the relative accuracy calculation.  Note that we only corrected for the a priori difference and not for the difference in vertical sensitivity. That is, even with the same a priori profile its relative contribution to the end result still depends on the averaging kernels. Considering this aspect in the TCCON-satellite comparisons, both of which yield only total column information, requires a reasonable estimate of the true atmospheric variability, which is not available on a global scale. In Wunch et al. (2011b) a detailed assessment of this issue was made, comparing ACOS-GOSAT XCO 2  with TCCON measurements. The study was limited to data taken at the Lamont station only, where the real atmospheric variability could be derived from regular aircraft observations. They found that smoothing the TCCON profile with the ACOS-GOSAT averaging kernel at Lamont induced a bias of about 0.6 ppm with no significant seasonal cycle or airmass dependence. The a priori correction on the other hand did feature a seasonal and latitudinal dependence, as expected, given that, contrary to the TCCON a priori, the ACOS a priori does not feature a seasonal cycle. No such evaluation has yet been made for a XCH 4 retrieval product, nor can it currently be made for any other station. With no ad hoc information on what best represents the true state for all stations, we limit ourselves to the a priori correction described above.
After the a priori correction, all available time series have been trimmed so as to work, in each given comparison round, with data that have matching temporal coverage. As with every satellite versus FTS comparison we defined a collocation time and area in which satellite and groundbased measurements can be paired. Ideally these criteria are as strict as possible in order to minimize the impact of spatial and temporal variability on the comparison. Here we have set the collocation time to ± 2 h. The spatial collocation criterion was set at a 500 km radius around the TCCON site. Smaller collocation areas have been tested (100, 350 km) but often yielded unstable results, due to insufficient data. All FTS data points that fall within the temporal overlap criteria of a single satellite measurement (that fell in the spatial overlap area) are then averaged to obtain a unique satellite-FTS data pair. The typical variability (1σ ), including random errors and real atmospheric variability, of the FTS measurements within this 4 h overlap time frame is on average 2.5 ppb for XCH 4 and 0.4 ppm for XCO 2 . Relaxing the overlap criteria does have a significant impact on the variability, and at ± 6 h the variability increases to 3.5 ppb (XCH 4 ) and 0.5 ppm (XCO 2 ).
From these data pairs we derived various statistical parameters. In the figures and tables within this article, N corresponds to the number of collocated data pairs; R is the Pearson's r correlation coefficient; Bias is the average satellite-FTS difference: while the scatter corresponds to the standard deviation of said difference: Note that the single measurement precision requirements for inverse modelling, set forward by the users, is < 8 ppm for XCO 2 and < 34 ppb in the case of XCH 4 .
All these parameters have been calculated using the individual data pairs as well as daily and monthly means. Note that both the daily and monthly means are derived from the individual data pairs; thus the ± 2 h collocation criterion still applies. In the analysis all data pairs are considered to have equal weight. In this article we will show the results of the individual data pairs only, except for the correlation coefficient R, which is based on the daily averages. Also the time series plots shown are daily averages.
One of the important quality criteria put forward by the users is the so-called "relative accuracy" (RA). This parameter is an indication of the variability of the bias in space and time. The relative accuracy user requirements (1σ standard deviation) put forward by the inverse modelling community are 10 ppb for XCH 4 and 0.5 ppm for XCO 2  based on 1000 km 2 monthly averages. For inverse modelling purposes this parameter is more important than the overall bias as this, if consistent, can be easily corrected for. While this parameter cannot be exactly replicated in our analysis, we calculate a RA, which attempts to yield some information on the station-to-station variability of the bias. We define RA as the standard deviation on the overall biases (derived from individual data) obtained at each station.
The "seasonal relative accuracy" (SRA) is the standard deviation over all seasonal bias results (40 in total: 4 seasonal biases over 10 stations). The seasonal bias results for each station are constructed from all data pairs which fall within the months of January to March (JFM), April to June (AMJ), July to September (JAS) or October to December (OND), regardless of the year the measurements are taken. Some stations feature only limited data during certain seasons, which sometimes results in erratic bias results. To avoid the inclusion of these results into the RA and SRA calculation, we do not include those bias results that are derived from fewer than 10 individual data points or have a standard error (σ/ √ N ) which exceeds the user relative accuracy requirements (0.5 ppm XCO 2 , 10 ppb XCH 4 ). RA and SRA are also derived from a common data set; thus if one algorithm in the validation round fails to meet the quality requirements for station x and season y, the corresponding bias result is also excluded from the SRA and RA calculation of its competitor.
In the case of all four seasonal biases for a station meeting the quality requirements, we also derive the standard deviation on these four results as an indicator of their variability. This parameter is referred to as the "seasonality" (Seas).

Results
Shown in each section are overview figures (Figs. 4, 7, 10 and 13) and tables (Tables 3 through 12) that list the statistical parameters obtained at each station, and for all station data combined (ALL). Given the uneven distribution of data among the 10 TCCON stations, stations with high data density such as Lamont have a higher impact on the "all data" results. For practical purposes we will only show an example time series of a single European, North American and Oceanian station.
The overview Tables 13 and 14 also list the 95 % confidence interval of the overall parameters. The confidence intervals on the scatter, RA, Seas and SRA are inferred from the Chi squared (χ 2 ) distribution in which with σ the population standard deviation, s the sample standard deviation, N the number of data points in the sample and α determining the confidence level (here 0.05 for 95 % confidence). We also performed a so-called F test, to quantify the probability that the statistical parameters of two competing results stem from the same population (Snedecor and Cochran, 1989). The null hypothesis (H 0 ) of the test states that the variances of the two populations are equal (σ 2 1 = σ 2 2 ). The result of the test is the probability that the stated hypothesis is true. Thus, the lower this number, the more likely it is that the obtained parameters such as RA and SRA of two competing algorithms are different.
Note that the F test relies on the presumption that the population exhibits a normal distribution. One could invoke the central limit theorem as the data from which the RA, Seas and SRA are drawn are sample means from the overall population. However, the sampling itself can hardly be called random. Thus to test for normality, we performed a Shapiro-Wilk normality test (Shapiro and Wilk, 1965), on a 0.05 confidence level, on all the relevant data samples. All data samples passed the test apart from the SRA samples from OCPR XCH 4 and BESD XCO 2 . However an analysis of the quantile-quantile probability showed no clear departure from normality.

SCIAMACHY XCO 2
The two competing algorithms are BESD and WFMD. Table 3 and Fig. 3a show the evolution of the bias over the different stations. The error bars in Fig. 3a correspond to the 95 % confidence bands of the bias. Note that there are no data for Karlsruhe since the TCCON measurements there commenced in 2010, while there were no post-2009 WFMD data at the time of this analysis. The overall bias is slightly smaller for BESD, but the variability of the bias (i.e. relative accuracy) is almost identical (1.28 versus 1.29 ppm).
The most significant differences between both data sets are the scatter and data density ( Fig. 3b and d). While the overall scatter for BESD is significantly lower (2.5 vs. 4.7 ppm for WFMD), its data density is also lower (9674 vs. 31 818 data pairs). Interestingly this makes the uncertainty on the overall bias, i.e. the standard error (σ/ √ N ), very similar (0.025 versus 0.026 ppm for BESD and WFMD, respectively). The higher scatter for WFMD also reveals itself in the generally lower correlation coefficients. Note that for the Lauder station, situated in New Zealand, BESD only offers three data pairs, all of which are measured on the same day (hence the lack of a daily correlation coefficient for this station). Both algorithms fail the above-stated quality requirements at this site, and if we thus exclude the Lauder station from our analysis, the relative accuracy (RA in the Table) of BESD improves to 0.63 ppm, while that of WFMD (slightly) deteriorates to 1.36 ppm.
The time series in Figs. 4 (BESD) and 5 (WFMD) are collocated daily averaged FTS and satellite measurements from Bialystok (a), Lamont (b) and Darwin (c). Comparing these figures, it is clear that BESD features substantially fewer data than WFMD, due to its more restrictive filtering process (particularly a very strict MERIS cloud mask). Also clearly visible is the extremely limited (if any) seasonal cycle in the Darwin data. BESD data clearly exhibit lower scatter, but some outliers can be identified. This has been identified as an issue related to the SCIAMACHY Level 1 version 7 consolidation product (L1v7u), used in the retrieval. Tests with the new L1v7w data show that these outliers are eliminated, which should further increase BESD's precision.
The seasonality of Lamont, Darwin and the overall results are slightly in favour of BESD as well as the SRA value (see Table 4). Keep in mind however that these parameters are derived from a limited data sample. Neither the seasonality nor SRA difference is significant (P value of the H 0 : σ 2 1 = σ 2 2 , or the probability that both samples are from a population with equal variances is 0.55 and 0.42, respectively). The P value for the RA H 0 : σ 2 1 = σ 2 2 hypothesis on the other hand is 0.06.

GOSAT XCO 2
Here we have two competing algorithms, OCFC and SRFC, which are the full-physics, bias-corrected versions of University of Leicester's OCO and SRON's RemoTeC algorithms, respectively. As one can see in Fig. 6 and Tables 5 and 6, the differences concerning all parameters are extremely small. Number of data points, scatter and correlation coefficients are never consistently in favour of one algorithm. Note that the correlation coefficients are quite low for the Southern Hemisphere stations of Darwin, Wollongong and Lauder, which is attributed to the limited seasonal XCO 2 variability at these sites. The RA is slightly in favour of OCFC (0.64 vs. 0.84 ppm for SRFC). Again we have a large uncertainty on the bias values for Lauder. Excluding this station from the relative accuracy calculation yields an RA equal to 0.53 and 0.74 ppm for OCFC and SRFC, respectively. The probability that both sample RA values stem from an equal distribution is 0.32.
Looking at the time series for Orleans, Lamont and Wollongong, (Figs. 7 and 8) there is hardly any difference between the two algorithms. However, OCFC does feature several strong outliers in all three stations. Unlike the station-tostation bias variability, SRFC has a lower variability in the overall seasonal bias (see Table 6). For both algorithms the winter-autumn (October through March) biases seem to be more negative than their spring-summer counterparts. This is also the case for the BESD algorithm. While the difference in overall seasonality (0.74 for OCFC vs. 0.33 for SRFC) is somewhat distinct (P (H 0 : σ 2 1 = σ 2 2 ) is 0.22), the difference in SRA (1.08 for OCFC vs. 0.89 for SRFC) is very small (P (H 0 : σ 2 1 = σ 2 2 ) is 0.68).

SCIAMACHY XCH 4
Both IMAP and WFMD are fairly mature proxy type algorithms. Note however that since November 2005 the SCIA-MACHY XCH 4 retrievals have suffered from a detector degradation in channel 6. Most of the TCCON stations (apart from Park Falls, Darwin and Lauder) commenced their measurements after this event. The quality assessment in this paper is therefore primarily representative of this post-decay period.
We also have to note that during the course of the validation, we detected strong biases in the January-through-March IMAP seasonal values. This turned out to be a processing error in IMAP (one of the clusters used incorrect settings from a previous IMAP run). All the data derived from that processing unit have been removed from the IMAP data set. This reduced the amount of overlapping data from 55 626 to 42 320 points (or almost 24 %).
The differences between the algorithms are fairly distinctive (see Fig. 9 and Table 7). Obvious is the far larger scatter (see Fig. 9b) in the WFMD data (overall 76 ppb compared to 50 ppb for IMAP). This also translates to an inferior correlation coefficient over all stations except for Wollongong (which features a negative correlation for both algorithms) and Bremen (by a very small amount). Unlike the BESD-WFMD XCO 2 comparison, WFMD's higher scatter properties are not offset by a superior data density. So on these parameters alone IMAP seems to outperform WFMD. The reason for the larger scatter of the WFMD data is likely due to the fact that WFMD is based on unconstrained linear least squares using a single constant methane a priori profile, whereas IMAP is based on optimal estimation using methane model data as a priori information. In addition, there are also other reasons which can explain the differences. For example, IMAP and WFMD differ greatly in their pre-processing steps, targeted at dealing with the problematic SCIAMACHY instrument degradation. IMAP, for instance, uses SRON's own specifically calibrated input spectra, while WFMD uses the official standard SCIAMACHY level 1 data. IMAP uses one single pixel filter (the so-called "Dead and Bad detector Pixel Mask", or DBPM), while WFMD uses several masks, each one optimized for a certain time period. However looking at the bias distribution, all three Southern Hemisphere stations (Darwin, Wollongong and Lauder) exhibit a considerable negative bias. Also for WFMD, these Table 14. Overview table, listing the "relative accuracy" (RA), overall "seasonality" (Seas) and "seasonal relative accuracy" (SRA), together with their 95 % confidence interval (RA 95 %, Seas 95 % and SRA 95 %) and the probability that the obtained sample variances stem from the same population (P (H 0 : σ 2 1 = σ 2 2 )). three stations feature a more negative bias than their Northern Hemisphere counterparts but not as distinctive as for IMAP. For IMAP the difference between the mean Southern and Northern Hemisphere bias is 26 ppb, while for WFMD it is 13 ppb. These values are larger than what could reasonably be inferred by the TCCON laser sampling error. This also reflects itself in the relative accuracy, which is 7.8 ppb for WFMD and 14.7 ppb for IMAP. Given the large scatter, it is difficult to assess any systematic seasonality errors in the time series plots (see Figs. 10 and 11). The IMAP underestimation at Darwin is clear, as well as the stronger scatter in WFMD. Table 8 lists the overall seasonal biases. As with the RA, we see a higher SRA in the IMAP data, although the difference is far less distinctive. For RA P (H 0 : σ 2 1 = σ 2 2 ) is 0.09, while for SRA P (H 0 : σ 2 1 = σ 2 2 ) equals 0.28. The difference in overall seasonality (4.0 for IMAP vs. 6.6 for WFMD) is even less significant (P (H 0 : σ 2 1 = σ 2 2 ) = 0.43).

GOSAT XCH 4
Concerning the bias (see Fig. 12 and Tables 9 and 10), as with SCIAMACHY XCH 4 , the Southern Hemisphere bias values tend to be somewhat lower (in absolute values) than their Northern Hemisphere counterparts, although only consistently so for SRPR and OCFP. The average Northern Hemisphere-Southern Hemisphere bias difference is 3.3 ppb for OCPR, 8.8 ppb for OCFP, 11.0 ppb for SRPR and 6.7 ppb for SRFP, all of which are considerably lower than that observed in IMAP (26 ppb). This is also reflected in the RA numbers, which range from 2.7 ppb (for OCPR) to 6 ppb (OCFP) (see Tables 9 to 10). The overall bias values themselves range from −2.5 (SRFP) to 7 ppb (OCPR). Only OCFP has a somewhat lower precision (18.1 ppb), while the overall scatter of the other algorithms ranges between 14 and 14.9 ppb. None of the algorithms is consistently better or worse across all stations involved though (see Fig. 12 and Tables 9 and 10). Similar observations can be made about the correlation coefficients where each algorithm comes out with the best R value at, at least, one station (see Fig. 12c). OCFP has the worst overall scatter, correlation and relative accuracy of all the algorithms involved, while OCPR has the best scatter, data density, relative accuracy and correlation results (the latter a tie with SRPR). The difference between the best (OCPR = 2.7 ppb) and worst (OCFP = 6 ppb) relative accuracy result is significant on a 95 % level (P (H 0 : σ 2 1 = σ 2 2 ) = 0.03). However the difference between the two best results (OCPR and SRFP at 3 ppb) clearly is not (P (H 0 : σ 2 1 = σ 2 2 ) = 0.76). The two SRON RA values have a P (H 0 : σ 2 1 = σ 2 2 ) = 0.33. The strong difference between the Leicester algorithms can be attributed to the fact that the full-physics version of Leicester's OCO algorithm is a more recent development than its more mature proxy counterpart.
Turning to the seasonality, the full-physics algorithms outperform their respective proxy counterparts by a small margin. Interestingly OCPR, which so far featured the best overall statistics, performs worst when looking at the seasonality. This is also somewhat evident from the time series plot (Fig. 13) where OCPR seems to underestimate the XCH 4 seasonal amplitude (obvious in Lamont). The differences between the other algorithms are less obvious (Figs. 14 to 16). Of course, being a proxy algorithm, some of the effects might come from the model used in the dry-air conversion (i.e. CarbonTracker CT2010; Peters et al., 2007). SRFP has the best seasonality, keeping in mind that difference between the OCPR and SRFP seasonality is not conclusive (P (H 0 : σ 2 1 = σ 2 2 ) = 0.18). All SRA values range between 5.4 and 6.2, and no interalgorithm difference is significant in this aspect (lowest P (H 0 : σ 2 1 = σ 2 2 ) = 0.45)

Summary
Tables 13 and 14 list the overview results using all combined data as well as their 0.95 confidence intervals and the equal variance hypothesis probabilities. The results in Table 13 correspond to the overall (ALL) results in Tables 3, 5, 7 and 9. The reported errors are also derived from this complete data set (using all available data pairs). Keep in mind that the station-to-station range in bias, scatter and correlation often far exceeds the error boundaries in Table 13. That said, looking at Table 13, we see that distinctive differences between algorithms do exist; however Table 14, which features the analysis results of the inter-station and seasonal variability, is far more ambiguous. This is of course a direct result of the difference in sample size from which the parameters in Tables 13 and 14 are obtained.  Table 14. Also, the inter-algorithm differences between RA values are more significant than those of seasonality and SRA, even though the latter is probably the best quality estimator of what the users have defined as the relative accuracy. To remedy this ambiguity, one would need to increase the number of sample data. One way would be to use monthly instead of seasonal means. However this would greatly increase the number of unstable samples (due to the limited amount of correlative data from which these averages are constructed). This could also be improved by using dynamic collocation criteria, using either the free tropospheric temperature as a proxy for XCO 2 Wunch et al., 2011b) or CTM model data . This results in more data and thus a more robust data set. To ensure the robustness of our data we need to reject results which do not meet our quality criteria. A dynamic collocation method would probably result in less rejections and a larger data set. Such an approach was not feasible due to practical considerations. The most desirable option would be to expand the TCCON. Note however that, for instance, to reach 0.95 confidence in the SRA difference between BESD and WFMD XCO 2 (1.19 vs 1.43 ppm), one would need 115 data samples (currently 21). Alternatively, with a perfect sample size of 40 (4 seasons × 10 stations), we can currently only distinguish, with 95 % accuracy, an SRA of 0.5 ppm (the threshold XCO 2 quality) from that of 0.68 ppm. With our best actual samples size (31) the latter becomes 0.72 ppm. Note also that in the case of XCO 2 none of the algorithms' RA or SRA values reach said relative accuracy threshold value of 0.5 ppm as set forward by the users, nor do the SRA 95 % confidence bands encompass this value. What we obtain is the combined TCCON-satellite accuracy, and according to Wunch et al. (2010) the current TCCON XCO 2 network accuracy (1σ , station to station) is 0.4 ppm. Adding additional uncertainty due to collocation and smoothing errors, and the above-mentioned uncertainty on the analysis itself, leaves little room for an accurate assessment of such a demanding threshold value for inverse modelling purposes. Efforts to decrease the station-to-station biases between TCCON stations are thus desirable and ongoing Dohe et al., 2013). For XCH 4 , SRA reaches the 10 ppb user quality threshold for all GOSAT algorithms, while SCIAMACHY WFMD's SRA approaches this number (10.5). IMAP would probably also meet this threshold if not for the Southern Hemisphere bias.
Even taking into account these uncertainties, at least in certain comparison rounds, the differences between the algorithm products were distinct enough to draw binding conclusions as to which one would proceed into the next round of the GHG-CCI project. Again it must be stressed that the decisions reached were not based on the comparisons with TCCON alone (see Buchwitz et al., 2013). In the case of SCIAMACHY XCO 2 , we see that BESD has a superior bias, scatter and correlation compared to WFMD. Its RA, Seas and SRA values are also consistently better, albeit only the RA with reasonable confidence (P (H 0 : σ 2 1 = σ 2 2 ) = 0.06). So in this round the conclusion was to proceed with BESD.
The GOSAT XCO 2 comparisons on the other hand yielded no clear winner. Both have comparable scatter and correlation values (in fact using different collocation criteria yielded different winners in this category). OCFC's RA value is slightly better, while its seasonality and SRA are slightly worse. Neither of these parameters is distinctive. As discussed in Buchwitz et al. (2013) global analysis of the data does yield, contrary to the TCCON locations, significant inter-algorithm differences. Certainly in areas with high (e.g. the Sahara) or low (e.g. Amazon forest) surface albedo, which are not covered by TCCON, differences become significant . This observation triggered the development of a new algorithm which uses ensemble medians called the ensemble median algorithm (EMMA; see Reuter et al., 2012). While the EMMA algorithm might be the best solution at hand, it does not negate the pressing need for expanding the TCCON into key areas, enlarging the surface albedo range and geographical distribution of the network.
The SCIAMACHY XCH 4 comparisons between IMAP and WFMD, showed that in many aspects IMAP was the best-performing algorithm (scatter, data density, correlation). However the inter-station bias difference, certainly between the Northern and Southern Hemisphere, appears to be large. This results in an inferior RA and SRA value (although the statistical certainty of the latter parameter is far less distinct, and the RA difference reaches a 0.9 confidence level only). WFMD also shows an inter-hemispheric bias difference, only less significant. Also neither of the algorithms reach the threshold single observation precision (< 34 ppb), set forward by the users. Since these issues need to be resolved first, both algorithms proceeded to the next round. Note that the results are only representative of the time period after 2005. That year featured a SCIAMACHY detector degradation in the spectral region used for methane retrievals, causing a significant deterioration of the methane retrieval quality (e.g., Buchwitz et al., 2013, and references given therein).
For GOSAT XCH 4 , it is the less mature OCFP algorithm that stands out in a negative way. It has distinctively more scatter and a lower correlation coefficient, and its RA value is distinctly worse than its proxy OCPR counterpart. OCPR on the other hand has the lowest scatter and highest data density, as well as the lowest RA value (although hardly distinct from its SRFP competitor). Neither of the algorithms, including OCFP, have a distinct SRA value. The margin in which OCPR stands out from its SRON competitors is however very small, and in terms of seasonality it seems to perform worse (although with a 0.18P (H 0 : σ 2 1 = σ 2 2 ), only with little more than 80 % certainty). Given this small margin and the fact that the comparison between the proxy and full-physics SRON products shows that the full-physics method is a viable option, it was decided to proceed with both OCPR and SRFP.

Conclusions
We have analysed 10 retrieval products produced by the BESD, WFM-DOAS, IMAP-DOAS, RemoTeC and Leicester OCO algorithms. We focussed specifically on the interproduct differences. It was found that for SCIAMACHY (both XCO 2 and XCH 4 ) the competing algorithms yielded significantly different products -especially in terms of single measurement precision (i.e. scatter). In both XCO 2 and XCH 4 , WFMD featured higher scatter than its competitor, being BESD for XCO 2 and IMAP for XCH 4 . The latter on the other hand seems to suffer (more) from a significant Northern vs. Southern Hemisphere bias, an issue which requires more analysis. One reason for the larger scatter of the WFMD data product is that WFMD is based on unconstrained linear least squares, whereas BESD and IMAP are based on optimal estimation. However, there are several other retrieval properties -e.g. meteorological profiles, cloud filtering and consideration of light scattering -which influence the retrieval scatter.
Differences between all the competing GOSAT products are far less striking. For XCH 4 , apart from the full-physics version of Leicester's OCO algorithm (OCFP), the other algorithms (OCPR and SRON's RemoTeC full-physics (SRFP) and proxy (SRPR) products) are very alike. In terms of precision, the proxy versions, especially OCPR, seem to have a slight edge, but in terms of inter-station bias variability and capturing the seasonal cycle, the SRON full-physics algorithm is more than competitive. In fact there are indications that OCPR underestimates the XCH 4 , but this might be due to the CarbonTracker CT2010 model  used in the dry-air conversion instead of the proxy algorithm itself. For XCO 2 , the competing products are closer still. Differences are small for all obtained statistical parameters, and no one algorithm betters the others across the board. This does not imply that these products feature no differences at all. In some regions (e.g. South America, Africa, China) differences between algorithms can be substantial, but there are no TCCON data available in these regions to discriminate between algorithm performance Reuter et al., 2012).
The relative accuracy and single precision threshold quality criteria for inverse modelling (10 ppb and 34 ppb XCH 4 , respectively) have been reached by all GOSAT XCH 4 products, and if the inter-hemispheric bias difference is mitigated (in a future version of the product or by using in situ data; see Bergamachi et al., 2009), so will be, for the relative accuracy, in all likelihood, the SCIAMACHY XCH 4 products. However both IMAP and WFMD XCH 4 still do not reach the precision user requirement. Again it needs to be pointed out that the validation results presented here are dominated by data generated after 2005 when SCIAMACHY suffered from detector degradation in the spectral region used for methane retrieval. The results presented here are therefore not representative of the time period 2005 and earlier years, where the quality of the SCIAMACHY methane retrievals is much better (e.g., Buchwitz et al., 2013, and references given therein).
For XCO 2 , all algorithms reach the single observation precision threshold (8 ppm), but none of the algorithms meet the relative accuracy user requirement (0.5 ppm). Unfortunately, given the current constellation of TCCON measurements, the assessment of whether an algorithm product has indeed reached this demanding value contains considerable uncertainty by itself. An expansion of the TCCON into key geographic areas and efforts to even further reduce the TC-CON station-to-station biases would be most welcome in this respect.