Comprehensive laboratory and field testing of cavity ring-down spectroscopy analyzers measuring H 2 O , CO 2 , CH 4 and CO

To develop an accurate measurement network of greenhouse gases, instruments in the field need to be stable and precise and thus require infrequent calibrations and a low consumption of consumables. For about 10 years, cavity ring-down spectroscopy (CRDS) analyzers have been available that meet these stringent requirements for precision and stability. Here, we present the results of tests of CRDS instruments in the laboratory (47 instruments) and in the field (15 instruments). The precision and stability of the measurements are studied. We demonstrate that, thanks to rigorous testing, newer models generally perform better than older models, especially in terms of reproducibility between instruments. In the field, we see the importance of individual diagnostics during the installation phase, and we show the value of calibration and target gases that assess the quality of the data. Finally, we formulate recommendations for use of these analyzers in the field.


Introduction
The Integrated Carbon Observation System (ICOS) is a European research infrastructure project that is currently reaching its operational phase after a 5-year preparatory phase (http://www.icos-infrastructure.eu/).Its goal is to provide high-quality observations needed to understand the longterm trend and spatial distribution of greenhouse gas emis-sions.For this purpose, ICOS is setting up monitoring networks of greenhouse gases over Europe in the atmosphere, ecosystems and at the surface of the oceans.In addition to the networks, central facilities have been designed during the preparatory phase to coordinate and standardize field operations and measurement protocols.Different methods will be developed to infer continental and regional carbon budgets from these measurements.The atmospheric approach using inverse modeling (Carouge et al., 2010a, b;Bousquet et al., 2011Bousquet et al., , 2013) ) relies on our ability to characterize very precisely the regional gradients of greenhouse gases over Europe.The efficient mixing of air masses in the troposphere has the advantage of integrating the signals from highly variable surface sources and sinks, but it may also have the inconvenient effect of smoothing very quickly the regional gradients.Previous studies have shown that the monthly mean gradients over Europe are generally lower than 10 ppm for CO 2 (Ramonet et al., 2009;Xueref-Remy et al., 2011a, b), and 100 ppb for CH 4 (from ICOS data).Moreover, the interannual variability from CO 2 is even lower, just a few parts per million (Ramonet et al., 2009).This typical greenhouse gas variability over Europe provides the most important data quality constraint of the ICOS-like observations.However, other parameters such as robustness, stability, and low-maintenance requirements also influence the choice of instruments to be used in field stations like the sites in the ICOS infrastructure.Gas chromatography (GC) systems (Yver et al., 2009;Schmidt et al., 2014) or monitors based on Published by Copernicus Publications on behalf of the European Geosciences Union.
nondispersive infrared (NDIR) sensors (Xueref-Remy et al., 2011a;Andrews et al., 2014;Schmidt et al., 2014) have been widely used over the last decades for high-precision measurements of CO 2 as well as CH 4 and N 2 O for GC.Both technologies require hourly to daily calibrations to produce high-precision measurements and for GC a relatively high level of expertise to produce high-precision measurements.This usually leads to high maintenance and difficult installation of these instruments in remote locations.For about 10 years, cavity ring-down spectroscopy (CRDS) analyzers have been developed and commercialized by a few companies (Crosson, 2008;Peltola et al., 2014).This new generation of sensors for greenhouse gases can be easily deployed in the field and requires lower maintenance and consumables compared to GC and NDIR technologies.
In the framework of ICOS, the ICOS Atmospheric Thematic Centre (ATC) metrology laboratory (MLab hereafter) based at the Laboratoire des Sciences du Climat et de l'Environnement (LSCE) is responsible for testing every instrument within the ICOS atmospheric network.Here, we present the results of tests for CRDS instruments that measure CO 2 , CH 4 and CO (Models ESP1000, G1301, G1302, G2301, G2302, G2401, Picarro, Inc., Santa Clara, CA, USA).We show the results for 47 instruments tested in the laboratory and from 15 field instruments measuring on sites instrumented for greenhouse gas for at least a year.These instruments have been dispatched to stations in various countries and environments with several stations in France within the SNO-RAMCES (National Observation Service -ICOS France https://icos-atc.lsce.ipsl.fr/?q= stations), stations in harsh environments as part of the Car-boAfrica project (http://www.carboafrica.net/index_en.asp) or the ICOS-INWIRE project (http://www.icos-inwire.lsce.ipsl.fr/) in Côte d'Ivoire, French Guyana, Bolivia, and other ICOS stations in Europe.Other types of analyzers using different technologies or measuring different components such as N 2 O or carbon isotopes have also been tested by the MLab but are not discussed in this study.
In the first section, we describe the analysis technique and the different models of instruments.Then, the protocols and the metrics used to assess the performance of the instruments are defined.In the last section, results for each species (CO 2 , CH 4 and CO) are presented and discussed.Finally, in the conclusion, we sum up recommendations for use of these instruments in the field.

Instruments
All the results presented in this study come from tests performed at the manufacturer, at the MLab and in the field on CRDS analyzers manufactured by the company Picarro, Inc. between 2008 and 2014.They cover five different instrument models and up to four species (CO 2 , CH 4 , CO, H 2 O).
The CRDS technique can be described as follows.A laser source is used to excite a measurement cell, which consists of a low-loss optical resonant cavity composed of at least two concave high-reflectivity mirrors.As the injected laser light propagates back and forth between the mirrors, a portion of the light is retransmitted through the mirror after each pass.A photosensitive detector located behind one of the mirrors monitors the time decay of the laser light.The decay (or ringdown) time depends on the cavity loss but also on the presence of any absorber species inside the cavity.Thus, higher concentrations of the target analyte molecule in the cavity correspond to shorter ring-down times.This technique and the details for each model are detailed in Crosson (2008), Chen et al. (2010Chen et al. ( , 2013)), and Rella et al. (2013).For the instruments considered here, near-infrared telecom lasers are used.It is important to note that although the chosen CO 2 , CH 4 and H 2 O absorption lines are fairly well separated from other absorption features, the absorption line for CO lies between CO 2 and H 2 O lines and has a relatively low strength compared to the others, as can be seen in Fig. 1.It is then challenging to measure CO at a high precision.
The three models, named ESP1000, G1301 and G2301 measure CO 2 , CH 4 and H 2 O using two lasers (one for CO 2 , one for CH 4 and H 2 O).The G1302 and G2302 instruments measure CO 2 , CO and H 2 O and the G2401 instruments measure all four species (using one more laser for CO).It has to be noted that three of the G1302 tested at the MLab were the first instruments of this model and two of them were tested only before they were upgraded for H 2 O-CO 2 -CO crosstalks (CKADS04 and CKADS07).The actual performances of these instruments should be better than has been demonstrated here.The instruments considered in this analysis are listed in Table 1 with their serial number, model, time of purchase, database identifier when existing, field sites if used here, measured species (except H 2 O) as well as the main results of different repeatability tests.

Protocols and metrics used in the study
The metrics defined in the study follow the International Vocabulary of Metrology guidelines (VIM, http://www.bipm.org/en/publications/guides/vim.html) and the Global Atmosphere Watch guidelines (GAW, http://gaw.empa.ch/glossary/glossary.html#section_2).They are usually calculated under repeatability conditions of measurements where all conditions stay identical over a short period of time.The continuous measurement repeatability commonly called precision is a repeatability measure applied to continuous measurements.The long-term repeatability which was commonly called reproducibility in the atmospheric community is a repeatability measure over an extended period of time (Andrews et al., 2014;Schmidt et al., 2014).The hereafter called short-term repeatability is what is defined as the repeatability in the VIM and GAW glossary.In the present Table 1.The 47 analyzers considered in this study.Their serial number, model, ICOS number when attributed and period of purchase are indicated in the first columns.If field results are presented here, the site as well as the number of months the instrument was used in the span of this study (+ indicates that the instrument is still running) and the percentage of valid (V) and invalid (I) data are detailed.For the MLab site, these are the reference instruments used for comparison.Comparable results (continuous measurement repeatability (CMR), short-term repeatability (STR) and long-term repeatability (LTR)) between factory, MLab and field are shown for all species.study we compare the results of tests done at the factory before delivery of the analyzers, at the MLab before deployment in the field and at a selection of monitoring sites.Ideally, we would like to apply exactly the same protocols at the three locations, but due to time and design constraints all tests cannot be performed or the protocols have to be adapted.The different protocols are described hereafter.The common abbreviations used throughout the text and figures are redefined in Table A1 in the Appendix.

Factory tests
All instruments are tested before leaving the factory and a certificate of compliance is provided with the instrument.For all instruments and species (except H 2 O), the continuous measurement repeatability, the short-term drift, the accuracy and the short-term repeatability are tested.The statistical parameters, used to characterize each instrument, are estimated by the measurement of dry reference gases (target) and are defined as follows.

Continuous measurement repeatability
The continuous measurement repeatability (CMR, called precision in the certificate of compliance) is calculated as the average over 30 h of 5 min interval SD of raw data (frequency about 0.5 Hz).

Short-term drift
For the first two generations of instruments (ESP1000 and G1301), the short-term drift was the peak-to-peak amplitude of the 5 min averaged data over 30 h.From the third generation on, the drift is defined as the peak-to-peak amplitude of the 50 min averaged data over 30 h.For CO, the drift is again  defined differently as the peak to peak amplitude of the 5 min averaged data over 24 h.

Short-term repeatability
The short-term repeatability (STR) is measured by cycling two gases at 10 min intervals for 2 h.The data from 8 min 50 s to 9 min 50 s are averaged and the SD of the averages calculated.

Accuracy
The accuracy against the factory internal scale was first evaluated with the mean of the raw data over 30 h then later over 30 min only.The results from this test are not discussed in this study.

MLab tests
Since the beginning of the MLab operation in 2008, the protocols and metrics used to evaluate the instruments have evolved.However, for our analysis, most of the data sets could be reanalyzed and the latest version of the protocols applied.
For the present protocol, the MLab keeps the instrument for about 1 month to perform all tests using dry reference gases calibrated in agreement with WMO scales, comparison to reference instruments and drying/humidifier system to evaluate the sensitivity to water vapor content.A detailed report is provided for each instrument.The cylinders are aluminum tanks (for the older ones used only as test cylinders: LUXFER, UK aluminum alloy 6061 with VTI Ventil Technik GmbH stainless steel valves, for all the newer ones: LUXFER, UK aluminum alloy 6061 with Rotarex membrane valve (D200 type with PCTFE seal) with brass or stainless steel body) with brass or stainless steel pressure regulator from Air Liquide America Specialty Gases LLC (previously Scott).They are either filled with dry natural air or with dry synthetic air.The isotopic composition of these last cylinders is controlled to correct for any bias compared to natural air.Indeed, as the CRDS instruments are sensitive only to the major isotopologue ( 12 C 16 O 2 or 12 CH 4 ), a bias in the isotopic composition would lead to a bias in the mixing ratio.The synthetic air cylinders are used for linearity and calibration purposes, and their gas concentrations span the range of ambient air concentrations and are calibrated against the international WMO scales.The natural air cylinders are filled at the MLab using a RIX Industries oil-free compressor and used as test cylinders; their concentrations are close to the average ambient air concentrations and allow us to evaluate the performances of the instruments as defined hereafter.The gases in these cylinders are referred to as target gases.
For the first instruments, in order to develop meaningful tests, the stabilization time for each instrument was evaluated in order to know how long a cylinder had to be measured before being stable.This stabilization period has been determined to vary between 5 and 15 min depending on the water content of the previous sample and the length of its analysis (not shown).It also depends on the length of the sampling lines and on the design of the whole system (dead volume, flush volume, etc.), which is relatively uniform in the MLab but can vary from site to site.These first tests have helped to define the length of the measurement intervals for the now standard protocols.Also, inlet pressure tests have been realized to evaluate the influence of the inlet pressure onto the measurements.Results (not shown) highlight the importance of having a difference below 0.4 bar between the calibration gases and the rest of the samples to avoid significant inlet pressure influence which leads to systematic biases.
As part of the latest MLab protocol, 11 criteria are now evaluated for each instrument.These are defined below.Target gases are used to evaluate most of the metrics except the calibration, linearity and comparison with reference instruments.

Continuous measurement repeatability
The continuous measurement repeatability is evaluated with the SD of the continuous measurements of a cylinder over 24 h as described above.

Short-term drift
The short-term drift is defined as the peak-to-peak amplitude of the same measurements.
These two metrics are evaluated for different integration times (typically, raw data, 1 min and 1 h).Usually, in the synthesis report, we provide values for 1 min and 1 h averages.

Allan deviation
The Allan deviation, which shows the stability as a function of the integration time and informs about the optimal integration time, is also calculated and provided in the synthesis report.
These three metrics are illustrated in the Appendix in Fig. A1.

Short-term repeatability
The short-term repeatability is defined as the repeated measure of a sample over a short period of time (about 3 h).In the laboratory, a target gas is measured 10 times in 15 min sequences bracketed by 5 min of wet ambient air measurements.For each measure, only the last 9 min are averaged.The repeatability is then expressed through the mean and SD of these averaged measures.

Long-term repeatability
The long-term repeatability (LTR) is comparable to the shortterm repeatability but on a longer timescale (3 days).In the laboratory, a target gas is measured for 30 min bracketed by around 5 h of wet ambient air over 72 h of total measurements.For each measure, only the last 10 min are averaged.The long-term repeatability is then expressed through the SD of these averaged measures.Typically, several 3-day exercises are performed and the results compared and aggregated at the end of the 1-month duration of the instrument test period.In Fig. A2 shows an example of short-term and longterm repeatability.For each species the mean, the SD and the drift are calculated.

Ambient temperature and pressure dependence
For the latest instruments, since 2013, the temperature and pressure dependencies have also been tested at the MLab.For the pressure, we plot the target gas measurements realized during the long-term repeatability test against the atmospheric pressure over several days and evaluate the correlation between the two.For the temperature dependence, the room temperature was until now varied using the room air conditioning system and we plot the target gas measurements against this varying temperature.Plans have been made to acquire a temperature-controlled chamber.As for the pressure, the correlation between the two is calculated.Two examples are shown in Fig. A3 with for each case, the linear regression and the correlation coefficients calculated.

Water vapor correction
An important test for the CRDS instruments is the water vapor correction evaluation.The applied factory correction is the same for all instruments even if not all of them have the same response to water vapor.It is also not always possible to measure only dry air when in the field.Over the years, different tests have been applied to evaluate this correction, from comparing two instruments measuring the same air, one with a drying system, the second without, to the latest tests that progressively increase the humidity of the measured gas stream.The complete methodology and the results of these tests will be treated in a separate publication.

Calibration
For an operational network, it is crucial to report not only precise but also accurate data linked to each other by a common scale.If each instrument is usually calibrated at the factory, the calibration scale used is not linked to the international WMO standards.Moreover, a regular calibration allows us to correct for long-term drift in the instrument.In the laboratory, at least four calibration sequences are done to determine the calibration function that links the measured values to the assigned values.Three to four standard gases are measured one after the other at least four times for 30 min each calibration sequence (each set of the four cylinders measurement is hereafter called a cycle; see Fig. A4).Then the calibration function using a linear fit is calculated.The calibration standards are themselves calibrated against the international primary scale of each species (WMO X2007 for CO 2 , NOAA04 for CH 4 and WMO CO X2004 for CO, Zhao and Tans, 2006;Dlugokencky et al., 2005;Novelli et al., 1994).Since 2008, six different secondary scales have been used.These calibrations are used for the comparison tests at the MLab.

Linearity
The linearity of the instrument is also evaluated.For the first instruments, the same cylinders as for calibration (four cylinders) were used.Then, two cylinders (low and high concentrated cylinders; see Fig. A5) were added to the set.The residuals from the fit are calculated, and their concentrations along with the correlation coefficient allow us to judge of the linearity of the instrument against the calibration scale.It is important to note that the validity of this test depends strongly on the proper assignation of the concentrations from each calibration cylinders, hence the importance of the link to internationals scales and the regular recalibrations of the MLab calibration cylinders against a "master" set of cylinders provided by the central calibration laboratories.

Comparison with reference instruments
Finally, ambient air measurements from each instrument are compared with other reference instruments maintained by the MLab.The MLab is located in Gif-sur-Yvette, about 50 km southwest of Paris.We are thus sampling suburban air with large variability as we are looking at 1 min averages.Initially, the CRDS analyzers were compared to the gas chromatograph system and if available to another CRDS ana-C.Yver Kwok et al.: Laboratory and field testing of CRDS analyzers lyzer in test.Since the end of 2011, most of the instruments have been tested against the same CRDS reference instrument for CO 2 and CH 4 (CFCDS03).For CO, since the end of 2013, a CRDS reference instrument has also been chosen (CFKADS2127).The tested instrument measures wet and dry air and is compared to the MLab reference instrument which measures ambient air dried through a cryogenic water trap.This allows the checking of the factory and MLab water vapor correction and the estimation of the biases.In Fig. A6, the comparison for CH 4 is shown.The H 2 O and target gas measurements allow a quality check of the tests.The histogram can point out outliers if the distribution is strongly not Gaussian.The difference between the wet corrected air and the dry air in the left panel (about 1.2 ppb on average compared to −0.03 ppb for both instruments measuring dry air) is due to the automatic H 2 O correction, which is here not sufficient to correct all the bias introduced by H 2 O.

Field tests
In the field, to estimate instrument performance, we use the calibration and target gases.Usually, one or two target gases are measured regularly for quality control purposes.If only one cylinder is used, then this cylinder is a so-called shortterm target and is measured once to twice a day.If two cylinders are in use, then the second one will be the long-term target and will be measured at the same time as the calibrations, usually every 2-4 weeks.The short-term target tank lasts about 1-2 years; the long-term targets in principle last 10 years or more, allowing continuity throughout the station lifetime.It is important to note that the analysis chain setup has an influence that can be hard to separate from the instrument performances.At the sites in this study, the ambient air sampling lines are usually Synflex 1300 (EATON) outside of the shelter and stainless steel tubing inside.Ambient air can be dried using either a cryocooler or a Nafion membrane.A Valvo (Vici) valve is used to distribute the different gases to the instrument.In the next tests, we however try to evaluate the influence of the setup, either by looking at specific data or by using the MLab tests as a comparison.We investigate the stabilization time of the instrument for each sample measurement interval and for each calibration sequence.We also look at the evolution of the calibration equation and its residuals which give insights into the linearity and stability of the instrument.We estimate the field continuous measurement repeatability, short-term repeatability and the long-term repeatability using the target gas measurements.Finally, we study the instrumental drift as a function of gas pressure or temperature.

Stabilization time within one measurement interval
For the stabilization time for one measurement, we select the last measurement interval of the last tank of the calibration sequence to avoid the influence of water from potentially humid ambient air samples and to ensure the flush and equilibration of the tank pressure regulators.We are indeed trying to look only at the performances of the instrument itself independently of the analysis chain setup.We calculate the minute averages within the interval and then the difference of the averages to the last minute of analysis.

Stabilization time within one calibration sequence
For the stabilization time within one calibration sequence, we compare the average of the last 10-15 min of each interval for the last cylinder to the last measurement interval.

Instrument long-term stability
We also look at the stability of the instrument by looking at the evolution with time of the calibration equation and evaluate whether the periods between the calibration allow us to capture this evolution.Finally, we look at the evolution of the linear fit residuals to investigate the linearity of the instrument over time.

Field continuous measurement repeatability equivalent
The target gases in the field are not measured continuously for 24 h.However, the short-term target is measured at least once a day for 20 to 30 min.Here, as an equivalent of the continuous measurement repeatability, we calculate the monthly average of the SDs of raw data over 1 min intervals.

Field long-term repeatability
For this value, we calculate the SD of the averaged target measurement intervals over 3 days as in the MLab; then we calculate monthly average of this number for graphical visibility.

Pressure vs. temperature as a source of instrumental drift
By studying the drift of the calibration constants of the instruments over time, we have an opportunity to study the behavior of this population of instruments over time.The following quantities are evaluated: with a and b the slope and intercepts from the calibration fits, 390 and 1900 ppb reference ambient air mixing ratios and CO 2frac and CH 4frac the fractional change of CO 2 and CH 4 concentrations compared to the reference mixing ratios.We consider two different sources of drift: gas pressure and gas temperature.Both quantities affect both the number density of molecules in the CRDS optical cavity, as well as the shape of the spectral line from which the mole fraction is derived.To measure the effect of temperature and pressure on the fractional change in carbon dioxide and methane, experiments have been performed at the factory on a single CRDS instrument (model G2301) in which temperature and pressure of the gas sample were changed and the fractional change in carbon dioxide and methane measured, to obtain the following values: We see that temperature drift has a fundamentally different character than pressure drift.That is to say, when the temperature drifts, the fractional change of CH 4 vs. the fractional change in CO 2 has a slope of 5.3/4.6 = 1.15; for pressure drifts, this ratio is 3.5/1.3= 2.7.This difference in ratios means that we discriminate between the two mechanisms, by looking at the slope for each instrument data set.

Allan deviation
The first test in the MLab is the continuous measurement repeatability measurement which allows us to draw the Allan deviation vs. the averaging time and already gives good insight into the stability of the instrument.In Fig. 2, the Allan variance is plotted for all instruments and for the three species.Averaging times of 1 min and 1 h are plotted as vertical solid black lines.For all instruments, the Allan deviation decreases and on average reaches its minimum around 1 h of averaging time.We generally observe that the first generation of instruments or the first instruments of the next generation perform less well for the smaller averaging time.For example, we see for CO 2 and CH 4 , that below 1 min averaging time, ESP1000 and G1301 have a higher Allan deviation than the G2301 except for one G2301 which happens to be the first purchased and tested in the laboratory.In the same way, the first G2401 performed less well than the next instruments.For CO, we see the same pattern with the first CO instruments (G1302) performing less well than the next generation.For the two instruments that were upgraded for water-CO 2 -CO cross-talks only after the tests, we also found that for CO 2 they performed less well than the previous generation (Allan deviation above 0.03 at 2 s).Looking at the upgraded instrument (Allan deviation of 0.02 at 2 s), we see that this has probably been corrected for with the upgrade.For a longer averaging period between 1 min and 1 h, the results are much less different, except -as said before -for the two G1302 instruments tested before upgrade.It is also in-

C. Yver Kwok et al.: Laboratory and field testing of CRDS analyzers
teresting to note that for CH 4 the first models ESP1000 and G1301 performed better than the following models, which is thought to be due to a change in the electronic design.Since this study, a thorough investigation concerning the electronic design changes and the effect on CH 4 averaging has been conducted by the manufacturer.The issue has been identified and corrective actions have been taken, restoring the expected performance in terms of CH 4 averaging according to the manufacturer.This will have to be confirmed when testing new instruments.It is important to note that for a given model the performances are very consistent with very few outliers, especially for the latest models.

Comparison of the results from the factory, the MLab and the field
In this section, we look at the performances of the groups of instruments using some of the metrics defined in the previous sections.Results from the factory, the MLab and the field sites are shown in Table 1 and in Figs.3-6.Only the MLab continuous measurement repeatability using raw data is presented in order to compare the three values (first panel of Figs.3-5).The short-term repeatability cannot be calculated in the field but is evaluated at the factory and at the MLab (second panel).Then, the long-term repeatability is not evaluated at the factory but is calculated at the MLab and in the field (third panel).Finally, the fourth panel highlights the dry ambient air comparison (mean difference) between the instruments tested at the MLab and the reference instruments.For the continuous measurement repeatability, we observe that for CO 2 and CH 4 the factory values are usually slightly lower than in the MLab but with a negligible difference (−0.003 ppm for CO 2 and −0.06 ppb for CH 4 on average when the three first outliers are excluded).There are three outliers for the first three ESP1000 where the factory values are about 2 times higher than the MLab values.For CO, the sign of the difference is not systematic and the difference is small on average (0.14 ppb) if we exclude one outlier (one G1302).For the short-term repeatability, we observe higher values at the factory than the MLab (0.024 for CO 2 , 0.15 for CH 4 and 0.5 ppb for CO (with one outlier excluded for CO) on average), which can be due to the different protocols (6 times 10 min with only 1 min kept against 10 times 15 min with 9 min kept).The difference is indeed relatively constant over the instruments especially for CO 2 where this difference is about twice the noise of the instruments as measured at the MLab.For the long-term repeatability, the field results are very close to the MLab results, showing that the instruments kept their performances over time on average and that the linear regression for the field calibrations stays valid with time.
Finally, in terms of bias to the reference instrument, the average of the difference is usually within the WMO compatibility goals.
For all species, and especially CO 2 and CO, we can see that the manufacturer did improve the performances of its instruments, which is shown in continuous measurement, short-term and long-term repeatability tests.Moreover, the performances are more reproducible with smaller dispersion of the results.In general, the precision has been improved by a factor of 2 to 3 as can be seen from Fig. 6, which presents the synthesis of the instrument performances averaged by model.

Ambient temperature and pressure dependence
Most of the instruments tested at the MLab show a limited sensitivity to temperature and pressure.However, some instruments present a higher dependence.In the case of the temperature changes, if these changes are slow and within the range guaranteed by the manufacturer (10-35 • C), the instruments are temperature independent.However in case of rapid temperature changes, like in an airplane, a temperature dependence can appear for CO 2 and CH 4 .It is caused by the fact that the cavity cannot regulate its temperature as fast as the outside changes.As the mixing ratios are calculated assuming a fixed temperature, small changes in this temperature lead to biases.Of all the 29 instruments tested for temperature dependence, only 2 presented a significant temperature dependence (R 2 > 0.5) using rapid changes.This is shown on one mobile instrument (CFKBDS2132) tested at the MLab as described above and using a temperaturecontrolled chamber that allows us to set different ramps and steps (see Fig. 7).In the shown experiment (lower panel of Fig. 7), a target gas was measured for 16 h with the temperature varying from 10 to 35 • C. The temperature was increased by 5 • C / 30 min ramps with 1 h steps.We observe a clear dependence (R 2 > 0.5) for CO 2 and CH 4 with 0.01 and 0.03 ppb • C −1 respectively.With longer ramps (not shown), the dependence was smaller but still the species were highly correlated with the room temperature.With the usual rapid change test (upper panel of Fig. 7), this dependence was also clear and comparable.
In the case of atmospheric pressure variations, CO 2 and CH 4 are not significantly affected, but for some instruments CO presents a significant dependence.At the MLab, out of 16 instruments measuring CO that undertook the pressure dependence tests, 5 presented a dependence higher than 4 ppb hPa −1 .One of them (CFKADS2084) presented a dependence of more than 7 ppb hPa −1 at the first test and was then sent back to the factory for upgrade.After upgrade, the dependence was significantly reduced to 0.04 ppb hPa −1 as can be seen in Fig. 8.

Calibration and linearity
To evaluate the linearity of the instruments, we have to be confident in the assignation of our cylinders.In the past, we have used different instruments (GC, Fourier transform in-  frared spectroscopy) and set of cylinders to assign the values of the MLab calibration sets.We are still in the process of reevaluating these values when possible.Here to assess whether the amplitude of the observed residuals are due to the instruments or to the calibration scales, we have plotted the residuals of each cylinder for the 14 instruments that used the latest set of calibration cylinders.In Fig. 9, we see that the residuals behave the same way for every instrument.For each assigned concentration, we found similar residuals for every tested instrument.Either all instruments behave exactly the same, or (the most probable hypothesis) the MLab calibration cylinders are not perfectly assigned.However, for this scale at least, we see that the residuals are about the same as the instrument precision especially for CO and CH 4 .This allows us to be confident in the linearity of the instruments, as the difference between the linear fit and the assigned values (the residuals) is on the order of magnitude of the precision of the instruments, and that part of the amplitude of the residuals is due to the assignment of the calibration cylinders.

Evolution of the metrics with time in the field
Out of the 47 instruments tested at the MLab, we present results for 13 that have been installed in the field on sites instrumented for at least 1 year (see Tables 1 and A1, Fig. 10) plus two more instruments that were not tested in the laboratory.One is running in parallel at MHD with one tested instru-ment, and the other one has been installed at a site (PUJ) for several years without major troubles.The complete measurement setup (length of sampling lines, buffer volume, drying system, measuring time) is not uniform and can differ from site to site.However all instruments are regularly calibrated with cylinders linked to the WMO scales and use at least one target tank.Also all the raw data have been processed with the same algorithm (Hazan et al., 2015).The earliest data set begins in September 2008, and we end our study on 30 September 2014.As detailed in Table 1, we have analyzed field data sets from 11 three-species and 4 four-species instruments.These last instruments have been installed for a year or less to replace previous instruments.The instruments have been running at nine different stations: four in France, three in Europe, one in Africa and one in the Indian Ocean.For two stations, MHD and OPE there are two instruments running in parallel, while at the other stations, there is either only one instrument since the beginning (AMS, IVI, PUY, TRN, PUJ) or either several instruments replaced one after the other (three instruments at BIS, two at LTO).At OPE, one of the three instruments was replaced.
In Fig. 10, we see that we regularly have invalid data.At the day scale, the invalid data are mostly due to the flushing time of the measurement lines, especially when the site samples at several heights.However, in this graph, we are plotting daily average.If more than 50 % of the Figure 6.Whisker boxplots summarizing the CMR, STR and LTR tests for CO 2 (first column), CH 4 (second column) and CO (third column) by models.The mobile instruments are grouped with their main model.G1302 and G2302 are grouped together.The middle horizontal shows the median, the limits of the box the 25th and 75th quartiles, and the end of the whiskers the lowest datum still within 1.5 interquartile range (IQR -difference between the first and third quartiles) of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile.The numbers under the boxplots in the CO 2 column indicate the number of instruments tested in each case.data are valid, then the day is flagged as valid; if not, then the day is flagged invalid.With this way of calculating, problems will be the most likely cause of the invalid data.This can be caused by several factors (leaks in the lines, frozen water traps, local contamination, etc.) but also by an instrument failure.Usually, small periods of invalid data will be due to problems in the setup while longer invalid periods will be linked to an instrument failure.We have listed in Table 2 the main instrument failures or problems that happened for the shown field instruments as well as for the MLab references, the models to which it happened and the usual solution.Failures from other instruments, with their proposed solutions, are compiled in the ICOS-INWIRE Report GA N313169 (http://www.icos-inwire.lsce.ipsl.fr/documents/10179/19756/ICOS-INWIRE+report+D2.1/).For the instruments installed on site for more 6 months (11 of them), on average, more than 80 % of the data are valid.
In the next figures the data from these instruments are compiled.We use first data from calibration gases.Data from the same site are in the same shade of color.In Fig. 11, we investigate the measurement interval and calibration cycle stability of the instruments.In the database, each instrument measurement protocol is configured according to the station setup.For example, the first minutes (usually half the time of the length of the measurement interval -so 10 to 15 min) are invalidated; for a calibration sequence, the first cycle is invalidated and at least three cycles are needed to have a valid calibration.In the left panel of Fig. 11, we show the minute    Valid data are in color; invalid data are in black (% are indicated in Table 1).ESP1000 are in red, G1301 in magenta, G2301 in green and G2401 in blue.
averages of the mean difference to the last minute.The first point is at 3 min after the beginning of the interval.We see that, for CO 2 already after 5 min, the difference to the last minute is below 0.05 ppm.For CH 4 , this difference is below 0.2 ppb.For CO, however, the difference does not always decrease with time and can be above 2 ppb after 20 min of mea- surements.It seems that, for this species, the noise is greater than the flushing influence.Indeed, in the MLab, the typical CMR is around 5 ppb.
In the right panel of Fig. 11, we look at the calibration cycles.Each point is the average of the validated minutes of one measurement interval, with one interval lasting 15 to 30 min depending on the site.For CO 2 and CH 4 , we see a clear decrease of the difference with the cycles while for CO, as before, this is not as clear.At the second cycle, the difference for CO 2 is below 0.03 ppm and for CH 4 below 0.3 ppb.For CO, the difference is variable but stays under 0.6 ppb for any cycle.This indicates that to lower costs (less gas consumption) or at sites where there are several heights to sample from, to sample more ambient air, we could use fewer calibration cycles for the same quality.For example, to be under 0.02 ppm error for CO 2 and 0.2 ppb error for CH 4 , in most of the cases, two cycles are enough.
For CO 2 , we also see that there is no large difference between the instruments except for LTO192, TRN108 and AMS111 that show a longer stabilization time.For LTO192, this can be explained by the setup of the station, with all the gases going through a Nafion dryer, which dries to 0.1 % H 2 O.The first measurement interval of calibration gases is thus still wet, and it was shown in the MLab that, after the H 2 O correction, there was still a bias.For AMS111, the short-term repeatability is of 0.04 ppm so the difference to the last cycle is in the instrument STR.For TRN108, we observe the same long stabilization time for both sets of cylinders; as the difference for the first cycle is twice the instrument STR, this seems to be due to the setup, but there is no easy explanation for LTO.For CH 4 , we observe longer stabi-Figure 12. Top: CO 2 .Middle: CH 4 .Bottom: CO.Left: temporal drift of the instrument calculated using a virtual tank with a fixed value after calibration and the calibration equations -the concentrations are normalized using the first calibration.Right: temporal evolution of the residuals from the calibration linear fit.The vertical solid lines show a change of calibration scales.For all the panels, each data point is a 3-month average.lization time again for LTO192 as could be expected and also for MHD54 but only for one set of cylinders.For this last instrument, the longer stabilization time is most probably due to a leakage in the tubing that was fixed when changing the cylinders.For CO, we also see that one instrument (OPE187) performs less well than the others both for the interval and the cycle stabilizations.We do not see the effect of the Nafion dryer on CO at LTO, and indeed in the case of CO no bias was observed after H 2 O correction.For all three species, we see that the stabilization time seems to depend more on the setup than on the initial performances of the instruments as could be measured by the STR.
In Fig. 12, we plot on the left panels the instrumental setup drift over time and on the right panels the evolution of the residuals from the linear fit of the instruments.The first plot allows us to investigate the need for more or less frequent calibrations while the second checks if the instrument response stays linear over time.For legibility, we have excluded one site where the values were so different that the scale was too large to see the trends (BIS24 for CO 2 ).
To evaluate the drift, using the calibration equation, we calculate the raw values of a virtual cylinder that once calibrated would have a fixed value (390 for CO 2 , 1900 for CH 4 and 150 ppb for CO; see Eq. 7).
with C raw , the raw values, C cal , the calibrated values we assigned, a, the slope of the linear regression and b the intercept.We normalized the concentrations by subtracting the first value.Over time, for most of the instruments, the variations are slow and regular but still significant (up to 0.15 ppm yr −1 for CO 2 for OPE75, up to 2.2 ppb yr −1 for CH 4 for the same instrument; for CO, there is not enough data to observe trends), and the regular calibrations allow us to follow these changes.These rates are comparable but lower than rates from other studies such as Richardson et al. (2012) and Karion et al. (2013), which found drift rates of 0.15 ppm yr −1 for CO 2 in the first case and 0.25 for CO 2 and 3.4 ppb yr −1 for CH 4 in the second study.For some instruments however, there were some rapid changes and we can question the calibrated values for these periods.For example, for OPE91, we observe a sharp decrease in January 2013.Looking at Fig. 10, we see that this instrument was not providing data before this modification.Indeed, the instrument had a failure; after repair, the calibration response was different.It is notable that, even after a simple restart, the instrument response can differ from before due to some startup processes that are not reinitialized.The sharp changes in the LTO23 instrument are also usually linked to problems either in the instrument or the setup that modified the calibration response.It is important to note that the drift is not only due to the instrument but could be due to leaks in the lines, drifts in the calibration tanks, etc.However, we observe a general trend toward increase of the concentrations for CH 4 , which seems to indicate an instrumental drift for this species.
On the right panels of Fig. 12, we plot the largest residuals from the linear regression to evaluate the linearity of the instruments over time and whether the linear approximation stays valid for the tested instruments.We see that for all three species, for most of the sites, the answer is yes, with stable residuals over time.However, for AMS111, BIS38 and MHD41, we observe a significant drift over time.After the change of the calibration scales, the drift is corrected for MHD41.For BIS38, after the change, there are only three calibrations before being sent back to the factory and three after.It seems however that, after repair, there was no drift anymore.For AMS111, there are also only three calibrations which present no or an inverse drift.It seems then that the observed drifts are most probably due to a drift in the calibration scale and not in the linearity of the instruments.
In Fig. 13, we show the average field continuous measurement repeatability and the long-term repeatability.For this, we use data from the target gas measurements.For all species, the CMR and LTR show very few variations over time (not shown), and thus we have chosen to show the data using boxplots summarizing the instrument performances over the whole period of measurement.For comparability, we have added the MLab continuous measurement repeatability and the long-term repeatability.We observe that, for most of the sites, the laboratory values and the field values agree well.The field continuous measurement repeatability tends to be higher than the long-term repeatability as can expected.
In Fig. 14, we plot the fractional change in the CH 4 concentration vs. the fractional change in the CO 2 concentration.The data from the different instruments are normalized to the first measurement in the time sequence and are offset horizontally from one another for clarity.We only considered continuous sequences when the calibration standards were not changed.Changing calibration standards caused jumps in the observed ratios, which is due to differences in assigned values for the standards.Whenever the standards were changed, a new data set was generated (labeled "a", "b", etc. in the figure legend).In addition, we created a new data set in instances when the instrument was sent back to the manufacturer for repair or was repaired on site, since the repair work can affect the instrument calibration.
We see that the fractional change is on the order of 0.001.In other words, the drifts of the CRDS instruments are typically about 0.1 % over a year.The magnitude of the fractional change is larger for methane than for carbon dioxide,  by about a factor of 2.7.For CO 2 , the mean fractional change is 0.05 %, or 0.2 ppm; for the same instruments over the same period of time, the mean fractional change is 0.13 %, or 2.5 ppb.These values are on the order of the compatibility target for these two gases.In nearly all cases, the drift in the fractional change in methane is well-correlated to the drift in the fractional change in carbon dioxide.We can see this in the linear fits to the individual data sets, which have a median R 2 of 0.79.This high degree of correlation implies that the drift is caused by drift in some quantity or set of quantities that affect carbon dioxide and methane proportionally.Finally, looking at the ratios, the grey dashed lines follow the pressure-derived slope of 2.7, and the pink dashed lines follow the temperature-derived slope of 1.15.It is clear from the figure that the data tend to follow the pressure-derived slope rather than the temperature-derived slope; this observation is supported by the fact that the mean slope from the linear fits in the figure is 2.4, which is close to the predicted value of the slope for a pressure drift.A more careful analysis of the data, by vector decomposition along the pink (temperature) and grey (pressure) axes, implies that 85 % of the observed drift in this population of instruments is due to pressuredependent drift.It is important to note that this analysis is not conclusive; other drifts other than pressure or temperature might be at play, which might invalidate these results.However, these results point to the tantalizing possibility that improving the performance of the pressure sensor could lead to a much higher degree of stability in these instruments.It is to be noted that, from the dependences evaluated at the factory and the instrument data shown, the mean pressure drift rate on a yearly basis is 0.3 Torr year −1 , which is well within the expected drift rates for the sensing technology used in the instrumentation.

Conclusions
Between 2008 and fall of 2014, 47 non-isotopic Picarro instruments were tested at what is now the MLab.The goals of this work were to give insight into the MLab testing procedures that are also applied to the other instruments as well as to provide an evaluation of the tested instruments.We show that over time the instruments tend to have more reproducible performances.However, the first instruments of a new model tend to differ from one another than the last instruments of the previous model.This conclusion holds for CO even though its measurement is challenging.We also see that the results from the factory, at the MLab and in the field generally agree well with each other; in the case of the field, the performances stay relatively unchanged over time.The laboratory test can then be used to prioritize the location of the instruments according to their performances and the needs of the stations before installation.This also shows that, for instruments that could not be tested at the laboratory, the field estimate could be an acceptable proxy if measured on a long enough period of time using the same protocols as described here.We can conclude that the instruments tested are well designed for field study (with an average of more than 80 % of valid data over the instruments tested in the field in this study).The troubleshooting list provided in this study is representative only of the observed failures for the tested instruments.Within SNO-RAMCES, a troubleshooting logbook is developed to allow every station to consult and add failures and solutions.This could be extended to the whole ICOS network.We would also like to add a short list of recommendations for use of these analyzers in the field.These recommendations are most likely valid for other instruments as well.
-Instruments should be tested in the laboratory before being on site.Indeed, some important tests such as the temperature, pressure and water vapor dependence tests are only done at the laboratory.It is also convenient to be able to verify the performances of an instrument with a standardized and recognized protocol.
-Measurement interval duration should be at least 10 min to allow for stabilization and should be in any case tested for the specific setup of the station as we have shown that this seems to be mostly station-specific and not instrument-specific.Indeed, to be able to reach the WMO comparison goals, we need biases as small as possible for every source of bias.Here, we aim for a difference of less than 0.05 for CO 2 , 0.2 for CH 4 and 1 ppb for CO.
-Pressure difference between the different samples should not exceed 0.4 bar to avoid significant inlet pressure influence, which leads to systematic biases.
-Calibration sequences should have at least two cycles, but in most cases this could be enough.
-Calibrations need to be run regularly to follow the instrument and setup drift.Especially after each restart of the instrument, calibrations have to be run.Each station setup being different, we cannot recommend a specific frequency for calibration, but we recommend that during the first 6 months these calibrations are run at least every 2 weeks.Then after analysis of the data, the frequency should be optimized.
-Despite these findings, we highly recommend carrying out a thorough test of the instrument at the station to take into account specificities that would lead to a needed higher number of calibration cycles or a longer interval time.

Figure 1 .
Figure 1.Spectral fits of absorption data (black solid dots) for CO 2 (four overlapping lines) only, H 2 O (three lines) only, CO only, excluding CO, and all data.From Chen et al. (2013).

Figure 2 .
Figure 2. Allan deviation for CO 2 , CH 4 and CO.The vertical solid black lines show 1 min and 1 h averaging times.X and Y axes are in logarithmic scales.The black dotted line shows the white noise 1 2 √ time .

Figure 3 .
Figure 3. Results for CO 2 .First panel: continuous measurement repeatability (CMR) at the factory (as defined), at the MLab (on the raw data) and in the field (as defined).Second panel: short-term repeatability (STR) at the factory and the MLab.Third panel: long-term repeatability (LTR) at the MLab and in the field.Fourth panel: comparison with reference instrument, average difference at the MLab.The horizontal lines show the WMO compatibility goals.

Figure 4 .Figure 5 .
Figure 4. Results for CH 4 .First panel: continuous measurement repeatability (CMR) at the factory (as defined), at the MLab (on the raw data) and in the field (as defined).Second panel: short-term repeatability (STR) at the factory and the MLab.Third panel: long-term repeatability (LTR) at the MLab and in the field.Fourth panel: comparison with reference instrument, average difference at the MLab.The horizontal lines show the WMO compatibility goals.

Figure 7 .
Figure 7. CO 2 and CH 4 temperature dependence of the same mobile instrument (CFKDBS2132) tested at the MLab (top four panels) and in a temperature-controlled chamber (bottom four panels) using in both cases rapid variations of temperature.Among the tested instruments, another one was also showing a strong temperature dependence.

Figure 8 .
Figure8.CO pressure dependence before (left) and after (right) repair at the factory for CFKADS2084.Among the tested instruments, four others were showing a dependence higher than 4 ppb hPa −1 .

Figure 9 .
Figure 9. Calibration residuals for one set of cylinders measured on 14 different instruments.Each point is the average residual for each instrument for one calibration cylinder.For each cylinder, the reference concentration is indicated.

Figure 10 .
Figure10.Daily average of data availability at each site (if 50 % of data are valid for 1 day, then the day is valid; if less, the day is invalid).Valid data are in color; invalid data are in black (% are indicated in Table1).ESP1000 are in red, G1301 in magenta, G2301 in green and G2401 in blue.

Figure 11 .
Figure 11.Top: CO 2 (ppm).Middle: CH 4 (ppb).Bottom: CO (ppb).Left: mean difference to the last minute for one measurement interval.Right: mean difference to the last cycle and SD of a calibration sequence.Dashed, dotted or dash-dotted lines show a different set of calibration cylinders for a same site.

Figure 13 .
Figure 13.Top: CO 2 .Middle: CH 4 .Bottom: CO.For each field instrument, we show the statistics as defined in Fig. 6 for the CMR and LTR during the measurement period.The MLab initial values are added with diamond shapes.

Figure 14 .
Figure 14.Data showing the fractional change in CH 4 vs. the fractional change in CO 2 , as derived from instrument calibrations over time.Each data set has been offset horizontally for clarity, and normalized to the first calibration data.The linear fit is plotted over the data.The pink dashed lines indicate the slopes corresponding to temperature-dependent drift, and grey dashed-dotted lines to pressure-dependent drift.

Figure A1 .
Figure A1.CH 4 continuous measurement repeatability.First panel: measurements averaged over different time intervals.Second panel: Allan deviation.

Figure A2 .
Figure A2.Short-term and long-term repeatability for the three species.First panel: short-term repeatability.Second panel: long-term repeatability.

Figure A3 .
Figure A3.CO pressure and temperature dependency.First panel: pressure dependency.Second panel: temperature dependency.On the right of the lower plot, the slope (I1), intercept (I0) and the coefficient of correlation (R 2 ) are indicated.

Figure A4 .Figure A5 .
Figure A4.Schematics of the calibration procedure at the MLab.With a measurement interval time of 20-30 min, a full sequence lasts between 5 and 8 h.

Figure A6 .
Figure A6.Comparison with the reference instrument for CH 4 .First panel: dry air vs. dry air.Second panel: wet air corrected for H 2 O vs. dry air.From top to bottom, the concentrations for both instruments and the difference of the two are plotted, then the water vapor concentrations for both instruments, then the evolution of the target for both instruments and finally a histogram of the distribution of the differences along with statistics.

Table 2 .
Instrument failures, models concerned and solutions for the field instruments of this study.