Evaluation of CarbonTracker's Inverse Estimates of North American Net Ecosystem Exchange of CO2 From Different Observing Systems Using ACT‐America Airborne Observations

Quantification of regional terrestrial carbon dioxide (CO2) fluxes is critical to our understanding of the carbon cycle. We evaluate inverse estimates of net ecosystem exchange (NEE) of CO2 fluxes in temperate North America, and their sensitivity to the observational data used to drive the inversions. Specifically, we consider the state‐of‐the‐science CarbonTracker global inversion system, which assimilates (a) in situ measurements (IS), (b) the Orbiting Carbon Observatory‐2 (OCO‐2) v9 column CO2 (XCO2) retrievals over land (LNLG), (c) OCO‐2 v9 XCO2 retrievals ocean‐glint (OG), and (d) a combination of all these observational constraints (LNLGOGIS). We use independent CO2 observations from the Atmospheric Carbon and Transport (ACT)—America aircraft mission to evaluate the inversions. We diagnose errors in the flux estimates using the differences between modeled and observed biogenic CO2 mole fractions, influence functions from a Lagrangian transport model, Bayesian inference, and root‐mean‐square error (RMSE) and bias metrics. The IS fluxes have the smallest RMSE among the four products, followed by LNLG. Both IS and LNLG outperform the OG and LNLGOGIS inversions with regard to RMSE. Regional errors do not differ markedly across the four sets of posterior fluxes. The CarbonTracker inversions appear to overestimate the seasonal cycle of NEE in the Midwest and Western Canada, and overestimate dormant season NEE across the Central and Eastern US. The CarbonTracker inversions may overestimate annual NEE in the Central and Eastern US. The success of the LNLG inversion with respect to independent observations bodes well for satellite‐based inversions in regions with more limited in situ observing networks.

• In situ measurements and the land nadir/land glint inversions are the most reliable products of CarbonTracker in temperate North America, superior to ocean-glint or LNLGOGIS inversions • Errors in these CarbonTracker regional flux estimates are not strongly dependent on the observational data sources • CarbonTracker overestimates seasonal net ecosystem exchange (NEE) for the Eastern and Central US, thus the annual NEE may underestimate continental uptake of CO 2

Supporting Information:
Supporting Information may be found in the online version of this article.

Introduction
Accurate quantification of carbon dioxide (CO 2 ) fluxes from different sources is an important input to the design of climate policies (e.g., Ciais et al., 2014;Keller et al., 2008;Rogelj et al., 2018). CO 2 flux related to terrestrial net ecosystem exchange (NEE) is one of the major components. It is challenging to quantify CO 2 NEE fluxes due to complex biosphere processes, together with the biosphere-atmosphere interactions (e.g., Tian et al., 2016). Both bottom-up and top-down approaches (e.g., Hayes et al., 2012;Hu et al., 2019;Liu et al., 2017;Pan et al., 2011;Thompson et al., 2020) have been used to characterize and quantify CO 2 NEE fluxes using data from a wide range of observation platforms.
The top-down approach is an optimization framework to improve a priori flux estimates, that are informed, for example, by ecosystem carbon-stock inventories or carbon flux models (e.g., Haynes et al., 2019). Atmospheric CO 2 measurements, on which the top-down method relies, can contribute powerful constraints to the bottom-up methods (e.g., Ogle et al., 2015). Different atmospheric CO 2 measurement platforms such as boundary-layer CO 2 mole fractions from ground-based networks (e.g., Andrews et al., 2014;Miles et al., 2012) and column-averaged CO 2 mole fractions (XCO 2 ) from satellites (e.g., Liu et al., 2020), aim to complement each other. Measurement biases, atmospheric transport errors, or representation errors, however, may cause difficulty in assimilating these measurements within the optimization process.
Evaluating current top-down CO 2 flux estimates from the different platforms with independent observations is a promising avenue to improve them. Chevallier et al. (2019) compares six global CO 2 atmospheric inversions from the combinations of three measurements platforms (i.e., Orbiting Carbon Observatory-2 -OCO-2 or Greenhouse Gas Observing Satellite -GOSAT column retrievals, and boundary-layer in situ measurements) using a large number of independent aircraft measurements in the free troposphere. They provide a cross-comparison among different inversion estimates as well as mole fraction-based comparisons between inversions and the aircraft measurements. They found the overall performance of inversions based on in situ data and based on OCO-2 XCO2 observations to be similar, however, they show that the posterior fluxes diverge for the northern and tropical parts of the continents. Seasonal, regional evaluation of the posterior fluxes is needed. The global inversions are temporally and spatially resolved products, and many aircraft field campaigns take place at a regional scale. This opens up the opportunity for further in-depth regional evaluations.
The Atmospheric Carbon and Transport-America (ACT-America) mission, conducted flights east of the Rocky Mountains in the United States (US) during Summer 2016, Winter 2017, Fall 2017, Spring 2018, and Summer 2019(Davis et al., 2018Davis et al., 2021). The multi-seasonal aircraft CO 2 sampling of ACT-America provides a unique opportunity for regional evaluation of CO 2 flux estimates. Extensive atmospheric CO 2 measurements from the atmospheric boundary layer (ABL) to the upper free troposphere during four seasons from ACT-America enable researchers to rigorously assess and potentially distinguish the biases and accuracy of different inversion estimates for temperate North America.
OCO-2 gathers XCO 2 measurements globally using nadir and glint observations over land, and glint observations over the oceans (Eldering, O'Dell, et al., 2017;. The OCO-2 retrievals are continually being improved (e.g., Miller & Michalak, 2020;O'Dell et al., 2018). Independent observation campaigns can test the ability of the OCO-2 v9-based inversions to estimate regional-scale fluxes with accuracy and precision. Temperate North America has one of the densest in situ-based greenhouse gas monitoring networks in the world. An evaluation of the OCO-2 v9 based flux estimates, along with the evaluation of in situ-based CO 2 flux estimates together can be used to assess the complementary role of the two platforms. Additionally, a multi-platform strategy that combines in situ-and satellite-based platforms to constrain CO 2 NEE is promising but requires independent evaluation.
In this study, we implement a method to evaluate the in situ-based, OCO-2 v9-based, and two-system-combined inversions of CO 2 NEE in temperate North America using airborne observations from the ACT-America mission. 10.1029/2020JD034406 details of the evaluation framework are described in Section 2. Results and discussion are presented in Section 3. We conclude in Section 4.

ACT-America Aircraft Campaign
We use CO 2 measurements from the Summer 2016, Winter 2017, Fall 2017, and Spring 2018 ACT-America campaigns. These are the times for which CO 2 flux products are available from CarbonTracker, as part of the OCO-2 v9 MIP. The maps illustrate the seasonal average of CO 2 NEE fluxes and the spatial coverage of in-situ and OCO-2 LNLG/OG data during the ACT-America campaign periods are shown in Figure S1 and S2. Each ACT-America campaign flew over the same three sub-regions of the United States (US): the Mid-Atlantic, Midwest, and Gulf Coast. For most flight days, two aircraft (a NASA Langley B200 and a NASA Wallops C130) flew together measuring atmospheric CO 2 mole fractions and other atmospheric variables in patterns designed to sample the variability in atmospheric GHGs within mid-latitude weather systems and the associated regional surface fluxes. All flights were conducted during midday hours (15-0 UCT) in order to sample well mixed ABL conditions. The detailed instrument, deployment and data set of ACT-America are described in Wei et al., 2021). The calibration of the CO 2 measurements are described by (Baier et al., 2020). About 35% of the flight time was within the ABL, the portion of the atmosphere most sensitive to regional GHG surface fluxes. In this study, we use the ABL measurements excluding the takeoff and landing portions, and aggregate these CO 2 measurements across 30-s intervals ( Figure 1, Table 1) to construct the receptors in the Lagrangian particle dispersion modeling that described in Section 2.3.

Influence Functions for ACT Flight Data
Upwind fluxes influence the aircraft samples. We explicitly quantify the source-receptor relationship (i.e., influence function) using a Lagrangian particle dispersion model (FLEXPART-WRF) (Brioude et al., 2013) in a backward mode. The simulations of FLEXPART-WRF are driven by the 27-km WRF-Chem simulated meteorology from the base line simulation described in ; Feng, Lauvaux,  which were nudged to the 25-km ECMWF-ERA5 reanalysis data (Hersbach et al., 2020).
In the study, we aggregated the set of influence functions to be the 1 × 1 degree resolution in terms of the flux evaluation.
We computed a suite of influence functions across 98 flight days, at the same spatial and temporal resolution of the meteorological driver (27 km and hourly) covering the entire domain ( Figure 2). Each receptor of the influence function is the 30-s interval along flight tracks, characterized by a box with boundaries between the maximum and minimum latitude/longitude as well as between the maximum and minimum heights during the 30-s interval. Each receptor box released 5,000 particles and simulated their transport and dispersion backward for 10 days (Cui et al., 2015(Cui et al., , 2017(Cui et al., , 2019. Some validations of the suite of influence functions were conducted. Based on the same flux inputs, boundary conditions, and meteorological fields, we compared the FLEXPART-WRF simulated CO 2 mole fractions with the WRF-Chem forward simulations along flight tracks to evaluate the ability of the current influence function setup to reproduce corresponding WRF-Chem simulations in the domain. We found that they agreed well. The suite of influence functions plays a key role in our evaluation described in Section 2.4.2. Evaluation of the WRF transport fields has been performed in other ACT studies (e.g., . Additional evaluation using the ACT airborne data is underway.

Background Determination
To evaluate the surface fluxes in our domain, we subtract the CO 2 background values from the ACT CO 2 measurements to obtain an estimate of the CO 2 mole fraction enhancements and depletions caused by surface fluxes in the domain. The CO 2 boundary conditions in the WRF-Chem configuration are from Carbon-Tracker . We interpolate the boundary values along the flight tracks to determine the background-value elements in y bkg . For the ACT Summer 2016 campaign, we used the 4-D simulations of atmospheric CO 2 mole fractions from the CarbonTracker 2017 product, while for the rest of the campaigns we used values from the CarbonTracker 2019-Near Real Time version 2 product. Upper free tropospheric mole fractions can provide another estimate of continental background conditions (Baier et al., 2020). We compare the simulated background mole fractions along ACT-America flight tracks above 4,000 mean sea level with the corresponding ACT-America measurements and find good agreement ( Figure S3). We do not explicitly compute uncertainty in the background in this study, but this comparison, and the work of Feng, Lauvaux,  suggests that the uncertainty is less than about 1 ppm.
CUI ET AL.

ACT Referenced Biogenic CO 2
The atmospheric CO 2 mole fraction continental enhancements and depletions include the influence of different fluxes: biogenic, fossil fuel, fire, and oceanic. To focus on the land biogenic CO 2 component, we remove the influence of the fossil fuel, fire, and oceanic sources on total CO 2 (y) by subtracting the component mole fraction enhancements simulated using the influence functions and flux estimates: where H represents the influence functions (see details in 2.3), which are used with the fluxes to produce the atmospheric CO 2 mole fractions along flight tracks. E ff , E fire , E ocn represent CO 2 fluxes from the fossil fuel, fire, and oceanic sources in the domain. E ff , E fire , E ocn are obtained from the CarbonTracker system as part of OCO2 v9 MIP. As described in Section 2.1, E ff is obtained from the ODIAC 2018 fossil fuel emission inventory, E fire is from the GFED4.1 s wildfire inventory respectively. E ocn is from the posterior ocean fluxes of the IS, LNLG, OG, or LNLGOGIS experiments, respectively. ACT-America campaigns were designed to fly over multiple productive ecoregions in Central and Eastern US and usually avoided urban areas, and wildfires in this region are not abundant. We modeled CO 2 enhancements/depletions from the four source components to the ACT-America boundary layer data space. The fire and ocean sources have trivial contributions to ACT-America data, compared with the biological and fossil fuel sources. Fossil fuels have modest influence on ACT-America data. Oda et al. (2018) estimated the annual ,uncertainty estimate of fossil fuel emission from ODIAC 2016 over North American Temperate to be 3.7%. Moreover, we convoluted two fossil fuel emission inventories to the ACT-America boundary layer data space and found the relative errors of mean values to be 2%-11% ( Figure S4-S5). The uncertainties from the fossil fuel, fire and ocean fluxes used in Equation 1 are much smaller compared to the uncertainty of NEE of CO 2 .
Meanwhile, the modeled biogenic CO 2 enhancements/depletions along the ACT flight tracks are also calculated as well from the four CO 2 NEE flux products (E bio , see Section 2.1) respectively:

Evaluation Framework and Experimental Design
To distinguish and rank the different flux products, we calculate the root-mean-square error (RMSE) between y modelbio and y ACTbio . The value of y modelbio is calculated using the influence functions and the flux products at the 3-hourly 1 × 1 degree spatial and temporal resolutions. The flux product associated with the smaller RMSE value indicates the better performance, and vice versa. The RMSE analysis is applied for all data during each campaign as well as the entire four campaign datasets.
The mole fraction-based analysis above is the net result of upwind biogenic fluxes. It is hence difficult to identify the sub-regional and ecosystem-specific sources of these divergences between the aircraft observations and simulations from the flux products without further diagnosis (Rayner, 2020). Therefore, in the study, we also conduct the flux-based evaluation to further diagnose the errors of flux products at the sub-regional scale. We use the following equations, and where H (dimension: m × n, m: receptors, n: states (spatial clusters associated with the time intervals) is the influence function, R (dimension: m × m) and B (dimension: n × n) represent the covariance of the model-data mismatch and the prior flux errors, respectively. x denotes the optimized flux value (the mode value of the posterior distribution, dimension: n x 1) using ACT-America data, and x 0 denotes the flux products that are evaluated in the study, that is the prior information.
We apply the Bayesian solution to optimize the flux products using the ACT-America data (Equation 3), and use the differences between the flux products and their optimizations by ACT-America (Equations 4 and 5) to evaluate the flux products.
ɛ (dimension: n × 1) is a spatially and temporally resolved quantity and it represents the errors in the flux product compared with the ACT-America referenced fluxes. ɛ is in units of μmol/m/s and it has positive and negative signs. A lower magnitude of ɛ indicates the flux product is closer to the ACT referenced value. Positive values in ɛ identify grid clusters where flux products overestimate the NEE of CO 2 , and vice versa.
R is assumed to be the variance of residuals between y modelbio and y ACTbio . We give a conservative assumption for R. B is given to be 100% relative uncertainty of the flux product (x 0 ) initially, and we then apply a regularization parameter to B to tune the balance between the contributions of the model-data mismatch and the constraints of the prior estimation based on Equation 3 (Cui et al., 2015(Cui et al., , 2017. Given the values of R and tuned B, we explicitly solve ɛ in Equation 5. For this study, we focus on the seasonal-level evaluations, thereby we combine all data from each campaign (i.e., each season) as one case, and derive the corresponding spatially and temporally resolved values of ɛ. We focus on the grid cells associated with the large values of influence functions for each campaign (Figure 2), and aggregate these grid cells in each sub-domain (i.e., R1, R2, and R3 in Figure 3) according to the different ecoregions classified in the CarbonTracker system and obtain total 36, 36, 37, and 33 grid clusters for the four cases, respectively (more details in SI and Figure S6). R is treated as the diagonal matrix in the study. We aggregated the time intervals from the native 3-hourly intervals to the daytime (14-01 UTC) and nighttime (02-13 UTC) scales of each day and used an e-folding temporal correlation scale (20 days) to the same time period of day in the prior flux errors. We then calculate the weighted average of ɛ (without or within its sign) during each campaign, based on the temporal information constrained by H T H for each domain (i.e., R1, R2, and R3), to identify the seasonal error levels for the flux products.
CUI ET AL.

Results and Discussion
As described in Section 2.5, we use both mole fraction-based and flux-based metrics to evaluate the four sets of NEE inversion products (e.g., IS, LNLG, OG, and LNLGOGIS). First, the mole fraction-based RMSE analysis are shown in Figure 4. We found that the IS flux product has the best performance among the four products during the summer, fall, and spring, and has the second-best performance during the winter time.
The performance of the LNLG flux product is second in most seasons and best in the winter. The OG flux product has the worst performance across the winter, fall, and spring and it is consistent with previous studies (e.g., Crowell et al., 2019;O'Dell et al., 2018). The RMSE values integrated over four campaigns show that IS has the best aggregate performance at the annual level, followed by LNLG, OG, and LNLGOGIS. The multi-platform product (LNLGOGIS) performs similarly to the OG flux inversion.
We calculated the averaged absolute values of ɛ by campaign in Figure 5, based on Equations 3-5, to identify the spatial distribution of errors in the flux products. In general, the four flux products show similar spatial patterns during all four campaigns. The similar spatial patterns indicate that the spatial distributions of errors in the NEE of CO 2 estimates are not strongly dependent on the observational system used. All flux inversions show the largest errors in the Central and Eastern US during the summer time. There are larger errors in the Southern and Eastern US than other areas during the spring. The inversions in winter time show the smallest errors. Although the overall spatial patterns of errors are similar, some differences among the flux products can still be observed at the sub-regional scale. For example, LNLG and LNLGOGIS have similar overall performance with IS in Eastern and Southern US, but much worse than IS in Midwest and Western Canada.
We further calculate the seasonally averaged ɛ including the signs for the three sub-domains ( Figure 6, and the corresponding spatial maps are shown in Figure S7) to identify the seasonal errors for these regions in the flux products. Again, the spatial patterns of the seasonal errors in these CarbonTracker regional flux estimates are not strongly dependent on the observational data sources. During the summertime, we found that all inversions overestimate NEE of CO 2 in the Eastern US (so the magnitude of net photosynthesis is underestimated), but significantly underestimate the flux (net photosynthesis is too large in magnitude) in the Midwest US and western Canada area from the LNLG and LNLGOGIS products. The LNLGOGIS product also underestimates NEE    Extrapolating these results across seasons suggests that the inversions generally amplified the seasonal cycle of NEE in Midwest and Western Canada by underestimating summer NEE or overestimating dormant season NEE, especially for the LNLG products. When we consider ɛ results across the four campaigns we found that the annual NEE of CO 2 fluxes have the positive errors in in Midwest and Western Canada and Eastern US from the IS and LNLG fluxes, but the LNLG fluxes show negative errors in the Southern US. The IS fluxes have the best seasonal performance and LNLG has the best annual performance across the three areas (i.e., the Central and Eastern temperate North America).
The seasonally averaged ɛ by daytime and nighttime for each case are calculated as well ( Figure S8 and S9), respectively. We note that these day-night analysis might be influenced by biases in atmospheric transport, including simulation of the nocturnal atmospheric boundary layer. The spatial patterns of the errors during the daytime and nighttime largely match those found for the daily NEE error estimates in Figure 6. During the summertime, opposing patterns of ɛ (negative values during the daytime, and positive values during the nighttime) in Midwest and Western Canada suggest that both nighttime respiration and net daytime photosynthesis are overestimated in the area. Both positive biases during daytime and nighttime in the Eastern US suggest overestimated biogenic respiration in this region. During the wintertime, positive biases seen in day and night from IS and LNLG in Midwest and Western Canada indicate that respiration is overestimated in the region. The magnitudes of errors in day and night from all flux products are small in the Eastern US.
Opposing patterns of ɛ (negative values during the daytime, and positive values during the nighttime) are seen in the Southern US. Consequently, the overall daily errors in these areas are small in Figure 6. In the fall, opposing patterns of ɛ (negative values during the daytime, and positive values during the nighttime) are seen again in the Southern US. In the spring, opposing patterns of ɛ (negative values during the daytime, and positive values during the nighttime) in the three domains suggest that both nighttime respiration and net daytime photosynthesis are overestimated in these areas.

Conclusions
We implement a framework to evaluate the NEE of CO 2 flux estimations across the Central and Eastern United States and some of Western Canada. We use this approach on the posterior fluxes from the Carbon-Tracker global flux inversion system, which, for the OCO2 v9 MIP, was run with four different atmospheric CO 2 data sources.
This study suggests that, in terms of regional variability in NEE of CO 2 , the IS inversion and the inversion using the LNLG observations from OCO-2 v9 are likely to be the most reliable products of the CarbonTracker system, superior to inversions based on the OCO-2 v9 OG or all data platforms (LNLGOGIS) data sets. We found, using a error diagnosis metric, that IS generally outperforms the inversions based on OCO-2 v9 observations, but the differences between the IS inversion and the LNLG inversion are relatively small. The OG and LNLGOGIS inversions are clearly inferior to the IS and LNLG inversions with respect to this error metric analysis, and warrant further investigations. This strong performance of the LNLG inversion as compared to the IS inversion is encouraging when considering inverse flux estimates in regions of the world where the in situ observing network is sparse.
The spatially resolved errors for the regional fluxes in CarbonTracker are not strongly dependent on the observational data source. Our results suggest that CarbonTracker overestimates seasonal NEE for the Central and Eastern US, and that, as a result, the annual NEE from CarbonTracker may underestimate continental uptake of CO 2 (annual mean NEE too positive). Summer NEE is positively biased in the Eastern US and negatively biased in Midwest and Western Canada, yielding relatively little total seasonal bias across the continent in summer. In the dormant seasons, the CarbonTracker inversions appear generally to overestimate NEE. It is possible that the FLEXPART-WRF transport model used in our evaluation system may be biased. The differences between the two systems (FLEXPAT-WRF and TM5-ERA-interim) would also cause the uncertainty to the flux evaluation. Conclusive assessment of the magnitude of the errors in seasonal NEE from CarbonTracker will depend on a more rigorous assessment of the transport models, which is currently being conducted. Nevertheless, we demonstrate that this continental-scale, multi-season airborne data set provides sufficient data to distinguish among inverse flux estimates and posterior identify flux biases, resulting in better understanding of the true NEE from North America.
We propose to extend this evaluation framework to other flux products from both top-down or bottom-up methods, such as other members of the OCO-2 v9 MIP and any available continental-scale biogenic CO 2 flux estimates. We hypothesize that these studies will yield insights that are applicable across the globe, especially in midlatitude ecosystems.