Objectified quantification of uncertainties in Bayesian atmospheric inversions

. Classical Bayesian atmospheric inversions process atmospheric observations and prior emissions, the two being connected by an observation operator picturing mainly the atmospheric transport. These inversions rely on prescribed errors in the observations, the prior emissions and the observation operator. When data pieces are sparse, inversion results are very sensitive to the prescribed error distributions, which are not accurately known. The classical Bayesian framework experiences difﬁculties in quantifying the impact of mis-speciﬁed error distributions on the optimized ﬂuxes. In order to cope with this issue, we rely on recent research results to enhance the classical Bayesian inversion framework through a marginalization on a large set of plausible errors that can be prescribed in the system. The marginalization consists in computing inversions for all possible error distributions weighted by the probability of occurrence of the error distributions. The posterior distribution of the ﬂuxes calculated by the marginalization is not explicitly describable. As a conse-quence,

Abstract. Classical Bayesian atmospheric inversions process atmospheric observations and prior emissions, the two being connected by an observation operator picturing mainly the atmospheric transport. These inversions rely on prescribed errors in the observations, the prior emissions and the observation operator. When data pieces are sparse, inversion results are very sensitive to the prescribed error distributions, which are not accurately known. The classical Bayesian framework experiences difficulties in quantifying the impact of misspecified error distributions on the optimized fluxes. In order to cope with this issue, we rely on recent research results to enhance the classical Bayesian inversion framework through a marginalization on a large set of plausible errors that can be prescribed in the system. The marginalization consists in computing inversions for all possible error distributions weighted by the probability of occurrence of the error distributions. The posterior distribution of the fluxes calculated by the marginalization is not explicitly describable. As a consequence, we carry out a Monte Carlo sampling based on an approximation of the probability of occurrence of the error distributions. This approximation is deduced from the welltested method of the maximum likelihood estimation. Thus, the marginalized inversion relies on an automatic objectified diagnosis of the error statistics, without any prior knowledge about the matrices. It robustly accounts for the uncertainties on the error distributions, contrary to what is classically done with frozen expert-knowledge error statistics. Some expert knowledge is still used in the method for the choice of an emission aggregation pattern and of a sampling protocol in order to reduce the computation cost. The relevance and the robustness of the method is tested on a case study: the in-version of methane surface fluxes at the mesoscale with virtual observations on a realistic network in Eurasia. Observing system simulation experiments are carried out with different transport patterns, flux distributions and total prior amounts of emitted methane. The method proves to consistently reproduce the known "truth" in most cases, with satisfactory tolerance intervals. Additionally, the method explicitly provides influence scores and posterior correlation matrices. An in-depth interpretation of the inversion results is then possible. The more objective quantification of the influence of the observations on the fluxes proposed here allows us to evaluate the impact of the observation network on the characterization of the surface fluxes. The explicit correlations between emission aggregates reveal the mis-separated regions, hence the typical temporal and spatial scales the inversion can analyse. These scales are consistent with the chosen aggregation patterns.

Introduction
Characterizing the global biogeochemical cycles of greenhouse gases requires a reliable understanding of the exchanges at the surface-atmosphere interface. The description of these exchanges must encompass the absolute amounts of gas released to and removed from the atmosphere at the surface interface, the spatial distribution and the temporal variability of the fluxes, and the determination of the underlying physical processes of emissions and sinks. Such an integral depiction is still missing for most greenhouse gases (Ciais et al., 2013). One of the possible approaches to Published by Copernicus Publications on behalf of the European Geosciences Union.
inquire into the surface fluxes is the analysis of the atmospheric signal. The drivers of the spatial and temporal variability of the atmospheric composition are atmospheric transport, chemistry and surface fluxes. Therefore, monitoring the atmospheric composition and using a representation of the atmospheric transport and chemistry with global circulation models (GCMs) or chemistry-transport models (CTMs) can help in inferring information on the fluxes (Bousquet et al., 2006;Bergamaschi et al., 2010). This approach, called atmospheric inversion, suffers from two practical issues in its implementation. First, the atmospheric composition is still laconically documented, though the number of global monitoring projects with extensive surface observation networks and satellite platforms has been increasing in the last decades (e.g. Dlugokencky et al., 1994Dlugokencky et al., , 2009. Indeed, the satellite platforms have a global coverage but the observed atmospheric composition is integrated over the vertical column, while the surface sites can provide continuous observations but only at fixed point locations. Second, the atmosphere behaves as an integrator and the air masses are mixed ambivalently through the transport (Enting et al., 1993). Thus, the inverse problem of tracking back the fluxes from the variability of the atmospheric composition cannot be solved univocally. The Bayesian formalism allows for statistical analyses of the atmospheric signal, so that one can identify confidence intervals of fluxes compatible with the atmospheric composition (Tarantola, 1987).
Bayesian inversions have been extensively used at the global scale, providing insights on the greenhouse gas budgets (e.g. Gurney et al., 2002;Kirschke et al., 2013;Bergamaschi et al., 2013). However, non-compatible discrepancies in the results appear between the possible configurations of atmospheric inversion systems (Peylin et al., 2013). The various configurations include the choice of the atmospheric transport, its spatial and temporal resolutions, the meteorological driving fields, the type and density of the observations, etc. In the Bayesian formalism, some assumptions also have to be made on the transport model error statistics, on the errors made when comparing a discretized model to observations (Geels et al., 2007) and on the confidence we have on the prior maps and time profiles of emissions (Enting, 2002). All these choices are based on technical considerations and on the expert perception of the problem to solve. Comparing results based on different choices that are physically adequate, but subjective, is difficult, especially to track inconsistencies, which enlarge the range of flux estimates.
In the following, we focus on the development of an enhanced Bayesian method that objectifies the assumptions on the statistics of the errors and that takes the unavoidable uncertainties generated by our lack of knowledge on these error statistics into account. In this approach, the confidence ranges of the optimized surface fluxes are computed by a Monte Carlo marginalization on all the possible error statistics, which is more general than the usual Bayesian approach deducing posterior uncertainties from a single error statistic combination only. The weight function for the marginalization is inferred from an already-tested maximum likelihood approach (e.g. Dee, 1995;Michalak et al., 2005), processing the pieces of information carried by the differences between the measurements and the prior simulated concentrations. The potential and consistency of the method is tested through observing system simulation experiments (OSSEs) on a realistic configuration of atmospheric inversion.
The case study is the quantification of methane fluxes in the Siberian Lowlands with a network of surface observation sites that have been operated for a few years by the Japanese National Institute for Environmental Studies (Sasakawa et al., 2010) and the German Max Planck Institute (Winderlich et al., 2010). The characterization of the region is challenging, with co-located massive methane emissions from anthropogenic activity (oil and gas extraction) and from wetlands in summer. Moreover, the wetland emissions have a very high temporal variability (due to their sensitivity to the water table depth and to the temperature; e.g. Macdonald et al., 1998;Hargreaves and Fowler, 1998). Their quantification is then difficult. In order to catch the influence of the sampling bias due to non-regularly distributed observation sites and non-continuous measurements, we produce virtual observations from a known "truth" at locations where real observations are carried out and at dates when the logistical issues do not prevent the acquisition of measurements. We then check the capability of our method to reproduce consistent flux variability and distribution with seven degraded inversion configurations (perturbed transport, flat flux distributions, etc.).
In Sect. 2, we describe the theoretical framework of our method of marginalization. The enhancements on the general theoretical framework need a cautious definition of the problem to be implementable in terms of computational costs and memory limits. In Sect. 3, guidelines for a suitable definition of the problem are developed. The whole structure of the method is summarized in Sect. 4.1. In Sect. 4, we present the particular set-up of the OSSE carried out for proving the robustness of the method. The specific Siberian configuration we test our method on is detailed in Sect. 5. The OSSEs are evaluated along defined objective statistical scores in Sect. 6.

Marginalized Bayesian inversion
We first describe the motivations for using a marginalized inversion in Sect. 2.1. In Sect. 2.2, we describe the marginalization itself and the Monte Carlo approach chosen in order to compute it.

Bayesian inversion framework
The surface-atmosphere fluxes, through transport, cause a variability in the atmospheric mixing ratios of the species we are interested in. The atmospheric inversion relies on the processing of the atmospheric variability in order to infer the surface-atmosphere fluxes. Since the atmosphere is diffusive and irreversibly mixes air masses from different origins, it is physically impossible to infer univocal information on the fluxes from the integrated atmospheric signal alone (Tarantola, 1987;Enting, 2002). We then pursue a thorough characterization of the pdf (probability density function) of the state of the system x (e.g. the spatial and temporal distribution of the surface fluxes, but also background concentrations and baselines in some cases), assuming some prior knowledge on the system and having some observations of the atmospheric physical variables related to our problem. That is to say, we want to calculate the pdf p(x|y o − H (x b ), x b ) for all possible states x; y o is a vector gathering all the available observations, x b is the background vector including the prior knowledge on the state of the system and H is the observation operator converting the information in the state vector to the observation space. Typically, H embraces the atmospheric transport and the discretization of the physical problem. In the scope of applications of the atmospheric inversions, the observation vector y o gathers measurements of dry air mole fraction. As for the observation operator, it is computed with a model which simulates mixing ratios. As we are interested in trace gases, we will consider that the dry air mole fractions can be treated as mixing ratios. In all the following, we also consider that H is linear; hence, H is represented by its Jacobian matrix H and H (x b ) = Hx b . This approximation is valid for all non-reactive atmospheric species at scales large enough, so that the treatment of the local-scale turbulence by the model does not generate numerical non-linearity. When the atmospheric chemistry must be taken into account (for instance with methane), either the window of inversion must be short compared with the typical lifetime in the atmosphere for the linear assumption to be valid, or the concentration fields of the reactant species (e.g. OH radicals for methane) must be accurately known.
In general, the characterization of the pdf is built within the Bayesian formalism with the assumption that all the involved pdfs are normal distributions (Enting et al., 1993). The pdfs are then explicitly described through their mode and their matrix of covariance. In this case, the pdf p(x|y o − Hx b , x b ) ∝ N (x a , P a ) is defined by its mode, x a , the posterior state, and its matrix of covariance, P a . In addition to the linear assumption, we also consider that the uncertainties are unbiased. That is to say, p(x − x b ) ∝ N (0, B) and p(y o − Hx t ) ∝ N (0, R) where x t is the true state of the system. The uncertainty matrix B (resp. R) encompasses the uncertainties on the background x b (resp. on the measurements and on the model, including representation errors, i.e. the errors made when approximating the real world by a numerical gridded model). Under these assumptions, we can explicitly write the posterior vector and the posterior matrix of covariance: with K = BH T (R + HBH T ) −1 the Kalman gain matrix.

Ambivalent uncertainty set-up
Atmospheric inversion is straightforward (apart from technical issues in the numerical implementation of the theory) as long as the uncertainty matrices R and B are defined. Some of their components can be calculated unambiguously, such as measurement errors in matrix R. Other errors are derived, in most cases, following expert knowledge on, e.g. the behaviour of the atmospheric transport and of the surface fluxes. This expert knowledge is acquired, for example, through extensive studies on the sensitivity of the transport model to its parametrization and forcing inputs (e.g. Denning et al., 1999;Ahmadov et al., 2007;Lauvaux et al., 2009;Locatelli et al., 2013), or by comparing prior fluxes to measured local fluxes (e.g. Chevallier et al., 2006). Some studies also rely on pure physical considerations (e.g. Bergamaschi et al., 2005Bergamaschi et al., , 2010. However, the complex and unpredictable structure of the uncertainties is hard to reproduce accurately from the expert knowledge alone and an ill-designed couple of uncertainty matrices (R, B) can have a dramatic impact on the inversion results (e.g. Berchet et al., 2013;Cressot et al., 2014). The discrepancies between the possible configurations of inversion can also reveal some biases, η, in the models: in that case p(y o − Hx t ) ∝ N (η, R) instead of p(y o − Hx t ) ∝ N (0, R), which would require a different handling of Eq. (1). For example, the horizontal wind fields can be biased or the vertical mixing in the planetary boundary layer systematically erroneous. That makes it difficult to compare simulated concentrations in the boundary layer to measurements (e.g. Peylin et al., 2002;Dee, 2005;Geels et al., 2007;Williams et al., 2014;Lauvaux and Davis, 2014). Biases can have critical impacts on inversion results and must be inquired into independently (e.g. Bocquet, 2011). Nevertheless, for our study, we decide to neglect the biases in the inversion. We discuss in Sect. 6.3 the potential impacts of biases that are not significant in our specific application. We then focus only on the mis-specification of the uncertainty matrices R and B.

Possible uncertainty handling
In order to address the uncertainty issue in atmospheric inversions, efforts are carried out towards objectifying the way the error statistics are chosen (e.g. Schwinger and Elbern, ] (as defined in Sect. 2.2). The red curve represents the normal distribution with the same mode and tolerance interval; the green one stands for a normal distribution with the same mode and the same standard deviation; the black one is the posterior distribution computed with the maximum likelihood couple of uncertainty matrices, presenting under-estimated skewness compared with the Monte Carlo distribution.
2010; Winiarek et al., 2012;Berchet et al., 2013). These efforts focus on specific algebraic properties of the uncertainty matrices (e.g. Desroziers and Ivanov, 2001;Desroziers et al., 2005) or more generally on understanding the likelihood of the prior innovation vector, y o − Hx b , as a function of the uncertainty matrices (Dee, 1995). Under Gaussian assumptions, the likelihood of the innovation vector can be written with d the dimension of the observation space and | · | the determinant operator.
In the likelihood framework, the couple of uncertainty matrices (R, B) that maximizes Eq. (2) is considered as optimal and will be hereafter referred to as the maximum likelihood. This maximum likelihood optimally balances the observation and prior state error variances and covariances according to the prior innovation vector y o − Hx b (Chapnik et al., 2004). A direct algorithm computing the maximum likelihood (applied to atmospheric inversion in, e.g. Winiarek et al., 2012;Berchet et al., 2013) is then supposed to provide a good approximation of the couple of optimal matrices (R max , B max ) which can be used forward in the inversion (Eq. 1). In order to dampen the computation cost of the maximum likeli-hood estimation, most studies just maximize the likelihood on hyperparameters (e.g. correlation lengths), describing the couple of matrices (R, B) in a more simple way.
Though general, the estimation of the innovation vector maximum likelihood relies on strong assumptions, it can suffer from strong numerical errors and it is not necessarily univocal. More explicitly, as showed in previous works, the pdf of the uncertainty matrices p(R, B) behaves as a χ 2 distribution with d degrees of freedom, d being the dimension of the observation space. Thus, the likelihood is highly dominated by the mode of p(R, B), co-located with the maximum likelihood. However, the peaked likelihood argument may be too rough in some cases. As the number of observations decreases compared to the number of state dimensions, this optimal case becomes less univocal. In the frameworks where observations are too scarce, the maximum likelihood may lead to flawed results. To assess the validity of the peak assumption, estimations of the Hessian matrix of the likelihood at its maximum have been used (e.g. Michalak et al., 2005;Wu et al., 2013). Hessian matrices give the magnitude of the uncertainties on the computation of the uncertainty matrices. Nevertheless, to our knowledge, no atmospheric inversion accounts for the impact of the Hessian matrix of the likelihood on the inversion results.
In addition, even when the pdf p(R, B) is intensely peaked at its maximum, the inferred inversion results from Eq. (1) with a direct maximum likelihood algorithm would erroneously under-estimate uncertainties on the result (see Fig. 1 and, e.g. Berchet et al., 2013). Indeed, at the maximum likelihood, all the pieces of information in the system are considered perfectly usable by the inversion, which then gives too optimistic posterior uncertainties in this case.

Theoretical formulation
Here, we compute the pdf p(x|y o − Hx b , x b ) by a marginalization on the uncertainty matrices to comprehensively account for the uncertainties in the characterization of the uncertainties and to quantify the impact of ill-specified uncertainty matrices. In statistics, marginalizing a pdf p(x) consists in rewriting it as a sum of conditional probabilities p(x|z) weighted by p(z).
Thus, the complete pdf p(x|y o − Hx b , x b ) classically described by Eq.
In Eq. (3), (.) depicts a dependency to the couple (R, B). The complete pdf p(x|y o − Hx b , x b ) then has the shape of an infinite sum of weighted normal distributions. This infinite sum could be described as a multi-variate T-distribution (Bocquet, 2011).
The general expression of Eq.
(3) encompasses the classical case with only one couple of matrices (R, B) which considers p(R, B|y o − Hx b , x b ) as a Dirac-like distribution (centred at the maximum likelihood or at any expert-based couple of uncertainty matrices). More generally, p(R, is not so well known as discussed in Sect. 2.1.3 above.

Monte Carlo sampling
Hereafter, a direct Monte Carlo characterization of Eq. (3) is carried out to deduce p(x|y o − Hx b , x b ). The Monte Carlo ensemble is to be defined along the pdf p(R, B), but the exact distribution of the error statistics is intricate. In all the following, we then approximate the pdf p(R, B) by a multi-variate χ 2 distribution with d (the number of observations) degrees of freedom, centred at the maximum likelihood of the prior innovation vector (following Dee, 1995). The Monte Carlo marginalization is consequently a direct extension of the maximum likelihood estimation now classically used in the atmospheric inversion framework.
The maximum likelihood can be estimated first by a quasi-Newtonian descent method. However, descent methods have high computation costs and thus require a reduced number of hyperparameters (variances, correlation lengths, etc.) to describe the full uncertainty matrices. From here, we decide to reduce the distribution of the matrices (R, B) to the subspace of the diagonal positive matrices. Using a subspace of the possible error statistics can dampen the generality of the method. In particular, error correlations will be excluded with diagonal uncertainty matrices. Correlations can be used in some frameworks to detect the biases in the system (Berchet et al., 2013). But, more importantly, correlations of observation or background errors can indicate redundant pieces of information in the inversion system. For instance, an inversion computed with no observation correlation tries to use too much information and is expected to give too optimistic a reduction of uncertainties on the fluxes. Nevertheless, in Sect. 3, we reduce the observation and state spaces in order to numerically compute the Monte Carlo marginalization. The reduction of the observation and state spaces indirectly depicts correlations in the full-resolution system. In this configuration, the correlation issue is then attenuated and the diagonal assumption is valid.
At the end, for each diagonal term of the uncertainty matrices (R, B), we prescribe a χ 2 distribution with d (i.e. the dimension of the observation space) degrees of freedom, rescaled so that its average equals the associated term in the computed maximum likelihood couple (R max , B max ). That is to say, for each diagonal element r i,i of the matrix R (equivalently of the matrix B): as the mean of the χ 2 distribution with d degrees of freedom, The χ 2 distributions are then sampled on a large ensemble -the Monte Carlo approach stabilizes after tens of thousands of draws in our case study -to characterize the final output not from the ensemble of posterior fluxes x a , but from a perturbed ensemble of ( x), with each x a random sample of N ( x a , P a ).

Processing the Monte Carlo posterior ensemble
In Fig. 1, we draw an example of the distribution of the Monte Carlo posterior vector ensemble along one component of the state space. The black curve depicts the posterior distribution inferred from the maximum likelihood, with underestimated spread compared to the Monte Carlo distribution. On the opposite, as illustrated by the green curve, a normal distribution with the same mode and the same standard deviation gives a misleading flat shape. As for a Gaussian, we then define the symmetric tolerance interval, so that 68.27 % of the samples are in the range (the hatched portion of the histogram in Fig. 1). This interval is equivalent to the Gaussian ±σ interval, with σ the standard deviation. One must remember that the computed tolerance interval does not depict a normal distribution. A normal distribution with the same tolerance interval (the red curve in Fig. 1) is still misleadingly flat. In all the following, we will write the tolerance interval TI 68 , [x low , x high ]. To summarize (as represented in the block diagram of Fig. 2), the maximum likelihood is first estimated using a quasi-Newtonian algorithm, similarly to what has been done in the literature (e.g. Winiarek et al., 2012;Berchet et al., 2013). We deduce from this maximum likelihood a plausible distribution of the uncertainty matrices (R, B). Through a Monte Carlo sampling of uncertainty matrices (R, B) along the deduced distribution, we compute an ensemble of possible posterior vectors ( x a (R,B) ). We can then define the tolerance intervals TI 68 and a posterior covariance matrix filled by the covariances of the ensembles of state components with each other.
Posterior covariance matrices are not always easy to compute in the atmospheric inversion framework. Here, the posterior covariance matrix is computed explicitly and objectively. The explicit definition of this matrix can give valuable information on the ability of the inversion to separate co-located emissions and emissions at different periods and

Informed definition of the problem
The general approach defined in Sect. 2 applies a Monte Carlo method on tens of thousands of individual inversions after the completion of a maximum likelihood algorithm. This procedure requires extensive amounts of memory and computation power that cannot be afforded in most real cases. For instance, the explicit computation of H with a CTM closely depends on the dimension of the state space: every column of the observation operator needs one model simulation (Bousquet et al., 1999). Additionally, each step of the algorithm to compute the maximum likelihood of the prior innovation vector and each step of the Monte Carlo method relies on matrix products, matrix determinants and matrix inverses. At first sight, all these operations are as many technical issues in high-dimension problems. As a consequence, the application of the theoretically simple framework developed in Sect. 2 relies closely on an informed definition of the problem. The dimensions of the observation and state spaces should be reduced to dampen the numerical obstacles, but one shall keep resolutions physically relevant for the system we are analysing. By synthesizing the recent literature on the subject, we show in the following that approximations can be reasonably applied to the full-resolution problem while not impacting the quality of the marginalized inversion results. Applying the Monte Carlo marginalized inversion is then technically feasible in a problem defined with a reduced dimension from the fulldimension problem.

Motivations and definition
In the observation space, more and more surface observation sites nowadays provide quasi-continuous measurements (at least a few data points per minute in the data set we use; Sasakawa et al., 2010;Winderlich et al., 2010). For long windows of inversion at the regional scale (of a few weeks or months), such a frequency of acquisition generates a number of data points technically impossible to assimilate all together in our framework. Concerning the fluxes, one shall aim at a characterization of the fluxes on very fine pixels and at a high temporal resolution. As the window of inversion lengthens and the domain widens, the number of flux unknowns grows dramatically.
In the inversion framework, the most straightforward way of minimizing the dimension of a problem is to reduce the dimensions of the observation and state spaces. Aggregating components of the state space and sampling observations are classically used for this purpose. In most studies, the reduction of the problem is carried out arbitrarily. However, ag-gregation can generate large errors (Kaminski et al., 2001;Bocquet et al., 2011), which would mitigate the benefits of the Monte Carlo marginalized approach compared to more classical ones applied in other atmospheric inversion studies with no aggregation (e.g. variational inversions; Courtier et al., 1994;Bergamaschi et al., 2005;Pison et al., 2009). Here, we propose a more objective way to do so following recent literature.
Using the formalism from Bocquet et al. (2011), we aim at defining a representation ω that encompasses the horizontal and temporal resolution of the fluxes, the choice of the regions of aggregation and the temporal sampling of the observations. The representation ω is defined through two operators ω and ω , which projects respectively the fullresolution state and observation space to smaller ones. After the state space "projection" with ω , the inversion applies corrections on regions of aggregation with fixed emission distributions, instead of on single pixels. The adjoint of this operator, T ω , then redistributes total emissions on finer scales with the same fixed emission distribution. The choice of ω impacts both the state vector x and the observation operator H. The observation sampling ω can consist in averaging or picking one value per time step (chosen accordingly to the physical resolution inquired into). For instance, one can decide to average the observations by day in order to study the synoptic variability of the atmosphere, related to the fluxes at the mesoscale. The observation sampling applies to both the observation vector y o and the observation operator H. The observation operator H computes the contribution from single sources to single observations. The adjoint of the observation sampling, T ω , will then redistribute an average or a sample for each chosen time step along this same time step. The redistribution will follow the raw observed temporal profile within the processed time step.

Mathematical formulation
At first glance, choosing the aggregation pattern and the sampling protocol can be considered as two independent physical problems. However, as they both influence the dimension of the observation operator H, they cannot be fixed separately. More explicitly, we can derive a formula which links ω and ω . Indeed, our final objective is to compute total posterior fluxes for each aggregated region that are as close as possible to the posterior fluxes from a full-resolution inversion aggregated afterwards. That is to say, we want to confine the norm of x a ω − ω x a t with x a ω , the posterior state vector resolved in the representation ω and x a t the posterior state vector computed with a full-resolution representation of the problem. Algebraic manipulations lead to where x t is the true state of the system, I is the identity matrix.
In Eq. (5), R and B are the full-resolution matrices of the error statistics.
For the aggregation errors to be limited, E ω (Eq. 6a) must tend towards 0. Then, S (Eq. 6b) and S ω (Eq. 6c) must be as close as possible to each other and the impact of P ω (Eq. 6d) and of the sandwich product with ω , T ω (·) ω , must be as small as possible. T ω extrapolates the fluxes from the aggregated regions to a finer resolution following an a priori repartition. The matrix P ω then redistributes the fluxes over a region with respect to the prior repartition, but keeping the same total emissions on the region.
In Sect. 3.2 below, we explain how to reduce these terms. The exact estimation of Eq. (5) is complicated and requires extensive numerical resources (e.g. Wu et al., 2011). In the following, we use physical considerations towards minimizing Eq. (5). The errors that are intrinsic to the aggregation process and that are unavoidable are controlled so that the benefit from the general marginalization is not wasted. We show in Sect. 6.3 that the physical considerations for choosing the representation ω in our case do not depreciate the inversion results compared to what would have been obtained with the exact resolution of Eq. (5).
Considering the computer resources we use, all the principles we define are applied in order to limit the size of the observation space (resp. the state space) to a dimension of roughly 2000 (resp. 1500). For instance, in the mesoscale Eurasian case study described in Sect. 5, these considerations lead to the aggregation patterns displayed in Figs. 2 and 6. With these problem dimensions, the ensemble used in the Monte Carlo sampling consists of 60 000 draws.
When the observation and the state space aggregation are chosen, the operator H can be computed with the so-called "response functions", based on forward simulations of the transport for each state component (Bousquet et al., 1999).

Observation space sampling
The sandwich product with ω , T ω (·) ω , aggregates the errors in the observation space and diffuses them back within each aggregate along a prescribed temporal profile. For example, it can typically compute the average error per day; then, it allocates for each subdaily dimension an error pro-portional to the contribution of the related component of y o to the daily mean. However, a daily averaging would be dominated by the outliers (e.g. singular spikes or night-time observations when the emissions remain confined close to the surface due to weak vertical mixing) that are generally associated with very high observation errors (due to fine-scale misrepresentations of the transport and erroneous night vertical mixing). For this reason, we decide to define ω as the sampling operator, which, for each day and observation site, picks the component of the observation vector when the daily minimum of concentrations within a planetary boundary layer higher than 500 m is observed. Below this threshold, the vertical mixing by the model is known to be possibly erroneous (e.g. Berchet et al., 2013). The daily resolution is chosen in order to keep a representation of the transport relevant to the mesoscale expectations on flux characterization. Higher time resolution would not improve the inversion efficiency due to strong within-day temporal correlations of errors (Berchet et al., 2013).

Observational constraints
One can notice that far from the observational constraints, the atmospheric dispersion (depicted by the sandwich product with H, H(·)H T ) makes the potential errors negligible compared to the errors generated in the areas close to the stations. Indeed, gathering two close hotspots of emissions thousands of kilometres away from the observation sites is not problematic since the air masses coming from the two spots will be well mixed. On the contrary, two hotspots that are as distant from each other as the first two, but close to an observation site, will generate plume-like air masses with a very high sensitivity to the errors of mixing and transport in the model. We use an estimation of the observation network footprints (approximating H T ) in order to fix the typical regions constrained by the network and avoid unfortunate grouping. At this step, approximate footprints are preferred to the heavy computation of the complete H T and are sufficient for our physical considerations. Within the constrained regions, we use a small spatial resolution for the fluxes and the transport and fine aggregation patterns; outside of them, we choose a coarse resolution and large aggregation patterns. These guidelines for using footprints prior to an inversion can be applied more systematically, as what is done in Thompson and Stohl (2014). An illustration of aggregation patterns in our case study can be looked at in Fig. 6.

Flux aggregation
Some terms in Eq. (5) are directly related to the aggregation of the fluxes. The term HA ω H T in Eq. (6c) depicts the aggregation errors coming from the uncertain distribution and temporal profile of the fluxes within each aggregation region, then transported to the observation sites. It must be close to 0. In our application below, this is particularly Geosci. Model Dev., 8, 1525-1546, 2015 www.geosci-model-dev.net/8/1525/2015/ important for hotspots of emissions, the locations of which are poorly known. The term HP ω BP ω H T in Eq. (6c) must be as close as possible to HBH T . The factors of divergence between these two terms come from the areas that are not well constrained by the observations. If, within a region of aggregation, a part is upwind the observation sites, while the other is not seen, then the aggregation errors will scatter on the unseen part of the region. The main sources of errors can then be separated into two different types: (1) the resolution/representation type, and (2) the constraint type. The type-1 errors are mainly related to the resolution of the observation operator. The models consider that the fluxes and the simulated atmospheric mixing ratios are uniform on a subgrid basis and neglects subgrid processes. This discretization contributes to type-1 errors, as "representation" errors (Tolk et al., 2008). Additionally, the distribution within each aggregation region is fixed and subregion rescaling is forbidden. The fine resolution close to the observation network is bound to confine type-1 errors (e.g. Wu et al., 2011). Additionally, the representation error is critical for co-located emissions, especially when the typical temporal and spatial scales of these emissions are different. For instance, grouping hotspots from oil extraction emissions with widespread wetland emissions that quickly vary in time is hazardous. We then aggregate the emissions along their typical time and space scale, hence according to the underlying physical process. An in-depth analysis of the footprints and the small patterns of aggregation close to the observation sites should limit the type-2 constraint errors. Areas under high observational constraints should not be grouped with underconstrained areas.
The resolution and aggregation choices can be computed objectively, but at a very high cost and only within a framework of prescribed frozen error matrices (Bocquet, 2009;Wu et al., 2011;Koohkan et al., 2013). For our purpose, we cannot afford such computation costs and rely on heuristic choices: small resolution and aggregation patterns close to the observation sites, aggregation by type of emission, separation of constrained/under-constrained areas by analysing the footprints. These non-optimal subjective choices may damp the efficiency of our method and must be carried out cautiously. Nevertheless, in our case, checking our choices after the computation shows that they did not have a critical impact on the inversion results.

Numerical artefacts
In addition to the need of defining a well-sized problem, smart adaptations can be applied to the computation of the method in order to enhance the efficiency of the algorithm. We face several sources of numerical artefacts in the computation of the method. In the quasi-Newtonian maximum likelihood algorithm, numerical artefacts are generated by the under-constrained regions. After a few steps, the computed gradient of the likelihood is dominated by these regions and the algorithm stays stationary. This issue could be partly related to the under-optimality of the chosen representation ω as suggested by the optimality criteria described in Bocquet et al. (2011). The stagnation of the maximum likelihood algorithm could then be used to detect too small regions of aggregation.
The under-constrained regions perturbing the maximum likelihood algorithm can be diagnosed using the diagonal terms of the influence matrix KH (with K defined in Eq. (1) and following Cardinali et al., 2004). This matrix represents the sensitivity of the inversion to elementary changes in the observations. Strong observation constraints are related to high sensitivity. After stagnation, the regions with a diagnosed KH < 0.5 are flagged out and the algorithm is carried on. This way, only the sufficiently constrained components of the state vector are processed until the algorithm converges. A third to half of the regions are flagged out this way in our case study.
The detection of the misrepresentation of hotspot plumes should also be enhanced. Despite the minimum daily sampling and the fine resolution close to the observation network, the plume issue can still generate strong temporal and spatial mismatches. For example, a point source can influence a station in the real world but not in the model because it has been mis-located, and vice versa. This creates significant differences between the simulated and the observed concentrations. The maximum likelihood algorithm attributes such mismatches to prior errors and/or observation errors. High diagnosed errors in the maximum likelihood algorithm are then a criterion for plausible mismatches. We know such plumes must be flagged out from the inversion to avoid irrelevant high influence from very local sources hard to represent. Since we notice that the observation and prior computed errors seem to follow a Fischer-Snedecor distribution, we choose to flag out the observations that are within the 95 % tail of the distribution.

Validation experiments
In Sect. 2, we described our modified atmospheric inversion by marginalization. In Sect. 3, we proposed some essential rules to follow in order to properly define the problem, so that the rather simple theoretical framework is not hindered by finite numerical resources. The marginalization method has to be validated along objective criteria. In the following, we summarize the general structure of the method in order to identify the critical points to test in the method (Sect. 4.1). We deduce from these points some OSSEs to carry out. In Sect. 4.2, we define the scores according to which the method will be evaluated.

Method summary
The method described in Sects. 2 and 3 is condensed in the block diagram in Fig. 2. The marginalized inversion takes the same input as any other atmospheric inversion: some atmospheric measurements and prior maps of fluxes with specified resolution and temporal profiles. In Sect. 3, we gave recommendations on the processing of the "raw" inputs, so we get an observation vector y o , a prior state vector x b and an observation operator H that are small enough to be computable by the method. These highlights are mainly the sampling of the observations per day (in accordance with our objective of characterizing mesoscale fluxes in our case study) and the aggregation of the fluxes by regions (based on physical considerations and footprint analysis). The maximum likelihood algorithm processes y o , x b and H in order to find a couple of optimal diagonal error matrices (R max , B max ). This maximum likelihood is found by a quasi-Newtonian descent method. We then infer from (R max , B max ) the approximate χ 2 shape of the distribution of all the possible error matrices (R, B). We carry out a Monte Carlo sampling on these distributions of errors and get an ensemble of posterior state vectors ( x a ). The processing of this ensemble provides the final output of the method: a tolerance interval TI 68 of the posterior state and the posterior correlations between the components of the state space. The method also allows for the explicit computation of the influence matrix K max H in order to analyse the constrained regions of emissions only.
To summarize, the marginalized inversion processes two vectors and one operator: y o , x b , and H, as any other atmospheric inversion. The main difference with most other atmospheric inversions resides in the objective and automatic computation of the influence of ill-specified error statistics, in contrast with the traditional assigning of frozen error matrices based on expert knowledge and with the more recent online computations of error hyperparameters. Thus, we do not have to inquire into the sensitivity of our method to the prescribed spatial correlations of flux errors, or to the error variances. Such a sensitivity is transposed to the choice of the aggregation patterns and the sampling protocol, as defined in Sect. 3.1. The chosen configuration of aggregation and the sampling protocol are checked afterwards to be relevant in our case study. OSSEs are then to be carried out to evaluate the sensitivity of the method to y o , x b , H.

Test strategy
We assume that, in our case, the method is not sensitive to errors in y o . Indeed, in all the following, we consider that the measurement errors are negligible compared to transport errors; this is true for surface sites that fulfil the World Meteorological Organization's strict recommendations for accuracy and precision (WMO/GAW, 2011). This approximation does not hold for satellite total column measurements, for which the transport errors are smoothed over the vertical atmospheric column, and the instrument errors are larger. In addition, representativeness errors may also impact y o . OSSEs should account for these errors. However, OSSEs may face difficulties in explicitly highlighting these errors. Therefore, we do not perturb y o in order to represent the instrumental uncertainties and representativeness errors in the OSSEs.
The OSSEs are then based on perturbations of x b and H. The discrepancies between the background x b and the "truth" x t are of two types: (1) the erroneous distribution and temporal profile of the fluxes within aggregation regions, and (2) incorrect total emissions by region. For example, in Eurasia, the maps of the distribution of the wetlands differ drastically from one database to another (Frey and Smith, 2007). Apart from the distribution, the amount of gas emitted by each process is uncertain, due to mis-parametrizations or, for anthropogenic emissions, mis-specified activity maps (e.g. Rypdal and Winiwarter, 2001). The transport H differs from the "true" transport mainly because of the resolution of the model, the parametrization of subgrid processes (such as vertical turbulent mixing in the planetary boundary layer or deep convection), and the meteorological forcing fields (which are not necessarily optimized for transport applications).
The main sources of errors in the inversion are then (1) a wrong representation of the transport (highly dependent of the transport model used, its resolution, its parametrization and the exactitude of forcing wind fields), (2) an erroneous distribution of the fluxes within aggregation regions (each inventory and database has different statistical methods and parameters to reproduce surface fluxes), and (3) incorrect total emissions by regions. In order to evaluate the impact of each point on the inversion result, we carry out OSSEs with perfect synthetic observations from a nature run (i.e. with "true" emissions and "true" transport, as defined in the setup in Sect. 5). We test the ability of the marginalized inversion to reproduce the "true" fluxes or, at least, to consistently include the "truth" within the tolerance intervals. There are eight possible combinations of correct or perturbed phases of the three parameters. The "all true" combination is not relevant: y o − Hx b = 0 and the maximum likelihood algorithm is stationary. Seven combinations remain, detailed in Table 1. We run the marginalized inversion for the seven OSSEs and evaluate them along the scores defined in Sect. 4.2 below.

Scoring system
We expect an atmospheric inversion to provide reliable ranges of uncertainties for surface fluxes. That is to say, for as many components of the state vector x i as possible, the "truth" Inversion inputs: Optimal r max 0.5 0.5 0.5 0.5 0.6 0.5 0. ability of producing consistent fluxes, we define a relative score z rel for each component of the state vector: Hereafter, all the scores will be expressed in percentages for better readability. Statistically, z rel has no upper bound. Relative scores bigger than 100 % are not statistically inconsistent, but, for the method to be validated, we expect that the proportion of state components with z rel < 100 % is dominant.
Furthermore, the atmospheric inversion is supposed to reveal pieces of information to the understanding of the system. Then, we also expect that a correct relative score below 100 % is not reached by specifying huge tolerance intervals. To evaluate the ability of the marginalization of getting close to the reality, i.e. providing valuable information on the state of the system, we define an absolute score z abs : The smaller the absolute score, the more accurate the marginalized inversion. An inversion also must be able to evaluate the observation constraints on the regions. An objective estimator of the constraints on the regions is the influence matrix KH defined in Sect. 3. The Kalman gain matrix depends on the couple (R, B). Amongst all the Monte Carlo draws, we compute the influence matrix K max H for the couple associated with the maximum likelihood. The diagonal terms of this matrix range from 0 to 1. They give for all components of the state space the constraint given by the observations. We then define the influence score: (z infl ) i = (K max H) i . The closer these terms are to 100 %, the more constraints the inversion provides. We can then deduce the typical range of influence of the observation sites and detect the blind spots of the used network.
For each component i of the state space, we then have defined three indicators:

Posterior correlation processing
Another point most inversions do not compute explicitly and objectively is the typical temporal and spatial scales the inversion can differentiate in the fluxes, considering the atmospheric transport and the density of the observations. Our marginalized inversion gives access to an explicit matrix of correlations as defined in Sect. 3. Strong positive and negative correlations between two components of the state space indicate that the inversion cannot separate the contributions from the two components. For example, air masses observed at a station and coming from two regions upwind the station will have a mixed atmospheric signal difficult to analyse. Colocated emissions are also not necessarily differentiated in the atmospheric signal. Moreover, in a regional framework, when a model of limited area is coupled to lateral boundary conditions (LBC), the inversion must explicitly alert on the regions that cannot be separated from the boundary conditions, i.e. from the baseline signal.
In the case of strong correlations between two components of the state space in the posterior covariance matrix, we consider that it is not relevant to try to infer specific information for the two separate components. Then, we group the state space components according to their posterior correlations. We define a threshold of correlation r max and associate couples of regions (i, j ) within groups such that |r i,j | > r max . If we prescribe r max = 0, all the regions will be grouped; conversely, if r max = 1, no group will be formed. The optimal correlation threshold is not evident. We test the grouping for all possible r max values. We flag out from the processing of the results all the groups, which include some contributions from the LBC. Thus, from this post-processing, we only keep the regions that are clearly constrained by the observation sites, with no interference from the LBC. Moreover, we can infer the spatial and temporal scale that the inversion can resolve from the grouping patterns.
In Table 1, the three scores defined in Eq. (7) are averaged on the whole domain of interest for the optimal correlation threshold r max (as discussed in Sect. 6.1).

Set-up of the OSSEs
We compute the OSSEs that we described in Sect. 4 in a realistic mesoscale case. We focus on a domain spanning over Eurasia, from Scandinavia to Korea. At this scale, the air masses' residence time is typically of days to a few weeks. This timescale is small compared to the 8-10-year lifetime of methane in the atmosphere (mainly due to oxidation by OH radicals; Dentener et al., 2003). Hence, the observation operator can be considered linear. We apply the method on a region characterized by significant fluxes, with collocation of different sources with different emission timescales: Siberia. We describe the region of interest and the chosen "truth" for the experiments in Sect. 5.1. We use two transport models in order to simulate atmospheric transport. The technical details on these models are summarized in Sect. 5.2. In Sect. 5.3, we explain how we choose and compute the synthetic observations for our experiments.

State space components
In the region of interest, the emissions of methane are dominated by wetland, anthropogenic (here, mainly related to the oil and gas industry) and wildfire emissions. In Fig. 3, the distributions of the wetlands and of the oil and gas industry in the region are displayed. Anthropogenic emissions of methane in the region are mainly hotspots related to the intense oil and gas industry in the Siberian Lowlands and to the leaks in the distribution system in population centres in the south part of Siberia. Wetland emissions are mainly confined in the lower part of Siberia in the west Siberian plain, half of which is lower than 100 m above sea level.
The spatial distribution of the associated fluxes is deduced from the (1) EDGAR database v4.2 (http://edgar.jrc. ec.europa.eu) for year 2008 for anthropogenic emissions, (2) LPX-Bern v1.2 process model at a monthly scale for wetland emissions (Spahni et al., 2011), and (3) GFED database at daily scale for wildfires (Giglio et al., 2009). The EDGAR inventory uses economic activity maps by sectors and con-volves them with emission factors estimated in laboratories or with statistical studies (Olivier et al., 2005). LPX-Bern is an update of process model LPJ-Bern (Spahni et al., 2011). It includes a dynamical simulation of inundated wetland areas, dynamic nitrogen cycle, and dynamic evolution of peatlands (Spahni et al., 2013;Ringeval et al., 2014). The model uses CRU TS 3.21 input data (temperature, precipitation rates, cloud cover, wet days) and observed atmospheric CO 2 for each year for plant fertilization. GFED v4 is built from the burnt-area satellite product (MCD64A1). CH 4 emissions at monthly and daily scales are deduced from the burnt areas using the Carnegie-Ames-Stanford Approach (CASA model; Potter et al., 1993) and emission factors (van der Werf et al., 2010). Wildfire emissions can be very strong and are punctual in time and space; they are then difficult to analyse by the inversion. Wildfires are included as inputs to the marginalized inversion but are automatically filtered out during the computation. In all the following, we evaluate the OSSEs only in terms of anthropogenic and wetland emissions.
In addition, at the mesoscale, we use a CTM (see Sect. 5.2.2) with a limited area domain. Initial and lateral boundary conditions (IC and LBC) are then also to be optimized in the system to correct the atmospheric inflow in the domain. Lateral concentrations are deduced from simulations at the global scale by the general circulation model LMDZ with the assimilation of surface observations outside the domain of interest (Bousquet et al., 2006). We aggregate the LBC along four horizontal components and two vertical ones (1013-600 and 600-300 hPa).

Generation of a perturbed reference state x t
The EDGAR fluxes are given at the yearly scale and the LPX fluxes are calculated at a monthly scale. Additionally, LPX monthly fluxes exhibit smoothed patterns while wetland emissions can vary drastically from a point to another. We want the nature run for OSSEs to reproduce the potential spatial and temporal variability of the emissions. To do so, we intensify the spatial and temporal contrasts from the databases to the nature run. We then compute the "true" state vector x t by perturbing EDGAR emissions on a monthly basis and LPX on a 10-day basis. That is to say x t = α ⊗ x data , with the vector α depicting the scaling factors by state space component, ⊗ the point-wise multiplication operator and x data the emissions from the databases. The perturbations in α from original EDGAR and LPX databases applied to get the "truth" are scaling factors of up to 10. These scaling factors could have been chosen randomly, but we prefer inferring them with a raw expert-knowledge-based inversion using real data. The purpose of using real data for computing x t is to generate potentially realistic variations within the state space.
For both anthropogenic and wetland emissions, the scaling factors can significantly differ from a period of inversion to another. We can then evaluate the ability of the marginalized inversion to catch quick variations. The distribution of the The colour bar shows the altitude above sea level (from ETOPO1 database; Amante and Eakins, 2009). Red dots (resp. orange triangle) depict hotspots of CH 4 emissions (based on EDGAR v4.2 inventory; see Sect. 5.1) related to oil welling and refineries (resp. gas extraction and leaks during distribution in population centres). Purple squares highlight the observation site localization. Blueish shaded areas represent average inundated regions, wetlands and peatlands (from the Global Lakes and Wetlands Database; Lehner and Döll, 2004) scaling factors α is shown in Fig. 4. These factors are plausible, knowing the uncertainties on the wetland emissions and gas leakage (e.g. Reshetnikov et al., 2000). Such target scaling factors are at the edge of the validity of the Gaussian assumption and of the positivity of methane fluxes. The ability of the marginalization to recover such correction factors will prove its robustness.
As for anthropogenic and wetland emissions, we apply the scaling factors α on the components of x t related to LBC by periods of 10 days.
The OSSEs rely on x b perturbed from x t , or not. We apply two types of perturbations as summarized in Table 1. In OSSE 1, 4, 5 and 7, we only implement prior fluxes with different total emissions on the regions of aggregation. We take the emissions of the raw inventories as background to test the ability of recovering "true" fluxes from realistic background fluxes without assigning frozen prior errors. In OSSE 2, 4, 6 and 7, the distribution of the prior fluxes is modified from the "truth". We choose all flat flux distributions for each region of aggregation as prior fluxes.

Simulation of the observation operator H
The observation operator H is deduced from simulations of atmospheric transport. We use two different transport models in order to evaluate the impact of the transport on the inversion. We define H FLEXPART with the Lagrangian dispersion model FLEXPART and H CHIMERE with the Eulerian chemistry-transport model CHIMERE. Any transport model can be considered at some point biased compared with the reality. Confronting the results from FLEXPART to those from CHIMERE will allow us to test the robustness of our method to the biases.

The Lagrangian model: FLEXPART
With the Lagrangian dispersion model FLEXPART (Stohl et al., 2005), we can compute the footprints of the observations, hence H T FLEXPART . We use FLEXPART version 8.2.3 to compute numerous back trajectories of virtual particles from the observation sites. The model is forced by the European Centre for Medium-range Weather Forecast (ECMWF) ERA-Interim data at a horizontal resolution of 1 • × 1 • , with 60 vertical levels and 3 h temporal resolution (Uppala et al., 2005). Virtual particles are released in a 3-D box centred around each observation site with a 10-day lifetime backwards in time. The footprints are computed on a 0.5 • × 0.5 • horizontal grid, following the method of Lin et al. (2003), taking the boundary layer height at each particle location into account. The footprints only have to be convolved with the emission fields in order to get simulated concentrations at the observation sites. The method for computing the footprints considers that only the particles within the boundary layer are influenced by surface emissions and that the boundary layer is well-enough mixed to be considered as uniform. The uniform vertical mixing of the mixing layer can generate a bias on the surface-simulated concentrations. Such a bias is critical in the classical inversion framework and consequently in the one we describe since all the uncertainties are considered unbiased.
FLEXPART can easily compute an estimation of the adjoint of the full-resolution observation operator before choosing the representation ω. Hence, despite the expectable biases, we use this model to estimate the footprints of the network and deduce the aggregation patterns needed to compute H CHIMERE . This same model FLEXPART may also be used to compute explicitly and rigorously the representation ω according to objective criteria (Koohkan and Bocquet, 2012).

The Eulerian model: CHIMERE
Using the Eulerian mesoscale chemistry transport model CHIMERE Menut et al., 2013) constrained by non-hydrostatic meteorological fields, we explicitly define the observation operator H CHIMERE by computing the forward atmospheric transport from the emission aggregated regions (defined according to Sect. 3 criteria) to the observation sites. This model was developed in a framework of air quality simulations (Schmidt et al., 2001;Pison et al., 2007) but is also used for greenhouse gas studies (Broquet Table 2. Eurasian site characteristics (Sect. 5.3). The altitudes of the sites are given as metres above sea level (a.s.l.) and the inlet height is in metres above ground level (a.g.l.  Berchet et al., 2013). We use a quasi-regular horizontal grid zoomed near the observation sites after the considerations of Sect. 3. The domain of interest is of limited area and spans over the mainland of the Eurasian continent (see Fig. 3). The average side length of the grid cells near the stations is 25 km, while it spans over 150 km away from the observation sites. The 3-D-domain roughly embraces all the troposphere, from the surface to 300 hPa (∼ 9000 m), with 29 layers geometrically spaced. The model time step varies dynamically from 4 to 6 min depending on the maximum wind speed in the domain. The model is an offline model which needs meteorological fields as forcing. The forcing fields are deduced from interpolated meteorological fields from ECMWF with a horizontal resolution of 0.5 • × 0.5 • every 3 h.

Synthetic observations y o
We compute the nature run, i.e. the synthetic observations, from the "true" state vector, with the CTM CHIMERE. That is to say, in all the following, we consider that y o = H CHIMERE x t . The site and date of available observations are chosen according to the operated observation sites in the region. Thirteen Eurasian surface sites have been selected. These sites are maintained by NIES (Tsukuba, Japan; Sasakawa et al., 2010), IAO (Tomsk, Russian Federation), MPI (Iena, Germany; Winderlich et al., 2010), NOAA-ESRL (Boulder, United States of America; Dlugokencky et al., 2009), and KMA (Seoul, Korea). The description of the sites is given in Table 2. The observation sites do not carry out measurements all year-round due to logistical issues and instrument dysfunctions. In order to reproduce this sampling bias, we generate synthetic observations only when real measurements are available from January to December 2010.

Results and discussion
After the description of the set-up in Sect. 5, we now have a "true" state x t and some reference observations y o . We also have two observation operators H CHIMERE and H FLEXPART and several possible prior fluxes x b as inputs for the marginalized inversion developed in Sect. 2. In order to evaluate the method, we now carry out the OSSEs described in Table 1 following the complete procedure in Fig. 2. In Sect. 6.1, we examine the average robustness of the method. Then, in Sect. 6.2, we explore the spatial efficiency of the marginalized inversion in our case study. In Sect. 6.3, we discuss the enhancement provided by our method compared to the classical Bayesian framework, despite some limitations.

Impact of the correlation processing
The marginalization should consistently reproduce the nature run in the OSSEs or, at least, it should detect its inability in characterizing the fluxes from the given atmospheric constraints. As detailed in Sect. 4.2, the aggregation regions may have strong posterior correlations after the marginalized inversions. This denotes the difficulties that the inversion encounters in separating some emissions. The aggregation regions can be grouped along correlation thresholds r max arbitrarily chosen in order to explicitly take the emission dipoles into account. In Fig. 5, we plot the profiles of the scores defined in Sect. 4.2 along the possible correlation thresholds r max for grouping the regions. Specifying a correlation threshold r max allows identifying the typical temporal and spatial scales that the inversion can separate. In the case of a limited domain CTM, the influence of the LBC and of the inside fluxes can be mis-separated. The correlations take this issue into account and the correlation threshold specifies the tolerance to such mis-separations. For all OSSEs, the influence score z infl increases with r max . In the correlation processing after the computation of the marginalized inversion, the threshold r max depicts the degree of tolerance to mis-separation between inside fluxes and LBC. The higher the threshold of tolerance r max , the fewer inside fluxes are considered inseparable from the LBC. Hence, fewer aggregation regions are eliminated from the inversion and more fluxes are corrected by the inversion. As the number of constraints increases, we notice that the absolute and relative scores, z abs and z rel , also tend to increase with r max . That is to say, if we only try to get average information on big under-resolved regions, the posterior fluxes can be expected to be closer to the "truth". On the contrary, if we try to process too much spatial information from the inversion, we must expect more discrepancies with the "truth".
In particular, in Fig. 5, one can notice some outlier peaks for low r max . For low r max , very few regions are computed in the inversion. The peaks are created by the regions that are not any more considered as mis-separated from the LBC when r max increases. For some OSSEs, these newly computed regions have very wrong scores and dominate upon the other few, computed regions. For this reason, one should be very careful in the chosen correlation threshold. In order to avoid the score instability, the optimum threshold should be chosen higher than 0.4. Above 0.5, in our mesoscale case study, as described above, the inversion is limited by the temporal and spacial variability of the fluxes to optimize and by the transport biases. Then, it cannot reach the requirement of consistent reproduced fluxes.
One should find a balance between the physical scales one wants to separate and the consistency of the results. In Table 1, we summarize the scores of every OSSE for a chosen correlation threshold with respect to result consistency.

Hotspots and large-area emissions
Both in Table 1 and Fig. 5, looking at a given correlation threshold r max , one would expect influence, relative and absolute scores that get worse when the inversion condition degrades.
The fossil fuel influence score follows this trend: the more perturbed the transport and the prior fluxes are, the more state space components are considered un-invertible. The hotspot regions of emissions are broadly filtered out and the remaining regions can be well characterized by the inversion even with wrong distribution and transport patterns. Some effects in the degrading conditions of the inversion can however compensate each other. For example, the absolute scores of OSSEs 5 and 7 are better than the one of OSSEs 3 and 6.
The situation for wetland emissions is different. These emissions are smoother than oil and gas emissions and are then not excluded because of wrong transport or distributions. For this reason, the influence score does not exhibit a clear trend with degrading inversion conditions. For wetland regions, transport seems to be the predominant factor of errors. OSSEs 3, 5, 6 and 7 do not consistently reproduce the "truth" with relative scores higher than 100 % when r max ≥ 0.4. These discrepancies can be attributed to the very high variability prescribed in the "true" wetland emissions. An erroneous transport will fail in detecting brutal changes of emissions at the synoptic scale. The wetland emissions should then be grouped temporally and spatially in order to average the point releases of methane.
The erroneous tolerance intervals can also be attributed to the biased transport in FLEXPART compared with CHIMERE. Since we filtered out most of the plumes with spatial and temporal mismatches with the observations, the horizontal biases in the transport are confined. Concerning the vertical bias, a wrong simulated vertical mixing in the planetary boundary will affect all the fluxes. This bias will then have an impact on the atmospheric concentrations that is relatively smoothed, uniform and constant. Therefore, an accurate detection of such a bias is very difficult. Any inversion . Left: influence correlation z infl profile. Centre: relative score z rel correlation profile. Right: absolute score z abs correlation profile. The red arrows depict the direction from lowest scores to best ones. The blue arrows denote the direction of grouping, from all grouped ("G", r max = 0) to all separated ("S", r max = 1). The OSSEs are indexed along Table 1 numbering. Thin (resp. thick) lines stand for correct (resp. perturbed) subtotal emissions. Green (resp. brown) lines depict correct (resp. perturbed) emission distributions. Solid (resp. dotted) lines represent correct (resp. perturbed) transport. As in Sect. 4.2, the scores are noted in percentages. relies on the unbiased assumption of the errors. The inversion will attribute the biases to the flux for wetland regions, impacting the result of the inversion. As other inversions, despite the marginalization, it appears that the results on wetland regions may be sensitive to vertical transport biases in the models (see discussion in Sect. 6.3.2).
Thus, the marginalized inversion seems to be sensitive to transport biases and to fluxes varying too quickly, as any other inversions. Nevertheless, post-processing is made possible by the explicit and objective computation of the posterior covariances and of the influence matrix. This postprocessing proves that the atmospheric inversion is not able to inquire into very fine scales in our case study. The corre-lation grouping of indifferentiable regions allows for an accurate analysis of the best possible signal detectable by the inversion. In the following, we take a correlation threshold of 0.5 as a good balance between sufficient constraints on the system and consistent posterior fluxes.

Spatial evaluation
We have chosen a threshold of correlation grouping the regions so that the averaged scores on the whole domain of interest are optimal. The scores are not uniformly distributed. In Fig. 6, the distributions of the three scores are displayed for fossil fuel regions and wetlands for OSSE 1 (transport Geosci. Model Dev., 8, 1525-1546, 2015 www.geosci-model-dev.net/8/1525/2015/ (a) Fossil fuels (b) Wetlands Figure 6. Map of the average scores as defined in Sect. 4.2 for OSSE 1 (see Table 1) projected on the aggregation grid defined in Sect. 3. Top: influence score z infl . Middle: relative score z rel . Bottom: absolute score z abs . The colour maps have been chosen so that redder regions correspond to better scores (denoted by and ⊕ symbols). The resolution and physical projection of the maps are the same as in Fig. 3. and distribution of the fluxes same as the "truth", perturbed masses by regions; see Table 1). We choose the "easiest" OSSE configuration in order to evaluate the behaviour of the marginalized inversion in the best configuration possible, thus getting the upper bound for the expectable quality of the results. Any more realistic set-up likely gives results that are not as good. In the figure, the scores are projected on the aggregation grid built on the considerations in Sect. 3. Most of the observation sites are located in the centre of the domain (see Fig. 3). Then, the influence score is on average better close to the core of the network for the wetlands. For the fossil fuel regions, the influence score is relatively high also upwind the monitoring network (dominant winds blow from west to east in the region). In addition to the network density, the inversion suffers from mis-separation of side regions and LBC. For this reason, side regions tend to be less constrained than centre ones. However, one can notice in both wetland and fossil fuel maps that some centre regions are, in general, significantly less constrained than the core of the domain. These are regions of very high and dense emissions close to the observation sites (< 500 km). The air masses coming from these regions to the observation sites are plume-shaped air masses. The inversion has troubles in assimilating single plumes. In Sect. 3, filters have been implemented in order to detect these problematic regions. The marginalized inversion effectively filtered out these regions. The absolute and relative scores also show unexpected patterns. The regions of Scandinavia and China own some of the best absolute and relative scores. These two side regions are filtered out most of the time because of strong correlations with the LBC components of the state space (confirmed by their low influence score). Consequently, when not filtered out, these regions are very well and unambiguously constrained, thus the good relative and absolute scores. For the rest of the domain, the scores are mostly the better, the closer to the observation network.

Promising computation of the uncertainties
The marginalized inversion provides an objectified quantification of the errors in the inversion system. With the Monte Carlo approach we implemented, we are able to consistently take the sources of uncertainties in the inversion process into account, especially those from the prescribed error covariance matrices. As evaluated through OSSEs, the method proved to consistently catch "true" fluxes on average in the particular Siberian set-up. Moreover, the Siberian set-up is a difficult case study for atmospheric inversions, with colocated intense fluxes that vary at temporal and spatial scales smaller than the mesoscale. The processing of hotspots, critical in most inversion configurations, is consistently managed through filters on the plume-shaped air masses. An in-depth analysis of the temporal variability of the fluxes is carried out in a sister publication with the Siberian set-up and real observations (Berchet et al., 2014). Additionally, as a comparison, we carried out the same OSSEs on the same particular Siberian set-up, but with expert-knowledge frozen error matrices (diagonal matrices with the same representation ω as for the other OSSEs). The correlation profiles and the spatial structures of the scores with the expert-knowledge matrices are not shown because the general patterns are very similar to what is described for the marginalized inversion. Though similar in patterns, the values of the scores are significantly depreciated from the marginalized inversion to the expertknowledge one. The expert-knowledge relative and absolute scores are several times bigger than the ones from the marginalized inversion, thus statistically incompatible with the "truth".
The marginalized inversion explicitly and objectively computes the posterior covariance matrix and the influence matrix. The physical interpretation of the inversion results are then enhanced by a clear analysis of the observation constraints to the fluxes. The processing of the posterior correlations makes the detection of the dipoles and of indistinguishable regions possible. The influence of the lateral boundary conditions, specific to the mesoscale and to the use of limited area CTMs, is estimated. Thus, the regions upwind the observation sites and mixed with lateral air masses can be excluded from the inversion. From the correlations, the grouping of regions gives an estimate of the typical spatial and temporal scale the method can compute. In our case, with few and distant observation sites, the groups of regions cover very large areas. Thus, a grid-point high-resolution inversion would not have given deep insights into the fluxes we are looking at. The reduced problem approach described in Sect. 3 is then relevant when computed cautiously.

Subjective choices and biases
Despite all these benefits compared with the classical Bayesian framework, our method still has limitations. The technical implementation of the method needs extensive computation power and memory requirements. For this reason, we have to drastically reduce the size of the problem to solve. The size reduction relies on rigorous considerations that are difficult to formulate analytically. Therefore, we applied heuristic principles in order to choose the aggregation patterns of the observations and the fluxes. This subjective procedure can modify the results of the inversion and must be carried out very cautiously. The way we group the regions after the marginalized inversion in order to physically interpret the results is also subjective. We choose a correlation thresh-old of 0.5 in order to counterbalance the need of useful constraints from the inversion and the requirements of consistently reproducing the "true" fluxes. Other thresholds could have been chosen and the typical distinguishable temporal and spatial scales would slightly differ from one threshold to another. But, in any chosen correlation threshold, we notice that most aggregation regions are grouped within bigger ensembles, suggesting that the chosen aggregation patterns are small enough to have reduced impact on the inversion postprocessed results.
The marginalized inversion suffers from transport biases as any other inversion. However, the maximum likelihood algorithm considers the biases as random errors and includes them into the error matrix R max . The biases are then taken into account in the marginalized inversion, though as random errors. Biases can be represented, or at least detected, with non-diagonal matrices as suggested by Berchet et al. (2013), but a non-diagonal framework would make the computation of the marginalized inversion critically complicated. Despite the implicit inclusion of the biases as random error in R max , we reduced the impact of the horizontal transport biases through filters on the plume-shaped air masses. The vertical biases are smoother and more difficult to detect. This issue must be inquired into in further works. Biases can be studied through marginalizations on the input vectors (e.g. Bocquet, 2011). Coupled marginalizations on the input vectors and on the error statistics would provide a more complete view on atmospheric inversion uncertainties.

Conclusions
At the mesoscale, inconsistencies between inversion configurations appear in the classical Bayesian framework. One of the main sources of inconsistencies is the specification of the error matrices and the non-inclusion of the tenacious uncertainties on these matrices. Synthesizing the recent literature, we developed an updated Bayesian method of inversion from the classical Bayesian framework based on a marginalization on the error matrices and on an objectified specification of the probability density function of the error matrices. This new method makes the comprehensive inclusion of the impact of ill-specified uncertainty matrices possible for the first time, to our knowledge, in atmospheric inversion. In principle, this method needs very high computation power and memory resources. To avoid technical limitations, we reduce the size of the problem by aggregating the fluxes by region, following objective principles for reducing aggregation errors. We test this method through OSSEs on methane in a domain of interest spanning over Eurasia with significant emissions of different types and different time and space scales. The OSSEs are based on synthetic observations generated from a nature run. We evaluate the consistency and robustness of the method on OSSEs with inversion configurations from the more favourable to the most disadvantageous one Geosci. Model Dev., 8, 1525-1546, 2015 www.geosci-model-dev.net/8/1525/2015/ (perturbed atmospheric transport, flat flux distribution and wrong total masses). The method produces very consistent and satisfactory results. In most cases, the tolerance intervals given by the inversion include the "true" fluxes and the results remain close to the "truth". The method also provides an explicit computation of the constraints on the regions and allows flagging out regions critically mis-separated from the lateral boundary condition. We hence have developed a robust and objectified method able to consistently catch "true" greenhouse gas emissions at the mesoscale and to explicitly group the regions that are physically un-distinguishable with the atmospheric signal only. In addition, we developed a method that explicitly produces posterior tolerance intervals on the optimal distinguishable time and space flux scales and that computes the observation network influence on the fluxes. The robustness of our method on the Siberian case with a biased transport proves that it can be generically applied to other mesoscale frameworks. The high spatial and temporal variability of the fluxes in Siberia ensures the possibility of using the system in an "easier" inversion set-up. Actual observations from the sites we used for the validation of the method are exploited in further steps of our work in order to quantify the "real" methane fluxes in the Siberian Lowlands (Berchet et al., 2014).