1 Introduction

The water stored on land is a key variable controlling numerous processes and feedback loops within the climate system (see, e.g., Dirmeyer 2000; Koster et al. 2004a, b; Seneviratne et al. 2010). It constrains plant transpiration and photosynthesis and thus is of major relevance for the Earth’s water and energy cycles and impacts the exchanges of trace gases on land, including carbon dioxide. Figure 1, from IPCC (2007), provides an overview of the main terrestrial components and exchanges within the climate system. This shows the complexity of land processes and feedbacks, to a large extent owing to the high spatial variability in soils, vegetation and topography (ranging from metres to kilometres). Processes affecting the amount of water stored on land, for example, precipitation and radiation, have spatial scales of kilometres (e.g., associated with weather fronts) and have high temporal variability (hours).

Fig. 1
figure 1

Global climate system. Figure from IPCC (2007)

The amount of water stored in the unsaturated soil zone is generally referred to as soil moisture, although the exact definition can vary depending on the context. Soil moisture is one of the key geophysical variables for understanding the Earth’s hydrological cycle. It is classed as an essential climate variable of the Global Climate Observing System (GCOS) (GCOS-107, 2006).

Soil moisture determines the partitioning of incoming water into infiltration and run-off. It directly affects plant growth and other organic processes and thus connects the water cycle to the carbon cycle. Run-off and base flow from the soil profile determine river flows and flooding, which connects hydrology with hydraulics. Soil moisture also has a significant impact on the partitioning of water and heat fluxes (latent and sensible heat), thereby connecting the hydrological (i.e. water) cycle with the energy cycle.

Soil moisture is a source of water for the atmosphere through evapotranspiration from land. Evapotranspiration is a major component of the continental water cycle, as it returns as much as 60 % of the whole land precipitation back to the atmosphere (e.g., Oki and Kanae 2006). Furthermore, evapotranspiration is also an important energy flux (Trenberth et al. 2009) and is connected to the surface skin and soil temperature, which make up other important state variables of the land surface system. Together, soil moisture, temperature and their impacts on the water, energy and carbon cycle play a major role in climate-change projections (IPCC 2007; Seneviratne et al. 2010). Snow on land is another important variable affecting the global energy and water budgets, because of its high albedo, low thermal conductivity, considerable spatial and temporal variation and medium-term capacity for water storage.

Quantifying the land state and fluxes and understanding soil moisture–temperature and soil moisture–precipitation couplings allow a better representation of hydrological processes in climate models and significantly help to reduce uncertainties in future climate scenarios, in particular regarding changes in climate variability and extreme events, and ecosystem/agricultural impacts (Seneviratne et al. 2010). This understanding is also crucially important for improving short-range numerical weather prediction (NWP) capabilities, in particularly regarding prediction of convective precipitation (Sherwood 1999; Adams et al. 2011, and references therein).

Hydrological observations are prone to errors and are discrete in space and time with the result that the information provided by these observations has gaps. Figure 2 shows an example of gaps in satellite observations. It is desirable to fill gaps in the observed information using additional information and computational techniques. Algorithms or models to fill in information gaps should organize, summarize and propagate the information from observations in an objective and consistent way. A simple approach such as linear interpolation could be a reasonably accurate “model”, when observations are dense enough. However, linear interpolation may not be consistent with our advanced understanding of how the land surface behaves. A more realistic approach would be to fill in the gaps using a land surface model (LSM). While observations give an instantaneous view of the land surface, LSMs provide continuous estimates, based on physical laws that are derived from historical observations. These models are not perfect, and gaps in their structure, parametrization or initialization can be filled in with observations.

Fig. 2
figure 2

Plot representing retrieved soil moisture data from the Soil Moisture Ocean Salinity (SMOS, Kerr et al. 2010) mission for August 3, 2012 (top left panel), August 10, 2012 (top right panel), August 17, 2012 (bottom left panel), and August 25, 2012 (bottom right panel), based on the observational geometry from ascending orbits from SMOS (units of m3 m−3). Blue denotes relatively wet values; red denotes relatively dry values. The uncoloured (i.e. grey) areas over land represent gaps between the satellite orbits. Noteworthy are the sparse SMOS observations over Scandinavia, where retrievals from remotely sensed observations are particularly difficult, when the land is covered with snow, ice, forest, water bodies or rocks

Data assimilation (Kalnay 2003) provides an intelligent method to fill in the observational gaps using a model or to steer models using observations. By intelligent, it is meant an “objective” way which makes use of quantitative concepts (e.g., mathematical) for combining imperfect information. By combining observational and model information, data assimilation can be used to test the self-consistency and error characteristics of this information (Talagrand 2010b).

In this paper we focus on off-line land data assimilation, where the LSM is uncoupled from an atmospheric model. By using an uncoupled LSM, it can be forced with more observation-based forcings, rather than often inaccurate atmospheric analyses, and less computational resources are needed. The uncoupled approach can be regarded as a first step towards the land data assimilation goal of coupling an LSM to an atmospheric model to improve predictions at weather, seasonal and climate timescales (Palmer et al. 2008).

In this paper we discuss observations (Sect. 2), models (Sect. 3) and data assimilation methods (Sect. 4) used in the studies of the hydrological cycle and provide illustrative examples, with a focus on soil moisture. We pay special attention to the conceptual problems and key challenges associated with making use of observational and model information of the land surface in data assimilation systems (Sect. 5). We finish by providing conclusions (Sect. 6).

2 Observations of the Hydrological Cycle

Observations of the hydrological cycle are commonly divided into conventional observations (e.g., in situ ground-based measurements such as screen-level relative humidity) and remotely sensed observations (e.g., satellite or aircraft microwave observations). These data sets are complementary: conventional observations have relatively high spatio-temporal resolution (order metres and minutes) but only have local coverage, so have poor representativity for a large area; satellite observations have relatively low spatio-temporal resolution but have global coverage, so have good representativity for a large area. In situ observations are typically used as ground truth for calibration and validation of remote sensing products, and model and assimilation results.

Table 1 gives an overview of satellite sensors and missions that contribute to our current understanding of the hydrological cycle or may potentially contribute to this understanding in the near future. Depending on the observed wavelengths, the orbit altitude and design details, there are large differences in horizontal, vertical and temporal resolution of each observation type. For example, satellite-based observations of soil moisture are made using passive and active microwave instruments. The horizontal resolution of these sensors ranges from 50 to 10 km; the temporal resolution is about one observation every 2–3 days, depending on the location on Earth. These instruments typically penetrate the first few millimetres to centimetres of the soil: a few millimetres for the X-band (8–12 GHz, e.g., Advanced Microwave Sounding Radiometer for EOS, AMSR-E; Njoku and Chan 2006); ~1 cm for the C-band (4–8 GHz, e.g., AMSR-E; Advanced SCATterometer, ASCAT; Bartalis et al. 2007); and ~5 cm for the L-band (1–2 GHz, e.g., Soil Moisture Ocean Salinity, SMOS; Kerr et al. 2010). An immediate conceptual problem is to estimate soil moisture of actual interest in the root zone (1 m) at a finer resolution. For this, observational information needs to be transferred from the surface layer to the root zone (e.g., Calvet et al. 1998; Sabater et al. 2007; De Lannoy et al. 2007a; Draper et al. 2012) and downscaled from the coarse scale to finer scales (Reichle et al. 2001a; Pan et al. 2009; De Lannoy et al. 2010; Sahoo et al. 2013) typically using a land surface model (discussed in Sect. 3) and/or land data assimilation (discussed in Sect. 4).

Table 1 Characteristics of hydrological observations potentially available within the next decade (see “Appendix” for details of sensor acronyms)

With the design of new sensors, one aims to gain resolution, increase the sensitivity to the variables of interest and reduce instrument errors (USGEO 2010). Examples of new missions for soil moisture observations are the SMOS and SMAP (Soil Moisture Active and Passive) missions, both using L-band sensors and designed with a target uncertainty lower than that of earlier missions, like AMSR-E and ASCAT. In November 2009, the ESA Earth Explorer mission SMOS was launched followed by another L-band mission, NASA/CONAE Aquarius (Le Vine et al. 2006), in June 2011. Aquarius measures various elements of the hydrological cycle, and its coarse resolution makes it less attractive for soil moisture estimation. The NASA mission SMAP is focused on soil moisture and freeze–thaw detection and is scheduled for launch in 2014 (Entekhabi et al. 2010a). To illustrate the importance of soil moisture information, Table 2 identifies key benefits from satellite soil moisture measurements.

Table 2 Key benefits expected from satellite soil moisture observations

A special issue on soil moisture from the SMOS mission has recently appeared in the IEEE Transactions on Geoscience and Remote Sensing (Kerr et al. 2012a). The papers in this special issue describe the SMOS mission (Mecklenburg et al. 2012); the radiometric performance (Kainulainen et al. 2012); the SMOS soil moisture retrieval algorithm (Kerr et al. 2012b; Mattar et al. 2012); the impact of radio frequency interference (RFI) on the SMOS soil moisture measurements (Castro et al. 2012; Misra and Ruf 2012; Oliva et al. 2012); soil processes in boreal regions (Rautiainen et al. 2012); disaggregation of SMOS data (Merlin et al. 2012); and various aspects of the validation of SMOS soil moisture data (Al Bitar et al. 2012; Bircher et al. 2012; dall’Amico et al. 2012; Jackson et al. 2012; Lacava et al. 2012; Mialon et al. 2012; Peischl et al. 2012b; Rowlandson et al. 2012; Sanchez et al. 2012; Schlenz et al. 2012; Schwank et al. 2012).

In view of the applications discussed later in this paper, we briefly mention that snow measurements are often provided by AMSR-E to measure snow water equivalent (SWE), and MODIS (MODerate resolution Imaging Spectroradiometer, Morisette et al. 2002) to give a picture of the snow-covered area. Finally, it is worth to mention GRACE (Gravity Recovery And Climate Experiment, Tapley et al. 2004) for its ability to measure an integrated water quantity of soil moisture and snow, as well as water in deeper layers.

Satellite instruments do not measure directly hydrological parameters. What they measure is photon counts (level 0 data). Algorithms then transform the level 0 data into radiances (level 1 data). Subsequently, using retrieval techniques (Rodgers 2000), retrievals of layer quantities (e.g., of soil moisture) or integrated amounts (e.g., total water storage) are derived (level 2 data). Fields derived from manipulation of level 2 data, for example, by interpolation to a common grid are termed level 3 data. Analyses derived from the assimilation of level 1 and/or 2 data are termed level 4 data.

Satellite observations (from level 0 and up) have associated with them a number of errors, including random and systematic errors in the measurement, and the error of representativeness (or representativity). Random errors (sometimes termed precision) have the property that averaging the data can reduce them. This is not the case of the systematic error or bias (sometimes termed accuracy). The error of representativeness is associated with the extent to which the measurement represents a point or volume in space. In land surface measurements, the error of representativeness is important to consider in comparisons of coarse-scale satellite data with point data.

Satellite-based hydrological data are becoming increasingly available, although little progress has been made in understanding their observational errors. Evaluation of the accuracy of land surface satellite data is a challenge, and novel methods to characterize their errors are being applied. Examples include triple collocation (e.g., Scipal et al. 2008; Dorigo et al. 2010; Parinussa et al. 2011); the R-metrics approach (Crow 2007; Crow and Zhan 2007; Crow et al. 2010); and data assimilation (Houser et al. 2010, and references therein).

A number of in situ network and airborne hydrological studies have been set up in the last decade for evaluation of satellite data. Examples of in situ networks include SMOSMANIA in France (Calvet et al. 2007; Albergel et al. 2009); NVE (Norges vassdrags-og energidirektorat, Norwegian Water Resources and Energy Directorate) in Norway (http://www.nve.no/en/); several large-scale (larger than 10,000 km2) networks in the USA and elsewhere (see Table 1 in Crow et al. 2012); and several local- to regional-scale (larger than 100 km2, smaller than 10,000 km2) networks in the USA and elsewhere (see Table 2 in Crow et al. 2012). In situ soil moisture data from various networks across the world are consolidated in the International Soil Moisture Network (ISMN; http://www.ipf.tuwien.ac.at/insitu). As of January 2013, the ISMN includes data from 37 networks—Table 3 provides details. An example of an airborne study on evaluation of satellite data is the Australian Airborne Cal/Val Experiments for SMOS (AACE, Peischl et al. 2012a). An example of in situ ground-based station data used to evaluate satellite data is SMOSREX (de Rosnay et al. 2006). The temporal scale of in situ platforms ranges from minutes to hours; the spatial scale of in situ platforms ranges from tens of metres (individual stations) to thousands of kilometres (regional-scale networks). Along with the availability of dense in situ data for validation, it is also important to select appropriate validation measures (Entekhabi et al. 2010b).

Table 3 Contributing networks to the International Soil Moisture Network (ISMN)

The assimilation of satellite data for land surface applications has only gained significance in the last decade; it started later than atmospheric and oceanographic data assimilation (see various chapters in Lahoz et al. 2010a). This can be attributed to: (1) a lack of dedicated land surface state (water and energy) remote sensing instruments; (2) inadequate retrieval algorithms for deriving global land surface information from remote sensing observations; and (3) a lack of mature techniques to objectively improve and constrain land surface model predictions using remote sensing data.

3 Models of the Hydrological Cycle

As discussed above, observational information has gaps in space and time. It is desirable to fill in these observational gaps using a model. Such models can range from simple linear interpolation to full land surface models (LSMs). Land surface processes are part of the global processes controlling the Earth, which are typically represented in global general circulation models (GCMs). The land component in these models is represented in (largely physically based) LSMs, which simulate the water and energy balance over land using simple algebraic equations or more complex systems of partial differential equations. The main state variables of these models include the water content and temperature of soil moisture, snow and vegetation. These variables are referred to as prognostic state variables. Changes in these state variables account for fluxes, for example, evapotranspiration and run-off, which are referred to as diagnostic state variables.

Most land surface models used in GCMs view the soil column as the fundamental hydrological unit, ignoring the role of, for example, topography on spatially variable processes (Stieglitz et al. 1997) to limit the complexity and computations for these coupled models. During the last decades, LSMs have become increasingly complex to accommodate for better understood processes, like snow and vegetation. Along with a more complex structure often comes a more complex parametrization, and several authors (Beven 1989; Duan et al. 1992) have stated that LSMs are over-parametrized given the data typically available for calibration. At larger scales, these models often rely on satellite-observed parameters, such as greenness and LAI (leaf area index). For field-scale studies, the LSMs are usually calibrated to specific circumstances to limit systematic prediction errors. Model calibration or parameter estimation relies on observed data and can be defined as a specific type of data assimilation (Nichols 2010).

Many LSMs have been developed and enhanced since the mid-1990s, with varying features, such as sub-grid variability, community-wide input, advanced physical representations and compatibility with atmospheric models (Houser et al. 2010). Some examples of widely used LSMs are the NCAR Community Land Model (CLM) (Oleson et al. 2010); the Variable Infiltration Capacity (VIC) Model (Liang et al. 1994); the Noah Model (Ek et al. 2003); the Catchment LSM (Koster et al. 2000); the TOPMODEL-based Land Atmosphere Transfer Scheme (TOPLATS) model (Famiglietti and Wood 1994); the Hydrology-Tiled European Centre for Medium-range Weather Forecasts (ECMWF) Scheme for Surface Exchange over Land (H-TESSEL) model (Balsamo et al. 2009); the SURFEX model (Le Moigne 2009); the Interaction between Soil Biosphere and Atmosphere (ISBA) model (Noilhan and Mahfouf 1996); and the Joint UK Land Environment Simulator (JULES) model (Best et al. 2011; Clark et al. 2011a). An example of an integrated system is the NASA Land Information System (LIS), which offers the capability to simulate with different models, observations and data assimilation techniques (Kumar et al. 2008).

An LSM has several elements, including a soil moisture scheme, a snow scheme, a rainfallrun-off scheme and a routing/hydraulic scheme. The soil moisture scheme can take several forms, such as explicit numerical solutions of Richards’ equations over multiple discretized layers (e.g., in CLM), or using a force-restore method (e.g., Deardorff 1977, used in SURFEX), or other more non-traditional approaches, such as a soil moisture calculation as a deviation from the equilibrium soil moisture profile between the surface and the water table (Catchment LSM). The different profile structures involve different state variables, for example, describing soil moisture at the surface (superficial volumetric water content) or describing soil moisture over the root zone (mean volumetric content of the root zone). The coupling strength between the surface and deeper soil layers is a sensitive point for successful propagation of surface observations to deeper layers (Kumar et al. 2009).

The presence of snow covering the ground and vegetation can greatly influence the energy and mass transfers between the land surface and the atmosphere. Notably, the snow layer modifies the radiative balance by increasing the albedo. Furthermore, the amount of water stored in the snowpack has an important impact on water availability in the spring time. The prognostic variables in most snow schemes include variables related to snow water equivalent (SWE), including snow depth and density, and the snow heat content. These variables most often determine the diagnostics such as snow area extent and albedo.

The snow scheme can have one layer, or several layers. In a one-layer scheme, the evolution of the snow water equivalent of the snow reservoir depends on the precipitation of snow (a source) and the snow sublimation from the snow surface (a sink). Multi-layer schemes are often designed to have intermediate complexity, having simplified physical parametrizations based on those of highly detailed internal-process snow models, while having computational requirements resembling those of single-layer schemes (Loth et al. 1993; Lynch-Stieglitz 1994; Sun et al. 1999).

A number of approaches have been implemented for rainfallrun-off schemes. Water that cannot be stored in the soil profile either runs off over land (Horton run-off, e.g., Decharme and Douville 2006) or gravitationally drains out of the profile (Mahfouf and Noilhan 1996; Boone 1999). The TOPMODEL run-off approach combines key distributed effects of channel network topology and dynamic contributing areas for run-off generation (Beven and Kirkby 1979; Silvapalan et al. 1987). This formalism takes explicit account of topographic heterogeneities (Decharme et al. 2006; Decharme and Douville 2006, 2007). Run-off and drainage exiting from hydrological models can be used as a boundary to hydraulic models that predict river flow and potential flooding (Matgen et al. 2010).

A hydraulic flood routing scheme uses numerical methods to solve simultaneously the equations of continuity and momentum for a fluid (see, e.g., Guo 2006). It is often applied to a river network, typically in a hierarchy including hillslope routing, sub-network routing and main channel routing. An example of a routing scheme is the river transport model developed for the NCAR Community Land Model (CLM) (Branstetter and Erickson III 2003, and references therein). A river transport model is also useful because it can be used to evaluate the performance of an LSM against gauge station data.

Most LSMs are soilvegetationatmosphere transfer (SVAT) models, where the vegetation is not a truly dynamic component. Recently, coupling of hydrological or SVAT models with vegetation models has received some attention, to serve more specific ecological, biochemical or agricultural purposes. Dynamic vegetation models are used to simulate the evolution of vegetation cover, photosynthesis, carbon and nutrient inventories and the fluxes of water, CO2, CH4, N2O, volatile organic carbon and fire-related emissions between the land surface and atmosphere. Illustrative examples of vegetation models are the Land biosphere Process and eXchange (LPX) model (Wania 2007; Spahni et al. 2010) and the CoupModel (Gustafsson et al. 2004; Jansson and Karlberg 2004; Jansson et al. 2005, 2008; Karlberg et al. 2006, 2007; Klemedtsson et al. 2008; Norman et al. 2008; Svensson et al. 2008).

There are a number of potential problems with LSMs that can cause errors in the forecast. These include components that cause “model error” and components that cause “predictability error”. Components that cause “model error” are as follows: incomplete description of physical processes perhaps done for computational efficiency, perhaps a reflection of incomplete knowledge; inaccurate parameters; and inaccurate forcings. Components that cause “predictability error” are inaccurate initial states and boundaries. All these problems are the subject of research in the land surface modelling and assimilation community.

4 Data Assimilation of the Hydrological Cycle

4.1 Introduction

The only practical way to observe the land surface on continental to global scales is by satellite remote sensing. However, this cannot provide information on the entire system, and measurements only represent a snapshot in time. Land surface models can predict spatial/temporal land system variations, but these predictions are often poor, due to model initialization, parameter and forcing errors and inadequate model physics and/or resolution. A way forward is to merge the observational and model information through data assimilation (Kalnay 2003).

Mathematics provides rules for combining information objectively, based on principles which aim to maximize (or minimize) a quantity (e.g., a “penalty function”) or on established statistical concepts (e.g., Bayesian methods) that relate prior information (understanding, which comes from prior combination of observations and models), with new information (e.g., an extra observation). The merged product, termed the posterior estimate or an analysis, adds value to both observational and model information. The data assimilation methodology takes account of the different nature (e.g., spatio-temporal resolution) of the observational and model information, using an observation operator (see, e.g., Talagrand 2010a).

Assimilation of land surface observations is at an earlier stage than, for example, assimilation of atmospheric observations (see various chapters in Lahoz et al. 2010a). However, during the past decade, land data assimilation has been a very active field of research. Land data assimilation considers both ground-based in situ data and satellite data. Often, satellite land surface data are assimilated and the process validated using in situ measurements. Assimilated satellite observations include retrievals of land surface temperature, soil moisture, snow water equivalent (SWE) and snow cover area (e.g., Van den Hurk et al. 2002; Andreadis and Lettenmaier 2006; Slater and Clark 2006; Bosilovich et al. 2007; Dong et al. 2007; Drusch 2007; Ni-Meister 2008; Reichle et al. 2008; Houser et al. 2010). Houser (2003) discusses the assimilation of land surface retrieved quantities and radiances. Early reviews of land data assimilation have been provided by McLaughlin (2002), Reichle (2008), Moradkhani (2008) and Houser et al. (2010).

Land data assimilation uses observations to constrain the physical parametrizations and initialization of land surface states critical for seasonal-to-interannual prediction. These constraints can be imposed in four ways: (1) by forcing the land surface primarily by observations (such as precipitation and radiation), often severe atmospheric NWP land surface forcing biases can be avoided (e.g., Saha et al. 2010; Reichle et al. 2011); (2) by employing innovative land surface data assimilation techniques, observations of land surface storages (such as snow, soil temperature and moisture) can be used to constrain unrealistic simulated storages (e.g., Houser et al. 2010; Reichle et al. 2013); (3) by tuning adjustable parameters (e.g., Pauwels et al. 2009; Vrugt et al. 2012); and (4) the land surface physical structure itself can be improved through the data assimilation process when the constant confrontation of model states against observations returns useful information about structural deficits. Integration of soil moisture information from satellite instruments, and ground-based and in situ observations of the land surface, using land data assimilation, provides a comprehensive picture of the state and variability of the land surface.

4.2 Data Assimilation Methods

Three methods are commonly used for land data assimilation (Houser et al. 2010): variational (3- and 4-dimensional, 3D-Var and 4D-Var); sequential (Kalman filter (KF) and Extended Kalman filter (EKF)); and ensemble (Ensemble Kalman filter, EnKF). Bouttier and Courtier (1999) provide details of these methods. Talagrand (2010a) and Kalnay (2010) discuss more recent developments in variational methods and ensemble methods, respectively.

In the 3-D variational (3D-Var) method, a minimization algorithm is used to find a model state, x, that minimizes the misfit between x and the background state x b, and also between the observation predictions H(x) and the observations y. The observation operator H maps the model state x to the measurement space, where y resides. In 3D-Var, we seek the minimum with respect to x of the penalty function, J, given by Eq. (1). The first term on the right hand side (J b) quantifies the misfit to the background term, and the second term (J o) is the misfit to the observations. If the observation operator is linear (written H), the penalty function, J, is quadratic and is guaranteed to have a unique minimum.

$$ J = \frac{1}{2}[x - x^{b} ]^{\rm T} {\mathbf{B}}^{ - 1} [x - x^{b} ] + \frac{1}{2}[y - H(x)]^{\rm T} {\mathbf{R}}^{ - 1} [y - H(x)] $$
(1)

4-D variational (4D-Var) assimilation is an extension of 3D-Var in which the temporal dimension is included, that is, 4D-Var is a smoother. In 4D-Var, observations are used at their correct time. 4D-Var has two new features compared to 3D-Var. First, it includes a model operator, M, that carries out the evolution forward in time. The first derivative, or differential, of M, M, is the tangent linear model (if M is linear, represented by M, its derivative is M). The transpose of the tangent linear model operator, M T, integrates the adjoint variables backward in time. The tangent linear model is only defined under the condition that the function J defined by Eq. (1) be differentiable—this is the tangent linear hypothesis. Second, J can include an extra term in which the model errors associated with the model’s temporal evolution are accounted for. In the formulation of Zupanski (1997), an analogous term involving Q −1 is included in J, where Q is the model error covariance. Examples for the land surface using variational methods include Calvet et al. (1998) and Reichle et al. (2001a). Note that variational methods are very common for parameter estimation (e.g., Dumont et al. 2012), but with replacement of the misfit to the background with a misfit to prior parameter guesses.

In the Kalman filter (KF), a recursive sequential algorithm is applied to evolve a forecast, x f, and an analysis, x a, as well as their respective error covariance matrices, P f and P a. The KF equations are (subscripts denote the time step) as follows:

$$ x_{n}^{f} = {\mathbf{M}}_{n - 1} x_{n - 1}^{a} ; $$
(2a)
$$ {\mathbf{P}}_{n}^{f} = {\mathbf{M}}_{n - 1} {\mathbf{P}}_{n - 1}^{a} {\mathbf{M}}_{n - 1}^{\rm T} + {\mathbf{Q}}_{n - 1} ; $$
(2b)
$$ x_{n}^{a} = x_{n}^{f} + {\mathbf{K}}_{n} [y_{n} - {\mathbf{H}}_{n} x_{n}^{f} ]; $$
(2c)
$$ {\mathbf{K}}_{n} = {\mathbf{P}}_{n}^{f} {\mathbf{H}}_{n}^{\rm T} [{\mathbf{R}}_{n} + {\mathbf{H}}_{n} {\mathbf{P}}_{n}^{f} {\mathbf{H}}_{n}^{\rm T} ]^{ - 1} ; $$
(2d)
$$ {\mathbf{P}}_{n}^{a} = [{\mathbf{I}} - {\mathbf{K}}_{n} {\mathbf{H}}_{n} ]{\mathbf{P}}_{n}^{f} . $$
(2e)

Equation (2a) represents the forecast of the model fields from time step n − 1 to n, while Eq. (2b) calculates the forecast error covariance from the analysis error covariance P a and the model error covariance Q. Equations (2c) and (2e) are the analysis steps, using the Kalman gain defined in Eq. (2d). Q and P a are assumed to be uncorrelated. For optimality, all errors must be uncorrelated in time.

The KF can be generalized to nonlinear H and M operators, although in this case neither the optimality of the analysis nor the equivalence with 4D-Var holds. The resulting equations are known as the Extended Kalman filter (EKF) as, for example, used for the land surface by Boulet et al. (2002), Reichle et al. (2002b), Matgen et al. (2010), Rüdiger et al. (2010) and de Rosnay et al. (2012b).

The Ensemble Kalman filter, EnKF, uses a Monte Carlo ensemble of short-range forecasts to estimate P f. The estimation becomes more accurate as the ensemble size increases. The EnKF is more general than the EKF to the extent that it does not require validity of the tangent linear hypothesis. Evensen (2003) provides a comprehensive review of the theory and numerical implementation of the EnKF. Examples for the land surface are identified in Table 4 (see below).

Table 4 Selected studies on land surface data assimilation, sorted by assimilated observation type

The Particle Filter (PF) is also an ensemble method. It does not require a specific form for the state distribution but, typically, a re-sampling algorithm needs to be applied (van Leeuwen 2009). Because PF methods typically make no assumptions of linearity in the model equations or that model and observational errors are Gaussian, they are well suited to deal with the land surface where model evolution is highly nonlinear, and model and observational errors can be non-Gaussian. The PF has been applied in hydrology to estimate model parameters and state variables (e.g., Moradkhani et al. 2005a; Weerts and El Serafy 2006; Plaza et al. 2012; Vrugt et al. 2012).

4.3 Representation of Errors

Representation of errors is fundamental to data assimilation. One needs to consider errors in observations, background information and model (see Eqs. (1, 2ae) above for identification of the error covariance matrices mentioned in the following). R, the observational error covariance matrix, is typically assumed to be diagonal, although this is not always justified. R includes errors of the measurements themselves, E, and errors of representativeness, F; R = E + F. B is the background error covariance matrix in variational methods (the analogue in the KF and ensemble methods is P f); its diagonal elements determine the relative weight of the forecasts, and its off-diagonal elements determine how information is spread spatially. Estimating B or P f is a key part of the data assimilation method (Bannister 2008a, b). Estimating model error Q is a research topic.

In the EnKF, the background (or forecast) errors are represented by the spread of the ensemble. This simplifies the computation of P f, implicitly accounts for the model error Q and avoids the calculation of Eq. (2b). For land data assimilation, the relative fraction of the observation error R and the model error Q (associated with the temporal evolution of the model) is often tuned or adaptively updated (e.g., Desroziers et al. 2005; Reichle et al. 2008).

In general, in data assimilation, errors are assumed to be Gaussian. The most fundamental justification for assuming Gaussian errors, which is entirely pragmatic, is the relative simplicity and ease of implementation of statistical linear estimation under these conditions. Because Gaussian probability distribution functions are fully determined by their mean and variance, the solution of the data assimilation problem becomes computationally practical. Note that the assumption of a Gaussian distribution is often not justified in land data assimilation applications.

Typically, there are biases between different observations, and between observations and model (see, e.g., Ménard 2010). These biases are spatially and temporally varying, and it is a major challenge to estimate and correct them. Despite this, and mainly for pragmatic reasons, in data assimilation it is often assumed that errors are unbiased. For NWP many assimilation schemes now incorporate a bias correction, and various techniques have been developed to correct observations to remove biases (e.g., Dee 2005); these methods are now being applied to land data assimilation (De Lannoy et al. 2007a, b).

4.4 Advantages and Disadvantages of Assimilation Methods

The feasibility of 4D-Var has been demonstrated in NWP systems (see, e.g., Simmons and Hollingsworth 2002). Its main advantage is that it considers observations over a time window that is generally much longer than the model time step, that is, it is a smoothing algorithm. This allows more observations to constrain the system and, considering satellite coverage, increases the geographical area influenced by the data. For nonlinear systems (as is generally the case for the land surface), this feature of 4D-Var, together with the non-diagonal nature of the adjoint operator which transfers information from observed regions to unobserved regions, reduces the weight of the background error covariance matrix in the final 4D-Var analysis compared to the KF analysis (for linear systems, the general equivalence between 4D-Var and the KF implies that the same weight is given to all data in both systems).

In contrast to the above advantages of 4D-Var, three weaknesses must be mentioned. First, its numerical cost is very high compared to approximate versions of the KF or ensemble methods. Second, its formalism cannot determine the analysis error directly; rather, it has to be computed from the inverse of the Hessian matrix (again, this procedure is prohibitive in both computation time and memory). Finally, its formalism requires the calculation of the adjoint model, which is time-consuming and may be difficult for a system such as the land surface which exhibits nonlinearities and on–off processes (e.g., presence or lack of snow).

The EKF is capable of handling some departure from Gaussian distributions of model errors and nonlinearity of the model operator. However, if the model becomes too nonlinear or the errors become highly skewed or non-Gaussian, the trajectories computed by the EKF will become inaccurate.

The EnKF is attractive as, for example, it requires no derivation of a tangent linear operator or adjoint equations and no integrations backward in time, as for 4D-Var (see Evensen 2003). The EnKF also provides a cost-effective representation of the background error covariance matrix, P f. Several issues need to be considered in developing the EnKF: (1) ensemble size; (2) ensemble collapse; (3) correlation model for P f, including localization (see, e.g., Kalnay 2010); and (4) specification of model errors.

The major drawback of the above techniques is the underlying assumption that the model states have a Gaussian distribution. The PF does not require a specific form for the state distribution, but its major drawback is that distribution of particle weights quickly becomes skewed, and a re-sampling algorithm needs to be applied.

The EnKF and PF are complementary. This complementarity makes a hybrid EnKF/PF version highly attractive for systems that can exhibit nonlinear and non-Gaussian features, an example being the land surface. For example, the EnKF could be used as an efficient sampling tool to create an ensemble of particles with optimal characteristics with respect to observations. The PF methodology could then be applied on that ensemble afterwards to resolve nonlinearity and non-Gaussianity in the system. This method is getting increased attention (see, e.g., Kotecha and Djurić 2003).

4.5 Example of a Land Data Assimilation System

For illustrative purposes, we describe the elements of the NILU SURFEX-EnKF land data assimilation system (Lahoz et al. 2010b). These elements are the following: (1) a data assimilation scheme (mainly variants of the EnKF, but also variants of the PF, and the EKF); (2) a land surface model (SURFEX model developed at Météo-France, Le Moigne 2009); (3) observations; (4) the observation operator; and (5) error characteristics for the model and the observations.

The SURFEX model used at NILU (and at Météo-France) can be run in uncoupled or coupled mode. It includes the following elements:

  • A soil and vegetation scheme: ISBA and ISBA-A-gs;

  • A water surface scheme: COARE/ECUME (Coupled Ocean–Atmosphere Response Experiment/Exchange Coefficients from Unified Multi-campaign Estimates) for the sea; FLAKE for inland water;

  • Urban and artificial areas: Town Energy Balance—TEB model;

  • A surface boundary layer (SBL) scheme;

  • Chemistry and aerosols;

  • A land use database: ECOCLIMAP.

Figure 3 illustrates how SURFEX works. During a model time step, each surface grid box receives from the atmosphere the following information: upper air temperature, specific humidity, horizontal wind components, pressure, total precipitation, long-wave radiation, short-wave direct and diffuse radiation and, possibly, concentrations of chemical species and dust. In return, SURFEX computes averaged fluxes of momentum, sensible and latent heat, and, possibly, chemical species and dust fluxes. These fluxes are then sent back to the atmosphere with the addition of radiative terms like surface temperature, surface direct and diffuse albedo, and surface emissivity.

Fig. 3
figure 3

Exchanges between the atmosphere and land surface implemented in the SURFEX LSM. See text. Based on Le Moigne (2009)

The above information transferred to the atmosphere from the land surface provides the lower boundary conditions for the radiation and turbulent schemes in an atmospheric model coupled to SURFEX or forced by SURFEX output. In SURFEX, each grid box is made up of four adjacent surfaces: one for nature, one for urban areas, one for sea or ocean and one for lake, identified by the global ECOCLIMAP land database. The SURFEX fluxes are the average of the fluxes computed over nature, town, sea/ocean or lake, weighted by their respective fraction.

The assimilation system at NILU is illustrated in Fig. 4 with reference to the EnKF. It can assimilate the following data: (1) 2-m screen-level temperature (T 2m) and 2-m screen-level relative humidity (RH2m) provided, for example, by the SYNOP/CANARI (Code d’Analyse Nécessaire à Arpege pour ses Rejets et son Initialisation; Taillefer 2002) analysis; and (2) superficial soil moisture content data from satellites (e.g., from ASCAT, AMSR-E and SMOS). The control variables (Nichols 2010) of the NILU land DA system are the following:

Fig. 4
figure 4

Schematic of the NILU SURFEX-EnKF land DA system methodology. From Lahoz et al. (2010b)

  • Surface temperature;

  • Mean surface temperature;

  • Superficial volumetric water content;

  • Mean volumetric water content of the root zone.

4.6 Data Assimilation Research Applications

Table 4 shows a selection of studies using a variety of observation types to improve the land surface state or the state in a hydraulic, vegetation or snow model coupled to it. Because of its success in highly nonlinear land surface modelling (Reichle 2008), the EnKF has gained a lot of attention. Therefore, state estimation studies using an EnKF or EnKS (Ensemble Kalman Smoother, where the time integration is done forwards and backwards) are organized separately from those that use any other assimilation technique (e.g., variational, optimal interpolation). Also shown are a few examples on parameter estimation in land surface or forward models. While this review focuses on state estimation, parameter estimation and forcing correction are of utmost importance in land surface models. Land surface models are not chaotic and thus benefit less from state estimation than atmospheric or oceanic applications. By contrast, parameters and forcings determine the major part of the land surface model uncertainty, and great advances can be expected from combining state, bias, parameter and forcing estimation (Moradkhani et al. 2005b; De Lannoy et al. 2006; Vrugt et al. 2012). Here, we discuss a number of soil moisture and snow-related studies done mainly for state updating, with particular attention to the conceptual problems they address. Examples on evapotranspiration, surface or skin temperature, LAI (leaf area index), discharge and water stage assimilation are also provided in Table 4, but not discussed in detail.

4.6.1 Single-column Applications

To explore the possibilities and limitations of assimilation schemes, numerous studies have first explored single point-scale or grid cell-scale applications. For soil moisture assimilation, conceptual problems include the propagation of information from the surface to the entire soil profile; the optimization of assimilation techniques and update frequencies; and the identification of an allowable level of uncertainty in surface observations to be useful in a data assimilation scheme, mostly in view of satellite sensor design.

Georgakakos and Baumer (1996) performed a sensitivity study to document the impact of observation noise on Kalman filter (KF) results. Calvet et al. (1998) and Wingeron et al. (1999) assimilated surface soil moisture data from a soil profile in the highly instrumented field site of the Monitoring the Usable soil Reservoir EXperiment (MUREX) in France to update root zone soil moisture using variational approaches and investigated the importance of assimilation windows and observation frequencies. Similarly, Li and Islam (1999) studied the effect of assimilation frequency while directly inserting gravimetric measurements as surrogates for remote sensing data, and Aubert et al. (2003) suggested that a 1-week soil moisture update is sufficient. Walker et al. (2001a) showed in a synthetic profile study that the KF was superior to direct insertion. In a subsequent study with real data from the Nerrigundah catchment in Australia, Walker et al. (2001b) articulated the idea that soil moisture assimilation can solve issues with errors in forcings or initial conditions, but not errors caused by problems in the physics of the soil model.

De Lannoy et al. (2007a) used an EnKF to study vertical information propagation, and the effect of assimilation depth and frequency for an extensive set of soil profiles in an USDA field in Beltsville, USA. This study highlighted the effect of bias propagation through the profile and the need for bias estimation, a conceptual problem that was addressed with a two-stage forecast and bias filter (De Lannoy et al. 2007a, b). At the same time, Sabater et al. (2007) studied the concept of propagating surface observations to deeper model layers using different types of filtering, using ground data from the Surface Monitoring of the Soil Reservoir EXperiment, SMOSREX. Camporese et al. (2009) set up synthetic soil profile assimilation experiments studying the effect of uncertainties, ensemble size, bias and other factors with an EnKF. Because of the large impact of parameters and forcings on soil moisture errors and biases, assimilation schemes have paid increasing attention to including parameter estimation along with state updating, as, for example, illustrated in Monsivais-Huerteroet et al. (2010). At present, EnKF filtering experiments are being conducted at point-scales to further identify and address conceptual problems with soil profile estimation, using surface observations (see, e.g., Han et al. 2012a).

Another important conceptual problem with soil moisture assimilation, initially addressed in a point-scale setting, is the direct assimilation of radiances or assimilation using an observation operator. This is done to avoid inconsistencies between auxiliary information that would be used in retrievals and that used in the land surface models. Entekhabi et al. (1994) estimated 1-m-deep bare soil moisture profiles using synthetic microwave brightness temperatures. This work was extended by Galantowicz et al. (1999) using eight days of L-band brightness temperature (Tb) data collected from a test plot in Beltsville, USA. Pathmathevan et al. (2003) assimilated microwave observations with a variational technique, but using a heuristic optimization, rather than an adjoint. Crosson et al. (2002) tested Tb assimilation at the point-scale with an EKF and showed that biases could not be overcome through assimilation. Crow (2003) successfully assimilated Tb for soil moisture and showed improvements at the plot-scale, using either synthetic or real field data. Crow analysed the EnKF performance in terms of the assumptions that underlie the KF. Crow and Wood (2003) also used the EnKF at two sites within the Southern Great Plains 1997 (SGP97) experimental domain and reported that Tb data assimilation was able to correct for rainfall errors. Wilker et al. (2006) highlighted the difficulty in mapping heterogeneous soil moisture into T b using a forward operator and identified the representativeness errors associated with these data. Similar to the above studies, Hoeben and Troch (2000) used a KF including a forward backscatter model to explore the direct assimilation of radar microwave signals to estimate soil moisture profiles.

Snow data assimilation has conceptual problems inherent to the cumulative and temporary nature of this variable. Slater and Clark (2006) illustrated how a square root EnKF could improve the snow state at in situ sites in Colorado during the accumulation and melt phase. They also identified the temporal correlation in snowpacks and showed how it could limit the efficiency of filtering if not accounted for properly. In a synthetic study, Liston and Hiemstra (2008) proposed a technique to update snow retroactively, which would be useful for re-analysis applications, if observations would only be available at the end of the snow season. In situ snow data assimilation is performed operationally (see Sect. 4.7 below), usually with simple assimilation techniques. An example where both the snow state and parameters were estimated using an EnKF in a 1-D setting is given by Su et al. (2011).

A number of point- or single-grid-scale studies have tried to relate brightness temperature data to snowpack characteristics (Durand et al. 2008; Andreadis et al. 2008), in preparation for Tb assimilation. Many of these studies highlight the large sensitivity of snowpack estimates to model parameters (Davenport et al. 2012), which makes both forward simulation and inversion of Tb observations for SWE estimation a difficult task.

4.6.2 Distributed Applications

The most obvious advantage of remotely sensed observations is the possibility of performing large-scale and spatially distributed assimilation. It should be recognized, however, that despite the spatial coverage of data, for computational reasons assimilation is often performed per column, that is, using a 1-D filter. When the vertical columns (of snow or soil) are horizontally connected through the model physics or assimilation statistics, this is referred to as 3-D assimilation.

The assimilation of catchment-distributed soil moisture has often focused on the improvement of the state or initial conditions (Pauwels et al. 2001, 2002) and parameters in order to improve spatially integrated fluxes, such as discharge. However, it is also possible to use soil moisture assimilation to correct rainfall estimates (Crow and Ryu 2009). At the global scale, soil moisture assimilation will become increasingly important when coupled to the atmosphere for climate and seasonal predictions.

Spatially distributed studies initially focused on assimilation of retrievals with simple techniques and gradually developed towards more complex schemes, with the inclusion of forward models (observation operators) to directly assimilate, for example, microwave observations. Initial soil moisture retrieval studies explored the performance of different filter techniques, such as Newtonian nudging, statistical correction and statistical interpolation (Houser et al. 1998; Pauwels et al. 2001; Paniconi et al. 2003; Hurkmans et al. 2006), while during the last decade, variational and KF-based assimilation largely dominated this research field because of the proven robustness and flexibility of these latter techniques (Reichle et al. 2002a, b).

A typical conceptual problem with spatially distributed assimilation is the use of coarse-scale remotely sensed data to infer fine-scale information. There are many static disaggregation techniques that use auxiliary information to perform such a downscaling outside the assimilation scheme. Performing dynamic disaggregation within the assimilation scheme remains a research challenge. The latter concept consists of a 3-D filter with inclusion of spatially correlated (fine-scale) state and (coarse-scale) observation prediction errors and has been addressed in EnKF frameworks by Reichle et al. (2001b, 2013), Reichle and Koster (2003), Pan et al. (2009), De Lannoy et al. (2010) and Sahoo et al. (2013).

An important issue connected to 3-D filtering for disaggregation is the use of local observations to update neighbouring locations, for example, to propagate from observed swaths to unobserved locations. Often, this problem is solved with spatial interpolation or by relying on horizontal connections in the model equations (Walker et al. 2002). Alternatively, such horizontal information propagation can be done within an assimilation scheme that provides accurate error correlations between observed and non-observed observations and forecasts (Reichle and Koster 2003; De Lannoy et al. 2012). De Lannoy et al. (2009) used an adaptive KF to identify such spatial correlations, along with the magnitude of the forecast error, to optimize filter performance. Han et al. (2012b) studied the effect of spatial correlations in an OSSE (observing system simulation experiment) with a local ensemble transform Kalman filter. Filter technical issues such as update frequency (Walker and Houser 2004) and error estimation have also been addressed in a spatial context. Reichle and Koster (2005) demonstrated the validity of the concept that assimilation results should be better than either the model or observations alone. After re-scaling satellite observations from AMSR-E and SMMR to take bias out of the system, Reichle et al. (2007) showed that satellite observations can contribute valuable information, even if they are not accurate. Reichle et al. (2009) further assessed the quality of assimilation products as a function of retrieval and land surface model uncertainty in an OSSE and showed that soil moisture retrievals can have slightly less skill than the land surface model and still contribute to an overall higher skill in the assimilation product. This was confirmed in a real data assimilation study by Draper et al. (2012).

The importance of correctly specifying random errors and biases is a major conceptual challenge in the optimization of distributed assimilation systems. Bias mitigation has become a regular part of most soil moisture data assimilation systems (Reichle and Koster 2004; Drusch et al. 2005; Kumar et al. 2012; Sahoo et al. 2013), and random error specifications for soil moisture data assimilation have been studied through adaptive filtering (Crow and van Loon 2006; Reichle et al. 2008).

Another idea with potential benefit is multi-sensor assimilation for soil moisture estimation. As an example, Draper et al. (2012) showed how both active (ASCAT) and passive (AMSR-E) microwave retrievals can contribute to a similar improvement in assimilation results. Combining improved precipitation data with soil moisture retrieval assimilation (Liu et al. 2011) and combining discharge (Pauwels and De Lannoy 2006), temperature or LAI with soil moisture assimilation are other avenues that have been exploited for hydrological assimilation.

As already indicated for single-column applications, a major conceptual problem is the direct assimilation of brightness temperatures (T b) or backscatter observations from satellite missions for soil moisture estimation. Reichle et al. (2001a, b) presented pioneering studies with a 3-D variational scheme to assimilate and disaggregate synthetic or real brightness temperatures over the SGP97 study area, while Margulis et al. (2002) used an EnKF and Dunne and Entekhabi (2006) compared an EnKF with an EnKS for the same T b assimilation problem. Walker et al. (2002) also assimilated T b directly, but from SMMR and using an EKF over Australia. Using a variational scheme, and with inclusion of both a land surface temperature and microwave brightness temperature observation operator, Barrett and Renzullo (2009) showed that both thermal (AVHRR) and microwave (AMSR-E) satellite observations can provide effective observational constraints on the modelled profile and on surface soil moisture. There are only a few studies on spatially distributed backscatter assimilation, but in a recent OSSE using an EnKF, Flores et al. (2012) showed the potential of the L-band radar information expected from the future SMAP mission.

For snow, spatially distributed assimilation studies include snow cover area (or snow cover fraction) and snow water equivalent (SWE) assimilation. A correct specification of the snow-covered area is important to represent feedbacks from the land to the atmosphere, while a good estimate of the actual amount of snow in the snowpack is of crucial importance for flood, drought and discharge predictions (He et al. 2012). Snow cover observations are typically fine-scale visible/near infrared observations that are only available in cloud-free areas, while SWE measurements are typically more inaccurate retrievals from T b observations at a coarse scale (see Table 1). It can be expected that multi-sensor assimilation could help to further snow estimation (De Lannoy et al. 2012).

Because of its binary nature, snow cover in terms of the presence or absence of snow cannot be assimilated with filters that rely on continuous variables. Instead, rule-based algorithms have been proposed (Rodell and Houser 2004; Zaitchik and Rodell 2009; Roy et al. 2010). However, the snow cover fraction (SCF) is a more continuous variable that has been assimilated with KF-based algorithms (Clark et al. 2006; Su et al. 2008; De Lannoy et al. 2012). When assimilating SCF with a Kalman filter, there is a need to relate SCF to the actual SWE state variable through an observation operator, often defined as a snow depletion curve. It is also possible to use visible/near infrared snow albedo observations to update snow parameters such as grain size (Dumont et al. 2012).

The two dominant conceptual problems with satellite-based SWE assimilation are the coarse-scale nature and high uncertainty of the measurements. Initial attempts to assimilate SMMR or AMSR-E SWE retrievals only yielded marginal success (Andreadis and Lettenmaier 2006; Dong et al. 2007), because of retrieval errors due to signal saturation, presence of liquid water in the snowpack and multiple other factors. To address the coarse-scale issue, De Lannoy et al. (2010) proposed several 3-D filter options to disaggregate SWE data and propagate data from observed swaths to unobserved regions. These techniques showed great benefit in a synthetic data study. When using real AMSR-E retrievals (De Lannoy et al. 2012), and with bias mitigation through re-scaling added to the system, the assimilation analyses were affected by a lack of a realistic interannual signal in the retrievals.

To address the problems with SWE retrieval accuracy, the potential of direct radiance assimilation has been investigated (Durand and Margulis 2006; Andreadis et al. 2008; Durand et al. 2009; DeChant and Moradkhani 2010). However, these efforts rely on a good description of the snowpack in the land surface model, which is not always available for large-scale applications. To address this, Forman et al. (2013) developed an artificial neural network as a computationally attractive forward model in readiness for large-scale radiance assimilation. In preparation for the future SMAP mission, freeze–thaw assimilation (Bateni et al. 2013) has been investigated, because of its importance in understanding the carbon cycle.

The above studies update either snow or soil moisture separately. A major challenge for land data assimilation is making use of total water storage (TWS) observations from GRACE, which include soil moisture, snow and other water components at a very coarse scale (Table 1). Total water storage can be decomposed into soil and snow components and disaggregated to finer scales (Zaitchik et al. 2008; Su et al. 2010; Forman et al. 2012; Li et al. 2012; Reichle et al. 2013).

4.7 Towards Operational Land Data Assimilation

Land surface processes and their initialization are of crucial importance to address the challenge of seamless prediction from weather to seasonal and climate timescales (Palmer et al. 2008). It is well established that high skill in short- and medium-range forecasts of temperature and humidity over land requires proper initialization of soil moisture (Beljaars et al. 1996; Douville et al. 2000; Mahfouf et al. 2000; Drusch and Viterbo 2007; van den Hurk et al. 2008). A similar impact from soil moisture has been established for seasonal forecasts (Koster et al. 2004a, b, 2011; Weisheimer et al. 2011). Initialization of snow conditions also has a significant impact on forecast accuracy at weather timescales (Brasnett 1999; Drusch et al. 2004). Operational land data assimilation has initially focused on ingesting precipitation observations (e.g., Saha et al. 2010; Reichle et al. 2011), but improved snow and soil moisture state updates are now emerging, as documented, for example, for the ECMWF Integrated Forecasting System by de Rosnay et al. (2012a).

An unprecedented operational land data assimilation product will be provided by the Global Modeling and Assimilation Office (NASA GMAO) in the form of a level 4 satellite-based soil moisture product (Reichle et al. 2012; De Lannoy et al. 2013). The assimilation of SMAP brightness temperatures into the Goddard Earth Observing System land surface model will yield a global root zone soil moisture product.

5 Conceptual Problems and Key Challenges

To summarize, the conceptual problems in our understanding of the hydrological cycle over land can be grouped by observing, modelling and data assimilation systems. These are outlined below.

5.1 Assimilated Observations

  • To be useful for model development and assimilation, the dominant modes (in space and time) of the land system must be sampled;

  • To be efficient for state updating, observations need to be available at a reasonable time interval to capture short-term dynamical variations (cf. the importance of satellite overpass frequency; Walker and Houser 2004; Pan and Wood 2010);

  • Observations must be collected in long enough historical records to identify long-term, climatological, statistics for bias mitigation (Reichle and Koster 2004) or trend identification;

  • Observations need to be sampled at different spatial scales to capture both local and global processes;

  • There is a need to have a reasonable signal-to-noise ratio (e.g., SMAP’s target of brightness temperature uncertainty is 1.3 K; Entekhabi et al. 2010a), and an uncertainty in the error description appropriate for scientific studies;

  • There is a need to relate observations to key system state variables, that is, there needs to be system observability.

5.2 Forward and Retrieval Models, with Particular Reference to Radiances and Backscatter Processes

  • To achieve appropriate retrieval accuracy, there is a need to use advanced methods to describe physical processes in radiative transfer models (RTMs);

  • When assimilating radiances at large scales (e.g., from microwave sensors), there is a need for calibration of RTMs (De Lannoy et al. 2013; Forman et al. 2013).

5.3 Land Surface Models

  • There is a need to use advanced methods to describe physical processes (this limits structural uncertainty) and couple land surface models with models describing more specialized processes such as run-off routing, dynamic vegetation or snow (Pauwels et al. 2006);

  • There is a need for consistent global parameter datasets to limit predictive uncertainty due to parameter uncertainty;

  • There is a need for high-quality forcing data (this limits input uncertainty), mainly for precipitation (Maggioni et al. 2011; Reichle et al. 2011).

5.4 Data Assimilation Challenges

  • There is a need to fill in the spatial and temporal gaps in observations (Reichle and Koster 2003; De Lannoy et al. 2012);

  • There is a need to disaggregate data in space and time and into their individual components (Forman et al. 2012; Reichle et al. 2013);

  • There is a need to ingest directly radiances or backscatter information (as opposed to retrievals) to avoid inconsistencies between auxiliary information in retrievals and land surface models (Crow and Wood 2003; Durand et al. 2009; Flores et al. 2012);

  • There is a need to exploit the simultaneous use of multiple sensors (Pan et al. 2008; Draper et al. 2012) and explore the capabilities of new sensors (Andreadis et al. 2007; Durand et al. 2008);

  • There is a need to combine state and input (forcing) information with parameter updates (Moradkhani et al. 2005b; Liu et al. 2011; Vrugt et al. 2012);

  • There is a need to explore advanced filtering techniques, for example, the use of the particle filter to account for non-Gaussian errors (Plaza et al. 2012);

  • There is a need to improve the representation of observation and forecast errors, and to specify biases in observational and model information (De Lannoy et al. 2007b; Crow and Reichle 2008; Reichle et al. 2008; De Lannoy et al. 2009; Crow and van den Berg 2010);

  • There is a need to preserve water balance in the land system (Pan and Wood 2006; Yilmaz et al. 2011) and draw lessons from the information in the assimilation increments;

  • There is a need to have access to adequate computational resources.

5.5 Validation

  • Needs ground observations with substantial spatial and temporal coverage;

  • Needs tools to address scaling and representativeness errors (Crow et al. 2012);

  • Needs appropriate and effective validation metrics (Entekhabi et al. 2010b).

6 Conclusions

To understand the hydrological cycle over land, we need to make observations and develop models that encapsulate our understanding. These models have a basis on the information gathered from observations, as well as on previous experience, and are used to project our understanding into the future by making predictions. A crucial element in this procedure is confronting models with observations. Data assimilation, which combines observational and model information, provides an objective method to confront models against observations and add value to both the model and the observations. Data assimilation adds value to observations by filling the gaps between them and adds value to models by constraining them with observations. In this paper, we touch on the main conceptual problems that limit a full integration of land surface models and observations by reviewing progress in land surface data assimilation research over the last decade.

Collectively, the advent of new satellite missions, the increasing attention to forecast uncertainty due to errors in the land surface model structure, parameters and input, and the development of advanced assimilation techniques will eventually close the largest gaps in our understanding of the hydrological cycle over land.