Gap-filling continuously-measured soil respiration data: A highlight of time- series-based methods

Soil respiration is an important ecosystem process that releases carbon dioxide into the atmosphere. While soil respiration can be measured continuously at high temporal resolutions, gaps in the dataset are inevitable, leading to uncertainties in carbon budget estimations. Therefore, robust methods used to fill the gaps are needed. The process-based non-linear least squares (NLS) regression is the most widely used gap-filling method, which utilizes the established relationship between the soil respiration and temperature. In addition to NLS, we also implemented three other methods based on: 1) artificial neural networks (ANN), driven by temperature and moisture measurements, 2) singular spectrum analysis (SSA), relying only on the time series itself, and 3) the expectation-maximization (EM) approach, referencing to parallel flux measurements in the spatial vicinity. Six soil respiration datasets (2017–2019) from two boreal forests were used for benchmarking. Artificial gaps were randomly introduced into the datasets and then filled using the four methods. The time-series-based methods, SSA and EM, showed higher accuracies than NLS and ANN in small gaps (<1 day). In larger gaps (15 days), the performance was similar among NLS, SSA and EM; however, ANN showed large errors in gaps that coincided with precipitation events. Compared to the observations, gap-filled data by SSA showed similar degree of variances and those filled by EM were associated with similar first-order autocorrelation coefficients. In contrast, data filled by both NLS and ANN exhibited lower variance and higher autocorrelation than the observations. For estimations of the annual soil respiration budget, NLS, SSA and EM resulted in errors between −3.7% and 5.8% given the budgets ranged from 463 to 1152 g C m−2 year−1, while ANN exhibited larger errors from −11.3 to 16.0%. Our study highlights the two time-series-based methods which showed great potential in gap-filling carbon flux data, especially when environmental variables are unavailable.


Introduction
The quantification of carbon (C) exchanges between natural ecosystems and the atmosphere, in the form of carbon dioxide (CO 2 ), has attracted great attention over the last few decades in the context of climate change (IPCC, 2013). With the development of automatic and continuous measurements at high temporal resolutions (e.g., eddy covariance), it is now possible to estimate the C budget of the ecosystems to a certain degree of accuracy and evaluate their contributions to climate change (e.g., Lu et al., 2017;Reichstein et al., 2007).
Among the key components of the C cycle in an ecosystem, soil respiration is the process that releases the CO 2 produced by roots, soil organisms and chemical oxidation into the atmosphere (Lloyd and Taylor, 1994). Soil respiration is usually measured using either static or dynamic chambers (Bekku et al., 1997;Jensen et al., 1996;Rochette et al., 1992). Automated soil respiration measurements have been performed in many studies at a temporal interval of 10-60 min (e.g., Suh et al., 2006;Wu et al., 2016). Among the approaches, the forced diffusion (FD) technique was developed based on the theory of dynamic chambers but has the advantage of low power consumption (see details in Risk et al., 2011). At the same time, FD chambers require no mechanical movement between or during measurements which allows them to operate even under snow cover (Lavoie et al., 2015). The FD chambers have been demonstrated to produce comparable results to conventional chambers under the same conditions (Risk et al., 2011). The FD technique, by producing data continuously throughout the year, provides a firm ground for estimating budgets of soil respiration in an ecosystem.
As with all automatic field measurements, missing data, due to e.g., power failure, instrumentation malfunction, disturbance during maintenance, data corruption, etc. are almost inevitable. Depending on the size and number of gaps, large uncertainties may be introduced to the https://doi.org/10.1016/j.agrformet.2020.107912 Received 10 October 2019; Received in revised form 18 December 2019; Accepted 23 January 2020 estimated C budget (Falge et al., 2001). Therefore, robust and suitable gap-filling methods are critical to ensure the accuracy of C budget estimations.
The gap-filling of carbon flux data usually takes auxiliary variables measured within or near the flux footprint to generate models for predicting missing values in the gaps. For soil respiration, its exponential relationship with soil temperature has long been acknowledged (Lloyd and Taylor, 1994). This relationship is thus most widely used as the basis of the non-linear least squares (NLS) method for soil respiration gap-filling (Gomez-Casanovas et al., 2013). Apart from soil temperature, soil moisture is another driver of soil respiration (Cook and Orchard, 2008). However, the relationship between soil moisture and soil respiration is more complicated because other factors are heavily entangled in it (e.g., time of soil rewetting) (Cook and Orchard, 2008). Many types of models have been used to describe the relationship, such as linear functions (Cook and Orchard, 2008), quadratic polynomials (Hursh et al., 2017), or the moisture effect was even included in the temperature-based model as a limitation term (Reichstein et al., 2003). Thus, it is challenging to choose the proper mathematical form to incorporate soil moisture in the models for soil respiration gap-filling. As a data-driven alternative, artificial neural networks (ANNs) use machine learning algorithms to simulate the output variable (soil respiration) based on the inputs (e.g., soil temperature and moisture) without explicitly specifying the mathematical relationships (Zhang et al., 1998). ANNs have been successfully implemented in gap-filling the ecosystem CO 2 exchange data from eddy covariance observations with multiple environmental variables (including soil moisture) (Moffat et al., 2007); however, their performance in soil respiration data gap-filling has yet to be evaluated.
Despite low temperatures in the winter, the soil respiration usually remains active under the snow cover (Brooks et al., 2011). Due to the long snow covering period at high latitudes, winter soil respiration can represent a substantial proportion of the annual ecosystem CO 2 budget (Alm et al., 1999;Aurela et al., 2002;Groffman et al., 2006;Zhao et al., 2016). However, since soil temperature and moisture exhibit very limited variations under the snow cover, their control over soil respiration is usually not significant in the winter (Alm et al., 1999;Liptzin et al., 2009;Merbold et al., 2012). As a result, gap-filling methods driven by environmental variables might perform poorly when being applied to winter time fluxes. This has not been investigated before.
Alternatively, methods using time series analysis can predict the missing data based on the measured soil respiration time series itself, independent of auxiliary variables. For example, singular spectrum analysis (SSA) (Ghil et al., 2002) implements nonparametric spectral algorithms to analyze time series for forecasting and has been applied in diverse fields such as hydrometeorology, remote sensing and space sciences for data analysis and gap-filling purposes (Golyandina and Osipov, 2007;Kondrashov et al., 2010;Mahecha et al., 2011;Schoellhamer, 2001;Shun and Duffy, 1999;Wang and Liang, 2008). Compared to other time series analysis methods, SSA is fully dataadaptive and has no prior knowledge requirements on the periodicities of the data, making it rather flexible (Golyandina et al., 2018). While the performances of all gap-filling methods are to some degree influenced by the gap size (Kunwor et al., 2017;Moffat et al., 2007), the accuracy of the SSA method is expected to be especially sensitive to gap size increase, due to the self-referencing nature of this method. In contrast, methods that uses the expectation-maximization (EM) algorithm can take other soil respiration time series measured in parallel as reference and impute the missing values based on the relationships between the multiple time series (Junger and Ponce de Leon, 2015). This method can be effective when multiple chambers are deployed at the same site as replicates. Overall, these time-series-analysis-based methods have never been applied to C flux gap-filling but may have great potential for future applications, especially in cases when the environmental variables are not measured or missing.
All the gap-filling methods require a sampling window around the gap. Using the available data within the window, the corresponding models are trained to predict the data in the gap. The sizes of this sampling window are usually arbitrarily chosen, ranging from one halfmonth to a full year (e.g., Kunwor et al., 2017;Moffat et al., 2007;Zhao and Huang, 2015). Theoretically, a large sampling window could contain more information on the dynamics and patterns of the time series investigated but may produce an overgeneralized model that hardly represents the local gaps. At the same time, a small sampling window that captures the local conditions close to the gap may nonetheless cause large uncertainties in the model itself due to less training data. Thus, an optimal sampling window should have the smallest size possible that minimizes the prediction errors for the missing data. Whether the optimal window size varies among different methods, seasons and gap sizes needs to be further studied.
In this study, we implemented the four gap-filling methods described above, namely NLS, ANN, SSA, and EM, on artificially created data gaps in six soil respiration datasets from two boreal forest sites in Norway. The aims of the study are 1) to determine the best sampling window size for each gap-filling method, 2) to evaluate the performance of the gap-filling methods in gaps of different sizes (1-, 6-h, 1and 15-day) and seasons (winter vs. non-winter), and 3) to test the suitabilities of the four methods in estimating annual soil respiration budgets under mixed-gap scenarios. Through these analyses, we accessed the strengths and weaknesses of these methods in soil respiration data gap-filling and enrich the toolbox for future C flux data gap-fillings.

Datasets
The soil respiration datasets used in this study were collected at two forest sites in Hurdal (60°22′21″ N, 11°04′41″ E, 284 m a.s.l.) and Løten (60°51′33″ N, 11°26′08″ E, 262 m a.s.l.), southern Norway. Both sites are dominated by Norway spruce (Picea abies (L) Karst.). At Hurdal, the  The plot number is included in the models as random effects. SS: type III sums of squares with Satterthwaite approximation for the degrees of freedom. RMSE: root mean squared error.
terrain is slightly sloping, and the soils are podzolic in the upper part and rather hydrogenic (partially waterlogged) in the lower part. The site at Løten is rather flat and contains a bog surrounded by mixed forests. The soils are loamy with a high water-holding capacity. Both sites are typically covered with snow for 4-5 months per year, with a long-term  annual mean air temperature of 3.9°C for Hurdal and 3.1°C for Løten, respectively according to the seNorge2018 database (Lussana et al., 2019). The soil respiration measurements were carried out automatically every 10-min using forced diffusion (FD) chambers (eosFDCO2, Eosense Inc., Dartmouth, NS, Canada). Specifically, each chamber is equipped with membranes allowing gases to freely diffuse into it. With different compartments, it separately measures the CO 2 concentrations that represent the ambient air (C ambient ) and the air that mixes the ambient air with gases from the soil (C mix ). The soil respiration rate (R s ) is then calculated as (Risk et al., 2011): where G is an empirical positive constant (a function of the gas diffusion rate through the membrane and measured soil surface area; unit: m s −1 ) provided by the manufacturer. More details of the method are outlined in Risk et al. (2011). The C mix is larger than the C ambient under typical circumstances, implying a loss of CO 2 from the soil (i.e., a positive flux). Since the FD chamber measures the fluxes with the same CO 2 source from the soil (i.e., soil respiration) as static chambers and produces comparable soil respiration data (Risk et al., 2011), we suggest that the methods and results in this study can also be applied to the soil respiration measured by the widely used static chambers. Along with the soil respiration, corresponding soil temperature and moisture (TMS-4, TOMST inc., Prague, Czech Republic) at 10, 20 and 50 cm depths were measured within <1 m from each chamber at the same temporal resolution (i.e., 10-min). In addition, the ambient temperatures were measured at~5 cm above the soil surface, which represented either the temperature of the snowpack during winter time or air temperature when snow was absent. Data from a total of 6 plots (4 from Hurdal and 2 from Løten) were used for analysis. The datasets overall covered the period between July 2017 and April 2019 but varied among the plots (see details in Table 1). The soil respiration data were quality-checked before being analyzed and were rejected (marked as missing values): 1) during instrument malfunction or power failure, and 2) with implausible values (soil respiration rate < −2 or >10 µmol CO 2 m −2 s −1 ). In total, 11% of the data were discarded (see Table 1 for the summary for each plot).

Artificial gaps
In order to compare the gap-filled data to the original observations, we created artificial gaps (flagged) in the soil respiration time series in all the 6 datasets. Specifically, we used each dataset to create four separate benchmark datasets: one with 80 one-hour gaps (6 observations per gap), a second one with 40 six-hour gaps (36 observations per gap), a third one with 30 one-day gaps (144 observations per gap), and the last one with 10 fifteen-day gaps (2160 observations per gap). The gaps were added at random locations of the time series while avoiding overlapping with each other as well as with the original gaps in the dataset. These benchmark datasets were used to test the performance of the gap-filling methods in gaps of different sizes.
In addition, to evaluate the overall influences of different gap-filling methods on the estimations of annual soil respiration budget, we selected the datasets that were longer than 1 year (i.e., H1, H3 and H4 in Hurdal and L2 in Løten, Table 1) and extracted the data of a one-year period each from April 30, 2018 to April 29, 2019. Based on each of the extracted datasets, we created a random number of gaps with random lengths (uniformly distributed) up to 15 days (mixed gaps) that resulted in benchmark datasets with 5 levels of the gap proportion (10, 20, 30, 40 and 50%). For each of the 5 levels, 10 replicate datasets were created with different random gap arrangements (i.e., 200 benchmark datasets in total for the 4 plots, see Fig. A1 in the Appendices for an example of gap summary in the dataset). These mixed gap benchmark datasets were thereafter used for the estimation of the annual soil respiration sum after the artificial gaps have been filled. (1 h, a-d), 6-h. (6 h, e-h), 1-day (1d, i-l) and 15-day (15d, m-p) by NLS (a, e, i, m), ANN (b, f, j, n), SSA (c, g, k, o) and EM (d, h, l, p) as a function of the corresponding observations. Non-winter and winter seasons are plotted in red and blue, respectively. Total least squares regression was used to accommodate the uncertainty of both dependent and independent variables. The black solid and dashed lines indicate the regressions for non-winter and winter seasons, respectively. The gray dashed lines indicate the 1:1 ratio. The coefficients of the regressions are presented in Table 3. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 3
Coefficients of the linear regressions between the observations and the gap-filled data that were predicted by the four methods for different gap sizes and seasons. Standard errors of the coefficients are shown in the parentheses.

Sampling window length
Each of the gap-filling methods has its own algorithm to build models trained by available data around each gap, and the models were then used to predict the values in the gap. Since the data sampled around the gap for training the models could have a significant influence on the accuracy of the gap-filled values, choosing a sampling window length that includes proper training data is important. In this study, we compared the performance of the gap-filling methods using sampling windows of 5, 10, 20, 30, 40, 50 and 60 days (i.e., 720, 1440, 2880, 4320, 5760, 7200, 8640 observations, respectively), arranged symmetrically around the gap. A sampling window of 10 days, for example, included data of 5 days before and after the gap, regardless of the gap length. The shortest window length that achieved best accuracy was chosen for each method. This sampling window is more flexible than the traditional fixed window (Moffat et al., 2007), which also includes the gap itself and may suffer from insufficient training data if the gap size approaches the sampling window size.

Gap-filling methods
In this study, we used four methods to fill the artificially created gaps.
Firstly, the artificial gaps were filled with the predictions from the nonlinear least squares (NLS) regression based on the relationship between soil temperature and soil respiration. Since NLS is the most commonly used method for gap-filling (Gomez-Casanovas et al., 2013;Kunwor et al., 2017;Moffat et al., 2007), we took it as a reference to compare other methods with. The Lloyd-Taylor function was used for making the models (Lloyd and Taylor, 1994): where R 10 and E 0 are parameters to be estimated, T ref and T 0 take the values of 283.15 and 227.13 K, respectively, following Lloyd and Taylor (1994), and T s is the soil temperature measured at 10 cm depth (K). The NLS regression was performed by applying the Levenberg-Marquardt algorithm using the "minpack.lm" package (Elzhov et al., 2016) in the program R 3.5.2 (R Development Core Team, 2018). In cases when the variance of soil temperature in the winter was too low for the NLS function to fit with, the mean value of the soil respiration rate within the sampling window was taken to fill the gap (71% of the gaps in the winter). Secondly, to include environmental variables other than the soil temperature, artificial neural network (ANN) models similar to Moffat et al. (2010) were implemented to fill the gaps. The inputs of the ANN models were soil temperature, soil moisture and ambient temperature (i.e., either snow temperature in the winter or air temperature in non-winter periods), and the output was soil respiration rate. The input variables were normalized to be centered around 0 with a standard deviation of 1 before used to train the models. We included one hidden layer with two neurons in the models and a back-propagation algorithm was performed using the R package "neuralnet" (Fritsch et al., 2019).
Thirdly, being independent of the variations of other environmental variables, the singular spectrum analysis (SSA) method analyzes the structure of the soil respiration time series and predicts the missing values in the gap. In this study, we used the iterative approach suggested by Kondrashov and Ghil (2006) to determine the major periodic component of the time series in the sampling windows and projected this periodic component into the gaps. In addition to the major periodic (i.e., low frequency) component, we also extracted the high frequency component (i.e., the eigentriples with periodicity band < 10, unit: 10min) of the time series in the sampling window and added its average diurnal variations to the projected low frequency values in the gap at the corresponding time of the day. This approach takes both low and high frequency components derived from the time series into account for the gap-filled data. The SSA was conducted using the R packages "spectral.methods" (Buttlar, 2015) and "Rssa" (Golyandina et al., 2015).
Lastly, we implemented the method developed by Junger and Ponce de Leon (2015) based on the expectation-maximization (EM) algorithm (Dempster et al., 1977). This method takes multiple other parallelly measured soil respiration time series as references for the target gapfilling series and makes predictions accounting for both the correlation among the series and temporal structure of individual series. In this study, all the parallel soil respiration time series (no artificial gaps) that reached a certain degree of correlation with the target series (i.e., r > 0.75; see Table A1 for the correlations among the plots) from the same site (Hurdal or Løten) were used as references. We used the "spline" method to fit the time series at each iteration with a degree of freedom of 10 (chosen by cross validation). Due to the unequal data temporal range among the plots, the gaps located out of the temporal range of the references were not filled and excluded in further analyses (<5% in total). Similarly, EM was not applied to gap-fill the mixed gap benchmark datasets from the plot L2 in Løten for determining the annual budget, due to the lack of a full year data at plot L1 as the reference. The EM gap-filling was performed using the R package "mtsdi" (Junger and Ponce de Leon, 2018).

Statistical analysis
The performance of the gap-filling methods was investigated separately in the winter and other periods of the year. Since the snow was not monitored at the sites, we defined the winter as the period with soil temperature <1.5°C and the remaining period as non-winter. This criterion well separated each year into two seasons and characterized the winter by low soil temperature with a narrow variation range (mostly 0-1°C).
To select the optimal sampling window length for each gap-filling method, we calculated the root-mean-squared error (RMSE) between gap-filled values and the original observations for each gap. RMSE has been widely used as an indication of the model accuracy and it was The plot number is included in the models as random effects. SS: type III sums of squares with Satterthwaite approximation for degrees of freedom. RMSE: root mean squared error. AR (1): first order autocorrelation coefficient. The sampling window length of 5 days was used.
calculated as: where n is the number of data points in the gap, and P i and O i are the predicted (gap-filled) and observed soil respiration rates (µmol CO 2 m −2 s −1 ), respectively. A linear mixed-effects model was used to investigate the effects of sampling window size, gap-filling method, season, gap size and the interactions between window size and each of the other three factors on RMSE. Plot ID was included as random effect in the model. Normality and homoscedasticity of the model were evaluated visually by plotting residuals. The post-hoc Tukey HSD test was used to further investigate the differences among the groups of the significant effects. The smallest window size associated with the lowest RMSE was chosen for each method for further analyses.
With the gaps filled using the optimal sampling windows, we further calculated relative error, variance and first order autocorrelation coefficient (AR (1)), in addition to RMSE (Eq. (3)), for each gap to evaluate the performance of the gap-filling methods. They are given as follows: where PO i is either the predicted (gap-filled) or observed soil respiration rate (µmol CO 2 m −2 s −1 ) and PO is the mean of the rates in each gap. AR (1) is the Pearson Correlation Coefficient (r) between the time series in the gap (OP t ) and the series itself with a lag of one-time step (i.e., 10-min) (PO t+1 ). Variance and AR (1) were calculated separately for the gap-filled and observed values. The gaps that were filled with the mean value of the sampling window in the winter were excluded in the AR (1) calculations for NLS. Linear mixed-effects models were established to determine the effects of the method, season, gap size and their two-way interactions on RMSE, relative error, variance and AR (1) with the random effect of plot ID. The relationships between the observations and the gap-filled values were fitted by linear regressions using the total-least-squares approach to account for variances of both the dependent and independent variables. We assumed that the residue variance of the gap-filled values was the same as the observations (i.e., the variance ratio was set to 1). The regression slopes were used to indicate systematic bias of each method in different gap sizes and seasons. The standard errors of the regression coefficients were estimated by bootstraps (1000 random replicates).
In addition, we also calculated the Nash-Sutcliffe Efficiency (NSE) (Nash and Sutcliffe, 1970) and Kling-Gupta Efficiency (KGE) (Gupta et al., 2009;Kling et al., 2012) in the gaps to evaluate the overall performance of the four methods in different gap sizes and seasons, as follows: where r is the Pearson correlation coefficient, cv is the ratio of the coefficient of variation between the predicted and the observed values, and β is the ratio of the means between the predicted and the observed values. The calculations were conducted using the R package "hy-droGOF" (Zambrano-Bigiarini, 2017).
To investigate the influences of gap-filling methods on the annual flux budget, the four methods were also applied to the one-year benchmark datasets with mixed gaps. We then compared the annual sum of soil respiration including gap-filled values to those summed from original observations using the Wilcoxon tests. The original gaps in the datasets were not taken into account for the budget estimations.
All the data processing and analyses were carried out using the program R 3.5.2 (R Development Core Team, 2018). In addition, the packages "lme4" (Bates et al., 2015) and "lmerTest" (Kuznetsova et al., 2016) were used to estimate linear mixed-effects models, the "emmeans" package was used for the post-hoc Tukey HSD test (Lenth, 2019), the "MethComp" package was used to perform regressions with the total-least-squares approach (Carstensen, 2015), and "ggplot2" package was used for plotting graphs (Wickham, 2016). The R code to perform the gap-filling with the four methods in this study is incorporated into the R package "FluxGapsR" freely available at https://github.com/junbinzhao/FluxGapsR/.

Effect of sampling window size
According to the mixed-effects model (Table 2), the sampling window size showed a significant effect on the RMSE of the gap-filling results (P < 0.01). This window size effect differed among the methods (P = 0.01), but not among the gap sizes (P = 0.57) or the seasons (P = 0.14). Of the four methods, NLS, ANN and EM showed slight increases in RMSE along the window size gradient from 5 to 60 days with significantly higher values of RMSE being present in the window of ≥30, 50 and 60 days, respectively, compared to the 5-day window (P < 0.05) (Fig. 1a, b and d). In contrast, SSA showed no significant changes in RMSE with different sampling window sizes (P > 0.05) (Fig. 1c). Therefore, the 5-day window, which is the smallest window size associated with the lowest RMSE for all the four gap-filling methods, was chosen in this study for further analysis.

Performance of the gap-filling methods
The scatterplots of the gap-filled (predicted) soil respiration values versus the original observations generally followed the 1:1 line in the gaps ≤ 1 day (Fig. 2). While the corresponding regression slopes for SSA and EM were between 0.95 and 1.03, the slopes for NLS and ANN occasionally dropped down to 0.92 (Table 3), indicating a slight underestimation. For the 15-day gaps, all the methods generally overestimated the values in the non-winter gaps (i.e., regression slope > 1.1, Fig. 2m-p and Table 3). For the winter gaps, SSA and EM exhibited slopes of 0.98 but NLS and ANN slightly underestimated the flux values in the gaps (i.e., slopes were 0.91 and 0.92, respectively).
According to the mixed-effects model (Table 4), the RMSE differed among different seasons (P < 0.01) and gap sizes (P < 0.01) but not among different methods in general (P = 0.16). However, differences of the RMSE were present among the methods when considering the interaction with the gap size (P < 0.01) but not when considering the season (P = 0.13). The difference was present in the 1-h gap where the results from SSA showed significantly lower RMSE than those from the other 3 methods (P < 0.01) (Fig. 3a). While the RMSE of EM was also significantly lower than that of NLS (P < 0.05), ANN resulted in similar RMSE as NSL (P > 0.05). For the gap sizes of 6-h, 1-and 15-day, the RMSE from ANN, SSA and EM were not significantly different from NLS (p > 0.05) (Fig. 3b-d).
All the four methods generally showed significantly greater RMSE as gap size increased from 1 h to 15 days (Fig. 3e-h). For NLS and ANN, the RMSE remained similar in gaps ≤ 1 day (P > 0.05) while the RMSE became significantly higher in the 15-day gaps (P < 0.01) (Fig. 3e and  f). The RMSE of SSA showed much higher sensitivity to gap size changes with a significant RMSE increase with every step of gap enlarging from 1-h to 15-day (P < 0.01) (Fig. 3g). The RMSE of EM showed a similar pattern with significantly lower values exhibited in 1-h gaps than in 15day gaps (P < 0.01) while 15-day gaps had the greatest RMSE (P < 0.01) (Fig. 3h).
The relative error was significantly different among seasons (P < 0.01) and gap sizes (P < 0.01), but not among the methods (P = 0.12) ( Table 4). The interactions of method with season (P = 0.39) and gap size (P = 0.06) were not significant either. Furthermore, even though the RMSE was generally lower in the winter gaps compared to those in the non-winter (P < 0.01) (Fig. 4a), the relative errors were significantly higher in the winter than in the nonwinter for the 1-h, 1-and 15-day gaps (P < 0.05) (Fig. 4b). The greatest relative error was present in the winter 15-day gaps.
The mixed-effects model for the variance showed significant effects of method, season, gap size as well as the two-way interactions of method with season and gap size (P < 0.01) ( Table 4). Among the four methods, only SSA resulted in similar variance in the gap-filled values compared to the original observations across the seasons and the gap sizes (P > 0.05) (Fig. 5). The variances of the values filled by other 3 methods were overall much lower than those of the observations (P < 0.05), particularly for the winter gaps (Fig. 5e-h). Exceptions are the non-winter 15-day gaps filled by ANN, which also showed similar level of variance as the observations (P > 0.05) (Fig. 5d).
The AR (1) were affected by method, season, gap size as well as the two-way interactions of method with season and gap size (P < 0.01) Table 4). Of the four methods, EM outputted data series with similar AR (1) to the original observations (P > 0.05), except for the 6-h and 1-day gaps in the winter (Fig. 6). NLS also resulted in similar AR (1) structure as the observations in the winter gaps of ≤1 day (P > 0.05) when excluding those filled using the mean values in the sampling windows ( Fig. 6e-g). AR (1) of the data series generated by SSA were similar to the observations only in 1-h gaps (Fig. 6a and e) and the winter 6-h gaps (Fig. 6f). ANN produced data series with significantly higher AR ((1) than the observations in all cases (P < 0.01) (Fig. 6).
The NSE were >0.85 in the gaps of ≤ 1 day for all the methods, among which the NSE values of SSA and EM were >0.95 in the 1-h gaps ( Table 5). The NSE values dropped below 0.8 in the 15-day gaps for all the methods with the values being <0.7 for ANN in the winter and EM in the non-winter periods and being even <0.6 for ANN in the non-winter periods. The more comprehensive index KGE, which considers accuracy, bias and variance of the gap-filled data, showed a similar pattern as the NSE (Table 5). Specifically, KGE values were >0.9 for gaps ≤ 1 day for all the methods and the values were even ≥ 0.95 for SSA and EM in the 1-and 6-h gaps. In the 15-day gaps, lower KGE values were present, which ranged between 0.8 and 0.9.

Fig. 5.
The variance of the gap-filled values predicted by different methods in gaps of different sizes (1h: 1-hour, 6 h: 6-h, 1d: 1-day, 15d: 15-days) as compared to variance of the observations. The variances are shown separately for the winter (a-d) and the non-winter (e-h) with different scales. The boxplots are plotted as in Fig. 1. Different lowercase letters indicate significant differences in variances among different methods (P < 0.05).

Annual budget estimation
The annual soil respiration budget was estimated to be 1152, 463, 627 and 865 g C m −2 year −1 , respectively, at the plot H1, H3, H4 in Hurdal and L2 in Løten during April 30, 2018-April 29, 2019 based on the observations (Fig. 7). By filling the mixed artificial gaps in the datasets, the corresponding annual soil respiration budget estimations ranged 1108-1136, 449-475, 598-646 and 767-955 g C m −2 year −1 , respectively, for the four plots. Overall, the errors of budget estimations tended to be larger as the proportion of gaps increased. For different gap-filling methods, the overall budget errors were −3.7 to 3.6%, −11.3 to 16.0%, −3.7 to 5.8% and −3.2 to 5.3%, respectively, for NLS, ANN, SSA and EM. Significant biases only occurred in the H3 and L2 datasets with 50% gaps (Fig. 7b and d). Specifically, NLS and ANN overestimated the budget in the L2 dataset (P < 0.05), while NLS and SSA underestimated the budget in the H3 dataset (P < 0.05).

Fig. 6.
The AR (1) coefficient of the gap-filled values predicted by different methods in gaps of different sizes (1 h: 1-h, 6 h: 6-h, 1d: 1-day, 15d: 15-days) as compared to variance of the observations. The AR (1) coefficients are shown separately for the winter (a-d) and the non-winter (e-h) with different scales. The boxplots are plotted as in Fig. 1. Different lowercase letters indicate significant differences in AR (1) coefficient among different methods (P < 0.05). Note that the gaps failed to be filled with NLS (but filled with the average of the sampling window) in the winter are not included in the AR (1) calculations (e-h).

Discussion
Our study evaluated the performances of four gap-filling methods on soil respiration datasets from two boreal forest sites. For carbon flux studies, gap-filled data are mostly used for estimating the sum of the flux budget, usually on the scale of one year. According to our estimations, NLS, SSA and EM resulted in annual budget errors that ranged from −3.7 to 5.8%, which are fully acceptable given that this result represents a wide range of soil respiration budget (i.e., 463-1152 g C m −2 year −1 ) in high latitude ecosystems (Bahn et al., 2010;Raich and Schlesinger, 1992). Among the methods used, SSA and EM are based on the time series analysis and had never been used for gap-filling in carbon flux data. Particularly, they are implemented independent of other environmental measurements (e.g., soil temperature). With this advantage and their accurate annual budget estimations, we suggest SSA and EM to be considered as alternatives to the more conventional methods (e.g., NLS) for gap-filling in soil respiration data and possibly in other carbon flux data (e.g., net ecosystem CO 2 exchange measured by eddy covariance, methane fluxes, etc.), especially when the controlling environmental variables are unavailable, missing or unknown.

The accuracy of the gap-filling methods
In our study, we selected the sampling window size of 5 days, because this window size already achieved the most accurate results for all the methods. Previous studies commonly use sampling windows ranging from half a month up to a full year depending on the method used (e.g., Kunwor et al., 2017;Moffat et al., 2007;Zhao and Huang, 2015), which might not be necessary according to our result. Our result implies that a smaller sampling window that captures the local data structures or relationships around the gap can be more representative for the gap than one that includes data of longer periods. Using a smaller sampling window also significantly reduces the computing time for the methods involving iterations (e.g., ANN). However, we do not recommend windows that are smaller than 5 days, because they may simply underrepresent the diurnal variation of the variables in the gap if e.g., only 1 day before and 1 day after the gaps are considered. In addition, it is worth to note that our datasets have a temporal resolution of 10 min, which is three times as many as the commonly used 30-min interval flux datasets in a 5-day window. Thus, we further aggregated the 10-min-interval data by averaging data within each half hour and generated datasets with 30-min intervals. The gapfilling results on the aggregated datasets also indicated that the 5-day sampling window had the lowest RMSE among the window sizes for all the methods (see Fig. A2 in the Appendices). Therefore, we suggest a 5day sampling window could be suitable for most gap-filling procedures on carbon flux data.
Among the four methods, NLS is the most widely accepted method for carbon flux gap-filling since it is usually supported by well-acknowledged relationships between the flux and environmental factors, such as the one between soil respiration and temperature used in this study (Gomez-Casanovas et al., 2013;Lloyd and Taylor, 1994). This method provides the basis to evaluate the other methods. Compared to NLS, ANN models were trained against more environmental variables (i.e., soil moisture and ambient temperature); however, it showed no improvement over NLS in terms of the RMSE (Fig. 3a-d). A previous study that included soil moisture in soil respiration gap-fillings also reached the same conclusion (Gomez-Casanovas et al., 2013). This is mainly because the controls of soil moisture over soil respiration are complicated and depend on many other factors (e.g., time, microbial communities, etc.) (Cook and Orchard, 2008). Particularly in this study, we found large prediction errors by ANN associated with the pulse in soil moisture induced by precipitation events that occurred in the gaps or the sampling windows (see Fig. 8 for examples). Depending on when those pulses occurred as well as their magnitude, gap-filled values predicted by ANN could greatly over-or under-estimate the fluxes in the gaps due to a mismatch of the soil moisture magnitudes in the gaps to those in the sampling windows. These large errors were also reflected in the lower NSE values in the 15-day gaps (Table 5) and the large uncertainties in soil respiration budget estimations by ANN (Fig. 7). Yet, models that better represent soil moisture in soil respiration gapfilling are still needed.
The SSA yielded the smallest errors among the methods in the small gaps (i.e., 1-hour) and outcompeted NLS (Fig. 3a), suggesting that SSA is a highly robust method for filling small gaps. In larger gaps, SSA could also capture the diurnal pattern and well construct the data series in the gap close to the observations if the variation pattern persisted throughout the gap (see Fig. 9a for an example). However, drifts in the diurnal patterns and magnitudes of soil respiration, which are driven by environmental factors, often occur in larger gaps (e.g., Fig. 9b). Without  Fig. 1. The asterisks (*) indicate significant differences of the estimated annual sum from the observed sum (P < 0.05). The datasets from L2 were not gap-filled by EM due to the lack of reference for the full year. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) referencing to environmental variables, these drifts in the gap could be hardly predicted by SSA (but could be predicted by NLS and ANN), and, thus, introduced large errors in the gap-filled values. In addition to the gap size, the completeness of the data in the sampling windows also plays a big role in the SSA predictions. Without sufficient data to construct the diurnal pattern, a "blind guess" by SSA can deviate well away from the observations (e.g., Fig. 9c). A high gap proportion (e.g., 50%) significantly increases the chances of incomplete data in the sampling window. This can increase the errors of the budget estimation with SSA, as shown, for example, in the H3 mixed-gap dataset (Fig. 7b). Therefore, caution needs to be taken when applying SSA for large gaps or highly fragmented datasets.
EM is a method whose performance relies on the relationship between the target time series and the references (Junger and Ponce de Leon, 2015). For our datasets, the accuracy of EM was higher than that of NLS in the 1-hour gaps but not in larger gaps (Fig. 3a-d). This may be partly due to the heterogeneity of our sites where time series considerably deviated from each other (r < 0.85, Table A1). A better accuracy from EM is expected in cases when more closely related time series are available to be used as references. Nevertheless, EM provides a practical approach to gap-fill missing data when parallel measurements are available, which is common for measurements such as soil respiration when multiple replicates are needed.
Due to the limited variations under the snow cover, soil temperature and soil respiration usually decouple from each other during the winter (Alm et al., 1999;Liptzin et al., 2009;Merbold et al., 2012). As expected, the NLS could not be applied to up to 71% of the small winter gaps in this study because of the low variance of the soil temperature. Compared to NLS, the time-series-analysis-based methods (i.e., SSA and EM), which were implemented independent of the soil temperature, achieved a better accuracy in gaps ≤ 6 h and less biased predictions in the 15-day gaps in the winter. Even though the RMSE in the winter was generally much lower than in the non-winter season, larger relative errors were present in the winter gaps. This suggests that the errors introduced by filling winter gaps can account for a large fraction of the annual budget in higher latitude ecosystems when the winter contribution is a significant component (Alm et al., 1999;Aurela et al., 2002;Groffman et al., 2006;Zhao et al., 2016). According to our results, gap-filling methods that are based on time series analysis (i.e., SSA and EM), rather than the more conventional NLS approach, are recommended in ecosystems with substantial winter contributions.

Data structure restoration
The time series analyses are useful tools in ecology studies; however, they are usually very sensitive to gaps in the data (Barnea et al., 2008). In addition to achieving low errors, gap-filled data that restore the original structure of the time series can potentially be further used in time series analysis for, e.g., forecasting (Kohn and Ansley, 1986). Variance and AR (1) are two important structural properties of time series data (Box et al., 2015). In our study, the conventional method NLS as well as ANN largely underestimated the variances, especially during the winter time (Fig. 5). This is because the structures of their predicted data series closely followed the corresponding environmental variables, which usually have much lower variances than the flux data (Kunwor et al., 2017). The data series produced by EM were also low in variances and this is mainly due to the relatively low degrees of freedom (i.e., 10) we chose for spline constructions to achieve better accuracy. In contrast, only SSA, which considered the high frequency signal of the data in the sampling window, restored the variance to the same level as the observations.
For AR (1), we found the gap size showed a significant effect, which was mainly due to the systematic errors introduced to the autocorrelation coefficient estimations when sample size was small, known as finite size effects. The finite size effect is most evident in the observations of 1-hour gaps (n = 5) which had large variances that covered the full range from −1 to 1, and, thus, hardly represented the AR (1) structure of the original time series. Apart from the 1-h gaps, only EM attained the same level as the observations in most cases (Fig. 6) mainly because of the similar AR (1) structures between the target series and the references. In contrast, the other 3 methods all overestimated the AR (1). It is worth to note that even though the predictions from NLS showed lower AR (1) in the winter time to roughly the same degree as the observations, this low AR (1) was largely caused by the low winter soil temperature variances and, also, that 71% of the gaps with extremely low temperature variances were excluded in AR (1) calculations for NLS. Given that the AR (1) became much larger than the observations in the non-winter season, we suggest the capability of NLS to reproduce the AR (1) is limited, which agrees with Kunwor et al. (2017).
Overall, SSA and EM can restore the original structure of the data in certain respects. Nevertheless, to restore the full structural properties of the data is challenging and new methods are still needed.

Conclusions
All the four studied methods achieved good accuracy in estimating the annual soil respiration budget. However, they all have different strengths and weaknesses which need to be considered when used for gap-filling in different situations. NLS is a widely used method; however, due to its poor performance in the winter (i.e., failing to fit the equation and lack of variance), ecosystems with substantial winter contributions should consider other alternatives for gap-filling, at least during the winter. In contrast, SSA has great advantages (i.e., good accuracy, variance and less bias) for winter gaps, especially the small ones. Also, because SSA is independent from environmental variables, it can be implemented even when the driving variables are unavailable, missing or even unknown. Compared to SSA, implementation of EM is more limited by the availability of the corresponding reference data. With a highly related reference dataset, EM is also a reliable option to be used without involving environmental variables. ANN is a flexible approach to include other potential driving factors in the gap-filling models without specifying the exact mathematical formulation. However, its performance may suffer under certain conditions (e.g., a precipitation event). At the same time, the long computing time of ANN compared to its (usually disproportional) improvement in the accuracy over NLS must also be considered.

Funding
This work has been supported by the Norwegian Research Council (grant no. NFR-255061). All measured data used in this paper are produced within this project.

Appendices
Figs. A1-A2 and Table A1.  9. Examples of the gap-filled soil respiration rates as compared to the original observations (gray lines) for gaps > 10 days derived from the mixed gap benchmark datasets at the plot H3 in Hurdal in 2018. The examples include scenarios where soil respiration diurnal pattern in the gap followed (a) and did not follow (b) the patterns in the sampling windows and where missing data were present in the sampling windows (c). Values predicted by NLS, ANN, SSA and EM are plotted in pink dotted, green dashed, blue solid and purple solid lines, respectively. The insets show the RMSE calculated for the gap-filled values predicted by different methods using the same color scheme. The black lines are observed values out of the gap. The red boxes indicate the corresponding sampling window for each gap. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)   A2. The RMSE for the gap-filled values as a function of the sampling window size using different methods for the datasets that were aggregated to achieve a temporal resolution of 30-min. The bottom and top of a box indicate the lower and upper quartiles, respectively, and the horizontal line within the box is the median. The lower and upper whiskers are the 10th and 90th percentiles, respectively. Different lowercase letters indicate significant differences in the RMSE among different sampling window sizes (P < 0.05).