China’s larch stock volume estimation using Sentinel-2 and LiDAR data

ABSTRACT Forest Stock Volume (FSV) is one of the key indicators in forestry resource investigation and management on local, regional, and national scales. Limited by the saturation problems of optical satellite remote-sensing imagery in the retrieving of stock volume, and the high cost of Light Detection And Ranging (LiDAR) data, it is still challenging to estimate FSV in a large area using single-sensor remote-sensing data. In this paper, a method integrated multispectral satellite imagery and LiDAR data was developed to map stock volume in a large area. A random forest model was adopted to estimate the stock volume of larch forest in China based on the training samples from the Airborne Laser Scanning (ALS)-derived stock volume and corresponding Sentinel-2 imagery. Validation using National Forest Inventory (NFI) data, ALS-derived stock volume and ground investigation data demonstrated that the estimated stock volume had a high accuracy (R2 = 0.59, RMSE = 59.69 m3/ha, MD = 39.96 m3/ha when validated with NFI data; R2 ranged from 0.77 to 0.85, RMSE ranged from 38.68 m3/ha to 67.38 m3/ha, MD ranged from 24.90 m3/ha to 37.27 m3/ha when validated with ALS stock volume; R2 = 0.42, RMSE = 79.10 m3/ha, MD = 62.06 m3/ha when validated with field investigation data). Results of this paper indicated the applicability of estimating stock volume of larch forest in a large area by combining Sentinel-2 data and airborne LiDAR data.


Introduction
Forest Stock Volume (FSV), defined as the total volume of the stems of all living trees with height over 2 m and canopy cover more than 20% per unit forest area (m 3 /ha), is often used in wood fiber production, forest resource investigation and management (National Forestry Administration 2014).In addition, FSV is well correlated with the AboveGround Biomass (AGB) and carbon stocks (Astola et al. 2019;Alberdi 2021), which is one of the key variables reflecting the total number of forest resources in regional and national scales (Mura et al. 2018;Alberdi et al. 2020;Venäläinen et al. 2020;Santoro et al. 2015).Traditionally, FSV is estimated by sampling plots, which cost substantial manpower, materials and financial resources (Astola et al. 2019).Remote-sensing technology provides a feasible and flexible way to estimate FSV, and is widely used in the field of forest resources inventory.Airborne Light Detection And Ranging (LiDAR) (Hu et al. 2019;Chen et al. 2019;Tian et al. 2019;Coops et al. 2021) data measurements enable FSV inventory and monitoring in high accuracy, but airborne acquisitions are expensive, limited in temporal and spatial coverage, and usually contain some data gaps, which restricting their utility for wall-to-wall mapping of FSV in large areas (Chang et al. 2019;Xu et al. 2021).In this condition, open satellite imagery becomes appealing for FSV mapping in regional and national scales.
Developments in the earth observation in recent years increased the potential to improve the efficiency of retrieving regional or global forest attributes including stock volume.Many researches have been carried out on the estimate of FSV based on multi-scale optical remote-sensing images, such as Moderate Resolution Imaging Spectroradiometer (MODIS) (Giree et al. 2013;Senf et al. 2013;Frantz et al. 2016), Landsat (Griffiths et al. 2014;Zhu, and Woodcock 2014;Morresi et al. 2019;Hawryło et al. 2020;Obata et al. 2021), SPOT (Bochenek et al. 2018;Ehlers et al. 2018;Zhou et al. 2020;Huong 2022), and Quickbird (Abdollahnejad et al. 2017;Stournara, Patias, and Karamanolis 2017).The high spatial resolution, short revisiting period, rich spectral information, and free accessing have made Sentinel-2 images a superb data source for FSV estimation in recent years.For instance, Chrysafis et al. (2017) estimated FSV based on Sentinel-2 images using the Random Forest (RF) model and found that Sentinel-2 data provided relatively better accuracy than Landsat-8 Operational Land Imager (OLI) images.Hu et al. (2020) estimated FSV in Hunan Province in China by integrating in situ plot data and Sentinel-2 images using machinelearning models.Combined use of optical multispectral imagery and LiDAR data could integrate the strengths and complement the limitations of different sensors, which may acquire better accuracy and improve the applicability of the retrieving models to estimate FSV (Mauya et al. 2019;Coops et al. 2021).For example, Hawryło et al. (2018) predicted FSV of Scots Pine using Sentinel-2 images and airborne point clouds from a multiple Linear regression (LR) and RF models.Chen et al. (2021) mapped FSV by Global Ecosystem Dynamics Investigation (GEDI) LiDAR data, and imagery from Sentinel and ALOS, and found that point-line-polygon framework integrating LiDAR as a linear bridge to estimate FSV performed better than traditional point-polygon method.Although some attempts have been made, how to integrate the LiDAR data and multispectral satellite imagery to build a robust model to estimate FSV is still challenging.
As for the methods to estimate FSV, some researchers have shown interests in non-parametric models, such as kernel-based regression methods (Verrelst et al. 2019).Relying on less available data and not requesting any fixed functional form, non-parametric methods are considered to be more versatile than traditional parametric methods (Tuia et al. 2011).Therefore, non-parametric methods usually perform better in the description of the nonlinear relationship between the remote-sensing features and FSV.Machine-learning methods and deep-learning methods, such as RF, Support Vector Machine (SVM), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), have also been widely used in the FSV estimation (Astola et al. 2019;Lang, Schindler, and Wegner 2019;Hu et al. 2020;Rees et al. 2021).One of the main factors affecting the accuracy of these models are the quality and quantity of the training dataset (Chen et al. 2021).The restricted number of field plots has limited their use in building machine-learning or deep-learning models in FSV estimation.Therefore, studies to acquire plentiful high-quality training samples, and then build a FSV estimation model with strong applicability and robustness in a large area becomes significant.
Larch forest covers about 6% of the total forest area in the world, which is about 250 million hectares (Gauthier et al. 2015;Gurmesa et al. 2022).Among which, China larch forest covers about 4% of the global larch forest area (about 10 million hectares) (National Forestry and Grassland Administration 2021).Some studies have been conducted to retrieve the stock volume of larch forest (Li et al. 2019(Li et al. , 2020;;Hao et al. 2022), but few researches focus on the estimation of China larch FSV, which accounted for 7.42% of the total volume of China planted forest.In this case, studying the stock volume in China could provide some information for the global and China forest carbon stock.
The primary objectives of this study are to: (i) explore the possibility and applicability of estimating larch stock volume by combining Sentinel-2 data and LiDAR data in a large area; (ii) build a robust model to generate a FSV map of larch forest in China.

Field data
We collected some field plot data of larch FSV in typical larch distribution regions, as shown in Figure 1.These plots were tallied in during 2017-2021 for multiple research purposes, as shown in Table 1.Within each plot area, diameter at breast height (DBH, where breast height = 1.30m) were measured for all the trees of DBH over 5 cm.Tree height was measured for all the tallied trees or typical trees.The stock volume was calculated using local allometric equation (Xu, Wu, and Ye 2009;Doyog et al. 2017) for each tally tree and summarized together as the FSV of each plot.Then, the FSV density of each plot was calculated from plot size.

Sentinel-2 data
A pixel-based compositing algorithm (Griffiths et al. 2013) was adopted to generate a cloud-free and radiometrically consistent datasets from Sentinel-2 images.Three parameters, i.e. acquisition year, day of year (DOY), and distance to clouds, were taken into account and a parametric weighting scheme was used in this algorithm, which allowed for utilizing different pixel characteristics for optimized compositing.We obtained the composited reflectance of 10 bands (Blue: 458 nm-523 nm, Green: 543-578 nm; Red: 650 nm-680 nm; Vegetation red edge: 698 nm-713 nm; Vegetation red edge: 733 nm-778 nm; Vegetation red edge: 773 nm-793 nm; NIR: 785 nm-900 nm; Vegetation red edge: 855 nm-875 nm; SWIR: 1565 nm-1655 nm; SWIR: 2100 nm-2280 nm) with a spatial resolution of 20 m from Sentinel-2 data in 2020 in the growing season (from May to October) for the whole distribution area of larch forest.We also obtained the reflectance of 10 bands mentioned above in the growing season in Mengjiagang in 2017 and 2020, in Saihanba in 2018 and in Dagujia in 2018.For the three forest farms shown in Figure 1, we also composited Sentinel-2 data in the year when airborne data were collected.Vegetation indexes, Normalized Difference Vegetation Index (NDVI), and Enhanced Vegetation Index (EVI) were also calculated in the growing season.NDVI and EVI were described by the following formula:   where NIR is the reflectance of near-infrared band, Red is the reflectance of red band, and Blue is the reflectance of blue band.
Considering that kernel NDVI (kNDVI) is more resistant to saturation, bias and complex phenological cycles, and shows better correlations than NDVI in applications of biomass estimation, kNDVI was also used as a variable to evaluate the larch stock volume (Camps-Valls et al. 2021).kNDVI could be described as (Camps-Valls et al. 2021): which is a length-scale parameter reflects the sensitivity of the index to sparsely or densely vegetated areas.
To evaluate accuracy of composited Sentinel-2 data, we analyzed the spectral consistency of the composited image with the realistic reference image by randomly selecting the samples, and found that a good correlation existed between them.The accuracy could be as high as 90%.

Stock volume maps of larch forest from airborne laser scanning data
We collected the airborne LiDAR data in Mengjiagang Forest Farm in June 2017 and September 2020, Dagujia Forest Farm in July 2018 and Saihanba Forest Farm in September 2018.The CAF-LiCHy airborne observation system, which integrated the RieglLMS-Q680i scanner, was used for the Airborne Laser Scanning (ALS) data collection (Pang et al. 2016).Wavelength of this system was 1550 nm, with a 3 ns pulse length and 0.5 m rad beam divergence.The max pulse repetition rate of the sensor is 400 kHz.The ALS datasets included the point density, flight time, relative flight height, and average flight speed.The point density varied from 10 pts/m 2 to 35 pts/m 2 , the above ground flight height varied from 800 m to 1000 m, and the flight speed varied from 130 km/h to 180 km/h (Pang et al. 2021).
Steps to obtain the stock volume maps of larch forest from ALS data (Figure 2) were as follows: (1) LiDAR metrics, including height and density variables, were extracted from height normalized point cloud.Size of the cells to derive the metrics was 20 m; (2) regression models were built between the extracted metrics and the stock volume accumulated at the field plot; (3) the models were applied to areas deemed as larch forest by a tree species map so as to derive FSV raster map for larch forest.Stock volume in 70% field plots was used to train the regression models in Mengjiagang, Saihanba, and Dagujia, respectively.Cross-validation with the stock volume in the rest 30% field plots demonstrated that the stock volume map from ALS data had high accuracy (R 2 ranged from 0.76 to 0.95, RMSE ranged from 32.66 m 3 /ha to 65.40 m 3 /ha).

Larch forest map in China
Multi-temporal Landsat 8 images in 2020, Digital Elevation Model (DEM), forest resource investigation data were applied in the classification process using the RF, SVM, and neural network.Spectral features (reflectance, NDVI and EVI), textural features (variance, average value, angular second moment, contrast, homogeneity, dissimilarity, correlation and entropy) of larch forest were derived from the multitemporal Landsat images, and terrain features (slope and aspect) were derived from DEM.Then, Gini coefficient was used to select the main characteristic factors from those features, which were used to train the machine-learning models.Validation with the investigation data in Daxing'anling and Heilongjiang showed that accuracy of larch forest classification map was higher than 70% (Ministry of Science and Technology of China, 2021).

Methods
A method integrating multispectral Sentinel-2 imagery and LiDAR data was developed to map stock volume in China.Flowchart of larch stock volume estimation was demonstrated in Figure 3.The ALSderived stock volume of larch forest was used as a bridge to relate field plots and full-cover Sentinel-2 multispectral imagery.Firstly, the training datasets were obtained from the homogeneous ALS stock volume and corresponding Sentinel-2 pixels (surface reflectance, vegetation indexes) using a stratified sampling method.The Sentinel-2 data were consistent with the year when the airborne data flew.Then, an RF model was adopted to estimate the stock volume map in China.At last, the results were validated using ALS-derived stock volume and ground investigation data.

Estimation of larch stock volume in China from Sentinel-2 data
To obtain some high-quality samples to train the machine-learning model, homogeneous pixels were selected using the coefficients of variation of ALSderived stock volume in 100 m × 100 m pixels.The coefficient of variation was defined as the ratio of the standard deviation to the mean value in the 100 m × 100 m pixel.The 100 m × 100 m pixels with the minimum 20% coefficient of variation in each forest farm (Mengjiagang, Saihanba and Dagujia) were selected as the pure pixels.In this paper, the coefficient of variation to define pure pixels was 0.42 (in Mengjiagang in 2017), 0.40 (in Mengjiagang in 2020), 0.35 (in Saihanba), 0.38 (in Dagujia), respectively.Then, the training datasets were built based on the ALS-derived stock volume in Mengjiagang, Dagujia, Saihanba and corresponding surface reflectance of 10 bands (band2, band3, band4, band5, band6, band7, band8, band8A, ban11, and band12) and vegetation indexes (NDVI, EVI, and kNDVI) from Sentinel-2 data in the homogeneous pixels.Noting that the surface reflectance and vegetation indexes from Sentinel-2 data in the homogeneous pixels were also aggregated to 100 m × 100 m pixels.To resist the saturation of machine-learning models and obtain high accuracy when the larch stock volume in high values, logarithms were used in the ALS-derived stock volume.Specifically, log(FSV) was used as the response variable and then backtransformed the output.To make the training datasets more representative, a stratified sampling method was adopted in this paper according to the ALS-derived stock volume.Stratum number is 6 according to the ASL ALS FSV.To make a crossvalidation, 70% of the pure pixels was used to train the model, and the rest 30% of the pure pixels was used to validate the model.As RF has been extensively used in forest parameter mapping from remote-sensing data and has achieved good accuracy by comparative assessments due to being less sensitive to noise in training samples (Zhao et al. 2019;Chen et al. 2021), the RF model was used in our research to estimate the stock volume based on the Sentinel-2 data (surface reflectance, vegetation indexes) at the original 20 m resolution.At last, the map of larch stock volume was obtained based on the larch forest mask in China.

Validation of the larch stock volume from remote sensing data
To assess the accuracy of the Sentinel-2-derived stock volume of larch forest in China, first, we compared the average growing stock in each province with the provincial National Forest Inventory (NFI) 9 stock volume.Then, the results were compared with corresponding ALS-derived stock volume pixel by pixel at Mengjiagang, Dagujia, and Saihanba.Besides, the estimated stock volume was also validated against the ground investigation data.Noting that the mean stock volume values in a 5 pixels *5 pixels window around each field plot were used to compare with the ground investigated data to reduce the co-registration errors between images and field plot sites.The coefficient of determination (R 2 ) and root-mean-square error (RMSE) were used to quantify the accuracy of the stock volume of larch forest from remote-sensing data, and the Mean Difference (MD) (Valbuena et al. 2017) was used to evaluate the degree of under or over prediction of the results.In Xinjiang, Heilongjiang, and Inner Mongolia, stock volume of larch forest derived from Sentinel-2 was more than 140 m 3 /ha, which were the highest.Followed by Jilin, Beijing, Hebei, Shanxi, and Hubei with the larch FSV ranging from 100 m 3 /ha to 120 m 3 / ha.In Chongqing, Qinghai, Gansu, Henan, and Shandong, the larch FSV were less than 90 m 3 /ha.

Validation with national forest inventory data
We compared the provincial Sentinel-2 stock volume with the provincial NFI 9 stock volume, as shown in Figure 6.A good linear relationship existed between the NFI 9 stock volume per hectare and the Sentinel-2 stock volume per hectare (R 2 = 0.59, RMSE = 59.69 m 3 /ha, MD = 39.96m 3 /ha).But we could also find that larch stock volume from Sentinel-2 data was overestimated when the NFI 9 volume was less than 100 m 3 /ha.A relative lower relationship existed between the Sentinel-2 volume and the NFI 9 volume.Reason leading to this may be the errors of larch forest area from remote-sensing data.

Validation with ALS stock volume
We compared the Sentinel-2-derived stock volume of larch forest with corresponding ALS-derived stock volume pixel by pixel (stock volume from Sentinel-2 data and from ALS data were aggregated to 100 m × 100 m) in Mengjiagang, Dagujia, and Saihanba, as shown in Figure 7.In general, a good linear relationship exists between Sentinel-2-derived stock volume and ALS-derived stock volume (R 2 ranged from 0.77 to 0.85, and RMSE ranged from 38.68 m 3 /ha to 67.38 m 3 /ha, MD ranged from 24.90 m 3 /ha to 37.27 m 3 /ha), indicating that a Sentinel-2-derived larch stock volume achieved a high accuracy in Mengjiagang, Dagujia, and Saihanba.

Validation with field investigation data
Validation with the ground investigated larch stock volume shown that the Sentinel-2-derived larch stock volume had a high accuracy (R 2 = 0.42, RMSE = 79.10 m 3 /ha, MD = 62.06 m 3 /ha) (Figure 8).However, we could find that most of the plots were distributed above the 1:1 line when the ground measurement stock volume was less than 200 m 3 /ha, indicating that the larch stock volume from Sentinel-2 data was overestimated.While most plots were located under the 1:1 line when the ground investigated stock volume was higher than 200 m 3 /ha, which demonstrate that Sentinel-2 derived stock volume of larch forest was underestimated.

Advantages of integrating LiDAR data and satellite data to estimate FSV
Compared to traditional stock volume modeling approaches directly related to sample plot data with satellite data, integrating airborne LiDAR data and Sentinel-2 satellite images to derive stock volume has some advantages.Specifically, maps of stock volume derived from airborne LiDAR data acted as a link bridge to relate field plots and fullcover Sentinel-2 multispectral imagery, which could greatly enrich the training datasets to build the machine-learning models.The relative error of traditional point-polygon method which directly related to sample plot data with satellite data was about 20% to 40% (Immitzer et al. 2016;Xu, Manley, and Morgenroth 2018;Dos Reis et al. 2019;Liu, Wang, and Wang 2019;Hawryło et al. 2020), while the relative error was less than 30% of this paper, indicating the applicability and feasibility of mapping larch stock volume in a large area by combining Sentinel-2 data and LiDAR data.

Influence of data saturation problem on FSV estimation
The data saturation problem of Sentinel-2 images may have some effects on the accuracy of the estimated FSV.Studies have indicated that the spectral reflectance of satellite images tend to be saturated and become less sensitive to high FSV due to the complex forest canopy structure in dense vegetation (Hu et al. 2020;Li et al. 2021), which would result in the underestimation of satellite-derived FSV.The application of various vegetation indexes, especially the indexes related to red edge, has the ability to overcome the saturation problem and increases the accuracy of estimated FSV (Frampton et al. 2013;Nuthammachot et al. 2022).In our research, reflectance of three red edge bands (698 nm-713 nm, 733 nm-78 nm, 773 nm-793 nm) was also used to train the machinelearning model.Besides, a new vegetation index kNDVI, which was believed to be more resistant to saturation and complex canopy structure (Camps-Valls et al. 2021), was also adopted to improve the performance of machine-learning model in the high FSV estimation to some extent.Another way to mitigate the data saturation problem was the application of stratification approaches (Zhao et al. 2016).But it was still challenging to collect sufficient number of samples in each stratum in the traditional methods.In this paper, the ALS-derived FSV was used to train the sample, which could increase the number of samples effectively.In the future, the stratified sampling strategy may be improved to reduce the impact of data saturation problem in FSV mapping.

Influence of compositing algorithm of Sentinel-2 images on FSV estimation
Satellite images for large-scale mapping of tree species or relevant parameters should not only rely on single high-quality cloud-free images, but also should be supported by seasonal composites of multi-images (Kollert et al. 2021).Studies have demonstrated that seasonal Sentinel-2 data are often used for temporal composites and land surface phenology, which could be as the inputs of machine-learning model (Kollert et al. 2021;Shen et al. 2021).The composites used in this paper were based on the growing season (from May to October) Sentinel-2 images, and a pixel-based compositing algorithm (Griffiths et al. 2013) which was applicable for integrating different pixel characteristics for optimized compositing.Cloud free and radiometrically consistent datasets could be obtained although the data acquired time was mixed.But the accuracy may be influenced when the image was far from the center time of the growing season, which may bring some errors in the compositing, and to the subsequent FSV estimation.

Limitations and future work
Errors of the training datasets and the limitation of RF model would bring some uncertainties to the results.First, the quality and accuracy of training samples were well related to the performance of RF model.In this paper, the training samples were obtained from the pure pixels from ALS larch stock volume map and corresponding Sentinel-2 data, and the errors of ALS larch stock map would have some influence on the accuracy of the results.To improve this, we may develop some methods to derive stock volume map with higher accuracy with LiDAR data, or to select more representative training samples in the future.Second, the selection of the variables may impact the results of RF model.In this paper, reflectance of 10 Sentinel-2 bands and VIs was used as the independent variables of the RF model.We may enrich the kinds of variables, such as spectral features, textural features, and topographic features, and analyze the of the importance of several kinds of variables to avoid overfitting and to build more robust machine-learning models to estimate FSV.Third, errors in the classification of larch forest would bring some uncertainty when validating with NFI data.Finally, limitations of the RF model may also lead to some errors of the results.More validation work may be carried out to assess the larch forest map used in this paper.Studies have demonstrated that overestimation in the low values and the underestimation in the high values was a common problem in the estimation of FSV or biomass from RF model using multispectral remote-sensing data (Ou et al. 2019;Zhang et al. 2019;Hu et al. 2020;Li et al. 2020).In the future, some deep-learning methods based on multiple optical remote-sensing data and LiDAR data may be used, and the parameters of the deep-learning models could be optimized to improve the accuracy.

Conclusions
In this paper, a method of integrating airborne LiDAR data and Sentinel-2 satellite images was developed to estimate stock volume of larch forest in China.LiDARderived stock volume of larch forest acted as a link to relate field plots and Sentinel-2 multispectral imagery.A machine-learning model with high accuracy was built to map the stock volume of larch forest in China based on the LiDAR-derived stock volume and corresponding Sentinel-2 data in the pure pixels.Comparison with the NFI 9 provincial data, with ALS stock volume, with field investigation data demonstrated that the Sentinel-2 stock volume gained a high accuracy.Results of this paper indicated that combining Sentinel-2 data and airborne LiDAR data could greatly enrich the training datasets and improve the accuracy of machine-learning models of stock volume estimation. of China (grant number: 41871278 & 32071759) and forest parameter inversion by integrating LiDAR and multiple angle optical data for the terrestrial ecosystem carbon inventory satellite (2016K-10 & YGD-202100105737-006-001).

Notes on contributors
Tao Yu received PhD degree of cartography geographic information system from Beijing Normal University in 2019.He is currently a research assistant at Institute of Forest Resource Information Techniques, Chinese

Figure 1 .
Figure 1.Verification sites and airborne flight area.
Number of samples Number of samples for ALS modeling Daxing'anling, Heilongjiang 2019, 2021 Circle with a radius of 14 m, 25 m*25

Figure 2 .
Figure 2. Larch stock volume from ALS data.

Figure 3 .
Figure 3. Flowchart of larch stock estimation from Sentinel-2 and LiDAR data.

A
cross-validation method was used to validate the estimated FSV of larch forest from RF model, which was shown in Figure 4. High accuracy (R 2 = 0.84, RMSE = 54.68 m 3 /ha, MD = 30.54m 3 /ha) was obtained when validated with ALS stock volume, which indicate that the model had a good stability and applicability.Map of stock volume of larch forest in China is shown in Figure 5. Generally, larch stock volume was high (more than 200 m 3 /ha) at northeastern areas in China, such as Daxing'anling and Xiaoxing'anling areas, while the larch stock volume was low in the central part of China, such as Hubei province.Total area of larch forest in China was about 2.18 × 10 7 ha, and the total stock volume of larch forest in China was about 2.76 × 10 9 m 3 .

Figure 4 .
Figure 4. Accuracy of the larch stock estimation model.

Figure 5 .
Figure 5. Larch stock map in China from Sentinel-2 data.

Figure 6 .
Figure 6.Comparison of the provincial volume per hectare.

Figure 8 .
Figure 8. Validation with field investigated field plot stock volume data.

Table 1 .
Information of the field plot data.