Mapping soil organic matter content using Sentinel-2 synthetic images at different time intervals in Northeast China

ABSTRACT Mapping soil organic matter (SOM) content has become an important application of digital soil mapping. In this study, we processed all Sentinel-2 images covering the bare-soil period (March to June) in Northeast China from 2019 to 2022 and integrated the observation results into synthetic materials with four defined time intervals (10, 15, 20, and 30 d). Then, we used synthetic images corresponding to different time periods to conduct SOM mapping and determine the optimal time interval and time period before finally assessing the impacts of adding environmental covariates. The results showed the following: (1) in SOM mapping, the highest accuracy was obtained using day-of-year (DOY) 120 to 140 synthetic images with 20 d time intervals, as well as with different time intervals, ranked as follows: 20 d > 30 d > 15 d > 10 d; (2) when using synthetic images at different time intervals to predict SOM, the best time period for predicting SOM was always within May; and (3) adding environmental covariates effectively improved the SOM mapping performance, and the multiyear average temperature was the most important factor. In general, our results demonstrated the valuable potential of SOM mapping using multiyear synthetic imagery, thereby allowing detailed mapping of large areas of cultivated soil.


Introduction
Soil organic matter (SOM) can promote the formation of soil aggregates, improve the conditions of physical, chemical and biological soil processes, and improve the absorption and buffering of soil (Page, Dang, and Dalal 2020).In addition, the carbon stored in SOM is an important part of the global carbon cycle and is thus critical for the global carbon balance (Coppola et al. 2022).The accuracies of soil quality and regional carbon cycle assessments are also affected by the lack of high-precision spatial distribution maps of SOM content (Meng et al. 2020;Poggio et al. 2021).Accurate and rapid mapping of regional SOM content is critical for high-quality management of cultivated land and sustainable agricultural development.
The SOM content is one of the main factors affecting soil spectral characteristics (Angelopoulou et al. 2019;Moura-Bueno et al. 2019).Laboratory hyperspectral studies have demonstrated that the SOM content is related to the visible shortwave infrared (350-2500 nm) and mid-infrared (2.5-25 nm μm) wavelengths (Wang et al. 2022a).This is mainly due to the absorption of polar covalent bonds (OH, CH and NH functional groups) to the infrared frequency (Bahureksa et al. 2021).Many researchers have developed SOM estimation algorithms based on this finding.Although these methods can accurately predict the SOM content for a single point, they cannot be used to obtain SOM contents on a regional scale (Bao et al. 2020;Conforti et al. 2015;Liu et al. 2019).Most commonly-used satellites include visible shortwave infrared (350-2500 nm) bands, making it possible to use remote sensing images to map regional SOM contents.
Based on the relationship between SOM and the soil spectrum, many researchers have begun to use common satellites to map regional SOM content (Biney et al. 2022;John et al. 2022).In recent years, the prediction methods used to map SOM have included linear regression and nonlinear regression (support vector machines, random forest regression, and Cubist models) (Gomes et al. 2019;Lamichhane, Kumar, and Wilson 2019;Luo et al. 2022a;Mahmoudzadeh et al. 2020;Pouladi et al. 2019;Tajik, Ayoubi, and Zeraatpisheh 2020).In recent years, some researchers have used ANN deep-learning algorithms to map SOM contents, showing good prediction performances and model portability (Meng et al. 2022b).In addition to improving SOM prediction methods, image selection is also an important part of SOM mapping (Demattê et al. 2018;Silvero et al. 2021).Multitemporal synthetic images have been used to obtain pure bare-soil pixels to avoid the impact of soil environmental factors on SOM mapping and have their advantages over single images have been demonstrated (Luo et al. 2022b;Luo et al. 2022c;Wang et al. 2022b).Finally, the addition of environmental covariates has proven effective (Wang et al. 2022b).
Some recent signs of progress have enabled us to predict the spatial distributions of large-scale SOM in the bare-soil period.On the one hand, a number of free and high-performance satellites have been successfully launched (Landsat-9, Sentinel-2) (Claverie et al. 2018;Masek et al. 2020), allowing us to obtain near real-time, high-spatial-resolution satellite images around the world.On the other hand, the development of the GEE has enabled many researchers to gain vast computing power (Dong et al. 2016;Gorelick et al. 2017).Compared to traditional remote sensing image processing methods, the biggest advantage of the GEE platform is that it can easily process multitemporal images (Huang et al. 2017;Xiong et al. 2017).GEE has been used to obtain multitemporal synthetic images for crop classifications, wetland extractions, land use classifications and other research (Jin et al. 2019;Mahdianpari et al. 2019;Teluguntla et al. 2018;You and Dong 2020).With the continuous improvements in data availability and computing platforms, more researchers have attempted largescale mapping of the spatial distribution of SOM (Luo et al. 2022a;Luo et al. 2022b).The key to largescale SOM mapping is selecting an appropriate time window corresponding to bare-soil conditions.
The temporal and spatial resolutions of remote sensing images are mutually restricted (Emilien, Thomas, and Thomas 2021).Building cloud-free and gap-filled images with proper and regular time intervals is essential for using high-spatial-resolution images to reconstruct bare-soil time-series image sets (Griffiths, Nendel, and Hostert 2019).In crop-mapping research, crop types differ from year to year, so remote sensing image data collected in the same year are usually used (Luo et al. 2022b;Tomppo, Antropov, and Praks 2019).In cultivated soils, the relative stability of the organic matter content can be ensured without large-scale land remediation (Dou et al. 2019).Therefore, when constructing regular time interval cloud-free and gap-filled bare soil images, we can use the same time period synthesis method for many years (Demattê et al. 2018;Silvero et al. 2021).The key to constructing a bare-soil synthetic image set is selecting the appropriate time interval (Belgiu and Csillik 2018;You and Dong 2020).The shorter the time interval is, theoretically, the smaller the soil variation among different years are.However, performing image synthesis with too narrow an interval may lead to more gaps (due to cloud pollution) (Dong et al. 2016;Griffiths, Nendel, and Hostert 2019).Therefore, determining the appropriate time interval for performing image synthesis to reconstruct the time series of cloudless images in the bare-soil period is very important.
In this study, farmland in Northeast China was the research object and we used all available Sentinel-2 images obtained from 2019 to 2022, soil sampling points, different image synthesis strategies, random forest (RF) regression and GEE platforms to determine the best Sentinel-2 synthetic image (the best time interval and the best time period).The specific objectives are as follows: (1) to determine the image synthesis time interval that yields the highest SOM spatial distribution map accuracy among the following choices: 10, 15, 20 d or 30 d; (2) to identify the time window distributions of the synthetic images at different time intervals; (3) to determine whether adding environmental covariate data improves the soil mapping accuracy compared to using Sentinel-2 alone.The resulting moderatespatial-resolution (10-m) SOM content spatial distribution map of Northeast China can provide the latest soil chemical quality information and aid in assessments of terrestrial carbon cycle simulations.

Study areas
The research area considered in this study is the northeast region of China, including three provinces and four leagues in eastern Inner Mongolia.The range is 38.72°N to 53.56°N and115.52°E to 135.09°E, covering an area of nearly 788000 km².The main soil types in Northeast China include Cambisols, Phaeozems, Chernozems, Luvisols, Arenoslos, and Kastanozems.These soil types are generally suitable for agricultural production (Ou et al. 2017).Crops are planted once a year and the main food crops include corn, soybean, and rice (You et al. 2021).The commercial rate of food is very high, and the region is the main food production base in China.According to previous studies, the bare-soil period in Northeast China lasts from the beginning of March to the end of June every year.Except for some areas covered by snow, the region is in the bare-soil state, which is conducive to using remote sensing satellites to monitor the physical and chemical properties of the soil (Figure 1).

Soil sample acquisition and treatment
We designed 6 sampling lines and collected 941 topsoil samples in April 2021 and May 2021.The soil sampling scheme was based on the soil type map obtained from the second soil survey, which was stratified and random.In each section, multiple points were sampled to ensure the representativeness of the samples.The central position of the sampling points was recorded with a global positioning system.All soil samples were ground through a 2-mm sieve (O'Kelly 2004).The soil organic carbon content was determined by the potassium dichromate heating method (Nelson and Sommers 2013) and then the value was multiplied by the coefficient to calculate the SOM content.

Sentinel-2 image acquisition and treatment
Sentinel-2 images can be repeated for 5 days, providing obvious time-resolution advantages over other commonly used high-spatial-resolution images (Drusch et al. 2012b).In this study, Sentinel-2 SR imagery covering Northeast China from March to June (Days 60-180) in the four years from 2019 to 2022 was used.The dataset was described as 'Sentinel-2 MSI: MultiSpectral Instrument, Level-2A' in GEE.The dataset was atmospherically corrected using the 'sen2cor' program and represented surface reflectance data.Subsequently, all filtered Sentinel-2 images were cloud masked using the 'QA60' band in the image to obtain a cloud-free time-series Sentinel-2 image set.
Many studies have demonstrated that median synthesis is an efficient method for obtaining the best pixels in time-series image collection.Compared to the average synthesis method, the pixels obtained by median synthesis are accurate and not easily affected by extreme values (You and Dong 2020;You et al. 2021).According to the needs of this study, the Sentinel-2 image set was median-synthesized at different time intervals to obtain median synthetic images in different time periods (see Table 1 for details).We used the visible and near-infrared bands (B2/B3/B4/ B8), four red-edge bands (B5/B6/B7/B8a) and two shortwave infrared bands (B11/B12) of the Sentinel-2 image as the remote sensing inputs and resampled all the bands to a 10-m spatial resolution.

Selection and treatment of environmental covariates
Many previous studies have used supplemental topographic and climate data to perform soil mapping, and the results have shown that the elevation, slope, average temperature and average precipitation are the most useful factors for improving the soil mapping accuracy with good interpretability (Wang et al. 2021;Wang et al. 2022b;Zhou et al. 2020).Therefore, in this study, we selected these four covariates for SOM mapping.
For the elevation and slope data, we used NASADEM Digital Elevation 30-m data in GEE; these are the latest 30-m-spatial-resolution data available and are optimized based on multiple DEM data (NASA 2020).The average temperature and precipitation data were processed by 'era5 monthly aggregates' (Cucchi et al. 2020).We used the average data from 1979 to 2020 to avoid the impacts of abnormal years on the data.

Random forest (RF) regression
A random forest (RF) is a concrete implementation of the bagging method (Breiman 2001).Multiple decision trees are trained and then the individual results are integrated to obtain the final result.RFs can  be used for splitting or regression (Svetnik et al. 2003).The method mainly involves the selection of decision tree types and the selection of s categories of decision trees according to specific tasks (Cutler et al. 2007).The random forest regression algorithm is used in scenarios in which the data dimension is relatively low (tens of dimensions) and the accuracy is high (Strobl, Malley, and Tutz 2009).
The RF regression model in GEE was used and the 'ee.Classifier.smileRandomForest()' program was used to predict SOM.The number of trees was set to 400, mainly to balance the accuracy and operation efficiency; for VariablesPerSplit, we used the default value of 'null'; for minLeafPopulation, we use dthe default value of 1; for bagFraction, we used the default value of 0.5; for maxNodes, we used the default value of 'null'; and for seed, we used the default value of 0.

Accuracy verification
We evaluated the models via independent verification methods.First, we used the Kennard stone algorithm to divide 941 soil samples into training samples and verification samples in the ratio of 3:1 (Saptoro, Tadé, and Vuthaluru 2012;Stevens and Ramirez-Lopez 2014).A total of 707 soil samples were used as training samples, and 234 soil samples were used as independent verification samples to ensure that the SOM was based on the same training and test data segmentation scheme to make facilitate comparisons between different models.Using the training dataset, the relationships between the SOM contents, remote sensing data, and environmental covariates are analyzed and the SOM prediction model is established.The same training and validation datasets were used for all models.We then used the RMSE and determination coefficient (R 2 ) to evaluate the accuracies of different models.

SOM mapping using synthetic images at different time intervals
The R 2 and RMSE changes in the SOM prediction accuracies for four different time-interval scenarios (10, 15, 20 d and 30 d) are shown in Tables 2-5.In all four scenarios, the SOM prediction accuracy reached the highest value around May.In the S1 scenario, DOY 130 to 140 had the highest accuracy, with an R 2 value of 0.445 and RMSE of 1.430%.In the S2 scenario, DOY 120 to 135 had the highest accuracy, with an R 2 of 0.467 and RMSE of 1.397%.In the S3 scenario, DOY 120 to 140 had the highest accuracy, with an R 2 of 0.511 and RMSE of 1.341%.In the S4 scenario, DOY 120 to 150 had the highest accuracy, with an R 2 of 0.478 and RMSE of 1.382%.In conclusion, the SOM prediction accuracies in different time intervals first increased and then decreased, and the highest SOM was obtained when using the 20-d time interval.

SOM prediction accuracy after increasing environmental covariates
The SOM prediction accuracies obtained by adding environmental covariates based on the best time period determined for different time interval scenarios are shown in Table 6.The SOM prediction accuracies for different scenarios were greatly improved.The highest SOM prediction accuracy was obtained from the 120 to 140-d synthetic images with a 20-d interval, R 2 of 0.674 and RMSE of 1.098%; the prediction accuracy of the 130 to 140-d synthetic image with a 10-d interval exhibited the largest improvement, 0.192 higher than that of the remote sensing image R 2 only, and the RMSE decreased by 0.279%.The SOM prediction accuracy was significantly improved by adding environmental covariates to the SOM prediction model.

Error analysis of SOM mapping under different scenarios
The absolute SOM prediction error was assessed using the best-time-period synthetic images for different time-interval scenarios.Figure 2 shows that the differences in the absolute SOM prediction error at different time intervals were mainly high values, and the prediction errors of the synthetic image at the 10-d time interval were relatively large in the high-value area (SOM > 6%).After adding environmental covariates, the absolute SOM prediction error decreased greatly for different SOM content intervals.

Spatial distribution of SOM in Northeast China
The scenario with the highest accuracy in the above analysis was used to map the spatial distribution of SOM in cultivated lands in the study area, and a 10-m-resolution spatial distribution map of SOM was obtained.The average SOM content in Northeast China was 3.42%.Figure 3 shows that the spatial distribution of SOM in Northeast China was higher in the northern region and lower in the southern region, similar to the soil type trend.The details of the 10-m spatial resolution SOM distribution map were relatively clear and could thus provide better agricultural precision management and carbon cycle model services.

Importance of selecting appropriate synthetic images with appropriate time intervals for SOM mapping
Different studies have used synthetic images obtained in different time intervals, including synthetic images spanning more than one year or single months in many years; it was found that the best option is to use synthetic images spanning multiple years or single months for mapping SOM (Luo et al. 2022b;Wang et al. 2022b).However, while monthly synthesis data are commonly used in related research, this time interval is not necessarily the optimal time period for synthesis (Chong et al. 2021;Griffiths, Nendel, and Hostert 2019;You et al. 2021).In studies related to crop mapping, images from the same year are usually used for synthesis, and the shorter the time interval is, the better the mapping effect.Soil is more stable than crops, and many studies have used synthetic images taken over many years to obtain more useful pixels (Mendes et al. 2021;Silvero et al. 2021).The shorter the time period of the synthetic image is, the closer the collection time of the images in the interval is, but the large-scale phenology, snowmelt time, and plowing time differ, leading synthetic images with different states in different regions, for example, in mid-April in Northeast China (Yang et al. 2022).At this time, the Sanjiang Plain in the northern area of the region has just completed the snowmelt period, and the soil moisture content is very high, while the soil on the Liaohe Plain in the south of Northeast China is relatively dry (Qi et al. 2021).For relatively long synthesis time intervals, the median value of the image set was used to obtain an image with the same relative state in different regions of Northeast China, but longer time intervals may introduce more redundant information, thus resulting in deviations in the optimal state of the synthetic image.We found that when different time intervals were used, the 20-day interval synthetic image had the best SOM prediction accuracy, possibly because the 20-day interval balances the above two problems.

Time window of SOM predictions in Northeast China
The results showed that, although the images are all taken in the bare-soil period, the SOM prediction performances using different bare-soil-period images were very different, and the SOM prediction accuracies obtained using synthetic images with 15-d, 20-d and 30-d time intervals all increased first and then decreased.Using the 10-d time interval synthetic images to predict SOM, the accuracy exhibited a fluctuating trend (Figure 4).The best time period for predicting SOM was always within May, mainly because in May, spring plowing has generally been completed in the cultivated lands in Northeast China, the straw coverage of the cultivated lands is low, and the high-soil-moisture period caused by snowmelt has ended (Wu et al. 2022).

Effects of environmental covariate data and remote sensing data on SOM predictions
Figure 5 shows that climate covariates and terrain covariates are of high importance in SOM predictions.Pedogenesis postulates that natural soil-forming factors and man-made soil-forming factors work together to affect soil physical and chemical properties, including climate, biology, parent material, topography, hydrology, time, cultivation, fertilization, and irrigation factors (Lin 2011;Minasny, McBratney, and Salvador-Blanes 2008).Environmental covariates can optimally represent natural soil formation factors, and our research also showed that environmental covariates were very important for predicting SOM (Table 6 and Figure 5).However, it is difficult to use environmental covariates to monitor anthropogenic soil-formation factors (Hengl et al. 2014).
Remote sensing image data can be used to directly observe the surface of bare soil, especially in Northeast China, where the bare-soil period is long (Dou et al. 2019;Luo et al. 2022c).The spatial resolution of the Sentinel-2 remote sensing images was better than those of the environmental covariates, and these two data types can be combined to obtain a SOM spatial distribution map with a relatively high spatial resolution (Drusch et al. 2012a).Therefore, remote sensing data and environmental covariate data have their own advantages in SOM predictions and combining environmental covariate and remote sensing data is a future direction of SOM prediction research.

Limitations and future opportunities
In this study, we evaluated the SOM prediction performance in Northeast China by synthesizing Sentinel-2 images over multiple years at different time intervals.When mapping soils in largescale areas, multitemporal image synthesis is a feasible means of obtaining stable images of large-scale areas, and this method is also used in many crop-mapping and wetland-mapping applications (Gorelick et al. 2017;Jin et al. 2019;Mahdianpari et al. 2019).However, in this study, we used only a single synthetic image to predict SOM.Some studies have demonstrated that using multitemporal remote sensing images can provide more useful information for SOM mapping (Shafizadeh-Moghadam et al. 2022).In addition, in this study, we used only Sentinel-2 images and did not implement multisource remote sensing data fusion to improve the SOM prediction accuracy.Temporal resolution, spatial resolution and spectral resolution are all very important for SOM mapping (Meng et al. 2022b).In future research, it is first necessary to consider extracting the temporal information of time-series images as the SOM prediction inputs to avoid random errors that may occur in single-phase images (Luo et al. 2022b).As more commercial satellites with high spatial resolutions are launched (Kulu 2021), it becomes possible to obtain meter-scale SOM maps, and highspatial-resolution images (Planet Labs, Jilin 1 commercial satellite) will play an increasingly important role in future SOM mapping research.In addition, the successful launch of hyperspectral satellites such as Gaofen 5 in China, HYSIS in India, DEsis jointly developed by the United States and Germany, PRISMA in Italy and HISUI in Japan has increased the potential of SOM mapping based on hyperspectral satellites (Qian 2021).Many laboratory studies have proven the advantages of hyperspectral images for SOM mapping (Bao et al. 2020;Wang et al. 2022a).Fully fusing the advantages of different satellite sensors is necessary to obtain higher-spatial-resolution, more accurate SOM maps (Meng et al. 2022a).In the future, accurate SOM mapping results can better serve the accurate management of regional agriculture and the construction of global carbon cycle models.

Conclusion
The results presented herein indicated that the 20-d synthetic image interval is a suitable time interval for mapping SOM, and the accuracy of the results obtained with an excessively long or short time interval was decreased.In addition, the best time period for predicting SOM was always within May.The consideration of environmental covariates can effectively improve the SOM mapping accuracy, and information regarding environmental covariates should be fully incorporated in future research.Based on the 10-m spatial resolution SOM map obtained in this study, researchers can improve the quality of agricultural ecosystem modeling, and government managers can rely on these high spatial resolution SOM spatial data.The distribution map obtained herein can be applied to optimize farmland protection and agricultural development strategies in different regions.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Distribution map of sampling points: (a) false-color composite; (b) soil type map; and (c) elevation map.

Figure 3 .
Figure 3. Spatial distribution of SOM in Northeast China.

Table 1 .
Synthetic images at different time intervals.

Table 2 .
SOM prediction accuracies at different time intervals with 10-d increments.

Table 3 .
SOM prediction accuracies at different time intervals with 15-d increments.

Table 4 .
SOM prediction accuracies at different time intervals with 20-d increments.

Table 5 .
SOM prediction accuracies at different time intervals with 30-d increments.

Table 6 .
SOM prediction accuracies after adding environmental covariates.