Mapping annual 10-m soybean cropland with spatiotemporal sample migration

China, as the world’s biggest soybean importer and fourth-largest producer, needs accurate mapping of its planting areas for global food supply stability. The challenge lies in gathering and collating ground survey data for different crops. We proposed a spatiotemporal migration method leveraging vegetation indices’ temporal characteristics. This method uses a feature space of six integrals from the crops’ phenological curves and a concavity-convexity index to distinguish soybean and non-soybean samples in cropland. Using a limited number of actual samples and our method, we extracted features from optical time-series images throughout the soybean growing season. The cloud and rain-affected data were supplemented with SAR data. We then used the random forest algorithm for classification. Consequently, we developed the 10-meter resolution ChinaSoybean10 maps for the ten primary soybean-producing provinces from 2019 to 2022. The map showed an overall accuracy of about 93%, aligning significantly with the statistical yearbook data, confirming its reliability. This research aids soybean growth monitoring, yield estimation, strategy development, resource management, and food scarcity mitigation, and promotes sustainable agriculture.

Liaoning (Fig. 2).These provinces are recognized as the top soybean-producing regions in China, collectively accounting for more than 80% of soybean production 3 .To effectively map the soybean annual planting area in China, we categorized them into three main regions: (a) Northeast China, encompassing Heilongjiang, Jilin, Liaoning, and the northeastern Inner Mongolia; (b) Huang-Huai-Hai Plain and the Middle-Lower Yangtze Plain, covering Shandong, Henan, Anhui, Jiangsu, and Hubei; and (c) Sichuan Basin.
Data.The Sentinel-2 satellite, with a spatial resolution of 10 meters and a revisiting period of 5 days, offers optimal support for the comprehensive and long-term identification of crops.Equipped with the Multispectral Instrument (MSI), Sentinel-2 can effectively even the slightest variations between different crops.For our study area, we acquired all available Sentinel-2A/B (S2) Level-2A surface reflectance (SR) data from 2019 to 2022 through the Google Earth Engine (GEE) platform.To enhance data quality, cloud masking and Savitzky-Golay (SG) filtering techniques were applied to the acquired data.
Two categories of spectral data were utilized to classify soybeans and other crop types: (1) reflectance from five spectral bands and (2) the values of nine spectral indices (refer to Table 1).The five bands selected for classification are red edge 1 (RE1), red edge 2 (RE2), red edge 3 (RE3), shortwave infrared 1 (SWIR1) and Fig. 1 Workflow for mapping soybean planting areas using the sample-generation and pixel-based algorithm.Sentinel-2 SR, sentinel-2 surface reflectance products in Google Earth Engine; Sentinel-1 SAR, a dualpolarization C-band Synthetic Aperture Radar data at 5.405 GHz; S-G filter, Savitzky-Golay filter; ESA, European Space Agency; VIs, Vegetation Indices; RMSE, root-mean-square error; MAE, mean absolute error; R2, R-squared; OA, overall accuracy; PA, producer's accuracy; UA, user's accuracy.
shortwave Infrared 2 (SWIR2).The red-edge bands are important indicator bands that reflect plant pigments and health status.The short-wave infrared bands can reflect changes in moisture and other biochemical components in crop leaves 42 .Previous research has confirmed the significant role they play in distinguishing between soybeans and corn 13,20,43 .Additionally, nine commonly employed spectral indexes were computed: Enhanced Vegetation Index (EVI) 44 , Green Chlorophyll Vegetation Index (GCVI) 45 , Land Surface Water Index (LSWI) 46 , Red Edge Position Index (REPI) 47 , Red Edge Normalized Difference Vegetation Index (RENDVI) 48 , Normalized Difference Phenology Index (NDPI) 49 , and Soil-Adjusted Vegetation Index (SAVI) 50 , Optimized Soil-Adjusted Vegetation Index (OSAVI) 51 , Transformed Chlorophyll Absorption in Reflectance Index (TCARI) 52 .The use of NDVI and EVI time series is widespread for extracting temporal characteristics and phenological indicators of various crops.LSWI can effectively differentiates and classifies rice due to the heightened responsiveness of corn and soybeans to leaf and soil moisture.RENDVI and REPI, which leverage the S2 Red Edge bands, are particularly suited for estimating canopy chlorophyll II and nitrogen content.OSAVI proficiently mirrors the dynamic growth of crops while simultaneously minimize the impact of background soil 51 .There exists a high correlation between crop OSAVI and their canopy chlorophyll content, which displays significant variations throughout the growth season of the crops.In crops with high chlorophyll content, such as soybeans, corn, and rice, changes in TCARI are comparatively slow.Therefore, TCARI/OSAVI demonstrates considerable sensitivity to flux in chlorophyll content 52 .Cash crops such as peanuts, cotton, potatoes, and sunflowers, potentially outside of soybeans, are derived using TCARI/OSAVI 25 .
The availability of suitable Sentinel-2 images was limited due to frequent cloud cover and rain during the soybean growing season.This presented challenges in creating the necessary time-series spectral features for classification.To address this issue, Sentinel-1 SAR (Synthetic Aperture Radar) data was utilized to establish the required time-series spectral features for classification.Sentinel-1 is equipped with a C-band synthetic aperture radar operating at a center frequency of 5.045 GHz.It provides four imaging modes: Stripmap, Interferometric Wide swath, Extra Wide swath, and Wave modes.Sentinel-1 offers dual-polarization SAR data (HH+HV, VV+VH) and has four product specifications: RAW Level-0, SLC (Single-Look Complex), GRD (Ground Range Detected), and OCN (Ocean).Due to data storage limitations, Sentinel-1 images in GEE are accessible in the GRD format, which lacks phase information.The data in GEE undergo several pre-processing steps, which     www.nature.com/scientificdatawww.nature.com/scientificdata/include: 1) removal of thermal noise, 2) radiometric calibration, 3) terrain correction using SRTM or ASTER DEM data, and 4) conversion of terrain-corrected backscattering coefficients to decibel values.Given the potential adverse effects of SAR active microwave imaging on image quality, this study applied Refine Lee filtering and straightforward incidence angle normalization into the processing of Sentinel-1 images.
The present study employs Sentinel-1 VV/VH dual-polarization imagery to distinguish five SAR parameters for extracting soybean features.These parameters comprise the backscattering ratio for VV and VH, denoted as σ VH 0 and σ VV 0 respectively, in addition to three combinations of polarization channels: the cross-polarization ratio ( / VH 0 VV 0 σ σ ), the cross-polarization sum (σ +σ VH 0 VV 0 ), and the Radar Vegetation Index (RVI).These parameters are itemized in Table 2.
Training and validation data.Ground survey samples, include those we collected from various provinces across different years, are denoted as yellow points in Fig. 2 and list in Table 3.The sample locations and crop types were recorded during fieldwork using mobile Geographic Information System (GIS) devices.Post-field surveys, we conducted a visual inspection of all ground samples utilizing high-resolution images from Google Earth and Sentinel-2 RGB composite images.Any samples displaying evident errors, such as the misclassification of natural vegetation as crops, were discarded.Samples located close to roads or field boundaries were also excluded.In addition, the sample data was enhanced by using existing data products 53 .
Sample generation and migration.This paper introduces a method for generating samples that employs existing samples to facilitate the spatiotemporal migration of soybean samples, even amidst constraints in sample sizes and temporal coverage (Fig. 3).The strategy used in this study for generating samples involves sifting out soybean and non-soybean specimens from randomly collected cropland samples.In our study area, the primary    www.nature.com/scientificdatawww.nature.com/scientificdata/crops grown encompass soybeans, corn, rice, wheat, and other staple crops, the planting area of which comprises up to 62% of the total cultivated land area 3 .Peanuts, rapeseed, cotton, potatoes, sunflowers, and other cash crops are also cultivated.The production of winter wheat and winter rapeseed in China accounts for more than 90% of the total wheat and rapeseed production, respectively 54 .As the growth periods of these two do not overlap with that of soybeans, they are not considered when filtering sample.To distinguish soybeans from the aforementioned non-soybean samples, our method of generating samples is categorized into three parts.Initially, the ESA WorldCover 55 is used to generate random cropland samples with unspecified crop types.Subsequently, based on the findings of Huang et al. 25 , we devised the Concave-Convexity Index (CCI), which segregates random crop samples into potential soybeans and non-soybeans based on the chlorophyll content change in the crop canopy.By charting the time series curves of band reflectivity and crop vegetation indexes, it is possible to discern accurate and reliable samples from potential ones, achieved through the analysis of the typical distribution of the area under the curve .a) Generate crop samples.Different forms of land cover, including tree cover, grassland, water bodies, and buildings, are often proximate to agricultural areas.These elements may influence the integrity of the crop sample, as illustrated in Fig. 4. In our approach, a hexagonal automated sampling technique is utilized to generate random crop points 56 , which involves scattering a substantial number of points randomly within a hexagonal grid and determine the point's classification by appraising the proportion of land cover categories within a 50-meter buffer.If a single land cover type comprises more than 90% of the area within this buffer, it is assigned as the type for that point.To verify the accuracy of samples for uncultivated area, we performed visual analysis using high-resolution Google Earth images and Sentinel-2 data from the corresponding year.b) Filter potential soybean samples.We initiated a preliminary filtering process of random crop points, aimed at identifying potential soybeans and non-soybeans.Crops' OSAVI is highly correlated with their canopy chlorophyll content, showing significant variations as the crops grow 51 .In crops with high chlorophyll content, such as soybeans, corn, and rice, TCARI changes relatively gradually 52 .As a result, the TCARI/OSAVI trend typically displays a concave pattern in these crops 25 .Conversely, crops with low chlorophyll content, such as peanuts, cotton, potatoes and sunflowers exhibit an inverse pattern, marked by noteworthy alterations in TCARI and culminating in a convex temporal curve, as illustrated in Fig. 5.This study leverages the TCARI/OSAVI association to evaluate the concavity or convexity of the temporal growth curves of crops by establishing the CCI for the start of growing season (SOS), peak of growing season (POS), and end of growing season (EOS) 25 .Points exhibiting a CCI ≥ 0 are marked as potential non-soybean points, while those with CCI < 0 are earmarked as possible soybean points.These points are subsequently employed in the sample selection process.The dates when the EVI temporal curve attains its peak are deemed as the POS of the crop, while the SOS and EOS are calculated using the median method 57 .
The formulation for CCI calculation (Eq. 1) is provided below:

POS S OS EOS
c) Confirm soybean samples.Several time series curves of band reflectivity and vegetation indexes, accurately exhibiting the growth characteristics of soybeans, as presented in Fig. 6.The curve integration of a specific parameter represents its cumulative value throughout the entire crop growing season.Soybeans, during their peak growth phase, manifest increased dryness and greener foliage, distinguishing them from other crops 27 .Consequently, soybeans tend to exhibit elevated values for EVI and SWIR2, while their LSWI values are comparatively lower.During the peak growing season, the REPI and RENDVI for corn distinctly surpass those of soybeans 20 .Conversely, soybeans demonstrate a relatively high red edge reflectance.Accordingly, we formulated two parameter sets for soybean samples screening, which includes a high-value group (EVI, RE2, and SWIR2; see Fig. 6a-c), and a low-value group (LSWI, RENDVI, and REPI; see in Fig. 6d,e).As a result, soybeans can be clearly distinguished from other crops through curve integration.The integration limits align with the crop's SOS and EOS.www.nature.com/scientificdatawww.nature.com/scientificdata/The band reflectivity or vegetation index of a certain crop in the peak growing season follows a one-dimensional Gaussian distribution 32 .The integral values of the aforementioned two sets of time series curves of soybeans are assumed to follow a multivariate Gaussian distribution that is connected to their dimensions.The probability density function (Eq.2) is given by: Where x is the integral vector of time series curve from ether the high-value or low-value group, μ is the mean vector of the verified soybean points, and Σ is the covariance matrix.To quantify the resemblance between random crop points and verified soybean points, we employed the Mahalanobis distance measurement.This computation is performed between the randomly selected points and a multivariate Gaussian distribution, which is composed of verified soybean points.The Mahalanobis distance gauges the deviation between data points and distributions, factoring in the correlation among different dimensions.This method mitigates the effects of varying dimensions and variances, thereby yielding a more accurate representation of the correlation between two sets of data.The Mahalanobis distance (Eq. 3) for a multivariate vector, with a mean of μ and a covariance matrix Σ is defined as follows: depicts the scatter distributions of soybeans, corn, and rice in three-dimensional spaces for high-value and low-value groups.Three crops clearly form distinct groups.The soybean samples are clustered in the upper-right and lower-left corners of two feature spaces.This highlights the excellent ability of the area under the curve employed in this study to distinguish soybeans from other crops.In the previous sections, we computed two sets of Mahalanobis distances between filtered points and points representing soybeans from ground survey.We hypothesize that shorter distances are indicative of soybeans, therefore emphasising the need to determine the categorization threshold.The threshold is determined by analysing the multivariate Gaussian distributions using soybean samples as the basis.Points with a high probability density are indicative of the most salient features of soybeans, and these points are concentrated at the centre of the distribution.To identify these robust soybean points, we employed the Monte Carlo method to calculate the probability density p 50 at which the cumulative probability of the multivariate Gaussian distribution (which is symmetric around the centre) hits 50%.The sought-after points are those soybean points with a probability density higher than p 50 .The formula (Eqs.4, 5) for p 50 is given as: 50 Subsequently, we computed the Mahalanobis distances from robust soybean points to the multivariate Gaussian distributions of high-value group and low-value group.The 90th percentile of these distances was selected as the threshold for confirming reliable soybean points from potential soybean, namely Where D High and D Low denote the Mahalanobis distances between arbitrary points and the high-value and low-value classes of soybean ground surveys, respectively.Figure 7 depicts the ellipsoidal clustering pattern of soybean points within a three-dimensional feature space.This study shows that the robust soybean points extracted through probability density occupy the central region of the ellipsoid.The percentiles of Mahalanobis distance from these points to the distribution guarantee that the filtered soybean points are also located within the ellipsoid, thereby assuring their accuracy and reliability.It is noteworthy that the computation of Mahalanobis distance involves dimension independence and standardization, which convert the ellipsoid into a sphere in the Mahalanobis distance space, thereby expanding the precision of the point filtering process.
In terms of migration strategy, we categorized the strategy into three types based on the spatiotemporal relationship between ground survey samples and migrated ones.These categories include: (1) Migrating samples from different regions in the same year to the target area (spatial migration); (2) Migrating samples from different years in the same region to the target year (temporal migration); (3) Migrating samples from different years and regions to the target year in the intended region (spatiotemporal migration).We give priority to using ground survey samples from the same region for migration (temporal migration), because soybean in the same area have relatively little inter-annual changes in planting habits, varieties, and soil texture 58 .Therefore, we think the temporal heterogeneity of soybeans in spectral parameters is smaller than the spatial heterogeneity.If there are no ground surveys in a certain area, samples from similar climate zones will be used for migration, with priority given to those from the same years, then adjacent years.Figure 8 describes the provinces and years to which the ground survey samples used for sample migration belong.
To evaluate the effectiveness of sample generation, we employed the GLAD maize and soybean map (GLAD) 13 to assess the accuracy of the generated soybean and non-soybean samples.Since the timeline of GLAD is constrained to 2019, we performed experiments and computed accuracies only for the sample points generated in 2019, implying all target years are set as 2019.
Features selection and classification.We utilized soybean phenological characteristics and spectral indexes to differentiate between soybean and non-soybean crops, as these are efficient in capturing the seasonal fluctuations in surface spectra.To compile these indicators, we annually selected data from April 1 st to November 15 th , considering the crop calendars of various regions.Table 4 presents the candidate features we employed.Statistical features of five reflectance bands -RE1-3, SWIR1, and SWIR2 -were analysed during the growing season (DOY: 90-318).These encompass the minimum, maximum, and standard deviation, as well as the 15th, 50th, and 90th percentiles.Phenological parameters obtained from the EVI time series, harmonic fitting parameters 20,59 , and accumulative biomass attributes 60,61 are also taken into account.We utilized harmonic fitting (discrete Fourier transform, Eq. 8) analysis to the original effective observational data to extract time-series curves, as demonstrated in the following formula: Where f(t) represents fitted vegetation index value at the time instance t.The constant term is represented as a, while b corresponds to the coefficient of the first-order term.M signifies the quantity of harmonic components, www.nature.com/scientificdatawww.nature.com/scientificdata/and C and D stand for the coefficients of cosine and sine functions, respectively.The variable ω is the reciprocal of the number of days in a year (1/365), t represents a specific day within a year as denoted by the DOY, and e corresponds to the residual value.For temporal feature extraction, phase and amplitude are utilized with amplitude defined as the magnitude of a two-dimensional vector [C M ,D M ], and phase as the angle of the same two-dimensional vector [C M ,D M ].
In assessing biomass using EVI, the EVI is systematically calculated for every time series data point throughout the growth season.The subsequent step involves aggregating the EVI values over unique time intervals to derive the cumulative biomass features, a crucial factor for soybean classification.To accommodate potential inconsistencies stemming from diverse pixel observation frequencies in regions with incomplete image data, a trilinear interpolation approach is incorporated in this study.This method effectively corrects missing data points, thereby ensuring a uniform computation of cumulative biomass features.The formula for trilinear interpolation (Eq.9) is provided below: Within this context, y 0 denotes the sought-after interpolation outcome, with y 2 denoting known function values.Here, x 0 designates the point slated for interpolation, whereas x 1 , x 2 , and x 3 stand as the abscissas for three established reference points.The notation f[x 1 ,x 2 ,x 3 ] represents the third-order divided difference calculated at positions x 1 , x 2 , and x 3 .
In addressing the five SAR feature parameters (Table 5), we harnessed SAR image time series to extract pivotal phenological characteristics of crops.This process comprised both statistical features and principal component features.The statistical features mirrored the approach adopt for optical data, incorporating the maximum, minimum, and variance, along with the 15th, 50th, and 90th percentiles of the five SAR parameters.These statistical attributes are instrumental in conveying the average levels and temporal fluctuations within time series curves for diverse crops.Furthermore, we carried out a Principal Component Analysis (PCA) on the Sentinel-1 image time series in the temporal domain, selecting the initial three principal components as the principal component features of SAR data 62 .
Considering the 'Hughes' Phenomenon, the current number of features is copious.Consequently, in each classification process, we incorporate feature selection, choosing to maintain the top 50% of features.This decision is influenced by the ranking provided by the random forest model for the final classification.
We employed local random forest classifiers for soybean planting areas identification in each province.This non-parametric machine learning classifier exhibits a higher error tolerance compared to certain parametric classifiers and has been extensively utilized in classification and recognition research.In terms of dividing training set and testing set, half of the samples within each province were randomly selected for training the classifier and mapping soybean cropland, while the remainder were utilized for validation.In this study, we implement the random forest classification model within the GEE.On the GEE platform, we vary the number of decision trees from 50 to 500 at 50-unit intervals.The chosen number of decision trees as the parameter for ensuring classification is the one that surpasses 100 and achieves the initial local maximum in classification accuracy.To counteract minor result variations in each experimental repetition due to the inherent randomness in random forest sampling, we set a random seed of 999.All additional parameters are left at their default values.
To assess the precision of soybean distribution mapping, we take two approaches: (1) on-site validation through the collection of ground truth samples, which involves conducting ground surveys and generating samples, and (2) comparing the results with agricultural statistical data obtained from administrative units.www.nature.com/scientificdatawww.nature.com/scientificdata/Confusion matrices were generated using both soybean samples and non-soybean samples for each provincial soybean map.These matrices were employed to calculate the producer's accuracy (PA), user's accuracy (UA) and F1-score (F1) for soybean samples (Eqs.10-12), assessing the precision of the approaches.The overall success of this strategy was assessed by calculating the overall accuracy (OA).The Kappa coefficient was used to assess the level of agreement between the classification results and sample labels.In addition, we assessed the soybean planting area identified in this study by comparing it to agricultural statistical data at the provincial and prefectural levels.This comparison was done using the coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE).

PA TP TP FP
(10) Post-processing.For large-scale and high-resolution crop mapping, the speckle noise is inevitable, and the same goes for soybean mapping 63 .Errors may arise during sensor imaging, soybean sample generation, image preprocessing, and feature classification, etc., resulting in soybean patches composed of just one or two pixels in the mapping results.In most cases, they are considered speckle noise and should be eliminated.We performed post-processing on the results using eight-neighborhood majority filtering.This processing can filter independent, unconnected soybean pixels, and non-soybean pixels in soybean plots will also be filled, making the mapping results more accurate and reasonable.

Data Records
Between 2019 and 2022, we generated four soybean cropland maps encompassing China's key soybean-producing regions, all at a 10-meter spatial resolution (ChinaSoybean10).The datasets, formatted to Geotiff, are available for access at the Zenodo repository (https://doi.org/10.5281/zenodo.10068402) 64.Structured under the ESPG: 4326 (WGS_1984) spatial reference system, the maps incorporate only one values: 1 to denote soybean planting areas, and null value to indicate non-soybean planting areas (inclusive of other landcover).These maps can be scrutinized and visualized using software such as ArcGIS, QGIS, or their alternatives.

Technical Validation
Precision Assessment of Sample Spatiotemporal Migration.Employing GLAD maize and soybean map 13 , we evaluated the sample generation accuracy for both soybeans and non-soybeans.GLAD maize and soybean map (https://glad.earthengine.app/view/china-crop-map) is a 2019 national maize and soybean map produced using field survey samples and binary random forest, in which the R 2 between soybean mapping area and statistical yearbook area can reach 0.93.It is considered to be a reliable reference for accuracy validation.
Using the actual samples from 2019 to 2021, we generated soybean and non-soybean samples for different regions in 2019 via three different methods.Subsequently, we calculated the sample generation accuracy for each region, as delineated in Fig. 9. Broadly speaking, with the exception of Liaoning and Jiangsu, the generation accuracy exceeds 80% for soybean samples and 95% for non-soybean samples, indicating the efficacy of the method used.www.nature.com/scientificdatawww.nature.com/scientificdata/Among the three-generation methods applied for soybean samples-temporal migration, spatial migration, and spatiotemporal migration-the average accuracies were 87.32%, 86.49%, and 83.44%, respectively.Temporal migration within the same region proved to be superior, followed by spatial migration.The least accuracy occurs in spatiotemporal migration.We postulated that minor annual variations in climatic factors, such as temperature and precipitation, contribute less to negative effects on sample migration compared to spatial heterogeneity due to regional differences.Among all provinces, Heilongjiang demonstrated the best results with an average accuracy of 92.72%.Inner Mongolia, Anhui, and Henan, three major soybean-producing provinces, also reached approximately 90%.In these provinces, soybeans are extensively cultivated, leading to relatively continuous and dense area, resulting in high accuracy.In contrast, Liaoning and Jiangsu, characterized by complex planting structures and fragmented soybean planting areas, had an accuracy below 80%.This lower accuracy can be attributed to these agricultural complexities and the adverse impact of rainfall on image quality.Conversely, the generation of non-soybean samples illustrated commendable accuracy and robustness across various regions.In summary, our sample generation method demonstrated exceptional proficiency across diverse regions, delivering commendable outcomes in both temporal and spatial sample generation.

Soybean map and accuracy assessment. By harnessing Sentinel-2 remote sensing imagery, selected
Sentinel-1 SAR image data, ground surveys and generated samples, we mapped soybean planting areas for ten provinces nationwide (Fig. 10), denoted as ChinaSoybean10.We conducted accuracy assessments of the mapping results using both ground surveys and generated samples.The result indicates that in the northeastern region, the average overall accuracy for soybean planting areas mapping was 93.70%, with a prevailing Kappa coefficient of 0.8624.In crucial soybean cultivation areas of the Huang-Huai-Hai Plain, the middle-lower reaches of Yangtze River Plain, and Sichuan, the average overall mapping accuracy was 93.16%, accompanied by a Kappa coefficient of 0.7980 (Table 6).Moreover, we calculated both the producer's accuracy and the user's accuracy for each province.In the prominent soybean planting areas of the Northeast, the average producer's accuracy, user's accuracy, and F1-score were 92.23%, 88.70%, and 90.06%, respectively.For the Huang-Huai-Hai Plain, the Middle-Lower Yangtze Plain, and Sichuan, the average of these indicators were 80.15%, 89.59%, and 0.8434, respectively.www.nature.com/scientificdatawww.nature.com/scientificdata/We compared the mapped soybean cropland areas of various prefecture-level cities with the officially reported planting areas, and quantitatively analyzed the accuracy of our soybean map by calculating R-squared (R 2 ), root-mean-square-error (RMSE), and mean-absolute-error (MAE).The results demonstrate a high level of consistency between our annual soybean maps and official statistical data (R 2 > 0.85), with values of 0.91, 0.92, 0.87, and 0.88 for the years 2019 to 2022, respectively (Fig. 11).R 2 for 2019 and 2020 were both above 0.9, while there was a slight decrease in 2021 and 2022, likely due to the increased use of generated soybean samples in those years.In Fig. 11, we plotted the 1:1 line, and some prefecture-level cities with higher soybean production, such as Heihe, Qiqihar, and Hulunbuir, tended to cluster around this line.However, certain cities in the Huang-Huai-Hai and Yangtze River regions exhibited slight discrepancies compared to the statistical yearbook, possibly due to the more complex planting structures and the presence of numerous smallholders 65,66 , resulting in significant field-level heterogeneity.Nevertheless, our method consistently produces highly reliable estimates of planting area.

Visual comparison with other products and methods.
Compared with some existing soybean distribution products, ChinaSoybean10 has wider spatial and temporal coverage.It also merges multi-source remote sensing data to achieve superior classification accuracy.We rank ChinaSoybean10 alongside GLAD  maize-soybean map 13 , CDL (Crop Data Layer) 20 and soybean map produced by GWCCI 27 .Among them, GLAD covers the soybean planting areas across the country in 2019, CDL covers the Northeast region, and GWCCI uses a common threshold of 0.17 for soybean mapping in all regions across the country.As a demonstration reference in 2019, we selected examples from Heilongjiang (Fig. 12a,b), and Anhui (Fig. 12c) to illustrate the comparison between our soybean mapping results and those of state-of-the-art methods.For the first example (Fig. 12a), our soybean mapping results are in good agreement with CDL and GLAD, reflecting the accuracy of our results.
In the second example (Fig. 12b), our results are very similar to GLAD, while CDL has redundant soybean recognition results in the red-boxed area.The third example is in Anhui Province (Fig. 12c).CDL does not cover this area, so the GWCCI generated results are compared with our results.The soybean mapping effectiveness of GWCCI relies heavily on image quality and threshold selection.It can be found that the overall result contains more salt and pepper noise, as well as misclassified pixels in the red-boxed area (Fig. 12c3), which do not exist in our results and GLAD.The above comparison verifies the accuracy of our results.
Advantages of the sample migration method.The paper presents a soybean sample migration method, optimizing the acquisition of crop samples for soybean area mapping.The method is less financially demanding, automated, and provides accurate soybean and non-soybean samples using minimal pre-existing ones.Figure 13 displays samples taken in 10 provinces during 2019.The spatiotemporal migration is based on the disparity in crop's band reflectivity and vegetation indexes.Using the integration of this curve during the growing season improves migration accuracy.The method's potential lies in reducing the spatiotemporal heterogeneity of soybean phenology by using automatically determined growth season intervals.Currently, there has been considerable research on no sample crop mapping based on crop knowledge and rule thresholds 25,27,32 .However, the effectiveness of these methods is severely hampered by threshold selection and are susceptible to clouds and fog, thus posing difficulties for large-scale crop mapping.The method presented in this paper selects random crop points based on the feature distribution of soybean samples, automatically determining filtering thresholds through the distribution characteristics of ground survey samples, thus enhancing its generality.Additionally, the mapping strategy in this paper combines generated samples with supervised classification.The thresholds mainly restrict the quality and quantity of samples, rather than directly affecting the mapping results.Therefore, threshold calculation parameters can be more "extreme" to obtain a purer set of soybean samples.www.nature.com/scientificdatawww.nature.com/scientificdata/Compared to the sample migration method based on DTW distance developed by Zhang et al. 33 , our method eliminates the need to obtain samples of other major crops in the target area, hence offering a more versatile and convenient solution.Some researchers use crop samples generated by existing products for mapping 13,56 .Although this method is convenient and efficient, it is subject to considerable limitations in terms of both time and space.
In conclusion, the soybean sample migration methodology elucidated in this paper adeptly and efficiently procures both soybean and non-soybean samples for regions devoid of such samples.This significantly aids in the creation of comprehensive crop mapping products and offers myriad possibilities for crop mapping.

Usage Notes
China is the fourth largest soybean producer and the largest soybean importer in the world, and its soybean consumption relies heavily on imports.Mapping the distribution of soybean growing areas at the national scale is critical for food and energy security in the context of growing population and consumption.In this paper, we collected soybean field survey samples for many years and proposed a sample spatiotemporal migration method based on the temporal characteristics of vegetation index.Using field survey and generated samples, we create national 10 m soybean maps in China from 2019 to 2022.Through experiments, we found that the areas calculated by our soybean maps area consistent highly with the official statistical area at the prefecture-level.Therefore, our soybean mapping results can be used to support large-scale soybean yield estimates and quantitative analyzes of multi-year soybean-cultivated area changes.Furthermore, our datasets can serve as a reference and support uncertainty analysis for comparable products.
Uncertainty.Despite our stringent data processing measures, certain sources of uncertainty remain inherent.Though we employed time-series Sentinel-2 data for soybean planting area mapping, the length of the soybean growing season is geographically variable.The 5-day revisit cycle might not consistently yield complete time-series spectral curves due to obstacles such as cloudy conditions and rainfall.While we successfully integrated Sentinel-1 data in certain regions and years, completely eradicating the speckle noise remains a complex task.Further, the widespread practice of soybean intercropping with crops such as corn and sorghum presents a substantial challenge in accurately mapping soybean's spatial distribution within the Huang-Huai-Hai and Yangtze River regions.The potential mixed pixel effect arising from the 10-meter spatial resolution of Sentinel-2 data inevitably weakens the identification signal of specific crops, introducing uncertainty.Lastly, despite the validation of our sample generation method using ground-truth samples and existing products by achieving approximately 90% sample accuracy, potential deviations may still affect our results.Additionally, our methodology still relies upon terrestrial survey soybean samples.Ensuring minor phenological differences between these field survey samples, and generated samples is critical to the accuracy of sample generation.Looking forward, leveraging crop-specific maps, and highly resolved remote sensing products could offer solutions for the mixed pixel issue and enhance sample generation methods for optimal differentiation between soybean and non-soybean areas.Consequently, this could simplify the soybean extraction process.

Fig. 2
Fig. 2 Location of the major soybean-producing region in (a) Northeast China, (b) Huang-Huai-Hai Plain and the Middle-Lower Yangtze Plain, and (c) Sichuan Basin.

Fig. 3
Fig. 3 Sample generation process.SOS, start of growing season; POS, peak of growing season; EOS, end of growing season; CCI, Concave-Convexity Index.

Fig. 4
Fig. 4 Grid-based random sample point generation with ESA WorldCover on the base map, green checkmarks indicate that a sample point of that feature type is retained, red fork markers indicate that the sample is discarded, and blue star markers indicate that cropland samples are retained for subsequent sample generation.

Fig. 6
Fig. 6 Time-series vegetation index curves for soybeans and other major crops.

Fig. 8
Fig. 8 Ground survey samples used for samples migration in each province and year.

Fig. 12
Fig. 12 Comparison with Existing Methods and Results, the first column is a median-synthesized RGB image from Sentinel-2 after cloud removal (from DOY 200 to DOY 240); the second column represents ChinaSoybean10; The second column shows the results of CDL and the results extracted by GWCCI, where (c1-c2) are from CDL and (c3) is the extraction result of GWCCI.; the fourth column is the soybean map of GLAD, with (a-c) indicating different regions: (a,b) 2019 soybean map in Heilongjiang province; (c) 2019 soybean map in Anhui province.

Table 2 .
The five SAR-based features used in the study.

Table 3 .
The number of soybean samples collected by ground survey in different provinces and years.

Table 4 .
Summary of Optical Feature Parameters and Feature Extraction Methods.

Table 5 .
Summary of SAR Feature Parameters and Feature Extraction Methods.

Table 6 .
Soybean Classification Mapping Overall Accuracy, Kappa, PA & UA.*Includes ground survey samples and generated samples, and the rest only have generated samples; HHH-MLY represents the Huang-Huai-Hai Plain and the Middle-Lower Yangtze Plain.