Background & Summary

Planted forests are forest ecosystems established by artificial tree planting or seeding for the provision of income and goods, as well as for climate change mitigation and the restoration of ecosystem services and processes1,2. According to the Food and Agriculture Organization of the United Nations (FAO)2, planted forests globally increased by 41,000 km2 per year between 2000 and 2020 and currently amounted to approximately 2,930,000 km2. Today, FAO estimates that 36% of the world’s planted forests are distributed in East Asian countries, namely China, Japan, the Republic of Korea (ROK), and the Democratic People’s Republic of Korea (DPRK)2. In East Asia, a large proportion of forest area is planted forests (39% in China, 41% in Japan, 36% in ROK, and 16% in DPRK in 2020, according to FAO2), while other regions in the world remain well below 20% (19% in Africa, 7% in Europe, and 9% in the United States). Unlike Western countries, where planting was traditionally conducted for silvicultural practices, East Asian countries planted trees for varying purposes with local species and unique history3,4,5,6,7,8,9,10,11.

East Asian countries have implemented a variety of tree-planting policies at different spatial and temporal scales. China leads all countries worldwide with the largest estimated plantation area of about 840,000 km2. Since the end of the 1970s, China has established several afforestation projects, including the Three-North Forest Shelterbelt Program4, the Natural Forest Conservation Program (also known as Natural Forest Protection Program), and the Grain to Green Program (GGP; also known as the Sloping Land Conversion Program)5,6. Currently, China has committed to preserving and expanding forest cover, aiming at mitigating soil erosion, air pollution, and climate change in the coming decades7. Although hundreds of tree species have been used for plantation establishment in China, a few species dominate the planted forests across the country, such as Chinese fir (Cunninghamia lanceolata) and eucalyptus (Eucalyptus spp.)8. In Japan, most planted forests were established after World War II to meet the growing demand for timber and other wood products. Thus, fast-growing and highly productive species, such as Japanese cedar (Cryptomeria japonica) and Hinoki cypress (Chamaecyparis obtusa), were extensively planted9. ROK underwent severe deforestation and forest degradation during World War II and the Korean War (1950–1953), followed by active conversion of forests to agricultural lands due to post-war poverty10. In response, the government implemented five National Forest Development Plans from 1973 to 2017. A variety of fast-growing species were planted during this period, and the successful recovery of healthy forests and ROK’s sustainable management strategies are internationally recognized10,11.

With active tree planting being implemented throughout the world for climate change mitigation, forest restoration, and biological conservation, it has become urgent to establish cost-effective guidelines for all ongoing and upcoming tree-planting projects. Assessment of the costs and benefits of planted forests, the key to the development of such cost-effective guidelines, is contingent on knowing where the existing planted forests are distributed12,13,14,15,16 and which tree species are planted17. The geospatial distribution of planted forests in East Asia still remains unclear due to a scarcity of complete, transparent, and publicly accessible data records. National governments have published some planted forest maps based on site visits, forest inventory, and satellite data. Yet, the spatial coverage is incomplete for Japan12, and the map produced by the Chinese Forest Inventory remains unverified and largely inconsistent with independent studies13,14. The existing large-scale maps of planted forests are based on inconsistent data sources with varying reliability and scale13 or solely based on satellite images14. Because of these differences in spatial extent, underlying data sources, and methods in existing datasets, a database that provides complete, consistent, and ground-truth-based records of the geographic distribution of planted forests and associated dominant tree species for East Asia constitutes a consistent and harmonized product.

Here, we produced the spatial database of planted forests in East Asia at a 1-km resolution and identify dominant tree species in these planted forests to the genus level. Our planted forest map encompasses forests of all ages planted for various purposes, including forest restoration, commercial plantation, and disaster prevention. These mapping products are based on ensemble machine learning models, data fusion, and multi-source data of planted forests. Our multi-source data comprised ~7,000 ground-truth inventory plots in China, five independent digitized maps across the study region, as well as 57 auxiliary datasets and layers, including satellite data such as the Global Ecosystem Dynamics Investigation (GEDI)18 and Moderate Resolution Imaging Spectroradiometer (MODIS) data to account for potential differences in forest structure and vegetation characteristics between planted and natural forests. In addition to the main products, we also estimated the upper and lower bounds of potential planted forest extent to account for the uncertainty associated with the varied quality of multi-source training data. With previous records of planted forests being inconsistent in resolution, quality, and accessibility, our map provides a complete, consistent, and in situ data-based estimation of the extent and species distribution of planted forests in East Asia.

Methods

To estimate the spatial distribution of planted forests over East Asia, we integrated multi-source planted-natural forest data from multiple in situ inventories and digitized data sources in a high-level data fusion algorithm (Fig. 1). For each observation, we first created a response variable explicitly labeled as either “planted” or “natural” forests. We then obtained data on 57 potential predictor variables encompassing forest structure, vegetation characteristics, bioclimate, topography, anthropogenic information, and soil characteristics, and merged these layers with the response variable layer based on spatial coordinates. The training dataset was then masked to the forested area in 2020 and separated into three biomes based on the Nature Conservancy Terrestrial Ecoregions map19. For each biome, we selected the optimal machine learning classification model and fine-tuned hyperparameters. Finally, we mapped planted forest distribution and the distribution of the dominant tree species in these forests to the genus level. Our study area covers China, Japan, ROK, and DPRK.

Fig. 1
figure 1

Workflow for developing the spatial database of planted forests. The top section (yellow) represents the data fusion algorithm we used to integrate multi-source data into coherent training datasets. The bottom section (green) represents the ensemble model we developed to predict the spatial patterns of planted forests.

Data fusion

We collected and integrated in situ and digitized planted-natural forest data from multiple independent sources using a high-level data fusion algorithm (Fig. 1). Observations from China came from published literature20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265 (Fig. 2a), which included 2,542 and 4,394 in situ records of confirmed locations of planted and natural forests, respectively. The in situ planted forest observations include the plantation of commercial species, such as pine (Pinus spp.) and eucalyptus (Eucalyptus spp.), and forests planted for restoration purposes. We also obtained the national planted forest map of China (Fig. 2b)15, which depicts the distribution of planted forests in 2000. Data specific to Japan was obtained from the national vegetation map created based on site visits and satellite images, where “planted forest” was one of the attributes of vegetation types (Fig. 2b)12. This “planted forest” attribute includes restoration-oriented forests composed of broadleaf species, commercial forests dominated by productive species like Japanese cedar (Cryptomeria japonica) and Hinoki cypress (Chamaecyparis obtusa), and disaster prevention planting, such as Japanese black pine (Pinus thunbergii) from coastal erosion and tropical species (e.g., Acacia confusa) as windbreaks. The national vegetation map has been gradually developed and improved since 2005. Finally, data specific to ROK was a polygon map of planted and natural forests from the national forest cover map (Fig. 2b)16. The ROK maps depict the distribution of planted and natural forests from 2009 to 2013, depending on the province. In addition to the country-specific data, we obtained the Spatial Database of Planted Trees covering China, Japan, and ROK (SDPT version 1.0; Fig. 2c)13 and a global extent of planted trees 201514, which includes the land use classes of planted forest, woody plantations, and agroforestry of the global forest management map266 (Fig. 2d). There is no data specific to DPRK used in this study due to the lack of available data.

Fig. 2
figure 2

Training data consists of a series of in situ and digital maps of planted-natural forest data from multiple independent sources. (a) The in situ data in China encompass 2,542 and 4,394 ground observations20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265, which represent confirmed locations of planted and natural forests, respectively, by previously published articles. (b) National maps of planted forests were obtained for China15, Japan12, and ROK16. (c) The Spatial Database of Planted Trees (SDPT version 1.0)13. (d) An estimated Global Extent of Planted Trees 201514. (e) Distribution of the three biomes in our study area. We developed a machine learning classification model for each biome to predict planted forests. Note that forests are distributed in the Temperate Grassland according to the FAO’s definition of forest (≥5 m tree height)2,270 although the area is limited. (f) Distribution of planted forests was estimated mainly for China, DPRK, and small areas in Japan. For the ROK and the majority of Japan, the national planted forest maps12,16 (b) were used as a final label.

To prepare a training dataset for machine learning classification models, we prepared a 0.009° by 0.009° grid (approximately 1 km2) for the study region in East Asia. National planted forest maps of China15, Japan12, and ROK16, as well as SDPT13 and the Global Extent of Planted Trees14 were extracted to the centroid of each grid cell using the “sf” or “raster” packages in R267,268. China’s in situ observations were associated with each grid cell by taking the majority vote of in situ points within each grid cell to determine whether that cell is a planted or natural forest. Grid cells with a 50/50 vote were removed from the training dataset. We then derived the response variable – a label of “planted” or “natural” forest – based on these underlying datasets following the Quality-Oriented Data Integration (QODI).

Quality-oriented data integration (QODI)

Since the underlying datasets differed in data sources and estimation methods, we developed a quality-oriented data integration approach in which the response variable was defined in three different levels of integration (Fig. 3). For each level of integration, we trained a separate set of machine learning models, so that we can quantify the potential range in estimated planted forest areas.

Fig. 3
figure 3

The response variable (“planted” or “natural” forest) was defined in a quality-oriented data integration approach based on multiple underlying data sources. Underlying datasets a-d correspond to Fig. 2a–d. Upper and lower bound models represent the most liberal and conservative approaches in labeling planted forest, respectively. The grey area was removed from the respective training dataset. All areas outside of the Venn diagrams were labeled natural forest. DPRK is not included in this figure due to the absence of training data associated with the country.

The first level of integration took the most conservative approach in deriving the lower bound of our estimation. Since China’s in situ observations20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265, Japan’s national vegetation map12, and ROK’s national planted forest map16 were largely based on in situ observations, we labeled a unit forest area (i.e., grid cell) as planted if and only if the grid cell was identified as a planted forest by either of these in situ-based datasets or identified by at least three other datasets as a planted forest.

The second level of integration took a midway approach in which, in addition to planted forests identified in the first level of integration, a given grid cell was also labeled as a planted forest if two out of the national planted forest maps of China15, SDPT13, and Global Extent of Planted Trees14 datasets agreed so.

The third level of integration took the most liberal approach in deriving the upper bound of our estimation, in which we assumed all underlying data sources were equally reliable and labeled a given grid cell as planted forest if it was identified as a planted forest by either of these datasets.

We also compiled 57 predictor variables for the supervised learning of the classification models (Fig. 1, Supplementary Table S1). The predictor variables consisted of five forest structure attributes269,270, seven MODIS-derived vegetation characteristics, 21 bioclimatic attributes271,272,273,274, 13 topographic attributes275, four anthropogenic attributes276,277,278,279, and seven soil attributes280. We obtained four forest structure attributes from the most recent Global Ecosystem Dynamics Investigation (GEDI) dataset, namely canopy height (rh100), plant area index (pai), foliage height diversity (fhd_normal), and total canopy cover (cover) (see Supplementary Table S1)18,269. We downloaded the raw footprint-level GEDI data (L2B), among which only full-power lasers were used in this study to ensure the accuracy of the measurement. GEDI data was processed using the “rGEDI” package in R281. Another forest structure attribute, tree height270, represents the 90th or 95th percentile of energy return height relative to the ground.

We extracted predictor variables to the centroid of each grid cell using the “sf” or “raster” packages in R267,268. GEDI footprint-level data was associated with each grid cell by taking the mean value of each attribute. We kept only grid cells with a minimum of 5 m tree height in accordance with FAO’s definition of “forest”2,270. Our final training dataset encompassed more than 1.5 million grid cells for the upper bound dataset, 1.0 million grid cells for the midpoint dataset, and 0.9 million grid cells for the lower bound dataset, consisting of one response variable labeled as either “planted” or “natural” and 57 predictor variables. Finally, to account for the differences in terrestrial ecoregions, we divided the overall training dataset into three biomes (Fig. 2e). Based on the global terrestrial biome map19, Temperate Grassland/Savanna and Montane and Flooded Grassland were grouped into “Temperate Grassland”. Temperate Broadleaf and Mixed and Temperate Conifer were grouped into “Temperate Forest”, and Tropical Moist, Tropical Dry, and Tropical Grassland/Savanna were grouped into “Tropical Forest and Savanna.” The three biomes remained separated for the upper bound dataset, but Temperate Grassland and Temperate Forest were merged for the midpoint and lower bound datasets to form the “Temperate Forest and Grassland” biome due to low sample size in Temperate Grassland.

For mapping purposes, we prepared another 0.009° by 0.009° grid (approximately 1 km2), covering forested area (≥5 m tree height)2,270 in the study region with all predictor variables (new data; Fig. 2f). We chose the resolution 0.009° to align with most of the predictor variables (Supplementary Table S1). After a machine learning classification model was trained, estimation was made for each grid cell of this new data. For ROK and a majority of areas in Japan, however, we utilized the existing planted forest maps, namely the national forest cover map of ROK16 and the national vegetation map of Japan (Fig. 2b)12, respectively, to label the grid cells. Since reliable planted forest data already exist for these areas, we used our estimation only for the remaining areas in China, DPRK, and a small portion of Japan (Fig. 2f). Nevertheless, the existing data for ROK and a majority of areas in Japan were converted to the 0.009° resolution within the forested area for consistency. For the areas where our estimation is used, we imputed missing values in predictor variables of the new data using the “Hmisc” package in R282 to provide a spatially continuous map. For the GEDI attributes (Supplementary Table S1), however, we imputed missing values by training random forest (RF) models (see below for details of RF) with seven MODIS attributes due to a large number of missing values (22%, 34%, and 44% of the sample size for the upper bound, midpoint, and lower bound dataset, respectively). For the midpoint and lower bound datasets, we used the average predicted values from 10 repetitions of random forest models using 200,000 data points to minimize computational time (Table 1). To assess the performance of the RF model in imputing missing values in GEDI attributes, we performed cross-validation using bootstrapping. For the upper bound dataset, we randomly sampled the dataset into the training (90%) and testing (10%) sets with replacement. For the midpoint and lower bound datasets, we randomly sampled 200,000 data points for the training sets with replacement, and the remaining was used as the testing dataset (Table 1). Based on 20 random iterations, we calculated the 95% confidence interval (CI) of the root mean square error (RMSE) and R-squared (R2). We calculated a 95% CI using the t0.975 value with 19 degrees of freedom.

Table 1 Summary of tasks conducted in this study.

Ensemble machine learning model

We developed an ensemble model to estimate the spatial distribution of planted forests, with three candidate machine learning models: RF, support vector machines (SVM), and XGBoost. RF is a non-parametric ensemble learning approach283, which combines a variant of decision trees and an additional level of randomness by bootstrapping sub-data and different sets of predictor variables to mitigate potential multicollinearity issues often encountered in multidimensional machine learning models284. We used the “randomForest” package in R285. SVM is a supervised learning model which constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space to help data analysis286. We used the “e1071” package in R287. XGBoost is a gradient-boosted decision tree machine learning, designed to accommodate large data at high speed. We used the “xgboost” package in R288. The three candidate models are frequently used in ecological and biological research with satisfactory performance266,289. Other potential candidate models include artificial neural networks, k-nearest neighbor, Naïve Bayer, etc., which are not necessarily superior290. All modeling processes were conducted in R291.

To assess the performance of the three candidate models in estimating planted forests, we conducted cross-validation using bootstrapping. Due to data size, we randomly sampled 50,000 points (25,000 for each class) for the upper bound and midpoint datasets and 80% of the sample points for the lower bound dataset for each of the ten repetitions to create the training set and the rest composed the testing set (Table 1). Default hyperparameter values were used for the three candidate models. Based on 10 iterations, we calculated the 95% CI of classification accuracy and F1 score. We calculated a 95% CI using the t0.975 value with 9 degrees of freedom. Classification accuracy shows the proportion of overall correct prediction. While accuracy is the most widely used and intuitive evaluation metric of a classification problem, it overestimates the performance of imbalanced data. F1 score is an equal measure of precision and recall and is more appropriate for imbalanced data292. Precision represents the correct prediction of the positive class (i.e., planted) among all positive predictions, and recall represents the correct prediction of the positive class among all actual positive cases293. Since precision and recall are in an inverse relationship, the combined metric, F1 score, provides a better evaluation perspective of incorrectly predicted cases. Using both accuracy and F1 score, we present a suite of evaluation metrics of our candidate models for both correct and incorrect predictions of an imbalanced dataset. Other potential evaluation metrics include Cohen’s Kappa. However, we did not use it in our study due to the controversy of its use294. Compared with SVM and XGBoost, the RF model was 0.7–8.1% more accurate in terms of overall classification accuracy and 1.4–4.5% more reliable in terms of F1 score (Fig. 4). Thus, we chose RF as the final model.

Fig. 4
figure 4

Performance of three candidate machine learning models to map planted forests. Classification accuracy and F1 score of random forest (RF), support vector machine (SVM), and XGBoost imputation models are shown. Mean values from 10 repetitions and 95% confidence intervals are shown for each biome. RF outperformed SVM and XGBoost in all cases, and thus RF was used to model planted forests in our study.

To improve the performance of the model while minimizing the time it takes to compute, we adjusted two hyperparameters of the RF algorithm: the number of decision trees and the number of predictor variables. Similar to the cross-validation described above, we randomly sampled 50,000 points (25,000 for each class) for the upper bound and midpoint models and 80% of the sample points for the lower bound model for each of the ten repetitions to assess RF performance using different hyperparameter values (Table 1). Specifically, we calculated the classification accuracy and F1 score for different hyperparameter values. Based on 10 iterations, we chose the number of 100 decision trees for the upper bound and midpoint models and 200 for the lower bound model where both accuracy and F1 score converged (Fig. 5). We used the default number of predictor variables (seven) for all biomes for the upper bound model. We chose 26 and 42 for Temperate Forest and Grassland and Tropical Forest and Savanna, respectively, for the midpoint model (Fig. 6). We chose 20 and 40 for Temperate Forest and Grassland and Tropical Forest and Savanna, respectively, for the lower bound model (Fig. 6).

Fig. 5
figure 5

Performance of random forest models in terms of classification accuracy and F1 score with different numbers of decision trees. For each biome (Fig. 2e), we tested a different number of decision trees in the random forest ranging from 2 to 750. The solid lines represent the mean of 10 repetitions, and the bands represent the standard deviation. The number of trees = 100 for the upper bound and midpoint models and 200 for the lower bound model were chosen to maximize the model performance while minimizing computational time.

Fig. 6
figure 6

Performance of random forest models in terms of classification accuracy and F1 score with different numbers of predictor variables. For each biome (Fig. 2e), we tested a different number of predictor variables in the random forest ranging from 2 to 56. The solid lines represent the mean of 10 repetitions, and the bands represent the standard deviation. We used the default number of predictor variables (seven) for all biomes for the upper bound model. We chose 26 and 42 for Temperate Forest and Grassland and Tropical Forest and Savanna for the midpoint model. We chose 20 and 40 for Temperate Forest and Grassland and Tropical Forest and Savanna for the lower bound model.

For the final RF model, we ensured that the training set had an equal number of points for each class (i.e., 50% planted forest and 50% natural forest) by randomly under-sampling the dominant class. The prediction of our classification model was the percent planted forest based on how many decision trees returned the “planted” prediction. We built 20 models to derive the mean percentage for each biome and model (upper bound, midpoint, and lower bound) (Table 1). Finally, we calculated the mean percentage of the three models as a final value, while upper and lower bounds serve as a potential range (Fig. 7). Grid cells with a predicted percentage ≥50% are considered planted forest (Fig. 8). Using the spatially continuous dataset of 57 predictor variables (see Data fusion), we created a map covering the entire forested area in East Asia using model prediction.

Fig. 7
figure 7

Spatial distribution of percent planted forest in East Asia. Our main prediction was the mean percent planted forest from the three models (upper bound, midpoint, and lower bound), while upper and lower bounds present potential ranges. Prediction was made for China, DPRK, and small portions of Japan. National planted forest maps of Japan12 and ROK16 were used for the remaining areas in ROK and the majority of areas in Japan, indicated in gray. The data is in a vector format with each polygon representing a 0.0090° by 0.0090° (approximately 1 km) grid in the WGS84 datum.

Fig. 8
figure 8

Spatial distribution of planted forests in East Asia. The map shows the estimated areas where the percent planted forest is greater than 50%. For ROK and most areas in Japan, national planted forest maps12,16 were used to determine the distribution of planted forest. The data is in a vector format with each polygon representing a 0.0090° by 0.0090° (approximately 1 km) grid in the WGS84 datum.

Mapping dominant tree species of the planted forests

Over the planted forest expanse in East Asia identified by the final RF classification model, we predicted the dominant tree species (to the genus level) of the planted forest for each criterion (Fig. 9). For the training set, we combined 2,481 in situ records in China20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265 with the tree-level records of Japan295 and ROK296 National Forest Inventories (NFI). Specifically, we calculated importance value for each species for each NFI plot within the predicted planted forest expanse and identified the species with the highest importance value as the dominant species for the given plot. Importance value is the sum of the percent basal area and the percent number of individuals of each species and represents the overall dominance of the species297,298. After identifying the dominant species for each NFI plot, we aggregated the plots into the 0.009° by 0.009° grid cells by taking the majority vote of the dominant species. We retained the genus names of the dominant species, and only genera with 60 or more samples were included to ensure a sufficient size of training data.

Fig. 9
figure 9

Spatial distribution of dominant tree species to the genus level across the planted forest range in East Asia (Fig. 8).

We trained an RF classification model using the same package in R, with the default hyperparameter setting and an identical set of predictor variables, except for roadless areas and GEDI attributes due to a substantial number of missing values (86% and 34% of the sample size, respectively). We ensured that the training set had an equal number of points for each class (i.e., genus) by combining random under-sampling and oversampling using the “UBL” package in R299. To assess the performance of the RF model in mapping dominant genera across the planted forest expanse in East Asia, we performed a 90/10 cross-validation using bootstrapping. In each iteration, we used stratified sampling to split the entire training dataset into the training (90%) and testing (10%) sets using the “caret” package in R300 and conducted a combination of under-sampling and oversampling of the training set to address the class imbalance (Table 1). Based on 100 random iterations, we calculated the 95% CI of overall classification accuracy and precision, recall, and F1 score for each class.

Data Records

The spatial database of planted forests consists of maps of estimated planted forest distribution (Figs. 7, 8) and dominant tree species (Fig. 9) of East Asia, available at https://doi.org/10.6084/m9.figshare.21774725.v3301. The database is in shapefiles where each polygon is 0.009° by 0.009° in size within the forested area of 2020 (≥5 m tree height) based on the FAO’s definition of “forest”2,270. Each polygon contains the following attributes:

ID: Polygon ID

Biome: Biome classes used in the study

Country: Country

Prc_Pln: Percent planted forest. The values represented the average of the three models (upper bound, midpoint, and lower bound). NA for ROK and a majority of areas in Japan, where national planted forest maps12,16 were used as the final planted/natural label (Fig. 2f).

Prc_P_U: Percent planted forest predicted by the upper bound model. NA for ROK and a majority of areas in Japan, where national planted forest maps12,16 were used as the final planted/natural label (Fig. 2f). Note that values are not always higher than Prc_Pln.

Prc_P_L: Percent planted forest predicted by the lower bound model. NA for ROK and a majority of areas in Japan, where national planted forest maps12,16 were used as the final planted/natural label (Fig. 2f). Note that values are not always lower than Prc_Pln.

Type: “Planted” or “Natural” forests based on the main result (i.e., the average of the three models). For our predicted percent planted forest, “Planted” if Prc_Pln ≥ 0.5 and “Natural” if Prc_Pln < 0.5. For Prc_Pln = NA, national planted forest maps12,16 were used to determine if the given polygon is a planted forest, and if not, “Natural.”

Typ_Upp: “Planted” or “Natural” forests based on the upper-bound model.

Typ_Lwr: “Planted” or “Natural” forests based on the lower-bound model.

Genus: For Type = “Planted”, this attribute indicates the predicted dominant genus. NA for Type = “Natural”.

Gns_Upp: For Typ_Upp = “Planted”, this attribute indicates the predicted dominant genus. NA for Typ_Upp = “Natural”.

Gns_Lwr: For Typ_Lwr = “Planted”, this attribute indicates the predicted dominant genus. NA for Typ_Lwr = “Natural”.

Besnard_Yr: Estimated planted year based on forest age302 (https://doi.org/10.17871/ForestAgeBGI.2021). See Usage Notes.

Du_Yr: Estimated planted year based on the map of planting year of plantations303,304 (https://doi.org/10.6084/m9.figshare.19070084.v2). A value of 1981 indicates the planting year was before 1982, and values from 1982 to 2019 correspond to the planting years. See Usage Notes.

Area_m2: Area of the planted forest polygons in square meters.

Raster layers are also available for percent planted forest, type (planted or natural forest), and dominant genus, at https://doi.org/10.6084/m9.figshare.21774725.v3301.

Based on our prediction, the total area of planted forests in East Asia was 948,863 km2, ranging between 600,529 and 1,277,549 km2. China shared 87% of the planted forest area in East Asia, most of which is in the lowland subtropical and tropical regions, and Sichuan Basin (Fig. 8). More than half of China’s planted forest area was dominated by Cunninghamia (Table 2) in the subtropical region and Sichuan Basin (Fig. 9). Larch (Larix spp.), black locust (Robinia spp.), and pine (Pinus spp.) were widely observed in northern and central China, and eucalyptus dominated planted forests in tropical regions.

Table 2 Predicted area of planted forest for each dominant genus based on the main model.

In Japan and ROK, planted forests were uniformly distributed across the country (Fig. 8). More than half of Japan’s total planted forest area was Chamaecyparis- or Cryptomeria-dominant (Table 2), while other coniferous genera (e.g., Abies and Pinus) covered northern planted forests (Fig. 9). ROK’s planted forests were characterized by diverse genera; more than half of planted forest areas were dominated by pine, followed by deciduous trees including oak (Quercus spp.) and chestnut (Castanea spp.). DPRK’s planted forests were mainly distributed in the south, largely composed of oak, larch, and pine.

The input training data, including the response variable and predictor variables, used in this study are available at https://doi.org/10.6084/m9.figshare.21774812.v2305. Underlying data included in situ and digitized planted-natural forest data:

The in situ observational data of China20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265

The Japan Vegetation Map12 (http://gis.biodic.go.jp/webgis/sc-025.html?kind=vg67)

The national planted forest map of China15

The national planted forest map of ROK16

SDPT version 1.013 (https://www.wri.org/research/spatial-database-planted-trees-sdpt-version-10)

Global planted trees extent 201514 (https://doi.org/10.5281/zenodo.3931930)

Japan National Forest Inventory295 (http://forestbio.jp/datafile/datafile.html)

ROK National Forest Inventory296

The predictor variables used in this study are all available through open sources as follows:

GEDI L2B269 (https://search.earthdata.nasa.gov/search)

Tree height (https://glad.umd.edu/dataset/GLCLUC2020)

MODIS (https://modis.gsfc.nasa.gov/)

Corrected precipitation: PBCOR271 (http://www.gloh2o.org/pbcor/)

Bioclimate data: CHELSA272,273 (https://chelsa-climate.org/bioclim/)

Global aridity index and potential evapotranspiration: CGIAR-CSI v.2274 (https://doi.org/10.6084/m9.figshare.7504448.v3)

Topography: EarthEnv275 (http://www.earthenv.org/topography)

Global cattle distribution276 (https://doi.org/10.7910/DVN/GIVQ75)

Roadless area277 (https://doi.org/10.1126/science.aaf7166)

Protected area: UNEP-WCMC278 (https://www.protectedplanet.net/en)

Human footprint279 (https://doi.org/10.5061/dryad.052q5)

Soil characteristics: WISE30sec v1.0280 (https://www.isric.org/explore/wise-databases)

Other data used in this study include:

The Nature Conservancy (TNC) Terrestrial Ecoregions map19 (https://geospatial.tnc.org/datasets/b1636d640ede4d6ca8f5e369f2dc368b/about)

All the data listed above are open access, except the national planted forest map of China15, the national planted forest map of ROK16, and the ROK National Forest Inventory296. The sensitive information in these datasets will be available upon request via Science-i (https://science-i.org/) and approval from data contributors.

Technical Validation

Model validation in imputing GEDI missing values

We conducted cross-validation with bootstrapping to evaluate the model in imputing the missing values in GEDI attributes for the high-latitude areas (Supplementary Table S1; see Quality-Oriented Data Integration (QODI) in Methods). R2 was within the range of 31% and 42% for all the GEDI attributes in Temperate Grassland and Temperate Forest (Table 3). For Tropical Forest and Savanna, canopy height showed R2 of 22%, and the rest of the attributes showed R2 of almost 30%. Foliage height diversity showed the highest R2 and total canopy cover showed the lowest root mean square error (RMSE) among all GEDI attributes in all groups (Table 3).

Table 3 Evaluation in imputing missing data of GEDI attributes for mapping purposes.

Model validation in estimating planted forests

To evaluate the performance of our mapping product of East Asia, we compared our main prediction (Fig. 8) with the planted/natural labels of the midpoint dataset for China. We calculated classification accuracy, precision, recall, F1 score, and four elements of confusion matrices in percentage (true positive, false positive, false negative, and true negative, where positive class represented planted, and negative class represented natural forest). Our prediction is characterized by a high recall (0.99), indicating that 99% of the observed planted forests were correctly predicted as planted forest (Table 4). Our precision was 0.63, which indicates that approximately two out of three positive predictions are actually planted forests. This level of accuracy is similar to those of other large-scale forest mapping studies (0.60–0.80)306,307,308.

Table 4 Evaluation metrics and elements of confusion matrices of the main prediction of planted forest distribution.

While precision is often negatively associated with recall, the F1 score, 0.77, indicates that our model is well-balanced between precision and recall. The low precision is attributable to the imbalanced distribution of positive and negative classes in the validation set (the midpoint dataset for China). The number of samples for natural forests was almost 10 times greater than that of planted forests in our validation set (Table 4). While we maximized the predictive performance by balancing the training data, high accuracy and low precision are inevitable due to the imbalanced validation set.

To further validate the quality of our prediction, we also compared our estimated total area of planted forests against the reported values from the FAO Global Forest Resources Assessment (FRA)2 and the National Forest Inventory dataset from China309 (Table 5). Our total predicted area of planted forests in East Asia was 948,863 km2 with a range between 600,529 and 1,277,549 km2, which is consistent with the FRA estimate (981,390 km2). The predicted area of China’s planted forests was 825,751 km2 (475,566–1,159,009 km2), while the FRA reports 846,960 km2 and the Ninth National Forest Inventory of China reports 795,428 km2. For Japan, the range of estimated areas of planted forests was between 103,447 and 105,633 km2, while the FRA reported value is 101,840 km2. Our estimated area of planted forests in DPRK was 7,986 km2 (5,601–11,648 km2), while the FRA reported value is 9,870 km2. Overall, our estimate was consistent with those reported by the FRA and the National Forest Inventory of China.

Table 5 Predicted area of planted forest for each country and the entire region and estimated area by other sources.

Model validation in estimating dominant tree species

Our 90/10 bootstrapping cross-validation in estimating the dominant tree species across planted forests showed an overall classification accuracy of 0.396 (±0.003 95% CI). Among all the planted tree species, Cunninghamia and Eucalyptus had the highest F1 score (0.745 and 0.733, respectively), with high recall (0.893 and 0.802, respectively) and satisfactory precision (0.644 and 0.403, respectively) (Table 6). Meanwhile, Carpinus and Castanea showed the lowest F1 score (0.124 and 0.136, respectively), which likely resulted from a small sample size compared to other genera. Acer, Alnus, Betula, Cryptomeria, Picea, Pinus, Quercus, and Tilia showed low recall compared to precision, indicating that true labels for these genera tended to be classified as other genera. Abies, Carpinus, Castanea, Castanopsis, Chamaecyparis, Cunninghamia, Eucalyptus, Fagus, Ilex, Larix, and Robinia had lower precision than recall due to the overprediction of these genera (Table 6).

Table 6 Evaluation of the random forest classification model in mapping the dominant tree species across the planted forest expanse in East Asia.

Uncertainties

While this study advances the current understanding of planted forests in East Asia based on multi-source data consisting of in situ, digitized, and modeled datasets, uncertainties arose from two main sources. First, limited in situ data, especially from Japan, ROK, and DPRK constitute one of the largest sources of uncertainties. The limited in situ data from these countries could lead to lower accuracy in our planted forests prediction. Nevertheless, to mitigate this uncertainty, we integrated different data sources for modeling (e.g., SDPT13 and the Global Planted Trees Extent 201514), and the final map product for these countries relied on external sources12,16.

Secondly, our map of planted tree species depicts the spatial distribution of the dominant tree species to the genus level across the range of planted forests. However, it is beyond the scope of this study to identify the spatial distribution of monoculture planted forests versus mixed-species planted forests, the latter of which are common in certain regions310. This uncertainty in tree species richness can be mitigated by integrating the mapping products presented here with recent global high-resolution maps of local tree species richness and co-limitation289. Furthermore, some genera predicted in our study had low F1 scores, which can be mitigated by increasing the sample size for these species. Nevertheless, it is not realistic to achieve perfectly balanced data, and differences in predictive performance among genera are inevitable.

Usage Notes

Our final maps of planted forest range (Fig. 8) for Japan and ROK consist of data directly obtained from the national planted forest maps of Japan12 and ROK16. Users of these particular maps should cite these sources accordingly.

Planted forests in this study include forests of all ages that have been planted for ecological restoration, commercial plantation, and other purposes, such as landscape and disaster prevention.

Since the underlying training datasets differ by planting years, we were only able to quantify a roughly estimated range of underlying years. Specifically, we overlaid our final map with two existing map layers with estimated forest age302 and planted year303,304 values. Based on these two sources, some planted forests were planted more than 100 years ago, while other planted forests are less than five years in age (Fig. 10). Estimation based on forest age302 presents consistency with planting history in each country; the majority of planted forests were established post-war in Japan, followed by efforts in the Korean peninsula, while planted forests in China come from more recent planting (Fig. 10a). We included planted year information in our map product (see Data Records).

Fig. 10
figure 10

Density plot showing the concentration of estimated planted forests in the range of planted year for each country. (a) planted year was estimated based on forest age302 with a maximum value of 2010. (b) planted year was estimated based on the map of planting year of plantations303,304.