Data on tree height and diameter for Pinus kesiya in Zambia

Forest inventories in plantations of non-native trees are conducted every five years in Zambia. Characteristics of data collected through these inventories are presented here. The data includes diameter at breast height (d), total tree height (h) and rotation categories for trees sampled. This data supported the development of robust h-d models for planted Pinus kesiya in the country. We have also presented graphical visualization of the composition and trends of the data by site and rotation. Datasets were filtered and cleaned and are ready to be used for other purposes in order to improve understanding of P. kesiya growth. For more insight please see “Modeling the height-diameter relationship of planted Pinus kesiya in Zambia” (Ng’andwe et al., 2019).

include: number of trees sampled, mean tree height and mean diameter at breast height in four different plantation sites. The data were categorized as (i) first rotation, (ii) second rotation and (iii) combined (Table 1). First rotation refers to the first P. kesiya trees that were planted after removing the Specifications Table   Subject area  Forestry growth modeling  More specific subject area  Height-diameter model for tropical non-native Pinus kesiya  Type of data  Table, graph, figure How data was acquired Data was collected during forest plantations inventories in Copperbelt province in Zambia. We sampled 7,691 trees from temporal random sample plots for model development and 5,301 trees for model validation. Data collection for model development and model validation was conducted at different measurement occasions five years apart. Data format Raw, filtered, analyzed Experimental factors Data presented constitute pairs of diameters and heights of trees. For development data, we present data as: (i) first and second rotation categories, (ii) site specific data and (iii) combined data (i.e. data irrespective of site and rotation categories). Height-diameter model development was based on the combined data of P. kesiya. The validation data presented does not include additional categories apart from d.

Experimental features
We selected eight popular theoretical functions used in forest growth modeling selected from literature on the basis of simplicity, biological logic and reliability. These models were fitted to the datasets in order to choose the most appropriate function for the development of robust h-d models for P. kesiya in Zambia. Value of the data This is data will enhance the development and comparisons of tropical pine height-diameter models for prediction in the region and globally. The composition of data presented include the first and second rotation of P. kesiya suitable for tree growth modeling of successive plantations. This data also creates an opportunity to improve further the developed h-d model for P. kesiya. The approach used is simplified based on diameter as the predictor variable, hence Forest Managers will find this data and developed models potentially user friendly. The data can be used for generating height-diameter curves for different rotations, site quality assessments and for developing biomass equations for P. kesiya Table 1 Characteristics of data used in this study for Pinus kesiya in Zambia. Numbers in brackets represent standard deviation of the mean. Data, irrespective of first and second rotation is indicated as 'combined' and was used in model development. First rotation refers to characteristics of data collected from trees above 25 years old and second rotation from trees below 25 years old. Data used for validation of models was only available as "combined" irrespective of site and rotation.
a N is number of trees, d is diameter at breast height, and h is tree height b V/data refers to independent validation data native vegetation and are usually above 25 years old. Second rotation refers to the P. kesiya trees that were planted immediately after the first rotation trees were harvested and are less than 25 years of age. Data on rotation is related to age obtained from administrative records i.e. the year when trees were planted in the field to the year when the inventory was conducted. We used 7,691 trees with complete h and d pairs in model development [1]. The data composition in each group is presented in Fig. 1. The combined data was used to develop the model parameter estimates (Table 3). The model fit to the combined dataset and h-d curve produced by the country level model (Equation (1)) and associated Eerik€ ainen (2003) 3  plots of residuals against predicted height are presented in (Fig. 2) and normality checks (Fig. 3). We also fitted the country-level model to site data and generated site-specific h-d models (Fig. 4) and homoscedasticity diagnostics checks (Fig. 5). Parameter estimates for site specific models and fit statistics are presented in Table 4. Data presented at site level includes plots of residuals versus predicted  height to check for normality and homoscedasticity of errors that could influence parameter estimates and fit statistics. A megaphone pattern would reveal heteroscedasticity which is more related to the response variable h [2]. Data related to the comparison of the country level model and site-specific model on the basis of the mean relative error (MRE) and mean absolute percent error (MAPE) is also presented in Table 4.

Data acquisition
Data presented was collected during the forest plantation inventory in 2011 and 2016. All compartments were assessed. The equipment used included diameter tapes (for measuring d) and Sunnto clinometers (for measuring h). Data presented was filtered from the main inventory database and prepared for modeling. We present 7,691 trees of P. kesiya for model development and 5,301 trees for validation.

Data exploration
The collected raw data was subjected to cleaning and generating of preliminary descriptive statistics in Microsoft Excel and saved in csv (Comma delimited) format. We used R to develop basic graphical and numerical diagnostics [3]. We checked for the normality of data to confirm if the assumptions for parametric tests were met by using both graphical and numerical measures (Fig. 3).

Height-diameter model development
We selected from literature eight model functions popular in forestry modeling (i.e. N€ aslund, Power, Curtis, Chapman-Richards, Weibull, Modified Logistic, Exponential and Hossfeld) for model development [4e8] ( Table 2). These functions were fitted to P. kesiya data using nls function in R. Actual datasets used are stored in a separate raw data file (pkesiya_fitdata.csv and pkesiya_validationdata.csv). We followed established procedures during fitting, parameterization and validation [5,9,10]. For more information please see "Modeling the height-diameter relationship of planted Pinus kesiya in Zambia" [1].

Performance analysis
All models were subjected to statistical and graphical performance tests [11,12]. Consistent with recommended practices in forestry modeling [2,9], we also conducted model diagnostic checks such as testing for normality and homoscedasticity of residuals for different models fitted to the data. The graphical performance of the best model when fitted to the combined data is shown in (Fig. 2a) and plot of residuals against predicted height in Fig. 2b. Data was not split for model development and  Table 4. Grey and black shades represent first and second rotation data, respectively. validation, instead an independent data was collected for validation purpose. The performance of developed models were evaluated numerically: Relative error (RE), mean relative error (MRE), absolute percent error (APE), mean absolute percent error (MAPE), Root mean square error (RMSE), model prediction accuracy (MPA) ( Table 3) and graphically ( Fig. 2a and b). The best model was based on its consistency and final ranking based on MAPE, RMSE and MPA goodness of fit criteria (Table 3).  Parameter estimates for the models, performance evaluation and model ranks are presented in Table 3.
On the basis of model ranking, the best h-d model for P. kesiya based on the Weibull function ( Further model tests on the country-level model (equation (1)) were performed to evaluate the influence of site and/or rotation on the prediction accuracy using ANOVA. Prior to conducting ANOVA, residuals were subjected to normality and homogeneity of variance tests. In this regard, we performed Shapiro-Wilk test and also visualized the distribution of residuals using histograms were necessary. Residuals with a normal distribution would be indicated by a higher value (W) of Shapiro (W > 0.05) and a higher value of (p) (Shapiro p > 0.05). However, any model with a high number of observations may yield a significant p-value (p < 0.05) for the ShapiroeWilks test [12]. Therefore, we also used visual inspection of the histogram and if skewed, data was transformed to comply with the assumptions of ANOVA.
In some cases where variances were not homogenous after performing the Bartlett homogeneity test, a Welch t-test for unequal variance was used instead [12]. Multiple pairwise comparisons among the levels of site was conducted using least-squares means (lsms) procedures for all significant effects on RE and MAPE for datasets with unequal variance [12], [13]. Depending on the outcome of the analysis, site specific or rotation specific h-d models were developed as submodels of equation (1) ( Table 4). We again used the mean relative error (MRE) and MAPE to evaluate the performance of site-specific modes. Models that passed this final step were considered for h estimation at the site and/ or rotation level for P.kesiya in Zambia [1]. Equations used in the evaluation process are detailed in Table  5. The R packages that we utilized included Metrics for statistical performance tests, ggplot2 and gridExtra for graphics, dplyr for sub sampling of data, among others [3].

Acknowledgments
We are grateful to the Zambia Forestry and Forest Industries Corporation (ZAFFICO) in Zambia for facilitating access to forest plantations and for organizing the inventories. We also thank the Where, RE i is the relative error and MRE is the mean relative error obtained by diving RE by the total number of measured trees, n. APE i is the absolute percent error, MAPE is the mean absolute percent error (i.e. an everage of APE i ), h i is the measured tree height for the ith tree; b h i is the predicted tree height for the ith tree; MPB is the mean prediction bias (i.e. the error associated with prediction for the ith tree which reflects the deviation of the model with respect to the measured value); SD is the standard deviation of the prediction bias; RMSE is the root mean square error; MPA is the model prediction accuracy which combines mean prediction bias an the standard deviation of residuals; k is the number of fixed model parameters.
Copperbelt University for the administrative and financial support as well as anonymous reviewers for the valuable comments that greatly improved the manuscript.