Data-Driven Methods for the Estimation of Leaf Water and Dry Matter Content: Performances, Potential and Limitations

Leaf equivalent water thickness (EWT) and dry matter content (expressed as leaf mass per area (LMA)) are two critical traits for vegetation function monitoring, crop yield estimation, and precise agriculture management. Data-driven methods are widely used for remote sensing of leaf EWT and LMA because of their simplicity, satisfactory accuracy, and computation efficiency, such as the vegetation indices (VI)-based and machine learning (ML)-based methods. However, most of the data-driven methods are utilized at the canopy level, comparison of the performances of the data-driven methods at the leaf level has not been well documented. Moreover, the ML-based data-driven methods generally adopt leaf optical properties directly as their inputs, which may subsequently decrease their ability in remote sensing of leaf biochemical constituents. Performances of the ML-based methods cooperating with VI are rarely evaluated. Using the independent LOPEX and ANGERS datasets, we compared the performances of three data-driven methods: VI-based, ML-reflectance-based, and ML-VI-based methods, for the estimation of leaf EWT and LMA. Three sampling strategies were also utilized for evaluation of the generalization of these data-driven methods. Our results evidenced that ML-VI-based methods were the most accurate among these data-driven methods. Compared to the ML-reflectance-based and VI-based methods, the ML-VI-based model with support vector regression overall reduced errors by 5.7% (41.5%) and 1.8% (12.4%) for the estimation of leaf EWT (LMA), respectively. The ML-VI-based model inherits advantages of vegetation indices and ML techniques, which made it sensitive to changes of leaf biochemical constituents and capable of solving nonlinear tasks. It is thus recommended for the estimation of EWT and LMA at the leaf level. Moreover, its performance can further be enhanced by improving its generalization ability, such as adopting techniques on the selection of better wavelengths and definition of new vegetation indices. These results thus provided a prior knowledge of the data-driven methods and can be helpful for future studies on the remote sensing of leaf biochemical constituents.


Introduction
Leaf water and dry matter content are among the most important biochemical indicators that determine plant photosynthetic capacity and ecosystem processes [1][2][3]. Quantifying changes in these leaf biochemical indicators is critical for plant function monitoring, crop yield estimation, and precise agriculture management. By definition, leaf water content is parameterized as equivalent water thickness (EWT), whereas leaf dry matter content is specified as leaf mass per area (LMA) [4]. EWT and LMA can be calculated as the amount of water and dry mass per leaf area, respectively. They could be measured by both laboratory destructive measurements and remote sensing techniques. Being fast, effective, and nondestructive, remote sensing techniques using leaf optical properties have become a popular approach for the estimation of EWT and LMA [5][6][7][8].
With the development of hyperspectral instrumentation and remote sensing theory, quite a few approaches have been proposed for the estimation of EWT and LMA. These approaches can be broadly categorized as two types: physical-based and data-driven methods [9]. Physical-based methods minimize the difference between the measured and modeled leaf optical properties using radiative transfer models. There are several radiative transfer models that could be used for this purpose, such as the PROSPECT [10,11], LIBERTY [12,13], and SLOPE [14] models. Among them, PROSPECT is the most widely used because of its relatively simple and satisfactory performances. Several inversion algorithms have also been proposed, including look-up-table [15] and iterative optimizations methods [4,16]. These algorithms were proposed based on the assumption that the leaf specific absorption spectra are fixed for all vegetation species, and cannot account for the spectral variability of leaf biochemical constituents [17]. Moreover, they may suffer from ill-posed problems and high computation cost.
Data-driven methods are built based on the statistical relationship between leaf biochemical constituents and leaf optical properties. The statistical relationship is generally calibrated using a training dataset with regression techniques, ranging from simple linear regression [18], partial least square regression (PLSR) [19], to complex machine learning (ML) techniques, such as the support vector machine regression [2], artificial neural networks [20], and random forest regression [21]. Leaf optical properties include, but are not limited to, leaf reflectance, its derivates and combinations, such as the vegetation indices (VI) [22,23], and red-edge positions [24]. Data-driven methods are generally simple to use, accurate, and computationally effective, and thus are preferred for fast estimation of leaf biochemical constituents in agricultural applications.
Vegetation indices for the estimation of EWT and LMA are combinations of reflectances at two or more wavebands in the 900-2400 nm spectral region [18], where strong/weak absorption of water and/or dry matter occurred. Therefore, wells and shoulders of water absorption around 970, 1200, 1500, and 2200 nm are densely utilized for the estimation of EWT [25]. As for LMA, its accurate estimation is challenging because of the predominant water absorption [6,26]. Studies usually adopt the absorption of the C-H bond stretch at around 1700 nm to suppress the influence of water [27]. Such characteristics explain why the EWT-related indices usually involve reflectance at around 970, 1200, 1500, or 2200 nm [28][29][30][31][32], whereas the LMA-related indices usually involve reflectance at around 1700 nm [22,23,27]. These indices have various types, varying from simple ratio, normalized difference to complex mathematical combinations. Generally, the types of simple ratio and normalized differences are easy to understand and use. Therefore, they are widely used for the estimation of EWT and LMA.
In the framework of data-driven methods, a linear or an exponential function built between VI and leaf biochemical constituents is among the simplest means (i.e., VI-based method), whereas a nonlinear relationship built using ML techniques are among the most complicated means. ML techniques have the advantage of solving nonlinear tasks, and have been widely used in remote sensing of leaf biochemical constituents in recent years. Notably, most of the data-driven methods focused on the canopy level [33][34][35], and only a few studies focused on the leaf level [2,7]. Accurate estimation of leaf biochemical constituents using leaf optical properties is fundamental for understanding what insides the leaf. According to the best of our knowledge, evaluation of these data-driven methods for the estimation of EWT and LMA at the leaf level has never been well documented.
Moreover, most of the ML techniques were adopted by using leaf optical properties at a wide spectral region as inputs (i.e., ML-reflectance-based method). For instance, Ref. [7] built a neural network, and Ref. [2] utilized a support vector regression (SVR) implementation for the estimation of EWT and LMA using leaf reflectance and transmittance at wavelengths covering the 900-2400 nm spectral region. Using leaf optical properties at such a wide spectral region may not always be helpful.
One study reported that including leaf optical properties in the 900-1300 nm spectral region could decrease the accuracy of LMA estimation [2]. Leaf optical properties in a spectral region are highly correlated [18], indicating that the leaf reflectance and/or transmittance at a specific wavelength may be enough for the representation of that at a wide spectral region. Combinations of leaf optical properties at two or more wavelengths can enhance the sensitivity to leaf biochemical constituents, as the VI-based method does. Another study suggested that incorporating ML techniques with VI can help to improve the performance of remote sensing of leaf biochemical constituents. Ref. [21] adopted 45 established VI as inputs to the random forest (RF) for the estimation of leaf chlorophyll content, and found that the error could be significantly reduced compared to the standard regression. However, incorporating ML techniques with VI for the estimation of EWT and LMA has rarely been reported.
Therefore, this paper focuses on the remote sensing of EWT and LMA at the leaf level using data-driven methods. The objectives of this paper are to: (1) intercompare the performances of the most widely used data-driven methods, i.e., the VI-based method and the ML-reflectance-based method, and the new method that incorporating ML techniques with VI (ML-VI-based method); (2) explore the potential and limitations of the data-driven methods. The most widely used VIs that are sensitive to leaf EWT and LMA, and the most popular ML techniques are used in this paper. The LOPEX and ANGERS datasets were adopted for evaluations of these data-driven methods at the leaf level. This study provided a prior knowledge of the data-driven methods and can be applied to future studies on the remote sensing of leaf biochemical constituents.
This paper is organized as follows. General description of the LOPEX and ANGERS datasets are given in Section 2.1. The most widely used VI and ML are introduced in Sections 2.2 and 2.3. To achieve the objectives of this paper, three experiments are designed, as described in Section 2.4. Section 3 presents the results obtained with these experiments. Section 4 discusses the performance, potential, and limitations of the data-driven methods. Finally, the concluding remarks are provided in Section 5.

Description of the Experimental Datasets
In this study, two independent datasets were adopted, i.e., the LOPEX and ANGERS datasets. These two datasets were collected with synchronous measurements on leaf optics and leaf biochemical constituents. They represent the most popular and easy to access tool for remote sensing of leaf biochemistry, and have been widely used across the world. The LOPEX dataset was collected at the Joint Research Center of Italy (Ispra, Italy) in 1993 over 320 fresh leaf samples from 45 different species [36]. The ANGERS dataset was collected at INRA (National Institute of Agronomy) in Angers, France in 2003 over 276 leaf samples from 43 different species [37]. In the LOPEX and ANGERS experiments, fresh weights of leaf discs were measured before drying them in an oven at 85 • C for 48 h. After drying, they were reweighted to determine the corresponding EWT and LMA [37]. They recorded data over 596 leaves of multiple herbaceous woody species under a variety of spectra, vegetation structures, and biological components. The datasets are publicly available via http://opticleaf.ipgp.fr/index.php?page=database. A detailed description of the two datasets was documented in [36,37].
In both datasets, leaf reflectance and transmittance in the 400-2500 nm spectral range with 1 nm step were measured in the laboratory spectrophotometers or field spectroradiometers equipped with integrating spheres [2]. In this study, we focused on EWT and LMA inversion and, therefore, the 900-2400 nm spectral range was selected for a higher sensitivity to the changes of EWT and LMA [2,16]. Leaf reflectance was used for calculation of vegetation indices and as input for data-driven models. Table 1 summarizes the statistical information of LOPEX and ANGERS datasets, including the number of samples and species, minimum, maximum, mean, and standard deviation of the EWT and LMA. Figure 1 illustrates the spectrum of leaf reflectance from the LOPEX and ANGERS Sensors 2020, 20, 5394 4 of 18 datasets. Each gray line represents the reflectance of a specific leaf, whereas the dashed black line is the median spectrum. Each gray line represents the reflectance of a specific leaf, whereas the dashed black line is the median spectrum.

Vegetation Indices
Ten vegetation indices, which have been reported to be most sensitive to leaf-and canopy-level EWT and LMA, were selected in this study. Table 2 shows these vegetation indices. Generally, these indices follow the type of simple ratio or normalized difference. They were calculated using leaf reflectances at two wavebands in the 900-2400 nm spectral region, where strong/weak absorption of water and/or dry matter occurred.

Vegetation Indices Sensitive to EWT
Based on Colombo et al. [25], six vegetation indices proposed for EWT estimation were selected in this study. These EWT-related indices are water index (WI) [28], normalized difference water index (NDWI) [29], simple ratio water index (SRWI) [30], normalized difference infrared index (NDII) [38], moisture stress index (MSI) [32] and difference water index (DWI) [25]. The WI, SRWI, and MSI follow the type of simple ratio, whereas the rest follow the type of normalized difference. They were calculated by exploiting wells and shoulders of water absorption around 970, 1200, 1500, and 2200 nm, as shown in Table 2. All these indices have been widely used in the estimation of EWT at both leaf and canopy levels, as documented in [18,25,31].

Vegetation Indices Sensitive to LMA
Four vegetation indices that are related to LMA estimation were selected in this study. These LMA-related indices are normalized difference for LMA (NDLMA) [23], normalized dry matter index (NDMI) [27], normalized difference (ND) [23], and ratio index (RI) [22]. Estimation of LMA has been reported to be challenging because of the predominant absorption of water [2,6]. These indices

Vegetation Indices
Ten vegetation indices, which have been reported to be most sensitive to leaf-and canopy-level EWT and LMA, were selected in this study. Table 2 shows these vegetation indices. Generally, these indices follow the type of simple ratio or normalized difference. They were calculated using leaf reflectances at two wavebands in the 900-2400 nm spectral region, where strong/weak absorption of water and/or dry matter occurred.  [25], six vegetation indices proposed for EWT estimation were selected in this study. These EWT-related indices are water index (WI) [28], normalized difference water index (NDWI) [29], simple ratio water index (SRWI) [30], normalized difference infrared index (NDII) [38], moisture stress index (MSI) [32] and difference water index (DWI) [25]. The WI, SRWI, and MSI follow the type of simple ratio, whereas the rest follow the type of normalized difference. They were calculated by exploiting wells and shoulders of water absorption around 970, 1200, 1500, and 2200 nm, as shown in Table 2. All these indices have been widely used in the estimation of EWT at both leaf and canopy levels, as documented in [18,25,31].

Vegetation Indices Sensitive to LMA
Four vegetation indices that are related to LMA estimation were selected in this study. These LMA-related indices are normalized difference for LMA (NDLMA) [23], normalized dry matter index (NDMI) [27], normalized difference (ND) [23], and ratio index (RI) [22]. Estimation of LMA has been reported to be challenging because of the predominant absorption of water [2,6]. These indices usually adopt a leaf reflectance around 1700 nm because of the absorption of C-H bond stretch, and another reflectance at other wavebands to suppress the influence of water. These indices have been reported to provide satisfactory performance when utilized for LMA estimation [22,23].

K-Nearest Neighbor (KNN)
KNN is a lazy supervised learning regression technique. It is fast and effective for high-dimensional data regression problems [39,40]. The basic idea behind KNN is finding a set of K samples that are most closed to the unknown sample based on a similarity measurement (e.g., Euclidean distance as used in this study) and predicting the value of the unknown sample using the average of the response variables of the K-nearest neighbors [41][42][43]. The parameter, K, has a significant impact on the performance of the KNN technique. A small K indicates that only a small portion of the training data that is most close to the unknown sample gives the prediction. This type of prediction could be impacted by noise or uncertainty within the training data. A large K can suppress the noise or uncertainty, but may introduce many unrelated learning samples.

Partial Least Squares Regression (PLSR)
PLSR is an extension of multiple linear statistical techniques. It integrates the advantages of principal component analysis, canonical correlation analysis, and linear regression analysis [22,44,45]. It can effectively address the problem of providing good predictions in multivariate regression, even with a few training data and multiple-correlated input variables. The basic idea behind PLSR is reducing a large number of reflectances or their derivates to a few principal components (PCs), and making regression using several selected PCs [19,46].
The key of PLSR is to build a linear model as follows, Here, y is the mean-centered vector of dependent variables (EWT and LMA). x represents the mean-centered vector of independent variables (reflectance). β and ε are regression coefficient and residual, respectively. In PLSR, the above principles are adopted on PCs of x.
The number of selected PCs has a great influence on the performance of the PLSR technique. A small number of selected PCs may cause data loss and thus underfitting occurs. These problems can be addressed by increasing the number of selected PCs, but may consequently cause overfitting and higher computation cost. In this paper, the PLSR2 model with the NIPALS algorithm was used.

Support Vector Regression (SVR)
SVR is an important branch of support vector machines (SVMs) [20]. It can provide good regression performance because it transfers a low-dimensional nonlinear input to a high-dimensional linear output. The basic idea behind the SVR is finding a hyperplane that can fit all data (that is, all sample points have the smallest total deviation from the hyperplane) [47][48][49].
The key of SVR is to solve the following equation, Here, ω is the normal vector of the linear function, b is the intercept. Φ represents a nonlinear transformation from the current dimension to a high-dimensional space, which could be specified by a kernel function. Slack variable ξ i andξ i correspond to the upper and lower parameters in which (ωΦ(x) + b) is allowed to deviate by an error, ε, and a cost, C. Finally, x is leaf reflectance and y is leaf EWT or LMA in this study.
A kernel function determines the distribution of sample points in the high-dimensional space, and is thus important for the SVR. In this study, the radial basis function (RBF) function was selected as the kernel function, which implies two critical parameters need to be optimized, C and γ. C is the cost parameter that is related to tolerance for error, whereas γ is a parameter unique to the RBF kernel function and affects the speed of model prediction. A large C implies that a large error cannot be tolerated, which consequently may result in overfitting. As for a small C, a large error is acceptable and underfitting may occur. The number of support vectors is adjusted to affect the speed of training and prediction by determining γ. A large γ indicates fewer support vectors, whereas a small γ indicates more support vectors.

Random Forest (RF)
RF is a nonparametric ensemble machine learning algorithm based on multiple decision trees to train samples and achieve estimation. It is popular in the field of remote sensing due to its high accuracy and stability [50]. The basic idea behind the RF regression is that each decision tree is calculated separately on the dataset, the results are transmitted and the average thereof selected as the final prediction result [21].
The key to RF regression is to split regression trees. This process is done by choosing the input variable with the minimum Gini index, i.e., Here, f t x i , j represents the proportion of observations with value x i belonging to leaf j as node t. I G is the corresponding Gini index.
Three parameters are required to be optimized at the RF regression process, number of decision trees, maximum depth, and terminal nodes. The number of decision trees needs to be maximum for a dense forest. The maximum depth of the decision tree is limited to avoid overfitting. The terminal nodes determine when the tree growth should be stopped. A large number of terminal nodes imply that tree growth is stopped after a few splits, which would result in underfitting, whereas a small number of terminal nodes could cause overfitting.

Cross-Validation
In order to optimize the best set of ML parameters, all the ML techniques were validated using the K-fold cross-validation procedure. The K-fold cross-validation procedure randomly and equally divided the training data into five subsamples, among which four subsamples were used to calibrate the ML models while the other subsample was used as "out of bag" to calculate the prediction error. This process was repeated five times until each subsample has been used and only used once for calculation of the prediction error. In this study, the prediction error was parameterized as the RMSE. The best set of parameters for these ML techniques were selected when they provided the smallest RMSE. A detailed description of K-fold cross-validation is documented in [51]. For the KNN, the K was optimized within [1,3,5,7,10]. For the PLSR, the number of PCs was optimized within [2,3,4]. For the SVR, the C and γ were optimized within [10 −2 , 10 −1 , 1, 10, 100] and [10 −4 , 10 −3 , 10 −2 , 10 −1 , 1, 10], respectively. For the RF, number of decision trees, maximum depth, and terminal nodes were optimized within [10,20,30,40,50,60,70], [5,7,9,11,13,15], and [1,2,3,4,5], respectively. All parameters of these ML techniques were optimized by the grid search function using the Python 3.6 implementation of the Scikit-Learn package.

Design of Experiments
In this study, three data-driven methods were applied to the LOPEX and ANGERS datasets, and their performances were compared to each other. These methods aimed at building an empirical relationship between leaf optical properties (i.e., leaf reflectance and its combinations) and leaf biochemical constituents (i.e., EWT and LMA). These three data-driven methods are:

Method 1 (M1):
VI-based method, which builds a relationship between VI and EWT or LMA using a linear or an exponential function.

Method 2 (M2):
ML-reflectance-based method, which builds a relationship between leaf reflectance and EWT or LMA using ML techniques.

Method 3 (M3):
ML-VI-based method, which builds a relationship between VI and EWT or LMA using ML techniques.
It is notable that linear and exponential functions were selected in M1 because of their wide applications in remote sensing of biochemical constituents, as documented in [18,33]. For each pair of leaf biochemical constituent and vegetation index, the regressed model with the highest accuracy was selected as the optimal model. The performances of the data-driven methods depend on the similarity between the training and validation datasets. Generally, the performances are calculated after splitting the experimental datasets into two subsets, one for training and the other for validation, and the regressed models are not validated on a completely independent dataset. This may raise the question that if the regressed models can be applied to another independent dataset, which satisfactory performances are also provided. To answer this question, three sampling strategies are designed: Sampling strategy 1 (S1): using the LOPEX dataset as the training dataset, instead of using the ANGERS dataset as the validation dataset.
Sampling strategy 2 (S2): using the ANGERS dataset as the training dataset, instead of using the LOPEX dataset as the validation dataset.
Sampling strategy 3 (S3): mixing the LOPEX and ANGERS datasets, randomly taking 80% of the mixed dataset as the training dataset, whereas taking the remaining 20% of the mixed dataset as the validation dataset.
Their performances were evaluated using the root-mean-square error (RMSE) and coefficient of determination (R 2 ). The ML techniques (KNN, PLSR, SVR, and RF) were implemented using the Scikit-Learn package in Python 3.6.

VI-Based Method (M1) for the Estimation of EWT and LMA
The VI presented in Table 2 were employed to build the statistical relationship between VI and EWT or LMA using a linear or an exponential function. The regressed model with the highest accuracy was selected as the optimal model. Table 3 shows these optimal models under three sampling strategies, i.e., S1, S2, and S3, respectively. The results indicate that the selection of the training dataset is highly important for the VI-based model. Different sampling strategies can result in quite a different optimal regressed model. For example, the MSI-based method provides three exponential models for the estimation of EWT when the S1, S2, and S3 sampling strategies were adopted. As one can see from Table 3, the parameters in these models vary significantly. The optimal VI-based models were further validated using the validation dataset, which was selected based on a different sampling strategy (i.e., S1, S2, and S3). Table 4 shows the performances of the validation of the VI-based models. Generally, these models provided satisfactory performance as the estimated and measured EWT and LMA were highly correlated and the corresponding RMSE was relatively small. Among these VI, MSI, and NDMI were the most sensitive indices to EWT and LMA, respectively. The VI-based models adjusted using MSI or NDMI generally provided the best performance in terms of RMSE and R 2 (i.e., a lower RMSE and a higher R 2 ). Figure 2 illustrates the correlation between the estimated and measured EWT and LMA using the two VI-based models (i.e., using MSI for EWT estimation and NDMI for LMA estimation).

ML-Reflectance-Based Method (M2) for the Estimation of EWT and LMA
The machine learning techniques presented in Section 2.3 were utilized for the estimation of EWT and LMA. All the leaf reflectances in the 900-2400 nm spectral region were selected as the inputs to the ML techniques, i.e., each leaf sample has 1501 variables. Figure 3 and Table 5 show the performances of EWT and LMA estimation using the ML-reflectance-based model.  Among these ML techniques, SVR provided the most accurate estimation of EWT under every sampling strategy. Notably, it could significantly outperform the rest of ML techniques when the training and validation datasets were independent, i.e., S1 and S2 sampling strategies. For the S3

ML-Reflectance-Based Method (M2) for the Estimation of EWT and LMA
The machine learning techniques presented in Section 2.3 were utilized for the estimation of EWT and LMA. All the leaf reflectances in the 900-2400 nm spectral region were selected as the inputs to the ML techniques, i.e., each leaf sample has 1501 variables. Figure 3 and Table 5 show the performances of EWT and LMA estimation using the ML-reflectance-based model.

ML-Reflectance-Based Method (M2) for the Estimation of EWT and LMA
The machine learning techniques presented in Section 2.3 were utilized for the estimation of EWT and LMA. All the leaf reflectances in the 900-2400 nm spectral region were selected as the inputs to the ML techniques, i.e., each leaf sample has 1501 variables. Figure 3 and Table 5 show the performances of EWT and LMA estimation using the ML-reflectance-based model.  Among these ML techniques, SVR provided the most accurate estimation of EWT under every sampling strategy. Notably, it could significantly outperform the rest of ML techniques when the training and validation datasets were independent, i.e., S1 and S2 sampling strategies. For the S3  Among these ML techniques, SVR provided the most accurate estimation of EWT under every sampling strategy. Notably, it could significantly outperform the rest of ML techniques when the training and validation datasets were independent, i.e., S1 and S2 sampling strategies. For the S3 sampling strategy, despite that the SVR provided the best performance, the differences of the RMSEs calculated using these ML techniques were not significant, as shown in Figure 3 and Table 5. As for the estimation of LMA, these four ML techniques provided overall comparable performances.
Moreover, the R2 values calculated using the ML-reflectance-based method for EWT estimation were higher than that for LMA estimation. As for the EWT estimation, the R2 were all higher than or equal to 0.8. However, as for the LMA estimation, the R2 values were rarely higher than 0.8, with the highest being around 0.7 or less.

ML-VI-Based Method (M3) for the Estimation of EWT and LMA
The VI presented in Table 2 were employed as inputs of the ML techniques, i.e., the six vegetation indices sensitive to EWT (WI, NDMI, SRWI, NDII, MSI, and DWI) were used for EWT estimation, whereas the four vegetation indices (NDLMA, NDMI, ND, and RI1368,1722) sensitive to LMA were used for LMA estimation. Figure 4 and Table 6 show the performances of EWT and LMA estimation using the ML-VI-based model. sampling strategy, despite that the SVR provided the best performance, the differences of the RMSEs calculated using these ML techniques were not significant, as shown in Figure 3 and Table 5. As for the estimation of LMA, these four ML techniques provided overall comparable performances. Moreover, the R2 values calculated using the ML-reflectance-based method for EWT estimation were higher than that for LMA estimation. As for the EWT estimation, the R2 were all higher than or equal to 0.8. However, as for the LMA estimation, the R2 values were rarely higher than 0.8, with the highest being around 0.7 or less. Table 5. Performance of the ML-reflectance-based models for the estimation of EWT and LMA. The italic bold number gives the best performances under a given sampling strategy.

ML-VI-Based Method (M3) for the Estimation of EWT and LMA
The VI presented in Table 2 were employed as inputs of the ML techniques, i.e., the six vegetation indices sensitive to EWT (WI, NDMI, SRWI, NDII, MSI, and DWI) were used for EWT estimation, whereas the four vegetation indices (NDLMA, NDMI, ND, and RI1368,1722) sensitive to LMA were used for LMA estimation. Figure 4 and Table 6 show the performances of EWT and LMA estimation using the ML-VI-based model.  Generally, the performance presented in Table 6 is similar to that in Table 5. Besides the S1 sampling strategy, SVR provided the most accurate estimation of EWT under S2 and S3 sampling strategies. As for the estimation of LMA, PLSR, KNN, and SVR provided the best performances under  Generally, the performance presented in Table 6 is similar to that in Table 5. Besides the S1 sampling strategy, SVR provided the most accurate estimation of EWT under S2 and S3 sampling strategies. As for the estimation of LMA, PLSR, KNN, and SVR provided the best performances under S1, S2, and S3 sampling strategies, respectively. RF worked less well under these three sampling strategies.
Moreover, the R2 values calculated using the ML-VI-based model for EWT estimation were slightly better than that for LMA estimation. Compared to the ML-reflectance-based model, the R2 values calculated using the ML-VI-based model was improved, especially for the estimation of LMA, as shown in Figure 4 and Table 6.

Performances of the Data-Driven Methods
The performances of the three data-driven methods (M1, M2, and M3) on the remote sensing of EWT and LMA at the leaf level were assessed under three sampling strategies (S1, S2, and S3) in this study. The widely used LOPEX and ANGERS datasets were adopted for the evaluation. Overall, the regressed models using MSI and NDMI provided the best performances in M1 for the estimation of leaf EWT and LMA, respectively (Table 4 and Figure 2); SVR is most effective for the estimation of leaf EWT in M2 and M3; as for the estimation of leaf LMA, the four ML techniques (i.e., KNN, PLSR, SVR, and RF) provides comparable performances in M2 and M3 (Tables 5 and 6, Figures 3 and 4).
Theoretical and experimental studies have suggested that the influence of leaf biochemical constituents on leaf reflectance and its combination generally does not follow a linear function [18]. According to the optimal VI-based method for the estimation of EWT and LMA in Table 3, an exponential function generally provided the best performance, whereas only in a few cases a linear function outperformed. The exponential functions calibrated using MSI and NDMI were the most recommended functions for the estimation of leaf EWT and LMA, respectively. They provided satisfactory performances under almost every sampling strategy, which were attributed to their strong sensitivity to the corresponding biochemical constituents and insensitivity to other confounding factors [27].
M2 aimed to build a nonlinear relationship between the leaf reflectance in the 900-2400 nm spectral region and the leaf biochemical constituents. Comparison between Tables 4 and 5 suggested that M2 provided slightly better performance than M1 for the estimation of EWT. Such improvements were attributed to, on one hand, more useful spectral information was added to the training and validation processes in M2, and on the other hand, the ML technique could provide a much more complex nonlinear relationship between the inputs and outputs than an exponential function did [35]. However, for the estimation of LMA, M2 provided much worse performance than M1. Such a result was likely attributed to the predominant water absorption in the 1300-2400 nm spectral region and useless leaf reflectance information in the 900-1300 nm spectral region for LMA estimation [6,17].
It has been reported that the excluding of the 900-1300 nm could improve the performance of LMA estimation [2].
The difference between M2 and M3 was the input to the ML techniques. Table 7 presents the RMSE of M2 and M3 for the estimation of leaf EWT and LMA. Using VI as input to these ML techniques provided improved performances. For EWT estimation, M3 outperformed M2 in 9 cases (12 cases in total). The average RMSE was reduced by 5.7% (=(2.5147 − 2.3703)/2.5147). For LMA estimation, M3 outperformed M2 in all 12 cases. The average RMSE was significantly reduced by 41.5% (=(2.8337 − 1.6581)/2.8337). VI-based methods are mathematical combinations (ratios, differences, and normalized differences, etc.) of reflectances at several bands, with at least one band at which the leaf biochemical material strongly absorbs radiation. The mathematical combination could also suppress the sensitivity of other confounding factors, such as the leaf surface reflection [52]. Therefore, compared to the reflectance at the whole 900-2400 nm spectral region, vegetation indices thus were preferred as the inputs to the ML techniques.  Table 8 presents the comparison of M1 and M3 for the estimation of leaf EWT and LMA. The VI (i.e., MSI and NDMI) and ML technique (i.e., SVR as a representative) that have shown good performances in M1 and M3 were selected. Overall, M3 outperformed M1, i.e., using several VI as inputs to the SVR provided better performance than using a single vegetation index as input to the exponential function for the estimation of both EWT and LMA. The average RMSE indicated that M3 could reduce the error by 1.8% (=(2.3515 − 2.3090)/2.3515) and 12.4% (=(1.6655 − 1.4585)/1.6655) for the estimation of EWT and LMA, respectively. Such results explained that the SVR adopted several more vegetation indices, which could provide additional information about the biochemical constituents, and gave a more complicated nonlinear relationship between the vegetation indices and the biochemical constituents, which was much more accurate than an exponential function. Therefore, M3 is suggested for future estimation of EWT and LMA when using the data-driven methods.

Potential and Limitations of the Data-Driven Methods
Great attention has been devoted to data-driven methods in past decades for the remote sensing of leaf biochemical constituents, especially methods involving ML techniques [17,53]. This study confirmed the good performances of the ML-VI-based method for remote sensing of EWT and LMA. It integrated the advantages of the VI and ML technique, making it insensitive to potential confounding factors and sensitive to the comprehensive nonlinear relationship between VI and leaf biochemical constituents. It thus provided a promising tool for further studies on the remote sensing of leaf biochemical constituents.
Studies have reported that the most sensitive bands may differ with vegetation types and experimental conditions [20,23,45], which consequently impact the value of VI. Therefore, selection of the best wavelength is critical for the ML-VI-based method. Investigations are needed to identify more consistent bands that could be applied to a wide range of vegetation types and experimental conditions. Moreover, with the development of hyperspectral instrumentation and radiative transfer theory, new VI that are sensitive to the change of leaf biochemical constituents may be found and defined [27]. These new VI can introduce more useful information on leaf biochemical constituents, and thus are likely to provide further improvements in the ML-VI-based method.
A mutual drawback of the data-driven method is that its performance is determined by data quality and discrepancies between the training and validation data, which limits its generalization ability when the trained method is applied to different vegetation types or experimental conditions. The discrepancies include, but not limited to, noise level, spectral resolution, and leaf species. It has been reported that the measurements of leaf reflectance in the NIR spectral region might be affected by experimental uncertainties [2,54], such as the noise presented at a higher wavelength in the ANGERS dataset ( Figure 1). Different sampling strategies could result in discrepancies between the training and validation dataset [35]. Noise level, presence of outliers and biases, and erroneous data might be different in the training and validation dataset, and thus the three methods (i.e., M1, M2, and M3) gave different performance under three sampling strategies (i.e., S1, S2, and S3), as shown in Tables 4-6. Many strategies, such as using expert knowledge for enhancing data quality, and using simulated data during the training stage for reducing data discrepancies, might help to overcome the drawback [2,35]. However, such work has not been well documented and further investigations are needed.
The data-driven methods presented in this study were evaluated using leaf reflectance and its corresponding biochemical constituents at the leaf level. It lays down the foundation for studies that adopting signals collected at the canopy level, such as the Hyperion [55] and AVIRIS [56] hyperspectral spectroradiometers. The applicability of these data-driven methods, especially the ML-VI-based method, at the canopy level needs to be further evaluated. Canopy reflectance models, such as the SAIL [57,58], DART [59,60], and stochastic [61,62] models could be helpful because they bridge the optical properties at different scales, scaling from leaf level up to canopy level. At a higher scale, additional factors, such as the canopy structure, act as confounding factors and should be carefully accounted for. The directional area scattering factor is a canopy structure parameter defined as the canopy BRF if the canopy does not absorb any radiation [63,64]. It can be easily retrieved using measured canopy BRF in the 710-790 nm spectral region without any ancillary information about leaf optics. The directional area scattering factor has been reported to be useful for suppressing the influence of canopy structure on the remote sensing of leaf biochemical constituents [65,66]. It is thus the key to evaluating the data-driven methods on the remote sensing of leaf biochemical constituents at the canopy level.

Conclusions
In this study, the performances of three types of data-driven methods with different sampling strategies were compared for the estimation of EWT and LMA using leaf reflectances in the 900-2400 nm spectral region. The data-driven methods included the VI-based method (which built a linear or an exponential relationship between VI and leaf biochemical constituents), the ML-reflectance-based method (which built a nonlinear relationship between leaf reflectances and leaf biochemical constituents using ML techniques), and the ML-VI-based method (which built a nonlinear relationship between VI and leaf biochemical constituents using ML techniques). VI that have been reported to be sensitive to leaf EWT and LMA were used, which resulted in the selection of six EWT-related indices (WI, NDWI, SRWI, NDII, MSI, and DWI) and four LMA-related indices (NDLMA, NDMI, ND, RI 1368,1722 ). Four ML techniques, i.e., KNN, PLSR, SVR, and RF, were utilized for the representation of the most widely used ML techniques. The independent LOPEX and ANGERS datasets collected over multiple herbaceous woody species were adopted for the evaluation.
Our results showed that the ML-reflectance-based method outperformed the VI-based method for the estimation of EWT. However, it provided a less accurate estimation of LMA than the VI-based method, possibly attributed to the influence of useless leaf reflectance information in the 900-1300 nm spectral region. The ML-VI-based method generally provided better estimations of leaf EWT and LMA than the VI-based method and the ML-reflectance-based method. It inherited the advantage of vegetation indices and ML techniques, which made it sensitive to changes of leaf biochemical constituents and capable of solving nonlinear tasks. Overall, compared to the ML-reflectance-based and VI-based method, the ML-VI-based model with SVR reduced errors by 5.7% (41.5%) and 1.8% (12.4%) for the estimation of leaf EWT (LMA), respectively.
In order to improve the accuracy and generalization ability of the data-driven methods, further investigations are motivated involving the selection of better wavelength, the definition of new vegetation indices, enhancement of the data quality, and reduction of data discrepancies. Moreover, the performances of the ML-VI-based method for the estimation of EWT and LMA at the canopy level needs to be investigated. During such investigations, special attention should be paid because additional confounding factors, such as the canopy structure, may significantly affect the performance of the data-driven methods for remote sensing of leaf biochemical constituents [63,67].