Machine learning based estimation of land productivity in the contiguous US using biophysical predictors

Estimation of land productivity and availability is necessary to predict land production potential, especially for the emerging bioenergy crop production, which may compete land with food crop production. This study provides land productivity estimates in the contiguous United States (CONUS) through a machine learning approach. Land productivity is defined as the potential in producing agricultural outputs given biophysical properties including climate, soil, and land slope. The land productivity is approximated by the potential yields of six major crops in the CONUS, i.e. corn, soybean, winter wheat, spring wheat, cotton, and alfalfa. This quantitative relationship is then applied to estimating the availability of marginal land for bioenergy crop production in the CONUS. Furthermore, the levels of uncertainties associated with land productivity and marginal land estimates are quantified and discussed. Based on the modeling results, the total marginal land of the CONUS ranges 55.0–172.8 mha, but the 95% inter-percentile distance of the estimated productivity index reaches up to 60% of its expected value in data-scarce regions. Finally, in a cross-check analysis, marginal lands estimated based on biophysical criteria are found to be comparable to those based on an economic criterion.


Introduction
The high precision of agricultural management nowadays calls for accurate estimation of land productivity, i.e. the capability of a piece of land in supporting agricultural production based on its biophysical conditions (e.g. climate, land slope, and soil condition) [1,2]. A pressing need of land productivity assessment is the identification of marginal land that is not highly productive (that should be used to produce food crops for the humanity) but can be used for growing dedicated bioenergy crops (e.g. miscanthus, switchgrass, and sweet sorghum). Existing studies that estimate land productivity and land availability for biomass production are mostly subject to significant uncertainties due to incomplete data [3,4], insufficient resolution [5], and even some unknown factors [5,6]. This study provides land productivity estimates in the contiguous United States (CONUS) through a machine learning approach, based on which marginal lands available for bioenergy production are estimated using both biophysical and economic criteria.
A simple method to estimate land productivity is to classify the land through simple soil taxonomy rules (i.e. empirical knowledge about land productivity and soil taxonomy features). Although the resulting land productivity index from even such a simple method is proved to correlate well with the observed county level crop yield [3], uncertainty treatment in the estimation requires advanced methods [6][7][8]. For example, national commodity crop productivity index (NCCPI) is another land productivity index that is derived through fuzzy logic models [4], which estimates the land productivity through a set of fuzzy if-then rules based on empirical knowledge and has been widely used for deriving land use decisions [9,10]. Limitations on uncertainty treatment exist in previous studies, including subjective components (e.g. the subjective choice of membership function and its parameter values in a fuzzy logic model [7]), unrecognized uncertainties in the input data for generating those indices [11][12][13], and complex relations between uncertainty sources in the land observation data (e.g. correlation between soil and land slope [6]). To address these limitations, this study updates our previous research on fuzzy logic models based land productivity [6] through a set of specially designed ML models, which (i) avoid relying on empirical knowledge and subjective judgements that can occur with the use of a completely data-driven approach; (ii) provide an estimate of the uncertainty associated with the ML model prediction; and (iii) automatically extract the information embedded in data by assuming no pre-defined relationship between the predictors and the predictand. In addition, this study is benefited from updated remote sensing data with improved spatial resolutions (up to~250 m resolution, see table 1). We use longterm yields of six major crops (corn, soybean, winter wheat, spring wheat, cotton, the five most grown crops in CONUS [14]; and alfalfa, a major crop in high slope regions [14]) in the contiguous United States (CONUS) as proxies of land productivity, and use machine learning (ML) algorithms to estimate the potential yields of these six major crops and their associated uncertainties at~250 m resolution. A productivity index is then derived based on the estimated long-term average potential yields of the six major crops during 2008-2017. To the best of our knowledge, this is the first productivity index that links directly to the actual crop yields through advanced ML techniques. This quantified relationship together with a set of biophysical and economic criteria are applied to estimating potential land available for bioenergy crop production (here referred as 'marginal land') at~250 m resolution in the CONUS. Overall, this study is expected to contribute to the existing literature of agricultural land management and biofuel development in the following aspects: (i) accurate estimation of land productivity through a machine learning approach, (ii) improved identification of marginal lands for producing bioenergy crops, and iii) reliable quantification of uncertainty involved in land productivity and marginal land estimations. Figure 1 provides a general flow chart for deriving the land productivity index and potential marginal lands. The first step is to collect and preprocess spatial data from table 1. Detailed characteristics of data from table 1 and their preprocessing procedures are introduced in Section I in the supporting information (SI), while their usage in this study is introduced in the subsequent sections. All the spatial data are projected to the World Geodetic System (WGS) 84 coordinate and resampled to 250 m resolution using the 'Bilinear' (for continuous data) and 'Majority' (for categoric data) resampling methods in ArcGIS [24]. In addition, all time-series of the data in table 1 are collected during 2008-2017, and their average values over these years are calculated and used in the ML models.

Model overview
We then adopt a two-step ML approach for each of the six major crops: we first downscale the county-level crop yield data (table 1) to a~250 m resolution based on the remote sensing gross primary productivity (GPP) data (equation 2). Second, we estimate the pixel-level potential yields and yield uncertainties across the CONUS based on a model trained with the downscaled crop yield data from equation 1 as the target and climate, land slope, and soil properties as inputs. Finally, we estimate the land productivity index and marginal land through a set of rules. Figure 2 provides more details about the estimation of crop yields and their associated uncertainties through the two-step ML approach. The first step crop yield downscaling is implemented through a Gaussian process model (GPM), and the second step yield estimation through a random forest model (RFM). GPM is a Bayesian inference method that has been proved to be able to provide excellent accuracy estimations as well as error bars (i.e. uncertainties) for geophysical studies [25][26][27]. The solid statistical foundation and nonparametric structure of GPM make it appealing especially for applications with small datasets (e.g. our case for downscaling county-level crop yield data), where the information-to-noise ratio is low [25,28]. For larger datasets (e.g. the downscaled crop yield at~250 m resolution), GPM is no longer suitable because of its high computational burden [29]; while neural network (especially deep neural network) and RFM are among the most widely used ML models and those with the best performing [25,[30][31][32][33]. Our preliminary test shows neural network and RFM have a similar performance. RFM is selected because of its simpler structure and lower computation burden. The tree-based RFM also holds an advantage of 'built-in' resistance to overfitting because of its 'bagging' process [34], while other popular ML models require major modifications to achieve such performance [35,36].

Crop yield downscaling and modeling
The first step GPM is used to estimate the following variable α at pixel-level (~250 m), which is further used to estimate downscaled crop yield: where α i is the ratio between crop yield YD i and the growing season GPP value of a specific crop i (GPP i ). The GPMs are trained with long-term (2008-2017) average county level crop yield data as the target and county average temperature and water availability data as inputs, as suggested by other studies [37][38][39].
The water availability information is represented by two variables: the ratio of evapotranspiration (ET) over precipitation and the percentage of land being irrigated. GPM assumes α i to follow a normal distribution as shown in figure 2, a common assumption in crop yield estimation models [40]. Section II in the SI provides details about the development of GPM for estimating α i , and table 1 shows the sources of input data. A 5-fold cross-validation as suggested by Yadav and Shukla [41] is used to validate the GPMs trained in this study. It should be noted that, when estimating pixel-level α i , the ratio of irrigated land (one GPM input to represent the water availability information) is set to zero, which will decrease the land productivity estimate for irrigated land. By this treatment, the land productivity index would only represent the 'natural' productivity of land based on the biophysical properties, but not the 'artificial' increase of yield from irrigation. The GPMs are trained at a county level, where all the inputs and targets reflect the countyaveraged values. Then the GPMS are reapplied at a pixel level to estimate the pixel-level crop yields. Since the temperature and water infrastructure conditions (the GPM inputs) are usually homogeneous within a county (especially with the large spatial scope of the  Percentages of the area falling into eight levels of slope classes: 0%-0.5%, 0.5%-2%, 2%-5%, 5%-10%, 10%-15%, 15%-30%, 30%-45%, and >45% Soil Soil depth, soil available water storage and soil organic carbon in 6 levels of soil: 0-5 cm, 5-20 cm, 20-50 cm, 50-100 cm, 100-150 cm, and >150 cm GPMs), the additional uncertainty caused by the difference between the resolutions in the GPM training and application phases would be limited. As illustrated in figure 1, the second step RFM in our two-step ML approach is then trained with downscaled crop yield data from the first step GPM. To further capture the uncertainty associated with the GPM, for each crop, an ensemble of random forests are developed, each trained with a stochastic realization of normally distributed downscaled crop yields in GPM (figure 2); the variances among the ensemble of random forests are included as one source of uncertainty in the final RFM crop yield estimation. We sample an ensemble of downscaled crop yields from the GPM to prepare the training targets for the subsequent RFM in figure 2 by equation (2): where variables with hat ' ' refer to their pixel-level estimated values from a trained GPM, and the superscript in parentheses ' (j) ' refers the ensemble number. Inputs to the RFM include the climate variables, slope information, and soil information (table 2). Table 1 provides references for the data sources of the variables shown in table 2.
To account for the uncertainty associated in RFM, we adopt a modified bootstrapping approach (MBA) [42] (figure 2). We train a separate random forest with each of the stochastic realizations of the downscaled crop yields (equation 2), where each random forest consists of a small number of regression trees. To balance the estimation accuracy and computation burden, the number of separate random forest and the number of regression trees in each random forest are selected as 50 and 10 respectively according a previous study [42], which results in 500 separate regression trees for each RFM and a 20% Monte Carlo error for variance estimation according to the MBA [42]. The expected yield for each crop i YRFM i t ha −1 is calculated as the average of all the 500 regression tree outputs; the variance of YRFM i calculated based on the within/between random forest variances.
A separate RFM is developed for each of the major crops. Data in table 2 are split into training (50%) and testing (50%) sets, and around 36.8% of the training data are unused in the training process and are retrieved as an out-of-bag (oob) validation set for hyperparameter selection [43]. Details about the development of RFM and hyperparameter selection are provided in section III in the SI.

Estimation of productivity
We then approximate the land productivity using the maximum value of the normalized potential yield across all six major crops, as estimated in section 2.2. To make the potential yields of different crops comparable, we normalize the deterministic RFM estimated yields for each crop with its 99th percentile: where NY i is the normalized potential yield for crop i, and YRFM 99 i the 99th percentile of YRFM i t ha −1 . The land productivity index, defined as maximum productivity (MP), is then calculated as the maximum value of NY i over all major crops for each pixel: where VS denotes a viable set of crops for a particular pixel. A specific crop i is said to be 'viable' for land pixels in counties that have a record of growing crop i (identified based on the CDL data). Since the performance of data-driven models like RFM tend to deteriorate for input combinations not seen during the training phase, the introduction of a viable set concept helps avoid such a problem. If the viable set VS is empty for a particular pixel, MP is assigned the NY i value of winter wheat at that pixel because of its wide spread over the CONUS (figure S2 in the SI). Such an operation might result in overestimation/underestimation of the MP value of that particular pixel, but we do not anticipant a significant effect on the overall pattern of the productivity in the CONUS, given that it is uncommon that a pixel would have empty VS [14]. Also, as all the row crops considered in this study are mostly grown under low slope conditions (more than 99.9% are grown at lands with average slope classes smaller than 4, while the slope class ranges from 1 (most flat) to 8 (most steep), as shown in table 2), MP of grid cells with average slope class larger than 4 are assigned the NY i values of a pasture crop alfalfa (which is more likely to be grown in slope lands) to avoid overestimating productivity at those regions [44]. We quantify the uncertainty of MP through a Monte Carlo approach, and calculate the MP value of one Monte Carlo realization ( NY i (j) ) as: where the superscript j is the realization number, is sampled based on the RFM estimated variance of NY i (equation S8, section III in the SI). We randomly generate 1000 realizations of MP (j) and calculate its 95% inter-percentile distance (IPD), the distance between 2.5th and 97.5th percentile of the aforementioned 1000 realizations of MP (j) ) to represent the uncertainty associated with MP. A larger value of IPD represents a higher level of uncertainty.

Identification criteria for potential land available for energy crop production
One potential application of the derived MP index is to identify marginal land available for bioenergy crop production. We develop a series of scenarios for marginal land identification based on biophysical criteria, and cross-check and justify the results using an economic criterion.
Two aspects of land properties are considered to develop biophysical criteria: the productivity and current land use [6]. For land productivity, the MP values are classified into three categories: high, medium, and low, based on the distribution of MP values of currently cultivated land. For current land use, similar to our previous study [6], two land use scenarios are used in combination with the productivity categories to identify marginal land: one scenario constrains marginal land to only current cultivated land, including crop land and pasture land, with low to medium productivity; the other scenario is bolder and assumes current grassland and shrubland with medium productivity could also be identified as marginal land for growing energy crops. However, the current study does not consider existing forest lands with improved land use classification that differentiates forest land from a coarse classification of mixed cropland, forest, grassland, and shrubland [19] as potential land for bioenergy production due to a growing concern on producing energy crops at the expense of deforestation.
The thresholds for breaking the productivity categories could be identified based on fuzzy logic rules [6], or by constraining the productivity estimate with other variables, e.g. land cover [45] or environmental vulnerabilities [46]. Since such constraining variables are not directly available, we test a series of thresholds via a trial and error method, and identify two sets of thresholds (which will be further validated with profit estimate) on productivity classification: criterion I assigns P25 and P50 (i.e. 25th and 50th percentiles) of the current crop land MP values over the CONUS as the break points from low to median and from medium to high productivity, respectively; criterion II assigns P10 and P25 as those break points. It should be noted that many other combinations of thresholds could be used, but the above two combinations are shown here to represent a realistic range for marginal land acreage. Criterion I represents an aggressive scenario that the maximum MP value of marginal land is higher than the MP values of the vast majority (i.e. 90th percentile) of existing grassland and shrubland (table S1 stacks.iop.org/ERL/15/074013/ mmedia); criterion II represents a conservative scenario that the maximum MP value of marginal land reaches just the medium MP values (i.e. 50th percentile) of existing grassland and shrubland. According to the productivity classification criteria and land use criteria as specified above, four scenarios of marginal land estimation are formulated, as displayed in table 3 in the result section.
We use the commonly used economic criterion of positive profit to cross-check the marginal land estimation based on the biophysical criteria [47,48], i.e. the land that does not pay off the investments and costs when growing regular crops is identified as marginal land. The maximum potential profit (MPF) is calculated as: where F i is the potential profit of growing crop i  (7) is identical to that in equations (4) and (5). If the MPF of a land pixel is negative, the land might be marginal from an economic perspective. The marginal land estimate based on the economic criterion is used as a cross-check with those based on the biophysical criteria. Theoretically, the estimate of marginal land conforms to the economic theory on farmers' land use decision in real life. However, the national level economic data available (crop prices and land rent rates) only has a state-level resolution and may involve large unidentified uncertainty. Also, crop prices and land rents are dynamic and change with the evolution of market conditions, and it is difficult to predict their change beforehand. Therefore, the economically identified marginal land cannot be used directly to identify candidate locations for energy crop production; instead, the marginal land acreage and its spatial distribution are only used to cross-check if any major inconsistencies exist between the marginal land estimated from the biophysical and economic criteria. After the marginal lands are identified, we apply an irrigation filter to exclude any land pixels with existing irrigation infrastructures (refer to figure S3 for the regions with existing irrigation infrastructures). This follows our assumption that marginal land for bioenergy crops will not include irrigated land, i.e. the current land with irrigation facilities is used for food and fiber crops, especially high-valued crops such as vegetable and fruit in arid and semiarid regions; future land development for bioenergy production usually does not consider irrigation due to existing water stress around the world [49].

Machine learning performances
The GPM for crop yield downscaling has good performance in the 5-fold cross-validation results shown in figure S3 in the SI. In terms of the coefficients of determination (R 2 ) and root mean square error (RMSE), the GPM has at least the same level of performance as the work by Marshall et al [37], who developed a crop production efficiency model to adjust yield estimates of corn, soybean, and wheat using GPP. The GPM performances for cotton and alfalfa are also comparable with other studies [50,51].  figure S6) of 95% confidence intervals of the crop yield estimations are 0.95, 0.90, 0.92, 0.93, 0.86, and 0.83 for corn, soybean, winter wheat, spring wheat, cotton, and alfalfa, respectively, which are considered acceptable given the 20% Monte Carlo error allowed for our study (see section 2.2).

Spatial distribution of productivity
The derived MP value shows similar general patterns as other productivity indices (e.g. NCCPI [4] and the soil productivity index [3]). The value appears to be high in the Midwest corn-belt region ( figure 3(a)) that is usually considered as the most productive region in the US. Major attributes of the corn belt region include flat topography, medium temperature, sufficient amount of rainfall (the new figures 3(c)-(e)), and deep and fertile soil, representing high land productivity given appropriate drainage facilities to reduce extra soil moisture before the crop growing season. The western US along and west to the Rocky Mountains has low MP values, mostly due to the high slope ( figure 3(e)) and low precipitation ( figure 3(c)). Between the corn belt and the western US is a corridor with moderate MP values. The lower Mississippi region and California Central Valley are major agricultural regions in the US but show moderate MP values as compared to that in the corn belt. The major reason for relatively low MP values in these two regions is the water deficit from precipitation and evapotranspiration, and irrigation is necessary (figure 3(f)). Note that the irrigation impact is removed when calculating the MP value (section 2.1).
The productive corn-belt region shows relatively small IPD values (which accounts for 10%-20% of the expected MP values in these pixels, figure 2(b)). In comparison, this ratio between IPD and MP in the western US and high slope regions reach to over 60% ( figure S7). Such high IPD values could mainly be attributed to the smaller number of observed crop cultivation in those regions (and thus more unseen input combinations in the RFM model) as is shown in figure S2, and the complex impact of slope on the potential crop yields [52,53].
The MP values of current cultivated land are higher than other land covers (as can be seen in figure S8, showing the histograms of MP values of current land uses). The ranking of MP values for each current land use category from high to low (based on their percentiles shown in table S3) is: cultivated, developed (urban), forest, herbaceous, shrubland, and barren land.

Spatial distribution of marginal land available for bioenergy crop production
The marginal lands identified according to different biophysical criteria S1-S4 show a considerable level of variability in both their spatial distributions (figure 4(a)) and total area (table 3). The total area of marginal land ranges from 55.1 (from scenario S2) to 175.6 mha (from scenario S3), which fall in the range of 43-179 mha reported by other studies [6,47,54,55]. Several marginal land 'hot spots' that appear in all scenarios S1-S4 are identified, including the corridor region between the Midwest corn belt and the western US, part of the lower Mississippi region and California central valley (that is cultivated but not irrigated according to our land use and irrigation data), and southeastern states.
As can be seen in figure 4(b), the marginal lands identified in S1 and S2 mostly associate with high probability, suggesting high confidence in identifying these lands as marginal. Among all four scenarios, the 'hot spots' of marginal lands identified in figure 4(a) are associated with high probabilities, which suggests an agreement of the marginal land 'hot spots.' Less confidence is found with the marginal lands identified in the western US from S3 and S4. This is mainly because of the lack of crop data and the impact of complex landscapes (i.e. frequently varying slopes) in these regions. We then conduct a cross-check by comparing the marginal land results (S1 and S2) based on the biophysical criteria to those derived from the economic criterion (MLEC). Figure 5 shows a similar spatial pattern of marginal land identified from the three criteria; the total area of marginal land of MLEC is 73.8 mha, which falls in the range between those of S1 and S2 (table 3). This supports the productivity thresholds as tested in section 2.4. Nevertheless, we acknowledge that land productivity is one among many factors (e.g. the economic profitability, environmental vulnerability, and accessibility of land) for identifying actual marginal land potentially available for bioenergy crop production.

Implications and limitations
This study contributes to the existing literature of land productivity estimation by adopting a datadriven approach that incorporates the relationships embedded in the data in estimating land productivity. This study is also the first to provide an estimate of uncertainty associated with the land productivity and marginal land, which is proved to be non-trivial. Though our scope of land productivity estimation and marginal land identification is limited within the CONUS, the ML model we have developed is general and could be applied to other regions and for solving other issues (e.g. hydroclimate modeling [56] and geochemistry analysis [57]) in the world (see a recent example of ML based estimation of land suitable of growing cassava [58]).
The high IPD values in some regions (e.g. those with a high slope) can result in a chance to significantly overestimate or underestimate the land productivity in these areas. Therefore, we anticipate substantial uncertainty with the marginal land estimates in those regions ( figure 4(b)). The uncertainty must be assessed with caution before the land estimate is used for land use decisions, especially for energy crops, considering the crop market risk that can be caused by the land availability uncertainty. Usually the model performance could be improved when the model scope moves from a national level to a local level (state or county) because of the inclusion of locally relevant data. The model performance could also be improved by incorporating the technique of transfer learning, i.e. inserting the knowledge learned from a large dataset to a model trained with a smaller dataset. If the processes represented by the two datasets are similar, transfer learning could significantly lower down the uncertainty of the model trained with the small dataset [59].
Our quantification of uncertainty in land productivity and marginal land estimation might be affected by several assumptions made with the twostep ML approach and crop yield post-processing procedures. First, errors in uncertainty estimation might propagate through the two-step ML approach, and one potential source of this error could be the Monte Carlo sampling error (MC error) for the RFM. In general, the Monte Carlo error is smaller if more individual regression trees are included in an RFM, and our choice (500 trees) results from a balance of accuracy (about 20% MC error) and the computation burden. Future studies could consider to include more regression trees or adopt advanced variance reduction methods (e.g. Wager et al [60]) to potentially reduce the MC error. Second, our calculation of MP value is based on the assumption that all the six major crops are equally good indicators of land productivity. Such an assumption is made to allow practically meaningful calculation of MP value for agricultural decisions, but it might underestimate some uncertainties associated with the potentially different yield-environment responses for different crops. Incorporating crop specific processes and criteria for more accurate land productivity uncertainty estimation goes beyond the scope of this study and will be one of the future research issues following this study.
This study trains the model for calculating MP values with 10-year average crop yields for the major crops in the CONUS, and the potential impact of yield increase as a result of technology improvement [61,62] is not considered as a factor. Since the MP is calculated in a comparative manner, we expect our estimate of MP to be stable from time to time. However, the MP value could potentially be changed as a result of climate change or new crop development. For example, the northwestern US might be more productive in the future as a result of more humid climate [8]. Also, the southwestern US might be more productive if new, reliable drought-tolerant crops [63] are developed and popularized.
The marginal lands in this study are identified assuming no irrigation and no deforestation for bioenergy crop. The land acreage would be increased if irrigation is allowed for growing bioenergy crop, especially for the intensively irrigated areas in the lower Mississippi and California central valley. Also, this study identifies marginal lands with biophysical properties potentially suitable for solving the competition between food and bioenergy production in a long-term manner, but the actual lands available for growing bioenergy crop in a short-term could largely be affected by a series of factors: economic viability [64], farmer's willingness [65], environmental vulnerability [46], infrastructure availability [66], agricultural policies [67], etc. It could be further complicated by some potential environmental consequences of large-scale energy crop plantation, e.g. loss of biodiversity [68], degradation of soil quality [69], and additional use of water [70]. Local criteria for marginal land identification could be developed to reflect the region-specific biophysical and socio-economic conditions. The marginal land identified in this study could be used as a base to consider these conditions for further exploration of land availability for bioenergy crop production.

Acknowledgments
This work was funded by the DOE Center for Advanced Bioenergy and Bioproducts Innovation (U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under Award Number DE-SC0018420). Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the U.S. Department of Energy. We acknowledge the suggestions on the presentation of methods and results from three anonymous reviewers. We thank Avin Arefzadeh for collecting the crop yield and economic data, and Professor Madhu Khanna, Professor Kaiyu Guan, Dr. Chongya Jiang, and Dr. Deepayan Debnath for commenting on the methods and results.

Data availability
The data that support the findings of this study are openly available. The soil property data is stored at www.nrcs.usda.gov/wps/portal/nrcs/detail/ soils/home/?cid=nrcs142p2_053628. The slope data is available at https://webarchive.iiasa.acat/Research/ LUC/Products-Datasets/global-terrain-slope.html. Monthly temperature and precipitation data are available at https://daymet.ornl.gov/. Annual evapotranspiration data is available at www.esrl. noaa.gov/psd/data/gridded/data.narr.html.
Cropland Data Layer land use data is available at www.nass.usda.gov/Research_and_Science/Cropl and/SARS1a.php, and National Land Cover Database land use data is available at www.mrlc.gov/. The irrigation map can be found at https://earlywarning.usgs.gov/USirrigation. The gross primary productivity data is available at www.ntsg.umt.edu/project/landsat/landsat-pr oductivity.php. Crop price, crop yield, land rent data are available at https://quickstats.nass.usda.gov. Crop specific production cost data can be found at www.ers.usda.gov/. The maximum productivity index and the identified marginal lands for the biophysical and economic criteria, as well as their associated uncertainty are freely available at the Illinois Data Bank https://databank.illinois.edu/datasets/IDB-4584681.