Probabilistic global maps of crop-specific areas from 1961 to 2014

Agriculture has substantial socioeconomic and environmental impacts that vary between crops. However, information on how the spatial distribution of specific crops has changed over time across the globe is relatively sparse. We introduce the Probabilistic Cropland Allocation Model (PCAM), a novel algorithm to estimate where specific crops have likely been grown over time. Specifically, PCAM downscales annual and national-scale data on the crop-specific area harvested of 17 major crops to a global 0.5-degree grid from 1961 to 2014. To do this, pixels are assigned into probability clusters based upon crop-specific pixel suitability (based on mean climate and soil characteristics) and gridded historical agricultural areas. PCAM maps compare relatively well with an existing gridded dataset of crop-specific areas circa 2000 (simple matching coefficient value >0.8 for all crops). PCAM estimates compare less well with time series county-level agricultural census data for the United States. Importantly, deviations between census data and PCAM benchmark estimates (driven by soil and climate suitability) can be used to infer the importance of other factors of agricultural production (e.g. labor, agricultural policy, extreme climate) in future work. Our results provide new insights into the likely changes in the spatial distribution of major crops over the past half-century.


Introduction
Agriculture is responsible for feeding the growing global population and is also one of the dominant ways in which humans impact the Earth system [1,2]. Covering approximately 12% of the land surface as of 2000 [3], croplands now comprise one of the largest terrestrial biomes and continue to grow [4]. Among other impacts, croplands alter biogeochemical cycling [5], lead to forest clearing [6,7], consume vast quantities of water that far exceed use by any other human activity [8], alter local, regional, and global climate [9][10][11], and degrade soil quality [12]. Specific crops impact natural resources and Earth surface processes in unique ways. For example, fertilized corn production in the US Midwest is responsible for increased nitrogen loads to the Gulf of Mexico [13], while wheat production in Northern India is rapidly depleting groundwater aquifers [14]. Similarly, rice production throughout Asia is a significant contributor to global atmospheric methane [15], while expanding soybean cultivation has converted large areas of South America's savannas and dry forests to cropland, and appears set to do the same in Southern Africa [16]. Estimating how the distribution of specific crops has likely evolved over time will enable better assessments of environmental system dynamics.
Agricultural management practices have changed substantially over the past several decades [17]. Although total global cropped area increased by about 18% since the mid-1900s, yields increased by 28% during the same time period [4]. This intensification of production was enabled by the increased use of fertilizers, pesticides, mechanization, irrigation, and cultivar improvements that are associated with the 'Green Revolution' [18,19]. These productivity gains are substantial and have enabled production to outpace population growth. However, yields of the 'Big 4' (i.e. maize, rice, wheat, and soy) are increasing slower than the 2.4% per year rate required to double global production by 2050 [20]. Additionally, extreme weather events have negatively impacted cereal production over the last several decades [21,22]. Extreme events are projected to increasingly impact agriculture in the future, with different impacts depending on the crop and climate hazard. To determine historical and future exposure of crops to extreme events, it is important to identify where specific crops have been grown.
Despite their large consequences, understanding of these changes and their impacts is primarily confined to the national scale, as the only globally comprehensive, annual crop area dataset is the country-level statistics provided by the United Nations Food and Agriculture Organization (FAO) [4]. FAO provides an open-access source of global agricultural statistics which is widely utilized and provides a standard for national level agricultural statistics. FAO provides data on the area harvested [hectares] for each crop-country-year. Areas harvested multiple times in a single year are counted more than once in the national total. This means that the harvested area may exceed the physical area of the cropland that they are grown on. FAO also provides information on the production [tonnes] for each crop-country-year.
Previous studies have created sub-national, cropspecific distributions and production maps. Monfreda et al [23] developed a statistical disaggregation technique to downscale the FAO crop-specific areas to a 5 min resolution, using a gridded cropland area map as the disaggregation target [3]. Portmann et al [24] built on that effort by using crop calendars to temporally disaggregate the dataset to a monthly time step. However, both datasets represent the year 2000 crop distributions, and the disaggregation procedure does not account for distributional variations between crops below the scale of the administrative unit providing the crop area statistics [23]. You et al [25] developed the Spatial Production Allocation Model (SPAM), which is a crop allocation model that incorporates comparative advantage and potential economic value to spatially distribute crop production. SPAM uses a host of input data and a cross-entropy approach to make plausible estimates of the distribution of 42 crops under two production systems [25,26]. As with the model presented by Monfreda et al [23], the SPAM dataset is also circa 2000 and does not vary in time, although model output for the years 2005 and 2010 has recently been generated [27,28].
A number of datasets provide gridded agricultural land use histories, but not for specific crops. These include a modeled reconstruction of various land use distributions, including agriculture, for the past 300 [29,30] and 12 000 years [31,32]. These historical reconstructions were developed under Land Use Harmonization projects to create merged historical and future projected land use time series, which provided the forcings for Climate Model Inter-comparison Projects [33]. Additionally, Ramankutty and Foley [34] presented a historical reconstruction of total cropland for the period 1700-1992. Pongrantz et al [35] modeled cropland and pasture from 800 to 1992.
Despite this substantial progress in understanding agricultural geography, there is relatively sparse information on the spatial distribution of individual crops over time. The goal of this study is therefore to estimate crop-specific agricultural geography in time.
Half degree spatial resolution maps of the annual distributions of 17 major crops for the period 1961-2014 are created. To generate these maps, a new cropland allocation algorithm was developed-the Probabilistic Cropland Allocation Model (PCAM). PCAM disaggregates the FAO's national census information [4] onto the global agricultural land use grids provided by the History Database of the Global Environment (HYDE) [32,36]. The presented approach differs from previous disaggregation methods by using crop-specific land suitability surfaces provided by the Global Agro-Ecological Zones (GAEZ) database [37] to probabilistically allocate crop areas into grids. GAEZ provides gridded crop-specific suitabilities based upon mean climate and soil characteristics. Additionally, information on crop-specific calendars-including multi-cropping-is used to allocate crops in space.
The potential utility of these probabilistic maps is demonstrated by using them to answer two key questions: (1) how has the spatial distribution of key crops changed globally over the study period? (2) Which countries, crops, and time periods have experienced significant area changes? This dataset is expected to provide deeper insights into how environmental dynamics have changed when used as an input in Earth system models, which currently rely on temporally static or generic crop types in their land inputs(e.g., [38][39][40]).

Methods
The PCAM was developed to estimate gridded and annual crop-specific harvested area. Figure 1 outlines the framework. The approach relies on several key input data that are at different spatial resolutions (refer to table 1). In general, national census data is downscaled to geographic grids. To do this, gridded data on the crop-specific suitability of each pixel (based upon climate and soil characteristics) is incorporated with the fraction of each pixel that has historically been cropland. Then, national level information was probabilistically allocated to each pixel. The approach incorporates a Monte Carlo algorithm, with random selection across multiple trials to obtain a likelihood estimate. This novel algorithm enables researchers to estimate a global crop-specific, gridded product in time. The data sources, algorithm, and key assumptions are described in more detail below.   [4]. The FAO census data was used as the national allocation constraint in the PCAM algorithm (see next section). Crop-specific suitability index (SI) rasters were obtained from GAEZ [37]. GAEZ is developed by the FAO and the International Institute for Applied Systems Analysis and provides information on the suitability of each pixel for specific crops based on mean climate and soil characteristics. Soil characteristics are principally based on the Harmonized World Soil Database with supplemental sub-national and regional information [41]. Mean climate information is derived from four general circulation models [42]. GAEZ has information for up to 280 crops/land utilization types under alternative input and management levels for historical, current and future climate conditions [37]. The SI values were selected based on intermediate agricultural inputs for both rainfed and irrigated water supply systems. The intermediate input level takes into account both subsistence and commercial production [42]. This GAEZ data provides an integral, gridded product for the PCAM allocation routine.
The History Database of the Global Environment (HYDE) provides 12 000 years (10 000 BC to 2017) of land use data at 5 arc minute spatial resolution [31,32,43]. HYDE provides gridded estimates of total cropland based on historical patterns of human populations. It does not specify the particular crops grown, with the exception of rice. From HYDE, time series information on the fraction of each grid cell that is cropland (rainfed and irrigated) as well as the fraction that contains rice was retrieved. This information is available decadally from 1960 to 2000 and annually from 2000 to 2014. Information on the total cropland of each pixel is used to constrain area accounting in the PCAM algorithm.
The FAO counts multi-cropped areas multiple times in the national statistic. This means that if the same land parcel is used twice in the same year, the area of this parcel will be counted twice [4]. For this reason, those pixels with multi-cropping were determined for better accounting in the algorithm. To incorporate pixel-scale information on multi-cropping, the secondary crop calendars provided by Sacks et al [44] were used for barley, maize, oats, rice, sorghum, and wheat. For example, if a pixel multi-crops rice in the data provided by Sacks et al [44], then the algorithm will count the pixel area twice if it is selected in the PCAM model.
The spatial scale, temporal domain, and crops considered were based upon the available input data.
The study period was restricted to 1961-2014, since this is the time period for which FAO national agricultural statistics were available. The 0.5-degree spatial resolution was chosen for comparison with several existing global, gridded agricultural datasets (e.g. [23]). Finally, 17 crops were selected for consideration in this study (see SI for list). These 17 crops are selected because all of the required input data is available for them. These 17 crops represent 48%-62% of global agricultural production [tonnes] and 68%-75% of global harvested area for the years 1961-2014 (see figure S1 for details available online at stacks.iop.org/ ERL/14/094023/mmedia).
The gridded data (i.e. GAEZ, HYDE, and crop calendars) are interpolated from a 5 arc minute grid to a 0.5-degree grid using a nearest neighbor approximation. A panel dataset was constructed from these three gridded data layers. To do this, information at the pixel level was combined by using each pixel's unique latitude-longitude pair. This enables researchers to construct a panel that has crop-specific information and varies in time. Each pixel with multicropping activities is assigned an indicator variable for the specific crops that are multi-cropped in that location.
In accordance with the UN, national boundaries from 1961 to 1990 were fixed to the 1990 boundaries. Dynamic country boundaries were obtained from the Cshapes v.0.6 package in R [45,46]. These boundaries allow researchers to capture key events such as the dissolution of the Soviet Union. Pixels are paired with countries by using the boundaries active on 31 December for each year during the study time domain. For countries not found in Cshapes, the Database of Global Administrative Areas [47] was used. The pixel is labeled using the International Organization for Standardization (ISO) 3166-1 alpha-3, which provides each country with a unique 3-character code. These codes are used to match national statistics to pixels during the allocation process.

Algorithm development
The PCAM downscales national census data to a global grid. To do this, PCAM follows 5 major steps.
• Step 1: Assign pixels to probability clusters so that national information can be assigned to the most likely and suitable pixels.
• Step 2: Sort national crop-specific production (and corresponding harvested area) information for each year.
• Step 3: Assign the harvested area of crops to pixels in descending order of national crop production until the national harvested area statistic is met. This step is repeated for each produced crop.
• Step 4: Perform probabilistic downscaling many times using a Monte Carlo framework to obtain likely outcomes. • Step 5: Repeat Steps 2-4 for all country-years in the study domain.
Additional details on these steps are provided here.
Step 1: Construct probability clusters. The goal of PCAM is to assign national information on the total area of cultivation for specific crops to pixels. Thus, it is essential to determine how to distribute national (lumped) information in space. A major novelty of the approach is the classification of pixels into probability clusters. These underlying probabilities are then used to probabilistically allocate national values. Gridded probabilities are based on a combination of GAEZ and HYDE data. First, pixels with HYDE agricultural areas >0 are available to allocate to. Then, pixels are assigned to probability clusters according to their underlying climate and soil SI derived from GAEZ data. SI values as provided by GAEZ are shown in table 2. Note that these values are pixel-crop specific. The SI categories were then clustered into six probability clusters in which the greatest probability (i.e. '1') is assigned to 'very high' quality pixels and the lowest probability (i.e. '6') is assigned to pixels that are 'not suitable'. Since FAO does not differentiate between varieties of rice (dryland versus wetland) and millet (foxtail versus pearl), generic rice and millet values were derived.
GAEZ information on the crop-specific SI of each pixel for rainfed and irrigated water supply systems is used to further refine probability clusters. Many locations that are not well-suited for a specific crop under rainfed conditions may be well-suited for that crop if irrigation is available. Unfortunately, time-varying irrigation infrastructure information is not available. Time-varying irrigation maps would enable researchers to determine where and when irrigation infrastructure was available to meet crop water demands. In the absence of such information, high quality rainfed pixels were assigned a more likely probability cluster than its irrigated counterpart, since it was not known if access to irrigation is available. In this scheme, moderate-very high suitability rainfed pixels were prioritized over comparable irrigation pixels. 'Marginal', 'very marginal', and 'not suitable' lands were treated as being equal regardless of water supply. Figure 2 displays how probability clusters vary across crops within a country. In general, irrigation improves suitability as shown by an expansion of areas containing lower SI values. For example, the map of SI values for rainfed maize (see figure 2(A)) fairly closely resembles the cornbelt in the United States. Now, if irrigation was provided to maize, then many more locations throughout the US would be highly suitable for maize (figure 2(B)). The probability cluster assignment assumes that crops will be preferentially grown using rainfall (rather than irrigation) when the soil and climate characteristics are conducive to it. The right column in figure 2 presents an example of the resulting probability clusters for the four major crops in the United States. As figure 2(C) illustrates, the probability clusters for maize in the US will preferentially allocate maize to pixels that are most suitable under rainfed conditions (i.e. pixels with value of '1'), with less likelihood of allocation to pixels that are suitable only when irrigation is available (i.e. pixels with value of '3').
HYDE provides space and time-varying pixel-level information on rice harvest areas. However, the total harvest area as reported by HYDE is an order of magnitude less than what is reported by FAO during the study period (see SI). Thus, the HYDE rice information was used as a baseline to identify 'active' rice pixels. Then a nearest neighbor approximation is applied to identify the three nearest neighbors to these known active rice pixels. These pixels are also labeled as active rice pixels to accommodate the excess rice area in the Table 2. Global Agro-Ecological Zones (GAEZ) suitability index (SI) values and resulting probability clusters for use in the Probabilistic Cropland Allocation Model (PCAM). Clustering applies to pixels in countries where the HYDE cropland fraction is greater than zero for both rainfed and irrigated water supply systems across all crops. Final clustering is based on prioritizing rainfed over irrigation water types. FAO database. For rice, the pixels that have been identified as active either from HYDE or the nearest neighbor identification are automatically placed in the first cluster (value of '1').
Step 2: National statistics analysis. The production [tonnes] and harvested area [hectares] for each country-year in the study domain were retrieved from FAO. First, the number of crops produced by each country for each year was obtained. The allocation algorithm begins by ranking the production information for all crops in a country-year in descending order. If a country has rice production, their rankings were adjusted to have rice be the first allocated crop. The remainder of the crops are then allocated in order of their production ranking. In other words, the algorithm is initialized with the known rice spatial locations and then sequentially moves through the rest of the crops in decreasing order of production. For example, maize is the most produced crop in the United States (in tonnage). However, the United States also produces rice; it is the seventh most produced crop in the year 2000. Rice would be allocated first, and then maize. This approach ensures that known gridded information is incorporated while still ensuring that the crops that are produced in the greatest quantity are prioritized for allocation.
The rice harvest area was adjusted if HYDE information is available. Rice cropland is aggregated for each country and compared to the data provided by FAO. If FAO indicates more rice harvest areas then HYDE at the country-level, the HYDE-based rice area was deducted from FAO. This net area becomes the rice harvest area constraint as outlined in Step 3.
Step 3: Harvest area assignment. National harvested area data is used as an adding-up constraint in the algorithm. Pixels are assigned harvested area until the total harvested area allocated reaches the national value reported by FAO or the HYDE adjusted value in the case of rice. The pixel-level cropland fractions provided by HYDE form a spatial distribution of generic crop-related activities across both space and time. This distribution was used as the random pixel selection probabilities during harvest area assignment. Thus, pixels are probabilistically selected until the following harvested area constraint is met: where p is pixel, c is crop, y is year, and i is country. Thus, the harvest area assigned to all pixels in a country-year should approximately sum to the FAO national harvested area statistic. An approximation is used because the random accumulation of pixel areas will not necessarily be perfectly equal to the national harvested area. The PCAM algorithm was constructed such that it can be consistently run for all countries in the entire time domain. There are countries in which the number of crops produced exceeds the number of pixels available to be allocated to. In these instances, each pixel was divided into equivalent area slices. These slices are determined by a time series of maximum crops produced by countries where the number of crops exceeds the number of country-labeled pixels. Each annual value is used to determine the number of slices a pixel is divided into for PCAM (see figure S2(a)). For example, in the year 2000, the maximum number of crops (from the PCAM crop list) produced across all countries is 11. Therefore, and regardless of country, each pixel will be divided into 11 slices, which provides a pixel 11 opportunities to be randomly selected. The slices available are adjusted if a pixel is identified by HYDE for rice. The area attributed to rice is converted into slices. These slices are then deducted from the total slices available. For example, a known rice pixel in the year 2000 may have a rice area that is equivalent to two slices. Therefore, it has a net of 9 slices available for random selection across all crops.
Two steps are taken when a pixel is randomly selected. First, the area of the pixel's slice is added to the accumulated harvest area (as in equation (1)). Second, the pixel's selection counter is reduced by one across all crops. However, the probability of selection was not altered as the selection counter is adjusted. When the pixel's selection counter equals zero, it is removed from the potential pool of pixels across all crops. Note that the slice area is doubled if the selected pixel has been designated as multi-cropped for a specific crop. With this approach, it was assumed that each pixel with multi-cropping capabilities will multicrop in a given year. For this reason, the area in those pixels was double counted (since this is what FAO does).
In the event that the national adding-up constraint has not been satisfied and all clustered pixels have been allocated, pixels with a HYDE cropland fraction equal to zero are then considered. These pixels are clustered based on their original cluster assignment. For example, a pixel with zero cropland but an original cluster assignment of '1' would be placed in cluster 7. Since the selection probability is equal to the cropland fraction, these extra pixels are assigned a probability equivalent to the ratio of pixels in the cluster eligible for allocation to the total number of outstanding pixels. Thus, the probability for these zero cropland pixels is dynamically calculated each time a pixel is selected.
Step 3 was run many times in a Monte Carlo framework. Since crops were probabilistically assigned to pixels, it is important to perform this operation multiple times in order to determine the most likely outcome across many trials.
For this reason, the model was run 500 times and keep track of the allocation in each model run. A counter is used to track the number of times a pixel is selected for a given crop across all cycles. The number of crop-specific probabilistic selections was converted into an estimate of the crop-specific harvested area with the following equation: where f is fraction, p is pixel, c is crop, i is country, and y is year. N p c i y , , , is the number of crop-specific selections for a pixel in a country for a given year across all cycles. N total is the number of total cycles run. N slice y , is the number of slices the pixel is divided into for a given year. Area p c y , , is the pixel area.
Step 5: Apply the framework to all countries. The core of the algorithm (Steps 2-4) is repeated for all country-years in the study domain. Upon completion, individual country results are bound together for each year to construct annual global PCAM results. The annual area is checked to determine if it is balanced with FAOSTAT data for each country in a given year. In the event that the areas do not balance, then the results were scaled so that each crop's total harvest area matches with the FAO (see SI for scaling factors used).

Model assumptions
The statistical downscaling algorithm makes several key assumptions. First, the algorithm relies on the national census data of FAO. Errors in this database would significantly alter the results. This is especially problematic for countries which may have political incentives to under or over-report their agricultural areas. For example, Seto et al [48] discuss China's propensity to over-report agricultural areas for political motives. Similarly, national standards of agricultural data collection remain poor in many regions, such as in sub-Saharan Africa [49]. Second, errors in the gridded data layers would carry over to the dataset. Third, a probabilistic downscaling approach was employed. The approach assumes that mean climate and soil characteristics are the most important factors in determining where specific crops are grown. This neglects several other important factors (e.g. extreme climate, agricultural labor, policies and regulations, etc). The probabilistic approach means that estimates of likely crop-specific areas for each year in the study period were produced. However, PCAM maps should not be misinterpreted as providing actual information on where each crop was grown in each year.
The GAEZ data on the suitability of each pixel was used for specific crops. This is a key gridded data layer that, unfortunately, does not vary in time. Ideally, the SI classes provided by GAEZ would vary in time. However, these SI values estimate the crop-specific suitability of each pixel based on average climate and soil conditions, with 1961-1990 as the baseline. It is likely that these underlying mean climate and soil characteristics remain relatively stable over the study period. However, the suitability of some locations for specific crops has changed in this time domain; northern China has become increasingly suitable for maize cultivation since the 1980s [50]. This would mean that the approach which is based on these time-invariant SI classes is appropriate for characterizing mean suitability and allocation. However, these SI classes will become increasingly problematic to use as the climate diverges from the 1961-1990 base period for which the SI classes were developed.
Certain crops are more likely to be grown in rainfed and irrigated conditions. Unfortunately, the available data does not enable a determination of which crops are grown in rainfed versus irrigated conditions over time. Modeled crop-specific irrigated areas circa 2000 are available [51], but not at the annual temporal domain necessary for this study. Additionally, crop-specific pixels from Monfreda et al [23] were used to force the irrigation model of Portmann et al [51]. The agricultural maps that underpin the irrigation spatial model are fixed in time and thus inconsistent with the PCAM approach. For this reason, the circa 2000 irrigation maps were not used, but instead assume that crops are preferentially assigned to high-quality rainfed locations. However, it is likely that certain crops, such as high-value crops, will be preferentially grown where more certain irrigation water supplies are available. It was anticipated that this is particularly likely to be the case for high-value crops (e.g. produce, tree and vine crops, etc), and should not be as problematic for the major staples considered in this study, with the exception of rice [51].

Model assessment
Quantitative methods were used to assess PCAM performance against existing agricultural maps. The four types of metrics used are: (1) R 2 ; (2) mean absolute error (MAE); (3) the Jaccard coefficient (J); and (4) the simple matching coefficient (SMC). The first two metrics consider the 'intensity' of the pixels. These two metrics were considered to be the most strict in assessing model performance. The R 2 value is from a Type II linear regression of the estimated cropspecific pixel values of harvest area fractions on existing maps. The R 2 regression also includes pixels where the harvest fraction is zero. J and SMC are focused on estimating spatial similarity in terms of the presence and absence pixel-level harvest areas.
The Jaccard (J) [52] and SMC [53] are defined according to equations (3)-(4), respectively. J gives an approximation of how similar PCAM results are for identifying crop-specific areas. Similarly, SMC allows consideration of both the presence and absence of harvest areas. In other words, it is possible to quantify if PCAM is allocating crops to known harvest areas as well as not allocating to known non-harvest areas. The comparisons were performed by using identical spatial resolutions across datasets. For the global comparison, the results from Monfreda et al [23] were interpolated from its native 5 arc minute spatial resolution to the 0.5-degree resolution of PCAM. On the sub-national scale, PCAM was compared to the United States Department of Agriculture's (USDA) census for the years 1997 and 2012 [54]. County-level boundaries were obtained from the Newberry Library [55] for 1997 and the United States Census Bureau [56] for 2012. These boundaries and their associated Federal Information Processing Standard codes are used to match data from the USDA to its spatial corollary. Then, PCAM results were aggregated from the 0.5degree resolution to the county-level to match the spatial resolution of the USDA census data. Cassava, groundnut, rapeseed, rye, and yam were excluded from sub-national assessment in the United States due to poor data availability for the assessment years.

Sensitivity analysis
PCAM heavily relies on the HYDE and GAEZ gridded datasets. In particular, the hierarchical clustering that boosts allocation towards the most suitable pixels is largely dependent on GAEZ. A combination of each crop's rainfed and irrigation values were used with both set for intermediate input levels. However, rainfed suitability is also available from GAEZ for low and high agricultural inputs, while irrigation data is also available for high agricultural inputs. GAEZ defines low agricultural inputs to be management practices aligned with subsistence farming, whereas high agricultural inputs use advanced management practices for complete commercial production [42].
To determine the sensitivity of PCAM estimates to GAEZ inputs, various combinations of probability clusters were run based on different GAEZ input scenarios. Specifically, the impact of changing rainfed input levels on PCAM results at the global scale for the year 2000 was considered. The rainfed inputs' variability was the focus since rainfed cropland historically dominates our time domain (see figure S2(b)). The metrics defined in section 2.4 were used to understand how PCAM maps change relative to the Monfreda et al [23] dataset.

Results and discussion
The   provides additional validation of the GAEZ-based approach to allocation, since PCAM does a good job at replicating sub-national patterns that are not incorporated in it. There are moderate differences in the global area of each crop between PCAM and Monfreda et al [23], driven by the underlying FAO data which serves as a constraint in PCAM (see table 3). For example, PCAM has 12.4% more millet harvested area globally than does Monfreda et al [23].
Differences between FAO and Monfreda et al [23] harvested area mean that PCAM results should not be expected to perfectly match Monfreda et al [23]. This is because PCAM total agricultural area is equivalent to FAO harvested area, by design. This is confirmed in tables 3 and 4, where the FAO and PCAM values are the same. Figure S2(c) shows a time series of Goldewijk [31], FAO [4], and PCAM area. Note that Goldewijk [31] reports much less cropland area than does FAO [4]. PCAM areas have been calibrated to match FAO [4] which is why PCAM and FAO values align in tables 3 and 4. Surprisingly, Monfreda et al [23] and FAO [4] do not contain the same quantity of harvested area. Table 4 provides country-level analysis for the top 10 countries by harvest area for the four major crops. The starting national area from FAO deviates from the total area in Monfreda et al [23], as at the global scale. In some countries, the performance of PCAM is not as good as at the global scale. For example, the global SMC for maize is 0.87 (see table 3), but 9 of the top 10 countries do not approach a value this high at the national scale (see table 4). The United States has the highest SMC for maize with a value of 0.85. However, PCAM demonstrates higher R 2 values at the country scale than it does globally. For example, the global R 2 value for maize is 0.18 versus 0.62 and 0.46 for the United States and Argentina, respectively.
Harvest area discrepancies between FAO and Monfreda et al [23] exist across a supermajority of country-crop pairs in a non-uniform manner. Thirteen of the 17 crops analyzed showed more reported area by FAO than by Monfreda et al [23] at the global scale. However, when country-level harvest areas are considered, larger area discrepancies were observed. For example, FAO/PCAM has 33.3% more maize harvested area in South Africa than does Monfreda et al [23]. These differences in national harvested area contribute to the differences between PCAM and Monfreda et al [23] (see figure 3).

National scale
The United States has agricultural census data available at the sub-national scale for several points in time from the USDA. The harvest fraction estimates provided by equation (2) are compared to county-level information provided by the USDA. Cassava and yams were not compared as they are not grown in the United States. Census information was obtained for 1997 and 2012 for 12 of our 17 assessed crops [54]. These years were chosen as 1997 is the first readily available electronic census with county-level information and 2012 is the most recent census available in the time domain. Table 5 compares PCAM with USDA data. PCAM provides the best pixel-level intensity matching for maize and rice as demonstrated by their high R 2 Table 3. Global comparison between PCAM and Monfreda et al [23] for the year 2000. Note that PCAM estimated areas match FAO data by design. R 2 , MAE, J and SMC metrics are presented to quantify the similarity between PCAM and Monfreda et al [23]. Area (%) displays the percent difference in area between PCAM and Monfreda et al [23]. For example, PCAM has 12.7% more millet harvested area globally than does Monfreda et al [23], due to the underlying FAO data. Differences between FAO and Monfreda et al [23] harvested area suggest that PCAM results will never perfectly match Monfreda et al [23].   [23] for the year 2000 for the top 10 countries by harvest area. R 2 , MAE, J, and SMC metrics are presented to quantify the similarity between PCAM and Monfreda et al [23]. Area (%) displays the percent difference in area between PCAM and Monfreda et al [23]. For example, PCAM has 19.8% less rice harvested area in India than does Monfreda et al [23].
Harvest area (10 6 hectares)  table 5), which means the PCAM algorithm is being run with a different national constraint. Maps of changes in harvest area for maize, soybeans, rice, and wheat from 1997 to 2012 are presented in figure 4 for the United States. PCAM spatial trends are disappointing and typically estimate gains in harvest area that the USDA data does not support. This disparity highlights a potential shortcoming of the approach, which is that PCAM allocated to suitable pixels based on mean climate and soil characteristics. This means that PCAM may be capturing crop planted area spatial trends better than harvested area. This is because the planted area decision will largely be driven by mean suitability, but extreme weather may impact the eventual locations of crop harvest. This is what would have happened in the year 2012 in the US, which was an extreme drought year in the US corn and soy belt [57]. This extreme event will be captured by the USDA 2012 census data on harvested area. However, the drought would not show up in the stationary pixel-scale suitability driving PCAM spatial allocation. The national harvested area statistic should capture this information, yet FAO and USDA data diverge for the US for this year.
Note that PCAM tends to underestimate declining areas compared to USDA data when considering whole country pixel trends. For example, PCAM projects that 2.1% of maize-related pixels will decline in intensity versus 14.6% based on USDA. Similarly, PCAM estimates that 5.4% soy-related pixels will decline in intensity versus 10.5% for USDA. There is also an underperformance of wheat in estimating the decline of wheat-related pixel (19.4% for PCAM versus 33.6% from USDA). However, PCAM is comparable to USDA when examining areas where pixels increase in intensity. For rice, PCAM and USDA are extremely close approximations (0.34% versus 0.67%). Thus, PCAM is able to closely capture the loss of rice harvest area in the Mississippi Embayment region. For soybeans, 13.8% of pixels are projected to increase according to PCAM compared to the 17.0% based on USDA. Figure 4 highlights that area harvested in the United States are driven by factors other than mean climate and soil. Many factors are likely important when determining where to grow crops, particularly in countries with advanced agricultural systems, such as the United States. However, only gridded information on cropspecific suitability driven by mean climate and soil was used. Other factors-such as labor, access to machinery, agricultural policies [58]-influence where specific crops are grown as much, or even more than, the physical variables that were used to guide the downscaling. Thus, divergences between PCAM estimates and census information, such as those provide in figure 4 may actually prove useful in future research that aims to identify other important factors guiding agricultural geography. Determining these factors is unfortunately beyond the scope of this project, but future work may aim to analyze these deviations and compare them with maps of other potentially important factors of agricultural production.

Crop-specific changes in time and space
First, changes in total cropland over time are examined. Figure 5 presents PCAM estimates of total cropland for the beginning, middle, and end of our study period for the 17 crops considered. The common scale across sub-figures makes it clear that both extensification and intensification of crop area have occurred over time. The intensities in total crop area increase significantly in 2010, driven by reported gains in crop area from FAO (see figure S2(c)). In 1970, 28.2% of the global land area was estimated to be cultivating at least one of the 17 crops with a mean harvest fraction of 0.09. By 2010, PCAM estimates a slight expansions of the global area that is harvested to 30.2%, but the mean cropland fraction has increased to 0.11. The observed trends in both modest intensification and extensification is driven largely by differences in crop area reported by FAO and HYDE (as shown in figure S2(c)). Table 6 summarizes the PCAM area estimates at the beginning and end of the study period by crop (see  [23] as discussed above). PCAM's output of pixel-level harvest fractions was treated as proxies for the likelihood of a crop being present in a given pixel. By extension, PCAM enables estimates of projected changes of a crop's likelihood over time to be determined as the relative change in these harvest fractions. As a result, it is possible to further identify two extremes of crop-specific pixel behavior. The first extreme is if a pixel converts from having a non-zero to a zero likelihood. These pixels are then considered inactive. The other extreme is where a pixel is estimated to increase its likelihood by several orders of magnitude. This latter case is a surrogate of pixel-level crop intensification. These extreme pixels are referred to as 'super pixels'. Figure 6 shows changes in time for each of the four major crops across both domains (see figures S12-S14 for the rest of the crops). Table 7 quantifies these spatial changes by presenting the fraction of pixels that have demonstrated crop-related increases and decreases over each sub-domain. A majority of pixels show no change over either period. In other words, these pixels have likely remained identically zero over time. Between 1970 and 2000, a majority of the crops studied (nine of 17) have more pixels that increased their likelihood than decreased. The remaining crops show a greater percentage of pixels that declined. Similarly, for the second period, 10 crops were estimated to have a net gain in active pixels while the remaining seven will have a net decline. Sweet potatoes is the additional crop demonstrating these gains in pixel activation.
The number of pixels activated for maize is estimated to increase by 6.95% by the year 2000. This increase is being driven by the 35% of pixels who significantly increase their likelihood. China (21%), Mozambique (8.4%), India (8.1%) are drivers of this increase. In particular, China and Mozambique see their maize likelihoods increase by 54% and 169%, respectively. For the 2000-2010 period, maize is projected to have 53% more pixels that increase their likelihood, rather than decrease. China is the estimated leading driver of this change. Its maize pixels increase their likelihood by 30.5% and it is home to 25% of the super gaining pixels.
For soybeans, there are 53% more pixels that are estimated to have increased than decreased the likelihood with a mean relative change of 283% for 1970-2000. This large increase in likelihood is being driven by 35% of its pixels being super gaining. The United States (22.3%), Brasil (12.5%), and Argentina (9.1%) are where most of this extreme extensification is occurring for 1970-2000. For Argentina, these projections highlight a case of both extensification and intensification as their mean likelihood increased by 3000%. Paraguay also illustrates substantial extensification of soy (i.e. gaining pixels), which indicates that PCAM is able to capture areal expansion of crops. For 2000-2010, Brasil (27.6%), the United States (13.6%), and Russia (7%) lead the global expansion of soybean pixels.
Wheat is another crop that is anticipated to have a continuous pixel-level expansion over both domains. It has 7.65% and 8.6% more pixels that increase rather than decrease in likelihood for Cassava is projected to have 19.7% pixels increase in likelihood than decrease, while the mean pixel increases by 5.15% from 1970 to 2000. Nearly 25% of cassava pixels are expected to become inactive. Brasil (18.6%), Congo (15.1%), and Indonesia (10.4%) are the leading countries for pixel loss. Brasil sees a modest increase in pixel likelihood of 3.23%, which coincides with a dip in global ranking from sixth to seventh. For 2000-2010, cassava experiences a nominal 2.4% increase in mean likelihood change with 20.1% more pixels estimated to increase their likelihood than decrease. These latter changes are being driven by Brazil (15.3%), Congo (10.1%), and Nigeria (7.0%).
Groundnuts are considered to have 14.98% more pixels gain likelihood than lose by 2000. The global mean change in likelihood improves by 7.48%. Approximately 33% and 31% of pixels experience extreme gain and loss, respectively. India (21.3%), Brasil (8.4%), and the United States (7.5%) are home to most of these losses. Note that Brasil is not a top 10 country by area for groundnuts between 1970 and 2000. However, India contained the most harvest area globally during this time. While India lost pixels, the remaining active pixels are projected to have a mean Table 6. Global cropland area changes in time. The global area is provided by crop for the start (1961) and end (2014) of the study time period as estimated by PCAM. The percent change in area over this time period is also shown. The percent of total cropland is shown for each crop for the start and end of the study time period. Note that soybean increases from 3.2% to 12% of all harvested area during this period.  Russia is the leading driver of heterogeneous change in the global sunflower distribution as it is home to 23.9% of pixels that alter their sunflower activity level. Conversely, the United States (27.6%) is the leading driver of sunflower pixel loss.
Yams are estimated to have 38% and 32.5% more of its pixels increase than decline in likelihood with changes of mean likelihood increasing by 3.67% and 1.03% for 1970-2000 and 2000-2010, respectively. These are considered to be negligible changes over the course of the study domain. Nigeria is the leading country over the entire domain.
It is important to note that PCAM spatial patterns are likely most representative of planted area. This is because PCAM uses SIs to probabilistically allocate crops to pixels. This is most likely capturing the locations that a crop is suitable to be planted in. However, the algorithm downscaled harvest area information. This means that there is a potential mismatch in spatial locations. These differences are likely to be small in locations and time periods in which crops were both planted and harvested in the most suitable locations. However, in time periods and places that experienced significant divergence-say, due to drought-these spatial estimates may be problematic. Recent work has shown that extreme weather events have reduced cereal production in recent decades, with drought events impacting both harvested area and yield, whereas temperature extrema have only impacted grain yield [22]. As such, this distinction between planted and harvested area estimates is important to keep in mind when using PCAM estimates, particularly in times of drought. Future work could bring PCAM area estimates together with information on climate extremes to better estimate harvested area.

PCAM sensitivity
The sensitivity of PCAM model output was analyzed using different scenarios from GAEZ. Specifically, the GAEZ inputs used were varied to evaluate the pixelscale suitability of specific crops. All PCAM results presented thus far are based upon the 'intermediate' input scenario produced by GAEZ. Here, PCAM was run with the 'low' and 'high' input scenarios for rainfed agriculture. Again, PCAM model output was compared with Monfreda et al [23]. Both the low and high rainfed inputs produce PCAM outputs that are similar (SMC>0.8 ) to the Monfreda et al [23] data (see table 8). Note that the SMC of each crop was weighted by its global harvest area fraction for the year 2000 to obtain an overall weighted SMC. The low versus intermediate and high versus intermediate scenarios differ by 0.12% and 1.24%, respectively. Therefore, there is limited impact on spatial similarity as a result of altering agricultural input scenarios.
Each crop's statistical distribution was examined to further elucidate the effect of changing GAEZ inputs on PCAM estimates. Table S4 shows how these properties and the number of contributed pixels fluctuate. Figure S4 presents box plots of harvest fraction for maize, rice, soybeans, and wheat as a function of these GAEZ inputs (see figure S15 for the remainder crops). A suppression in pixel intensity was observed for the high rainfed results due to significant increases in the number of pixels selected. Similarly, the low inputs scenario produces the largest pixel intensities and is generally based on having selected the fewest number of pixels. This makes sense, as more pixels are suitable for selection when many inputs are available in agriculture; fewer pixels are suitable when fewer inputs are used. There is a comparable number of outliers (i.e. excessively high harvest fractions) across the four dominant crops and inputs scenarios (see figure  S4). Overall, the intermediate scenario is nestled between the two inputs extrema. Thus, the intermediate rainfed PCAM outputs was treated as baseline values, whereas the low and high form the upper and lower bounds, respectively. Many additional sensitivity analyses can be performed by future researchers. Future work can evaluate the 'truthfullness' of national harvested area and run PCAM with multiple national constraints and weight the most truthful statistic. For example, FAO was relied upon for the truth in agricultural statistics. However, national values presented by [23] or USDA may be better to use. Future research could determine the best national statistic for each country-year and then re-run PCAM to obtain better results. Additionally, future work could vary the probability clusters that underpin PCAM for each country-year. This could be done using an optimization algorithm that would determine the probability cluster assignment that most closely represents a 'target' database (e.g. [23] or USDA) and then use this cluster for the algorithm. More sophisticated versions of this would involved varying the probability clusters in time in a way that reflects the underlying suitability of pixels. Other factors that influence spatial allocation could be added, such as infrastructure (e.g. irrigation and/or road network) and proximity to urban centers. These extensions would build upon and improve the results presented here.

Conclusions
The PCAM model was introduced to downscale national agricultural census information to a geographic grid. This model enabled the creation of estimated gridded areas of individual crops at the annual time scale. Remarkably, this probabilistic approach performs reasonably well against a global dataset assembled with sub-national agricultural census information. PCAM global gridded maps of specific crops for each year from 1961 to 2014 are provided in the supplementary information, available at https:// doi.org/10.13012/B2IDB-7439710_V1 in order to ensure transparency and enable future research.
PCAM employs a probabilistic framework that enables estimation of likely locations of specific crops for each year for the last half century. There are many potential applications of both the algorithm and the maps that have been produced. In its current form, PCAM provides a unique opportunity to determine the other factors of production that are important in the spatial distribution of agriculture. This can be done by analyzing deviations between census data and PCAM estimates, which are based on the underlying suitability of the mean climate and soil of pixels. This means that PCAM provides a useful benchmark of where crops were likely grown taking only climate and soil into consideration. PCAM has the additional advantage of requiring relatively few data inputs. The PCAM algorithm may also be useful in allocating national variables to a global grid in other settings.
It is important to note that the methodology is designed to provide a probabilistic assessment of crop distributions. For this reason, PCAM's maps should not be considered to be actual data. Rather, they should be used to determine likely locations of major crops in time. For locations in which sub-national census data exists, such as it does in the United States, then the census data is the preferred source of information. However, agricultural census information is often not available for many locations and time periods. This means that PCAM estimates may be suitable for certain uses in these instances. Future work could improve upon the approach, especially as new datasets and computational techniques emerge. For example, machine learning shows much promise to leverage satellite data for estimating crop-specific areas [59,60]. However, this approach would be restricted to the time domain of necessary satellite data. Future work could assemble available sub-national census data and fuse it with PCAM estimates for locations without data. Insights into additional determinants of crop-specific suitability could be used to improve the suitability mesh. Additionally, future work could integrate information on extreme climate events to better determine locations of harvest as opposed to planted areas.