An open-access dataset of crop production by farm size from agricultural censuses and surveys

This dataset is a cross-country convenience sample of primary data measuring crop production and/or area by farm size for 55 countries that underlies the article entitled “How much of the world׳s food do smallholders produce?” (DOI: https://doi.org/10.1016/j.gfs.2018.05.002). The harmonized dataset is nationally representative with subnational resolution, sourced from agricultural censuses and household surveys. The dataset covers 154 crop species and 11 farm size classes, and is ontologically interoperable with other global agricultural datasets, such as the Food and Agricultural Organization׳s statistical database (FAOSTAT), and the World Census of Agriculture (WCA). The dataset includes estimates of the quantity of food, feed, processed agricultural commodities, seed, waste (post-harvest loss), or other uses; and potential human nutrition (i.e., kilocalories, fats, and proteins) generated by each farm size class. We explain the details of the dataset, the inclusion criteria used to assess each data source, the data harmonization procedures, and the spatial coverage. We detail assumptions underlying the construction of this dataset, including the use of aggregate field size as a proxy for farm size in some cases, and crop species omission biases resulting from converting local species names to harmonized names. We also provide bias estimates for commonly used methods for estimating food production by farm size: use of constant yields across farm size classes when crop production is not available, and relying on nationally representative household sample surveys that omitted non-family farms. Together this dataset represents the most complete empirically grounded estimate of how much food and nutrition smallholder farmers produce from crops.

representative household sample surveys that omitted non-family farms. Together this dataset represents the most complete empirically grounded estimate of how much food and nutrition smallholder farmers produce from crops.
& 2018 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Subject area
Agriculture, Food Security, Environmental Studies More specific subject area Crop Production, Crop Diversity, Farm Size, Smallholders Type of data CSV file How data was acquired All data were compiled via agricultural censuses or nationally representative household surveys. Data format Aggregated to sub/national level resolution.

Experimental factors
We describe the survey instruments used to build this harmonized dataset, and the methods of harmonization. We also test four aggregation assumptions we made with this dataset, including 1) using a constant yield across all farm size classes when crop production was not available, 2) using aggregate field size as a proxy for farm size, 3) relying on nationally representative household sample surveys that omitted nonfamily farms, and 4) crop species omission biases resulting from converting local species names to harmonized names. We also tested for regional biases resulting from our global convenience sample.

Experimental features
We describe key components of the data harmonization process and the dataset characteristics. Each of the four assumptions were tested in countries containing variables with both the assumption and the actual data. For example, we tested the constant yield bias in countries with datasets containing both the agricultural area and the actual crop production per farm size class. We then applied a constant yield across all farm size classes to the crop area variable and tested the difference between using the actual production versus the constant yield to calculate the production. Similar within country tests were conducted for each assumption. Data source location Sample containing 55 countries. See data coverage section for spatial coverage. Data accessibility Data accompanies article.

Value of the data
The first open-access dataset containing food production by farm size at the global scale. Dataset can be used as a baseline for other global farm size datasets that do not contain direct measurements of smallholder food production.
This dataset is harmonized across crop species, county, and year to link with the FAOSTAT and World Census of Agriculture databases.
Contains 154 unique crop species, macro-nutrient conversion factors, and food, feed, and other production conversion factors that can be subset by farm size.
This dataset is spatially explicit at the subnational level and is accompanied by a shape file with political boundaries for mapping.

Data
This dataset was built to provide estimates of the percentage of food produced by farms of different sizes globally. We constructed this dataset by harmonizing agricultural censuses and nationally representative household sample surveys that directly measured crop production and/or cropping area 1 by farm size. This dataset is a convenience sample of 55 countries with 45 countries having subnational resolution.
Our dataset captures $ 51.1% of global crop production and $ 52.9% of global cropland area (i.e., arable land and permanent crop area as reported in the Food and Agricultural Organization's statistical database (2017) [FAOSTAT hereafter]) [1]. The primary sources are agricultural census data (i.e., the majority of which used exhaustive sampling of the farming population, but not all response rates were 100%) or nationally representative sample surveys (i.e., with randomly stratified sampling of households in a country). These data were available at either the aggregated level by administrative unit (34 countries) or at the non-aggregated, microdata level where data are available as anonymized individual household level records (21 countries, of which 18 were sample surveys and 3 were complete agricultural censuses) (Fig. 1). We document the source information, detail the methods for building this dataset, and describe its characteristics in this article to enable its use by the research community.
This database was harmonized across countries, 154 crop species, and farm size categories. Crop species and country names were matched with FAOSTAT by year to integrate with its extensive variable lists. The median year of the source data was from 2013, with the oldest source dataset from 2001 and the newest from 2015; each administrative unit contains data for the most recently available time point. We harmonized the farm size categories to match the World Census of Agriculture (WCA) farm size categories: 0 to 1 ha, 1 to 2 ha, 2 to 5 ha, 5 to 10 ha, 10 to 20 ha, 20 to 50 ha, 50 to 100 ha, 100 to 200 ha, 200 to 500 ha, 500 to 1000 ha, and above 1000 ha.
We ran into several methodological issues when harmonizing the underlying data needed to construct this dataset. In this article, we outline the assumptions made, and test the bias of these assumptions, such as applying constant yields across farm size classes to estimate production when only cropping area was available (representing $ 60% of our data), omitting non-family farms when relying on household sample surveys (22.5% of our data), using aggregated plot size as a proxy for farm size ( $5% of our data), and omitting crop species that we were unable to be harmonize across countries or with the FAOSTAT crop species list.
In this article, we also provide details on the data collection and inclusion process, summary statistics, spatial coverage, and provide sensitivity tests and/or detailed explanations of each of the data harmonization assumptions we made. Our goal is to be transparent about our dataset's limitations, offer insight for other data harmonization projects relying on these same biases, and offer guidance for people wishing to use this data in their own work.

Inclusion criteria
We prescribed four inclusion criteria for this project. First, datasets needed to contain variables for farm size (where farm size was not available we relied on aggregate field size)cross-tabulated with production per crop or cropping area per crop. Second, datasets needed to be nationally representative. Agricultural censuses or household sample surveys were used only when their sampling methodology was transparent and/or these datasets were used by the country's government for official statistics. We required the household surveys' sampling designs to be transparent, randomized   at the appropriate administrative unit, and to provide sampling weights and expansion factors with details on their creation and intended application. Third, national numbers calculated from these datasets needed to be comparable with official national statistics. For many agricultural censuses, the sampling design and response rates were not available. Fourth, we only focused on surveys which included disaggregated data on crop species so that they could be matched to FAOSTAT crop names and item codes. No aggregate categories were used (e.g., 'roots and tubers' or 'fruit and vegetables').
We systematically searched several locations for agricultural datasets to compile our dataset. These sources included the World Bank microdata archives, EarthStat metadata, Living Standards Measurement Study (LSMS) surveys, and the Accelerated Data Program (see Table 1 for full data repository list). We conducted our search on a per country basis either through each data archive's search capabilities where available, detailed search of each data archive's metadata, or via webscraping the archive to identify pertinent variables. Due to the multilingual nature of the datasets, variables were translated using the Google Translate Application Programming Interface (API) and we cross-checked any ambiguous or unknown colloquial crop name against several sources [2,3] and/or with colleagues who work in each region of interest. For each country in each data archive, we searched for variables that directly linked 'farm size' or 'plot area' with 'production' or gross 'plotted'/ 'cropped'/'planted'/harvested' area by 'crop type'. If there were multiple eligible datasets available per country, we included the most recent year. Nearly all the source data were freely obtained and all are used according to their user agreements.
Of the censuses that we included and had detailed sampling information (25 countries), 15 countries relied on either an exhaustive sampling design or a design that was exhaustive for farms with a set number of employees and/or annual revenue and a sample survey for smaller farms. Of the exhaustive censuses, there was a median response rate of 80%; the remaining censuses relied on stratified randomized sampling and applied resampling weights and expansion factors before making their aggregated data available (see dataset's metadata).

Farm size harmonization
For tabulated census data, we made adjustments in order to match the census data to the farm size classes that were reported in the WCA in order to enable consistent analyses across all countries. In some instances, census data farm size classes could simply be aggregated to match those reported in the WCA. In other instances, census data classes needed to be disaggregated into two or more WCA classes. For countries that had both tabulated census data and microdata available, the available area data in the microdata was aggregated into WCA classes, and the proportion represented by each class was used to distribute census data. For countries that had agricultural area by farm size class reported that differed from the classes in the WCA, the proportion of area in each class was used to disaggregate subnational census data classes where necessary. For example, Paraguay reported a farm size class of 1-5 ha, whereas the WCA reported classes 1-2 ha and 2-5 ha. The total area in the 1-5 class was split between the two smaller classes based on their relative size, so 25% of area was . The x-axis shows each farm size class (ha). The y-axis shows the percent of global production. The red line is the average percent of production by farm size class. The gray line indicated 95% confidence intervals. assigned to the 1-2 ha class, and 75% of area was assigned to the 2-5 ha class. For all other countries, the simplest solution was to aggregate classes to match the WCA farm size classes. There were instances where two different methods were used for the same country. Additionally, there were situations were a country's largest farm size class differed from the WCA's largest farm size class, yet encompassed all farm sizes over a certain threshold. For example, in countries that only reported the largest farm size class to be over 100 ha, all farms over 100 ha would be entered into the WCA's corresponding 100-200 ha class. While this is a limitation of the data harmonization process, we were not able to assume a distribution for a country's largest farm size class through which we could dissagregate into several of the larger WCA classes. Fig. 2 shows a subsection of reported farm size classes for tabulated census data (all European countries reported in Eurostat had the same classes, represented by the Europe category in Fig. 2). The WCA classes, which were used in our analyses, are also shown. Corrections were made for the following countries: Austria, Belgium, Brazil, Bulgaria, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Ethiopia, Finland, France, Germany, Greece, Hungary, Iceland, India, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Montenegro, Netherlands, Norway, Paraguay, Poland, Portugal, Romania, Slovakia, Slovenia, South Africa, Spain, Sweden, Switzerland, United Kingdom, United States of America (Fig. 2). Map showing countries requiring assumption of constant yield across farm sizes. For many countries, our dataset contained a mix of actual production values and only area measurements per crops per farm size; percentages are given for each country according to how much of total crop production was calculated using constant yield assumption (indicated as percent bias in the legend). Darker orange indicates a greater percentage of the country's data was based on constant yields.

Construction of conversion factors
Conversion factors for kilocalories, fats, and proteins (in grams per capita) and for the percentage of each crop grown for food, animal feed, processed commodity, seed, and wastage due to transportation and storage (but not home consumption) were calculated using FAOSTAT. FAOSTAT provides actual values for each of these variables at the national level per year with detailed definitions. For example, if a country produced soybeans in a given year, we took the ratio of the amount of soybean production allocated towards food divided by the total soybean production in that country to obtain the conversion factor for that country and year. We would repeat for feed, processed goods, seed, and waste, then apply these conversion factors to the amount of production each farm size produced per administrative unit in that country, and for each crop type. Hence, each estimate for these macronutrient and production variables assumes the national allocations are homogeneous across all administrative units and across all farm sizes. This is a largely untested assumption, and to our knowledge there are no sub-national datasets nor farm size specific datasets covering these variables, and therefore the bias introduced by it is unknown (unlike for some other assumptions for which we were able to estimate bias, see Section 4). To enable future researchers to accommodate adjusting these conversion factors, we provide the actual amount of production per farm size per administrative unit in addition to the conversion factors and converted values.  6. Verifying our constant-yield assumption through comparing production calculated using constant yields versus actual production for countries where we had both area and production data by farm size. A) Log-log plot between constant yield calculated production and actual production. Black line represents 1-to-1 line. Green line is the linear regression line when using constant yield derived production to predict actual production. B) Compares production using constant yields (orange) to actual (green) production on a log-scale, while C) shows this relationship for each farm size class.

Dataset descriptive statistics
Our dataset includes primary datasets ranging from 2001 to 2015, with a median year of 2013. It includes 55 countries, 45 of which have subnational resolution, 18 of which have fine scale (i.e., farm level) resolution. Fig. 3 shows the data's spatial resolution and distribution of the 154 unique crop species represented; on average (mean), there were 30.8 crop species per country (Standard Deviation (SD)¼20.3). Crop species were aggregated to major commodity groups according to FAOSTAT definitions of cereals, fruit, oil crops, pulses, roots and tubers, tree nuts, vegetables, and other. Relying on the FAOSTAT classification has its limitations. For example, soy was classified as an oil crop, but it is also a pulse; therefore, this classification should be used as a guideline (Fig. 4). Due to the aggregated nature of a large number of the sources used, we were only able to present gross agricultural area, not net agricultural area or the number of farmers by farm size class.

Constant yields
For 33 countries in our dataset, representing 59.7% of the total production (in kcal), we could not find crop production by farm size, but we did find either gross cropped area, harvested area, planted area, or plot area by farm size per crop (Fig. 5). For these data, we used FAOSTAT's national yield estimates for the given country, year, and crop to estimate production per farm size. This assumes that all farm sizes within a country had the same yields for a given crop and year. However, as there is a widely observed inverse yield to farm size relationship where smaller farms typically have higher yields [4][5][6], we explored how using a constant yield across farm sizes may bias our production estimates.
We tested the presence of a constant yield bias in eight countries for which we had both an area measurement (i.e., harvested, cropped, planted, or plot area) per crop per farm size and crop production by farm size measurement. For these countries, we regressed known production values against production values calculated from constant yields with countries and crop type as random effects, and we report the intercept and slope for this relationship to indicate the level of bias introduced by the constant yield assumption. Fig. 6A is a log-log plot that shows a high correlation between production computed using constant yields and actual production. We used the natural log of production values to plot this due to long-tailed distributions in the data. We found that using constant yields slightly overestimates actual production for administrative units with smaller production but converges at administrative units with larger production (Intercept: À 0.79, SE ¼0.11; Slope: 1.03, SE ¼0.001). This bias can be corrected for by predicting out of the model shown in Table 2. In Fig. 6B, we also show boxplots to illustrate this overestimation for all farm size classes, and in Fig. 6C we show the differences for each farm size. The plots indicate that overestimation of production from using constant yield is higher for smaller farm sizes, which is expected due to their higher yields; in general, the FAO yields were higher than the reported yields in our dataset (see section 2.2.2 for details).
Where country level yields were not available for certain crops and/or years, regional or global yields were used. Regional and global yields were used for 0.02% of all administrative units in our dataset (and had a Spearman rank correlation of 0.86 with the FAO country level yields) and so we expect them to have small effects on production values estimated across the sample. These are included in the constant yields assumption and the above bias analysis, and the use of constant yields are denoted in the dataset for future researchers.

Calibrating with FAOSTAT
To calibrate our dataset with FAOSTAT we regressed our estimates of country production against theirs for matching crops and years. Our data consistenly underestimates production relative to FAOSTAT (Intercept: 15.39, SE¼ 1.67, and Slope: 0.92, SE¼0.08; Fig. 7). This relationship can be used to calibrate our data against FAOSTAT for future researchers interested in using this data. As we used the exact matching of crop lists with the FAO, this is perhaps surprising. It is possible that some of this variation represents differences in survey instruments since we have included different datasets from what FAOSTAT included since we needed to have access to crop production by farm size and FAOSTAT did not provide this cross-tabulation. Another way of looking at this discrepancy is that our dataset provides an independent, and transparent, estimate of the amount of crops produced by different countries across the world.

Family farms bias
For 17 countries in our dataset, representing 22.5% of the total production (in kcal), we could not find agricultural census data, but we did find nationally representative (often with sub-national resolution) agricultural household surveys (Fig. 1). One bias that stems from household surveys is that they only capture family farms, which are often associated with smaller farms. The household surveys miss non-family commercial enterprises and thus do not represent the full population of farms in a country. A proper test of the bias introduced by use of household surveys would require both census and household survey data for the same countries, which we did not have access to for the countries in our dataset and they covered different ranges and magnitudes of production (e.g. with household survey data covering countries with smaller aggregate production; see Fig. 7).

Plot size as a farm size proxy
For 8 countries in our dataset, representing 4.8% of the total production (in kcal), farm size was not explicitly reported, so we calculated a proxy farm size using the sum of either harvested, cropped, planted, or plot area (Fig. 8). This assumption may influence estimates of global crop production by farm size by underestimating farm areas in some farm size classes, because the aggregation process did not capture all fallow plots, water sources, unused areas, and on-farm structures. We think the main effect of this would be to introduce noise into the production by farm size signal (by mixing data using the field size proxy with real farm sizes). Due to data constraints, we were not able to explore how much noise this introduced. It does stand to reason that larger fields need to belong to larger farms, but it is unclear whether smaller fields are part of a large farm with several small fields or part of a small farm. However, because these countries represent less than 5% of the total production covered in our dataset, they do not greatly influence gross estimates of crop production by farm size estimated from these data. When the 8 countries we used a proxy indicator for farm size are omitted from the dataset there was minimal influence on the distribution of food production by farm size (mean absolute difference¼0.26; SD ¼0.19).

Regional bias
Our dataset accounts for around 51% of the total global harvest area, with representation across country types (e.g., spatial and economic). However, since our dataset is a convenience sample, we were not able to control for spatial coverage nor the countries included, and there were large data gaps for Australasia and Asia (Fig. 9).
An important question for researchers interested in this dataset is how much the global estimates of crop production by farm size are influenced by the omission of particular countries. While this coverage error is difficult to compute directly, we can explore how sensitive global estimates are to Fig. 11. Two examples of countries that deviated from the global distribution of total crop production by farm size: Germany (purple) and South Africa (orange) have different distributions than the global average (green). any one country included in the dataset. To do this we re-computed jackknife samples, where one country was omitted with each iteration, shown in Fig. 10. The vertical black line is the mean kilocalories (kcal) of food produced for a given farm size class when no countries were omitted. Each blue dot represents the mean when a corresponding country was omitted. If a country is to the left of the black line it lowers the global average. The vertical lines are the upper and lower quartiles for food production. For each plot, we labelled four countries as examples, but all countries are present.
There is substantial variation when a country is omitted indicating that countries' farm size distributions can heavily influence the global averages (see Tables 3-5 for per country distributions of gross agricultural, total production (kcal), and food production (kcal)). This high variation in the percentage of food produced in different farm size classes indicates that the relationship between farm size and food production is highly contextual; Fig. 11 shows two examples, South Africa and Germany.