Large survey dataset of rice production practices applied by farmers on their largest farm plot during 2018 in India

This dataset provides detailed information on rice production practices being applied by farmers during 2018 rainy season in India. Data was collected through computer-assisted personal interview of farmers using the digital platform Open Data Kit (ODK). The dataset, n = 8355, covers eight Indian states, viz., Andhra Pradesh, Bihar, Chhattisgarh, Haryana, Odisha, Punjab, Uttar Pradesh and West Bengal. Sampling frames were constructed separately for each district within states and farmers were selected randomly. The survey was deployed in 49 districts with a maximum of 210 interviews per district. The digital survey form was available on mobile phones of trained enumerators and was designed to minimize data entry errors. Each survey captured approximately 225 variables around rice production practices of farmers’ largest plot starting with land preparation, establishment method, crop variety and planting time through to crop yield. Detailed modules captured fertilizer application, irrigation, weed management, biotic and abiotic stresses. Additional information was gathered on household demographics and marketing. Geo-points were recorded for each surveyed plot with an accuracy of <10 m. This dataset is generated to bridge a data-gap in the national system and generates information about the adoption of technologies, as well as enabling prediction and other analytics. It can potentially be the basis for evidence-based agriculture programming by policy makers.


a b s t r a c t
This dataset provides detailed information on rice production practices being applied by farmers during 2018 rainy season in India. Data was collected through computer-assisted personal interview of farmers using the digital platform Open Data Kit (ODK). The dataset, n = 8355, covers eight Indian states, viz., Andhra Pradesh, Bihar, Chhattisgarh, Haryana, Odisha, Punjab, Uttar Pradesh and West Bengal. Sampling frames were constructed separately for each district within states and farmers were selected randomly. The survey was deployed in 49 districts with a maximum of 210 interviews per district. The digital survey form was available on mobile phones of trained enumerators and was designed to minimize data entry errors. Each survey captured approximately 225 variables around rice production practices of farmers' largest plot starting with land preparation, establishment method, crop variety and planting time through to crop yield. Detailed modules captured fertilizer application, irrigation, weed management, biotic and abiotic stresses. Additional information was gathered on household demographics and marketing. Geo-points were recorded for each surveyed plot with an accuracy of < 10 m. This dataset is generated to bridge a data-gap in the national system and generates information about the adoption of technologies, as well as enabling prediction and other analytics. It can potentially be the basis for evidence-based agriculture programming by policy makers. ©

Value of the Data
• This dataset is unique in the data ecosystem of India as it records in detail farmers' current rice production practices. It can be used as a monitoring tool/feedback mechanism by national agricultural system for site specific technology targeting. • The dataset is quite large and covers many different geographies/agroecological zones. It generates adequate information to learn how rice cultivation practices vary from place to place within India. • Farmers for the interviews were selected using purely random method so the information including yields can be very well generalized for a larger geo-political domain. If replicated, concerned agency can generate panel data to assess change in practices and productivity gains over time.
• The data generates information on location-specific usage rate of farm inputs and adoption status of agricultural technologies. It is thus valuable for private sector firms dealing in seeds, fertilizers, herbicides machineries in terms of market development/expansion. • Crop modelers can layer this data with other datasets (weather, soil, topography, etc.) as every datapoint in this dataset is geo-referenced. It can then be reused in developing algorithms for yield predictions.

Data Description
This survey dataset [3] from India has large spatial distribution and it provides complete details about ricer production practices. The dataset is diverse since surveyed crop was spread over different cropping as well as production systems. The survey form was kept consistent across states and sites to ensure uniformity of data [1] . A similar dataset was generated earlier for the wheat crop in Bihar and Eastern Uttar Pradesh [2] .
Map 1 shows geo-locations of the survey on Indian map and a broader distribution within state boundaries. This map was developed using geo-coordinates of each surveyed rice plots captured at the end survey with QGIS Desktop App (version 3.22.7). Table 1 describes how the samples were distributed across states, farmers' typology, and characteristics of surveyed rice plots. It shows that maximum number of samples were from Bihar followed by West Bengal; Punjab and Haryana states had limited samples. Percentage of female farmers surveyed was 3.7% of the total sample. Educational status of sampled farmers depicts that one in four had never attended a school. Majority of the farmers were either from Other Backward Caste (OBC) or from General category. Most of the surveyed plots were owned by interviewed farmers. The soil type of these plots was mostly medium textured and water retention capacity of these plots was also in the middle range. This table was drawn using Rstudio (version 1.4.1106), an open-source data analytical tool. Table 2 provides information on rainy season calendar months and corresponding rainfall in millimeters (mm) for surveyed states [13] . It also highlights rainfall received during this rainy season as a percent of total annual rainfall (mm) of the year 2018.  Farmers were found to practice seven different ways to trans(plant) rice. These were transplanting seedling randomly, transplanting seedling in line, broadcasting seed on wet field, broadcasting higher rate of seed followed by uprooting poor seedlings almost a month later ( beushening ), sowing seed with seed drill machine, transplanting specially grown seedling with machine, and through system of rice intensification method. Out of seven different methods, the most frequent method was transplanting seedling randomly. It was found with 84% of the sample followed by transplanting seedling in line by 9% farmers. Fig. 2 showcases various categories of rice seeds used by farmers. 72% of surveyed farmers were found to be using improved open pollinated varieties followed by 20% using rice hybrids. Occurrence of Basmati (scented) group of varieties was limited to Haryana and Punjab states. Fig. 3 shows distribution of overall rice grain yields as reported by surveyed farmers through a density plot, smoothed version of the histogram [7] . It shows that the mean rice yield of sampled farmers was 4.7 tons per hectare as indicated by dotted vertical blue line. This figure was drawn using 'ggpubr' package of Rstudio.   Fig. 4 presents distribution of rice grain yields by states through merged histograms drawn with 'ggplot2' package of Rstudio [5] . It shows concentration of samples on the right side (higher yield levels) of the plot for states like Punjab and Haryana. For states such as Chhattisgarh, Odisha and Bihar, yield samples were more towards left side of the respective plots denoting lower rice yields of most farmers.

Experimental Design, Materials and Methods
District was taken as one survey unit and so sampling frames were constructed for all 49 districts separately. Single stage cluster sampling (a type of probability sampling) approach was followed to select farmers in districts. In the first stage, 30 villages were selected randomly in each district using probability-proportionate-to-size (PPS) where size refers to number of households in the village. PPS method is well suited when population of sampling units (villages in this case) vary in size [12] . It reduced standard error and bias by increasing the likelihood that a sampling unit from a larger population will be chosen over a sampling unit from a smaller population. After village selection was made, seven households were selected randomly in each village for conducting personal interviews. Accordingly, ideal sample size for each survey unit (district) was 210.

Village Selection
To construct sampling frame for village selection, census data of India 2011 [4] available in the public domain was used. All villages within a district were listed where corresponding number of households were also known. Extremely small (villages having < 50 households) and extremely large (villages having > 50 0 0 households) were discarded along with villages categorized as urban habitat [10] . Final village list was accordingly generated to apply PPS. Steps followed to draw 30 villages in a district: • In column next to number of households, generate cumulative number of households starting from number '1'. Last row in this column should match with total number of households (say N) in the sampling frame of villages.

Household Selection
Seven households were selected in 30 villages each through simple random sampling [11] . To make this selection, a list of villagers was generated through voter list available on the election commission websites of the respective states [8] . House number attached to each voter was treated as one household [10] . For example, if house number 71 was imprinted with six different names in the voter list, they were treated as one household.
Map 2 illustrates sample distribution (locations of farmers' largest rice plots surveyed) in 14 Districts in Bihar. This map clearly depicts that sampling methodology applied in this survey generated uniformly distributed samples within a district.

ODK Collect Application
Trainings were conducted for enumerators separately in each state to set-up their Android devices with ODK server, discuss survey questions and explain the sampling frame (list of villages and house numbers/member names). Mock interviews were organized on the second day of the training. ODK Collect app [9] was downloaded on enumerator's device and linked to ODK Map 2. Map of Bihar state (one of the surveyed states) highlighting districts covered in the survey with red boundaries and sample distribution within districts by black dots. server hosted in New Delhi by Indian Agriculture Statistics Research Institute (IASRI). Credentials were given to enumerators so that they can download digital survey Form [6] namely 'Landscape Diagnostic Survey' from server and deploy. Completed survey Forms were sent to the server by enumerators.

Ethics Statements
The survey was conducted under Cereal Systems Initiative for South Asia (CSISA) project of International Maize and Wheat Improvement Center (CIMMYT). The project took formal approval of CIMMYT's Internal Research Ethics Committee (IREC) to collect and use farmers' data. Before each farmer's interview, we clarified respondent the purpose and use of the data. All interviewees gave their prior informed consent to participate in the survey and were informed that they could withdraw at any point in case. Dataset was adequately anonymized so that neither individual participant nor their surveyed farm plot can be identified.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Large-scale data of crop production practices applied by farmers on their largest rice plot during 2018 in eight Indian states (Original data) (Dataverse).