Crop Selection and Yield Prediction using Machine Learning Approach

In recent years, Agriculture sector has been researched a lot with the advancements in technologies like machine learning and smart computing. With the dynamic economics of Agri-produce, it is becoming challenging for farmers to utilize the land efficiently to get maximum profit in the specific landscape. Crop Yield Prediction (CYP) is crucial and is greatly dependent on environmental factors like soil contents, humidity, rainfall as well as area under cultivation and other required metrics. Due to insufficient incorporation of the multiple environmental circumstances, a number of existing tools and techniques used for CYP, such as historical averages, tend to produce inaccurate findings. In such situation, with multiple options of crop, it is essential for farmers to plan the crop strategy in advance. If the farmer can get estimate of the crop yield in advance, cultivation can be done accordingly. To solve this problem, machine learning approach is implemented as a base for accurate predictions. Crop prediction is done by classification model and yield prediction uses regression models to learn from the data. Multiple ML models are analyzed based on performance metrics. Best performer model is incorporated in backend. Among the used models for yield prediction, Random Forest Regression gives best results with MAE of 0.64 and R2 score of 0.96. For crop prediction, Naïve Bayes classifier gives most accurate results with accuracy of 99.39. The study emphasizes how machine learning could revolutionize crop management techniques by giving farmers insights about optimizing resource allocation and boost overall crop yield.


Introduction
The field of machine learning is advancing day by day.Learning is important when we need sample data or experience rather than the ability to immediately design a computer program to solve a particular problem.When there is no human knowledge or when people are unable to express their expertise, learning becomes important.Computers are programmed with machine in order to improve performance criteria based on actual or hypothetical facts.Computer program learns to optimize the parameters used for the model using training input or previous information.The model may be descriptive to draw conclusions based on model data or predictive which estimates trends in future. 1 A subset of artificial intelligence (AI), machine learning (ML) enables computers to learn for a specific dataset such as playing chess or making recommendations on social networks without having to be explicitly programmed.Precision farming and Agri-technology, now referred to as Digital Agriculture, are evolving into emerging fields in research that employ highly data-driven techniques to boost productivity in agriculture while shrinking the adverse effects on the environment.Machine learning (ML), alongside big data technology and robust computing infrastructure, has arisen to create potential solutions for unravelling, quantifying, and comprehending data-intensive processes in agricultural operational environments.Data analysis, as an evolved scientific discipline, is essential to the development of a wide range of crop management applications.Many times, it is possible to efficiently use ML without having integrating data from many sources.There tends to be less emphasis on data integration when large datasets are easily available, especially on a major scale.The main force behind this development is the complexity of data preprocessing and analytical processes, as opposed to the machine learning models' generally straightforward implementation. 2Agriculture sector has a major contribution of almost 20% in India's GDP in year 2019-20. 3Also, it is the principal source of employment in India.In addition to being a significant part of the global economy, it is crucial for the continued existence of humanity.Weather, pests, and the readiness of harvesting operations are the main factors that influence agricultural production.For managing agricultural risk, it's essential to have accurate crop history information. 4nethical practices are being used to produce higher yields of less-nutritious hybrid cultivars as the population grows.These techniques tend to harm soil quality.It results in environmental loss.Given the changing patterns of weather conditions and also economics, it is getting difficult to choose right crop for farmer.The use of various fertilizers is also unclear because of seasonal climate variations and changes in the availability of fundamental resources like soil, water, and air.The agricultural yield rate is continuously decreasing in this situation. 5Farmers today cultivate crops based on knowledge gained from earlier generations.Since the traditional method of cultivation has been refined, there are either excessive or insufficient yields without really meeting the need. 6If the producer knows yield estimates in advance, it would help to form the crop strategy.Machine learning is a rapidly expanding methodology that supports and provides a guide in decision process in various applications of multiple different industries.The majority of modern gadgets benefit from models being examined before deployment.The primary idea is to increase the efficiency and profits of the agriculture industry by using data as a tool with models.Precision farming, which prioritizes quality above unfavorable environmental variables, would be the main focus. 7L has advanced its applications in agriculture in areas like predicting soil properties, rainfall analysis, yield prediction, disease and weed detection, ML based computer-vision and many more.8 The use of computer vision, machine learning, and IoT applications will assist boost productivity, enhance quality, and ultimately increase the profitability of farmers and related industries.To increase the overall harvesting output, precision farming is crucial in the world of agriculture.9 For example, smart irrigation systems, crop disease prediction, crop selection, weather forecasting, and determining the minimal support price are all examples of techniques employed in agriculture.These methods will increase field productivity while requiring less work from farmers. 10 Crop yield estimation may be used for a variety of purposes, including helping farmers enhance production, optimizing the supply-demand cycle for fertilizers, insecticides, and other agricultural products, predicting prices, and calculating the risk levels for agricultural insurance.Another study by Anakha Venugopal, Jinsu Mani, Aparna S Rima Mathew, Prof. Vinu Williams 13 uses several machine learning approaches to forecast the agricultural production.By taking into account variables like temperature, rainfall, area, and other characteristics, Farmers will be able to select the crop that will provide the highest produce by using the forecasts made by ML models.The study is focused on Kerala's Agri-produce. Ag the classifier models utilized here, Random Forest has the highest accuracy, followed by Logistic Regression and Naive Bayes.
A Research 14 A smartphone app which is used in the proposed method connects farmers to the internet.GPS helps user in locating his location.The user enters the location and soil type.The most profitable crop list can be picked using machine learning algorithms, and they can also forecast crop yields for user-selected crops.Machine learning models, including random forest (RF), artificial neural network (ANN), support vector machine (SVM), multivariate linear regression (MLR), and k-nearest neighbor (KNN), are used to estimate crop productivity.Random forest demonstrated the best outcomes with 95% accuracy.The algorithm also makes recommendations on when to apply fertilizers to increase yield.This research focused on the limitations of present approaches and their applicability for yield prediction.The suggested approach then connects the farmers with an effective yield forecasting system via an app for smartphones.To assist them in selecting a crop, people may select from a number of attributes.The integrated prediction system assists farmers in estimating the crop produce.A user may research possible crops and their yield using the integrated recommendation system in order to make better educated judgements.Based on data from states of Maharashtra and Karnataka, several ML models like RF, KNN, MLR, SVM, and ANN were built and compared for accuracy.Results confirm RF Regressor, which has a 95% accuracy rate, is the best standard algorithm when applied to the presented datasets.
In, 15 the Random Forest Algorithm is used.In spite of extensive research into challenges and topics like weather, temperature, humidity, and rainfall, there are still no acceptable remedies or ideas to deal with the difficulty we face.In nations like India, there are numerous different sorts of rising economic growth, including in the agriculture sector.Additionally, crop yield predictions can be made using the processing.The current study proved the value of data mining techniques for predicting agricultural output based on input features related to the climate.All new grains and regions chosen for the investigation should have accuracy of prediction above 75%, demonstrating improved predictive performance.The produced website is user-friendly.The website was developed utilizing data from that area to predict crop yield.
According to a study, 16 selecting the best crop before sowing will increase agricultural yield.It depends on a variety of factors, such as the soil type and its composition, climate, local terrain, crop yield, market prices, etc. Techniques like Decision Trees, K-nearest Neighbors, and Artificial Neural Networks have a position in the crop selection framework, which depends on a variety of different factors.Machine learning has been used to choose crops based on how natural disasters like hunger could affect them.Researchers have employed artificial neural networks to choose crops depending on soil and climate with success.
When attempting to create a high-performance predictive model, ML studies face a variety of difficulties.To tackle the issue at hand, it is essential to choose the appropriate algorithms, and both the algorithms and the supporting platforms must be able to handle the sheer amount of data. 17study 18 suggested a method for unsupervised fuzzy categorization that identifies crop kinds with springtime harvests.The categorization outcomes likely to get better with time.Strategy used in 19 made use of the Bayesian network categorization supervised learning model.Crop information is analyzed with environmental parameters like temperature and rainfall to categorize crops.
A study by D. A. Reddy, B. Dadore, and A. Watekar 20 highlights how despite being one of the nations with the highest agricultural output, India's agriculture productivity is still fairly low.Productivity needs to be increased so that farmers may get better profit from decreased costs.In order to reliably and successfully propose a suitable crop based on soil data, it offers solutions such as offering a recommender utilizing an ensemble approach with a large proportion of voting methods employing random tree, CHAID, kNN, and naive bayes classifier.Soil types, soil characteristics, and crop yield data collection are taken into consideration when advising the farmer on the best crop to grow.The majority voting process, which is the most popular assembly technique, is used in this system.Any number of primary learners may be used in the voting process.A minimum of two base learners are required.The chosen learners complement one another and impart knowledge to the others.With more competition, a better forecast may be made.The specified training data set is used to train the model.When a new record has to be categorized, each model chooses the class independently.Class predicted by consensus of learners is chosen as class label for current record.
A study 21 says Building a random forest, a group of decision trees that considers two-thirds of the records in the datasets, takes into account data sets on temperature, production, perception, and rainfall.These decision trees are then applied to the remaining data to ensure accurate categorization.For accurate crop production prediction based on the input qualities, the test data may be applied to the generated training sets.The RF method and the dataset were used to evaluate the efficacy of this technique.The advantage of the random forest approach is that overfitting is less of an issue with random forests than it is with decision tree-based model.The random forest does not need to be trimmed.The loaded data sets are divided into train, test data of 67 or 33 percentage points, or 0.67 or 0.33 respectively.In order to enable the mapping of attribute values to appropriate values and list placement, the training data must be categorized.By contrasting the initial data with model predictions, the probability is determined.Based on the result, the highest likelihood is utilized to make a forecast.The accuracy may be calculated by comparing the generated class value with the test data set.
According to a different study, 22 agriculture has positive economic effects on the country.It falls short, nevertheless, in terms of using modern machine learning techniques.As a result, our farmers ought to be knowledgeable with all of the most recent machine learning technology and fresh approaches.The productivity of agriculture is increased by using these methods.To increase agricultural productivity rates, a number of machine learning approaches are used.These techniques can help with agricultural problems.We may also assess the accuracy of the yield by looking at several ways.Thus, we may perform better by contrasting the accuracy of several crops.In agriculture, sensor technology is widely used.The study helps increase agricultural yield rates. helps choose the right crop for the chosen site and season.

Materials and Methods Data Pre-processing
A technique called data pre-processing transforms unprocessed, uncleaned data ready for further analysis.Data may be gathered from multiple sources, but as they are collected in raw form, analysis is not possible.We convert data into a comprehensible format by using several strategies, such as substituting missing values and null values.Fields in the dataset which are insignificant for label prediction are eliminated.If required, One-Hot Encoding is performed on the dataset to have dataset ready for regression model fitting.The division of train and test data is the final stage in the data preparation process.As training the machine learning algorithm usually requires as much data points as possible, the data typically has uneven distribution.Training dataset, which in this case makes up 70% of total data, is used to train machine learning models and make accurate predictions.

Factors Affecting Crop Yield
The yield of every crop is impacted by a wide range of variables.These are essentially the characteristics that aid in estimating a crop yield.For crop yield prediction, this study includes parameters such as temperature, rainfall, area, humidity, soil nutrients, pH, and AUC (Area under Cultivation).

Comparison and Selection of ML Algorithm
We first must assess and compare different algorithms before selecting the one that best matches this particular dataset.Machine Learning is an effective way to solve crop prediction problem as it learns from past data and gives predictions on current parameters.In order to make precise predictions and stand by erratic patterns in weather conditions like temperature and rainfall, various machine learning classifiers like Logistic Regression, Naïve Bayes, Random Forest, KNN are used and compared for the performance metrics and the model with best accuracy is selected for crop prediction.For Yield prediction, regressors like Linear Regression, Random Forests and Decision Trees Regression are compared for metrics like MAE, Median Absolute Error and R2 Score.The model with best values is selected for predicting yield.

Naïve Bayes
Based on Bayes' theorem, Naïve Bayes model is frequently employed in many classification tasks.The multinomial, Bernoulli, and Gaussian algorithms make up the three Naive Bayes algorithms.Naive Bayes Algorithm is mostly employed for classification problems.It operates under the presumption that each feature has an equal chance of occurring and that the likelihood of each feature occurring is independent of the probabilities of the occurrence of all other features.The Bayes theorem determines likelihood of an event happening when another event is occurred.Multi-class classification makes use of Bayes theorem.Also, in comparison to other ML techniques, it is quicker and simpler to construct.Additionally, it doesn't need a lot of training data.
Both discrete and continuous data may be used with it.It is extremely scalable and unaffected by insignificant features.

Decision Trees
A decision tree is a type of tree structure that resembles a flowchart and is frequently employed in supervised machine learning for classification and prediction.A DT may be transformed into a set of rules, with each path serving as a different rule, with each path travelling from the root node to each leaf node.In a decision tree, each leaf node has a class that may be reached if an attribute matches the prerequisite for the branch that leads to it.In a decision tree, each internal node corresponds to a test, condition, or attribute. 6

KNN
The machine learning approach known as kNN, which is supervised and nonparametric, is used to solve classification and regression issues.Labeled data is used with supervised algorithms.The technique relies on the distances between the points, which may be calculated in a few different ways.The fact that the distance must always be either zero or positive should be taken into account.The distance is squared, raised to a given power, or the absolute values are used to do this.Pre-processing of all the labelled data is necessary before we apply the kNN algorithm.All of the data must first be normalized.As kNN struggles to function when there are too many features present, feature selection must then be used to eliminate the insignificant features.Missing data must be filled in.Else, that particular record must be eliminated.The performance can be enhanced by including more train samples.The fundamental drawback of KNN is that as the size of dataset grows, cost of computing rises, the algorithm's speed decreases.

Random Forests (RF)
The RF technique is a perfect example of ensemble learning in action since it connects several classifiers to tackle the challenging problem and improve a model's efficiency.The "forest" created with this approach is actually a collection of decision trees.In each decision split, RF characteristics are chosen at random.Picking traits that encourage prediction and lead to increased efficiency reduces the correlation across trees.The Random Forest ML classification approach generates the final output by combining the results of all the decision trees after segmenting the dataset into smaller subsets or trees.The Bagging subcategory of ensemble learning methods includes Random Forest.A Sample of rows and features from the primary dataset are selected at random and fed into the Random Forest Technique's decision trees.It can also carry out jobs requiring both regression and classification.It also works well with huge, highly dimensional data sets, and most significantly, it greatly improves the model's accuracy and fixes the overfitting problem.

System Architecture
System architecture is represented in Figure 1.End user interacts with web user interface (UI) which is hosted on a server.Open Weather Map API is connected to server to deliver weather data.The machine learning models are trained and tested by admin and are loaded in the server for predicting crop and yield in tone per hectare of land area.

Datasets
The public datasets have been chosen because they are readily available and easily accessible.Kaggle is a popular platform for finding and sharing datasets, so we were able to find datasets that met our criteria.We selected 3 datasets namely India Agriculture Crop Production 23 The dataset has following features: State, District, Crop, Year, Season, Area, Area Units, Production, Production Units, Yield.This dataset is used to build regression model for yield prediction. Yield is the required label.It has total 12176 records containing 27 unique crops and 4 unique seasons.Crops are as follows: Arhar/tur, bajra, castor seed, gram, groundnut, jowar,linseed, maize, moong, niger seed, other cereals, kharif pulses, rabi pulses, summer pulses, ragi, rapeseed and mustard, rice, safflower, sesamum, millets, soyabean, sugarcane, sunflower, tobacco, urad, wheat, oilseeds.

District Wise Rainfall Normal 23
This dataset is used for collecting district wise rainfall data to predict yield.It is used for extracting district wise average annual rainfall for each district of Maharashtra, India.This feature is combined with Yield dataset mentioned above to get estimates of production for particular crop in the given season.

Crop Recommendation 23
The dataset is used for crop prediction.It has features like N, P, K, rainfall, humidity, pH and crop.N, P, K stands for Nitrogen, Phosphorous and Potassium nutrients in soil.It has 2200 total records containing 22 unique crops.Data consists 100 records for each of the following crops: rice, maize, chickpea, kidney beans, pigeon peas, moth beans, mung beans, black gram, lentil, pomegranate, banana, mango, grapes, watermelon, muskmelon, apple, orange, papaya, coconut, cotton, jute, coffee.

Data Pre-Processing Crop Prediction
Prior to start modelling the data, we need to carry out data-pre-processing.It is done in following steps as shown in Figure.

Handling Missing Values
The line df = pd.read_csv('Crop_recommendation2.csv', na_values='=') reads the CSV file into a DataFrame (df), replacing any occurrences of '=' with NaN values, which are commonly used to represent missing data in pandas.

Separating the Target Variable
The line b = df['label'] extracts the target variable column ('label') from the DataFrame df and assigns it to the variable b.

Creating a Preprocessing Pipeline
The code creates a pipeline (my_pipeline) using scikit-learn's Pipeline class.The pipeline consists of two steps

Imputation
The missing values in the DataFrame are imputed using the SimpleImputer transformer with a strategy set to "mean".This strategy replaces the missing values with the mean of the corresponding column.

Standardization
The features are standardized using the Standard Scaler transformer.This step scales the features to have zero mean and unit variance.

Applying the Preprocessing Pipeline
The line X = my_pipeline.fit_transform(df)applies the preprocessing pipeline (my_pipeline) to the entire DataFrame (df).It fits the pipeline on the data to learn the mean values (for imputation) and the standardization parameters.Then, it transforms the data by applying the learned transformations.

Train-Test Split
The train_test_split() function from scikit-learn is used to split the processed features (X) and the target variable (b) into training and testing sets.The stratify=b parameter ensures that the class distribution is maintained in both the training and testing sets.The split is performed with a test size of 30% (test_size=0.3)and a random state of 42 (random_state=42).These pre-processing steps help handle missing values, standardize the features, and split the data into training and testing sets for further analysis and model training.

Yield Prediction
In data preprocessing, we did clean the data containing missing values, outliers, or errors that need to be addressed before the data can be used for machine learning.Also, we did data integration of District wise rainfall normal 23 and India Agriculture Crop Production 23 as we required it to be merged for passing it to the machine learning model.We did data transformation for India Agriculture Crop Production 23 dataset as it had categorical variables which need to be encoded as numerical values to pass to the machine learning model.We used One Hot Encoding for data transformation.We had to do data reduction to limit the dataset to the state of Maharashtra otherwise the dataset would have been too large in terms of the rows and columns.

Feature Selection
A machine learning model's performance can be improved through feature selection, which is the process of choosing a subset of the relevant features from available data.For Crop Prediction, following features were selected: Nitrogen, Phosphorous, Potassium, Temperature, Humidity, pH and Rainfall.For Yield Prediction, features selected are as follows: City, Crop, Annual Rainfall (in mm), Season.

Train Test Splitting of Data
We have split the data in the ratio 70:30 using random sampling and stratification.Choosing an appropriate train-test split is important in ML, because it can affect the accuracy and generalization of the resulting model.

API Integration
The city of user taken as an input is given to API call as a parameter.The temperature and humidity fields from API response are given to crop prediction model as input along with other data.This helps the system to give real-time predictions."OpenWeatherMap" API is used for the same.For creating an API URL, base URL and API key is used which is unique with each subscription.User's city name is passed in complete URL as a parameter and response is collected.From the collected response, required fields i.e., temperature and humidity are passed to ML model for predicting crop.

Training and Evaluation of Models
The crop prediction uses the multi-class classification machine learning model to predict the crop for a set of given input features.Whereas the yield prediction incorporates the regression model to predict the yield for a given set of input features.For training and evaluation of models, Google Colab Platform is used.While the User Interface for the project is built using ReactJs, the backend is built using Python Flask framework.

Application and Advantages over existing versions
The model can be used to create an impact on right crop selection as the user would get fair prediction on yield as well as crop.Also yield prediction would be important in financial assessment of crop strategy.Model is useful if the user wants to compare yield for multiple crop options and then select the best one.It could also be used in a wide geography to estimate the yield for a particular crop.This project can be used directly by end users as farmers for taking predictions for their conditions.Instead, it can also be used by government agencies for planning and policy making if modified with wider access to reliable closed source government data.It can also be used by NGOs which work for educating farmers in adopting new technologies and precision agriculture.Also, it can be used in fields where monetary calculations come in picture as it is dependent on how much yield could be produced like in insurance claims or loan policies.
The project improves the prediction accuracy by suitable data gathering cleaning and selecting best accurate model.Also, the project incorporates both crop as well as yield prediction.So, the project is using classification as well as regression models for necessary functionality.It adds value to the modern agriculture setup by providing a way to add to the reliability of crop selection which in turn improves the yield and financial stability.

Results and Discussions Crop Prediction
First, datasets are loaded and cleaned from insignificant features.After Data Preparation, data is split into training and testing data and various models are fitted and tested for accuracy.Feature Importance is calculated to determine the relative significance or contribution of individual features in ML model.For crop predication model, Drop Column Importance, also called as "permutation importance" or "feature importance by feature shuffling," is calculated.
Drop Column Importance = Baseline Metric − Shuffled Metric Drop column importance is based on the idea that removing a feature that is crucial to the performance of the model would cause it to perform less significant than before.It is calculated in following steps 1. Train a model with all features 2. Measure baseline performance with a validation data 3.One feature is determined of which importance is to be calculated 4. Train a model with all other features except the selected one 5. Calculate performance with a validation data 6.The feature importance is the drop in performance from baseline 7. Follow same steps 3 through 6 for every feature

Fig. 4: Feature Importance for Crop Prediction
As shown in above Figure 4, rainfall is most important feature for crop prediction followed by humidity, Potassium(K), Phosphorous(P), Nitrogen(N) and pH.
Classification Models' results for Crop prediction as depicted in Figure 5.   Train the Random Forest Regressor 2.
Access Feature Importances: Random Forest Regressor has built in attribute named feature_importances_.

3.
Map Feature Importances to Original Features: Every one-hot encoded feature is mapped with its original feature.4.
Aggregate Feature Importances: By aggregating we get every categorical feature's importance 5.
Rank Feature Importances in descending order of importance.
As shown in Figure 7, Crop is most important feature in order to predict yield followed by District, Rainfall and Season.

Fig. 6 :
Fig. 6: Code Snippet for Feature Importance in Yield Prediction

Fig.
Fig. 8: Regression Results for Yield Prediction

8: Regression Results for Yield Prediction Random
Linear Regression has Mean Absolute Error of 1.08, Median Absolute Error of 0.47 and R2 Score of 0.92.ConclusionCrop yield prediction is a complex process which relies on several different factors including weather, soil, fertilizers, pest infestations, etc.In this paper, we predict the crop yield using weather and soil parameters.The research is based on the datasets limited to districts in Maharashtra.The system incorporates regression techniques to estimate yield and multi-class classification to predict type of the crop.Among the used models for yield prediction, Random Forest Regression gives best results with MAE of 0.64 and R2 score of 0.96.For crop prediction, Naïve Bayes classifier gives most accurate results with accuracy of 99.39.The suggested method aids farmers in choosing which crop to plant in the field and how much yield any crop would give in that specific environment.Dataset used in the research can be improved by taking real time data through IoT devices.Also, various factors like irrigation and fertilizers use can be included for better prediction.Mobile App can be developed for mobile devices with added services like price estimates in accordance with current market prices.Paid datasets may bring more reliable and accurate data which in turn might help in model accuracy.They may contain more features which may help correlate more with label.
Forest Regressor gives most reliable results when given required inputs with Mean Absolute Error of 0.64, Median Absolute Error is 0.16 and R2 score of 0.96.Decision Trees Regressor has Mean Absolute Error of 0.80, Median Absolute Error of 0.18, R2 Score of 0.94.