Optimization of Crop Recommendations Using Novel Machine Learning Techniques

: A farmer can use machine learning to make decisions about what crops to sow, how to care for those crops throughout the growing season, and how to predict crop yields. According to the World Health Organization, agriculture is essential to the nation’s quick economic development. Food security, access, and adoption are the three cornerstones of the organization. Without a doubt, the main priority is to ensure that there is enough food for everyone. Increasing agricultural yield can help ensure a sufﬁcient supply. The country-wide variation in crop yields is substantial. As a result, this will be the foundation for research into whether cluster analysis can be used to identify crop yield patterns in a ﬁeld. Previous study investigations were only marginally successful in accomplishing their primary intended objectives because of unstable conditions and imprecise methodology. The vast majority of farmers base their predictions of crop yield on prior observations of crop growth in their farms, which can be deceptive. Standard preprocessing methods and random cluster value selection are not always reliable, according to the literature. The proposed study overcomes the shortcomings of conventional methodology by highlighting the signiﬁcance of machine learning-based classiﬁcation/partitioning and hierarchical approaches in offering a trained analysis of yield prediction in the state of Karnataka. The dataset used for the study was collected from the ICAR-Taralabalu Krishi Vigyan Kendra, Davangere, Karnataka. In the two dataset analysis techniques employed in the study to ﬁnd anomalies, crop area, and crop production are signiﬁcant variables. Crop area and crop yield are important variables in the two dataset analysis methods used in the study to detect anomalies. The study emphasizes the importance of a mathematical model and algorithm for identifying yield trends, which can assist farmers in selecting crops that have a large seasonal impact on yield productivity.


Introduction
Precision agriculture, as it is now known, was pioneered by environmentally conscious farmers. Prior to the invention of computers, this method was used. They were successful in identifying both the actions required to increase crop yields and the variables that contributed to the field's unpredictability. Farmers achieved this by taking field notes during the planting and harvesting seasons. Based on the information gathered, they would then select the most effective plan of action for the following year. Data-generating equipment and sensors have long been on the rise in agriculture [1]. This has enabled farmers to make data-driven decisions. This type of farming is known as smart farming. The author [2] provides a thorough overview of the various goals and strategies used in smart farming. One of the major problems in precision agriculture is taking into account agricultural production predictions and the various models that have been proposed and tried so far. Because crop yield is affected by a variety of factors such as soil, weather, fertilizers, and seeds, multiple datasets must be used [3]. As a result, estimating agricultural production is a difficult task. It is simple to estimate the actual yield using agricultural productivity forecasting models, but improving yield prediction accuracy is still preferable [4,5].
The majority of climate change simulations are based on deterministic biophysical crop models [6]. Based on detailed illustrations of plant physiology, these models can still be used to assess response mechanisms and potential adaptation strategies [7]. Statistical models, on the other hand, outperform them when making predictions at a larger spatial scale [8]. Several studies [9] have found a strong link between excessive heat and poor crop performance. This correlation was demonstrated using statistical models. Traditional econometric methods are used in these methods. In recent research, crop model output and crop model insights have both been incorporated into statistical model parameterization, among other attempts to merge crop models with statistical models [10]. These efforts have been made to better understand how statistical models and crops may interact [11].
Numerous studies go into detail on the various challenges to developing high-performance forecasting models. Choosing the best algorithm for high performance becomes a timeconsuming and important task as a result. Furthermore, the chosen systems and algorithms must be extremely efficient at handling large amounts of data [12]. Furthermore, locating zones within a region that have behaved similarly over time is more useful than predicting specific yields within a sector. However, some factors that can affect yield, such as soil type, climate, harvesting techniques, and so on, may vary from season to season. As a result, even if crop yield remains constant from year to year. As a result, the yield of one season cannot account for differences in the field [13]. The large area and yield deviations make it difficult to precisely measure variants. Many crops have yield estimates built into the agricultural planning process. To protect the vested interests of farmers and the government, the research work categorizes the divisions based on yield and region. On some major agricultural issues, the government may act rashly. However, this approach assists farmers in selecting the appropriate crop for various areas in order to provide adequate yields. This is accomplished by taking into account cluster values based on the heuristic scores described in the following section. It also intends to elaborate on the importance of avoiding the cultivation of unnecessary crops in order to protect vital resources, such as time and money. Most farmers plant crops based on past experience, resulting in a lower yield. Various clustering algorithms are used in the study to identify the appropriate clusters, and comparative analyses are performed to determine the optimal cluster values. The section that follows contains a detailed description of the aforementioned study.
The research performs the following activities: • Hierarchical and partitioning approach to developing clusters based on factors such as location, output, productivity, etc.; • Comparative analysis is performed to identify the best method for structuring zones into clusters; • Recommend areas or fields with the potential to produce crops with high crop yields based on the scale value specified.

Dataset Overview
The data have been collected from various sources, including Krishi Kendra (Agricultural office) in Davangere district, Karnataka. Area and Production Statistics are collected from the Ministry of Agriculture and Farmers Welfare, Karnataka, India [14]. In this work, the dataset source is accessible in the records of the Karnataka Government [12,15]. The data are obtained from the year 1998 to the year 2018. The preliminary data collection is carried out for various districts in Karnataka. The dataset consists of huge observations with the following varying values: state, district, year, season, crop, yield (in tons), area (in hectares), and production (in hectares).
A precise summary of the statistical data of each variable is presented in Figure 1. Lines no. 5 and 6 represent the variable "crop" details. The total number of crops available is 43, and each needs to be identified by its specific name. Only the Karnataka state is used for the study. As the total number of locations is 30, one needs to know the names of different districts. Data are collected from 1997-1998 to 2017-2018, which counts as 21 multi-year data. One needs to know that the crops were studied all these years aptly. Line no. 20 to 23 represents the variable "season" details. Crops are classified into kharif, rabi, and summer crops. The Kharif season, which lasts until September, begins in June. The summer season lasts from March to May, while the Rabi season lasts from October to February. Whole_year specifies the different crops cultivated in a year irrespective of the three seasons. Representation of total value in the season column specifies the overall yield of a crop. Line no. 25 to 29 represents the details of the variable "area". Variable "area" specifies an approximate number of lands in hectares used for agricultural purposes. Line no. 31 to 36 represents the details of the variable "production". Variable "production" considers the aggregate area and calculates the production of crops in kgs. Missing values may be due to personal purpose, nonproduction, or unentered data usage. Line no. 38 to 42 represents the details of the variable "yield". Variable "yield" determines the number of crops cultivated in tons per hectare.

Proposed Framework for Determination of Yield Trend
The flow chart of the working model, as shown in Figure 2. starts by loading the dataset and clearing unwanted data. The main data are then sent to the present summary. Once this is performed, extracting numeric variable stands is the following step. Extracting numeric variables helps in generating a table of correlation which shows the correlation coefficient between a set of variables, such as area, production, and yield. If the correlated variables are available, redundant variables are filtered out as they would give the same results when we use them. Then, data were sent for the data inspection methods (probability density plot and boxplot). If present results using density plots exist multi-modal data, one is supposed to filter out such districts and then send it to the boxplot to realize the presence of outliers.
Afterward, it is further sent for the outlier detection methods (bpRule and Grubbs' test). The outliers are removed only when the two outlier methods identify outliers.
Once the data are free from outliers, the next step of the flowchart is picking the optimal K for clustering. The results must be either area or yield. For area, group data by location, mean partition by area, and mean partition by yield are obtained. If the outcome is yield, then the grouping of data by location is the next step, which means partition by yield is obtained after the clustering, the output is categorized season-wise and presented, and inferences are drawn.
Most of the existing methods used multiple clustering algorithms to analyse large datasets CLARA (Clustering Large Applications) is an extension of k-medoids (PAM) techniques for dealing with data including many objects (greater than a few thousand observations). Its purpose is to reduce processing time and RAM storage problems. This is performed using the sample technique.

Algorithm for Determination of Yield Trends
Let D represent the training set of tuples presented by an 8-dimensional feature vector, X = (x1, x2, . . . , x26, 906), showing eight measurements, such as state, district, year, season, crop, yield (in tons), area (in hectares) and production (in hectares).
Firstly, Algorithm 1 considers numeric variables to check if the variables are changing together at a constant rate by using Pearson Correlation [16] given in Equation (1).
Then, linearly related variables are removed from the attribute list. The algorithm calls the DetectionAndInspection () computes, draws kernel density estimates, and figures out the multi-modal distribution. It also removes the locations and location observations for which the method determines the existence of multi-modal distribution.
The simplest univariate outlier detection method, the Boxplot Rule, is also used and applied to the numeric variable area, which tags any value outside the interval where Qtl1 (Qtl3) is the first (third) quartile, IQR = Qtl3 − Qtl1 is the interquartile range, and s an outlier.
A related method called the Grubbs' Test starts by calculating the following z score for each observation x, Equation z = |x−x| s x where x is the sample mean of the variable x, and s x represents sample standard deviation. Using this score, an outlier is declared if the following Equation (2) holds where N is the sample size and t 2 α/(2n), N−2 is the value of the t-distribution at the significance level of α/(2N) [8].
The key issues with any clustering algorithm are: • Cluster validation decided by the obtained solution is precise; • To obtain an appropriate number of clusters for the yield dataset (compactness or cluster separation).
Internal validation methods, such as Calinski-Harabasz Index and Average Silhouette Width Index, are adopted to overcome the issues related to clustering.

Calinski-Harabasz Index
Calinski-Harabasz Index, also known as the Variance Ratio Criterion, is the ratio of the sum of between_clusters dispersion and inter_cluster dispersion for all clusters; the higher the score, the better the performances.
For a set of data D of size n D , which has to be clustered into k groups, the Calinski-Harabasz score "s" is given by the equation is the trace of the between_clusters dispersion matrix, and tr(W k ) is traces of within-clusters dispersion matrix given by and c q represents the set of points in cluster q; c q represents the centre of cluster q, c D is the centre of D, and n q represents the number of points in cluster q [17].

Average Silhouette Width Index
For each observation "i", the average distance is obtained to all objects in the same group as "i" and called this average "a i ". For each observation, we also calculate the average distance to the cases belonging to the other groups, calling this value "b i ". Finally, the silhouette coefficient of any observation, "s i ", is given by equation [18]. Initially, on a dataset of 26,906 instances, the algorithm calls the partition method on the variable area to form clusters. The area division obtained is further partitioned into a yield-wise group using the Hierarchical Method with linkage criteria.
Given datasets with huge observations, it would be extremely computationally expensive [19] to compute the partition and Hierarchical method. To reduce computation in evaluating these methods without the locations and associated observations splitting up and falling among different clusters [20], the mean () observations of each location are recorded.

•
The Maximum or Complete_linkage clustering measures the difference between two groups by the largest distance between any two observations in each group, and it is mathematically given Equation (5) as the distance D(X, Y) between cluster X and Y D(X, Y) = max x∈X, y∈Y d(x, y) • The Minimum or Single_linkage clustering measures the difference between two groups by the smallest distance between any two observations in each group, and it is mathematically given Equation (6) as the distance D(X, Y) between cluster X and Y D(X, Y) = min x∈X, y∈Y d(x, y) • The Average_linkage measures the difference between the two groups by the average distance between any two observations in each group, and it is mathematically given Equation (7) as the mean distance between elements of each cluster • Ward's method aims to minimize the total within-cluster variance. At each step, the pair of clusters with minimum between-cluster distance is merged, and it is mathematically given Equation (8) as the squared Euclidean distance between points All the above-discussed linkage methods are applied to the dataset to determine the yield patterns, and the complete details of the methods are explained in Figures 15 and 16. From Table 6, the highest score instance is considered for model construction.

Data Analysis
A significant part of the process of data mining is data preprocessing. It can yield false results by examining data that have yet to be thoroughly examined, for instance, issues such as out-of-range values, unlikely mixes of data, missing values, etc. If such obsolete and repetitive data are present, the evidence exploration becomes more complicated during the modeling process.
The following are the critical problems that are recognized during preprocessing information.

Eliminating the Unwanted Observations
There are a few districts where multi-year data have not been consistent over several years, which could lead to incorrect output because crops in such districts have a lower threshold value (40%). These districts have been identified and removed, as shown in Figure 3.  Table 1 shows that the value "total" is not a season but rather the combining of various seasons in which a specific crop is grown while considering location, production, and output (for example, Kharif + summer). Rows with a cumulative value are highlighted and deleted.

Fill in the Missing Values
It is important to note that the output column ( Figure 1, line 32) has sixty missing values. An in-depth examination of the data in Table 2 reveals that the yield column has zero value, implying that no crop yield occurred. Furthermore, the number of harvesting areas is limited. As a result, there was no agricultural production. In the output, zeros are used to replace the rows with missing values.  Figures 4 and 16 show a side-by-side comparison of boxplot results for a portion of the region and region results. There are two images in total. The boxplot with the outlier is on the left, and the boxplot without the outlier is on the right. The graphicPlot() function generates a boxplot with parameters. The boxplot is also given a rug, which shows the parameter's concrete values and its horizontal dotted line at the mean value [18]. We can infer that anomalies have skewed the mean value by matching this dotted line with the inner region of the solid line of the box displaying the median line. Two methods for identifying outliers in districts with defined anomalies are the Boxplot Rule and the Grubbs' test. As a result, they are capable of detecting and eliminating these anomalies.

Results and Discussion
Eight variables are used in the presented work, of which area, production, and yield are numeric variables. To avoid unnecessary analysis, a table of correlated variables is generated (Table 3) [20]. From this table, area and production appear to be highly correlated (about 95%). So, area and yield variables are considered for the analysis. The shaded density plot on the area in Figure 4 suggests that four districts exhibit multi-modal distributions. From these plots, it will be easier for the two outlier detection methods to identify the outliers correctly.
Two outlier detection methods: The boxplot rule and Grubbs' test [18] have identified one or two outliers in three districts, as shown in Figure 5. However, the outliers are very few, and outliers also distorted the mean value. So, the elimination of these outliers will not cause any misleading analysis. The work divides the yield zones, resulting in an actual clustering problem. Because of the lack of external information, knowing the value of the k parameter in advance is critical for proper groupings. The partitioning method was used in conjunction with two criteria to estimate the best number of clusters: • Calinski-Harabasz Index ("CH"). • Average Silhouette Width Index ("ASW").
Over a range of k is used to estimate the best k. "ASW" suggested 5 clusters, and "CH" suggested 10 clusters ( Figure 6). "ASW" shows the sign of exemplary cluster configuration. Out of clusters 2, 4, 5, and 6, the k value of 2 is selected, as their cluster criterion values are nearly the same, as shown in Table 4.  Further, it would be easier to give rankings to the clusters. The criterion "CH" proposed k value of 10 is not chosen because it indicates data overfitting. After setting the k value to 2, cluster analysis is performed using parameters based on the following features: one by taking the mean (area) and another by taking the mean(yield), with Euclidean distance as the metric. We use mean () for variables (area and yield) because we do not want the locations and their associated observations to split up and fall into different clusters, as this distorts the distribution and leads to false interpretations [20].
The partition method supports two objective functions-construct phase and exchange phase [21]. In the construct phase, the algorithm looks for a good initial set of medoids, and in the exchange phase, it tries to fine-tune initial estimates given by the rough clusters determined in the construct phase. By looking at the values (construct phase: 22350 and exchange phase: 16357) of these objective functions from Figure 7, the function did change significantly with more steps from the construct phase to the exchange phase. The partition method has selected two reference medoids as Bangalore_rural and Mandya. After dividing the areas into two groups of area 1 and area 2, these clusters are identified and renamed as small and large areas based on the mean values obtained in Table 5.  The second part focuses on the variable "yield" The density plot shown in Figure 8 indicates multi-modal distribution in one district, Bagalkot. Figure 9 has identified three districts, namely Hassan, Koppal, and Shimoga.  Outlier detection methods have detected two outliers in small areas, as shown in Figure 10, and three outliers in large areas, as shown in Figure 11.  Picking the k value in small and large areas uses the "ASW" metric along with hierarchical linkage criteria such as complete_linkage, average_linkage, single_linkage, and ward.D2 to estimate the best k. A small area picks up two clusters, whereas a large area picks up four clusters. The resultant scores, thus obtained, are presented in Table 6 [18]. After determining the best k value, cluster analysis is carried out on small and large areas using Agglomerative nesting (Agnes). Figure 12 depicts the results obtained from Agnes. Lines 2 and 11 specify the call to the Agnes function. The first argument to this function is the Euclidean distance matrix of variable yield from a small and large area cluster. On the other hand, the second argument specifies the criterion in this case as "average" to select the two groups for merging at each step. Lines 3 and 12 define the agglomerative coefficient, quantifying the amount of clustering structure discovered (values closer to 1 suggest a strong clustering structure).
We obtained 0.88 and 0.93 for the small and large areas, respectively, indicating that we have a fairly reasonable clustering structure for both groups. Lines 4 and 13 specify the order of objects, i.e., a vector containing a permutation of the original observations to allow plotting. Lines 6 and 15 define the height, a vector containing the distances between merging clusters at each stage.
A dendrogram is a tree-representation diagram. This diagrammatic representation, which is widely used in hierarchical clustering, shows how the clusters generated by the relevant analysis are arranged. In this research, we used visual criteria, e.g., Average Silhouette Width(ASW) Index, Calinski-Harabasz(CH) Index, and hierarchical linkage criteria such as complete_linkage, average_linkage, single_linkage and ward.D2 to estimate the best k. Basically, we want to know how well the original distance matrix is approximated in the cluster space, so a measure of the cophenetic correlation is also useful. The concordance with Ward.D2 hierarchical clustering gives an idea of the stability of the cluster solution. In the current research two clusters, which were found in Figure 6, were used for the investigation, as shown in Figure 13. A dendrogram is plotted and compared to a banner plot to determine whether the data have been properly clustered. Figures 13-16 give comparative interpretations of the dendrogram and banner plot. The banner plot's x and y axes represent the height and order in which objects are clustered in Figure 14.   The banner plot's white area (Figure 14) dictates the unclustered data [22]. On the other hand, the white lines ( Figure 14) indicate the red blocks where the clusters were arranged. As a result, it can be seen that objects 9 and 13 have a larger bar than objects 1 and 9. Moreover, in the objects between 2 and 5, there is no bar at all, and this indicates that objects numbered from 1, 9, so on to 5 belong to one cluster, which is completely dissimilar to the objects 2, 12, so on to 15 which belong to another cluster [23]. The dendrogram shows a similar pattern, as shown in Figure 13.
The unclustered data are determined by the white area of the banner plot ( Figure 16). The white lines ( Figure 16) represent the red blocks where the clusters were arranged. As a result, objects 1 and 7 have a larger bar than objects 1 and 3. Furthermore, there is no bar between objects 3 and 2, indicating that objects 1, 7, and 3 belong to the same cluster as objects 2 and 9, which belong to a different cluster. Figure 15 depicts a similar pattern in the dendrogram.
After dividing the districts' as small and large areas into different clusters, the two clusters zone 1 and zone 2 are identified and renamed as low and high yield districts based on the pie plot obtained in the Table 7. Similarly, the four clusters zone 1 to zone 4 from large areas are identified and concluded that zone 1 is the low yield, zone 2 is high yield; zone 3 and zone 4 are considered as moderate yield districts. The results of average yield distributions of small and large areas are presented in Figures 17 and 18, respectively.    Table 8 shows that the rice crops are cultivated and harvested during all three seasons. Moreover, larger areas are allocated to this crop for cultivation during the kharif season. On top of that, this season has a high impact on production compared to the summer season, followed by the Rabi season.

Conclusions
The research emphasizes the importance of prioritizing relevant crops for different districts by eliminating risks and investments to achieve yield benefits rather than harvesting or sowing crops that are not relevant to the regions. In order to achieve yield benefits instead of harvesting/sowing flop crops (irrelevant crops), the work presented aims to recommend and prioritize the relevant crops for different districts. The work extends to help government sectors and farmers by providing detailed information on the type of seasons and crops that have a high impact on yield and production, and also by resolving issues related to agricultural activities in weaker locations. Two dataset inspection methods (Probability Density Function and Boxplot) are used to define the multi-modal distribution and identify and eliminate outliers. The research work was conducted research observation on the agricultural planes of Karnataka and used the dataset furnished by the Ministry of Agriculture and Farmers Welfare, Karnataka, India, and ICAR-Taralabalu Krishi Vigyan Kendra, Davanagere, Karnataka, India. The proposed work is carried out on linkage parameters, along with two heuristic methods and two partition algorithms (both hierarchical and partition), which are used to estimate the value of the best k. In addition, dendrograms are created and compared to the banner plot to guarantee that the cluster formation is accurate. The algorithm designed in this work computes the mean score on area and yield variables, which reflects the fast execution and assists in carrying out precise analysis to draw the best yield patterns that can be recommended. Finally, based on economic significance, the work recommends the best crops by eliminating risks and investments.

Acknowledgments:
The authors also thank Bapuji Institute of Engineering &Technology (BIET) for providing the support and infrastructure.

Conflicts of Interest:
The authors declare no conflict of interest.