Identification of soil type in Pakistan using remote sensing and machine learning

Soil study plays a significant role in the cultivation of crops. To increase the productivity of any crop, one must know the soil type and properties of that soil. The conventional soil type identification, grid sampling and hydrometer method require expert intervention, more time and extensive laboratory experimentation. Digital soil mapping, while applying remote sensing, offers soil type information and has rapidity, low cost, and spatial resolution advantages. This study proposes a model to identify the soil type using remote sensing data. Spectral data of the Upper Indus Plain of Pakistan Pothwar region and Doabs were acquired using fifteen Landsat eight images dated between June 2020 to August 2020. Bare soil images were obtained to identify the soil type classes Silt Loam, Loam, Sandy Loam, Silty Clay Loam and Clay Loam. Spectral data of band values, reflectance band values, corrective reflectance band values and vegetation indices are practiced studying the reflectance factor of soil type. Regarding multi-class classification, Random Forest and Support Vector Machine are two popular techniques used in the research community. In the present work, we used these two techniques aided with Logistic Model Tree with 10-fold cross-validation. The classification with the best performance is achieved using the spectral data, with an overall accuracy of 86.61% and 84.41% for the Random Forest and Logistic Model Tree classification, respectively. These results may be applied for crop cultivation in specific areas and assist decision-makers in better agricultural planning.


INTRODUCTION
Agriculture is the backbone of Pakistan's economy, which heavily depends on its cash crops. In Pakistan, there are two main seasons for growing crops: Rabi (winter) and Kharif (summer). Crops of Rabi are sown in the month of October-November and Kharif in the months of April-May. Cash crops such as cotton, rice, and sugarcane are sown in May and RS is a valuable technique for describing the spatial, spectral, and temporal properties noted from earth surface characteristics for spatially separated variables (Kneubühler et al., 2014;Pande et al., 2018;Pande et al., 2020). It is a technique of acquiring data through earth observation satellites. These satellites employ sensors that acquire reflectance from the earth's surface in frequency bands known as frequency spectrums. The reflectance is recorded using active as well as passive sensors. Active sensors emit light through a laserbased system and record its reflectance providing information about earth terrain, whereas passive sensors, which are most common, use sunlight reflected from the soil surface (Bagheri, 2020;Hengl et al., 2015). Near infrared reflectance analysis (NIRA) is used to compute the land characteristics in shortwave and near infrared (Ben-Gera, 1968). The reflectance is recorded into different spectra, which provides a hyperspectral image that can be used to extract spectral signatures (Mohamed et al., 2018). It helped to find and validate the relationship between chemical data and spectral data using prediction equations on analysis of multiple linear regression (MLR). In a laboratory environment (Ben-Dor, Irons & Epema, 1999) applied this technique to the visible region to predict specific surface area, carbonate (CaCO 3 ) contents, residual loss-on-ignition and total silica (SiO 2 ) by simulating the bands of the Landsat satellite.
Remotely sensed images provide information on different visible and invisible bands, which is then used to calculate indices such as NDVI, Leaf area index (LAI), soil reflectance index, and other useful indices (Xue & Su, 2017). With high-resolution sensors and advanced technology, the bands obtained by remote sensing are increasing, thus helping us calculate better indices with more representation power (Khan et al., 2018). RS and spectroscopy are provide us with hyper spectral images (HSI). These images provide us with image texture information and spectral information. Image texture information depends on the relationship of neighboring pixels intensities and helps us classify soil type (Xue & Su, 2017). The spectral signature of a material is characterized by its form, the intensity of its reflectance and absorption bands which are electrical and vibrational transitions (Gomez & Lagacherie, 2016).
These spectral signatures are provided to machine learning algorithms, which learn the relationship of these indices with the target variable. These are automated computer algorithms and the provision of representative data is important for them to learn these relationships. There are three types of machine learning algorithms: supervised, unsupervised, and reinforcement learning (Mahesh, 2020). The choice of algorithms depends on your data types and what task you want to automate. Supervised learning algorithms are spoon-fed versions of machine learning as we label the data corpus, and the algorithm works with strict boundaries. This process is like "connecting the dots" because we select the type of information to feed the algorithm and we know the kind of desired results (Mahesh, 2020).
Soil texture comprises silt, clay, and sand contents and is a valuable property for many soil functions. Water retention, soil structure, aeration, capacity, susceptibility to erosion, cation exchange capacity, organic matter content, and others are swayed by soil texture, which is an important property in soil classification systems (Bousbih et al., 2019;Jia et al., 2017). In the selection and growth of crops, soil texture is one of the most important factors (Chakraborty & Mistri, 2015).
RS techniques for the estimation of soil texture have been used in many studies. In the laboratory, texture quantification spectral data models were created (Coleman, Agbu & Montgomery, 1993). Spectral standards attained in the laboratory help us recognize satellite standards; meanwhile, the perception of energy interaction matches what occurs in the laboratory, altering only the scale and configuration of the apparatus. Consequently, studies have been performed using laboratory data to simulate satellite data (Rehman et al., 2015). Studies (Hengl et al., 2015;Okparanma & Mouazen, 2013;Pham, Prakash & Bui, 2018) have showed the analytical capacity of the satellite data in relation to the proximal data.
Dynamic spatial panels of RS data and topographic, bioclimatic and soil conditions features are used to apply data-mining techniques. Xu, Wang & Chen (2018) proposed a spatio-temporal crop disease prediction framework based on ensemble learning techniques and spatio-temporal recurrent neural network (STRNN), which is an extension of recurrent neural networks (RNN) in time and space. For prediction precision analysis, they used a specific dataset based on reported cases of wheat yellow rust outbreaks in the Longnan city.
In recent years, Jia et al. (2017) used hyperspectral images to measure total soil nitrogen and classify the soil types. Spectral soil data emanating from the satellite are influenced by pixel size of images, atmospheric attenuations, soil cover, sensor spectral resolution, geometric view, and increased signal-to-noise ratio since variations are essential in the methods and models to enhance spectral responses. For instance, Liao et al. (2013) proposed a method for spatial estimation of soil surface texture using kriging, co-kriging and stepwise multiple regression and co-kriging gave them best prediction. da Silva Chagas et al. (2016) recently predicted the clay, silt, and sand contents, based on Landsat TM 5 bands and various indexes which not only helped them enhance accuracy of the models but also classify bare soil using the RF and MLR methods.
More recently, Demattê et al. (2018) proposed a quantification model for soil texture based on reflectance information from a range of bare soils acquired by intersecting multitemporal satellite images. This model was applied to another region to calculate its pertinency. They used only bare soil spectral data collected using Landsat TM 7 satellite images of two distinct regions of Brazil. The model was trained using spectral data (acquired from six bands) and laboratory data (having size 0.00-0.20 m layer) of one region and extrapolated to another. Sandy loam, clayey loam, sandy, clayey, and very clayey soil are the differentiated textual classes.
Similarly, Forkuor et al. (2017) investigated six soil properties-silt, clay, sand, soil organic carbon, nitrogen, and cation exchange capacity of area 580 km 2 situated in southwestern Burkina Faso, by using the high-resolution satellite data (Landsat 8 and RapidEye), topography data and laboratory examined. They tested and compared four prediction models: random forest regression, support vector machine, stochastic gradient boosting, and multiple linear regression. Prediction of the model was validated on an independent set of similar soil samples (Forkuor et al., 2017).
In this context, the present study aims to identify the soil type by using the spectral signature of land through remote sensing data. This analysis is based on satellite measurements acquired over the Upper Indus Plain in Pakistan. We analyze the sensitivity of several indicators to soil texture and use the machine algorithms to determine soil type classifications. The objective of this study is to find out the answers to the following questions: I. Is it possible to identify the soil type using remote sensing data?
II. How can we map spectral signature with soil type?
III. Which index calculated from remotely sensed images is more useful for soil type identification?
IV. Which machine learning algorithm is more effective at mapping the spectral signature to the relevant soil type?

Study site
The study area selected for this research is the Upper Indus Plain of Pakistan, having thirty sites in the Pothwar region and sixty-six in Doabs (the area between two rivers). These sites are taken from the research report, Soil Physical and Hydraulic of the Upper Indus Plain of Pakistan (Malik et al., 2019). The northeastern areas of Pakistan, including the districts of Jhelum, Chakwal, Rawalpindi, Islamabad and Attock, consist of 2.23 Mha called the Pothwar plateau, as shown in Fig. 1A. This plateau is geographically surrounded by a salt range in the south, River Indus and Jhelum in the west and east, respectively. The surface of this land has waves and dissected ditch belts. The dominant soil texture is silt loam, loam, and Sandy loam in Pothwar. The main cultivated crops of this area are barley, wheat, legumes, sorghum, and vegetables (Khan et al., 2018). Doab is a land enclosed among two merging rivers. The most populous province of Pakistan, Punjab, has four doabs because it has five rivers streaming. These four Doabs are Thal, Rachna, Bari and Chaj Doab, as shown in Fig. 1B. Thal Doab is an area enclosed between Jhelum and Indus Rivers. It is the largest doab geographically, having area of 37,477 km 2 . Khushab, Mianwali, Layyah, Bhakkar, and Muzaffargarh are major districts of Thal Doab. Soil type of twenty sites were determined and each site covers 1,873 km 2 area in Thal Doab (Liao et al., 2013). Rachna Doab is an area enclosed between Chenab and Ravi Rivers and is geographically the second largest doab having an area of 31,331 km 2 . It comprises the central districts of Punjab. Gujranwala, Sialkot, Narowal, Sheikhupura, Faisalabad, Nankana Sahib and Toba Tek Singh are major districts in Rachna Doab. Rachna Doab determined the soil type of sixteen sites; each covers a 1,958 km 2 area. Bari Doab is an area enclosed between Sutlej, Bias and Ravi Rivers, having an area of 29,649 km 2 . Major districts of Bari Doab are Okara, Khanewal, Sahiwal, Multan, Lahore, and Kasur. The soil type of twenty sites was determined in this Doab; each site covers 1,482 km 2 area. Chaj Doab is an area between Chenab and Jhelum Rivers and the geographically smallest Doab with an area of 13,660 km 2 . Sargodha, Mandi Bahauddin and Gujrat are major districts of Chaj Doab. The soil type of ten sites was determined in Chaj Doab; each site covers 1,366 km 2 area.
The Pits were denser in Pothwar than in Doabs because of the extra diversity in the geology of Pothwar. Pothwar has 22,254 km 2 geographical area and 27.5 km × 27.5 km was the grid size. While 41.5 km × 41.5 km was the grid size of Doabs and 113,085 km 2 was the geographical area (Khan et al., 2018). The distribution of soil classes in the surface of Pothwar and Doab regions is shown in Figs The soil type can be identified using the spectral signature of land with the help of remote sensing data. In this work, we used Landsat 8 data (Pahlevan et al., 2014) and supervised classification methods to achieve the results. Spectral images acquired through remote sensing satellites are used to identify suitable frequency bands to calculate spectral signatures.

Satellite data
In 2013, the National Aeronautics and Space Administration (NASA) launched Landsat 8. It was a recently initiated Landsat satellite that housed the Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS) instruments. OLI collects data for the visible, near-infrared, panchromatic band, and short-wave infrared spectral bands, whereas TIRS gathers images in the thermal region. The satellite completes its one cycle in 16 days and more than 1.8 million scenes are publicly available for the research community that can be utilized in several applications. Fifteen cloud-free Landsat 8 images acquired from USGS Earth Explorer website (https://earthexplorer.usgs.gov) between June 2020 and August 2020 were used for the present study. Several preprocessing steps were performed on each image to get the spectral band values shown in Table 1. We calculated these spectral bands' reflectance and corrective reflectance values. Soil signatures are commonly used in agricultural remote sensing and are known as vegetation indices. We also calculated the NDVI, EVI and SAVI vegetation indices (Rezaei et al., 2016) to investigate the impact of spectral bands on soil type classes. We have excluded Band 8-Panchromatic (0.50-0.68 µm) value due to the short range and most affected by clouds.

Derived spectral band values
Spectral signature is a combination of spectral bands' reflectance of material concerning wavelengths. It is used to help classify remote sensing images. We collected spectral band values, as shown in Table 1 for 82 out of 96 instances of the understudy site (Malik et al., 2019). We focused on five types of soil types: Silt Loam, Silty Clay Loam, Clay Loam, Sandy Loam and Loam which are dominant soil types in these regions, so we ignored fourteen instances that belonged to other soil types. On the base of Spectral band values, we have calculated reflectance band values, corrective reflectance band values and vegetation indices to generate our dataset. Our dataset has 27 attributes and a class label named soil type. We expanded our dataset by considering the nearest points and the 3 km distance from the actual points. Actual and extended instances of the dataset are given in Table 2. This labeled data is used to train a supervised machine learning algorithm that models the specific land map into different land types.

Classifiers algorithms
Several classification methods have already been applied to measure soil type; among them, RF and SVM are commonly used. In this study, we have selected three multi-class classification methods to train and compare the soil type classification; more detail on these methods is given in the following subsection.

Random forest
The random forest contains several decision trees. Each tree gives the class a vote, and the majority vote determines the final class. This implies that RF is an enhanced version of the decision tree classification. RS has also been extensively used in remote sensing classifications, where it has been applied to multitemporal images, multispectral data, optical data, and Synthetic Aperture Radar (Bousbih et al., 2019). In the most classical form of RF, a randomly built multiple decision tree is used, where each tree is trained on a random subset of data (named features). In recent decades, it has been known to be one of the most efficient algorithms and has been extensively reported in remote sensing literature. It has been widely used for various applications in the agricultural sector, including the analysis and classification of soil texture. The construction of classification and regression trees (CART) from the samples are involved in developing the RS classifier (Chen et al., 2017). A subset of input features (attributes) is randomly selected to form a set of predictor variables. It is best used for classification when there is a multi-class problem. The prediction of the RF depends on the predictions of the individual trees in the forest. The class that is classified by the random forest is the most predicted class in the individual trees. The important parameters that affect its performance: are the number of trees (k), maximum depth of the tree (not exceeding 25) and the minimum number of samples in each node (Bousbih et al., 2019). Typically, as the value of k increases, accuracy becomes better. Limiting the number of features reduces a RF's computational complexity. In the present study, the parameters were set as shown in Table 3.

Support vector machine
The support vector machine (SVM) is a supervised algorithm, and it is used for the classification of both linear and nonlinear data. SVM is generally employed in classification, pattern recognition, and regression analysis (Bousbih et al., 2019). It has been frequently used in remote sensing classification applications (Mohamed et al., 2018). The classifier uses a nonlinear mapping to transform the original data into a higher dimension and essential training tuples called support vectors are used to find a hyperplane for the data separation. In the case of multi-class classification, the SVM classifiers use kernel function to the input data, where the data are not linearly separable. The polynomial kernel is used among the different kernel functions that have been already proposed. In the current study, the Polynomial kernel was picked. Therefore, any complex problem can be solved by choosing an appropriate kernel function in SVM.

Logistic model tree
A logistic model tree (LMT) combines logistic regression (LR) and decision tree learning methods. In this model, information gain is used for splitting, the Logitboost algorithm is used to produce an LR model at every node in the tree and the tree is pruned using the CART algorithm (da Silva Chagas et al., 2016). LMT used cross-validation to find several Logitboost iterations to prevent training data overfitting. The Logitboost algorithm uses additive logistic regression, with each class Ci having a least square fit (Doetsch et al., 2009).
where n is the number of factors, whereas β i is the coefficient of the ith component of the vector x. In the LMT model linear logistic regression method is used to calculate posterior probability in leaf nodes (Chen et al., 2017).
D is the number of classes.

Evaluation
In this section, we compare the three approaches using their parameterization with three different classifiers (RF, SVM and LMT). The classification performance should be validated under conditions as close to the production environment. The results obtained are averaged over ten random, equally disseminated trails from the training and validation samples and not influenced by any specific split between the training and validation samples. For optimization of these classifiers, we used 10-fold cross-validation. The fundamental idea of this validation is to divide the training dataset into 10 subsets of equal size, with the ratios 9/10 and 1/10 being used for the training and validation sets, respectively. A classifier is trained on nine subsets and a remaining subset is used for validation. The split criteria of 10-fold cross-validation are shown in Fig. 3. Validation is an essential step to evaluate the quality of supervised classification results and is achieved by evaluating the classified data with the reference data. The confusion matrix facilitates the understanding of results and provides detailed information concerning the effects of incorrect predictions. The confusion matrix is constructed on each experiment which exhibits the predicted and actual classes as shown in Table 4. The square array represents the confusion matrix where the column corresponds to the predicted values while rows represent the reference data. Matrix diagonal values represent the correctly classified values, whereas other elements correspond to confusion.
To evaluate the influence of the class label, using RF, SVM and LMT, the Accuracy, Precision, Recall and F1 Score are computed from each confusion matrix over a set of different simulations.

Landsat-8 sensitivity analysis
We analyzed the impact of band values, reflectance band values, corrective reflectance band values and vegetation indices on soil type classes. We found that the reflectance of Band 1, Band 4 and Band 9 almost remains the same for all types of soil type classes. Furthermore, Band 2, Band 3, Band 5, Band 6 and Band 7 reflectance are more sensitive toward soil type, as shown in Fig. 4A. Reflectance band values of Band 5, Band 6 and Band 7 efficiently differentiate the soil types. Spectral signatures of reflectance band values helped us to map the soil type as shown in Fig. 4B. Clay loam has the highest reflectance values for corrective reflectance bands from Band 1 to Band 7. Corrective reflectance values for Band 6 and Band 7 have a significant impact on soil type classes, as shown in Fig. 4C. Silt Loam has minimum reflectance. However, Silty clay loam has the highest reflectance on NDVI and SAVI vegetation indices. From NDVI, SAVI and EVI vegetation indices, NDVI and SAVI can be seen to have sensitivity to soil types, as shown in Fig. 4D.

Classification scheme
Twenty-seven indicators were used to estimate the soil type: Landsat 8 images in eight different bands (B1-B7 and B9), the reflectance of band values, corrective reflectance of band values and vegetation indices. From sensitivity analysis, we found that all these indicators are sensitive to soil type class, as shown in Fig. 4. Consequently, all spectral bands were selected for the classification process. The workflow of the present study with its three modules to perform the soil type identification using remote sensing data is shown in Fig. 5.

Validation
A 10-fold cross-validation procedure is used to obtain predictions from the different classifiers. The accuracies obtained from the confusion matrices with these simulations are given in Fig. 6

DISCUSSION
The first research problem was "Is it possible to identify the type of soil using remote sensing data?" We have identified the soil type by using spectral signatures using remote sensing data. The different bands values of the pixel differentiate the soil type as shown in Figs. 4A-4D. Results also show us that, soil types are identified using remote sensing data with 86.61% accuracy by applying a random forest machine learning model as shown in Fig. 6. In the identification of soil type, clay loam was misclassified with silt clay loam and loam, sandy loam with silt loam and loam, silty clay loam with clay loam and silt loam, loam with sandy loam and silt loam, silt loam with sandy loam and loam. Our second research question was "How can we map spectral signature with soil type?" For this, we have studied the impact of band values as in Fig. 4A, reflectance values shown in Fig. 4B, corrective reflectance values shown in Fig. 4C and vegetation indices shown in Fig. 4D  Our third research question was "Which index calculated from remotely sensed images is more useful for soil type identification?" Based on our research work NDVI and SAVI are more useful indices among NDVI, SAVI and EVI vegetation indices as we discussed in this work. NDVI and SAVI index values are more helpful to identify the soil type as shown in Fig. 4D.
Fourth and last research question was "Which machine learning algorithm is more effective at mapping the spectral signature to relevant soil type?" For this, we have implemented different machine learning techniques like Random Forest, support vector Machine, decision trees, logistic model tree and bagging to identify the soil type. Accuracies of these approaches are elaborated in "Discussion"; you can clearly see the Random Forest, which is an ensemble learning algorithm, gave us the highest accuracy 86.61% as shown in Fig. 6.

CONCLUSION
Soil type plays an important role in the cultivation of the crop in a specific area. In the present study, we identified the soil type using spectral signatures based on remote sensing data. We also analyzed the impact of band values, reflectance values, corrective reflectance values, and vegetation indices on soil type. Band 5, reflectance of Band 5, reflectance of Band 6, corrective reflectance of Band 5, and corrective reflectance of Band 6 are more sensitive for soil type classes. Spectral signatures of NDVI and SAVI are more helpful in identifying the soil type. Three different machine learning techniques, Random Forest, support vector machine and logistic model tree, have been used to identify soil type classes and achieved 86.61% accuracy using Random Forest. Government organizations and decision-makers can use these results for agricultural planning. In future work, we can increase the number of soil type classes to map more areas. Consequently, the web-based application can be designed to determine the soil type by giving the location. Many machine learning classification techniques can be used to improve this multi-class classification problem. For example, we can use neural networks as deep learning algorithms to get better results from this dataset.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
Research was funded by NRPU Project No. 17006 and Project No. 9201. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: NRPU: 17006, 9201.