Artificial intelligence–based models for the qualitative and quantitative prediction of a phytochemical compound using HPLC method

Isoquercitrin is a flavonoid chemical compound that can be extracted from different plant species such as Mangifera indica (mango), Rheum nobile , Annona squamosal , Camellia sinensis (tea), and coriander ( Coriandrum sativum L.). It possesses various biological activities such as the prevention of thromboembolism and has anticancer, antiinflammatory, and antifatigue activities. Therefore, there is a critical need to elucidate and predict the qualitative and quantitative properties of this phytochemical compound using the high performance liquid chromatography (HPLC) technique. In this paper, three different nonlinear models including artificial neural network (ANN), adaptive neuro-fuzzy inference system (ANFIS), and support vector machine (SVM),in addition to a classical linear model [multilinear regression analysis (MLR)], were used for the prediction of the retention time (tR) and peak area (PA) for isoquercitrin using HPLC. The simulation uses concentration of the standard, composition of the mobile phases (MP-A and MP-B), and pH as the corresponding input variables. The performance efficiency of the models was evaluated using relative mean square error (RMSE), mean square error (MSE), determination coefficient (DC), and correlation coefficient (CC). The obtained results demonstrated that all four models are capable of predicting the qualitative and quantitative properties of the bioactive compound. A predictive comparison of the models showed that M3 had the highest prediction accuracy among the three models. Further evaluation of the results showed that ANFIS–M3 outperformed the other models and serves as the best model for the prediction of PA. On the other hand, ANN–M3proved its merit and emerged as the best model for tR simulation. The overall predictive accuracy of the best models showed them to be reliable tools for both qualitative and quantitative determination.

to model the retention factors of 16 different kinds of amino acids using reversed-phase liquid chromatography through the application of different elution modes. The accuracy of ANN prediction was far better than predictions obtained by applying the same data to retention models based on the solution of the fundamental equation of gradient elution [19].
The contribution and novelty of this study is presented in various forms. First, to the best of our knowledge, to date no study has been conducted in the technical literature indicating the chromatographic application of these methods in coriander plant since the inception of AI-based techniques. Secondly, the experimental aim of the current research is to determine the qualitative and quantitative properties of isoquercitrin from coriander plant. These properties were further simulated using data-driven models, and to the best of our knowledge, this is the first study in the literature depicting the application of AI-based models to predict the qualitative and quantitative properties of the phytochemical compound. Subsequently, the study proposes three different nonlinear models (ANN, ANFIS, and SVM) and a classical linear model (MLR) for the prediction of tR and PA for isoquercitrin using HPLC method development.

Materials
All chemicals used in the study are HPLC grade and were purchased from Sigma-Aldrich (Sigma-Aldrich Corp., St. Louis, MO, USA).

Instrumentation
An HPLC instrument (Agilent Technologies 1200 series, USA) with a diode-array detector (DAD) was utilized in this experimental study. The bioactive compound was analyzed on an Eclipse XDB-C18 (150mm × 4.6mm, 5µm) reversedphase column. The mobile phase consisted of a gradient elution system composed of deionized water (D. I) and formic acid (A) and methanol (B). The flow rate was set at 0.9 mL min -1 with a column temperature of 35°C and an injection volume of 10 μL. The analytical wavelength was set at 254 nm for isoquercitrin, and the identification of isoquercitrin was performed by comparing the retention times of the real samples with those of pure standards. Calculations regarding the quantitative result were made using external standardization by measuring peak areas. The detection limit (LOD) and limit of quantification (LOQ) were calculated using standard deviation over slope.

Extraction of isoquercitrin from coriander leaves
The coriander leaves were collected from Fresh Farm Company in Alayköy, Northern Cyprus. The leaves were then dried, ground, powdered, and weighed. The weighed samples were further extracted using 100 mL methanol and stirred using a magnetic stirrer for 2 h. The obtained extract was then evaporated using a rotary evaporator. The residue was dissolved in water prior to HPLC analysis.

Proposed methodology
For any data-driven method, the knowledge of data science and analysis is crucial; for this reason, the data were collected from our experimental studies. This study proposes the application of four data-driven algorithms including the classical and most commonly used linear model, MLR, and three nonlinear models: ANN (the most widely used data-driven model), ANFIS (a hybrid learning algorithm), and SVM. These models were used to predict the qualitative [i.e.(tR)] and quantitative [i.e. (PA)] properties of isoquercitrin from coriander plant. Concentration of the standards, pH, and composition of mobile phase (methanol and water) were used as the corresponding input variables. The basic motivation for employing different data-intelligence models is the difficulty in understanding whether a specific model is superior to others in practice. Choosing appropriate models for a particular case can be challenging for predictors. This complexity can only be overcome by selecting and comparing different data-driven models, including the linear models, despite their weaknesses in handling highly nonlinear and complex data. Figure 1 presents a flowchart of the methods used for the development of the current study. The input data were collected, preprocessed, and normalized based on Eq.(1) (Figure 1). Normalization of the data was conducted before the model training, and this is usually performed to increase the accuracy and speed of the model. Predictive models such as regression and data-intelligence models are usually evaluated using different numerical indicators, as presented in Section (2.6) below. µ(x)is A < and µ(y)B < then f < = p < x + q < y + r < µ(x)is A I and µ(y)is B I then f I = p I x + q I y + r I = L + < < + I I + ⋯ / / where y is the normalized data, x is the measured data, and x max and x min are the maximum and minimum values of the measured data, respectively.

Artificial neural networks (ANNs)
Generally, ANNs are used to build models to aid in understanding the process itself, which is motivated by the configuration of the human brain. TheANN is made of artificial neurons that are considered robust and interconnected processing systems that act together in order to elucidate a certain problem [20]. Artificial neural networks are mostly applied in complex situations that cannot be solved using classical computational methods. The ANN model is a learning algorithm in which the relationships that exist between the predictor and output elements are generated by the data itself. It is important to note that an ANN is proficient enough to learn from wide examples [19]. The efficiency of an ANN is sufficient to resolve an incomplete task and then estimate the outcomes. These two considerable properties of ANNs distinguish it from other data-driven models and lead to its high level of applicability in diverse research areas. The overall architecture of an ANN is shown in Figure 2. The network is composed of various layers of neurons. In order to predict the measurable relationship that exists between input and output variables to a suitable degree of accuracy, it is highly recommended to have at least one hidden layer consisting of a number of nodes [17]. Generally, this model consists of three steps: calibration, validation, and verification.

Adaptive-neuro fuzzy inference system (ANFIS)
The ANFIS is considered a general estimator that can respond to all kinds of complex problems. The ANFIS is a hybrid of adaptive multilayer and feed-forward networks and consists of input-output variables together with fuzzy rule, which is based on the Takagi-Sugeno type. Fuzzier and defuzzifier are the major parts of the fuzzy database system. Fuzzy logic involves the conversion of input data into fuzzy values through the application of membership functions (MFs). Nodes work as MFs and permit modeling of the relations between input and output. There are various types of membership functions, such as triangular, sigmoid, Gaussian, and trapezoidal [21]. Assume the FIS contains two inputs,x and y, and one output,f; a first-order Sugeno fuzzy has the following rules.
parameters are membership functions for x and y, and inputs p 1 , q 1 , r 1 , p 2 , q 2 , r 2 are outlet function parameters. The structure and formulation of ANFIS follows a five-layer neural network arrangement. Refer to Lu et al. [9] for more information about ANFIS.

Multilinear regression (MLR)
Generally, regression models predict the extent of correlation between input and output parameters as well as the relationship that exists between them. Linear regressions are generally fitted using a least squares approach, although other methods could be employed, such as minimizing the lack of fit in some of the norms or reducing the penalized version loss of the least square function, as in ridge regression. Primarily, linear regression is categorized into two major divisions: multiple and simple linear regression. A linear regression is considered simple if it is aimed at predicting the correlation between a single output by using a single input variable. However, if the aim is to estimate the correlation between two or more input variables in order to determine a single criterion variable, this model is referred to as MLR. In MLR each value of the input parameter is associated with a value of the output variable. Multilinear regression is the most widely utilized form of linear regression and has been used in various areas of study. It is worth mentioning that MLR shows a correlation in terms of a straight line which can best estimate all the data points involving both output and target variables [20]. The general form of the MLR model is shown in Eq. (4).
where x 1 ,is the value of the i th predictor, b o is the regression constant, and b i is the coefficient of the i th predictor.

Support vector machine
In 1995 Vapnik proposed the idea of learning in the context of the SVM, which provides the desired mechanism for solving problems that involve classification, prediction, pattern recognition, and regression. The SVM works according to the concept of machine learning and is a data-driven model [21]. The two major functions of SVM are statistical learning theory and structural risk minimization. This helps SVM offer insight that differs from ANN, as itreduces error, redundancy of data, and complexity and increases the general performance of the system. Support vector machine can be classified into linear support vector regression and nonlinear support vector regression [22]. This means that support vector regression (SVR) is a form of SVM based on two basic structural layers: the first layer is a kernel function weighting on the input variable while the second function is a weighted sum of kernel outputs [23]. In SVM the linear regression is first fitted on the data, and then the outputs go through a nonlinear kernel to identify the nonlinear pattern of the data. The calibration data is: {(x i , d i )} N i (x i is the input vector, d i is the actual value, and N is sum of the data), where overall SVM function is given as: is A < and µ(y)B < then f < = p < x + q < y + r < µ(x)is A I and µ(y)is B I then f I = p I x + q I y + r I = L + < < + I I + ⋯ / / where φ(x i ) indicates feature spaces, nonlinearly mapped from input vector x.

Evaluation criteria for data-driven models
Generally, for any form of data-driven approach, performance accuracy is evaluated using various criteria based on a comparison between the predicted and measured values. In this study, the determination coefficient (DC) as a goodnessof-fit, correlation coefficient (CC), and two statistical errors, root mean-squared error (RMSE) and mean-squared error (MSE), were used for the evaluation of the models.
where N, Y obsi , Y, and Y comi a are data number, observed data, average value of the observed data, and computed values, respectively.

11. Data set description and validation of the models
In a data-driven method, the primary objective is to fit the models to a given data set based on the employed indicators in order to produce a reliable prediction of the unknown data set. Considering issues such as overfitting, satisfactory training performance is not always in agreement with the testing performance. In the validation process, different types of validation approaches can be applied including cross-validation, which is called k-fold cross-validation; others are [24] holdout, leave one out, and so on. The major advantage of the k-fold cross-validation mechanism is that in every single round, the validation set and the training sets are independent [25]. As stated above, the data is further divided into categories; 75% for the calibration (training) and 25% for the testing (verification) stage. Considering the k-fold crossvalidation, it is important to note that other validation methods can be applied to the data set [26]. Furthermore, the data in the current study was collected over a period of two months, and the data set was composed of 64 instances for each variable.

Experimental results
The chromatogram obtained from the HPLC analysis above showed a well-resolved peak with little or no interference for the standard as well as the real sample. other natural sources, particularly in plants, that contain large amounts of isoquercitrin. In the abovementioned studies, 11 different plant families together with other natural sources were examined, and interestingly, most of them contained the analyte in high quantities. The major sources were onions, tea, and tartary buckwheat bran [27]. The isoquercitrin content revealed by their research is similar to the findings of this study.

Artificial intelligence results
Data-driven approaches (ANN, ANFIS, SVM, and MLR) were analyzed in order to predict the qualitative and quantitative properties of isoquercitrin using the HPLC technique. Before model calibration, data were analyzed statistically, as shown in Table 1. Statistical analysis is generally used to understand the science of the data in order to navigate common problems that can lead to incorrect results and to facilitate decision-making based on raw data.
The Spearman-Pearson correlation shows the relationship between variables and their fitness using a linear function. The correlation strength is independent of the sign or direction. A positive coefficient shows that, as the first variable increases, the second variable also increases; a negative correlation shows an inverse relationship between the parameters, in other words, when the first parameter increases, the second parameter decreases, and vice versa [28]. As seen in Table 2, there is a strong correlation between the concentration of the standard and the peak area (PA) (R=0.998501), while a lesser correlation was observed between concentration and tR. Table 2 also shows a positive correlation between tR and MP-B, pH, and PA, while there is a negative correlation between tR and concentration and MP-A.
For development of the data-driven models, MATLAB 9.3 (R2017a) was used in the ANN, SVM, and ANFIS models, while the deterministic linear MLR model was developed using the simulation tool in EViews 9.5 software. According Usman et al. [25], the proper number of nodes in the hidden layer for identification of the optimum number of hidden layers ranges from (2n1/2 + m) to (2n+1), where n is the number of input neurons, and m is the number of output neurons. Hence  aspect in any ANN modeling, as it helps to prevent overfitting caused by different factors. As reported in several works in the field of science and engineering, there is no specific standard method for determining the appropriate number of hidden neurons [29]. Furthermore, appropriate and optimal determination of parameters (Cε, γ) in the SVM models is very important in choosing the best structure for the models. In this research, optimal values were obtained by employing the kernel function of the grid procedure, as suggested by Pham et al. [30]. For ANFIS modeling, various types of MFs and epoch iterations were explored using trial and error to identify the best structure. Table 3 shows the results of the performance analysis for the four data-driven models. It is clear that the nonlinear models (ANN, SVM, and ANFIS) outperformed the traditional linear regression model (MLR) in the predictive comparison among the four models. Table 3 further demonstrates that all four models are capable of predicting the qualitative (tR) and quantitative (PA) properties of the bioactive compound. It is important to note that the performance efficiency of all four models, in terms of DC, CC, MSE, and RMSE, shows a satisfactory and reliable accuracy. This may be due to the cross-validation process conducted before model calibration, which is a very significant component of model evaluation [31]. Table 3 also shows that M3, with four input parameters, provided the highest prediction accuracy among the four data-driven methods in terms of DC, CC, MSE, and RMSE, for the prediction of PA and tR. Among these models,ANFIS-M3 with DC (0.9998), RMSE (0.0002), MSE (0.0001), and CC (0.9999) values in the verification phase had the highest levels of accuracy when compared to ANN-M3, SVM-M3, and MLR-M3 for the prediction of PA.However, for the prediction of tR in the verification phase,ANN-M3 outperformed the other three models(ANFIS-M3, SVM-M3, and MLR-M3) in DC (0.9987), RMSE (0.0063), MSE (0.0036), and CC (0.9993) values, and this result is in line with previous findings [31][32][33]. The predictive precision relating to DC showed that ANN outperformed the other three models (ANFIS, SVM, and MLR) and increased the prediction accuracy up to 0.07%, 8%, and 7%, respectively. The predicted results were displayed graphically using a scatter plot in order to demonstrate the goodness-of-fit between the experimental and predicted values for the two best models (ANN and ANFIS) ( Figure 6). It is apparent from the scatter plots that both ANN and ANFIS models demonstrated good fitting agreement between the experimental and predicted values. The higher prediction skill of PA can be attributed to the high correlation values between variables ( Table 2). Figure 7 shows the response time series plot for PA and tR. According to the plot, the extent of spread values between the experimental and predicted values proved the Table 3

Conclusion
This work examined the qualitative and quantitative properties of isoquercitrin from coriander plant using the HPLC method to simulate tR and PA, respectively. The analyte was determined from a leaf extract, and a considerable amount was found. The work further explored the application of data-driven methods including a classical linear model (MLR) and three AI-based models (ANFIS, ANN, and SVM),in order to simulate tR and PA of isoquercitrin from the coriander plant using HPLC technique. The concentration of the standard, composition of the mobile phases, and pH were used as input variables. The results obtained indicated that all four data-driven approaches are capable of simulating the tR and PA of the analyte. Furthermore, comparison of the models demonstrated that M3 offered the highest level of predictive accuracy, as shown in Table 3. Further comparative analysis of the results showed that ANFIS-M3 outperformed the other models for the prediction of PA. However, ANN outperformed the other three models (ANFIS, SVM, and MLR) and increased accuracy up to 0.07 %, 8 %, and 7 %, respectively, for the prediction of qualitative properties of the analyte (tR).