Multiple Modeling Techniques for Assessing Sesame Oil Extraction under Various Operating Conditions and Solvents

This paper compares four different modeling techniques: Response Surface Method (RSM), Linear Radial Basis Functions (LRBF), Quadratic Radial Basis Functions (QRBF), and Artificial Neural Network (ANN). The models were tested by monitoring their performance in predicting the optimum operating conditions for Sesame seed oil extraction yields. Experimental data using three different solvents—hexane, chloroform, and acetone—with varying ratios of solvents to seeds, all under different temperatures, rotational speeds, and mixing times, were modeled by the three proposed techniques. Efficiency for model predictions was examined by monitoring error value performance indicators (R2, R2adj, and RMSE). Results showed that the applied modeling techniques gave good agreements with experimental data regardless of the efficiency of the solvents in oil extraction. On the other hand, the ANN model consistently performed more accurate predictions with all tested solvents under all different operating conditions. This consistency is demonstrated by the higher values of R2 and R2adj ratio equals to one and the very low value of error of RMSE (2.23 × 10−3 to 3.70 × 10−7), thus concluding that ANN possesses a universal ability to approximate nonlinear systems in comparison to other models.


Introduction
Sesame seed oil has many applications in health and food that have been known for several thousands of years. With higher oil content in comparison with other revivals, mechanical extraction for sesame seeds has always been the easiest in comparison to other seeds. Over the years the extraction process has undergone numerous developments and the principle of simply "squeeze the oil out" has been superseded significantly by the introduction of solvent extraction.
Sesame seeds have higher oil content (around 50%) than most of the known oilseeds. Sesame oil is known to be a high-priced and high-quality oil. It is also among the most stable edible oils despite its high degree of unsaturated fats [1,2]. Sesame oil is rich in monounsaturated and polyunsaturated fatty acids [3]. The most abundant fatty acids in sesame oil are oleic, linoleic, palmitic, and stearic acids, which together comprise about 96% of the total fatty acids. Oil content and fatty acid compositions vary significantly between oilseed crops and among the same crop collected from different geographical locations. Is has been reported that oil content for sesame seeds ranges between 44.6% to 53.1%. The content of oleic acid, linoleic acid, linolenic acid, palmitic acid, and stearic acid varied between 36.12-43.63%, 39.13-46.38%, 0.28-0.4%, 8.19-10.26%, and 4.63-6.35%, respectively [4]. average particle sizes (2, 1.5, 1, 0.8, and 0.5 mm) after roasting at different temperatures (100, 120, 140, 160, 180, and 200 • C) as a pre-treatment process. Different ratios of sesame seed mass to solvent mass (1:1, 1:2, 1:3, 1:4, and 1:5) and contact time of 6, 12, and 24 hours with varying stirring speeds of 0, 150, 300, and 700 rpm were examined and samples were subjected to heating at different temperatures (25,30,35,40,45, and 50 • C) during contact period of extraction; data obtained at 40 • C were used in this work as it gave maximum extraction yield [6]. Extracted oil was then separated by distillation. Oil yield was calculated as a ratio of extracted oil to seed weight. Experimental results used for modeling are presented in Appendix A.

Modeling Techniques
In this paper the following models are used: Response Surface Method, Linear Radial Basic Function, Quadratic Radial Basic Function, and Artificial Neural Network. The four promising modeling techniques, LRBF, QRBF, ANN, and RSM, were applied to model the experimentally available data, from which the predictions generated for oil extraction yields were obtained and then compared to evaluate these models' adeptness.

Response Surface Methodology (RSM)
Response surface methodology came from the original work of a previous study [24]. Their collaboration was initiated at a chemical company when solving the problem of determining optimal operating conditions for chemical processes. Response surface methodology is used in many practical applications in which the goal is to identify the levels of design factors or variables that optimize a response. Despite its simplicity and efficiency, RSM provides efficient and accurate solutions. Therefore, it has successfully been applied in many engineering problems [32][33][34][35].
RSM is a higher order polynomial model; a second-order (Quadratic) polynomial equation is developed after ANOVA test to express the value of the variable Y (oil Yield) as a function of each independent variable (X 1 , X 2 , and X 3 ), as follows [16]: where β 0 , β i , β ii , and β in are the regression coefficients for intercept, and the notations X 1 = A, X 2 = B, and X 3 = C are the independent variables, as presented in Table 1. A least-squares methods can be used to determine the parameters for RSM as follows: All regression models were developed using the Design of Experiment, DOE and statistical toolbox in MATLAB TM . A Radial Basis Function (RBF) is a real-valued function that depends only on the distance from the origin, Any function φ that satisfies the property φ (x) = φ (|| x ||) is a radial function. Even though the norm is usually Euclidean distance, other distance functions can also be possible [36]. RBF uses a series of basic functions that are symmetric and cantered at each sampling point, and it was originally developed for scattered multivariate data interpolation [25]. RBF had applications in medical imaging, ocean depth measurement, altitude measurement, rainfall interpolation, surveying, mapping, geography and geology, and image warping [37].
If f(x) is the true objective or response function and f '(x) its approximation obtained from a classical RBF with the general form: where n is the number of sampling points, x is the vector of design variables, x i is the vector of design variables at the i-th sampling point, x − x i is the Euclidean distance, ϕ is a basis function, and λ i is the unknown weighting coefficient. Therefore, an RBF is actually a linear combination of n basis functions with weighted coefficients. Some of the most commonly used basis functions include: • Linear Radial Basis Function (LRBF): ϕ(r) = r.
An RBF using the highly nonlinear functions does not work well for linear responses [38]. To solve this problem, an augmented RBF polynomial function is included: where n is the total number of terms in the polynomial, and c j (j = 1,2, . . . , m) is the corresponding coefficient. A detailed discussion on the polynomial functions that may be used can be found in a previous study [38]. RBF passes through all the sampling points exactly. This means that function values from the approximate function are equal to the true function values at the sampling points. Therefore, it would not be possible to check RBF model fitness with ANOVA, which is the main drawback of RBF.
All RBF have been claimed to create better models than the RSM with a limited number of samples [31]; it has not been found from the literature which RBF or RBFs are highly accurate in general for linear, quadratic, and high-order nonlinear responses. A study on the accuracy of RBF models is needed before RBF can be used to create high-fidelity global models because the types of responses are typically unknown in most situations.

Artificial Neural Network (ANN)
ANN is made up of two parts, nodes and connections. Nodes consist of neurons, which consist of the transfer function that takes the argument S, and produces the scalar output of a single neuron. The most used transfer functions to solve linear and nonlinear regression problems are purelin, logsig, and tansig [39].
For the case of logistic output the log sig transfer function may be written as: The architecture of the neural network is presented in the form in which the neurons' inputs and outputs are connected. These neurons are divided into several groups, called layers. A multi-layer neural network has hidden and output layers consisting of hidden and output neurons, respectively. Frequently, the inputs are considered as an additional layer. The most common neural network architecture used for solving nonlinear regression problems is the multi-layer feed-forward neural network, also known as Multi-Layer Perceptron (MLP), as shown in Figure 1. The architecture of the neural network is presented in the form in which the neurons' inputs and outputs are connected. These neurons are divided into several groups, called layers. A multi-layer neural network has hidden and output layers consisting of hidden and output neurons, respectively. Frequently, the inputs are considered as an additional layer. The most common neural network architecture used for solving nonlinear regression problems is the multi-layer feed-forward neural network, also known as Multi-Layer Perceptron (MLP), as shown in Figure 1. A technique called "Early Stopping" was used during model training to avoid overfitting and subsequent poor generalization. Data sets were divided into 70% training set, 20% testing set, and 10% validation set. The number of training samples was 42, number of testing samples was 12, and validation was 6 samples. The MATLAB Neural Network Toolbox, version 6, was used to design and implemented all the ANNs.

Model Validation and Evaluation
In order to evaluate the goodness of the model fitting and prediction accuracy of the constructed models, R 2 and error analyses were performed between the experimental and predicted data in the LRBF, QRBF, RSM, and ANN models. Many approaches for validation stated in the literature are used for error analyses, with some listed in a previous study [36].
In this paper, promising techniques that used the error as a performance index to measure the model accuracy are introduced. There are a number of different measures of model accuracy. The first two are the root mean square error (RMSE) and the R square value, are defined below: whereŷ is the predicted value, y is the mean of the observed values. In general, the larger the values of R 2 and R 2 adj , and the smaller the value of RMSE, the better the fit. In situations where the number of design variables is large, it is more appropriate to look at R 2 adj , because R 2 always increases as the number of terms in the model is increased, while R 2 adj actually decreases if unnecessary terms are added to the model [19]. The four techniques proposed in this study are used to examine experimental data for solvent extraction of sesame seeds using three solvents, chloroform, acetone, and hexane. The experiments were conducted under different operating conditions (temperature, mixing speed, and solvent/seed ratio); experimental results are presented in a previous work [6,16]. Different statistical analysis techniques, e.g., ANOVA test, can be used to check the fitness of an RSM model, and hence identify the main effects of design variables. However, the main effect analysis is not the focus of this study and will not be discussed here. The major statistical parameters used for evaluating model fitness are the R, adjusted R 2 , and Root Mean Square Error (RMSE). Note that, these parameters are not totally independent of each other and are calculated by the methods listed in the following section. Generally speaking, the smaller the value of RMSE, the better the fit. It can be calculated as: where p is the number of non-constant terms in the RSM model, SSE is the sum of square errors, and SST is the total sum of squares. SSE and SST are calculated as: where fi is the measured function value at the i-th design point, fi is the function value calculated from the polynomial at the i-th design point, and f is the mean value of fi.

R 2 and R 2 adj
In situations where the number of design variables is large, it is more appropriate to look at R 2 adj , because R 2 always increases as the number of terms in the model is increased: R 2 adj actually decreases if unnecessary terms are added to the model,

Results and Discussion
The model prediction is developed using MATLAB 2017a with Model-Based Calibration Toolbox™ in windows 7 platform with i5 8GB RAM. This toolbox uses Design of Experiments (DoE), statistical modeling, and optimization techniques to efficiently produce high quality calibrations for the oil yield model. To evaluate the computational efficiency and accuracy of the developed models, the above performance evaluation functions are known as good indicators. The small values of R 2 adj and R 2 , as well as large values of RMSE, indicate bad fittings for the RSM models. Using the same experimental data samples, RBF models with the linear function (LRBF) and multi-quadric functions (QRBF) and the ANN model are also developed.

Modeling Experimental Data Using RSM
A second-order (Quadratic) polynomial optimum equation was developed after testing the feasibility of other possible orders (2-6 orders, see Figure 2). To express the value of the variable Y (oil Yield) as a function of each independent variable (X 1 , X 2 and X 3 ), the following models are obtained.

Acetone
The second order equation for acetone is:

Chloroform
The second order equation for Chloroform is:

Modeling Experimental Data Using LRBF
The response surface model for LRBF for the three solvents is shown in Figure 3. In the graph the predicted oil extraction yields versus experimental data are presented. The result showed better agreement between predicted and experimental data. When comparing these results with that obtained using the RSM model, the RBF model prevails. Moreover, hexane showed the best RBF linear model agreement between the predictions versus experimental data of the oil yield extracted in comparison to other solvents. The RBF linear model showing higher R 2 and R 2 adj values at near to 1 (Table 4), whereas for the RMSE value, hexane has been found to achieve the highest value of 0.116 in comparison to the other solvents. Table 2 shows the linear RBF optimum parameters. Figure 4 shows the difference between the three solvent models. It can be seen from the graph of the experiment versus predicted oil yield data that the hexane solvent produced a more robust model when compared to the other solvents. The QRBF parameter values can be seen in Table 2.

Modeling Experimental Data Using ANN
The optimum configuration for the neural network is performed with 2 hidden layers; the first layer contains 10 neurons, the second layer 5 neurons. Different back-propagation (BP) algorithms were compared to select the best-suited BP algorithm. The Marquardt-Levenberg learning algorithm was used with a Mean Squared Error (MSE). Table 3 shows the ANN optimum parameter values used for the three solvents. The ANN model and its parameters variation were determined based on the minimum values of MSE of the training dataset. The 3D response surface plot using ANN for all solvents together with a graph presenting experimental versus predicted oil yield data is shown in Figure 5. The result shows identical matching between experimental and predicted data, thus ANN overperformed all the aforementioned models in term of low RSME, (acetone = 3.7 × 10 −5 , chloroform = 3.3757 × 10 −5 , and hexane = 2.23 × 10 −3 ) and R 2 and R 2 adj equal one.          The results summarized in Table 4 show that values of R 2 , R 2 adj , for the RSM Quadratic model indicate good agreement for the hexane solvent with R 2 , R 2 adj near to one, whereas, the RMSE value for the hexane was relatively large (0.225) compared to the rest of models. On the other hand, ANN has a smaller value of RMSE and R 2 , R 2 adj equal to one, indicating the most accurate modeling for all three solvents. The LRBF gave better R 2 , R 2 adj values in comparison with the RSM quadratic for all three solvents and showed better responses for Chloroform. The QRBF showed values for R 2 , R 2 adj equal to one for all modeled solvents.
In a nutshell, the results show the supremacy of the ANN over the other modeling techniques applied in terms of minimum RMSE, and R 2 , R 2 adj values near one. This result agrees with that obtained by many researchers confirming that the ANN model has the best prediction [17,18,20,36].

Conclusions and Future Work
The systematic comparative study presented in this paper has provided insightful observations into the performance of various meta-modeling techniques. This study has revealed that the properly trained ANN model has consistently performed more accurate prediction compared to those of RSM, Linear (LRBF), and Multi-quadric (QRBF) models in all aspects. This accurateness is expressed in the very high values of R 2 and R 2 adj ratios equal to one and the very low value of error for RMSE (for hexane 2.23 × 10 −3 , chloroform 3.3757 × 10 −5 , and for acetone 3.7 × 10 −5 ) indicators for the ANN results compared to others. This confirms that the ANN model displays a significantly higher generalization capacity than the rest of the models. The reason can be accredited to the universal ability of ANN to approximate the nonlinearity of the system. As a conclusion it can be noted from the plot of experimental data against the predicted data that the ANN is superior, and the modeling techniques compared to RSM, Linear (LRBF), and quadric (QRBF) in the second-ranking QRBF proved to be more accurate and had the finest prediction capability, when compared to LRBF and RSM. The applications of artificial neural networks can be used for on-line state estimation and control of sesame oil extraction.
Statistical indices have generated competitive results in predicting experimental extraction data. It is recommended that these techniques be applied on further techniques, such as modeling green solvent systems. Moreover, the experimental testing of different solvent mixtures in addition to analysing extracted oil quality by monitoring different properties, such as pH, acidity, and peroxide value, can be introduced as extra operating condition functions to be modeled.