Hybrid Modeling of Machine Learning and Phenomenological Model for Predicting the Biomass Gasiﬁcation Process in Supercritical Water for Hydrogen Production

: Process monitoring and forecasting are essential to ensure the efﬁciency of industrial processes. Although it is possible to model processes using phenomenological approaches, these are not always easy to apply and generalize due to the complexity of the processes and the high number of unknown parameters. This work aims to present a hybrid modeling architecture that combines a phenomenological model with machine learning models. The proposal is to enable the use of simpliﬁed phenomenological models to explain the basic principles behind a phenomenon. Next, the data-oriented model corrects deviations from the simpliﬁed model predictions. The research hypothesis consists of showing the beneﬁts of integrating prior knowledge of chemical engineering in simplifying data-based models, enhancing their generalization and improving their interpretability. The gasiﬁcation process of lignin biomass with supercritical water was used as a case study for this methodology and the variable to be observed was the production of hydrogen. The real experimental data of this process were augmented using Gibbs energy minimization with the Peng–Robinson equation of state, thus generating a more voluminous database that was considered as real process data. The ideal gas model was used as a simpliﬁed model, producing signiﬁcant deviations in predictions (relative deviations greater than 20%). Deviations ( ∆ H 2 = H real2 − H predict2 ) were used as the target variable for the machine learning model. Linear regression models (LASSO and simple linear regression) were used to predict ∆ H 2 and this variable was added to the simpliﬁed forecast model. This consisted of the hybrid prediction of the resulting hydrogen formation (H predict2 ). Among the veriﬁed models, the simple linear regression adjusted better to the values of ∆ H 2 (R 2 = 0.985) and MAE smaller than 0.1. Thus, the proposed hybrid architecture allowed for the prediction of the formation of hydrogen during the gasiﬁcation process of lignin biomass, despite the thermodynamic limitations of the ideal gas model. Hybridization proved to be robust as a process monitoring tool, providing the abstraction of non-idealities of industrial processes through simple, data-oriented models, without losing predictive power. The objective of the work was fulﬁlled, presenting a new possibility for the monitoring of real industrial processes.


Introduction
Modern engineering seeks the optimized use of raw materials and resources, and a common way to achieve such goals is through rigorous process monitoring. Chemical processes, in general, play a significant role in control and monitoring systems to ensure the proper development, to avoid waste, and to maximize the efficiency of processes [1].
For processes with chemical reactions, control can be hampered in most cases because of their nonlinear nature. The reason may be related to the complexity of the reaction system, where numerous intermediate products can be formed throughout the reaction process. Biomass gasification processes in supercritical water are examples of reaction Note the formation of intermediate components. The reaction mechanism presented (Equations (1)- (12)) shows a certain inaccuracy with respect to the process steps, since there is not full knowledge of the possible by-products generated during the reactions, which may make the construction of monitoring tools very challenging.
A common approach in modeling literature in such cases of knowledge incompleteness is to build empirical data-driven models. Industries are an abundant source of data, and they must be used to leverage a company's capacity for self-improvement [13].
Because of the growing complexity of industrial processes, the need for more sophisticated modeling techniques has increased proportionally. Machine learning and artificial intelligence techniques are among the top approaches of interest because of their predictive power and wide area of applicability. Ge et al. [13] and Venkatasubramanian [14] present complete reviews of how these techniques have been applied to help solve chemical engineering problems.
Machine learning techniques have been widely applied by chemical process researchers to monitor process parameters [15][16][17]. Marciej et al. [18] used the Extreme Gradient Boosting (XGBoost) model for data regression in order to predict the carbon straightening capacity in mixtures. Yang et al. [19] constructed a multi-feature fusion convolutional neural network and Light Gradient Boosting Machine (LightGBM) to monitor the safety of oil and gas pipelines. Zhang at al. [20] used Multilayer Perceptron and Random Forest to model the spontaneous combustion tendencies of coal with respect to crossing point temperature. Azarpour et al. [21] proposed a hybrid model combining a first-principle model and artificial neural network, with the aim of predicting the kinetic constant of deactivation of catalysts in a fixed bed. Lei Y. et al. [22] present a hybrid model proposal using four machine learning models (Artificial Neural Networks, Random Forest, XGBoost and LightGBM) for the prediction of hydrogen and methane in raw coke oven gas, presenting coefficients of determination equal to 0.99952 and 0.99964 for the prediction of hydrogen and methane concentrations, respectively, for the best model (LightGBM). Shahbaz et al. [23] constructed an ANN for the prediction of the palm kernel bark steam gasification process using CaO as adsorbent and coal ash as a catalyst. The authors used the backpropagation algorithm to train seven neurons in the hidden layer. The gas composition predicted by the ANN was compared with real data from the pilot scale process, showing high agreement with R 2 = 0.998 for almost all cases.
Despite several applications already reported in the literature on the application of data models for the prediction of chemical processes, a disadvantage of data-oriented models is the difficulty of generalizing correlations outside the original range of training data. This is a special issue in process monitoring because they naturally evolve over time due to changing operating conditions. The current work proposes the creation of a modeling architecture that takes advantage of both approaches: phenomenological and data-driven. Through their union, a hybrid model is built. The work demonstrates an example of the application of this methodology in the modeling of the gasification reaction system with supercritical water. For the prediction of the biomass gasification process, the thermodynamic approach of minimization of Gibbs energy (minG) will be used. Any system reaches its thermodynamic equilibrium if the total Gibbs free energy has the smallest possible value, so this objective function is widely applied to verify processes in the equilibrium condition [24].
The Gibbs energy minimization approach has greater advantages because it is a direct minimization method that predicts the formation of the system phases and describes the equilibrium compositions adequately, as shown in the works of Rocha and Guirardello [25], Voll et al. [26], and Hantoko et al. [27]. This method has the advantage of considering, in addition to the conservation of masses and equality of fugacity, the minimum Gibbs energy of the system, making it unnecessary to worry about predicting the possible phases that the system may form [28].
For reactive systems with multiple components conditioned at constant pressures and temperatures, the thermodynamic equilibrium condition can be formulated as a Gibbs energy minimization problem, with the Gibbs energy described by Equation (13).
The direct minimization of Equation (13), considering the restrictions of mass balance and stoichiometry, results in a combined chemical and phase equilibrium point. For the system to reach an adequate solution, it is necessary to add two constraints. The first constraint is the non-negativity of the number of moles, Equation (14), of each of the components in each of the phases [28].
The second restriction is related to the balance of atoms due to the non-stoichiometric formulation, which does not consider the possible reactions that occur throughout the optimization process, but the best arrangement of atoms is represented by Equation (15).
When the conservation of matter equation is satisfied, the Gibbs free energy expression obtains its minimum value when a multicomponent system reaches chemical equilibrium [29].
Bearing in mind that gasification processes in supercritical media occur under high pressure and temperature conditions, it is estimated that components in the liquid phase will not be formed; even so, both phases will be considered in the modeling process. Equation (13) can be rewritten in terms of chemical potentials and molar amounts of solid, liquid, and vapor phase components, as described in Equation (16).
The standard chemical potential can be calculated from Equations (17) and (18). These results are necessary for estimating the Gibbs energy, as shown in Equation (16).
Eng 2023, 4 To facilitate the thermodynamic modeling of the process, the solid phase will be considered ideal (Equation (19)), so it will not be necessary to estimate non-idealities. This consideration seems to be reasonable, considering that throughout the gasification process with supercritical water, high levels of water are inserted in the reaction system, hindering the formation of components in the solid phase [3,4,28].
Contrary to the hypothesis adopted regarding the ideality of the solid phase, the vapor phase cannot be considered ideal since the conditions of the process in question make this consideration impossible. Equation (20) describes the chemical potential of the components in the vapor phase written as a function of the standard chemical potential, temperature, molar composition in the vapor phase, pressure, and coefficient of fugacity of the components considered.
Equation (21) presents the chemical potential of the components in the liquid phase. This is written as a function of the standard chemical potential, temperature, molar composition in the vapor phase, pressure, and fugacity coefficient of the considered components.
The chemical potential of the liquid phase components is calculated as a function of the saturation pressure, and the Antoine equation (Equation (22)) will be used to calculate this property.
The Peng-Robinson cubic equation of state (EoS) will be applied to estimate the nonidealities of the liquid and vapor phases [30]. The next section presents in more detail the estimation of fugacity coefficients using the Peng-Robinson EoS.
The molar partial enthalpy of each liquid or gaseous I component is calculated as a function of their heat capacities, which are a function of temperature, as shown in Equation (23).
For solids, the heat capacity is calculated according to Equation (24).
The parameters for calculating the saturation pressures and formation properties of the considered components are presented in Table 1. The parameters for calculating the heat capacities of the solid and vapor phase components are presented in Tables 2 and 3, respectively. The reference state of a species in the gas phase is given by the pure substance at 1 bar and system temperature. Liquids and solids use the liquid itself or pure solid at 1 bar [31].

Estimation of Fugacity Coefficients Using the Cubic Peng-Robinson Equation
For the prediction of the biomass gasification process from the phenomenological point of view, the thermodynamic approach to minimization of Gibbs energy (minG) will be used. Any system reaches its thermodynamic equilibrium if the total Gibbs free energy has the smallest possible value, so this objective function is widely applied to verify processes in the equilibrium condition [24]. This methodology has great advantages as it is a direct minimization method that predicts the formation of the system phases and satisfactorily describes the equilibrium compositions in reaction systems.
The equations of state can be presented as cubic equations, in the form of the compressibility factor Z, generally described by Equation (25).
where A and B are dimensionless dependent on temperature, pressure, and phase composition, as shown in Equations (26) and (27). Parameters u and w are 2 and −1, respectively, tabled from Peng-Robinson state approval.
where a m and b m are mixture properties and determined from Equations (28) and (29), respectively.
The k ij is a binary interaction parameter and a i e a j are parameters that depend on a predetermined constant for each equation of state, the critical properties (P c and T c ), gas constant (R), and acentric factor (ω i ) of each component i and j. In this way, a i and a j are represented by Equation (30).
The parameter α i is given by Equation (31).
The b i parameter also depends on the critical properties, gas constant, and acentric factor of each component i, as shown in Equation (32).
With these data, it is possible to calculate the roots of the cubic equation. The fact that there is only a single real root of the compressibility factor (Z) reveals that the mixture exists in a single phase, liquid or vapor. If you have the three real roots, the largest of them will represent the vapor phase and the smallest the liquid phase. The root of the intermediate value has no physical meaning as it violates the mechanical stability criterion [34]. Knowing the root of Equation (25) for both phases, Equation (33) will be used to estimate the fugacity coefficients for the vapor and liquid phases.

Mathematical Formulation and Solution of the Equilibrium Problem
Equation (25) is known as the cubic equation of state. This equation provides an approximation of the actual behavior of the liquid and vapor region for a series of fluids [31]. The resolution of this equation produces one or three real roots, which can be later used to calculate the fugacity coefficients, in the approach known as phi-phi that will be used in this work. Authors Kamath, Biegler, and Grossmann [34] determined in their work that the first derivative of the cubic equation of state concerning Z must be positive to avoid selection of the root mean value. Furthermore, the second derivative ensures that the liquid and vapor phase roots are determined. The largest root will determine the vapor phase, whereas the second derivative must be greater than or equal to zero, and the smallest root, which determines the liquid phase, must be less than or equal to zero. Equations (34)- (37) represent these constraints for the Peng-Robinson equation.
To avoid selecting a root without physical significance, with the disappearance of one of the phases of the system, with only one phase, gaseous or liquid, Kamath, Biegler, and Grossmann [34] added slack variables (σ v e σ l ), which are used to allow the program to calculate derivatives when they are equal to zero, as in Equations (33) and (34), obtaining Equations (38) and (39), with modifications for the gaseous and liquid phases, respectively. M is a positive and large value. In this work, M was considered 10, as well as in the work of Dowling et al. [35].
Initially, 12 components will be considered (H 2 , H 2 O, CH 4 , CO 2 , CO, O 2 , N 2 , CH 4 O, C 2 H 6 , C 3 H 8 , NH 3 , C 2 H 4 ) as representative of the main compounds that it is possible to form during the biomass gasification process in supercritical water. The selection of these components was based on results reported in the literature, which indicate that these are the components formed in considerable compositions during the gasification processes of biomass from different biomass sources [3][4][5]28,[36][37][38][39][40][41].
The formulated NLP problems will be solved with the aid of the GAMS software and the CONOPT 4 solver, considering that this solver has some advantages about the type of approach that will be used. It is suitable for models with very non-linear constraints, is designed for large models, and can be applied to models that do not have differentiable functions [42]. This approach has demonstrated great accuracy and efficiency and has been used with great results by our research group over the last few years for a wide range of systems under conditions of chemical equilibrium and combined phases [3,4,6,25,26]. Figure 1 presents the proposed algorithm for obtaining the equilibrium compositions throughout the reaction using the Gibbs energy minimization methodology associated with the Peng-Robinson cubic equation of state. Figure 2 describes the proposed hybrid modeling architecture associating simulated data or data obtained through rigorous modeling with data obtained from a simplified phenomenological model (ideal cases or simplifying hypotheses).  Figure 2 describes the proposed hybrid modeling architecture associating simulated data or data obtained through rigorous modeling with data obtained from a simplified phenomenological model (ideal cases or simplifying hypotheses). The hybrid architecture from Figure 2 is based on the concept of boosting as it consists of a set of weak estimators and sequentially organized models that perform a little better than random predictions. Each new estimator is trained to correct the errors made  Figure 2 describes the proposed hybrid modeling architecture associating simulated data or data obtained through rigorous modeling with data obtained from a simplified phenomenological model (ideal cases or simplifying hypotheses). The hybrid architecture from Figure 2 is based on the concept of boosting as it consists of a set of weak estimators and sequentially organized models that perform a little better than random predictions. Each new estimator is trained to correct the errors made The hybrid architecture from Figure 2 is based on the concept of boosting as it consists of a set of weak estimators and sequentially organized models that perform a little better than random predictions. Each new estimator is trained to correct the errors made by the previous estimator [43]. The main gain of the proposed approach is to reduce the overall prediction bias.

Hybrid Architecture Proposed for the Hybrid Modeling of the Problem
The first step of the proposed architecture of this work consists in making predictions of the production of hydrogen in the equilibrium condition, considering the system as ideal-i.e., using the Clapeyron equation (Equation (40)). Note that ideal behavior is not consistent with what is studied, considering that the critical water pressure is greater than 220 bar [6,8,36].
The simplified model uses basic inputs to calculate the variable of interest; in this case, the production of hydrogen in the equilibrium condition. Using real process data or data simulated by a more rigorous phenomenological equation, the error of the predictions will be calculated using Equation (41). The second part of the proposed architecture corresponds to the use of a data model that will receive several input values-which may be the same used in an ideal first-principle model-and use them to predict the errors calculated previously.
A set of experimental data reported by Basu [39] will be used for the gasification process of lignin biomass in supercritical water at 30 MPa. Experimental data will be used to validate the methodology described in Section 2.1, using the Gibbs energy minimization methodology associated with the cubic Peng-Robinson equation to calculate non-idealities. Figure 3 presents a comparison of the experimental data reported by Basu [39] with results calculated using the methodology described in Section 2.1.
The simplified model uses basic inputs to calculate the variable of interest; in this case, the production of hydrogen in the equilibrium condition. Using real process data or data simulated by a more rigorous phenomenological equation, the error of the predictions will be calculated using Equation (41). The second part of the proposed architecture corresponds to the use of a data model that will receive several input values-which may be the same used in an ideal first-principle model-and use them to predict the errors calculated previously.
A set of experimental data reported by Basu [39] will be used for the gasification process of lignin biomass in supercritical water at 30 MPa. Experimental data will be used to validate the methodology described in Section 2.1, using the Gibbs energy minimization methodology associated with the cubic Peng-Robinson equation to calculate nonidealities. Figure 3 presents a comparison of the experimental data reported by Basu [39] with results calculated using the methodology described in Section 2.1.
As seen in Figure 3, the thermodynamic modeling applying the minimization of the Gibbs energy associated with the cubic Peng-Robinson equation presents an excellent fit concerning the ideal data, with a mean relative deviation of less than 1.0%. It is also verified that the results obtained considering the ideal model follow the tendency of the molar fraction of hydrogen as a function of temperature; however, the adjustment is not so precise, with an average relative deviation of 22.032%. Hence, from this point on, the results obtained by minimizing the Gibbs energy with the Peng-Robinson equation will be considered as real data. As seen in Figure 3, the thermodynamic modeling applying the minimization of the Gibbs energy associated with the cubic Peng-Robinson equation presents an excellent fit concerning the ideal data, with a mean relative deviation of less than 1.0%. It is also verified that the results obtained considering the ideal model follow the tendency of the molar fraction of hydrogen as a function of temperature; however, the adjustment is not so precise, with an average relative deviation of 22.032%. Hence, from this point on, the results obtained by minimizing the Gibbs energy with the Peng-Robinson equation will be considered as real data.
Considering that the Gibbs energy minimization methodology with the aid of the Peng-Robinson cubic adjusted well the experimental data of Basu [39], additional data were generated using different conditions of pressure, temperature, and biomass compositions in the feed. Figure 4 represents the described data set expansion procedure. Having the ideal prediction deviation results, the following steps will all be aimed at applying the machine learning model for KPI prediction.
The database that will be applied to the machine learning model contains the variables shown in Figure 5.
Eng 2023, 4, FOR PEER REVIEW 11 Considering that the Gibbs energy minimization methodology with the aid of the Peng-Robinson cubic adjusted well the experimental data of Basu [39], additional data were generated using different conditions of pressure, temperature, and biomass compositions in the feed. Figure 4 represents the described data set expansion procedure. Having the ideal prediction deviation results, the following steps will all be aimed at applying the machine learning model for KPI prediction.
The database that will be applied to the machine learning model contains the variables shown in Figure 5. The methodology used to expand the data set, as shown in Figure 4, is widely applied to simulations of complex reaction systems. Works reported by Mitoura et al. [6], Gomes et al. [8], and Freitas [28] applied the Gibbs energy minimization methodology to simulate gasification processes of different biomass sources and methane thermal cracking, presenting excellent results.

Data Modeling
Considering that one of the objectives of this work is to show the advantages of less complex data approaches, two modeling algorithms were chosen as the first options to model the errors of the ideal model concerning the real data. Two linear regression approaches were selected because they have good generalizability and are easy to interpret [44]. Linear regression models will be used through the LinearRegression class and LASSO regression from the Lasso class, both from the scikit-learn library. Equation (42) presents the generalized form of a linear model.
where y is the objective variable to be modeled, Bi are the angular coefficients referring to attribute i, B0 is the intercept, and x is a predictor variable.

Attribute Selection, Data Standardization, Model Selection, and Validation
An important procedure in machine learning modeling consists in selecting the attributes that contribute the most to the prediction of a target variable. The main reasons The methodology used to expand the data set, as shown in Figure 4, is widely applied to simulations of complex reaction systems. Works reported by Mitoura et al. [6], Gomes et al. [8], and Freitas [28] applied the Gibbs energy minimization methodology to simulate gasification processes of different biomass sources and methane thermal cracking, presenting excellent results.

Data Modeling
Considering that one of the objectives of this work is to show the advantages of less complex data approaches, two modeling algorithms were chosen as the first options to model the errors of the ideal model concerning the real data. Two linear regression approaches were selected because they have good generalizability and are easy to interpret [44]. Linear regression models will be used through the LinearRegression class and LASSO regression from the Lasso class, both from the scikit-learn library. Equation (42) presents the generalized form of a linear model.
where y is the objective variable to be modeled, B i are the angular coefficients referring to attribute i, B 0 is the intercept, and x is a predictor variable.

Attribute Selection, Data Standardization, Model Selection, and Validation
An important procedure in machine learning modeling consists in selecting the attributes that contribute the most to the prediction of a target variable. The main reasons include the existence of multicollinearity effects, which cause redundant information to be inputted in the model. In this work, a simplified approach of feature selection was employed, using only linear correlation as the measure of importance of each feature.
Through the SelectKBest class of Python's scikit-learn library, a linear regression is fitted for each attribute/target pair, and the F statistic is calculated by measuring the goodness of the linear fit. The model selects the attributes that have the highest F statistics [45].
Considering that the attributes have very different scales, another crucial step is to scale the data, which helps to avoid model biases towards features with the widest ranges of variation.
The MinMaxScaler class from the scikit-learn library will be used, which normalizes all features in a single scale (0-1), while keeping their variance. Equation (43) presents the scaling of the data based on their maximum and minimum values.
For the selection of hyperparameters, the RandomizedSearchCV class from the Python scikit-learn package was used, together with the cross-validation strategy using the KFold class from the scikit-learn package. The algorithm was defined to generate 1000 combinations of hyperparameter values. The model selection metric was the mean absolute error (MAE) (Equation (44)), and the coefficient of determination R 2 (Equation (45)) was also used as a model selection criterion. Figure 6 presents the machine learning model pipeline with the descriptions presented.
Eng 2023, 4, FOR PEER REVIEW 12 include the existence of multicollinearity effects, which cause redundant information to be inputted in the model. In this work, a simplified approach of feature selection was employed, using only linear correlation as the measure of importance of each feature. Through the SelectKBest class of Python's scikit-learn library, a linear regression is fitted for each attribute/target pair, and the F statistic is calculated by measuring the goodness of the linear fit. The model selects the attributes that have the highest F statistics [45].
Considering that the attributes have very different scales, another crucial step is to scale the data, which helps to avoid model biases towards features with the widest ranges of variation.
The MinMaxScaler class from the scikit-learn library will be used, which normalizes all features in a single scale (0-1), while keeping their variance. Equation (43) presents the scaling of the data based on their maximum and minimum values.
For the selection of hyperparameters, the RandomizedSearchCV class from the Python scikit-learn package was used, together with the cross-validation strategy using the KFold class from the scikit-learn package. The algorithm was defined to generate 1000 combinations of hyperparameter values. The model selection metric was the mean absolute error (MAE) (Equation (44)), and the coefficient of determination R 2 (Equation (45)) was also used as a model selection criterion. Figure 6 presents the machine learning model pipeline with the descriptions presented. The following sections present the results of applying the data-based model for predicting the error between real data and those calculated by the ideal model (Equation (41)). With the estimated error, the corrected hydrogen production will be calculated based on the values predicted by the ideal model, following Equation (46).

Presentation of the Database
As mentioned previously, the experimental data from Basu [39] were used to validate the proposed methodology, and after validation, the data set was augmented. Figure 7 shows, as an example, the formation of hydrogen as a function of temperature, fixing 1 mole of biomass with 5 moles of water in the feed for pressures of 300 and 500 bar.
Analyzing Figure 7, the ideal model follows the trend of the real process, even with perceptible deviations. The mean absolute error values are equal to 0.281 and 0.322 for pressures of 300 and 500 bar, respectively. The statistical metrics presented are The following sections present the results of applying the data-based model for predicting the error between real data and those calculated by the ideal model (Equation (41)). With the estimated error, the corrected hydrogen production will be calculated based on the values predicted by the ideal model, following Equation (46).

Presentation of the Database
As mentioned previously, the experimental data from Basu [39] were used to validate the proposed methodology, and after validation, the data set was augmented. Figure 7 shows, as an example, the formation of hydrogen as a function of temperature, fixing 1 mole of biomass with 5 moles of water in the feed for pressures of 300 and 500 bar.
Analyzing Figure 7, the ideal model follows the trend of the real process, even with perceptible deviations. The mean absolute error values are equal to 0.281 and 0.322 for pressures of 300 and 500 bar, respectively. The statistical metrics presented are considerable since the objective of this text is to reduce the bias of a simple first-principle model with the aid of a machine learning model. considerable since the objective of this text is to reduce the bias of a simple first-principle model with the aid of a machine learning model.
To verify the linear correlations between the variables, Figure 8 presents the correlation matrix of the data set. This was an important step because of the types of machine learning models employed (Linear Regression and LASSO).  The temperature has a high positive correlation with the target variable (represented here as "Hydrogen_real"), indicating that the increase in temperature favors the formation of hydrogen throughout the process. This result is expected since it agrees with the kinetic model of Whitag et al. [46], where it is described that in gasification systems in supercritical water, the temperature increase favors the water-gas displacement reactions that form large amounts of hydrogen.
In addition to the effect of temperature, note that the pressure and the biomass feed disfavor the formation of hydrogen. This result is predicted by the model of Whitag et al. [46], where it is described that the increase in pressure disfavors the formation of products of interest throughout the process. This is justified by the fact that the increase in pressure disfavors the water-gas displacement reactions and the methanation reaction is favored,  To verify the linear correlations between the variables, Figure 8 presents the correlation matrix of the data set. This was an important step because of the types of machine learning models employed (Linear Regression and LASSO).
Eng 2023, 4, FOR PEER REVIEW 13 considerable since the objective of this text is to reduce the bias of a simple first-principle model with the aid of a machine learning model. To verify the linear correlations between the variables, Figure 8 presents the correlation matrix of the data set. This was an important step because of the types of machine learning models employed (Linear Regression and LASSO).  The temperature has a high positive correlation with the target variable (represented here as "Hydrogen_real"), indicating that the increase in temperature favors the formation of hydrogen throughout the process. This result is expected since it agrees with the kinetic model of Whitag et al. [46], where it is described that in gasification systems in supercritical water, the temperature increase favors the water-gas displacement reactions that form large amounts of hydrogen.
In addition to the effect of temperature, note that the pressure and the biomass feed disfavor the formation of hydrogen. This result is predicted by the model of Whitag et al. [46], where it is described that the increase in pressure disfavors the formation of products of interest throughout the process. This is justified by the fact that the increase in pressure disfavors the water-gas displacement reactions and the methanation reaction is favored,  The temperature has a high positive correlation with the target variable (represented here as "Hydrogen_real"), indicating that the increase in temperature favors the formation of hydrogen throughout the process. This result is expected since it agrees with the kinetic model of Whitag et al. [46], where it is described that in gasification systems in supercritical water, the temperature increase favors the water-gas displacement reactions that form large amounts of hydrogen.
In addition to the effect of temperature, note that the pressure and the biomass feed disfavor the formation of hydrogen. This result is predicted by the model of Whitag et al. [46], where it is described that the increase in pressure disfavors the formation of products of interest throughout the process. This is justified by the fact that the increase in pressure disfavors the water-gas displacement reactions and the methanation reaction is favored, according to Le Chatelier's principle; thus, hydrogen is greatly consumed, forming methane and carbon dioxide.
The models presented by Whitag et al. [46] and Yan et al. [40] describe that the increase in the composition of biomass in the feed harms the formation of hydrogen, while the amount of methane increases. This behavior is justified by the fact that the increase in biomass concentration disfavors the water-gas reactions, which produce greater amounts of hydrogen, which, in turn, favors the methanation reaction, forming methane. The formation of carbon monoxide in low amounts helps to confirm the hypothesis.
Since the increase in biomass composition minimizes the formation of hydrogen, it is expected that the increase in the amount of water in the feed favors the formation of hydrogen, a result that is verified in Figure 8. Water additions to the reaction process favor the reactions of water-gas, increasing the formation of hydrogen, as previously mentioned.
All the above conclusions follow what was predicted by the models presented by Guan et al. [12], Yan et al. [40], Castello and Fiori [47], Goodwin and Rorrer [48], and Tang and Kitagawa [38] for the behavior of biomass gasification processes in supercritical water. In addition to the listed models, recent work reported by Chen et al. [49] and Gomes et al. [8] studying gasification processes of biomass sources using supercritical water as a reaction medium presented results in agreement with those presented in this text.
With the data describing the actual hydrogen production and calculated by the ideal model as a function of the other variables (temperature, pressure, and composition of biomass/water in the feed), the ideal model's deviations presented in Equation (41) were calculated and their correlation matrix was built, as Figure 9 shows. The produced quantity of ideal hydrogen (Hydrogen_ideal) has a high correlation with the temperature, thus the temperature and the molar quantity of ideal hydrogen are collinear. Multicollinearity is a problem in the model's fitting because it can impact the estimation of the parameters [50]. Given the multicollinearity problem, the Hydrogen_ideal variable was removed from the data set.
Eng 2023, 4, FOR PEER REVIEW according to Le Chatelier's principle; thus, hydrogen is greatly consumed, for thane and carbon dioxide.
The models presented by Whitag et al. [46] and Yan et al. [40] describe th crease in the composition of biomass in the feed harms the formation of hydrog the amount of methane increases. This behavior is justified by the fact that the i biomass concentration disfavors the water-gas reactions, which produce greater of hydrogen, which, in turn, favors the methanation reaction, forming methane mation of carbon monoxide in low amounts helps to confirm the hypothesis.
Since the increase in biomass composition minimizes the formation of hydr expected that the increase in the amount of water in the feed favors the format drogen, a result that is verified in Figure 8. Water additions to the reaction pro the reactions of water-gas, increasing the formation of hydrogen, as previou tioned.
All the above conclusions follow what was predicted by the models pre Guan et al. [12], Yan et al. [40], Castello and Fiori [47], Goodwin and Rorrer [48], and Kitagawa [38] for the behavior of biomass gasification processes in supercr ter. In addition to the listed models, recent work reported by Chen et al. [49] an et al. [8] studying gasification processes of biomass sources using supercritical w reaction medium presented results in agreement with those presented in this te With the data describing the actual hydrogen production and calculated by model as a function of the other variables (temperature, pressure, and composi omass/water in the feed), the ideal model's deviations presented in Equation calculated and their correlation matrix was built, as Figure 9 shows. The produ tity of ideal hydrogen (Hydrogen_ideal) has a high correlation with the tempera the temperature and the molar quantity of ideal hydrogen are collinear. Multico is a problem in the model's fitting because it can impact the estimation of the p [50]. Given the multicollinearity problem, the Hydrogen_ideal variable was from the data set.

Process Monitoring with the Hybrid Model
Simple linear regression was applied, taking as its objective the actual production of hydrogen throughout the process (Hydrogen_real). The simple linear regression took the variables of temperature, pressure, and composition of the biomass/water feed stream as predictor variables. Figure 10 presents the result of the simple linear regression application.

Process Monitoring with the Hybrid Model
Simple linear regression was applied, taking as its objective the actual production of hydrogen throughout the process (Hydrogen_real). The simple linear regression took the variables of temperature, pressure, and composition of the biomass/water feed stream as predictor variables. Figure 10 presents the result of the simple linear regression application. The result indicates that the simple linear regression does not fit the problem in question adequately, considering that it is a non-linear phenomenon.
The next step will be to apply the hybrid modeling methodology, summing the deviation prediction and the value predicted by the ideal model (Hydrogen_ideal). Figure  11 presents the results obtained after the hybridization process. Figure 11. Hybrid modeling of biomass gasification process with supercritical water to predict hydrogen production.
The results presented in Figure 11 indicate excellent adjustments with the real data. The hybrid model associating the ideal model with the simple linear regression showed better statistics with a coefficient of determination equal to 0.985 and a mean absolute The result indicates that the simple linear regression does not fit the problem in question adequately, considering that it is a non-linear phenomenon.
The next step will be to apply the hybrid modeling methodology, summing the deviation prediction and the value predicted by the ideal model (Hydrogen_ideal). Figure 11 presents the results obtained after the hybridization process.

Process Monitoring with the Hybrid Model
Simple linear regression was applied, taking as its objective the actual production of hydrogen throughout the process (Hydrogen_real). The simple linear regression took the variables of temperature, pressure, and composition of the biomass/water feed stream as predictor variables. Figure 10 presents the result of the simple linear regression application. The result indicates that the simple linear regression does not fit the problem in question adequately, considering that it is a non-linear phenomenon.
The next step will be to apply the hybrid modeling methodology, summing the deviation prediction and the value predicted by the ideal model (Hydrogen_ideal). Figure  11 presents the results obtained after the hybridization process. Figure 11. Hybrid modeling of biomass gasification process with supercritical water to predict hydrogen production.
The results presented in Figure 11 indicate excellent adjustments with the real data. The hybrid model associating the ideal model with the simple linear regression showed better statistics with a coefficient of determination equal to 0.985 and a mean absolute Figure 11. Hybrid modeling of biomass gasification process with supercritical water to predict hydrogen production.
The results presented in Figure 11 indicate excellent adjustments with the real data. The hybrid model associating the ideal model with the simple linear regression showed better statistics with a coefficient of determination equal to 0.985 and a mean absolute error equal to 0.07. Table 4 presents a summary of the statistical metrics of the verified models.  Figure 12 presents a comparison between real data, simulated data considering the reaction system as ideal, and the results obtained from the hybrid modeling, with the simple linear regression model fixing 1 mole of biomass with 5 moles of water in the feed for pressures of 300 and 500 bar. Figure 12 presents a comparison between real data, simulated data considering the reaction system as ideal, and the results obtained from the hybrid modeling, with the simple linear regression model fixing 1 mole of biomass with 5 moles of water in the feed for pressures of 300 and 500 bar.
As can be seen in Figure 12, the application of the hybrid modeling proposal considerably improves the ideally simulated data. The ideal model has limitations that make it impossible to predict well the behavior of the system at high pressures, which is verified in Figure 12, as a greater distance between real and calculated data is perceived when the pressure increases from 300 to 500 bar. For both verified pressures, the proposed hybrid model presents excellent results, with coefficients of determination equal to 0.968 and 0.984 for pressures of 300 and 500 bar, respectively.

Figure 12.
Comparison between real data, simulated data considering the reaction system as ideal, and results obtained from the hybrid modeling, with the simple linear regression model fixing 1 mole of biomass with 5 moles of water in the feed for pressures of 300 and 500 bar.

Conclusions about the Approach and Gains from the Point of View of Process Engineering
The problem used as an example throughout this text deals with a complex reaction with strong non-ideality due to its high temperature and pressure needs, which disfavors the application of simple models such as the ideal gas model. As seen in Figure 7, the ideal model does not present good adjustments concerning the data set used and the deviations tend to be greater with increasing pressure. However, the application of the hybrid model associating the simple linear regression model with the ideal gas model presented good adjustments for the formation of hydrogen under the minimum (300 bar) and maximum (500 bar) pressure conditions verified in this study, thus demonstrating the robustness of this methodology.
Considering that monitoring the formation of hydrogen considering the system as an ideal can be written in a few lines of code, the application of the proposed hybrid modeling described throughout this text has the potential to be applied as an online monitoring tool. As can be seen in Figure 12, the application of the hybrid modeling proposal considerably improves the ideally simulated data. The ideal model has limitations that make it impossible to predict well the behavior of the system at high pressures, which is verified in Figure 12, as a greater distance between real and calculated data is perceived when the pressure increases from 300 to 500 bar. For both verified pressures, the proposed hybrid model presents excellent results, with coefficients of determination equal to 0.968 and 0.984 for pressures of 300 and 500 bar, respectively.

Conclusions about the Approach and Gains from the Point of View of Process Engineering
The problem used as an example throughout this text deals with a complex reaction with strong non-ideality due to its high temperature and pressure needs, which disfavors the application of simple models such as the ideal gas model. As seen in Figure 7, the ideal model does not present good adjustments concerning the data set used and the deviations tend to be greater with increasing pressure. However, the application of the hybrid model associating the simple linear regression model with the ideal gas model presented good adjustments for the formation of hydrogen under the minimum (300 bar) and maximum (500 bar) pressure conditions verified in this study, thus demonstrating the robustness of this methodology.
Considering that monitoring the formation of hydrogen considering the system as an ideal can be written in a few lines of code, the application of the proposed hybrid modeling described throughout this text has the potential to be applied as an online monitoring tool.
Another advantage consists in the abstraction of non-idealities knowledge. It is not rare that process engineering systems have complex relations, and phenomena that are hard to model, using only a rigorous first-principle-based approach, without incurring the elevated cost of parameter estimation. The hybridization methodology allows the abstraction of these difficulties in the modeling process without losing predictive power.
This work fulfilled the objective of presenting the hybrid modeling architecture as a tool for application in the prediction of industrial processes where a phenomenological model is known that describes the process of interest. The main gain resides in the fact that a data-oriented model can help to correct the deviations caused by the non-ideality of the real phenomena, allowing the use of simplified equations.

Conclusions
This work proposed and developed a hybridization methodology of engineering models together with data-based models as an alternative to building tools for monitoring and forecasting industrial phenomena. Depending on process complexity, a rigorous approach may be too expensive due to the difficulty in finding adequate parameters that generalize the behavior observed in the plant, or due to the uncertainty associated with the estimates of these parameters.
The case study used as a basis for the development of the methodology was the biomass gasification process using supercritical water as the reaction medium. The proposal is to use linear models, which are simpler and more interpretable, in order to correct the errors committed by an idealized phenomenological model.
Using experimental data, a complex model based on the minimization of Gibbs energy using the cubic Peng-Robinson equation was applied, which presented an excellent fit with the real data, with a lower mean relative deviation of 1.0%. The adjusted phenomenological model was used to augment the database by calculating the equilibrium compositions for different conditions of temperature, pressure, and biomass/water composition in the process feed.
Hydrogen production was adopted as the objective variable, and the next step was the attempt to adjust this variable with a simplified model. The consideration was that the reaction system would behave as ideal; thus, the ideal model was used to adjust the verified process. It presented low adjustment with the real data, presenting values for the mean absolute error equal to 0.281 and 0.322. Since the ideal model did not fit the actual hydrogen production data well, the application of the hybrid modeling proposal was attempted, using a linear machine learning model to guide the simplified model considered in the prediction of the variable of interest. From this point on, the variable of interest became the error between the calculated results of the ideal values and actual values for hydrogen production.
Two linear regression models were tested for predicting the deviations of the ideal model: simple linear regression and LASSO linear regression. The simple linear regression model showed a better fit when associated with the ideal model for calculating hydrogen production. The predicted deviation values estimated by the data model were added to the results presented for the prediction of ideal hydrogen, and the result of this sum presented good adjustments with the real data. For pressures of 300 and 500 bar, the proposed hybrid model presents excellent results, with determination coefficients equal to 0.968 and 0.984, respectively, thus optimizing the ideal simplified approach.
For comparison purposes, a simple linear regression was applied directly to the variable of interest, the formation of hydrogen. The model presented results for the coefficient of determination equal to 0.834 and an absolute mean deviation equal to 0.225, making the visualization of prediction gains clearer with the application of the hybrid model.
The possibility of using simplified models such as the Clapeyron equation, which are easier to interpret and implement, is a considerable gain, as complex phenomenological models usually demand significant experimental work to determine parameters and have limited generalization, as their reliability is only guaranteed within the limits of experimental conditions.
It was possible to demonstrate how data-based approaches and artificial intelligence can help to improve and give more efficiency to the field of process engineering, allowing the construction of better tools for process monitoring and predictive approaches.
The major challenge found in the process industry is having sufficient quality data to train machine learning approaches. In this work, this obstacle was surpassed through data augmentation through a rigorous equation-of-state (Peng-Robinson) model. However, the lack of a satisfactory amount of data is not a rare situation in the process industry.
In addition, the quality of the data available presents an additional challenge. Industrial data often contain the effects of multiple phenomena, noise, and measurement uncertainty. This may turn the modeling more difficult because it increases the knowledge incompleteness of the studied processes.
Finally, all the objectives of the work are considered fulfilled, even knowing that there is still much to be done and researched to implement the proposed tools and observe the expected gains.

Future Work
The field of industrial digitization is a field of increasing exploration and research, with many opportunities for chemical and process engineers to take a more data-driven view and strengthen the evidence base of arguments.
Possible future work related to this work includes the application of the methodology in real streaming process data and its adaptation to self-learning applications. This could leverage the value generation from industrial data analytics.
This work opens opportunities to explore hybrid methodologies for the use and construction of digital tools for the industry. Opportunities are focused on exploring how the model behaves against real data on the conversion of biomass into hydrogen during the process of supercritical biomass gasification.

Data Availability Statement:
The data used in this work were obtained from simulations based on the thermodynamic approach as described. Similar results can be obtained in any process simulator and the treatment from a machine learning point of view can be easily replicated. We encourage everyone to use the architecture described in any possible problem where you have knowledge of data from any process (real or rigorously simulated) and data obtained from simplified modeling. The purpose of the text is not the verified system but the hybrid approach that allows associating machine learning models with phenomenological models for monitoring processes.

Acknowledgments:
The authors would like to thank the entire faculty of the State University of Campinas for their contribution to the personal and professional development of countless lives and all the professors who support the development of society. In addition, the authors thank Radix Engineering and Software for providing the necessary time and tools demanded by the development of this methodology.

Conflicts of Interest:
The authors declare no conflict of interest.