Machine-learned digital phase switch for sustainable chemical production

.


Introduction
Reaction, the transformation of basic molecules into high-value products has been an important centrepiece for the chemical process industries since the early days of the industrial revolution (Aftalion, 2001;Scott Fogler, 1987).Chemical reaction analysis forms the driving force for many important industrial applications, such as drug discovery (Schneider, 2018), polymer production (Suleimanov et al., 2016), pharmaceutical product development (Hansen et al., 2011), and more chemical process synthesis applications (Siirola and Rudd, 1971).In recent years, the development in reaction paths and networks has switched towards computational approaches (Fooshee et al., 2018;Schwaller et al., 2019) and are popularly utilizing machine learning to achieve applications at lower costs (Dewyer et al., 2018;Ulissi et al., 2017).For bio-based chemical production, machine learning has been particularly beneficial for real-time, dynamic control applications of biochemical processes such as bio-fermentation and anaerobic digestion (Pomeroy et al., 2022).Recent challenges in photochemical reaction multiscale modelling (Kovačič et al., 2020) can also be tackled using machine learning models to speed up computation (Sun et al., 2022).Despite such successes for machine learning in chemical reaction analysis, Kovács et al. (2021) recently argued that model interpretation is a stumbling block for black-box machine learning models on reaction predictions.The work also highlighted that model interpretability is important for machine learning algorithms as chemical reactions are incredibly contextual.Gale and Durand (2020) also discussed that the future for chemical reaction prediction should be enabled in high-throughput reaction systems, allowing for in-batch or in-flow analysis.Further efforts should also be made to analyse dynamically controlled reactions (Armstrong and Teixeira, 2020) and assist chemists in industrial applications (Nair et al., 2019).
The analysis of scaled-up reaction systems has also a large potential for improved production value as reactions are operated under conditions that are elevated in energy consumption, amount of catalyst, reactant capacities, and production rate.Machine learning is able to improve the operation of such industrial systems, giving multi-objective improvements (Tarafder et al., 2005) such as energy efficiency improvement, environmental impact minimization, product quality improvement, etc.On an industrial scale, these energy-saving and environmental impact minimization contribute to the 7th (Affordable and Clean Energy) and 12th (Responsible consumption and production) goals of the United Nations Sustainable Development Goals (SDG, 2019).This global effort fosters the need for a more sustainable processing industry (Klemeš et al., 2011).For multi-objective considerations, Schweidtmann et al. (2018) utilized a Thompson Sampling Efficient Multi-Objective algorithm (TS-EMO) to obtain Pareto fronts for productivity, environmental impact and product purity.Genetic algorithms were also shown to be effective for multi-objective optimization (Silva and Biscaia, 2003), assuming that the reaction models are well constructed.An example of a rigorously developed reactor model can be found in the work of Park et al. (2018) where computational fluid dynamics were used as the reactor model.Alternatively, data-driven models could also be used to construct effective reactor models (Venkatesh Prabhu and Karthikeyan, 2018), such as neural networks, response surface methodology, etc.
Apart from elements related to energy, environment, and materials, there is also research interest to reduce reaction batch times of reaction systems via machine learning methods.For this, preliminarily terminating a reaction will cause deterioration of product quality due to incomplete reaction, while late termination will cause a delay in production and wastage of energy.One approach that is used is to detect the endpoint of the reaction using spectroscopy (Mockel and Thomas, 1992) for immediate discharge to prepare for the next batch of reactions.Even in-line spectroscopy was used for real-time endpoint detection (Lin et al., 2006).While spectroscopy methods require capital investment for implementation, Wang et al. (2019) proposed an interesting idea to develop a catalyst which changes colour during the reaction endpoint.This was demonstrated to be possible for the reaction of alkyne trans-hydrogenation with ethanol, however, requires more development for other types of reactions.A kinetic matching approach was also proposed (Magalhães et al., 2012) which requires frequent experimental sampling and analysis of the solution.More recent developments to predict the endpoint of reaction systems were focused on using machine learning models to predict from physical sensors, such as extreme learning machines (Han and Liu, 2014), support vector machines (Wang et al., 2010), random forest, gradient boosted trees (Sala et al., 2019).This approach was highly effective and have lower implementation costs.
Works that considers product grade transition includes Kim et al. (2005) to detect different grade of polypropylene product within a process using multiple local and global partial least squares (PLS) models.Moreover, Kaneko et al. (2011) modelled different grades of the polymeric product occurring in the process using K-nearest neighbour, a range-based approach, support vector machines and PLS models.However, to the authors' best knowledge, none of the works has considered the transition of operational phases within the reaction with the purpose to shorten reaction batch time and lowering energy consumption.Furthermore, the transition of such reaction operational phases commonly requires the verification of a human operator in industrial settings, requiring more human-centric efforts.Thus, this work proposes a novel approach of using explainable human-in-the-loop machine learning to quantify and minimize the energy consumption and batch time of a reaction system from a statistical process control perspective.In this work, the role of the human is to verify the outcome of the machine learning algorithm and carryout the phase switch, allowing for a trustworthy operation in a high-stake production environment.This also prevents the possibility of catastrophic errors caused by the machine learning algorithm which might happen in rare cases.

Models and methods
The purpose of this work is to utilize the reaction physical parameters of a reaction unit and unravel the overall batch completion and operational phase classification from it.First data were split into training (60%), validation (20%) and testing dataset (20%).The training dataset is used to perform weight training of models (partial least square or neural network) while the validation set is used to optimize the hyperparameters in the pipeline.The testing dataset is used to provide an independent test of the model performance.In the overall method for this work (see Fig. 1), the optimal pre-processing method was first determined via a partial least square (PLS) testbed.The considered preprocessing method contains a total of 129 distinct Wavelet (Lee et al., 2019) and Hampel (Davies and Gather, 1993) filters to remove data spikes.Next, a neural architecture search (NAS) approach is carried out for a progress depth of the neural network from shallow to deep, finding the optimal neural architecture for the purpose of reactor overall batch completion.Optimal transfer learning is deployed to reutilize the pre-trained layers of the optimal architecture and utilize it for reaction operational phase classification (blue dotted lines in Fig. 1).Finally, the phase switch indicator is derived from the phase classification neural network and can be used for statistical process control.The quantification of improvement in operator time, reaction batch time, energy consumption and carbon emissions were also presented.These elements were tested on a real case study for a polymerization reactor (see section 2.1).

Polymerization reaction system
The methods and analysis for this work were tested on a polymerization reaction unit in a polyester production facility located in the Netherlands.The reactor unit is a part of the polymer production process where monomers are converted into polymers via a reactor operation of 7 phases (see Fig. 2) which includes filling, heating, pressurization, reaction, vacuum, cooling and discharge.Thermal oil was used for the temperature control in the reactor for both heating (and cooling).Vapour from the reactor is sent to a distillation column to separate water, solvent, and recycle the polymer liquid by backflow.Two prediction task was required for this work: (i) real-time prediction of the reaction overall completion, (ii) real-time classification of the reactor operational phases.The reaction overall completion of the reaction was featured as a ratio of the current time (in minutes) and the final time (in minutes), normalized from 0 to 1.In this case, the current time divided by the predicted reaction overall completion would produce the predicted final time of the reactor batch, and this can be done in real-time.
The reaction operational phases are encoded and used as classification predictors.
In this work, 8 variables were selected as input for the model which mainly includes the temperature, pressure and flow of chemicals and heating oil in real-time.There is also a real-time inline viscosity meter to provide information about the viscosity in the reactor.The contained variables are shown in Table 1.For this work, 46 batches of reaction data were used with batch 14 removed as an anomaly, which leaves 45 batches each containing between 1020 and 1426 timestamps of data.The data is obtained directly (1-min sampling rate) from the process information system of the polymerization facility with real-time sensors per batch of production.

Filter selection for noise removal
Noise removal is essential for removing fluctuations in realistic chemical reaction data as the physical sensors can be easily affected by elements such as bubbles, fluid flow, etc.The filter set contains 129 denoising algorithms which are mainly consisting of (i) Hampel filters (Davies and Gather, 1993) and (ii) Wavelet filters (Lee et al., 2019).For the Hampel filter, window sizes of 2-25 were considered.Five families of wavelet functions were considered in this work, which includes (i) Haar, (ii) Daubechies (order = 1 to 20), (ii) Symlets (order = 2 to 20), (iii) Coiflets (order = 1 to 5), (iv) Biorthogonal, (v) Reverse Biorthogonal and (vi) Discrete Meyer.The biorthogonal scaling function orders for biorthogonal and reversed biorthogonal wavelet functions includes pairs of (1.1), (1.3), (1.5), (2.2), (2.4), (2.6), (2.8), (3.1), (3.3), (3.5), (3.7), (3.9), (4.4), (5.5), (6.8).To select the best filter from the collection of algorithms, each respective algorithm was coupled with a partial least square (PLS) model (Geladi and Kowalski, 1986) to evaluate their cross-validated error.In this case, the PLS model is used as a pre-processing testbed to save computational time, rather than optimizing the filter on a neural network, causing an exponential increase in computational time.

Neural network with variable architecture and depth
A neural network (see Fig. 3) can be expressed as a model (f) with multiple layers of activation functions for a target value (y), giving a predicted target (y ′ ) represented as a network of activation functions (Lecun et al., 2015;McCulloch and Pitts, 1943).The inner parameters  within a neural network contain the weights in the edges of the neural network (w), and the bias (b).For hyperparameters, d is the depth of the neural network (number of layers), h is the number of neurons in each layer expressed as a sequence, and a is the type of activation function in each layer expressed as a sequence (encoded).During the training phase of a neural network, the parameters d, h, a are predefined as hyperparameters to fix the network structure, and a gradient descent algorithm, such as the Adam algorithm (Kingma and Ba, 2015) is for error backpropagation (Ruineihart et al., 1986), thus optimizing both weights (w) and bias (b) to minimize the loss function L(y ′ train , y train ).Nevertheless, the choice of d, h, a as hyperparameters for the network's structure is shown to be important as it can drastically affect prediction effectiveness (Elsken et al., 2019a;Tang et al., 2021).Such hyperparameters should be optimized by neural architecture search (NAS) (Elsken et al., 2019b).This work uses an evolutionary optimization approach for NAS, and the algorithm is described in the subsequent section (Section 2.3.1).

Neural architecture search
Neural architecture search is deployed to obtain optimal structure and combination of activation functions in a neural network.In this work, a progressive neural architecture search approach is used (Liu et al., 2018).The algorithm that was used is called Progressive Depth Swarm Evolution (PDSE) (Teng et al., 2019), which utilizes a two-step optimization approach to carry out neural architecture search for deep neural networks.The higher-level problem progressively searches through the depth (d) of the neural network (from low to high), with the motivation to first search shallower neural networks before deeper neural networks.The lower-level problem uses the modified particle swarm optimization algorithm to optimize both the number of neurons in each hidden layer (h), and the types of activation functions in each hidden layer (a) with the objective to minimize the cross-validated loss evaluation of the validation dataset, L(y ′ valid , y valid ).The PDSE algorithm can be mathematically expressed as the following two-step optimization (see Eqns. ( 1) and ( 2)): Where d * is the optimal depth of the neural network, a * is the optimal activation functions in each hidden layer expressed as an encoded sequence, h * is the optimal number of neurons in each hidden layer.

Optimal transfer learning
Transfer learning is used as a technique to re-use part of a neural network which was used for Task A for a different but related Task B (Weiss et al., 2016;Zhuang et al., 2021).Using inductive transfer learning (Lu et al., 2015), the knowledge from Task A can be transferred to Task B in the same domain source by re-using the latent space of the model.In this work, inductive transfer learning was carried out for real-time reaction completion prediction (Task A) and transferred to real-time reaction operational phase classification (Task B) while the same reaction recipe (source) is maintained.
For the algorithmic procedure, we first train a neural network (considering neural architecture search) for real-time reaction completion prediction, represented as f A (d = d * ).Where the depth of the neural network, d is at the optimal depth found by neural architecture search, d * as obtained in Section 2.3.1.To start the transfer learning procedure, the final layer of the neural network is discarded and only the first d * − 1 layers are taken, giving the transfer neural network, f T = f A (d = d * − 1).Next, a new single layer, g(a T , h T ) is selected with an activation function (a T ) and number of neurons (h T ) and added to f T as the penultimate layer.A softmax layer, g ′ with number of neurons equals to the number of reaction operational phases ( 7) is added as the final layer as the classifier.This forms the untrained neural network for the reaction operational phase classification, The weights and biases in the transferred layers f T are frozen, while weights and biases in g(a T , h T ) • g ′ are trained with the Adam algorithm (Kingma and Ba, 2015) to minimize the loss function, a sample weighted multi-class cross-entropy loss for the training data.Next, the procedure was repeated for all the possible activation functions (a T ) and number of neurons (h T ) within search the range.The optimal transfer learned network with the optimal activation function (a T * ) and optimal number of neurons (h T * ) is the network which has minimal cross validated loss on the validation dataset L ′ (y ′ valid , y valid ), presented as shown in Eq. ( 3): After determining the hyperparameters a T * and h T * , the weights and biases in f T which were previously frozen, are allowed to be fine-tuned.Specifically, the weights and biases in the full neural network for reaction operational phase classification • g ′ were optimized using the Adam algorithm (Kingma and Ba, 2015) with a small learning rate of 10 − 5 to allow for better specification on the new task of reaction operational phase classification (Mormont et al., 2018).

Explainable machine learning and interpretation
The interpretation of information between machine learning models and the human user is crucial for a human-centric machine learning application.For model explainability and interpretability, principal component analysis visualization, Kernel Shapley Additive Explanations (Kernel SHAP), and Accumulated Local Effects (ALE) plots were used.

Principal component analysis
Multivariate data visualization is performed using principal component analysis (PCA) (Jolliffe, 1986).The multivariate data of the reaction system is dimensionally reduced using PCA to 3 dimensions (maximum dimensions for visualization).The reaction system data points are projected in a 3D plot with batch completion as the colour gradient for effective multivariate data visualization.

Kernel SHAP
For the global explainability of the neural network, the Kernel Shapley Additive Explanations (Kernel SHAP) method was used (Lundberg and Lee, 2017).Kernel SHAP uses the LIME model (Ribeiro et al., 2016) as a local linear explanation model while utilizing the classic Shapley regression value (Lipovetsky and Conklin, 2001).Suppose f x is the trained model using the set of features x, and the set of all the features is represented as F while all the feature subsets are represented as S, such that S⫅F.The Shapley value for feature i is computed as the absolute average of the following Eq.( 4): Lundberg and Lee (2017) demonstrated that the regression coefficients of the LIME surrogate model estimate the SHAP value using a weighted linear regression model and weighting kernel.With this, sample efficiency is achieved for model explainability via approximated feature importance.

ALE plots
To explain the local effects of the variables on the prediction, the Accumulated Local Effects (ALE) plots were used (Apley and Zhu, 2020).ALE quantifies the local feature effects by calculating the difference in prediction based on the conditional distribution of the features.The centred ALE can be computed by Eq. ( 5): Where N(k) denotes the interval [z k− 1 , z k ), z is a fine grid of the feature x 1 , n(k) represents the number of points that are within the interval of ) is evaluating the difference in the prediction where x 1 is replaced with the right interval z k and the left interval z k− 1 respectively.

Knowledge distillation into a decision tree
Neural networks can capture high-dimensional features and provide good prediction accuracies, however, they are black-box and difficult to interpret.Frosst and Hinton (2018) propose using a decision tree to model the input and output that comes out of a neural network, transferring the knowledge from a big neural network to an easily interpretable decision tree with slightly lower accuracy.This knowledge distillation strategy (Gou et al., 2021) allows for a neural network to be represented as a tree diagram for human understanding.
In this work, the neural network was uniformly randomly sampled 10,000 times as a training set and 10,000 times as a test set.The data was then trained on a decision tree classifier with the target variable binned.To optimize the hyperparameters of the decision tree classifier, the Bayesian optimization algorithm is used with Matern Kernel, a number of initial points of 45 and the number of iterations at 100.The hyperparameter optimization extracts the most information possible from the neural network to be represented in the tree structure.

Reaction process quantification for human-in-the-loop
The quantification of the reaction process in terms of time savings, energy savings and carbon emission reductions is presented in this section.Firstly, the time savings can be expressed as two types of savings: (i) time savings potential for reactor operator (human), and (ii) time savings potential for reaction batch.

Operator time savings
During product discharge and reaction batch ending, the operator has to be present at the reactor.The expected time window that the operator has to be ready for this operation is the standard deviation of the reaction total batch time.Due to the accurate prediction of the endpoint by the neural network, the time window (95% confidence interval) of the expected reaction endpoint can be decreased relative to the reaction batch endpoint standard deviation (σ b ), which causes time savings for the operator.This time-saving potential for the operator, t o has an expected value which is represented as (see Eq. ( 6)): Where CI 95% (R) is the 95% confidence interval of the prediction residuals for the reaction endpoint considering alternative batches as single samples.

Reaction batch time savings
The time savings potential for reaction batch (t r,ρ ) (for each opera-  7).Note that N(B) is the number of total batches.

Energy savings
The rate of heat energy required Ḣ(t) at time t for the reaction system follows a common heat transfer equation which can be expressed in Eq. (8).
Where ṁ(t) is the logged mass flowrate of the heating oil to the reactor, c p,oil is the sensible specific heat capacity of the heating oil, T oil,out (t) is the logged outlet temperature of the heating oil from the reactor, and T oil,in (t) is the logged inlet temperature of the heating oil before the reactor.
From this, heat energy saving potential H(ρ) at a certain phase switching point (ρ) can be expected as the average of the rate of heat energy across all batches, multiplied by the expected time saving, shown in Eq. ( 9).

Carbon emission reductions
The source of the heat energy is produced from a natural gas boiler.The savings in the heat energy usage of the reactor also reduces the need to produce more energy in the natural gas boiler, giving a reduction in carbon emitted.The reduction in carbon emission (ς) is calculated as the average of expected energy saving across all phase switching points ρ ∈ Ῥ, multiplied by the carbon dioxide equivalent coefficient, Κ according to EN 15603 (BSI, 2008;Zottl et al., 2011).Note that N(Ῥ) is the number of total phases switching points and the carbon emission savings can be scaled-up to an operation time in years with a factor (n).This expression is shown in Eq. ( 10).

Neural network-based phase switch indicator as a digital phase switch
Neural network-derived phase switch indicators are developed as digital phase switches to assist human operators in carrying out better operational phase changes and batch-ending procedures.This is a human-in-the-loop approach which allows machine learning algorithms to assist human operator in making operational phase switch decisions.
First, the transferred learned reaction phase classifier neural network (see Section 2.4) is used for prediction on batches of reactions.The confusion matrix for the resulting classification is a matrix, M with a size of P × P, containing number of samples in the corresponding elements.We now assume that the prediction of the neural network can approximate the ground truth for the operational phase change, as it has learned the batch-to-batch variance of the reactions.The number of samples where the operator switches the reaction operational phase accurately according to the neural network (α) can then be quantified as: Next, the number of samples where the operator switches the reaction operational phase both late and early according to the neural network (β) can be quantified as: Now the unnormalized phase switch indicator (γ) is expressed as the ratio of the accurately switched number of samples divided by samples including late and early switching of operational phases.Each phase switch index corresponds to single batches of reaction.
The phase switch index is then normalized over the batches of reactions and the modulus is taken.This is to ensure that the index can be comparable and the modulus operation combines late and early operational phase switches, making the index one-sided instead of two-sided.
Where γ is the normalized phase switch indicator, γ is the batch mean of the unnormalized phase switch indicator, σ γ is the batch standard deviation of the phase switch indicator.The normalized phase switch indicator is representative (larger is good) of how the neural network model evaluates the quality of operational phase switches of a single batch reaction.

Results and discussion
The methodology described in section 2.0 was performed on a polymerization reactor for the production of polymers in an industrial facility located in the Netherlands.The subsequent sections discuss the results for pre-processing, predicting overall batch time, operator time savings, reaction operational phase classification, phase switch index, and resulting process improvements (in terms of time, energy and environmental impacts).

Data pre-processing via Wavelet and Hampel filters
Heterogeneous reaction often produces gases within liquids, which leads to the generation of bubbling within the liquid reactants.These bubbles can be attached onto sensor probes, causing a spike in the reading while can be misinterpreted by any data-driven models.For this artefact within the data, we proposed to either use Wavelet filters and Fig. 4. Plot of all the considered de-noising methods for filtering spikes within the reaction system.For the rank of the pre-processing method, a lower rank is better with lower validation mean absolute error.

S.Y. Teng et al.
Hampel filters to remove such reading spikes (see Fig. 4).The criteria for the filter method is to maximize the mean absolute error on the validation set, as the mean absolute error is directly proportional to the deviation of the reading.The optimization of the de-noising method is carried out using partial least square (PLS) method as a testbed, as it sufficiently deals with simple non-linearity within the data and is computationally inexpensive for such optimization procedure.From this approach, the optimal de-noising method obtained is the Hampel filter with a window size of 6 (minutes) with validation MAE of 0.0713 on the PLS testbed.From the results of the de-noising methods, Hampel filters with a suitable and small window have the best de-noising effects.Wavelets with a Daubechies function also produced stable and competitive results to the Hampel filters.Reversed Biorthogonal, Haar and Discrete Meyer produced de-noising results that were relatively worse with validation MAE of 0.076-0.077.

Predicting overall batch time
The de-noised data from the physical variables of the reaction system was then fed into a neural network which has no fixed neural architecture.Instead, a neural architecture search (NAS) algorithm, the progressive depth swarm evolution, is used to search for the optimal architecture.The obtained neural network has an optimal architecture (depth = 8) of (508,261,177,49,234,284,6,1) and with activation functions of (elu, selu, hard sigmoid, hard sigmoid, softmax, softmax, hard sigmoid, linear).Observing the error diagram of all the best neural networks with different depths, it is found that the error of optimal neural architecture drastically drops from a depth of 2-3.This is the effect of shallow learning versus deep learning, where the reaction system is sufficiently complex that a higher level of data abstraction will benefit the model from deep learning (Lecun et al., 2015;Poggio et al., 2017).The big performance increase in depth 2 to 3 just demonstrates the effectiveness of deep learning.However, the difference between a depth of 3-10 must also be studied.From Fig. 5 (a,b), performance increase (RMSE, MAE and R 2 improved, while MBE was within acceptable range) can be observed up to a depth of 8 throughout the range of 3-10.This performance increase was observable for both the training dataset and the testing dataset, suggesting that there are minimal effects of overfitting, and this is a true performance improvement in the prediction.This observation is further confirmed when the testing set also demonstrates similar performances to the validation and training set (see Fig. 5(c)).It is interesting to state that there is a white gap within the testing set, as that is the time for manual sampling operation in the reactor.From Fig. 5(d), the multivariate visualization of the dataset can be related to batch completion.It is observed that there are artefacts in the data in the multivariate space that are causing a slight deviation in prediction.

Model explainability
Using Kernel SHAP, the global explainability of the optimally evolved neural network models was deduced.From Fig. 6(a), the individual SHAP values of each sampled point in the model are shown.It is shown for the viscosity (most important variable) of the liquid in the reactor that there is very large range for the individual SHAP value, while the distribution of points (shown as the thickness of points clustered in the plot) shows that there are significantly negative and positive impacts on the model.The subsequent process variable, the reaction temperature, has also a large range for individual SHAP values, however, the points are mainly distributed near zero.Overall, the global feature importance of the Kernel SHAP method shows that the viscosity of the liquid in the reactor is the most important variable as it gives the highest impact on the overall reaction batch time (see Fig. 6(b)).The next most important variables explained by Kernel SHAP are the reaction temperature and the reaction pressure.This is physically indicative of the reaction, as the temperature and pressure of the reactor are definitive of the reactants' physical state.
The local effects of each variable on the overall reaction batch time (target variable) are analysed using ALE plots (see Fig. 7).The 9 5% confidence intervals within the ALE plots show the standard deviation of the local effect of the physical variable on the overall reaction batch time, explained by the neural network.In other words, the confidence interval is a measure of how much the neural network is trusting the physical variable to carry out the prediction.For the viscosity (Fig. 7(a)), this variable is fully being utilized as the ALE plot shows a very tight confidence interval.Moreover, the viscosity becomes increasingly important for the prediction over time.From Fig. 7(b-g), the confidence interval is segregated into "chunks", showing the distinct behaviour of the neural network with respect to different operational phases.The reactor temperature (Fig. 7(b)) increases in effect on prediction over time, while reactor pressures (Fig. 7(c,e)) and vapour temperature decrease in effect over time.Interestingly, the least important variable, the flowrate in Fig. 7(h) has a decreasing effect over time and a slightly larger confidence interval during only the discharge phase.These ALE plots explain the local effects of each variable and unravel the neural network's prediction in a human-explainable manner.
Using knowledge distillation into a decision tree classifier (see Fig. 8), it can be observed that the decision tree mainly preserves information to classify the last stage of the process (0.8-1.0 batch completion) and the earliest stage of the process (0.0-0.2 batch completion).This shows that the neural network uses most of its highdimensional space to map the "filling" and the "discharge" phases of the operation, as these phases contain the most variances (see Fig. 5(d)).A combination of reactor temperature in the mixture (T3), viscosity (V), thermal oil flowrate (F1) and vacuum control pressure (P2) are used to align and model the large variances in the "filling" and "discharge" phases.

Operator time savings
Under normal circumstances, the operator would be required to be on standby at the reactor within an expected time window (standard deviation) near the endpoint.However, with the more accurate endpoint prediction due to the neural network, the endpoint of the reaction operation could be more accurately quantified.The time saving is caused by the original standard deviation of the endpoint deducting the expected prediction deviation.The overall reaction batch time prediction statistically saves operational time for the reaction unit's operator.From Fig. 9(a), the deviation of the neural network's prediction starts to stabilize after 800 min of operation.This would be the time that the end time prediction would be sufficiently accurate.Next, from Fig. 9(b), the maximum time-saving potential for the reactor operator is about 6 h, which is reliably possible as the prediction stabilizes after 800 min.

Reaction operational phase classification
Having constructed a neural network model for the purpose of overall batch time prediction, the information of the overall batch time can be used as prior knowledge for reaction operational phase classification.The optimal transfer learning approach allows the re-use of physical variables and reaction operational information for operational phase classification, hence stabilising model prediction, and reducing training time.The penultimate prediction layer for the classification neural network is optimized for its activation function and the number of hidden neurons based on the cross-validated class-weighted multiclass cross entropy (see Fig. 10).The optimal prediction layer for the classification neural network has a penultimate layer a cross entropy of 0.1941 with 5 neurons and exponential activation function.
After optimal transfer learning, the classification performance of the neural network was carried out.Firstly, the confusion matrix has shown that the neural network predicts the phase classification accurately, with most data lying on the diagonal cells of the matrix (with over 90% accuracy), giving correct predictions (see Fig. 11(a)).The majority of the misaligned data were found in one operational phase earlier or later with exceptions to the filling phase, as it has significant uncertainty (e.g.reactant splash, operator stopping the fill, etc.).The classification report matrix (Fig. 11(b)) shows that the predictions of each individual operational phase were accurate (more than 0.85 for precision, recall and f1score) with no strong bias for a certain operational phase.Further looking into the prediction using PCA visualization (see Fig. 11(c-e)), it is observed that the misaligned points were actually points near the transition of the reaction operational phases.An important aspect for consideration is that the neural network is predicting the phase classification from the physical variable of the reactor, while the actual labels themselves are the classification from the control system.The classification of the control system of the reactor requires confirmation of the human operator to perform the operational phase switch, otherwise, the reactor will be idle.Therefore, the misalignment for such points in Fig. 11(e) is the misalignment between the human operation and the S.Y.Teng et al. readiness (to progress to the next operational phase) of the physical variables within the reactor.Therefore, if the inherent prediction error of the neural network is small and can be negligible, these misaligned points can be used as a quantification for the timeliness of the phase switch operation.

Phase switch index
The Phase Switch Index is derived from the misalignment points between the operator control and the neural network prediction.In this work, we show that it can be used as a simple and effective indicator to study the effects of the digital phase switch in the reaction unit.From Fig. 12(a), it can be observed that the phase switch index is a direct effect of the batch time.In the process, whenever there is a large increase in batch time, there is a corrective effect from (increase in) the Phase Switch Index.This shows the cause-and-effect relationship of the batch time and the Phase Switch Index during statistical process control.Moreover, the Phase Switch Index has a decreasing relationship with the batch time (Fig. 12(b)).This shows that the Phase Switch Index is logically affected by the batch time of the process.

Time, energy and environmental savings
From the analysis of heating temperatures, the heating profile of the reaction can be determined (see Fig. 13(a)).The heating rate of the reactor rises during the heating phase and generally lowers down.The confidence interval exhibits different ranges at different operational phases, while large sudden spikes are at the operational phase switching points.The heating rate for each batch of reaction at the operational phase switching points is shown in Fig. 13(b).The heating rate of the switching point between the cooling to the discharge phase has the largest variance as the control of cooling within the reaction has a complicated dynamic due to heat transfer back to the thermal oil (and heat loss to the environment).
The energy saving potential, operational time-saving potential, and carbon emission saving potential are presented (see Fig. 14(a-c)) to quantify the process improvement advantage achieved by the neural network analytics.The operational phase that has the highest process improvements is the phase of discharge to end.This shows that it is essential to end the reaction and start a new batch (or shutdown) as valuable energy and time are wasted.Apart from the endpoint control, two operational phases which can benefit from the neural network prediction are the heating to pressurization and pressurization to vacuum phase.These results are evidence that optimally switching operational phases in a reactor has comparable importance to predicting the reaction endpoint.Better tracking and prediction analytics for reactors can allow for significant savings in operational time, energy, and environmental impacts.Using the neural network analytics of this work, it is expected that the overall batch time saving is 5.4%, overall carbon emission reduction is 10.5%, and overall energy consumption saving is 10.6% (see Fig. 14(d)).
These results in Fig. 14 show that there is a strong motivation for utilizing the proposed method as a trigger for reactor operational phase switching as it significantly elevates the improvements in batch time, energy, and carbon emissions.Furthermore, it shows the effectiveness of human-centric machine learning to leverage process savings in reactor

Further applications and extensions
It is evident that such an approach brings benefits to generic multistep batch or semi-batch reaction processes in terms of costs, sustainability and time.There are clear strengths, weaknesses, opportunities and threats which arises from the new digital phase switch technology proposed (see Fig. 15).Usage of explainable AI technique in digital phase switch allows interpretable actions from the model, improving operator understanding to achieve improvement in costs, environmental impact and time.The advantage of using a human-in-the-loop machine learning approach is that operations can still be trustworthy.An additional benefit is that mistakes caused by a rare but catastrophic failure of the machine learning model will not impact the production system as the human operator will act as the fallback.However, this also implies that the operation is still prone to human errors and small delays in phase switches.Nevertheless, when more data is available and the model becomes more robust, the implementation of a fully autonomous phase switch by this approach can be explored.Although high prediction performances of the model can be achieved, the main work horse of the application is neural architecture search, which is computationally extensive and can be challenging for many researchers.Since neural architecture search is only required once-off during training of the model, authors recommended balancing training time and computational power of hardware (also consider cloud computation) to achieve performance.Furthermore, the authors also suggest that adding inline spectroscopy within the reactor can also provide additional data to improve the robustness of the model.However, this additional device will come with additional investment costs and might be explored in  future works.Overall this approach can be applied to many generic processing or manufacturing industries.Evidently, the digital phase switch technology is only applicable multi-step or multi-stage process, where there is a physical phase, and not for fully continuous processes.As digital phase switch is still a relatively new technology, we expect that operators might still have socio-technological barrier for this approach.More research will be required to fully establish digital phase switch as a concrete technology for cleaner production.

Conclusions
In conclusion, this work has proposed a machine learning workflow which considers elements of optimal pre-processing, neural architecture search, transfer learning and the proposal of a digital phase switch.The approach has the advantage of high accuracy while circumventing black-box applications of neural networks by using explainable machine learning methods.With this, operators can incorporate and recognise system knowledge in the model related to the various operational phases.Also, the neural network functions as a trigger for operators in carrying out phase switching for a multi-step reaction unit, promoting human-centric and responsible machine learning.Specifically, an evolutionary neural network was deployed on an optimally de-noised (by selecting from 129 de-noising Wavelet and Hampel filters) polymerization reactor operational data.A first neural network was first made to predict the overall reaction batch time, which provides stable reaction end-time prediction at ~800 min from the start of the reaction.This saves up to 6 h of operator time during the end discharge of the reaction.Furthermore, the information within the first neural network was then optimally transferred to a second neural network for classification of the reaction operational phase.From the confusion matrix of  this neural network, a phase switch indicator was developed to act as a digital phase switch and was shown to be inversely proportional to overall batch time.Further quantification of the effects of using such an approach for the polymerization reactor case study reveals that proper operational phase switch was critical for improving batch time, carbon emission and energy consumption.The proposed method was able to improve the batch time by 5.4%, carbon emission by 10.5%, and energy consumption by 10.6%.The result of this work demonstrates a promising and sustainable paradigm for human-centric machine learning to be deployed as a digital phase switch in chemical and reaction facilities.This work can be extended to various other important industries such as plastics recycling industries, pharmaceutical production, petrochemical facilities, etc.

Fig. 1 .
Fig. 1.The overall method for prediction of overall batch completion and tracking the operational phase switch of a reactor.

Fig. 2 .
Fig. 2. Illustrative process flow diagram of the polymerization reaction unit.
tional phase switch ρ) is quantified by taking the prediction of transferred learned reaction phase classifier neural network as the ground truth.When the neural network classifies (i.e.C ′ (x) = P) that a change of reaction operational phase (P) is required and the operational phase is not physically changed by the operator (i.e.C(x) = P − 1), this time difference is calculated and expressed as Δt b,p (C ′ (x) = C(x) − 1) for each batch b ∈ B. The expected time savings potential for the reaction batches is calculated by averaging the time difference across the reaction batches, shown in Eq. (

Fig. 5 .
Fig. 5. Error metrics including MAE, RMSE, MBE, and R-squared for optimal neural architecture at progressive depths for cross-validation using (a) training dataset (b) validation dataset (c) testing dataset.(d) PCA visualization for the relation of process phases with respect to batch completion.

Fig. 6 .
Fig. 6.(a) Feature SHAP values for reactor physical variables (b) Feature importance of reactor physical variables by evaluating the mean absolute SHAP value.

Fig. 7 .
Fig. 7. Accumulated Local Effects (ALE) plots for physical reaction variables in the sequence of global importance.Ticks at the bottom of the chart show the distribution of the data within the dataset.Variables are abbreviated as v: viscosity, T: temperature, P: pressure, F: flow rate (see Table1).

Fig. 8 .
Fig. 8. Knowledge distilled decision tree classifier representing the decision of the neural network.

Fig. 9 .
Fig. 9. (a) Deviation with 95% confidence interval of prediction time relative to the time after the start of reaction operation.(b) Average potential time saving of operator relative to time after the start of reaction operation.

Fig. 10 .
Fig. 10.Transfer learning layer optimization matrix.The optimal point is circled in orange; a lower value is better.

Fig. 11 .
Fig. 11.Reaction operational phase classification (a) confusion matrix, (b) classification report matrix with relative to operational phases (c) predicted timepoints coloured according to the prediction phases in a PCA plot (d) actual timepoints coloured according to the prediction phases in a PCA plot (e) misclassified timepoints coloured according to the prediction phases in a PCA plot.

Fig. 12 .
Fig. 12.(a) Phase switch index (normalized) and batch time throughout each reaction batch in sequence (b) Batch time against phase switch index.

Fig. 13 .
Fig. 13.(a) Heating profile as the heating rate provided to the reactor with respect to time (b) Heating rate during operational phase switching points.

Fig. 14 .
Fig. 14.(a) Energy saving potential (b) time-saving potential and (c) carbon emission savings during reactor operational phase switching using neural network analytics.(d) Comparison of overall batch time, overall carbon emission, and overall energy consumption before and after using neural network analytics.

Table 1
Variable names of analysis (anonymized) and description.
Fig. 3. Illustrative figure of a neural network with variable architecture and depth.S.Y.Teng et al.