Automated Classification of Bioprocess Based on Optimum Compromise Whitening and Clustering

Biotech unit operations are often characterized by a large number of inputs (operating parameters) and outputs (performance parameters) along with complex correlations amongst them. A typical biotech process starts with the vial of the cell bank, ends with the final product, and has anywhere from 15 to 30 such unit operations in series. The aforementioned parameters can impact process performance and product quality, as well as interact amongst each other. Chemometrics presents one effective approach to gather process understanding from such complex data sets. The increasing use of chemometrics is fuelled by the gradual acceptance of quality by design and process analytical technology among the regulators and the biotech industry, which require enhanced process and product understanding. From the standpoint of industrial operation, it is important to note that the type of metabolism used by the cultivated microorganism for the processing of substrates has a decisive impact on process performance measured by indicators, like productivity and yield. Therefore, the design of bioprocess control strategies is to be focused not only on the issue of cell environment control, but should ideally also aim at the control of the cell physiology itself. This issue has been addressed by the introduction of a control concept referred to as technological state control by Konstantinov and Yoshida1. In contrast to conventional control strategies operating in closed loop in respect to the cell environment, the physiological state control scheme creates a closed loop in respect to the cell state. Consequently, the environment is not a goal but a tool for manipulating cell physiology. From all the subtasks of a general technological state control strategy, the task of on-line recognition of the physiological state of the cultivated microorganism is of key importance. The classification schemes involved in this task are usually based to some extent on expert knowledge representation and are frequently implemented using various artificial intelligence techniques1,2.


Introduction
Biotech unit operations are often characterized by a large number of inputs (operating parameters) and outputs (performance parameters) along with complex correlations amongst them. A typical biotech process starts with the vial of the cell bank, ends with the final product, and has anywhere from 15 to 30 such unit operations in series. The aforementioned parameters can impact process performance and product quality, as well as interact amongst each other. Chemometrics presents one effective approach to gather process understanding from such complex data sets. The increasing use of chemometrics is fuelled by the gradual acceptance of quality by design and process analytical technology among the regulators and the biotech industry, which require enhanced process and product understanding.
From the standpoint of industrial operation, it is important to note that the type of metabolism used by the cultivated microorganism for the processing of substrates has a decisive impact on process performance measured by indicators, like productivity and yield. Therefore, the design of bioprocess control strategies is to be focused not only on the issue of cell environment control, but should ideally also aim at the control of the cell physiology itself. This issue has been addressed by the introduction of a control concept referred to as technological state control by Konstantinov and Yoshida 1 . In contrast to conventional control strategies operating in closed loop in respect to the cell environment, the physiological state control scheme creates a closed loop in respect to the cell state. Consequently, the environment is not a goal but a tool for manipulating cell physiology.
From all the subtasks of a general technological state control strategy, the task of on-line recognition of the physiological state of the cultivated microorganism is of key importance. The classification schemes involved in this task are usually based to some extent on expert knowledge representation and are frequently implemented using various artificial intelligence techniques 1,2 .

Review of current state
The chemical industry was early in recognizing and adopting chemometrics as a quick and economical method of extracting real-time information from data and, thus, leading to improved process monitoring and control. Visible spectroscopy, near-infrared (NIR) spectroscopy, mid-infrared (MIR) spectroscopy, nuclear magnetic resonance (NMR) spectroscopy, and Fourier transform infrared (FTIR) spectroscopy are some of the commonly used process analyzers that have been used in the chemical industry. Principal component analysis (PCA), partial least squares (PLS) regression, principal component regression (PCR), canonical variable analysis (CVA), and modified soft independent modeling of the class analogy (SIMCA) are some of the statistical tools that have been used to facilitate analysis and modeling of the abundant data that are provided by the aforementioned process analyzers.
Chemometric tools have been used in the cell culture operations in the last decade. PCA has been used for detection and diagnosis of abnormal process conditions in an industrial fed-batch cell culture process. The model was successfully able to detect abnormal process conditions, which resulted from three known fault types, namely irregular thermal heating, elevated dissolved oxygen values, and large variation in agitation 3 . PLS calibration models of NIR spectra have been utilized for the measurement of glucose, lactate, glutamine, and ammonia in undiluted serum-based cell culture media 4 . Robust, analyte-specific models were generated, and the low values of standard errors of prediction for each analyte demonstrate that the models can be used to (off-line) determine the important nutrient and byproduct content in a serum-based cell culture medium. A novel PLS approach called evolving PLS has been compared with the traditional PLS using data from an industrial fed-batch mammalian cell culture process for prediction of intermediate and final quality variable values 5 . Use of in situ 2D fluorometry in combination with chemometrics has been evaluated for monitoring the concentration of viable cells and the concentration of recombinant proteins in mammalian cell culture 6 . PCA was used to filter the large volumes of redundant spectral data, while PLS correlated the reduced data with the target state variables. Both viable cells density and glycoprotein concentration were accurately estimated, which strongly suggests that the combination of 2D fluorometry with suitable chemometric techniques is a consistent technique for the monitoring of a cell culture medium. Modeling and monitoring of batch processes using neural networks, where the principal component analysis is used for the problem dimensionality reduction is described in Kulkarini et al 7

Experimental setup
The fed-batch cultivations (Pseudomonas putida KT2442) were carried out under the following conditions: temperature 30 °C, pH = 7, stirrer speed 900 min -1 , air flow 9.5 L min -1 . Base (14 % NH 4 OH) and acid (17 % H 3 PO 4 ) solutions were added to the cultivation medium to control pH. Following the initial batch phase, octanoic acid was continually supplied with a feeding rate set by the operator. Feeding strategies varied by individual cultivation runs, generally there was a phase of an exponential feeding followed by underfeeding and starvation, respectively.
All cultivations were carried out in a 7-litre laboratory bioreactor (newMBR, Switzerland) at the Bioprocess Control Laboratory at the Department of Computing and Control Engineering of the University of Chemical Technology in Prague (UCT Prague). The bioreactor was equipped with an IMCS 2000 analogue control unit (temperature, pH, stirrer speed, antifoam level, and airflow control), a programmable logic controller (Modicon Compact PC-E984-265, Schneider Electric, France), and the proprietary Biogenes II control system (based on Factory Suite 2000 software package, Wonderware, USA). The dissolved oxygen tension was measured by an oxygen probe (Mettler Toledo); the oxygen and carbon dioxide concentrations in the off-gas were measured by SERVOMEX 1100 and 1440 analysers, respectively. For the substrate supply to the bioreactor, a DP200 peristaltic pump (New Brunswick) was used. Control variables feeding rate, acid, base and antifoam addition were also recorded.

Classification of technological states
The main aim of this paper was automated classification of biotechnological system states. However, the proposed methodology of classification is more general and consists of several steps. x of simple but optimal robust linear smoother 10 . The final result of data preprocessing is the pattern vector x k N  R which is shifted via r steps. Therefore, the series of patterns is where M = m -2r. The time delay of smoothing is T s r which must be approximately equal to the large time constant of given technological process.

Dimensionality reduction and Data Whitening
Dimensionality reduction is based on the Principal Component Analysis (PCA) 11 . Firstly, we calculate the mean value vector as λ λ λ Meanwhile, the PCA does not perform data standardization, the DWH does, but it also increases the noise level of higher order components. Therefore, we suppose that CWH as a compromise between these two disadvantages, can be a more efficient tool than PCA and DWH of the same dimension D in particular cases, related to the cluster analysis as the final step of data processing. which is useful for the determination of the optimum cluster number, i.e. the number of technological states in our case. The cluster number which minimizes AIC is declared to be optimal. The choice of AIC is motivated by its cross-validation properties 14 . The other criteria of clustering quality are not recommended due to their poor relation to the cross-validation process.

Optimum parameter setting
The proposed classification method of technological states has the optional parameters as the pattern vector length N, the order of smoothing r, the number of whitened components D, CWH index a, and the number of states H. But the parameters N and r are strongly connected with the biotechnological process, its monitoring, and time constants. Therefore, they are not subjects of optimization with three aims: -optimal clustering with minimal value of AIC as declared above, -maximal number of clusters H for detailed analysis of the technological states -maximal number of CWH components D for saving of the data dimensionality.
The optimization task seems to be a multi-criteria one, but its solution would be sensitive to the choice of AIC, H, and D weights in the compromise objective function. We prefer the three-stage hierarchical optimization process driven by three optimization aims. To avoid multiplicities, we define the left minimum (left maximum) as the lower possible value of the scalar variable, which guarantees the minimal (maximal) value of the given function.
The optimum values of H, D, and a can be obtained for the given data set and fixed N, r as follows: Optimum H*(D,a) is obtained by the minimization of AIC(H) as the left minimum for constant D, a.
Optimum D*(a) is obtained by the maximization of H*(D) as the left maximum for constant a.
Optimum a* is obtained by the maximization of D*(a) as the left maximum.
Therefore, a*, D*(a*), and H*(D*(a*),a*) are the obtained optimal values, which form the main result of our novel approach to the automated state classification of biotechnological process from measured time series. . Therefore, the classification error can be defined as ERR = E/T as a good post-optimization indicator of the parameter setting quality in the case of a single technological experiment.

Final Cross-validation Strategy
The final cross-validation methodology is based on general assumptions 15 . Having data from two independent technological experiments, we can easily perform traditional cross-validation, which is based on a separate parameter setting from both experiments. We define inner classification as ICVS using the data from the given experiment and optimum values H*, D*, a*. The outer classification is defined as ICVS using the data from the other experiment, which vectors x 0 , t 1 , ..., t H and matrices L, W are applied to the data from the given experiment to obtain a potentially different state classification.
The final cross-validation strategy consists of: (i) Optimization of H, D, and a in case of the first experiment.
(ii) Evaluation of CWH and cluster analysis vectors and matrices in the case of the first experiment.
(iii) Inner classification of technological states according to (ii) in the case of the first experiment.
(iv) Evaluation of CWH and cluster analysis vectors and matrices in the case of the second experiment.
(v) Inner classification of technological states according to (iv) in the case of the second experiment.
(vi) Outer classification of technological states according to (ii) in the case of the second experiment. D×D contingency table. (viii) Evaluation of the classification error from the contingency table.

Results
The general methodology was directly applied to the real data from two experiments with the given biotechnological process. Sampling period was 60 sec. The order of smoothing r was set to 10, which corresponds with the time constant of the given biotechnological process. The first data set consists of 2096 points. The influence of CWH parameter a on parameter D is demonstrated in Table  1. The dimensionality D of CWH varies between 1 (poor) and 4 (rich) with maximum for a* = 0.54, which is recommended as the optimum value for the best dimensionality D* = 4, and consequently H* = 5 as the optimum number of technological states. The traditional cases of PCA (a = 0) and DWH (a = 1) obtain poor dimensionality D = 1. Therefore, the novel CWH technique obtains better results than the referential methods on the given task. Time dependency of these four components in the first experiment is depicted in Fig. 1 as traditional PCA. Adequate classification into the five technological states is demonstrated in Fig. 2. As seen in Table 1, the classification error ERR obtained via ICVS varies between 2.96 % and 19.76 %, but this criterion was not a subject of minimization. The optimum CWH has interleaved cross-validation error ERR = 19.08 % in the first experiment.
Therefore, the technological process was classified into five states. The first state describes the beginning of the experiment, i.e., the batch phase and the beginning of fed-batch phase. The second and third states describe the production phase of the experiment where the optimal feeding presents 90 % of the span, of which underfeeding is represented by 10 %. The beginning of the fourth state comes with the concentration of dissolved oxygen at 17 % and with the evident decreasing in the dissolved oxygen difference. Finally, the fifth state is represented by the dissolved oxygen and the difference of dissolved oxygen equal to zero. In the case of the second experiment, the data set consists of 1923 points. The optimum classification parameters a* = 0.54, D* = 4, and H* = 5 from the first experiment were used for the inner and outer classification of data points in the second experiment. Meanwhile, the inner classification formed new CWH components and cluster centers from the second data set, the outer classification used the old CWH components and cluster centers from the first experiment to data from the second one. The results of the final cross-validation are collected in Table 2 in the form of a contingency table. There are only 416 missclassified data points, meaning a cross-validation error of 21.63 % using independent data sets. This quality measure is very similar to interleaved cross-validation error of 19.08 %. Therefore, the novel methodology of optimum CWH with cluster analysis is able to generalize the relationships from one experiment to be applicable to another one.

Conclusions
The general methodology based on compromise data whitening, cluster analysis, interleaved cross-validation, and cross-validation on the second data set, was applied to two data sets from the biotechnological process of Pseudomonas putida fedbatch cultivation on octanoic acid. The optimum compromise data whitening outperformed both PCA and traditional data whitening in the given cases. The resulting fourth component system localized five technological states with an interleaved cross-validation error of 19.08 % and final cross-validation error of 21.63 % on the second data set. The resulting states have biotechnological interpretation. The proposed methodology of biotechnological state analysis can be used in similar cases.