A novel imputation based predictive algorithm for reducing common cause variation from small and mixed datasets with missing values

ranges of one or more process inputs. A set of values for optimal process inputs is generated from operating ranges discovered by a recently proposed quality correlation algorithm (QCA) using a Bootstrap sampling method. The odds ratio, which represents a ratio between the probability of occurrence of desired and undesired process output values, is used to quantify the effect of a confirmation trial. The limitations of the underlying PCA based linear model have been discussed and the future research areas have been identified.


Introduction
A manufacturing process produces products with consistent quality when it is capable of operating with acceptable variability around the desired process response or key product characteristic values (KPCs).The variation around a target process output is a natural process and is inherently present in any manufacturing processes.This natural variation is referred to as background noise, which results from unavoidable or unknown causes known as common causes (Montgomery, 2009).The other kind of variation that leads to undesired output is defined as special or assignable cause variation.The common cause variation is an allowable variation and hence the process response values remain within the process upper and lower specification limits ( ) and  (Steiner & MacKay, 2004).The statistical process control (SPC) methods (George et al., 2005;Montgomery, 2009) are normally employed to discover the existence of special cause variation.Fig. 2 illustrates a schematic relationship between process input variation A to process output variation B. The input variation A is referred to as special cause variation as it produces a much wider spectrum of variation B on the output.The corresponding variations C and D are within process specifications and represent common cause variation.

Nomenclature
With the technological enhancement of machines, advanced feedback controls and sensors, it is proposed that it may be possible to further reduce variation D by adjusting variation C in Fig. 2. A foundry process is a complex process with many sub-processes such as pattern making, mould and core making, melting and pouring process.There are number of casting processes but an example of investment casting process has been discussed in this research.
For an investment casting process, the mould (or shell) making process is further divided into sub-processes such as coating and drying processes.Investment casting foundries produce complex shaped and super-alloyed components such as turbine blades for aerospace and power industries and turbocharger wheels for automotive industries.Sometimes, it takes weeks to produce a turbine blade from the initial wax processing stage to the final casting.A typical continual process improvement study may have over hundred measurable process inputs that govern the quality of the final turbine blade.
On average precision foundries lose about 3%-5% of their revenue in rejected or reworked castings.In a foundry environment such a process is referred to as a stable and capable process and is approved by the customer during the product validation stage.Many foundries have higher internal rejection rate.The challenge for foundry process engineers is to be able to make changes to several process parameters (e.g.slight adjustments to the operating ranges of various parameters such as alloy compositions at various stages of melting and pouring process, pouring temperature, and moulding parameters etc.) Undertaking one change at a time is not sufficient.Even for experts, it is not easy to choose critical process variables that can be shown as being responsible for causing the 3%-5% rejection rate which is a representative of common cause variation.The aforementioned rejection rate is a cumulative rate which gives a general indication about the process, which is usually constant.In order to understand the variation of process defects detailed statistics on rejection rates is needed.
Recently, many methods have been developed to interpret prediction from available data of manufacturing process.A data based prediction model is presented for casting surface related defects (Chen & Kaufmann, 2022), where six regression methods are used.A Extremely Randomized Trees Regression model showed the best prediction performance in comparison to remaining five methods, whereas the maximum prediction error is obtained when the ridge regression method is used.The effect of data features (factors) on metal penetration of an iron casting is studied.Three factors from 282 factors showed a significant impact on the output (Uyan et al., 2022).A cloudbased process variable measurement system is developed to extract data.The work included the use of supervised machine learning model to predict the porosity defects in an aluminum low-pressure die casting process.The extreme boosted decision tree (XGBoost) model is used.The obtained results indicate that the model accuracy for predicting the good parts is 87 percent and is equal to 74 percent for defective parts (Sika & Ignaszak, 2020).A knowledge discovery based approach is introduced to reduce the defects of selected iron castings.The data acquisition and data mining methods are used to manage production parameters and to discover parameters that lead to increase or decrease the occurrence of defects.The objective was to use results to discover process knowledge.For surface monitoring and control applications, a novel 3D point cloud surface monitoring method is proposed.It uses an Earth Mover's distance (EMD) based control chart to measure the deviation of the cloud sample from the nominal sample.This helps to locate process shifts when data is collected with laser point cloud technique during an inspection process (Zhao, Lui, Du, Di, & Shao, 2023).
There are many applications of machine learning methods for process control applications.In these applications, a target value for process variables is generally known and the control method is used to bring the process variable value as close to the target value as possible.What if the target value aimed by the process control algorithms is sub-optimal?Can machine learning algorithms detect this situation by observing in-process data and suggests optimal target values and their tolerance limits for multiple process variables?This challenge is addressed in this paper.The major difference of the proposed algorithm is that it is designed to discover optimal target values with corresponding limits for multiple continuous and categorical variables using small observational data sets with missing values and it can predict the combined effect of optimal values on the process response.After discussions with foundry process engineers, it was discovered that they do try to reduce the common cause variation by further fine tuning the process manually using their experience and expertise.A quality correlation algorithm has been developed recently (Batbooti, Ransing, & Ransing, 2017;Ransing, Batbooti, Giannetti, & Ransing, 2016) to fine tune the process inputs to reduce a deviation from the desired process response (output) values.The developed algorithm is based using the co-linearity index (Giannetti et al., 2014;Ransing, Giannetti, Ransing, & James, 2013) as a measure to discover correlated variables.The principal component analysis (PCA) scores are projected on all variables and responses.The corresponding scores for correlated variable are collected based on direction of variable and response.These scores relate to either optimal or avoid settings with reference to the correlated variable.The observations corresponding to the collected scores generate a new operating range, which is considered as optimal or avoid based on the factor correlation direction.The obtained range is considered as an optimal range if the variable is correlated positively with low penalty values for the response vector.The range is considered as avoid if the variable is correlated positively with high penalty values for the response vector.The new operating ranges obtained by the QCA is equivalent to range E in Fig. 3.One of the objectives of this work is to develop a data based model to predict the corresponding response to the range E discovered by the QCA for all input factors.The schematic presentation of this problem is shown in Fig. 3.
In this work a typical example of an investment casting foundry manufacturing Nickel based superalloy castings is used.The variation R.S. Batbooti and R.S. Ransing Fig. 4. The variation in rejection rate and the variation in values for an input factor %C for a Nickel based alloy for a dataset with 60 batches.Fig. 5. Data set partition induced by observation   (Folch-fortuny, Arteaga, & Ferrer, 2015).
in number of castings rejected due to shrinkage related defects per melt is observed and noted as process response (i.e.castings produced per fixed amount of molten metal and the rejection rate and system parameters observed for each melt or batch).
The variation in the rejection rate with reference to the variation of one process input e.g.factor (%C) is shown in Fig. 4 with lower and upper operating range limits of () and ( ).With reference to Fig. 2, the variation in the rejection rate corresponds to the variation D and the variation in %C in the middle corresponds to variation C. On the other hand, the upper and lower limit in the left corresponds to variation E in Fig. 3.
The overall aim of this work is to develop a data based predictive model to quantify the response corresponding to the operating ranges discovered by the QCA and estimate the QCA operating ranges in presence of missing data.This includes the following objectives: 1. Development of a missing data algorithm to impute missing values for mixed and small manufacturing dataset.2. Prediction of the process response for any given choice of operating limits on selected input factors of mixed data types.
This paper is structured as follows.Section 2 reviews the missing data machine learning methods.Section 3 describes two PCA based methods, iterative based PCA algorithms and regression based PCA algorithms.This is followed by the proposed new algorithm for mixed datasets in Section 4. Section 5 discusses the use of the proposed missing data algorithm as a predictive tool to estimate the effect of operating limits on the response values on mixed datasets.The paper is concluded in Section 6.

Missing data imputation methods
The occurrence of missing data is a common problem in many industrial data sets.This may be due to many reasons, such as data collection errors, incorrect measurements and measuring instruments errors or any other reason can lead to miss the information.Several methods have been proposed in the literature to deal with the missing data.Mean imputation is a very common method based on replacing the missing entry by the attribute or variable mean.This method is very simple, but its limitation is that it underestimates the real variance of the variable (Little & Rubin, 2003).Laaksonen (2000) introduced an imputation method based nearest neighbour algorithm referred to as regression-based nearest neighbour hot decking.A measure of the distance among observations is used to group the data into clusters and the missing observation is replaced with the mean of the nearest neighbour cluster.A family of K-nearest neighbour algorithms has been developed based K-nearest neighbour imputation (Batista & Monard, 2003), weighted K-nearest neighbour imputation (Troyanskaya et al., 2001) and fuzzy K-means clustering imputation (Li, Deogun, Spaulding, & Shuart, 2004).Schneider (2001) adapted the expectation and maximization (EM) algorithm (Dempster et al., 1977) to analyse an incomplete climate data.The missing values imputed from a conditional probability model.The mean and the covariance matrix (Expectation step) was determined followed by an estimate of the mean and the covariance matrix from observed and imputed observations (Maximization step).The Maximization step was taken into account the conditional estimate of the covariance matrix on imputation error.The iterative solution between the two steps continued until the convergence occurred.Nelwamondo, Mohamed, and Marwala (2007) compared the EM algorithm with an algorithm based on neural networks and genetic algorithms that has been developed by Mussa and Tshilidzi (2005).The difference between the target and actual output used as an objective function and the genetic algorithm used to estimate the missing values by minimizing the introduced objective function.The input in the objective function is represented by both the missing and observed values.The combined input derived from the imputed and observed values is supplied to an auto-encoder neural network.The genetic algorithm used in this work based on a population of string chromosomes, which corresponds to a point in the search space.The Multi-layer perceptron (MLP) network is used to construct an auto-encoder neural network and is trained with  back-propagation algorithm.The comparison of results from four historical data sets showed that the EM algorithm has better performance when there is little or no interdependency between the variables.The auto-associative neural network and genetic algorithm combination is used in cases where there is non-linear relationship between some of the given variables.However, genetic algorithms typically require large datasets.
Many missing value imputation methods in the literature are based on the Principal Component Analysis (PCA) for continuous data.The principal components on the complete dataset provide a new low rank subspace that achieves the maximization of variability of the projected data.The new projection aims to find two matrices  × (the score matrix)and  × (the loading matrix) such that the following reconstruction error is minimized (Diamantaras & Kung, 1996): Where:  × : is the data matrix with rows represents the observations and the columns corresponding to the variables. × : the mean matrix that has the means of each columns of the data matrix  × in each row.
In seventies, Christoffersson (1970) presented a procedure for missing data based on optimizing a least square problem for one component PCA.The loading matrix  was held constant while the score matrix  was optimized.Then the score matrix was fixed to optimize the loading matrix.The optimization target was to minimize a cost function as given in Eq. ( 1) for observed data.This resulted in an update rule for principal components (or loading matrix)  and mapping (or score) matrix  .This procedure was extended by Grung and Manne (1998) to include more than one principal component for missing data problems.The obtained results were further improved by Ilin and Raiko (2010) by updating the bias term in updating rules.
It should be noted that the alternating algorithm procedure is not efficient for a large number of principal components (Roweis, 1998) and is shown to have convergence properties only on a limited number of principal components (Ilin & Raiko, 2010).The computational cost of alternating algorithm can be improved by using gradient descent algorithm and Newton's method for optimization (Ilin & Raiko, 2010;Raiko, Ilin, & Karhunen, 2008).
A PCA imputation method that is widely used to impute missing values is referred to as an iterative PCA algorithm.It is based on minimization of the cost function.It introduces a weighted matrix with matrix element value of zero if there is missing value in the original dataset or one otherwise.The missing values are initialized with a mean, or any other value, followed by performing PCA on the complete data and then the missing values are reconstructed from the PCA projection space in an iterative procedure (Husson & Josse, 2013;Josse & Husson, 2012a).The iterative PCA is equivalent to an expectation maximization algorithm associated to the PCA model (Ilin & Raiko, 2010;Josse & Husson, 2012a), it has been called as EM-PCA algorithm (Josse & Husson, 2012a).Ilin and Raiko (2010) showed that the reconstruction step of imputation algorithm is corresponding to the E-step of the EM algorithm and the M-step of the EM algorithm is equivalent to performing the PCA on the complete dataset.It is also shown that the minimization of cost function with respect to the variation of the noise, as assumed in the probabilistic PCA (PPCA) model, has no effect on the imputation algorithm steps.The PPCA provides a Bayesian treatment of PCA that can be combined with EM algorithm to estimate the PCA model parameters iteratively.A Factorial Variational Approximations solution based PPCA, called as VBPCA, is introduced to deal with high-dimensional sparse data sets with high percent of missing values.The VBPCA showed better performance compared to the standard EM-PCA and iterative algorithm, but it computationally expensive.On the other hand, the iterative algorithm has ability to adapt with different missing data methods, such as regression based methods.
Regression based methods substitute the missing values by regressing the unknown data from observed data.Regression based methods have been developed, compared and studied in the prescience of missing data multivariate problems (Arteaga & Ferrer, 2002, 2005).The study included the standard imputation algorithm and other algorithms.Recently, Folch-fortuny et al. ( 2015) compared PCA regression based methods namely the trimmed score regression method (TSR) and the known data regression method (KDR) with other iterative algorithms (IA).The study built PCA models based on iterative algorithms and applied the developed algorithms for real case studies from literature.The regression based methods (TSR and KDR) showed fast and better performance in comparison to other methods such as standard iterative imputation algorithm.
Another point to note is that the PCA iterative algorithm has a mixed data version developed recently, which can be used to input missing values by using methods based on the factorial analysis for mixed data (FAMD) (Audigier, Husson, & Josse, 2016).FAMD is based on principal components method to describe and visualize multidimensional mixed data matrix by studying the similarities between each variables, the relationships between mixed variables and to study the contribution of each variable.Similar to PCA imputation methods, an iterative FAMD procedure developed by Josse and Husson (Audigier et al., 2016) imputes missing values for mixed data sets.The FAMD algorithm is similar to iterative algorithms for continuous data.For categorical variables, it has a scale step to convert categorical variables to continuous variables.This gives the algorithm an ability to impute mixed data.The proposed method has been compared to a random forest based method (Stekhoven & Bühlmann, 2012) and it showed an enhanced ability to impute mixed missing observations.
In general all PCA iterative methods consist of an 'initiate step' followed by the scale step for FAMD, to perform PCA step and reconstruct step.The regression based methods have better and fast performance in comparison to standard iterative algorithms (IA), however, these methods are yet to be shown to be able to impute on mixed data examples.FAMD is the PCA iterative algorithm for mixed data, which has shown inferior performance against PCA regression based methods for continuous data.In order to introduce a new mixed data imputation algorithm with a better performance, a new procedures based on FAMD and regression based methods (TSR and KDR) is needed to impute the missing data in mixed matrices.A new procedure has been developed in this work without taking the effect of outliers on imputation methods.

PCA iterative methods
In the PCA imputation algorithm, the minimization of least squares criteria (Eq.( 2)) achieved by introducing a weighted matrix (the weighted matrix W, whose elements take either zero if the original dataset value is missing or one otherwise), resulted in the following cost function.
The iterative PCA algorithm procedure consists of following steps (Husson & Josse, 2013;Josse & Husson, 2012a): impute initial values for missing observations cells to complete the data and update the missing values from PCA reconstruction of resulted full data matrix.The iteration between the update and reconstruct step continues until the desired convergence is achieved.
For mixed data, Audigier et al. (2016) adapted alternative algorithm by converting each categorical variable into dummy variables by taking a unit value if the corresponding category was occurring and zero  otherwise.Each continuous variable was then standardized by dividing by its standard deviation and each dummy variable is divided by the root square of the proportion of the variable.This iterative FAMD can be implemented as follows (Audigier et al., 2016): Step 0: impute an initial value for each missing values (mean for quantitative and the proportion of the category for each category).Calculate scale and mean matrices:  0  scale matrix and  0 mean matrix.2. For step i: (a) Apply SVD on the global matrix ( −1 −  −1 )( −1  ) −1∕2 to obtain the matrices   (left singular vector, score matrix),   (right singular vector, loading matrix) as well as (  ) 1∕2 .(b) Reconstruct   from the fitted model: the imputed data set become: (c) from resulted complete data set update,    and   .
3. Repeat steps 2a, 2b and 2c until convergence occurs between the imputed and original observed values.
Where:   is a diagonal matrix that contains the square of standard deviation for each continuous variable and the proportion of the category for each category of categorical variable, × ∶ the mean matrix that has the means of each columns of the data matrix  × in each row,  1 : number of quantitative variables,  : total number of columns in matrix ,  =  1 + ∑  2 =1   ,   is the number of categories in variable , and  2 is the number of categorical variables,  : the number of observation, which represent the number of rows of the data matrix , X : the reconstructed data matrix.

PCA regression based methods
The PCA regression based methods for missing data partition the attributes with missing values into two parts: the missing part and the observed part.Suppose the observation   has some missing values, these will take as first  elements of the row vector, without loss of generality.This partitions the vector   into   = [ #  *  ].As a result, the data matrix becomes  = [ #  * ], and the loading matrix  can be written as Where:  # : denotes the missing elements. *  : the observed elements. # : is the submatrix containing the first  columns of  (corresponding to the missing variables in   ).
* : contains the remaining columns corresponding to the observed values in    # : is the submatrix with  rows of . * : contains the rest of ( − ) rows.
For both methods, observed values are not changed.
The imputed data matrix is: (c) Update mean from the imputed data set from step (2-b).
Fig. 6 shows the comparison of regression based algorithms (TSR and KDR) with iterative and alternating algorithms for four standard deviation  values (0.25, 5, 0.75 and 1).Regression based methods showed a better performance in comparison to other iterative algorithms such as PCA iterative algorithm and the alternating algorithm.A percentage of missing values were randomly generated from 5% to 30%.The methodology used for generating data used in this comparison is explained in Section 4.1.1.

A PCA regression based imputation algorithm for mixed data
All PCA based missing data algorithms discussed above, the iterative FAMD and regression based iterative algorithms TSR and KDR have two main steps: perform PCA step and update from reconstruction step.The reconstruction step implemented in different manner for each case.The FAMD algorithm has scale step, which gives the algorithm an ability to impute mixed data (quantitative and categorical data), whereas the regression based methods impute missing quantitative values without the scale step that used in FAMD to adapt the categorical values for imputation.
In the present work, the regression based methods TSR and KDR methods are adapted to impute mixed data by adding a scale step similar to the one used in FAMD.In other words, the reconstruction step in iterative FAMD algorithm is changed.The FAMD reconstruction step is step 2-b (Eq.( 3)) as shown in the iterative FAMD algorithm above The TSR and KDR reconstruction step is achieved by updating the missing values based on the following formula (Eqs.( 7) and ( 10)) where  is the loading matrix for TSR and is equal to identity matrix in KDR.In order to use the TSR and KDR for mixed data, the reconstruction step (step 2-b in FAMD algorithm) is changed to coincide with the TSR and KDR requirements, so the missing part for the mixed data is updated from the following equation: Where:  = ( − )(  ) −1∕2 ,  # and  * are related to  similar to the way  # and  * are related to  as shown in Fig. 5 and  is the mean matrix of  as defined in Eq. ( 3).
The steps for the proposed algorithm are described below: 1.
Step 0: impute an initial value for each missing values (mean for quantitative and the proportion of the category for each category).Calculate matrices  0  ,  0 , the mean of  0 , and calculate  0 = ( 0 −  0 )( 0  ) −1∕2 .2. For step i: (a)  −1 = covariance of matrix ( −1 −  −1 ), apply SVD on the global matrix ( −1 ), to find a loading vector, which represents the right singular vector ( −1 ).(b) For each row that has missing values, estimate the missing part from the following regression equations.For TSR method use: and for KDR method use: For both methods, observed values are not changed.The reconstructed Ẑ matrix becomes: The imputed data matrix is: (c) from resulted complete data set, update    and   .
For convergence test, the following two criteria are used: for continuous variables: for categorical variables: with  1 equals to 10 −6 and  2 equals to 10 −10 for example.The algorithm with all details depicted in Table below is referred to as Algorithm 1.
Algorithm 1: A PCA regression based imputation algorithm.

Missing data simulations
Two simulations are conducted to compare the proposed algorithms with FAMD algorithm.The strategy is to generate the missing data from a complete dataset by considering the following incremental levels in the first simulation (5%, 10%, 15%, 20%, 25%, 30%) and extend it to 40% with 10% incremental step for the second simulation.For each level, the missing data is generated randomly.The performance of the present work is assessed by calculating the normalized root mean squared error (NRMSE) for continuous variables and the proportion of falsely classified (PFC) for categorical variables.NRMSE values consider the variance of each variable.The imputed values should correlate with original values if the NRMSE value is equal to zero.They will correlate with the initial values (if the initial value is assumed as the mean) when the NRMSE value is equal to one.
where:   is the number of missing categorical values.

Model based simulation
In this section, more than one data sets are generated according to the model based procedure proposed by Josse and Husson (2012b): Where, the matrices  and  are generated from a standard normal distribution with zero mean and variance equal to the identity matrix.Each column of the product matrix   (  )  is divided by its standard deviation.The noise is added by drawing   from a normal distribution with mean equal to zero and variance equal to  2 .The values in matrix  are assumed to be zero.Signal to noise ratio is defined as 1∕.In the current work, four data sets are generated to compare the performance of the proposed algorithm with the FAMD algorithm.The generated data sets consist of two quantitative variables and two categorical variables with four categories for each variable.The categories are generated from continuous data by dividing each variable into four segments.The number of observations is 100 and two principal components are selected to reconstruct the data.Four values of  tested (0.25, 0.5, 0.75, 1) and the results displayed in Fig. 7.The obtained results showed a very good performance for KDR for the mixed data method for all categorical variables with small PFC error.Moreover, it showed a good imputation ability with quantitative variables as well.For all  values, TSR method for mixed data and FAMD showed slightly different NRMSE and PFC errors, but TSR for mixed data gives better performance with NRMSE and PFC values smaller than that for the FAMD method.
In order to check the ability of algorithm to impute non-linear data, two non-linear variables were generated by adding non-linear functions  2 and 3() to the two variables generated from model in Eq. ( 23) with  = 1.First, the variables were assumed as continuous.The imputation of missing data is depicted in Fig. 8, which shows a very high MSPE error value close to 1 for 10% or more missing data.To improve the performance of imputation, each variable is divided into categories based on the probability plot Fig. 9, which is usually used to check the linearity of distribution of the data.As it can seen from the plot that each variable can be divided into three categories.As a result, the imputation of the missing values performed for two categorical variables with three categories for each variable.The results of imputation are shown in Fig. 10, which gives a better prediction than continuous variable assumption.
Finally, another test is conducted to check the effect of imputation of non linear variables with linear ones.The same two non-linear simulated variables were merged with two linear variables to constitute a four variable data set.Fig. 11 shows the PFC error of the two non linear variables in the last simulation, which indicates a smaller imputation error compared to the imputation in Fig. 10 where only non-linear variables were used.

Manufacturing data set
A data set consists of 37 factors affecting the defect of a casting process, 21 of the factors are categorical factors such as percent of Mn with categories 0.002, less than 0.001 and between 0.002 and 0.001.Other factors are continuous factors like the percent of Co.
The observed examples are 20733 observations, the results of the comparison of the proposed algorithm with FAMD algorithm is shown in Figs. 12 and 13

A novel imputation based predictive algorithm for mixed data
The main aim of prediction is to estimate the process response for a new batch for given values of input factors.In other words, the aim is to determine the th response value   corresponding to factors vector   (j = 1,. . . . . .,n).In the PCA context, this is similar to projection of a new observation with missing value (missing response) into a lower sub plane predefined by PCA loading matrix, which is known as new observations with missing data.This new observation with missing values can be obtained by finding its scores from the original loadings.It is obtained by an iterative procedure introduced by Arteaga and Ferrer (2002) for continuous data which alternates between estimating scores of missing values from Eq. ( 24) and reconstruct the missing values from Eq. ( 25), knowing that the first step includes initializing R.S. Batbooti and R.S. Ransing Fig. 16.Odds ratio for original ranges, QCA limits and QCA limits with uncertainty estimation.

Fig. 17.
Odds ratio for interaction of three factors with 0.2 penalty threshold for optimal response values, where the data is generated from QCA optimal limits.a value for each missing value.These two steps are repeated until convergence occurs.
x# =  # t (25) Arteaga and Ferrer (2002) showed that the obtained scores at convergence can be expressed as: For the case of mixed dataset, the same concept is used with the following modification: 1-Add the scale step before perform PCA on the original data matrix.
2-Reconstruction step will be in terms of  instead of  : 3-Estimate the missing value from the below equation: Where  # is the mean vector of elements in  # .The score of new observation is obtained from Eq. ( 26) instead of the iterative procedure.Full steps of this algorithm are depicted in the table below.
The above procedure can be used in the quality correlation algorithm (QCA) to check the behaviour of the discovered operating limits range by estimating the corresponding responses values.After estimating the response of operating limits, this tool can be used to compare the performance of the process before and after applying operating limits.In other words, it allows to estimate the probability of occurrence of the desired (optimal) and undesired (avoid) response values for a confirmation trial plan and the original plan.
The comparison of two proportions of occurring, such as success and failure, can be conducted by calculating odds of probabilities of proportions; another test method used is the likelihood ratio test (see Giannetti & Ransing, 2016).The odds of success is defined as the ratio between probability of success divided by the probability of failure (Agresti, 2002;Liberman, 2005): Where: : the odds of success,   : the probability of success(odds of success), and   : the probability of failure(odds of failure).
In terms of manufacturing defects, the success represents the occurrence of desired response values such as lower percentage of defects in batches (optimal response values), while the higher percentage of defects, or the occurrence of undesired response values (avoid response values), refers to the failure.As a result,  will be the odds ratio respresenting odds of optimal response and probabilities of success (odds of success) and failure (odds of failure) will be replaced by the probability of optimal response values (  ) and the probability of avoid response values (  ) respectively.The Odds ratio equation above is rewritten as follows: Where: Also, a relative odds ratio can also be defined between any two odds ratio, such as the odds ratio of confirmation trial plan and the original plan as: In general the proposed missing data algorithm can be applied as a prior to complete the data matrix followed by applying QCA to estimate the optimal operating limits.The next step creates a set of new examples from the obtained operating limits range to study the influence of the factors.The new example is generated by using the Bootstrap by replacement method from the optimal range for each factor.Finally, the odds ratio of the optimal range is compared with the odds ratio of the original range.

Discussion of results
A Nickel based alloy data set used by Ransing et al. (2016) and Batbooti et al. (2017) to estimate the optimal limits is discussed here.In the current simulation, the QCA algorithm with six principal components is used.This resulted in nine correlated factors for which optimal operating limits were identified.The bootstrap method is used to generate 1000 examples from combination of optimal operating limits of factors and compared with the original range by estimating the odds ratios for the original nine factors and odds ratio based on operating limits.The prediction of odds ratio is dependent on the number of principal components chosen for the dataset defined by optimal factors (e.g.nine correlated factors in this case).The actual odds ratio, based on the original dataset, is 2.125 (Fig. 14). 100 bootstrapped examples were created to test the dependence of the predictive analytic ability of the algorithm on the number of principal components chosen.The real odds ratio of the data is compared with the predicted ones.Each bootstrap example gave slightly different Odds ratio value for the same chosen number of principal components.The most frequently occurring value is chosen and compared with the actual odds ratio value in Fig. 14.The number of principal components for which the most frequently occurring odds ratio value is closest to the actual one is chosen for the analysis.The response histogram and odds ratio values displayed for original range and bootstraped operating limits are shown in Figs. 15 and 16 respectively.The histogram of response to the left in Fig. 15 similar to the response histogram in Fig. 4, but the current one is based penalty value method used by the QCA and the Bootstrap sampling instead the original rejection rate in Fig. 4 and its original range.This approach can be extended to study the effect of interaction between factors, for example, by bootstrapping three factors with suggested optimal range and bootstrapped values for remaining factors taken from the original range.This procedure is repeated for all factors.The calculated odd ratio values are displayed in Figs. 17 and 18 for the QCA limits and QCA with uncertainty limits respectively.An optimal value threshold for penalty values chosen as 0.2 to classify an optimal process response is used.The high values of odds ratio resulting from each combination of factors indicate existence of an interaction among the factors.The value of odds ratio in the aforementioned two Figures is shown in a cell for factor names shown in the corresponding row and column for the given table associated with the corresponding factor name.For example, in the table for factor %C (top left table in Fig. 17), the odds ratio value of 1.32 represents the effect interaction among %C, %Ti and %Co.In other words, for bootstrapping, the values for these factors are chosen from the optimal limits where as the values for remaining factors are chosen from their original range.It can be also seen that if the values for factor %C are used from its optimal range, the resulting odds ratio is 0.75, whereas the %Co factor shows a much higher odds ratio value at 1.35.Both %C and %Co have similar strength co-linearity indices as determined by the QCA, however, the corresponding odds ratio values are significantly different.This is probably because of the linear assumption of prediction model.The predictive algorithm has underestimated odds ratio values for %C.The histogram for %C showed a skewness to the left where as the distribution of %Co showed behaviour close to the Gaussian distribution as shown in Fig. 19.The lower and upper optimal limits are shown as red lines.The measure of the skewness in the data is also observed in the box plot in the same Figure and the penalty matrices in Fig. 20.The first quartile of %C in the penalty matrix has 13 observations with high penalty values where as only three observations with lower penalty values.This range needs to be avoided.However, the performance of the process when %C is in quartile ranges 2, 3 and 4 remains similar.The skewness, or non-linearity, is defined by this step change in the process performance when %C is in quartile 1 as compared to quartiles 2, 3 and 4. Whereas, for %Co quartiles 1 and 2 are optimal (correlated with lower penalty values) with quartile 3 values associated with higher penalty values and quartile 4 with worst performance demonstrating strong correlation with higher penalty values.The variation in the association with low penalty values to high penalty values is linear as %Co range varies from the minimum to maximum value.It may also be possible that the low odds ratio of %C factor may come from the contribution of other factors.

Conclusions
A single imputation procedure to predict process response by selecting input factor values from any given range has been described.The procedure is designed to work for mixed datasets comprising quantitative and categorical variables with missing values.The proposed procedure is also required to work on mixed data sets where the number of observations are either smaller, or similar, than the number of input factors.It uses a dimensionality reduction method based on FAMD and investigates relationships between pairs of variables with an improved PCA regression based method.The proposed algorithm is used to impute real and model generated data.The generated data included linear and non-linear simulations.The imputation of non-linear data was improved by dividing the variable range into categories and convert quantitative (or continuous) variables as categorical variables.Also, it is shown that the imputation of non-linear variables with linear ones improves the performance of the algorithm.The obtained results showed a good performance, where the error of the proposed algorithm PCA regression based methods for mixed data (KDR and TSR for mixed data) was less than the error of FAMD based PCA imputation.The imputation of new observation missing values based on FAMD method conducted and used to estimate the response of in process data with known factors.The prediction simulation methodology is based on bootstrapping from original data to predict the behaviour of process when the operating limits discovered by QCA (or any other equivalent method).The odds ratio values are used as a reference to quantify the ratio of the desired to the undesired response values and to compare the behaviour of the process with the original range and the optimal range.The odds ratio values for a real Nickel Based alloy data set were estimated by bootstrapping from the original and optimal ranges respectively.The limitations of the linearity assumptions in potentially underestimating odds ratio values are discussed.

Fig. 1 .
Fig. 1.Visualization of variation in factor values with reference to control limits in a Manufacturing process.

Fig. 2 .
Fig. 2.The influence of input variation on the process output with reference to special (A&B) and common (C&D) cause variations.

Fig. 7 .
Fig. 7. Model based simulation errors: the plots on the left show quantitative variables error (NRMSE) and the PFC error with different values of  is shown on the right.

Fig. 8 .
Fig.8.Quantitative error for a non linear variable imputed as a continuous variable.

Fig. 9 .
Fig. 9. Probability plot for normal distribution for four variables, two of them are linear(in the top of the figure) and the other two (the bottom two figures) are nonlinear.

Fig. 10 .
Fig. 10.Categorical error for non linear variables imputed by dividing each variable into three categories.

Fig. 11 .
Fig. 11.PFC error for two categorical variables imputed with two quantitative linear variables.

Fig. 18 .
Fig. 18.Odds ratio for interaction of three factors with 0.2 threshold for optimal response values, where the data generated from QCA with uncertainty optimal limits.

Fig. 19 .
Fig. 19.The histogram and box plot of factors %C and %Co.