Interface State Density Prediction between an Insulator and a Semiconductor by Gaussian Process Regression Models for a Modified Process

During data-driven process condition optimization on a laboratory scale, only a small-size data set is accessible and should be effectively utilized. On the other hand, during process development, new operations are frequently inserted or current operations are modified. These accessible data sets are somewhat related but not exactly the same type. In this study, we focus on the prediction of the quality of the interface between an insulator and GaN as a semiconductor for the potential application of GaN power semiconductor devices. The quality of the interface was represented as the interface state density, Dit, and the inserted operation to the process was the ultraviolet (UV)/O3-gas treatment. Our retrospective evaluation of model-building approaches for Dit prediction from a process condition revealed that for the UV/O3-treated interfaces, data of interfaces without the treatment contributed to performance improvement. Such performance improvement was not observed when using a data set of Si as the semiconductor. As a modeling method, the automatic relevance vector-based Gaussian process regression with the prior distribution of the length-scale parameters exhibited a relatively high predictive performance and represented a reasonable uncertainty of prediction as reflected by the distance to the training data set. This feature is a prerequisite for a potential application of Bayesian optimization. Furthermore, hyperparameters in the prior distribution of the length-scales could be optimized by leave-one-out cross-validation.


■ INTRODUCTION
Predictive machine learning (ML) models are frequently used in data-driven material discovery. Their usefulness has been proven with several material development projects, such as polymers, perovskites, and concrete mixtures. 1−3 Although "big data" generated by high-throughput experimentation of material synthesis by combinatorial approaches can be sometimes obtained, 4 at the laboratory scale, a limited number of samples, a small data set, can be usually accessible due to experimental cost and/or the difficulty of automation.
When utilizing a small data set for material discovery or process condition optimization, Bayesian optimization (BO) techniques can be used in combination with ML models to minimize the number of evaluations to find the optimal materials or process conditions. 5,6 In BO, a set of candidate process conditions is proposed based on an acquisition function that evaluates the process conditions quantitatively. For successful BO, a surrogate ML model should be predictive and accurately represent the uncertainty of prediction based on which acquisition functions are designed. As the ML method for BO, Gaussian process regression (GPR) is frequently used. 7−9 A GPR model predicts the distribution of the objective variable as a Gaussian distribution. The distribution reflects the similarity between the test process condition and the conditions used for training through the kernel function in the independent variable space. Thus, similar process conditions tend to show similar predicted property values. In practical applications, GPR models can be employed with the automatic relevance determination (ARD) technique 10 to automatically reflect the importance of the variables on the length-scale parameters of the kernel function. However, when a data-set size is small, ARD might weigh only a few variables to be sensitive to the prediction and the rest of the variables are almost ignored (see the Effect of the Prior of Length-scales in GPR Section).
For predicting a material's property from a set of process variables for fabricating the materials, ML models are normally built on all samples with all of the variables in the process.
However, changing the process by inserting or removing an operation(s) makes the ML model construction procedure difficult. In general, there are two approaches to address this challenge. In one approach without data integration, samples from the original process are discarded or ignored completely, and only samples from the new process are used for model building. This might be effective and safe when the number of samples from the new process is large, and the measurement (synthesis) cost is low. In the other approach, data from the original process are merged with the ones in the new process to build an ML model. During process development for material design, the insertion/deletion of operations frequently occurs; thus, a way to integrate samples from different but similar processes is required; otherwise, many data points should be measured every time we change the process.
In the study of semiconductor devices, new operations are frequently inserted or current operations are modified. Because the quality of the gate dielectric and semiconductor interface is important, such as the metal-oxide-semiconductor (MOS) interface, various process technologies have been introduced, e.g., surface treatment of semiconductor substrates and heat treatment inserted in the middle of the process. While various process techniques have already been established for silicon semiconductors, process development is still required for wideband-gap semiconductors. In this study, our objectives are GaN power semiconductor devices, which have attracted much attention in recent years, while many issues still remain at the MOS interface. 11,12 One of the challenges is how to reduce the electrical interface defects measured by the interface state density, D it , which represents the quality of an interface. D it is strongly affected by the surface condition of the GaN substrate and the deposition conditions of plasma-enhanced chemical vapor deposition (PECVD), which deposits SiO 2 . For improving the surface of GaN, previous research reported a significant reduction of D it values by introducing the operation of UV/O 3 treatment to the surface of GaN. 13 As deposition conditions for PECVD, five process variables exist: temperature, pressure, radio frequency (RF) power, oxygen flow rate, and precursor material (TEOS) flow rate. Since UV/O 3 treatment can be regarded as an inserted operation before the PECVD process on the GaN substrate, an effective way of data integration with and without UV/O 3 treatment to make a highly predictive D it model is important.
In this study, to identify effective model-building approaches for D it prediction when the training data size was small and similar data sets were accessible, the D it measurement data of interfaces between a semiconductor and the insulator of SiO 2 with PECVD were prepared. The D it values were measured by the Hi-Lo and conductance methods, and the semiconductors were Si and GaN. For some of the GaN samples, UV/O 3 treatment was additionally applied. These samples were utilized for building predictive D it models from the process variables. Employed ML methods were ARD-based GPR with and without the prior distribution of length-scale parameters. The ARD-based GPR were evaluated in terms of whether or not desired characteristics as a surrogate model for BO were achieved. These characteristics were the predictive ability of the model and the ability of reflecting distance to a training data set on the predicted distributions. A data integration strategy using a simple representation of one-hot encoding was tested using different but highly related process data sets of interfaces. We found that for predicting D it for the UV/O 3treated interfaces of SiO 2 /GaN, the data set of the interfaces of SiO 2 /GaN without the treatment contributed to performance improvement. Such performance improvement was not observed when integrating a data set of SiO 2 /Si. Furthermore, ARD-based GPR models with the prior distribution of lengthscales showed the desired characteristics as a surrogate model for BO.

■ MATERIALS AND METHODS
Metal-Oxide-Semiconductor (MOS) Capacitor Fabrication. Si-doped GaN epitaxial layers (with a donor density of 5.0 × 10 16 cm −3 ) of 4 μm thickness on an n-GaN(0001) substrate were used. The GaN surfaces were cleaned with buffered hydrofluoric acid (BHF) for 5 min and with hydrochloric acid (0.02 M) for 3 min. Then, the samples were rinsed with ultrapure water before the deposition of SiO 2 . After that, a thin film of SiO 2 (thickness of ∼80 nm) was deposited by PECVD with TEOS and O 2 gas. In the case of the surface-oxidized GaN sample (OxGaN), the GaN substrate was subjected to UV/O 3 cleaning for 20 min at 100°C before deposition of SiO 2 . SiO 2 /Si samples were fabricated using the same methods. After forming the Al electrodes, postmetallization annealing was carried out for 30 min at 400°C under a N 2 condition. The capacitance−voltage (CV) characteristics were evaluated, and the interface state density (D it ) was calculated by the Hi-Lo method and conductance methods for SiO 2 /Si samples and SiO 2 /GaN (OxGaN) samples, respectively. Schematic diagram of the fabricated sample is provided in Figure 1.
Data Sets for Machine Learning (ML). D it of SiO 2 and a semiconductor interface was the objective variable: y in ML models. Two N-type substrates�GaN and Si�were employed for MOS capacitors. An operation with O 3 gas was further applied to a few of them using GaN, leading to three types of interfaces: SiO 2 /Si, SiO 2 /GaN, and O 3 gas-treated SiO 2 /GaN (OxGaN), corresponding to three data sets. Process variables in PECVD were independent variables (x) in ML models. These ML models can predict a D it value from its process variable values. The ranges of D it and the process variables for the three data sets are reported in Table 1. The number of samples for Si was 26, while that for OxGaN was 12. The D it and the process variable values for the three data sets are provided in Table S1 in the Supporting Information. The values of the process variables were determined by our previous attempts to optimize the process variables targeting for lower D it using BO approaches; 14 thus, relatively diverse process variable values had been tested beforehand.
Gaussian Process Regression (GPR). GPR is a regression algorithm based on the Gaussian process, where any finite collection of outputs of a function follows a multivariate Gaussian distribution. A trained GPR model can predict a Gaussian distribution of the objective variable for an arbitrary input x given a set of samples as a training data set. Thus, probabilistic operations, e.g., the expectation and the variance of the objective variable for x, are naturally applied to GPR outputs. In general, GPR models are controlled by two parameters: the mean (m) and covariance functions (k) as shown in eq 1 The mean and covariance functions give the mean vector and the covariance matrix of a multivariate Gaussian distribution for a collection of function outputs, respectively, governing the Gaussian process. Given a set of n data points D = {(x 1 ,y 1 ),(x 2 ,y 2 ),···,(x n ,y n )}, the multivariate Gaussian distribution becomes n n n n n n where m(x i ) represents the mean of f(x i ) and k(x i ,x j ) represents the covariance between f(x i ) and f(x j ) for the i-th and j-th samples, respectively. In this study, m(x) takes a constant value. The covariance matrix in eq 2 is also called the kernel matrix (K). The kernel matrix determines the shape (smoothness) of the function based on x values. If two x points (x 1 and x 2 ) take a high value of k(x 1 ,x 2 ), the corresponding output values ( f(x 1 ), f(x 2 )) become similar. In this study, the Matern52 kernel with ARD 10 was used as the kernel function where l = (l 1 ,l 2 ,···,l d ) is a set of hyperparameters called lengthscales, d is the dimension of x, and ∥x 1 −x 2 ∥ is the Euclidean distance between x 1 and x 2 . ARD sets a length-scale parameter for each variable. A variable with a large length-scale value after optimization is regarded as unimportant to the model because the covariance of the model becomes almost independent of input values for the variable. 15 In our study, the Matern52 kernel was further combined with a constant kernel to scale the output of the kernel The derivation of the distribution of y for a new point is provided in Supporting Note S1.
Parameter Estimation for GPR. Parameters in GPR, such as the length-scales in eq 3 and the Gaussian noise σ 2 , are normally determined by maximizing the marginal likelihood for the observed data points. In this study, a prior distribution for each hyperparameter is introduced to avoid overfitting, in particular when data points are not many. To estimate the hyperparameters, maximum a posteriori (MAP) estimation was conducted based on the following equation Assuming that the length-scales as well as the other parameters are independent, the prior distribution becomes as follows In this study, the prior distribution of each hyperparameter is given as a form of the γ distribution, γ(α,β), where α is the shape parameter and β is the inverse scale parameter (rate parameter). Each length-scale parameter takes the same γ distribution as prior. The logarithm of the prior (eq 7) can be represented as follows where Γ represents the γ function and (α l ,β l ), (α τ ,β τ ), and (α σ 2 ,β σ 2 ) are the shape and inverse scale parameters for the In this study, α l = 3.0, β l = 6.0, α τ = 2.0, β τ = 0.15, α σ 2 = 1.1, and β σ 2 = 0.05 were used for each prior hyperparameter, if not specified. The probability density functions for the lengthscale, scale parameter, and noise are provided in Figure S1. These values were the default setting in the BOTorch library, 16 which also showed relatively stable performance for the data sets in this study (see the Predictive Performance of ML Models Section).
Software and Implementation. GPR models with/ without prior distributions were implemented in an in-house Python script with the help of GPyTorch (version 1.9.0) and BOTorch modules. 17 ML Modeling Algorithms for Comparison. Random Forest Regressor. The random forest (RF) algorithm is an ensemble learning approach based on bagging. 18 A set of bootstrap samples is prepared for each decision tree, and each tree is trained solely on the set with random features. The prediction by an RF regressor (RFR) is the average of the outputs of all trees. In this study, the scikit-learn implementation of RF was used: RandomForestRegressor (version 1.1.2). The hyperparameters of RF models were optimized by using the GridSearchCV module of scikit-learn with a set of parameters, provided in Table S2.
Multi-task Neural Networks. A simple multi-task neural network (multi-task NN) was used as a control. Multi-task NN can be trained on a data set mixed with multiple objective variables. In this study, the multiple objective variables correspond to D it for the three substrates: GaN, Si, and OxGaN. The multi-task NN architecture is provided in Figure  S2, which consisted of a sequence of input, hidden, and output layers. During the model training, the number of epochs was controlled to reach the best predicted D it value for OxGaN as shown in Table S3.
Evaluation Metrics. For evaluating D it prediction models, the coefficient of determination (R 2 ), the root-mean-square error (RMSE), and the mean absolute error (MAE) were used. These values are calculated by following the equations  where n is the number of data, yî is the predicted value for the i-th data, and y ̅ is the average of the observed data.
Data Preprocess. Before making ML models, independent variables (x) were scaled based on the theoretical maximum and minimum values for each variable. The logarithm form of D it (log D it ) was used followed by the range scaling using the maximum and minimum values of training data. The logarithm transformation allowed us to compare a wide range of possible D it values, with different orders of magnitude. For comparison, we also made the GPR models using untransformed (original) D it values.
■ RESULTS AND DISCUSSION Study Design. There were two objectives in this study for predicting D it values from a limited number of samples: revealing an effective data integration strategy for ML models (1) and proposing the best modeling method that can be further utilized for BO (2). The limitation of the study is that all data were collected from a single laboratory with a wellestablished protocol and a small number of samples (12 for evaluation), meaning that experimental errors were expected to be small, and the derived conclusions can be supported when using a small data set.
For the first objective, when targeting D it prediction for OxGaN samples, four integrated data sets were prepared as training: GaN and OxGaN (GaN/OxGaN), Si and OxGaN (Si/OxGaN), GaN, Si, and OxGaN (GaN/Si/OxGaN), and only OxGaN. In ordinary ML approaches, training data sets show the same distribution as test data sets: using only OxGaN samples for both training and test (Figure 2a). However, different but somehow related samples might contribute to increasing the predictive performance of the model, i.e., OxGaN and GaN. This hypothesis was systematically tested in this study. To represent which type of interface is measured for D it , one-hot encoding of interface type was introduced ( Figure  2b). A sample that was treated with O 3 gas has a value 1 for the UV/O 3 -processed variable; otherwise, it is 0. Similar labeling was used for Si-based interfaces. To evaluate the performance of prediction models, leave-one-out cross-validation (LooCV) was conducted due to the relatively small number of OxGaN samples. The metrics of evaluations were R 2 , RMSE, and MAE. For a fair comparison with various ML models, GPR, RFR, and multi-task NN were used as regression models. For all ML models, a sample was represented by the five process variables and the two one-hot encodings (Figure 2b).
For the second objective, ARD-based GPR models with/ without the prior distribution of length-scales were focused. Introducing the prior to the GPR modeling was expected to reduce the possibility of overfitting when a data set size was small. In addition to comparing the predictive ability among GPR models with/without the prior, the values of the optimized length-scales by maximizing marginal likelihood functions were scrutinized. Since BO is extensively utilized for material design when the number of samples is small, an appropriate quantification of the design space reflecting the training data distribution is important. In this respect, the predicted distributions for grid points in the design space were evaluated, which will become important information for further process variable optimization.
Predictive Performance of ML Models. The LooCVbased performances of the prediction models for OxGaN samples are reported in Table 2. These models were combinations of the four integration strategies and the four modeling methods except for multi-task NN. Overall, the combinations of the GaN/OxGaN data set and GPR with/ without the prior worked best. The R 2 values for the two strategies were over 0.6, much higher than the other tested strategies. On the other hand, for the integration strategies involved in Si, the performances deteriorated drastically. For example, when using the Si/OxGaN data set, GPR with the prior had an R 2 value of 0.27 and GPR without the prior had an R 2 value of −0.46. These performance differences imply that there might be a relationship between the samples of GaN and those of O 3 -treated GaN, but not with Si. However, for the six pairs of samples having the same process variable values between GaN and OxGaN, the correlation coefficient of log D it was −0.20, and one pair was found having the same process variable values between Si and OxGaN: 10.83 for Si and 10.64 for OxGaN. RFR showed the best performance when only the OxGaN data set was used, although the R 2 values were much smaller than those with GPR models for the GaN/OxGaN data set. The hyperparameters of RFR were optimized by the inner grid search CV. In RFR models, when changing the number of folds in the CV, unstable prediction performances were observed except for the data set of OxGaN ( Figure S3). This could explain the low predictive ability of RFR models when using the integrated data set. There is discrepancy between GPR with and without the prior approach. Without using the prior, prediction performances were unstable except for the GaN/OxGaN data set (Si/OxGaN: −0.46 in R 2 , OxGaN: −0.36). However, while using the prior, such performance deterioration was not observed (Si/OxGaN:0.27 in R 2 , OxGaN:0.29). Thus, introducing the prior had some merit in GPR models in this study. Multi-task NN showed a relatively good prediction performance (R 2 : 0.50), but it was lower than GPR with the prior and the GaN/OxGaN data set.
The effect of the logarithm transformation of the objective variable on the model's predictive ability was also investigated. As shown in Table S4, GPR with the prior without the logtransform overall performed poorly. On the other hand, without the prior distribution in GPR, the predictive performance of the model was almost as good as that of the model built on the log-transformed variable. For fairly comparing the evaluation metrics for the models with/without the logarithm transformation when the model was GPR without the prior and the data set was GaN/OxGaN, the predicted values (logarithm scale) were retransformed to the original scale. Although the model was trained for the logarithm-scaled variables, the metric values for the model using the logarithm-transformed variable were slightly better (MAE: 1.52 × 10 10 , RMSE: 1.99 × 10 10 , R 2 :0.66) than those for the model using the untransformed variable.

Effect of the Prior of Length-scales in GPR.
To further reveal the difference between GPR models with and without the priors, the optimized length-scale parameters (ARD) were calculated using all of the data in each training data set (Table  3). A large length-scale value for a variable indicates that the variable is unimportant for the GPR prediction. Without the prior when the data set was OxGaN, the only effective variable became RF power. This might be a result of the model overfitted to a small data size (12 samples). On the other hand, introducing the prior avoided relying only on a single variable. Rather the weights of variables were equally distributed among the process variables. For the GaN/OxGaN, the difference in length-scale parameters between GPR with and without the prior became smaller.
To compare the model stability between GPR with and without the prior, the convergence of parameters by the maximization of marginal likelihood was tested by shuffling a training data set 10 times and building 10 GPR models for each OxGaN sample. In theory, the 10 models should predict the same distribution for each test sample. Using GPR with the prior, the average of the standard deviations for the training data sets was 9.0 × 10 −5 , while without the prior, the average was 9.5 × 10 −2 , meaning that the predicted distributions by GPR without the prior differed based on the order of training samples (Table S5).
To further investigate characteristics of the distribution of predicted D it derived by GPR models with the optimized parameters, 24.3 million grid points in the process variable spaces were evaluated, which were the exhaustive combination of 30 equally split points for all of the process variables (30 5 ). For the UV/O 3 -processed variable, a value of 1 was set when using the GaN/OxGaN data set. Figure 3 shows the predicted mean and standard deviation of D it by GPR models for the grid points. Horizonal axes in Figure 3 are the minimum Euclidean distance to the training data points: a distance to the training data set. With the prior, the predicted mean and standard deviation values gradually converged as the distance to the training data set increased (Figure 3a,c). In particular, the average of the standard deviation values became larger as the distance increased. Most acquisition functions in BO depend solely on the predicted distributions. These characteristics of the GPR model with the prior seemed reasonable for BO. On the other hand, without the prior, the predicted mean and standard deviation values from GPR models did not converge as the distance increased (Figure 3b,d). In particular, the GPR model without the prior, which was trained on the OxGaN data set, predicted the same mean and standard deviation values irrespective of the distance. This was a consequence of the fact that the length-scale for only RF power had a smaller value in comparison to the other process variables (Table 3). When conducting BO with this model, the process variable values except for RF power will be determined almost at random.
Hyperparameters of Length-scales Prior. Robustness of Hyperparameter Values on the Predictive Ability. Analyses in the previous sections revealed that setting a prior distribution to length-scales contributed to the higher predictive ability of the models and avoided predicting the same distributions for different test points. The prior for the length-scales was a γ distribution with parameters α l and β l . The predictive ability of GPR models with the prior was systematically evaluated for exhaustive combinations of α l and β l from 1.0 to 50 with an interval of 0.5. Figure 4 reports LooCV-based R 2 surfaces for the two data sets: GaN/OxGaN ( Figure 4a) and OxGaN (Figure 4b). The top 10 performing combinations of α l and β l are also reported in Table 4. Obviously, extreme values for the parameters did not perform well for both data sets. On these two surfaces in Figure 4, combinations of the well-performed parameters differed: for GaN/OxGaN, α l and β l were similar values, while for OxGaN, larger α l than β l worked better. Since the γ distribution with α and β has a mean value of α/β, restricting each length-scale in a proper range of values might be key to building a better predictive model. In fact, for the top 10 performing models in Table 4 Table S6. Although the scale of length-scales seemed restricted by the prior mean, the optimized length-scale values among the process variables varied. For the GaN/OxGaN data set, the length-scale of the Temp. variable was the smallest, while that of the TEOS flow was the largest (Table S6).
When using a γ distribution relatively close to the unformal distribution ( Figure S4 Figure 3. Predicted values for grid points in the experimental condition space. For each grid of exhaustive experimental conditions, predicted properties (mean and standard deviation) of D it (y axis) are plotted against the distance to the training data set (x axis). The distance metric is the minimum Euclidean distance to training data points. The property values are predicted by GPR with the prior (a) and GPR without the prior (b) using GaN/OxGaN, and GPR with the prior (c) and GPR without the prior (d) using OxGaN. data set and the ARD-based GPR with the prior was found to be suitable in terms of predictive ability (Table 2) and showed preferable characteristic for BO, reflecting the distance from a training data set on the predictive distributions (Table 3), these GPR models had the prior hyperparameter values predetermined by the BOTorch module as α l = 3.0, β l = 6.0. The important question was how to optimize these hyperparameters solely using a training data set. Thus, we conducted double CV or nested CV to optimize the hyperparameters. 19 As the inner CV loop, LooCV was employed to determine the best hyperparameter values based on RMSE. The exhaustive combinations of α l and β l from 1.0 to 10 with an interval of 1.0 were searched in this process. Table 5 shows the predictive performance for LooCV when using the optimized combination of α l and β l for each OxGaN sample as a test. For each test sample, the fitting to the training samples inside the inner CV and the optimized combination of α l and β l are reported in Table S7. Furthermore, the predicted D it value for each test sample by GPR with the prior with optimized α l and β l is provided in Table S8. The double CV-based validation for GPR with the prior also reached almost the same performance as using predetermined hyperparameter values in Table 2.
Thus, hyperparameter values of α l and β l could be determined solely based on a training data set. However, when using only the OxGaN data set, fitting to the training data set sometimes failed possibly due to the small data-set size.

■ CONCLUSIONS
For data-driven material discovery or process optimization, an efficient ML model-building scheme is required when the dataset size is small and when introducing additional operations in the process. In this study, three types of interfaces fabricated by PECVD were prepared to evaluate their characteristics:     Brief explanation on the predicted distribution of the output for a new point in GPR (Note S1); D it and the process variables (Table S1); hyperparameters of RFR (Table S2); predictive performance for OxGaN by the multi-task NN models trained with different epochs (Table S3); predictive performance for OxGaN by GPR without the logarithm transformation of the objective variable (Table S4); stability of the GPR models (Table  S5); length-scale values in the ARD-based GPR models using the top 10 performed combinations of α l and β l in the prior (Table S6); optimized α l and β l values in the double CV trial (Table S7); predicted D it value for each test sample using optimized and in the double α l and β l CV trial (Table S8); probability density functions (PDFs) for length-scale, scale parameter and noise ( Figure S1); multi-task neural network architecture ( Figure S2); effect of fold sizes on the predictive ability of RFR models ( Figure S3); probability density functions (PDFs) with small α and β values ( Figure  S4)