Machine Learning Assisted Classification of Aluminum Nitride Thin Film Stress via In-Situ Optical Emission Spectroscopy Data

In this study, we submit a complex set of in-situ data collected by optical emission spectroscopy (OES) during the process of aluminum nitride (AlN) thin film. Changing the sputtering power and nitrogen(N2) flow rate, AlN film was deposited on Si substrate using a superior sputtering with a pulsed direct current (DC) method. The correlation between OES data and deposited film residual stress (tensile vs. compressive) associated with crystalline status by X-ray diffraction spectroscopy (XRD), scanning electron microscope (SEM), and transmission electron microscope (TEM) measurements were investigated and established throughout the machine learning exercise. An important answer to know is whether the stress of the processing film is compressive or tensile. To answer this question, we can access as many optical spectra data as we need, record the data to generate a library, and exploit principal component analysis (PCA) to reduce complexity from complex data. After preprocessing through PCA, we demonstrated that we could apply standard artificial neural networks (ANNs), and we could obtain a machine learning classification method to distinguish the stress types of the AlN thin films obtained by analyzing XRD results and correlating with TEM microstructures. Combining PCA with ANNs, an accurate method for in-situ stress prediction and classification was created to solve the semiconductor process problems related to film property on deposited films more efficiently. Therefore, methods for machine learning-assisted classification can be further extended and applied to other semiconductors or related research of interest in the future.


Introduction
Aluminum nitride (AlN) is a typical semiconductor material that belongs to group III-V with a hexagonal, close-packed wurtzite structure [1]. AlN is a material with high mechanical, thermal, and chemical stability. From an electrical point of view, AlN is well known for its large energy bandgap, high thermal conductivity, high breakdown energy, high resistivity, and high surface acoustic wave (SAW) velocity [2]. AlN has some clearly labeled crystal orientations, such as (100), (101), and (002). The latter parameter is the highest orientation in the (002) plane, prompting research interest in the piezoelectric area [3]. Various methods can produce thin films but sputtering, with the low-cost and low-temperature requirements, is fast becoming the most suitable fabrication method [4]. During the DC sputtering, the arcing phenomena can lead to target poisoning and damage to the target as well as affect thin film deposition afterwards. Differently, RF sputtering inches in diameter with 6 mm thickness in 99.999 wt% purity. During the processes, the mixture gas of nitrogen (N2) and argon (Ar) in a ratio from 1:1 to 4:1 was used. The sputtering power of the process is from 400 W to 1000 W, and the pulse frequency was 250 kHz with a 0.6μs reverse time. Samples were cut into squares with a side length of 2 cm and cleaned in the order of acetone (ACE), isopropanol (IPA), and de-ionized (DI) water. Before the sample was loaded into the chamber, diluted 1.2% hydrofluoric acid was used to remove the oxide from the samples.

Data Acquirement
During the 60-min processes, under the electric field, N2 dissociates, and Ar impacts the target to sputter Al from the target. Atoms are partially ionized and emit light. The plasma consists of some radicals, e.g., Al + , Al 0 , Ar + , Ar 0 , N + , and N 0 , and spectra were collected by OES (SE2020-025-FUVN, OtO Photonics, Taoyuan, Taiwan) presented by the emission line, but N + is not detected due to low intensity [8]. A waterfall plot was plotted to show the raw data of one experiment in Figure 2. OES supports the spectral analysis of a 2048-pixel charge-coupled device (CCD). The emission light is transmitted through a diffraction grating and then collected by CCD. The OES device has a spectral decomposition capability of wavelength between 180 and 850 nm, and the resolution between these wavelengths is 1900 pixels.
On the right side of Figure 2, every data "point" reflects an arbitrary unit (a.u.) density in a specific time and wavelength. Spectral intensity is recorded numerically on these pixels, and we call each intensity value a "point". We use OES to continuously collect the plasma emission spectrum at the interval of 3 s, and the integration time of the spectrum is 25 ms. An integration time (25 ms) can be easily understood as the exposure time of a camera. Over 25 ms of integration time, an OES spectrum is obtained. We call a spectrum at a time point as a "set". These sets reflect the spectrum at a particular time and are fed into the PCA one by one as input data. After a 60-min experiment, we will receive 1200 sets, which we will call a "group". Before using this set of data, we used PCA to denoise this set of data.

Data Acquirement
During the 60-min processes, under the electric field, N 2 dissociates, and Ar impacts the target to sputter Al from the target. Atoms are partially ionized and emit light. The plasma consists of some radicals, e.g., Al + , Al 0 , Ar + , Ar 0 , N + , and N 0 , and spectra were collected by OES (SE2020-025-FUVN, OtO Photonics, Taoyuan, Taiwan) presented by the emission line, but N + is not detected due to low intensity [8]. A waterfall plot was plotted to show the raw data of one experiment in Figure 2. OES supports the spectral analysis of a 2048-pixel charge-coupled device (CCD). The emission light is transmitted through a diffraction grating and then collected by CCD. The OES device has a spectral decomposition capability of wavelength between 180 and 850 nm, and the resolution between these wavelengths is 1900 pixels.
On the right side of Figure 2, every data "point" reflects an arbitrary unit (a.u.) density in a specific time and wavelength. Spectral intensity is recorded numerically on these pixels, and we call each intensity value a "point". We use OES to continuously collect the plasma emission spectrum at the interval of 3 s, and the integration time of the spectrum is 25 ms. An integration time (25 ms) can be easily understood as the exposure time of a camera. Over 25 ms of integration time, an OES spectrum is obtained. We call a spectrum at a time point as a "set". These sets reflect the spectrum at a particular time and are fed into the PCA one by one as input data. After a 60-min experiment, we will receive 1200 sets, which we will call a "group". Before using this set of data, we used PCA to denoise this set of data.

Flim Charaterization and Stress Calculation
After films being deposited, XRD (D8 Advance, Bruker, Billerica, MA, USA) was used to measure the film texture, CFE-SEM (SU8200, HITACHI, Tokyo, Japan) was used to take the cross-section images of the thin film and the interface between AlN and Si, and HR-TEM (JEM2100, JEOL Ltd., Tokyo, Japan) was used to take the bright-field images and selected area electron diffraction (SAED) pattern. We calculate the residual stress on the thin film using the XRD measurement results and the following using the side-inclination method [13,14]: where Young's modulus E = 308 GPa and Poisson's ratio υ = 0.287 for AlN [26,27], σ is stress, θ is diffraction angle, θ0 is the diffraction angle without stress, and Ψ is the inclination angle of the sample. After changing the inclination angle of the platform several times, we can obtain the slope curve and an estimated value of the thin film stress [28].

Data Processing and Machine Learning
We combined the different experimental data and analyzed them using PCA, then classified them by calculating stress. In machine learning exercises, PCA is classified as dimension reduction to extract features. In principle, the data dimension is reduced, but the overall performance of the data remains good or not much different. As shown in Figure 3a, PCA can distinguish data through different PCs (principal components). The directions with the largest and the second-largest data differences are extracted from PCA, respectively, called principal component 1 (PC1) and principal component 2 (PC2). Then, we select PC1 and PC2 and input the processed data into MLP. MLP includes at least three nodes (input layer, hidden layers, and output layers), as shown in Figure 3b. The hidden layers are performing the feature extraction of input layer data, which can reduce as well as increase dimension. This process is not designed by the rule of thumb but learned by data. The final output layer is the classified stress (tensile or compressive) [29]. In the end, we have received a machine-determined taxonomy for PC1 and PC2, as shown in Figure  3c. This kind of classification is not necessarily visible, only visible in limited dimensions or selected dimensions. However, the data can be classified to make predictions using this model, whether it is visible or not.

Flim Charaterization and Stress Calculation
After films being deposited, XRD (D8 Advance, Bruker, Billerica, MA, USA) was used to measure the film texture, CFE-SEM (SU8200, HITACHI, Tokyo, Japan) was used to take the cross-section images of the thin film and the interface between AlN and Si, and HR-TEM (JEM2100, JEOL Ltd., Tokyo, Japan) was used to take the bright-field images and selected area electron diffraction (SAED) pattern. We calculate the residual stress on the thin film using the XRD measurement results and the following using the side-inclination method [13,14]: where Young's modulus E = 308 GPa and Poisson's ratio υ = 0.287 for AlN [26,27], σ is stress, θ is diffraction angle, θ 0 is the diffraction angle without stress, and ψ is the inclination angle of the sample. After changing the inclination angle of the platform several times, we can obtain the slope curve and an estimated value of the thin film stress [28].

Data Processing and Machine Learning
We combined the different experimental data and analyzed them using PCA, then classified them by calculating stress. In machine learning exercises, PCA is classified as dimension reduction to extract features. In principle, the data dimension is reduced, but the overall performance of the data remains good or not much different. As shown in Figure 3a, PCA can distinguish data through different PCs (principal components). The directions with the largest and the second-largest data differences are extracted from PCA, respectively, called principal component 1 (PC1) and principal component 2 (PC2). Then, we select PC1 and PC2 and input the processed data into MLP. MLP includes at least three nodes (input layer, hidden layers, and output layers), as shown in Figure 3b. The hidden layers are performing the feature extraction of input layer data, which can reduce as well as increase dimension. This process is not designed by the rule of thumb but learned by data. The final output layer is the classified stress (tensile or compressive) [29]. In the end, we have received a machine-determined taxonomy for PC1 and PC2, as shown in Figure 3c. This kind of classification is not necessarily visible, only visible in limited dimensions or selected dimensions. However, the data can be classified to make predictions using this model, whether it is visible or not.  The above study methods are illustrated as a fishbone diagram, as shown in Figure  4. Through data acquirement, data processing, and thin film analysis (step I-III), a machine learning method is built for stress classification (step IV). Validation experiments after model training can verify if the model can be widely used. The correlation between process parameters and OES has been addressed by varying some of the experimental parameters, which will result in changes in the whole spectrum of OES data that are difficult to quantify or describe. The prediction combined with machine learning is also an innovation of this study which solves the complex problems of plasma composition and energy associated with film deposition and film property.

Stress and X-ray Diffraction Analysis
As shown in Figure 5, there are two kinds of stress acting in the film. In a recent review [15], thin film stress was discussed. Reducing the stress of thin films can improve the electron mobility of devices and reduce film cracking or peeling from the substrate. Thin film stress can be affected by many complex factors like microstructures and deposition parameters. Several complex ex-situ measurements and analyses can be ignored through this study of machine learning exercises. The above study methods are illustrated as a fishbone diagram, as shown in Figure 4. Through data acquirement, data processing, and thin film analysis (step I-III), a machine learning method is built for stress classification (step IV). Validation experiments after model training can verify if the model can be widely used. The correlation between process parameters and OES has been addressed by varying some of the experimental parameters, which will result in changes in the whole spectrum of OES data that are difficult to quantify or describe. The prediction combined with machine learning is also an innovation of this study which solves the complex problems of plasma composition and energy associated with film deposition and film property.  The above study methods are illustrated as a fishbone diagram, as shown in Figure  4. Through data acquirement, data processing, and thin film analysis (step I-III), a machine learning method is built for stress classification (step IV). Validation experiments after model training can verify if the model can be widely used. The correlation between process parameters and OES has been addressed by varying some of the experimental parameters, which will result in changes in the whole spectrum of OES data that are difficult to quantify or describe. The prediction combined with machine learning is also an innovation of this study which solves the complex problems of plasma composition and energy associated with film deposition and film property.

Stress and X-ray Diffraction Analysis
As shown in Figure 5, there are two kinds of stress acting in the film. In a recent review [15], thin film stress was discussed. Reducing the stress of thin films can improve the electron mobility of devices and reduce film cracking or peeling from the substrate. Thin film stress can be affected by many complex factors like microstructures and deposition parameters. Several complex ex-situ measurements and analyses can be ignored through this study of machine learning exercises.

Stress and X-ray Diffraction Analysis
As shown in Figure 5, there are two kinds of stress acting in the film. In a recent review [15], thin film stress was discussed. Reducing the stress of thin films can improve the electron mobility of devices and reduce film cracking or peeling from the substrate. Thin film stress can be affected by many complex factors like microstructures and deposition parameters. Several complex ex-situ measurements and analyses can be ignored through this study of machine learning exercises. In Figure 6a, compressive and tensile stress are obtained after experiments, and we can roughly judge that the residual stress has a noticeable difference with the change of N2 flow rate and power. Apparently, even with the increase of N2 flow rate and process power (see Table 1), the compressive stress changes to tensile stress (see Figure 6a). With the increase of the N2/Ar flow ratio and sputtering power, the thickness of AlN increases, indicating that the transition really exists from compressive stress to tensile [30]. Increasing the sputtering power energetically enhances the deposition rate, with the increased kinetic energy transferring to the sputtered atoms. With the sputtering power induced energy transfer, ionized radical mobility, and surface diffusion length also increase, leading to the increased possibility of atomic diffusion [31]. Furthermore, when Ar remains constant, with the increase of N2/Ar flow ratio, the proportion of N + or N2 + increases with the increase of deposition rate [32]. Therefore, the above power and flow rate changes will be highly reflected in OES data through plasma composition and energy changes. In combination with the film measurements by XRD, SEM, and TEM, we can attempt to establish a connection between the OES data and the film results. Thus, machine learning becomes a great way to process complex OES data.  T1  15  15  800  T2  15  30  800  T3  15  45  800  T4  15  60  800  T5  15  45  400  T6  15  45  600  T7  15  45  800  T8 15 45 1000 Figure 6b shows the OES data and XRD of two representative groups. The plot in the red dotted line represents group T6 with tensile stress, and the plots in the blue dotted line patterns represent group T8 with compressive stress. The overall emission intensity T8 is relatively strong from the OES data plots, but it is still challenging to find many differences. In group T8, only AlN (002) orientations that require energy accumulation appear at the 2θ diffraction angle between 30° and 40° due to the higher deposition power from the XRD plot. In contrast, in the T6 group, there is an apparent peak in the orientation of AlN (100). In Figure 6a, compressive and tensile stress are obtained after experiments, and we can roughly judge that the residual stress has a noticeable difference with the change of N 2 flow rate and power. Apparently, even with the increase of N 2 flow rate and process power (see Table 1), the compressive stress changes to tensile stress (see Figure 6a). With the increase of the N 2 /Ar flow ratio and sputtering power, the thickness of AlN increases, indicating that the transition really exists from compressive stress to tensile [30]. Increasing the sputtering power energetically enhances the deposition rate, with the increased kinetic energy transferring to the sputtered atoms. With the sputtering power induced energy transfer, ionized radical mobility, and surface diffusion length also increase, leading to the increased possibility of atomic diffusion [31]. Furthermore, when Ar remains constant, with the increase of N 2 /Ar flow ratio, the proportion of N + or N 2 + increases with the increase of deposition rate [32]. Therefore, the above power and flow rate changes will be highly reflected in OES data through plasma composition and energy changes. In combination with the film measurements by XRD, SEM, and TEM, we can attempt to establish a connection between the OES data and the film results. Thus, machine learning becomes a great way to process complex OES data.  T1  15  15  800  T2  15  30  800  T3  15  45  800  T4  15  60  800  T5  15  45  400  T6  15  45  600  T7  15  45  800  T8 15 45 1000 Figure 6b shows the OES data and XRD of two representative groups. The plot in the red dotted line represents group T6 with tensile stress, and the plots in the blue dotted line patterns represent group T8 with compressive stress. The overall emission intensity T8 is relatively strong from the OES data plots, but it is still challenging to find many differences. In group T8, only AlN (002) orientations that require energy accumulation appear at the 2θ diffraction angle between 30 • and 40 • due to the higher deposition power from the XRD plot. In contrast, in the T6 group, there is an apparent peak in the orientation of AlN (100).

Microstructure of Thin Film
The influence of process parameters on the film stress was mentioned above. However, the microstructure is more intuitive because we can discern the classification of stress and the correlation with the microstructure.
In Figure 7, cross-section CFE-SEM images of group T8 (tensile) and T6 (compressive) are shown. The film thickness of T8 is thicker than that of T6 due to its higher deposition power, and the columnar structure in the CFE-SEM image can be easily observed. Differently, in Group T6, compared with T8, the columnar structure of AlN in CFE-SEM is not apparent [9]. This phenomenon can be explained by adatom mobility. At lower power, the radicals have less kinetic energy, resulting in lower adatom surface mobility. In general, higher adsorbed atom mobility promotes the growth of C-axis oriented AlN thin films because the higher the mobility, the more likely it is that low-energy binding sites leading to crystal growth will be found [10]. In Figure 8, the corresponding electron diffraction (SAED) patterns of a tensile group T8 and a compressive group T6 are shown. In the HR-TEM images, the constitution of columnar grains was analyzed using bright field and SAED patterns. Along the c-axis, a

Microstructure of Thin Film
The influence of process parameters on the film stress was mentioned above. However, the microstructure is more intuitive because we can discern the classification of stress and the correlation with the microstructure.
In Figure 7, cross-section CFE-SEM images of group T8 (tensile) and T6 (compressive) are shown. The film thickness of T8 is thicker than that of T6 due to its higher deposition power, and the columnar structure in the CFE-SEM image can be easily observed. Differently, in Group T6, compared with T8, the columnar structure of AlN in CFE-SEM is not apparent [9]. This phenomenon can be explained by adatom mobility. At lower power, the radicals have less kinetic energy, resulting in lower adatom surface mobility. In general, higher adsorbed atom mobility promotes the growth of C-axis oriented AlN thin films because the higher the mobility, the more likely it is that low-energy binding sites leading to crystal growth will be found [10].  (Table 1) of the experiment for MLP model training; (b) OES data plot (colored line: mean value of data, shadowed space: complete data) and XRD plot from 30 to 40 degree in 2θ diffraction angle of specific groups (T6 and T8).

Microstructure of Thin Film
The influence of process parameters on the film stress was mentioned above. However, the microstructure is more intuitive because we can discern the classification of stress and the correlation with the microstructure.
In Figure 7, cross-section CFE-SEM images of group T8 (tensile) and T6 (compressive) are shown. The film thickness of T8 is thicker than that of T6 due to its higher deposition power, and the columnar structure in the CFE-SEM image can be easily observed. Differently, in Group T6, compared with T8, the columnar structure of AlN in CFE-SEM is not apparent [9]. This phenomenon can be explained by adatom mobility. At lower power, the radicals have less kinetic energy, resulting in lower adatom surface mobility. In general, higher adsorbed atom mobility promotes the growth of C-axis oriented AlN thin films because the higher the mobility, the more likely it is that low-energy binding sites leading to crystal growth will be found [10]. In Figure 8, the corresponding electron diffraction (SAED) patterns of a tensile group T8 and a compressive group T6 are shown. In the HR-TEM images, the constitution of columnar grains was analyzed using bright field and SAED patterns. Along the c-axis, a In Figure 8, the corresponding electron diffraction (SAED) patterns of a tensile group T8 and a compressive group T6 are shown. In the HR-TEM images, the constitution of columnar grains was analyzed using bright field and SAED patterns. Along the c-axis, a textured growth can be observed. Like other wurtzite-type structures, planar defects like stacking and dislocation faults are observable in the bright field image. It can also be inferred from the SAED patterns that the arrangement of columnar grains is not perfect. Displacement of the (002) reflection can be found [12]. In the bright field image of group T6, compared with the T8, the texture seems disorderly. The SAED pattern is also a more apparent polycrystalline structure consistent with the result of complicated crystallization in XRD.
Materials 2021, 14, x FOR PEER REVIEW 8 textured growth can be observed. Like other wurtzite-type structures, planar defects stacking and dislocation faults are observable in the bright field image. It can also be ferred from the SAED patterns that the arrangement of columnar grains is not per Displacement of the (002) reflection can be found [12]. In the bright field image of gr T6, compared with the T8, the texture seems disorderly. The SAED pattern is also a m apparent polycrystalline structure consistent with the result of complicated crystalliza in XRD. It is not difficult to conclude that there is a correlation between the complex mi structure and differing film stress. It can be seen from the SEM results that the growt the crystal can lead to the change of stresses, and TEM results (Figure 8) show that XRD measurement is effective. Apparently, this is a merit that, from this study, we encouraged to bypass these complex ex-situ film analyses and resolve film stress iss through machine learning without taking too much time for analysis and validation.

Model Preparing and Performing
After completing the stress measurement, we package eight OES data groups divide them into compressive stress and tensile stress for PCA (principal component a ysis) and MLP (multilayer perceptron). We need to consider which PCs (principal c ponents) are needed to feed to MLP and what kind of hidden layers structure we nee define our training model. To deal with the input neurons of PCA, we first calculate contribution of these data in different PCs. The percentage of variance of these group data from PC1 to PC5 are 88.0%, 8.9%, 1.5%, 1.0%, and 0.4%, respectively. It can be fo that the cumulative rate of the first three PCs can reach 98.4%. The first three PCs enough to represent complete data, and in order to avoid irrelevant OES noise being u for classification, we only consider using PC1 to PC3 [22].
It has been shown that MLP is a universal approximator. As long as there are eno hidden nodes, MLP can approximate any needed function. In practice, it is also hard know how many nodes are needed and to be used in hidden layers of a given probl We do not know how to learn or set its weight effectively. However, the single hid layer solution is very inefficient [29]. Through many tests, we select  (which me six nodes in the first hidden layer, three nodes in the second hidden layer) and [9-6-3 our consideration of hidden layers as nodes in the first, second, and third hidden laye the process of training, we applied the Levenberg-Marquardt algorithm (LMA) as training model. Although this method requires more computing power, it is the fas one for training, and it is worthy for thousands of training times [33]. It is not difficult to conclude that there is a correlation between the complex microstructure and differing film stress. It can be seen from the SEM results that the growth of the crystal can lead to the change of stresses, and TEM results (Figure 8) show that our XRD measurement is effective. Apparently, this is a merit that, from this study, we are encouraged to bypass these complex ex-situ film analyses and resolve film stress issues through machine learning without taking too much time for analysis and validation.

Model Preparing and Performing
After completing the stress measurement, we package eight OES data groups and divide them into compressive stress and tensile stress for PCA (principal component analysis) and MLP (multilayer perceptron). We need to consider which PCs (principal components) are needed to feed to MLP and what kind of hidden layers structure we need to define our training model. To deal with the input neurons of PCA, we first calculate the contribution of these data in different PCs. The percentage of variance of these groups of data from PC1 to PC5 are 88.0%, 8.9%, 1.5%, 1.0%, and 0.4%, respectively. It can be found that the cumulative rate of the first three PCs can reach 98.4%. The first three PCs are enough to represent complete data, and in order to avoid irrelevant OES noise being used for classification, we only consider using PC1 to PC3 [22].
It has been shown that MLP is a universal approximator. As long as there are enough hidden nodes, MLP can approximate any needed function. In practice, it is also hard to know how many nodes are needed and to be used in hidden layers of a given problem. We do not know how to learn or set its weight effectively. However, the single hidden layer solution is very inefficient [29]. Through many tests, we select  (which means six nodes in the first hidden layer, three nodes in the second hidden layer) and [9-6-3] for our consideration of hidden layers as nodes in the first, second, and third hidden layer. In the process of training, we applied the Levenberg-Marquardt algorithm (LMA) as our training model. Although this method requires more computing power, it is the fastest one for training, and it is worthy for thousands of training times [33].
As shown in Table 2, after identifying several preferred neurons (PCs or complete spectrum) as well as input and hidden layer structures, we want to know how the formed MLPs perform. After processing these formed MLPs, we obtain some indicators for further consideration. At the same time, we also confirm whether we should feed the data into MLP after PCA processing. We used the complete spectrum (1900 raw spectrum point) that PCA did not process as input neurons. These models are trained by MATLAB 2021a in this study using build-in ANNs functions. The percentage of exemplars for training is 70%, 15% is for training cross-validation, and 15% is for testing. We use accuracy (ACC), root mean squared error (RMSE), and epoch to judge to find the most suitable training model for our experiment [33]. ACC is the probability that the training model can correctly classify the input contents, and the equation is as follows: where true positive (TP) and true negative (TN) represent cases where the prediction results are consistent with the actual situation, and the value of P + N represents the whole training samples. RMSE is a convenient method to measure "average error" and is calculated by the difference between the true condition and the predicted condition, and the sum is averaged and then is rooted. Equation of RMSE is given as follows: where y is an estimated parameter andŷ is an estimator, and the value can be 0 or 1 based on whether it is compressive or tensile. The smaller the RMSE value is, the better the accuracy of the prediction model to describe experimental data [34]. When complete data passes through the MLP once and returns, this is called an epoch. In these pieces of training, the maximum epochs are set to 1000, and the "epoch #" in the table indicates how many times you have trained to achieve the best performance. However, there is no conclusion regarding how many epochs are suitable, but as the updating time neural network increases, the curve to classify changes from underfitting to overfitting.
Although the import of complete spectrum (1900 neurons without PCA) has a fragile advantage, we also found that it takes about 250-300 times more in actual model training time and about 25 times more in model building time (building time includes the time of data loading data processing, model training, and model output). The above time results were tested on the Apple M1 SoC (silicon on chips) platform with 16 GB memory, and the training time can be saved on the better performing device. However, we believe that the general performance device is more in line with the actual application situation of the semiconductor or related industries as well as cost considerations.

Stress Perdition and Verification Result
After various considerations, we finally chose the [9-6-3] MLP with PC1 and PC2 input for our model. Fortunately, the input of PC1 and PC2 can be well rendered by using a plot. As shown in Figure 9, the junction of the red area and the blue area is the classification result trained by machine learning. Because of the dense data point, we also plot a local enlarged image on the right. The blue dots represent the neurons with compressive input and the red dots represent tensile input. If dots appear in the area of the same color, this model is correct for the input neuron classification. Otherwise, the model is not able to distinguish correctly.
semiconductor or related industries as well as cost considerations.

Stress Perdition and Verification Result
After various considerations, we finally chose the [9-6-3] MLP with PC1 and PC2 input for our model. Fortunately, the input of PC1 and PC2 can be well rendered by using a plot. As shown in Figure 9, the junction of the red area and the blue area is the classification result trained by machine learning. Because of the dense data point, we also plot a local enlarged image on the right. The blue dots represent the neurons with compressive input and the red dots represent tensile input. If dots appear in the area of the same color, this model is correct for the input neuron classification. Otherwise, the model is not able to distinguish correctly. Figure 9. Performed model using a [9-6-3] MLP with PC1 and PC2 input, and successfully classified the stress as compressive or tensile.
However, we desired to know whether this model can be applied under other experimental conditions. We have two variables (nitrogen flow rate and sputtering power) in the training experiment, each with four values. In designing the verification experiment, we randomly combined the two variables. We collected the spectrum in the following experiments with different N2 flow rates and power. After the film measurement, we randomly selected spectrums from verification experiment V1-V16, as shown in Table 3, to evaluate the model's actual performance. Among them, the prediction of 60 random experiments was compressive, and the prediction of the other 60 random experiments was tensile. We use PCA to process these random cross-validation spectrums with the previous OES data, then plot them on the previously trained model, as shown in Figure 10. Green points represent tensile perditions and orange point represents compressive predictions. If the prediction is correct, the points are plotted as circles. The points are plotted as crosses on the contrary. However, we desired to know whether this model can be applied under other experimental conditions. We have two variables (nitrogen flow rate and sputtering power) in the training experiment, each with four values. In designing the verification experiment, we randomly combined the two variables. We collected the spectrum in the following experiments with different N 2 flow rates and power. After the film measurement, we randomly selected spectrums from verification experiment V1-V16, as shown in Table 3, to evaluate the model's actual performance. Among them, the prediction of 60 random experiments was compressive, and the prediction of the other 60 random experiments was tensile. We use PCA to process these random cross-validation spectrums with the previous OES data, then plot them on the previously trained model, as shown in Figure 10. Green points represent tensile perditions and orange point represents compressive predictions. If the prediction is correct, the points are plotted as circles. The points are plotted as crosses on the contrary. The table in Figure 10 presents a confusion matrix on the cross-validation test performance. A confusion matrix is typically used in supervised learning with visualization. It is so named because it is easy to see whether the machine has confused two different classes through this matrix. We can receive the values of recall, precision, and F1 score to score this model through calculation. The recall is the proportion of positive samples that can be predicted among all positive samples, and the precision is the proportion of positive samples among all positive samples. F1 score is the harmonic average of the two, which can be regarded as a good general index to see the performance of this model [21]. We are pretty satisfied with the performance of this model. At present, there are only 10,000 sets of training data. We believe that thin film stress classification will become more effective if more data are used during training of the model. The table in Figure 10 presents a confusion matrix on the cross-validation test performance. A confusion matrix is typically used in supervised learning with visualization. It is so named because it is easy to see whether the machine has confused two different classes through this matrix. We can receive the values of recall, precision, and F1 score to score this model through calculation. The recall is the proportion of positive samples that can be predicted among all positive samples, and the precision is the proportion of positive samples among all positive samples. F1 score is the harmonic average of the two, which can be regarded as a good general index to see the performance of this model [21]. We are pretty satisfied with the performance of this model. At present, there are only 10,000 sets of training data. We believe that thin film stress classification will become more effective if more data are used during training of the model.

Conclusions
In this study, we have carried out eight groups of experiments to study the effect of stress obtained using XRD measurement and correlating with TEM microstructures on AlN thin films processed by a pulsed DC reactive sputtering with an in-situ OES. With different experimental conditions, we observed the transformation of stress from compressive to tensile. Stress is a very complicated physical characteristic affected by many factors, such as deposition parameters (i.e., power and gas flow) and film microstructures. However, we have proven that machine learning combining PCA with ANNs can solve the semiconductor process problems more efficiently. Combined with PCA, the prediction time is highly preserved with a shallow loss of accuracy. In recent years, machine learning has become a significant trend. Apparently, this is a merit that we have bypassed with these complex ex-situ measurements and analyses on thin films to create a method for insitu stress prediction and classification with in-situ OES data. We also expect that these methods can be further extended and could be beneficial to the semiconductor and related industries in the future.

Conclusions
In this study, we have carried out eight groups of experiments to study the effect of stress obtained using XRD measurement and correlating with TEM microstructures on AlN thin films processed by a pulsed DC reactive sputtering with an in-situ OES. With different experimental conditions, we observed the transformation of stress from compressive to tensile. Stress is a very complicated physical characteristic affected by many factors, such as deposition parameters (i.e., power and gas flow) and film microstructures. However, we have proven that machine learning combining PCA with ANNs can solve the semiconductor process problems more efficiently. Combined with PCA, the prediction time is highly preserved with a shallow loss of accuracy. In recent years, machine learning has become a significant trend. Apparently, this is a merit that we have bypassed with these complex ex-situ measurements and analyses on thin films to create a method for in-situ stress prediction and classification with in-situ OES data. We also expect that these methods can be further extended and could be beneficial to the semiconductor and related industries in the future.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the data sets are too large in this study.