Unsupervised Pre-Training of Imbalanced Data for Identification of Wafer Map Defect Patterns

Visual defect inspection and classification are significant steps of most manufacturing processes in the semiconductor and electronics industries. Known and unknown defects on wafer maps tend to cluster, and these spatial patterns provide valuable process information for supporting manufacturing in determining the root causes of abnormal processes. In previous studies, data augmentation-based deep learning (DL) techniques were most commonly used for the identification of wafer map defect patterns (WMDP). Data augmentation is an effective technique for improving the accuracy of modern image classifiers. However, current data augmentation implementations were manually designed for the WMDP problem. In this study, we propose a DL-based method with automatic data augmentation for the WMDP task. Basically, it focuses on learning effective discriminative features, from wafer maps, through a deep network structure. The network consists of a convolution-based variational autoencoder (CVAE) sequentially. First, we pre-trained the CVAE on large training data in an unsupervised manner. Second, we fine-tuned the encoder of the CVAE, which was followed by a neural network (NN) classifier, in a supervised manner. Additionally, we describe a simple procedure for automatically searching for improved data augmentation policies. The policy mainly consists of five image processing functions: rotation, flipping, shifting, shearing range, and zooming. The effectiveness of the proposed method was demonstrated through experimental results obtained from a simulation dataset and a real-world wafer map dataset (WM-811K). This study provides guidance for the application of deep learning in semiconductor manufacturing processes to improve product quality and yield.


I. INTRODUCTION
In conjunction with the fourth industrial revolution, the semiconductor market has been expanding rapidly [1]- [3]. Semiconductor demand has been exploding in areas such as smartphones, virtual reality, automobiles, wearable devices, internet of things (IoT), and robotics [4]- [6]. Many diverse products are in demand. Semiconductor lines have become diverse, and the semiconductor fabrication process is complicated. Semiconductor manufacturers can produce semiconductor products with high yields and high quality to ensure The associate editor coordinating the review of this manuscript and approving it for publication was Fan Zhang . market competitiveness. Semiconductor processes increase productivity through facility diagnosis, process control, stabilization of yield rate, and so on. In addition, the semiconductor fabrication process has been continually refined, and design complexity has increased to enhance productivity and semiconductor accumulation [7]- [9].
Semiconductor fabrication is conducted in two processes, from wafer fabrication to manufacturing the finished product. The first is the fabrication process of the integrated circuits on the wafer surface. The second is the testing process of the wafer map, processed by a unit die or chip after fabrication. As the fabrication process becomes more challenging and complicated, the number of defects increase. The processed wafer was tested using the fabrication process, detailed later on, and subsequently assisted in identifying several defects [10]- [12]. As semiconductor manufacturing becomes complicated, and the difficulty of the refined process techniques increases, a new type of wafer defect map appears. This is because the generating mechanism according to the defect pattern of the wafer map is different. It is crucial to classify wafer maps automatically to eliminate the cause of defects.
Most of the steps used in semiconductor fabrication are conducted using a wafer map. If there are some abnormalities in the manufacturing process, defects will occur on the wafer. There are various types of defect patterns based on the manufacturing methods or features of abnormal unit processes. These defect patterns can be detected using wafer-map data from the test step of a wafer. To determine the abnormality process, causing wafer defects, at an early stage and to take steps to recover the yield rate, it is necessary to analyze the wafer map [13]. The process of sorting defective items among semiconductor fabrication processes involves electrical die sorting (EDS) [14]. It also tests the electrical motion state of each semiconductor chip generated on a wafer. To improve the yield rate of processing, engineers define and classify the forms of a defective wafer, and identify a wafer map, resulting in the EDS test [15]. Fig 1 shows an example of a wafer map. A large circle indicates a wafer, and small rectangles inside represent each die. The white color indicates that the die passed all the tests without any error, and other colors indicate that the die did not pass the test. In ordinary semiconductor manufacturing companies, skilled experts classify and analyze defective patterns of wafer maps manually. However, when using this method, the classification performance of wafer map defect patterns can differ depending on the ability of the experts. Additionally, when production increases, it is difficult to cope utilizing only experts according to the growth of semiconductor demand [16]. Correspondingly, it is necessary to gain extra capacity to enable the system to cope during high productivity. The use of machine learning model, that learns the knowledge of experts, is one solution to increase capacity. Therefore, there is much research on handling these issues using machine learning or deep learning techniques. However, previous research faced some limitations. For example, there was a problem with classifying only defect patterns learn-ing step. Also common problem in many data-oriented realworld semiconductor applications is class-imbalance [17]. Additionally, when the fabrication is refined and more complicated, the defect patterns of the wafer maps will vary. Therefore, it is necessary to develop a model that recognizes a new types of defective wafer map pattern.
Another common problem in many data-oriented realworld semiconductor applications is class-imbalance [17]. Previous studies have been conducted to classify wafer map defect patterns to handle imbalanced data and irrelevant features. In previous work, it was also handled by data augmentation [18]. Data augmentation is the process of extending the training data by applying class-preserving transformations such as rotation, flipping, shifting, shearing range, and zooming of an image to the original data. These processes have become important tools for achieving the accuracy of modern machine-learning algorithms. Data augmentation is a popular technique because of its simplicity in deep-learning applications. However, applying multiple transformations to the entire data set can increase the total size of the dataset tenfold, so data growth can be an expensive process. This may have some advantages in terms of overfitting, but it increases the overall training data. It can also significantly increase the cost of data storage and training time and can scale linearly or super linearly with respect to the training set size. Semiconductor engineers have applied many different techniques, such as manual visual inspection and machine learning algorithms, using manually extracted features for wafer defect classification. Therefore, automatic wafer map identification systems need to be developed by taking advantage of machine learning and deep learning methods.
In this study, we consider the data imbalance problem by developing a deep learning-based method. It automatically classifies wafer map defect patterns without manual data augmentation or feature extraction. We employed a convolutional neural network (CNN) to extract visual features from the wafer map images. A generative variational autoencoder (VAE) was used to learn the data distribution and sample augmented data. The data augmentation function includes transformations such as rotation, flipping, shifting, shearing range, and zooming. First, we pre-trained the convolutional variational autoencoder to learn training samples and generate augmented data. Then, we fine-tuned only the encoder part, followed by the neural network (NN) classifier for the classification of wafer map defect patterns.
The contributions of this paper are summarized as follows: 1. We proposed an automatic classification method that employs deep learning techniques, such as CNN and VAE, for wafer map defect patterns without manual data augmentation and feature engineering. 2. We designed a convolutional variational autoencoder (CVAE) that learns the distributions of visual data. Then, it also samples various data transformations to solve data imbalance problems. 3. We automated the process of finding an effective dataaugmentation policy for a wafer map dataset. Each policy expresses several choices and orders of possible augmentation functions, such as rotation, flipping, shifting, shearing range, and zooming. 4. Comprehensive experiments demonstrate that the proposed method can obtain good results for identifying wafer map defect patterns. By combining convolutional operations and a generative model, we can obtain competitive results with other state-of-the-art deep learning methods. Additionally, we generated wafer map images with various transformations for each non-defect and defect class.
The remainder of this paper is organized as follows. We first review related works in Section II. In Section III, we introduce the proposed method in detail. Section IV reports the experimental settings and results and provides a discussion and analysis. Finally, conclusions and future work are provided in Section V.

II. RELATED WORKS
Research has been conducted to classify defective wafers into each pattern using wafer map information. In this section, we review some recently published research that uses machine learning and deep learning.
In the early stages, research has been conducted to extract features from wafer maps and classify defective patterns using machine learning techniques. Machine learning classification algorithms classify the defective patterns based on the pre-defined visual features from the wafer map. Manually or automatically obtained features, using feature extraction techniques, have been explored in computer vision [19]. For example, the features were extracted from the wafer map using Hough transformation, and the defect ratio at the center of the wafer was calculated. A variety of machine learning classification algorithms then apply the extracted features to classify defect patterns [20], [21]. In addition, wafer map transformation into spatial correlation and dynamic time warping [22] techniques were used in feature extraction, and the defective patterns in the results were classified using the k-nearest neighbor classifier [23]. As studied in domain analysis, principal component analysis (PCA) introduces the pattern index of the wafer map, which produces indices and variables focused on the structural features of the wafer map [24]. Therefore, in the case of a wide fluctuation of produced items, there would be problems in conducting modeling again. Additionally, singular value decomposition was used to transform the wafer map into a regularized singular value, followed by a k-nearest neighbor classifier [25]. However, there are some problems in real-world situations when using k-nearest neighbors, which makes it difficult to achieve good performance when insufficient training data and high computation time are required. After projecting the wafer map into Radon the data is transformed into four feature subsets, namely, max, minimum, means, and standard deviation, an ensemble model was proposed based on the decision tree of each feature subset [26]. Moreover, applying density-based and geometry-based Radon techniques to ensemble classi-fiers, constructed with logistic regression, random forest, gradient boosting machine, and artificial neural net [27] suggested a model using the extracted features of the wafer map.
Recently, various techniques have been proposed for the identification of wafer map defect patterns by taking advantage of deep learning. For example, without feature extraction of wafer maps or spatial filtering, research has been conducted widely using CNN, which applies intact original images. In CNNs, the features necessary for classification learn for themselves through convolution layers [18]. In another recent study using CNN, a wafer map was constructed according to 22 defective patterns, defined in advance, and then using the map, the patterns were classified into convolutional neural networks and applied for image retrieval. Even though the classification model showed an accuracy of 98% for the artificial data, some patterns extracted from the real data showed an accuracy of 68%. This demonstrates the limitations of artificial data [28].
Moreover, Kyeong and Kim [29] proposed a CNNbased classification model to classify mixed-type defect patterns in wafer bin maps separately for each pattern circle, ring, scratch, and zone. Cheon et al. [30] proposed an automatic defect classification method based on deep learning that was designed to achieve high classification performance for known defect classes and also classify unknown defects. Jin et al. [31] proposed a clustering-based defect pattern detection and classification framework, based on the density-based spatial clustering of applications with noise. Ishida et al. [32] proposed a deep learning-based failure pattern recognition framework that only uses data augmentation techniques with noise reduction, without accessing a large amount of training data. Shen and Yu [33] integrated wafer map defect recognition with deep transfer learning, which reduces the training time and improves the feature learning performance. It also addresses the problem of class imbalance. Wang and Chen [34] used extracted features based on three types of masks: polar masks, line masks, and arc masks. These masks extract rotation-invariant features for classifying defect patterns. Yu [35] proposed an enhanced stacked denoising autoencoder with manifold regularization techniques to generate discriminative features from wafer maps. Yuan-Fu [36] used automatic optical inspection to visualize defect patterns and identify the root causes of die failures. Then, CNN and extreme gradient boosting methods are employed for wafer map retrieval and defect pattern classification. Shawon et al. [37] also modified the CNN architecture to improve the classification performance and used data augmentation techniques to solve the data imbalance problem. Nakazawa and Kulkarni [38] proposed a deep convolutional encoder-decoder neural network architecture for detecting wafer map defect patterns, as well as segmentation. Yu et al. [39] proposed a stacked convolutional sparse denoising auto-encoder for wafer map pattern recognition and a feature learning method to learn discriminative features from wafer maps. Yu and Liu [40] proposed a deep neural network, which is a two-dimensional PCA-based convolutional auto-encoder for wafer map defect recognition. Alawieh et al. [41] used a deep selective learning technique and featured an integrated reject option where the model chooses to abstain from predicting a class label when the misclassification risk is high. Thus, there is a tradeoff between the prediction coverage and the risk of misclassification. Jang et al. [42] proposed an ensemble model of a one-versus-one method that uses a CNN as the base classifier for wafer map classification, and then examined the open set recognition problem, in which wafer maps must be classified using major defect patterns. Tsai and Lee [43] proposed a CNN encoder-decoder-based data augmentation and depth-wise separable convolution-based defect classification. They also developed a classifier with a reduced-weight architecture based on depth-wise separable convolutions [44]. Yu et al. [45] addressed the problem of insufficient labeled images with various defects. They proposed a semi-supervised deep-learning-based transfer learning algorithm by joining features and labels in an adversarial network. Jin et al. [46] presented an image-based classification method for wafer map defect patterns without any specific preprocessing. They extracted high-level features from a CNN fed to a combination of error-correcting output codes and support vector machines for the classification of wafer map defect patterns. Wang and Chen [47] used polar mapping before training the CNN. Then, the circular wafer map was transformed into a matrix. They also applied a data augmentation technique to eliminate the effects of rotation. Saqlain et al. [47] addressed the data imbalance and irrelevant features problem using data augmentation techniques such as rotation, flipping, shifting, shearing range, and zooming of an image to the original data.
Owing to the limitations of previous studies, we developed a novel classification technique by modifying the CVAE. The modified CVAE automatically performs data augmentation without manual rules or large data generation. In addition, pseudo-data are generated from the distribution of each class label. The experimental results demonstrate the efficiency of the proposed method.

III. PROPOSED METHOD
In this section, we discuss the basic structure of the proposed method in detail. We also provide the training procedure and hyperparameter settings.

A. ARCHITECTURE
Wafer maps provide important information when represented as images for engineers to identify the root causes of die failures during semiconductor manufacturing processes. In computer vision, CNN is a deep learning-based technique commonly applied to analyzing visual imagery. In real-world problems, data imbalance is a critical issue. As we discussed, CNN is the basic technique adopted in the identification tasks of wafer map defect patterns, and data augmentation techniques are generally used for data imbalance problems. In this study, we employed CNN as our base feature learner. Instead of using manual data augmentation, generative mod-els generate samples for high-dimensional datasets, learns the data distribution, and generates new samples from the learned distribution. We designed a CVAE that is improvised with image operations such as rotation, flipping, shifting, shearing range, and zooming for more effective image generation. We then used the basic NN technique for the classification of defect patterns. It calculates the probability distribution for each class label, and the maximum value is chosen for the final prediction. First, we pre-train the CVAE model by minimizing the reconstruction loss, and the mean square error was also used. Second, we train the NN classifier by minimizing cross-entropy loss. An overview of the proposed method is presented in Fig 2. As shown, we input wafer map images to the proposed method and identify whether they are defective or not. The common defect patterns are edge ring, edge local, center, local, scratch, random, donut, and near-full. In the following sections, we explain the proposed method in detail.

1) CONVOLUTIONAL NEURAL NETWORK
A CNN is a type of deep neural network with the capability of extracting useful features by utilizing several convolutional operators. It is particularly suitable for two-dimensional data structures; therefore, it is a popular pattern recognition classifier in image processing.
In a CNN, as a weighted kernel K slides over every position of input data x, the convolution operation of the input data and kernel is triggered, resulting in a feature map: where S is the feature map resulting from input data x and kernel K , and * denotes the convolution operation. Typically, the kernel size is smaller than the input data size, but with greater depth. This means that several different kernels are applied to the input data at the same time, resulting in the same number of feature maps. The weights of the kernels were adjusted during the training.
Although CNNs are mostly applied for the identification of wafer map defect patterns, they have also been successfully explored in fault classification and diagnosis in semiconductor manufacturing processes [48]. Because wafer map defect patterns have the same 2-dimensional data structures as images, the CNN for analyzing images is suitable for identification. VOLUME 9, 2021 2) VARIATIONAL AUTOENCODER VAE, an important generative model, has a similar network frame as an autoencoder, which consists of two parts: an encoder and a decoder. In the autoencoder, the encoder defines a mapping from input data x ∈ R d x to a latent variable z ∈ R d z , while the decoder defines a mapping back from the latent variable z to the input space, which outputs the reconstructedx. The training objective of the autoencoder is to make the reconstructed termx as close as the original one x, forcing autoencoders to learn the latent features of normal data. In VAE, the latent variable z is constrained to be distributed according to a prior distribution p θ (z), usually a multivariate unit Gaussian N (0, I ), forcing the model to learn the distribution of input data. However, when mapping from the input data x to the latent variable z, according to Equation (3), p θ (z|x) is usually intractable because p θ (x) is also intractable.
Hence, variational inference techniques are used to solve this problem in a tractable manner by finding an approximation posterior q φ (z|x).
where the mean µ z and standard deviation q z of the approximation posterior q φ (z | x) are derived by the encoder. Given an inference model q φ (z | x), the evidence lower bound (ELBO) can be derived as follows: In Equation (8), the first term is ELBO, and the second term is the Kullback-Leibler (KL) divergence of the approximate q φ (z|x) from the true posterior p θ (z|x). To ensure q φ (z|x) gets closer to p θ (z|x), the KL divergence term between them has to be minimized. According to the equation, minimizing KL divergence can be transformed into the task of maximizing ELBO. Therefore, the loss function of the VAE can be expressed as follows: The VAE has been successfully applied in different domains. With a sliding window, the VAE can be used for the clustering of wafer map patterns [49]. However, the standard VAE with CNN is not used to classify wafer map defect patterns. Hence, the standard VAE needs to be modified to identify wafer map defect patterns by addressing imbalanced data problems.
The search algorithm used in our experiment uses Reinforcement Learning, inspired by [50]- [54]. The search algorithm has two components: a controller, which is a recurrent neural network, and a training algorithm, which is a proximal policy optimization algorithm [55]. At each step, the controller predicts a decision produced by a softmax, and the prediction is then fed into the next step as an embedment. In total, the controller has 46 softmax predictions to predict policies, each requiring an operation type and probability. The controller is trained with a reward signal, which is how good the policy is in improving the generalization of a ''child model'' (a neural network trained as part of the search process). In our experiments, we set aside a validation set to measure the generalization of the child model. A child model is trained using the augmented data generated by applying the policies on the training set. For each example in the minibatch, one of the policies was chosen randomly to augment the image. The child model was used as a reward signal to train the recurrent network controller. As shown in Fig 3, the RNN controller predicts an augmentation policy from the search space. A child network with a fixed architecture was trained to attain convergence, achieving accuracy. The reward is used, with the policy gradient method, to update the controller so that it can generate better policies over time.

4) NEURAL NETWORK CLASSIFIER
To establish a predictive model, we employ a simple NN classifier followed by the downstream of the CVAE, which fine-tunes the CVAE encoder part (f CVAE(encoder) ) and feature extraction layers in an end-to-end manner for the identification task of wafer map defect patterns. The predictor function (f NN ) can be summarized in Equation (10) as follows: The objective function of the NN classifier is to predict the true class labels to minimize the cross-entropy loss between the approximate distribution and the ground truth distribution. The objective function of the predictor network (classification loss) is summarized as shown in Equation (11): where y is the ground truth value, and predicted y is the predicted value. The supervised NN classifier network provides predictions of wafer map defect patterns as any of the given defect patterns or non-defects.

B. TRAINING
To train a CNN model directly, we need large-scale image data such as the WM-811K dataset [56], which contains more than a hundred thousand images, but it is highly imbalanced. If large-scale training data are required, the applicable problems of a CNN are very limited. To avoid such situations and to make a CNN effective even for small-scale data, two important steps have been performed sequentially. The first step is to pre-train the generative models and replay the data samples for downstream tasks. The second step is to finetune the encoder of the pre-trained model, followed by a supervised classifier to perform the prediction.

1) GENERATIVE PRE-TRAINING
During training, the gradients of the loss function are required for the optimization of the ELBO. However, it is not easy to differentiate the loss with respect to the variational parameters φ because the gradients cannot be back propagated through the latent variable z. Hence, the re-parameterization trick, following the work in [57], is applied to overcome this problem.
The latent variable z is assumed to be a deterministic function of x and a random variable ε sampled from a fixed distribution, N (0, 1). Hence, the non-differentiable random variable z is converted to a differentiable function of x and a random ε.
where µ z and σ z are the variational parameters derived from the encoder. The sampling number L during the training was set to 1 because one sample was already sufficient. With model loss, the negative ELBO, we trained the model using the Adam optimizer [58] to update the weightings of the model.

2) FINE-TUNING FOR CLASSIFICATION
Fine-tuning involves tuning the parameters pre-trained with large-scale data using small-scale data. We fine-tuned the encoder of the pre-trained CVAE, pre-trained with an imbalanced large amount of data. We added a supervised NN classifier after the encoder of the CVAE, ignoring the decoder part.
With model loss and cross-entropy, we also trained the model using the Adam optimizer [58] to update the weightings of the model.

C. HYPERPARAMETERS
In this study, we constructed a CNN-based VAE model for WMDP, which has an encoder and decoder, each consisting of one input layer, eight convolution layers each with batch normalization, padding, and rectified linear unit (ReLU) activation, and five pooling layers (four stacking pairs of convolution-pooling-convolution). The supervised classification layer has one dropout layer, two fully connected layers, and one output layer. For a fair comparison, we used the same convolution-based neural network architecture for all the methods. In this model, each convolution and pooling layer consists of subsampling filters of size 3 × 3 and 2 × 2, respectively. The first convolution layer extracts the features from the input training wafer images of size 224 × 224 pixels. Each convolution layer contained a set of learnable filters to extract unique feature maps. The number of filters increases with increasing depth of the convolution layer, and thus the number of feature maps also increases. However, feature maps become smaller and more complex due to the pooling layer in a deeper network. The proposed CNN-WDI model adopts 16, 32, 64, and 128 feature maps for the first, second, third, and fourth stacking pairs, respectively. The model parameters used in this study are listed in Table 1.
Zero padding was applied to all convolutional layers to ensure that the dimensions of the input and output feature maps were the same. The Softmax activation function was applied to the output layer of the model. In addition, the Adam optimization method, which combines the concepts of Momentum optimization and root mean squared prop (RMSProp), was selected as the optimizer. This optimizer helps achieve a higher accuracy and improves the training process. In addition, after many attempts, other parameters such as batch size, learning rate, and number of pre-training and training epochs were assigned as 128, 0.001, 500, and 20, respectively. A smaller batch size improves the generalization ability by computing an approximation of the gradient value and then updating the other parameters.

IV. EXPERIMENTS
In this section, we first describe the experimental dataset used in this study. Then, we show the metrics used for evaluating all the methods. Finally, we provide the comprehensive experimental results.

A. DATASET
The WM-811K dataset is a semiconductor dataset consisting of 811,457 real wafer map images [56]. The wafer images were collected from 46,293 lots in a circuit probe test of the semiconductor fabrication process. A single lot contains 25 wafer maps, so there should be 1,157,325 wafer maps in total (i.e., 46,293 lots × 25 wafer/lot). Not all lots have exactly 25 WMs, due to sensor faults or other unknown reasons, and they were pruned from the dataset. The dataset also contains additional information about each wafer map, such as lot name, die size, wafer index number, failure type, and training and test labels. This is the largest publicly available wafer map dataset that can be accessed on the Multimedia Information Retrieval (MIR) laboratory website [59]. Different sizes of wafer images exist because of their two-dimensional nature and different pixel values along the length and width of the image. We found a total of 632 wafer images of various sizes ranging from 6 × 21 to 300 × 202.
We split the experimental dataset into training, validation, and testing sets, as shown in Table 2.

B. EVALUATION MEASURES
The measurements obtained from the confusion matrix were compared with the classification achievements, obtained  from sentiment classification in similar studies, to demonstrate the accuracy of the method. Accuracy, precision, recall, and F1 measurement values were obtained from the confusion matrix.
The abbreviations TP (true positive), FP (false positive), FN (false negative), and TN (true negative) in the confusion matrix in Table 1 have the following meanings: The accuracy, precision, recall, and F1 measurement were calculated according to the confusion matrix in Table 1. The accuracy was calculated according to Equation (13).
Precision is the total estimate of class labels accurately predicted for each class. The precision was calculated using Equation (14).
The recall value is the weighted average of the correct labels that are correctly classified for each class. This value was calculated according to Equation (15): Other metrics, F1, were used to combine the precision and recall values in a single measurement. The value of this measurement is between 0 and 1, and if the classifier correctly classifies all samples, it takes the value of 1. The F1 measure is given in Equation (16), and the F1 value is close to 1 for good classification success.
All experiments were executed on an Intel Xeon E5-2698 v4 @ 2.20GHz, 256GB (CPU), NVIDIA Tesla V100 32GB (GPU), and Ubuntu 18.04 operation system. We also used the Scikit-Learn and Pytorch libraries with the Python programming language for all analyses.

C. RESULTS AND DISCUSSIONS
In this section, we present some experimental results, including a feature analysis that is selected by the CVAE. We then discuss a comparative analysis with other baseline methods and the efficiency of the proposed method.

1) GENERATION OF DEFECT PATTERNS
First, we pre-trained the unsupervised CVAE model on the entire training set and corroborated it using the validation set, as discussed previously. A CNN was used to extract visual features, and VAE was used to learn the distribution of each class label. We attempted to minimize the reconstruction loss (mean squared error) during training on the training set. The reconstruction error for 500 epochs in the training set is shown in Fig 5. It constantly decreases, and it shows the learning capability of our pre-trained model. The mean squared error was used as the reconstruction error in our experiment. During training, we also tried to find the optimal augmentation policy, composed of several image processing operations such as rotation, flipping, shifting, shearing range, and zooming. As shown in Fig 6, we illustrated the examples of each operation applied to the generated samples.
As shown in the figure, the generated images were automatically transformed by image processing operations instead of using manual data augmentation. We used the rotation range from 5 to 45 degree and horizontal and vertical flipping. These transformations do not change the size of the generated images. In contrast, the other transformations such as shifting, shearing, and zooming change the size of generated images. For example, we used the zooming by between 1% and 20%. The hybrid method sequentially inte-  grated image generation and various transformations can also address the data imbalance problem efficiently.

2) PERFORMANCE EVALUATION
Secondly, we fine-tuned the only encoder part followed by a simple neural network classifier for the identification task of WMDP. We trained the supervised classifier on the training dataset and evaluated it on the validation set. We attempted to minimize cross-entropy loss during training. During training, the classification loss was constantly decreasing among all 20 epochs.  We evaluated the proposed method on the validation set standard measures such as accuracy, precision, recall, and F1-score. The classification performances on the validation set is shown in Fig 7-10, respectively. We achieved satisfying results in the first ten epochs. We highlighted the first ten and last ten epochs as solid pink and dashed black lines, respectively. We could not get clear information from the accuracy (Fig 7) for the imbalanced dataset. As you can see, we achieved the highest precision of 98.05% at the 5 th epoch (Fig 8) and the highest recall of 96.83% at the 8 th epoch (Fig 9). Our model has been satisfied at the 9th epoch by achieving the F1-score of 95.82% (Fig 10).
We compared the proposed methods to the other baseline methods such as SVM [60], ANN [61], VGG-16 [62], and CNN-WDI [18] algorithms. For fair comparison on the different split of the testing dataset. In the previous works, CNN-WDI [18] shows the highest performance results. We re-implemented the CNN-WDI method that achieved the comparative results as shown in Table 3. As shown in this table, the methods with manual data augmentation show high results. In this paper, we develop an automatic WMDP identification method without any manual augmentation. Because manual data augmentation is very time-consuming and non-memory efficient. Our hybrid method with the generative model and automatic image transformation operations can reduce the memory usages and much human efforts. We developed the CVAE method without any image transformation by only generating data samples. It improved the classification performance by 6%. Then we applied automatic image transformation with policy search strategy, to the CVAE method. It shows the highest classification performance without manual data augmentation and comparative results with manual data augmentation techniques. As conclude, the experimental results shown in Table 3 highlights the efficiency of our proposed method. As shown, Saqlain et al. [47] achieved the F1-score of 87.7% on the  original imbalanced data and achieved the F1-score of 96.2% on the manually balanced data. Our proposed method, CVAE with automatic image transformation with policy search strategy, achieved the F1-score of 95.1% without any human efforts. Surprisingly, the proposed CVAE method achieves the highest recall of 96.9%. It is very comparative to the manual augmentation methods in terms of predictive performance and can reduce much human effort.
As shown in Table 4, the confusion matrix performed by our proposed method CVAE with image transformation is provided. As you can see, we achieved high accuracy results higher than 90% except for Donut defect pattern.
In this paper, we addressed the issue of manual data augmentation; it requires much human effort. Instead of manually transforming training data, we automatically generated fake data similar to original images and added an image transformation function with a policy search strategy. For a fair comparison, we selected the same image transformation techniques used in the previous works. It reduces many preprocessing steps and immensely scalable to add more image transformation techniques. As shown in Table 3, the proposed method CVAE is lower than the performance of the highest manually augmented method. However, we can quickly improve it by adding other image transformation techniques.
The policy search algorithm is very efficient in finding the best augmentation policy from many possible states even there are many transformation techniques. May it increases the computation time and memory usage. But it is not critical in this research, and we can reduce it at the application level for real-world scenarios.

V. CONCLUSION
In this study, we developed a DL-based method, that is, CVAE for WMDP, which employs CNN as a feature extractor, and CVAE exploits the full connection between the features and the subsequent convolved images in an unsupervised manner. A simple NN classifier was used to identify the defect patterns from input images in a supervised manner. The robust and discriminative features from the wafer map through this network can be extracted to identify the WMDP improvement. Additionally, an automatic policy search procedure was defined for improved data augmentation, instead of using manual functions. CVAE achieves better recognition results on real-world wafer map datasets than traditional WMDP methods and other DL models. The comprehensive experimental results verify that the CVAE is capable of learning effective features from wafer maps. This study provides a new method for the identification of WMDP using generative DL models, with an automatic data augmentation procedure, control. It addresses the problem of data imbalance and limited training data, which leads to overfitting of DL-based methods.
The limitations of the proposed method are described as follows. In the general research of wafer map defect pattern, most methods utilized the limited dataset publicly available. More challenging data is necessary to this semiconductor manufacturing research field. We proposed automatic techniques such as generative model and image transformation with the policy search strategy to reduce human efforts. However, it improves the computational cost, but it can be reduced. We only considered the five transformations in the image transformation phase, such as rotation, flipping, shifting, shearing range, and zooming. There is also not exact value of augmented data size for training.
In the future, we will discover more data that covers more challenging issues in this research field. Also, we will carry out further research on other generative models, that is, generative adversarial networks and improved deep network architecture to disclose the properties of CVAE. Additionally, fast and adaptive algorithms for searching data augmentation policies will be considered. We will improve the proposed method in terms of both computational cost and predictive performance for developing real-world applications. To increase the capability, we will employ more image transformation techniques and discover augmented data characteristics.

ACKNOWLEDGMENT
(Ho Sun Shon and Erdenebileg Batbaatar contributed equally to this work.) VOLUME 9, 2021 HO SUN SHON received the B.S. and M.S. degrees in statistics from Sungshin Women University, Seoul, South Korea, in 1986 and 1992, respectively, and the Ph.D. degree in computer science from Chungbuk National University, Cheongju, South Korea, in 2010. She is currently a Researcher with the Research Institute for Computer and Information Communication, Chungbuk National University. Her research interests include machine learning, data mining, pattern recognition, and bioinformatics.
ERDENEBILEG BATBAATAR received the M.S. and Ph.D. degrees in data mining, medical informatics, and computer science from the Database and Bioinformatics Laboratory, Chungbuk National University, South Korea. He is currently a Postdoctoral Researcher of bioinformatics and computer science with Chungbuk National University. His research interests include software engineering, data mining, big data analysis, bioinformatics, machine learning, deep learning, and their applications.
WAN-SUP CHO received the B.S. degree from Kyeongbuk National University, in 1985, and the M.S. and Ph.D. degrees from KAIST, South Korea, in 1987 and 1996, respectively. He is currently a Professor with the Department of Management Information Systems, Chungbuk National University. His research interests include big data platform and data governance with AI and the IoT for smart factories and smart healthcare.
SEONG GON CHOI received the B.S. degree in electronics engineering from Kyeongbuk National University, in 1990, and the M.S. and Ph.D. degrees from Information Communications University, South Korea, in 1999 and 2004, respectively. He is currently a Professor with the College of Electrical and Computer Engineering, Chungbuk National University. His research interests include smart grid, the IoT, mobile communication, high-speed network architecture and protocol. VOLUME 9, 2021