Crystal Growth Characterization of WSe 2 Thin Film Using Machine Learning

Materials characterization remains a labor-intensive process, with a large amount of expert time required to post-process and analyze micrographs. As a result, machine learning has become an essential tool in materials science, including for materials characterization. In this study, we perform an in-depth analysis of the prediction of crystal coverage in WSe 2 thin film atomic force microscopy (AFM) height maps with supervised regression and segmentation models. Regression models were trained from scratch and through transfer learning from a ResNet pretrained on ImageNet and MicroNet to predict monolayer crystal coverage. Models trained from scratch outper-formed those using features extracted from pretrained models, but fine-tuning yielded the best performance, with an impressive 0.99 R 2 value on a diverse set of held-out test micrographs. Notably, features extracted from MicroNet showed significantly better


Introduction
Great advances are being made in the synthesis of two-dimensional materials (2D), [1][2][3] since the successful isolation of graphene in 2004. 4The transition metal dichalcogenides (TMD) is a major class of 2D materials that have gained much attention due to their interesting properties and potential for applications in areas including electric and optoelectronic, energy, and sensing. 3,5Several synthesis methods, including mechanical exfoliation, 3 powder vaporization, 6,7 pulsed laser deposition, 8 chemical vapor deposition (CVD), and metal-organic chemical vapor deposition (MOCVD) [9][10][11][12] are being deployed in a bid to improve both the quality and scalability of the grown TMDs.Associated with the materials synthesis is the need for an efficient characterization technique to determine the various features of the samples, ranging from the basic crystal qualities to the determination of the properties and potential applications of the materials. 13,149][20][21] Given the fundamental role the information from the AFM image analysis plays in determining the grown sample's quality, even before further characterization to determine their properties and potential applications, the fidelity and efficiency of the analysis are of major priority in the workflow to accelerate the 2D materials qualitative and quantitative synthesis and exploration.
3][24][25][26] This interest stems from its potential to automate processes, reduce human intervention, and swiftly handle large volumes of images.Numerous software tools, such as ImageJ, 27 Gwyddion, 28 WSxM, 29 NanoScope Analysis, 27 Mountain, 30 and MIPAR, 31 have been developed to address image correction, processing, and analysis needs.Some of these tools support automation and batch analysis, enabling the processing of many images.Recently, deep learning strategies have also become a mainstay of this field, and some software, like MIPAR, now support deep learning workflows natively.This raises questions about the performance and reliability of machine learning schemes for materials characterization.Specifically, there is a need to explore the behavior of regression models compared to segmentation models for characterizing micrographs, such as determining the thin film crystal growth on a substrate, quantified by crystal coverage.Furthermore, investigating how pretraining domains and different modes of transfer learning impact the capabilities and reliability of models at inference time is crucial.Analyzing these aspects will contribute to a better understanding of the overall potential of different learning schemes and, more specifically, their suitability for high-throughput characterization.This is essential for accelerating the exploration of TMDs.
Several studies have been reported on the deployment of ML models to the AFM image analysis.Among them are the segmentation of the molecular resolved AFM images, 24 classification of quasi-planar molecules that spans relevant structural and compositional moieties in organic chemistry based on AFM images, 32 identification of self-organized nanostructures, 33 extraction of molecule graphs of samples from AFM images, 34 atomic structure recovery from AFM images, 35 and quantitative analysis of MoS 2 thin film micrographs. 36Crucial to the determination of the quality of the materials synthesis is the domain size and thickness, and surface coverage, 12,18,19,[37][38][39] isolation of the grown crystal from the substrate on which it is grown.
The crystal coverage is a basic metric that indicates the extent to which the thin film has grown on the substrate.A rapid and automated determination of the crystal coverage can enhance materials synthesis as the growth parameters can be optimized based on this figure of merit.In our present study, convolutional regression models are developed to be deployed in determining the crystal coverage of 2D WSe 2 grown using MOCVD. 9Additionally, robust semantic segmentation models [40][41][42][43][44] which give a pixel-wise classification of the grown samples AFM images, as either belonging to the substrate or the crystals, are trained.Our models exhibit excellent results with R 2 exceeding 0.99 in the quantification of the crystal coverage in held-out test samples.Furthermore, we have systematically evaluated the efficacy of different transfer learning schemes, namely feature extraction and fine-tuning.We also include the effects of different pretraining domains, specifically materials micrographs compared to miscellaneous everyday objects.Our results have some important and counter-intuitive implications for the practical implementation of these computer vision models in materials characterization workflows.

Method Dataset
The WSe 2 AFM data used in this research were grown by Eichfeld et al. 9 and stored in the Lifetime Sample Tracking (LiST), a database hosted by the 2D Crystal Consortium (2DCC), 45 while the processed data are available to download from Ref. 46 The 52 WSe 2 thin film samples were synthesized using the metal-organic chemical vapor deposition (MOCVD) technique.The samples were grown at various conditions, including the growth time, chamber inner temperature, and pressure (Fig. 1), resulting in significant variations in the morphological features of the AFM micrographs obtained.Additionally, different imaging conditions were employed for the samples, with characterization obtained at the centers and edges of the wafer and different resolutions.This resulted in a total of 221 micrographs from the 52 grown samples.
Eichfeld et al., 9 which grew the samples, postprocessed the micrographs using NanoScope Analysis 27 that performed flattening (which we could have done automatically as part of our workflow), inserted a color bar, and annotated the images with text labels and a scale bar.We therefore retrieved flattened images which were stored as TIF files such as those shown in Fig. 2. One important consequence of using the flattened images is that our models were trained not on height maps, but on height-normalized images.That is, the relationship between pixel intensity and the original height measurement was different within each image.The same was true of the length scale, where pixels represented different sample areas within each image.We believe this better represents the practical use case for these models compared to carefully controlled height and length scales.For the deployment of our models on raw (unflatten) images, the flattening step can be implemented as part of an automated workflow (e.g., pySPM 47 ).
The figure of merit for these thin film samples is the monolayer coverage, which can be computed from an AFM height map according to the fraction of pixels in the foreground

Regression Models
We consider two variants of the ML task: regression (predicting the coverage label directly from the image) and segmentation (predicting the binary mask and then computing the coverage from the mask).Within the regression task, we further consider three training paradigms: training from scratch using end-to-end learning (i.e., with randomly initialized weights), transfer learning by fine-tuning (i.e., initializing the model with pretrained weights),  For all the regression models, the Adam optimizer, ReLU activation function, and mean squared error (MSE) loss functions were used.10% of the data samples, grown under different growth parameters than the rest of the data and/or obtained under different imaging conditions, were held out to determine how well the models generalize to out-of-distribution data (Table 1).Additionally, about 80% and 10% were used for the training and validation, respectively.
We started by training a small Convolutional Neural Network from scratch (CNNsc).
The architecture of the CNNsc network was optimized using Bayesian hyperparameter tuning implemented in the ax-platform package 49 which leverages a Gaussian-process-based Bayesian optimization. 50After each of the convolutional layers, a max pooling and ReLU activation function were applied to downsize the feature maps and extract the most important features, and introduce non-linearity, respectively.This network was deliberately simplified compared to the pretrained models to evaluate whether fewer trainable weights would be more robust in extrapolating to the test domain.
We also explored the application of pretrained models, specifically ResNet18 architecture pretrained on ImageNet 51 and MicroNet 22 datasets, to predict the coverage of WSe 2 thin films.We chose ResNet18 as it is among the shallowest standard computer vision architectures available today, which we felt was important given our low data volume.The features were extracted from the average pool layer of the pretrained models, given 512 features.
Multilayer perception (MLP) models were then built to learn the crystal coverage from the image features obtained from the ResNet18 pretrained on the ImageNet and MicroNet.The MLP models are hereafter referred to as MLP-I and MLP-M, respectively.MLP model hyperparameters were tuned using ax-platform as in the case of the CNNsc.
For completeness, we also employed the fine-tuning paradigm of transfer learning.This allowed us to assess the performance of these pretrained models in our specific context and evaluate their potential for accurate thin film coverage prediction.The pretrained models' classifiers were replaced with 2 FC (fully connected) layers of 512 and 100 neurons and an output layer.Between the 2 FC layers is a ReLU activation function to introduce nonlinearity and a dropout of 0.25 to minimize over-fitting.The sigmoid activation function was additionally placed before the output layer to ensure only values between 0.0 and 1.0 (range of coverage values) are predicted.The models were then tuned with our data to learn the crystal coverage.The fine-tuning was carried out for the ResNet18 pretrained on the ImageNet (CNN-I) and another on the MicroNet (CNN-M).

Segmentation Models
Separately from the regression task, we attempt to solve the problem using segmentation models to work natively with the binary mask.Similar to the regression models, encoders pretrained on MicroNet by Stuckner et al. 22 were used.In their report, they found ResNeXT, 52 SE, 53 Inception, 54 and EfficientNet 55 encoder architectures to give better performances.
Additionally, Unet 56 and Unet++ 25 decoders were found to outperform others.Specifically, SE ResNeXt-50 32x4d and SE ResNeXt-101 32x4d encoders pretrained on MicroNet coupled with Unet++ decoders gave, on the average, the best intersection over union (IoU) accuracy for models trained on the full sets of 2 different SEM images (nickel-based superalloys and environmental barrier coatings).We therefore used SE ResNeXt-50 32x4d and SE ResNeXt-101 32x4d encoders pretrained on MicroNet coupled with Unet++ decoders in our study.These segmentation models are termed SEG50 and SEG101, respectively.
For us to compare the performance of the segmentation and regression models from the same pretrained architectures, we have additionally trained segmentation models based on the ResNet18 pretrained encoder and using the Unet++ decoder.Both encoders pretrained on the ImageNet and MicroNet were used, and termed SEG18-I and SEG18-M, respectively.
The Adam optimizer, 1e-4 learning rate, and a batch size of 6 were used in the training.We utilized an early stopping after 30 epochs of training without further improvement on the IoU accuracy of the validation set, while the loss function was a weighted sum of balanced cross entropy (BCE) and dice loss with a 70% weighting towards BCE.

Regression Models Training from Scratch
The architecture of the CNNsc network found by hyperparameter tuning consisted of four convolutional layers and three fully connected (FC) layers (Fig. 6).The kernel size was 5 with a stride of 1 and zero padding.This model was trained to minimize the MSE loss between the target and the predicted coverage.A stochastic behavior is observed in the learning resulting in the fluctuation in losses with the training iterations both for the training and  1).These correspond to R 2 values of 0.997, 0.984, and 0.979 for train, validation, and test, respectively.Only a few scattered points were observed in the validation and test parity plots, indicating a minimal over-fitting.

Feature Extraction
The MLP architectures were tuned (to minimize the validation loss) to yield 2 hidden layers with (120, and 84) neurons in the MLP-I and MLP-M.The trained MLP-I exhibited an R 2 value of 0.873 on the test set (Fig. 7 and Table 1).MLP-M performs better than the MLP-I, though still slightly worse than the CNNsc.A better performance observed in the MLP-M than the MLP-I might be due to the proximity of the data for the pretraining and our data; MicroNet consists of grayscale micrographs while ImageNet is made up of the macroscale color images of natural objects.The features extracted from the former may therefore be more relevant in learning our image features than those from the latter.
The superior performance of the CNNsc may be due to its smaller size or its on-the-fly data augmentations; random rotations and flips were applied to the data while training.To verify if the data augmentations applied to the CNNsc made a significant difference to the model performance, we trained the same architecture of CNN with the same hyperparameters without the augmentations (CNNsc*).The result shows that the augmentations indeed significantly enhance the performance of the CNNsc (Fig. 7 and Table 1).Overfitting is observed to set in soon after the first few epochs of training on data without augmentation.
The model accurately predicts the coverage for the train set but a worse performance than both MLP-I and MLP-M is observed in the validation and test sets.
However, the on-the-fly augmentation cannot be readily applied in the feature extraction case as data are not seen by the model more than once.The closest we can get to the on-the-fly augmentation is to obtain different features for the rotated and horizontal and vertically flipped images, then train the MLP model on all of these at once.We also tried average pooling on these variants as input to the model rather than trying to learn a manyto-one mapping.Both of these approaches gave worse performance compared to the vanilla MLP models, with the augmentation giving the R 2 values of 0.86 for the MLP-I and 0.93 for the MLP-M, while the pooling strategy was worse.These results underscore a fundamental difference in the static augmentation of the data for the MLP models and the on-the-fly augmentation for the CNN models.

Fine-Tuning
Finally, we examined the fine-tuning of the pretrained model to predict the crystal coverage.
This approach needs to be explored especially because we observe the significant impact data on the validation and test set with R 2 value of 0.99 (see Fig. 8 and Table 1).
Interestingly, while a significantly better performance is observed from features extracted from the model pretrained on MicroNet than that from ImageNet, the fine-tuning shows the reverse.This means that the filters pretrained on the MicroNet extract much more useful features from the AFM than those pretrained on the ImageNet.However, the latter scenario seems to provide more generic image features in which case fine-tuning on sufficient target   the segmentation models have a minimal improvement over the regression models as shown in Table 2 and Fig. 9.To be specific, the best model from the regression models, CNN-I (Fig. 8 and Table 1) exhibits a test RMSE of 0.029, whereas SEG18-M and SEG50 both obtain 0.020 RMSE.
Based on the patches of the images, it seems that segmentation models provide higher performance in determining the crystal coverage than regression models.Additionally, segmentation models offer the advantage of giving impressive performances even with a much smaller data set for training 22,23,26 since each pixel is in effect a training data point.In our present study, the total image patches used in the segmentation models are half of that used in the regression models.
In addition to the coverage value determination, segmentation models provide pixel-wise classification of the image, classifying each pixel in the AFM images of WSe 2 samples as either belonging to the substrate or the crystal.This has some additional utility in determining not only how much crystal is present, but its location in the micrograph.The intersect over union (IoU) metric shows high performance even on the pixel-level classification task, with 92% (SEG50) and 90% (SEG101) IoU on held-out test images.It is worth noting that similar performances are observed on both the train and test sets, indicating low memorization.
This level of generalization, despite the held-out test set samples being grown at different conditions and/or obtained at different imaging conditions, underscores the potential of the models to produce reliable results in practical applications.

Inference on Full Images
The test set discussed in the previous sections is based on patches created from the full image test set.However, it is important to characterize the held-out test set in its original full image format, as this is the real measure of the practical value of our trained models.
For this test, we are using SEG50 and SEG101 and only the best regression models: CNN-I and CNN-M.While SEG50 gives the best performance on the held-out test set among the    on the full images compared to the patches.Creating the tiles for the full image inference requires processing that could result in the loss of some parts of the original images.The resizing involved in the patches created for training the models is also inevitably not the same as that for the tiles.The sensitivity of the different models to the different image processing and the image morphological features have therefore resulted in the observed variation in the model performances.Also worthy of note is the fact that significant variations in the segmentation model performances have been observed depending on the encoder and/or decoder architecture. 22erall, the results on full images show an important distinction between the training protocol and the real-world application of CNNs.Deep CNNs such as SEG101 may not be robust in practical micrograph analysis despite excellent performance even on held-out test data due to the image augmentation scheme.Meanwhile, even though the calculation of crystal coverage is natively a segmentation problem, the regression models perform well on the full images, suggesting that they may be more robust to changes of scale, dimension, or other factors compared to the segmentation models.

Conclusion
In this study, we conduct a comprehensive analysis of crystal coverage (the proportion of the substrate covered with grown crystal) in WSe 2 thin film atomic force microscopy (AFM) micrographs using regression and segmentation models.Regression models were trained to predict the monolayer crystal coverage from image patches.Models were trained from scratch and using transfer learning from ResNet pretrained on ImageNet and MicroNet.MicroNet consists of grayscale micrographs while ImageNet is made up of the macroscale color images of natural objects.For transfer learning, both feature extraction and fine-tuning approaches were used.
Our analysis revealed that the CNN models trained from scratch outperform MLP mod-els trained on features extracted from the pretrained models, while fine-tuning gave the best performance with up to 0.99 R 2 value on the held-out test set.Interestingly, while a significantly better performance is observed from feature extraction using MicroNet than that from ImageNet, the fine-tuning shows the reverse.This means that the filters pretrained on the MicroNet extract more useful features from the AFM than that pretrained on the ImageNet.However, the latter scenario seems to provide more generic image features in which case fine-tuning on sufficient target data has yielded a better result.
Beyond the prediction of crystal coverage over entire patches, segmentation models provide pixel-wise classification of the image, classifying each pixel in the AFM images of WSe 2 samples as either belonging to the substrate or the crystal.This has some additional utility in determining not only how much crystal is present, but its location in the micrograph.
Based on the patches of the images, the segmentation models provide higher performance in determining the crystal coverage than regression models.The intersection over union (IoU) metric shows high performance even on the pixel-level classification task, with up to 92% IoU on held-out test images.
The results on full images show an important distinction between the training protocol and the real-world application of the models.Contrary to the results from image patches, the regression models performed better than the segmentation models at predicting the monolayer crystal coverage of the full images of the held-out test set, giving the R 2 values of 0.98 and 0.90, respectively, from the best models.The average IoU on the full held-out test images reduced to 86% from the 92% obtained for the patch images.Our finding suggests that the regression models may be more robust to changes in scale, dimension, or other factors compared to the segmentation models.Overall, these results highlight the efficacy of machine learning for automated, high-throughput sample characterization, demonstrating its potential for accelerating the high-throughput development of chalcogenides for technological applications.At the same time, it provides practical guidelines for implementing standard computer vision workflows in real-world materials characterization applications.

Figure 1 :
Figure 1: WSe 2 samples used in the study showing the growth parameters space.T is the growth chamber inner temperature, P is the pressure, and time is the growth time.Multiple micrographs are obtained for each sample, so there are fewer unique conditions than images in our study.Bolder circles indicate more samples at the same point.Some samples in the test set occupy unique points in the parameter space, such as the samples at the lowest T .

Figure 2 :
Figure 2: Sample AFM images of WSe 2 thin film in our dataset.Data ingested in our workflow have already been preprocessed by other software and include dimensional scale bar, color scale, and text annotations.

Figure 3 :
Figure 3: Sample images with their lightness histograms demonstrating how segmentation could be performed based on a bimodal lightness distribution (one for foreground, one for background).This assumption is often violated due to imperfect flattening, texture, or artifacts.

Figure 4 :
Figure 4: Data augmentation and image patching schematic scheme.The original AFM image (top left) is thresholded to produce a mask (top right).Random image patches are jointly taken from both the image and mask to yield new (image, mask, coverage) sets where all patches are of size 224 × 224 but represent different portions of the original image.
-to-end learning

Figure 5 :
Figure 5: Schematic of different transfer learning paradigms.Feature extraction is a scheme that only modifies the trainable weights in the fully connected layer (or other shallow models) while leaving the pretrained weights in the convolutional layers unchanged.In fine tuning, all the trainable weights from the pretrained model are adjusted to improve the model's fit to the new task.In end-to-end learning, the entire model is trained from scratch, without any knowledge transfer.

Figure 6 :
Figure 6: (a) is the CNN architecture built from scratch (CNNsc) showing the convolutional (Conv1, Conv2, Conv3, and Conv4), the pooling (MaxPool), and fully connected layers (FC1, FC2, FC3), as well as the feature maps and channel sizes for each of the convolution layer and the neurons connecting the FC layers.(b) is the root mean squared error value (RMSE) on the train and validation (val) data against the learning iteration (epochs).(c) is the parity plot of the predicted and target coverage.The R 2 and RMSE values in (c) are for the test set.

Figure 7 :
Figure 7: (a), (c), and (e) are the root mean squared error value (RMSE) on the train and validation (val) data against the learning iteration (epochs) for the multilayer perceptron model (MLP) trained with features extracted using the ResNet18 pretrained on the Ima-geNet data (MLP-I), MLP trained with features extracted using the ResNet18 pretrained on the MicroNet data (MLP-M), and for the CNN built from scratch without on-the-fly data augmentation (CNNsc*), respectively.(b), (d), and (f) are the parity plots of the predicted and target coverage corresponding to (a), (c), and (e), respectively.The R 2 and RMSE values in (b), (d), and (f) are for the test set.
augmentation has on CNN model performance.Fine-tuning is carried out for the ResNet18 pretrained on the ImageNet and another on the MicroNet.These models are termed CNN-I and CNN-M, respectively.As observed in the CNNsc, capturing the grokking effect is important in obtaining the optimally trained model; the training and validation losses were closely monitored, and the training halted once the minimal obtainable validation loss was reached.The validation loss associated with the grokking point was determined by initial training of the models for a few thousand epochs.The performance of CNN-I and CNN-M are quite similar, with CNN-I giving a marginally better result.Both have accurate predictions

Figure 8 :
Figure 8: (a), and (c) is the root mean squared error value (RMSE) on the train and validation (val) data against the learning iteration (epochs) for the fine-tuned ResNet18 pretrained on the ImageNet data (CNN-I), the fine-tuned ResNet18 pretrained on the MicroNet data (CNN-M), respectively.(b) and (d) is the parity plots of the predicted and target coverage corresponding to (a) and (c), respectively.The R 2 and RMSE values in (b) and (d) are for the test set.

Figure 9 :
Figure 9: (a) and (b) are the parity plots of coverage predicted, using the segmentation model pretrained on the MicroNet, and the target coverage.The encoder for the SEG50 and SEG101 models are SE ResNeXt-50 32x4d and SE ResNeXt-101 32x4d, respectively.(c) are sample images, the corresponding ground truth mask, and the predicted mask by the SEG50 and SEG101 models.The R 2 and RMSE values are for the test set.

Figure 10 :
Figure 10: The original (whole) images in the hold-out test set (first rows), and the pixelwise classification, as either belonging to crystal or substrate (second rows) obtained from the SEG50 model.The intersection over union (IoU) accuracy for each image is given below the classification.

Figure 11 :
Figure 11: Coverage analysis and segmentation of the original (whole) test images.Results obtained using the segmentation models, SEG50 and SEG101, and the best regression models, CNN-I and CNN-M are shown.The S/No. corresponds to the image # shown in Fig. 10.

Table 1 :
RMSE and R 2 values for the predicted coverage on the train, validation (val), and test sets for models trained from scratch and through transfer learning.CNNsc and CNNsc* are the CNNs trained from scratch with and without on-the-fly data augmentation, respectively.MLP-I and MLP-M are the MLPs trained using the features extracted with ResNet18 architecture pretrained on ImageNet and MicroNet, respectively.CNN-I and CNN-M are the fine-tuning models of the ResNet18 architecture pretrained on ImageNet and MicroNet, respectively.The best performance in each row is shown in bold, including ties and near-ties.