Data-driven approaches to optical patterned defect detection

Computer vision and classification methods have become increasingly wide-spread in recent years due to ever-increasing access to computation power. Advances in semiconductor devices are the basis for this growth, but few publications have probed the benefits of data-driven methods for improving a critical component of semiconductor manufacturing, the detection and inspection of defects for such devices. As defects become smaller, intensity threshold-based approaches eventually fail to adequately discern differences between faulty and non-faulty structures. To overcome these challenges we present machine learning methods including convolutional neural networks (CNN) applied to image-based defect detection. These images are formed from the simulated scattering of realistic geometries with and without key defects while also taking into account line edge roughness (LER). LER is a known and challenging problem in fabrication as it yields additional scattering that further complicates defect inspection. Simulating images of an intentional defect array, a CNN approach is applied to extend detectability and enhance classification to these defects, even those that are more than 20 times smaller than the inspection wavelength.


Introduction
Compared to the Apollo 11's onboard guidance computer, a modern cellphone is about 1,400 times faster and has 4,000,000 times more memory [1]. These dramatic increases illustrate the substantial impact, observed by Gordon E. Moore, that with the decrease in production costs, the number of transistors in a dense integrated circuit will double about every two years [2]. Transistor count is the most common measure of integrated circuit complexity and is closely related to computational performance [3] -the main force driving the feasibility and wide-spread availability of the different techniques of data-driven methods. The manufacturing of these integrated circuits as of 2017 has become a $ 400 billion industry [4], and even as the semiconductor industry struggles to perpetuate Moore's law [5], crucial challenges exist in monitoring the production process for decreasing feature sizes [6].
One of the most pressing challenges is the detection of so-called "killer defects" i.e., deviations that would lead to device failure due to shorted or broken electrical connections within the layers that are printed using photolithography [7,8]. Not detecting such defects can lead to systematic imperfections within other devices on subsequent wafers, resulting in multiple failed devices that are only observed after fabrication is complete. Examples of such observations include electrical testing and the X-ray based detection of die warpage after packaging [9]. The latter however is limited by long scan times for laboratory-based Xray sources.
Defect metrology concentrates on locating and identifying these defects during manufacturing to increase yield. Optical tools, such as scatterometry [10] or imaging techniques [11], are the only way to successfully inspect these defects non-destructively at high speeds over the area of the typical 300 mm diameter wafer. As killer defects decrease in size with shrinking device dimensions the scattered intensity from these defects becomes harder to detect, thus for either approach a large amount of data need to be processed. Converting these low-intensity data into meaningful results requires exploiting the very increase in computation power that results from successfully producing more powerful devices. In this work, this virtuous cycle is illustrated by adding several key aspects from machine learning to image-based defect detection, comparing a contemporary deep ultraviolet (DUV) inspection wavelength against proposed and potential vacuum-and extreme-ultraviolet (VUV, EUV) wavelengths.
While machine learning (ML) has successfully been reported in the analysis of patterns of poor device yield across such wafers after electrical testing [12][13][14] with some even using convolutional neural networks (CNN) [15][16][17], only recently has the imaging of defects been treated in a ML setting, more specifically by using principal component analysis [18]. This work broadens the application of ML to improve localized, image-based defect metrology by comparing linear classifiers and CNNs. Note, image-based defect detection with machine learning has been realized in other industries e.g., textiles [19][20][21][22], steel [23], and wood [24], but a key difference is that due to the decreased dimensions in semiconductors these defects must be detected even as they are often unresolved.

Simulation details
Shown schematically in Fig. 1 are two types of bridging defects and two types of line extensions which in general are harder to detect. These layouts are based upon public information about recent manufacturing processes [25] of the fins for 3-D field effect transistors (finFETs) and also upon an intentional defect array defined by SEMATECH [26]. The latter provides the naming convention for the defectuous wafers, see panels (b) to (e) in Fig. 1.
The simulations were performed using a well-verified [28][29][30] in-house implementation of the finite-difference time-domain [31] (FDTD) method to model the electromagnetic field scattered from the patterned layout and its defects. The incident angle of the illumination is chosen to be normal to the substrate for clarity; prior simulation results indicate that the defect detection often varies when using oblique illumination [29,32]. The linear polarization basis within the simulation is defined with respect to the long axis of the nominal pattern. (Note, the x and y directions do not correspond to the defect naming convention.) The scattered and reflected fields are converted to images through an idealized modeling of the Fourier optics assuming a collection numerical aperture of 0.95 for simplicity. To account for the measurement noise, Poisson (shot) noise [33] is applied to the raw images. Throughout the remainder of this paper the pixel size at the sample is assumed to be 10 nm × 10 nm. This number has been proven to be a good compromise between the changes in the noise model due to an increased photon count for larger pixels and aliasing effects.
We have previously utilized intensity and area thresholding to extract a signal-to-noise ratio (SNR) from differential images. In [30,34,35] the SNR was strongest for bridging defects when applying a wavelength of λ = 47 nm for the simulated structures, while the SNR varied notably across the remaining wavelengths. While SNR is straightforward and convenient it does require an informed choice of these intensity and area thresholds to exclude the shot noise, and for sufficiently low photon densities a SNR may not be reportable.
In this work three wavelengths are employed, 193 nm as a common inspection wavelength, 47 nm that performed best for the SNR metric, and 13 nm which was proven to be a challenging wavelength in our prior reports. The photon densities that are applied here are based on current estimates for the intensity of the three wavelengths used in this study, see Refs. [36,37], and can be found in Tab. 1 below.
In addition to measurement noise, "wafer noise" due to process variations is included [38][39][40][41][42]. Line edge roughness (LER) is known to be present in every lithographically manufactured device, either reducing the signal or increasing the noise [43]. The geometries in these simulations include LER [44,45] that is based on the current state-of-the-art with 3 · σ LER = 0.6 nm, i.e. 10% of the line width and a correlation length of ξ = 10 nm [46]. Both types of noise are applied separately to defect and no-defect images that are used to form the absolute value differential images (AVDI). These AVDIs form the basis of the investigations, and are realized by subtracting these two images and taking the absolute value of the resulting pixel values, see Fig. 2 for examples. Here, the intensities in the defect and nodefect images are given relative to the intensity of the incident light, the AVDIs are converted to 8-bit integers and normed to their respective maximum values for use in the machine learning algorithms. This subtraction almost completely removes the background at a cost of combining two realizations of shot noise.
Separating the shot noise from the noise due to LER for experimental data is a very challenging task for optical tools, more easily achieved in scatterometry-based critical dimension metrology [47,48] than for imaging defect inspection tools. For this simulation study, by definition we have perfect knowledge of the distribution of the shot noise and removing this would lead to unrealistically good results. Noise filtering in this work is therefore limited only to wavelet-based compression techniques.

Implementation of ML algorithms
Two types of ML algorithms are applied to this classification problem. While the first type, linear classifiers (LC) [49] can in some sense be seen as extensions of our previously used SNR metric, the second one, convolutional neural networks (CNNs) [50] are a class of algorithms that are widely used across a vast number of image recognition tasks. Each algorithm requires a set of features to operate on, and the selection of these features is an integral part in any ML setup. Limited computation resources, especially memory, have guided the selection of these methods. The linear classifier uses histograms, while the CNN processes wavelet-compressed AVDIs.
The histogram of its pixel intensities is an easily obtained image feature. Even though one discards the spatial information of the image, several applications in such diverse fields as wood [24] and fabric inspection [51] have proven that histograms can be a very valuable feature for classification. The intensities for each image are normed relative to its respective maximum value and a total of 100 bins to create the histogram has been used (although not shown, setting the number of bins to less than 50 in this work has a negative effect on the performance of the ML algorithms used). The histograms are normalized to show the relative pixel frequencies. A training set is created that consists of n t = 10000 histograms x i ∈ ℝ 100 , i = 1, …, n t , one half of which are images from simulations without a defect and labeled by y (i) = [0, 1] ⊤ , i = 1, …, 5000, and the other half are images from simulations with a defect y (i) = [1, 0] ⊤ i = 5001, …, 10000. Figure 3 presents the histograms for two critical wavelength/defect combinations. Note the clear difference in the histograms for λ = 193 nm while the histograms for λ = 13 nm are virtually indistinguishable.
The classifier that has been applied to these histograms is a simple linear classifier (LC). Figure 4 has the classification success rates (CSRs) for the above algorithm if applied to histogram data that has been generated using realistic photon densities as given in Tab. 1. Ten optimizations have been performed to determine the mean CSRs. The corresponding standard deviations in all cases were below 0.01 and are therefore not plotted. The LC performs quite well for λ = 193 nm yielding a CSR of approximately 0.98, i.e., on average it successfully classifies an image in 98% of all cases. For the λ = 13 nm data with a low photon count ρ ph = 1 nm −2 to 10 nm −2 , however, the LC performs poorly as the CSR is just above 0.5. With increasing ρ ph , the CSR increases to 0.82 for the Bx defect and 0.98 for the By defect at ρ ph = 1000 nm −2 . It has been reported that the SNR is a good metric for defect detection at λ = 47 nm, so it is not surprising to see that the LC also performs very well for a reference photon density of 10 nm −2 . Even for photon counts that are a magnitude less, a CSR of around 0.9 is achieved here. However, for real-world process control this value is not satisfying. The same can be said for the LC at λ = 13 nm, even for an increased photon count. Therefore the CNN is to be applied which will need a different feature that ideally contains more information without requiring too much memory.
Memory constraints using the full images with a pixel size of 10 nm × 10 nm arise as images at λ = 13 nm and λ = 47 nm consist of 71 × 63 pixels and 107 × 95 for λ = 193 nm.
Building a sufficiently large library of training data with these images is not possible given our resources, hence the information provided by the images is condensed by applying a two-step wavelet-based image compression using a 'db1' wavelet [52], by high-pass filtering the original images. The resulting images are then low-pass filtered and downscaled, yielding approximated subimages for which the procedure is repeated. This approach leads to a reduction of the image sizes to 18 × 16 pixels for λ = 13 nm and λ = 47 nm and 27 × 24 pixels for λ = 193 nm, and hence an increase of the pixel sizes at the sample from 10 nm to 39.4 nm. Even with these larger pixel sizes, the wavelet-based compressed images preserve the details of the original defect images while they tend to disappear if one simply rebins, see Fig. 5 for an example.
With the 16-fold data reduction due to the compression, it is now possible to train a convolutional neural network that uses the spatial information contained in the compressed images as features to classify the defect/no-defect AVDIs. A known, fundamental architecture that has been proven to successfully detect defects in a slightly different field [23] is used, given in Fig. 6, and implemented using the TensorFlow toolbox [54]. Just as with in the histogram case, a training set for each different λ-ρ ph -defect-polarization combination is created.
Initially the LC and the CNN are both used for binary classification, i.e., defect no-defect. The capabilities of the CNN will be evaluated further by using an order of magnitude less light at λ = 193 nm and also attempting defect classification among the Bx, By, Cx, and Cy defects and the no-defect case.

CNN results
Starting with the λ = 47 nm case and the reduced photon density of ρ ph = 1 nm −2 , recall that the histogram approach resulted in CSRs of approximately 0.9 for both the Bx and By defects. Using the CNN approach yields CSRs of almost 1, cf. the red and blue '×'s in Fig.  4. The same improvement in performance is observed for the Bx defect and λ = 193 nm case, with the CSR increasing from 0.88 for the LC to basically 1 for the CNN. For λ = 13 nm with current photon densities, even the CNN approach cannot detect these defects at this wavelength. Therefore the change in the CSRs is to be determined for increased ρ ph . With a photon density of ρ ph = 1000 nm −2 values for the CSRs are close enough to 1 if the CNN is used. That is about an order of magnitude less in photon density than would be needed for the SNR metric to successfully separate defects from no-defect images [35].
Finally the presented approaches are applied to smaller defects, excluding λ = 13 nm due to the difficulties this wavelength presented for larger defects. Figure 1 c) and d) have a schematic representation of the non-bridging defects that, following the SEMATECH convention, will be denoted by Cx and Cy, and that shall be investigated here using λ = 47 nm and λ = 193 nm. Although not shown, neither defect could be classified adequately using the SNR metric for any wavelength, further motivating the use of machine learning based methods. The results for applying the LC and CNN approaches are presented in Fig. 4 as light blue and orange triangles, respectively. The shorter wavelength does not have any problems detecting the Cx and Cy defects at the current ρ ph if the CNN is used, but only reaches a maximum CSR of 0.92 for the Cx defect for lower photon densities. As expected the LC is not sufficient to detect either defect at λ = 47 nm for any given photon density due to the very small scattering volume. On the other hand, λ = 193 nm performs very well on those small defects, especially given their size, that is approximately less than For processing images for high-volume manufacturing however, the image's size in memory may need to be decreased beyond the 16-fold reduction from the two-step wavelet compression. Figure 7 shows the effect that further compressing of the images has on the obtained CSRs for λ = 13 nm. Increased compression does indeed have a negative impact on CSRs > 0.6, decreasing the values for both defect types at the two larger investigated photon densities. For the smaller photon densities, better CSRs might occur for higher compression, however with CSR < 0.6 this is of negligible impact on defect detection. It is however surprising to see that the drop in CSRs is not as dramatic as expected, for example the CSR decreased from 0.985 to 0.938 for the Bx defect at ρ ph = 10000 nm −2 . While the CSRs from the highly compressed data are insufficient for practical application, the size of the image in memory is one variable of many that must be optimized for data-driven defect inspection.
One advantage of simulating an intentional defect array is the perfect prior knowledge of each image's defect type, and this enables further testing of the CNN beyond binary classification as shown in Fig. 4 using training and test data unique to each defect type. In the following we therefore train the same CNN architecture using a five-fold classification to distinguish among the no-defect case and the four types of defects as shown in Fig. 2. Specifically we use AVDIs that were generated for a wavelength of λ = 193 nm and an order of magnitude less photon density than the benchmark value to represent possible inefficiencies in source strengths and faster data acquisition with examples shown in Fig. 8. From each of these five classes 4000 images have been generated leading to a total of n t = 20000 images that are separated into 16000 training and 4000 test images.
The classification works quite well with the confusion matrix presented in Tab. 2 showing the accurate classification of three of the four defect types and of the no-defect case. For X polarization only 14 % of the By images were misclassified as Cx defects; for Y polarization (not shown) the trends are the same with 16 % of By images misclassified similarly and an overall CSR of 0.968. Note, that despite the small error rate, all defects were accurately flagged as a defect for both linear polarizations. Another key result from these data is that the success of the CNN does not depend upon polarization optimization. Contrast this with Fig. 4 where these highly directional defects are illuminated using each defect's optimal linear polarization axis, e.g., Bx with Y polarization. While defects encountered in nanoelectronics fabrication often defy such straightforward classification, the presented results demonstrate the versatility of a CNN approach to addressing the ever-pressing challenge of detecting killer defects.

Conclusion
We have applied two data-driven approaches to defect detection, namely linear classifiers and convolutional neural networks to simulated images computed using normally incident illumination using three wavelengths. As expected, CNN outperforms both the linear classifier and the SNR, due to the conservation of spatial information of the images. A very straightforward CNN approach can be used to extend the defect detectability to smaller defects, even as some are more than 20 times smaller in one dimension than the inspection wavelength of λ = 193 nm. Successful classification from an intentional defect array has been demonstrated for these data at this longer wavelength. However, the prospects for defect metrology remain challenging at λ = 13 nm despite the implementation of ML algorithms, partly due to the very low photon density realistically expected at this wavelength. It has been demonstrated that an increase in photon density can help to improve the detectability significantly using longer wavelengths for "killer defects" and further improvements in defect detectability through optimal combinations of illumination angle, polarization, photon density, defect type, and wavelength can be expected. Schematic representation of a) ideal layout, b) Bx and c) By bridging defects, d) Cx and e) Cy line extension defects, and f) and g) key dimensions of the unit cell. The lighter color is simulated as crystalline silicon and the blue as amorphous silicon. For clarity two 2 nm thick conformal layers that coat the amorphous silicon are not shown. Geometry and materials details are available at [27]. Example images with Poisson noise generated using photon densities from Tab. 1, (left) no defect, (center) defect, (right) AVDI. (top row) Bx defect, Y polarization, λ = 13 nm, (middle row) Cx defect, Y polarization, λ = 47 nm, (bottom row) By defect, X polarization, λ = 193 nm. While the longer wavelengths are able to identify the defect, it is almost indistinguishable from noise at λ = 13 nm. Histograms of pixel intensities as used as features in the linear classifier, ρ ph (193 nm) = 10 5 nm −2 , ρ ph (13 nm) = 100 nm −2 . CSRs as a function of photon density, capital letter after defect type denotes polarization of incident light, CNN denotes convolutional neural network, LC denotes linear classifier. Effect of compression and binning for the Cx defect (circled), λ = 47 nm, Y-polarization, ρ ph = 10 nm −2 , a) original AVDI, b) rebinned image, c)-e) wavelet-based compression for two, three, and four steps. Schematic representation of used CNN, the filter size for the convolution layers was set to 5 × 5 pixels. For supplementary information see [53]. Defect AVDIs at λ = 193 nm and ρ ph = 10000 nm −2 . While many defects are easily identified by eye, some defect and polarization combinations yield difference images that are visually similar. Benchmark photon densities ρ ph from the literature [36,37] for the wavelengths used in this study. λ (nm) 13 47 193 ρ ph (nm −2 ) 10 10 10 5