Holistic Optimization of Trap Distribution for Performance/Reliability in 3-D NAND Flash Using Machine Learning

A machine learning (ML) method was used to optimize the trap distribution of the charge trap nitride (CTN) to simultaneously improve its performance/reliability (P/R) characteristics, which are tradeoffs in 3-D NAND flash memories. Using an artificial neural network (ANN), we modeled the relationship between trap distributions and P/R characteristics. The ANN was trained using a large experimentally-calibrated technology computer-aided design (TCAD) simulation dataset. The gradient descent method was adapted to optimize the trap distribution, achieving the best P/R characteristics based on the well-trained ANN. Eventually, we found the best trap profile distributed in both space and energy. In particular, the energetic trap distribution had a larger impact on the P/R characteristics than that of the spatial trap distribution. Furthermore, in terms of the P/R characteristics, it was generally preferable to increase all inputs of the energetic trap distribution. However, the acceptor-like trap energy level ( $E_{TA}$ ) and its standard deviation ( $\sigma _{EA}$ ) caused a tradeoff between P/R characteristics; therefore, ML was used to determine their optimal points. The proposed ML method allows the optimization of trap distribution to obtain the best P/R characteristics rapidly and quantitatively. Our findings could be used as a guideline for determining the physical properties of CTN in 3-D NAND flash cells.


I. INTRODUCTION
NAND flash memory is a typical nonvolatile memory that is widely used in data servers and portable electronic devices. The principal advantage of NAND flash is its low bit cost owing to its high density, which has been facilitated by the The associate editor coordinating the review of this manuscript and approving it for publication was Cristian Zambelli . conversion of conventional 2-D architectures to 3-D architectures and the adoption of charge trap nitride (CTN) as a storage layer [1], [2], [3], [4]. In particular, CTN has enabled the fine scaling of devices and improved their endurance and reliability compared with conventional floating-gatebased devices [5]. However, as 3-D NAND cells are more stacked and scaled, precise control is required to deposit CTN. In addition, uncertainty in determining the physical VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ properties of CTN can result in tradeoffs between their erase/program (ERS/PGM) speed and retention characteristics [4], [6], [7], [8].
In our previous study, we only considered the ERS/PGM characteristics with the energetic properties of traps in CTN [9], [10]. However, in this study, we have attempted to solve more complicated problems combined with retention characteristics by optimizing the critical properties of CTN, that is, uneven spatial and energetic trap distributions, by using a deeper neural network. Unlike previous studies, new trap properties were investigated by observing all characteristics simultaneously rather than individually. Previously, the spatial and energetic trap distributions were extracted using the trap spectroscopy by charge injection and sensing (TSCIS) method [11], [12] and the retention model [13], [14]. By optimizing these controllable parameters comprehensively, we first attempted to simultaneously improve the performance, incremental step pulse erasing/programming (ISPE/ISPP) characteristics [15], and the reliability, retention characteristics in the 3-D NAND cell, which exhibit a tradeoff relationship.
Machine learning (ML) in this study can help in determining the optimal values exactly and rapidly. ML has recently been used to predict and optimize nanoscale transistors [16], [17], [18]. An artificial neural network (ANN) modeled the relationship between the trap distributions as input parameters and the absolute values of the threshold voltage shift (| V th |) as the output parameters. The ANN that was trained using a validated simulation dataset determined the best trap distribution exhibiting the best performance/reliability (P/R) characteristics at the same time. Consequently, we proposed using ML methods to rapidly and quantitatively determine the optimal trap distributions. Furthermore, we analyzed sensitive trap parameters to determine the P/R characteristics. The remainder of this paper is organized as follows. Section II describes the methods used to train and optimize the ANN. Section III presents the results of the optimization and trends using ML-based analysis. Finally, the conclusions are presented in Section IV. Fig. 1 shows a part of the 3-D NAND flash string used in the simulation. The cylindrical structure was simulated using Sentaurus technology computer-aided design (TCAD) [19]. Both ends of the string were connected to a bit line (BL) and a source line (SL), and V th was extracted from the BL current vs. gate voltage curves of the selected word line (WL Sel. ). Unselected WLs (WL Unsel. s) were stacked around the WL Sel. to consider adjacent cells. The gate stack consisted of a metal gate, blocking oxide (BOX), charge trap nitride (CTN), bandgap-engineering tunneling oxide (BE-TOX), polysilicon channel, and Macaroni filler. No grain boundaries in the polysilicon channel were assumed to focus on the CTN materials. Drift-diffusion transport, Shockley-Read-Hall recombination, mobility (doping, high-field, and interface-dependent), and Hurkx band-to-band tunneling models were implemented in the simulation. A nonlocal tunneling model was used to describe the movement of charges between the channel and the CTN. In addition, the Poole-Frenkel model was considered, and the trapped charges were calculated in the CTN. The features of the simulation structure were obtained from an actual 3-D NAND cell [9]. In this study, the geometric features of the actual device were fixed only to see the effect of trap distributions in the CTN on the P/R characteristics.

II. SIMULATION STRUCTURE AND METHODS
All traps were spatially and energetically distributed in the CTN. In addition, each distribution was composed of donor-and acceptor-like traps, which captured the holes and electrons, respectively. Moreover, Gaussian trap distributions of the peak, mean, and standard deviations were assumed. Fig. 2(a) shows the spatial trap distributions. The spatial trap distribution assumes that the volume trap density is not uniform in the CTN. Each distribution is composed of peak spatial depths (D TD and D TA ) and standard deviations of the spatial traps (σ SD and σ SA ). σ SD and σ SA contain values that sufficiently reveal the deviation of the volume trap density. A thickness of 0 nm indicates the BE-TOX/CTN interface; the closer trapped charges are to zero, the more they are near the BE-TOX/CTN interface. In other words, the trap has a spatial distribution based on the CTN thickness.  indicates a shallow trap. Furthermore, the value of the volume trap density in Fig. 2(a) is the integral of the energetic trap distributions at a specific location on the CTN thickness. Fig. 3(a) shows V th as a function of the erasing/programming voltage; the ISPE/ISPP schemes. The ISPE/ISPP scheme is a basic ERS/PGM method that rapidly narrows the V th distribution by increasing the operating voltage. Therefore, the higher | V th |, the better the performance at the same operating voltage. The ISPE scheme starts from the PGM state, whereas the ISPP scheme starts from the ERS state. Sixteen V th values (open circle) are used as outputs at each erasing/programming voltage of 12 V to 19 V. Fig. 3(b) shows V th as a function of time. The experiments were conducted at a high temperature (125 • C) because V th barely changed at room temperature. The initial V th was in the PGM state to verify the charge transport, and WL Unsel. s were in the ERS state. In terms of reliability, the retention Under the same bias conditions, the simulation data (lines) were calibrated using the experimental data (symbols), as shown in Fig. 3. They are in good agreement when adjusting the inputs of the trap distributions to calibrated values, as shown in Table 1. Table 1 also summarizes the ranges of the input parameters to be modeled in the artificial neural network (ANN). The inputs of the energetic trap distributions, that is, N TD , E TD , σ ED , N TA , E TA , and σ EA , were referenced from [13], [14], [20], [21], and [22]. In addition, the inputs of the spatial trap distributions, that is, D TD , σ SD , D TA , and σ SA , were extracted from [12] and [23]. The ranges of each input were sufficiently feasible, and trap parameters could be determined by adjusting the gas flow ratios [24]. Consequently,   (Fig. 3). MLP was used to train the nonlinear function. After training, an optimization process was performed to find the inputs that made the outputs large by the gradient descent method.
the controllable input parameters were randomized between their minimum and maximum values, having different | V th | values accordingly. Fig. 4 shows a schematic of the ANN used in the ML process. The inputs are the 10-feature of the trap distributions, as shown in Fig. 2, and the outputs are the 22-feature of | V th | (open circle), as shown in Fig. 3. A multilayer perceptron (MLP) was used to model this relationship. The MLP has been widely used to build complex nonlinear models [25]. The feedforward network had one hidden layer with 30 nodes, and its activation function was a hyperbolic tangent. The Levenberg-Marquardt method was adopted for the training algorithm, and the cost function was the mean squared error (MSE). After successful training, the gradient descent method was used as an optimization algorithm in the inverse direction. Furthermore, by using ML-based optimization, we can precisely find the specific inputs that are challenging for engineers to find empirically. In the outputs, | V th | values were used to provide the same direction and were logged to prevent abnormal values (such as < 0). However, the | V th,R1∼R6 | of the retention characteristics were negative because they had different directions for improvement. These results will be discussed in Section III. In the inputs, N TD and N TA were logged, and then all parameters were standardized for rapid and accurate training. For example, it takes 4741 seconds on average to predict one device using TCAD, whereas only 21 seconds using ML. All ML operations were run in MATLAB [26] using a personal desktop computer with AMD Ryzen 7 5800X (8-Core, 3.8 GHz) and 32 GByte RAM. Table 2 summarizes the Spearman correlations between the inputs and outputs. They have a positive correlation as space changes to red and a negative correlation as it changes to blue. For the ISPE/ISPP characteristics, outputs are good to be larger, but the opposite is true for the retention characteristics. Therefore, correlation values with the same sign are likely to have a tradeoff between the P/R characteristics. For example, when E TA increases, the | V th,E1∼E8 | of the early phase is likely to decrease (degrade), whereas | V th,R1∼R6 | is likely to decrease (improve). It can also be observed in the N TA and σ EA . However, it is difficult to analyze the effect of multiple inputs on multiple outputs with only correlations. Therefore, in this study, we attempted to analyze the complex physical properties of the CTN in a 3-D NAND flash device using ML and suggested the best set of inputs for the tradeoffs.

III. RESULTS AND DISCUSSION
A. | V TH | PREDICTION AND OPTIMIZATION USING ARTIFICIAL NEURAL NETWORK In this section, we present the results of training and optimization using the ANN. We want | V th | of ISPE/ISPP characteristics to be larger and | V th | of retention characteristics to be smaller, but they inherently have tradeoffs. Therefore, the goal of optimization is to find the best set of inputs that make the best P/R characteristics as simultaneously as possible according to an estimated value, which is our evaluation criterion.   Fig. 5(a) shows the MSEs of the train/validation/test sets with epochs. After preprocessing, the total number of samples was 2290, and V th could be extracted from the 57250 BL current vs. gate voltage curves, resulting in the ISPE/ISPP and retention curves of Figs. 1 and 2. The samples were divided into train, validation, and test sets using weights of 70, 15, and 15, respectively. The train sets were used to train the models, and the validation sets were used to measure the performance of models built on the train set. A specific trained model with the best performance was selected using the validation sets. Finally, the test sets were used to measure the expected performance of the model after the validation sets were determined. Furthermore, the smaller the MSE, the better the ANN training. Subsequently, the ANN exhibited the best performance, with an MSE = 4.5903 × 10 −2 in 24 epochs after the set validation epochs are over in 34 epochs (10 epochs away). Fig. 5(b) shows the agreement between the targets and predicted values in the sets. The targets are the existing simulated values, and the corresponding predicted values lie on the yaxis. A regression of the test sets shows an ideal alignment in which R = 0.98 (close to 1). It implies that the trained ANN successfully modeled the nonlinear function between the inputs and outputs. Consequently, a well-trained ANN can predict the correct outputs for random inputs within a finite range.
Figs. 6(a), (b), and (c) show comparisons of the ISPE/ISPP and retention characteristics between the simulated and optimized values. Solid lines denote the results of simulated outputs from the randomized inputs shown in Table 1. The open symbols indicate the improved outputs predicted using the ANN. The loss function for the superior P/R characteristics can be expressed as follows: where the predictionŷ j is the matrix of ln(| V th,E1∼E8 |), ln (| V th,P1∼P8 |), and − ln (| V th,R1∼R6 |) in order (Fig. 4); thus, the value of feature j is for j = 1 to 22. The solution of (1) can be found in the direction where the gradient becomes small. In our case, the gradient decreases asŷ j increases, according to (1). Therefore, | V th,E1∼E8 | and | V th,P1∼P8 | increased, but | V th,R1∼R6 | decreased because they were made negative based on the previous training. The best ISPE, ISPP, and retention characteristics were determined by only one set of inputs, and the selection method will be discussed in the next paragraph. Fig. 7 shows the estimated values of optimization are usually greater than those of conventional simulation. We VOLUME 11, 2023 FIGURE 7. Cumulative probability of the estimated value of (2), the sum of enhancement for each characteristic between simulation and optimization. The estimated values of optimization are larger than those of simulation, which means the former is better than the latter. From this perspective, the largest estimated value means the best result.
initially set up randomized inputs and performed 2000 iterations using the well-trained ANN to optimize the trap distributions. In addition, we calculated the estimated values for 2000 samples using the following equation: The estimated value of simulation was calculated as its own value, notŷ j . Equation (2) is our evaluation criterion. In (2), the former indicates the | V th,E1∼E8 | and | V th,P1∼P8 | of ISPE/ISPP characteristics, and the latter indicates the | V th,R1∼R6 | of retention characteristics; the larger the former and the smaller the latter, the better the P/R characteristics. Therefore, the largest estimated value among the optimizations was selected as the best P/R characteristics, which is the final goal of optimization process. As a result, the best result increased by 6.19% from the calibrated one. Also, the average of cumulative distribution function for optimization increased by 13.87% from the simulated one. These values are sufficiently large considering the small size of the retention characteristics and the fast processing speed (just 40 minutes). Finally, we analyzed the best set of inputs that made the best P/R characteristics in the following section. Furthermore, the sensitive inputs that significantly affected the outputs were also investigated in detail.

Figs. 8(a) and (b)
show the qualitative direction of the trap distributions for the best P/R characteristics and the best donor/acceptor-like traps showing trap profiles with the largest estimated value. Fig. 8(a) shows the best spatial trap distributions compared with the calibrated distributions. If the volume trap density was large near the BE-TOX/CTN interface, we had a question of which one was larger, the amount of charge tunneled from the channel in the ISPE/ISPP characteristics or the amount of charge leaked to the channel in the retention characteristics. As a result of ML, the best distributions have larger volume trap densities with wide σ SD and narrow σ SA , and then D TD and D TA are closer to the BE-TOX/CTN interface than the calibrated distributions. In general, these results suggest that the spatial trap distributions should be large and close to the BE-TOX/CTN interface to improve the P/R characteristics. Fig. 8(b) shows the best energetic trap distributions compared with the calibrated distributions. As a result of ML, the best distributions have larger N TD and N TA and deeper E TD and E TA values, but σ ED and σ EA decrease slightly compared with the calibrated values. These results suggest that the large trap densities and deep energy levels of the energetic trap distributions can improve the P/R characteristics, generally. The values of the best inputs are listed in Table 3. Also, the dominant factors for the best are briefly summarized. Especially, some inputs are close to their minimum/maximum values, whereas others have ambiguous values, such as E TA and σ EA. It means that all inputs should not necessarily be their minimum or maximum for the best. Furthermore, the difference among the optimizations is insignificant (Fig. 6), indicating that several optimized inputs can also improve the P/R characteristics because their results would not differ much from the best results. Therefore, we analyzed the box plot for the optimized inputs to get the sensitivity of each input parameter, as shown in Fig. 9. Fig. 9 shows the standardized values of all inputs are drawn in a box plot because of the different feature sizes. The box size of the energetic trap distributions (N TD , E TD , σ ED , N TA , E TA , and σ EA ) is smaller than that of the spatial trap distributions (D TD , σ SD , D TA , and σ SA ). The median lines are close to the maximum, implying that the sensitivity of the energetic trap distributions is greater than that of the spatial trap distributions. This result could be expected from the relatively small correlation between the spatial trap distributions and | V th | (Table 2), and it was verified through the MLbased analysis. In addition, the boxes for N TD and N TA contain the maximum. We confirmed that as N TD and N TA increased, | V th,E1∼E8 | and | V th,P1∼P8 | increased because of the many available trap sites, but | V th,R1∼R8 | also increased with the increasing electric field in the CTN [22], [27]. However, the maximum values of N TD and N TA are preferred because the degree of increase in | V th,E1∼E8 | and | V th,P1∼P8 | are greater than that of | V th,R1∼R8 |.  Table 3. Fig. 10 shows the output trends for the remaining four inputs of the energetic trap distributions. Each of the remaining four inputs was split, leaving the other inputs fixed at their calibrated values. Additionally, | V th,E2 |, | V th,E8 |, | V th,P8 |, and | V th,R6 |, representative outputs of each characteristic, were used for the analysis. As E TD increased, | V th,E8 | and | V th,P8 | remained almost the same, whereas | V th,R6 | decreased and then was saturated, as shown in Fig. 10(a). This is because deeper energy reduces the attemptto-escape factor for traps [28].
In Fig. 10(b), as σ ED increased, | V th,E8 | remained the same, | V th,P8 | increased slightly, and | V th,R6 | decreased after σ ED = 0.16 eV. Initially, | V th,R6 | increased slightly due to the shallow hole traps in the WL Sel. region, but after σ ED = 0.16 eV, | V th,R6 | decreased because the holes already filled in the gate space were rather reduced due to the increase in the shallow hole traps. Therefore, the holes that invade the WL Sel. region decreased, positively affecting the retention characteristics. Fig. 10(c) shows the results of increasing E TA . We found that the early (| V th,E2 |) and late (| V th,E8 |) phases of the ISPE exhibited different trends. In the early phase, the escape of trapped electrons begins first, and so the lower E TA results in the higher ERS efficiency [29]. However, this phenomenon was reversed in the late phase because of the same reason for increasing E TD . In addition, as the E TA increased, | V th,R6 | decreased rapidly but started to saturate. Therefore, when considering the tradeoffs between the ISPE and retention characteristics, the ML determined the best value of E TA to be 1.92 eV (Table 3). Fig. 10(d) shows the results of increasing σ EA . The tradeoff between the P/R characteristics is clearly demonstrated. As σ EA increased, the integral of the energetic trap distributions also increased, resulting in an increase in | V th,E8 | and | V th,P8 |. However, | V th,R6 | increased because the shallow electron traps increase and then escape easily. Therefore, when considering the tradeoff between the ISPE/ISPP and retention characteristics, the ML determined the best value of σ EA to be 0.30 eV (Table 3).

IV. CONCLUSION
ML-based analysis was used to determine the optimal trap distributions of CTN for improving the P/R characteristics simultaneously in 3-D NAND flash memory. The ANN could model the relationship between the trap distributions and | V th | and was trained using an experimentally-calibrated TCAD simulation dataset. The trained ANN exhibited high accuracy and predicted specific inputs for the superior P/R characteristics. As a result, we found the values of the best inputs using our evaluation criterion and analyzed the trends of each sensitive input parameter. In general, the influence of the energetic trap distributions on | V th | was greater than that of the spatial trap distributions. Moreover, increasing all inputs of the energetic trap distributions proved to be better for the P/R characteristics. However, it was found that E TA and σ EA should not be maximized because they cause a tradeoff in the P/R characteristics. Therefore, even if the actual process was difficult, ML could determine their optimal values. Therefore, this study is helpful in determining the complex physical properties of CTN in the 3-D NAND flash.

ACKNOWLEDGMENT
The EDA tool was supported by the IC Design Education Center (IDEC).