Characterization, Modeling, and Test of Intermediate State Defects in STT-MRAM

—The manufacturing process of STT-MRAM requires unique steps to fabricate and integrate magnetic tunnel junction (MTJ) devices which are data-storing elements. Thus, understanding the defects in MTJs and their faulty behaviors are paramount for developing high-quality test solutions. This article applies the advanced device-aware test to intermediate (IM) state defects in MTJ devices based on silicon measurements and circuit simulations. An IM state manifests itself as an abnormal third resistive state, which differs from the two bi-stable states of MTJ. We performed silicon measurements on MTJ devices with diameter ranging from 60 nm to 120 nm ; the results show that the occurrence probability of IM state strongly depends on the switching direction, device size, and bias voltage. We demonstrate that the conventional resistor-based fault modeling and test approach fails to appropriately model and test such a defect. Therefore, device-aware test is applied. We ﬁrst physically model the defect and incorporate it into a Verilog-A MTJ compact model and calibrate it with silicon data. Thereafter, this model is used for a systematic fault analysis based on circuit simulations to obtain accurate and realistic faults in a pre-deﬁned fault space. Our simulation results show that an IM state defect leads to intermittent write transition faults. Finally, we propose and implement a device-aware test solution to detect the IM state defect.


INTRODUCTION
S PIN-transfer torque magnetic random access memory (STT-MRAM) is one of the most promising emerging memory technologies, thanks to its advantageous features: non-volatility, fast access speed, high endurance, nearly zero leakage power, and CMOS-compatibility [1].The flexible trade-off between write speed, endurance, and retention also empowers it to be tailored and fitted into different layers ranging from high-retention storage to high-performance caches in the present memory hierarchy [2].Therefore, STT-MRAM has stimulated several start-ups (e.g., Everspin [3], Avalanche [4]) and major global semiconductor companies (e.g., Intel [5], Samsung [6], TSMC [2]) to commercialize this technology.Nevertheless, to enable high-volume production of STT-MRAM, high-quality test solutions are paramount to meet the increasingly stringent quality requirements of IC chips being shipped to end-customers.The STT-MRAM manufacturing process involves not only conventional C-MOS process but also magnetic tunnel junction (MTJ) fabrication and integration [6].The latter is more vulnerable to defects as it requires deposition, etch, and integration of magnetic materials with new tools [7].A blind application of conventional tests for existing memories such as SRAM and DRAM to STT-MRAM may lead to test escapes and yield loss.Hence, understanding MTJ-internal defects and their resultant faulty behaviors are crucial for developing high-quality STT-MRAM test solutions.
STT-MRAM testing is still an on-going research topic.Several fault models such as multi-victim, kink, and write destructive faults [8,9] were proposed for field-driven MRAMs.However, these fault models are not applicable to current-driven STT-MRAMs.Chintaluri et al. [10] derived fault models such as transition faults and read disturb faults in STT-MRAM arrays by simulating the impact of resistive defects in the presence of process variations; a March algorithm and its built-in-self-test implementation were also introduced.Nair et al. [11] performed layout-aware defect injection and fault analysis, whereby they observed dynamic incorrect read fault.Nevertheless, all these papers assume that STT-MRAM defects including those in MTJ devices are equivalent to linear resistors without any justification.Recently, Wu et al. [12] presented both experimental data and simulation results of pinhole defects in MTJ devices, and demonstrated that modeling pinhole defects as linear resistors is inaccurate and results in wrong fault models.To address the limitations of the traditional fault modeling and test approach, Fieback et al. [13] proposed the concept of Device-Aware Test (DAT), a step beyond cell-aware test.The DAT approach models physical defects accurately by incorporating the impact of such defects into the technology parameters and subsequently into the electrical parameters of the device.With the obtained defective device model, a systematic fault analysis can be conducted to develop realistic fault models; these fault models are then used to develop high-quality test solutions.
In this paper, we characterize intermediate (IM) state defects in STT-MRAMs and apply the DAT approach to model this defect, obtain accurate and realistic fault models, and develop an appropriate test.Normally, an MTJ device only has two bi-stable resistive states representing logic '0' and '1'.However, due to some physical imperfections such as unreversed magnetic bubbles [14], inhomogeneous distribution of stray field [15] or even skyrmion generation [16], a third resistive state may arise, leading to unintended memory faulty behaviors.The main contributions of this paper are as follows.
• Characterize the IM state defect in MTJs with diameter ranging from 60 nm to 120 nm based on silicon measurements.
• Demonstate the conventional resistor-based fault modeling and test approach fails to derive effective fault models and test solutions to detect such a defect.
• Develop a Verilog-A compact model for a defective MTJ device suffering from IM state defect, and calibrate the model with silicon data.
• Perform device-aware fault modeling to develop accurate and realistic fault models induced by the IM state defect.
• Propose and implement an effective test solution with weak write operations.
The remainder of this paper is organized as follows.Section 2 introduces the fundamentals of STT-MRAM and device-aware test.Section 3 presents characterization results of IM state defect.Section 4 discuses limitations of testing the SAFF defect using conventional resistive defect models.Section 5, Section 6 , and Section 7 apply the DAT approach to physically model the IM state defect, derive accurate fault models, and develop a test solution, respectively.Section 8 concludes this paper.

MTJ Device Technology
Magnetic tunnel junction (MTJ) is the most important component in STT-MRAMs, as it is the data-recording element which encodes two bi-stable magnetic states into one-bit data.Fig. 1a shows the schematic of a simplified MTJ device; its Critical Diameter (CD) is typically 20 nm-150 nm.The cross-sectional area A 0 = 1 4 πCD 2 is a key technology parameter of the device.Fundamentally, the MTJ consists of three layers.
1) Free Layer (FL): This is the top layer typically made of CoFeB-based materials (t FL ≈1.5 nm [17]).The magnetization of the FL can be switched by a spin-polarized current going through it or an external perpendicular magnetic field.The saturation magnetization M s and magnetic anisotropy field H k are two key technology parameters determining the thermal stability factor ∆ as well as the switching characteristics of the FL [18], as listed in Table 1.2) Tunnel Barrier (TB): This is the MgO dielectric layer below the FL.As the TB layer is ultra-thin, typically t TB ≈1 nm [19], electrons have a chance to tunnel through it, making the device behave as a tunneling-like resistor.To compare the sheet resistivity of different MTJ designs, the Resistance-Area (RA) product [18] is used.This is a figure of merit which is commonly used in MRAM community, and it is independent on device size.
3) Pinned Layer (PL): This is the bottom CoFeB-based layer (t PL ≈2.5 nm) with its magnetization strongly pinned to a certain direction by a synthetic anti-ferromagnetic structure [19].As a result, the FL's magnetization can be either parallel (P state) or anti-parallel (AP state) to the PL's.
The MTJ's resistance depends on both t TB and the magnetic state (i.e., P or AP).This is well known as the tunneling magneto-resistance (TMR) effect [18], which is characterized by the TMR ratio, defined as: (R AP − R P )/R P where R AP and R P are the resistances in AP and P states, respectively.
Similar to other non-volatile memory technologies, enough retention time is required to retain the data in STT-MRAMs for an expected period of time depending on the target application.an STT-MRAM retention fault occurs when the magnetization of the MTJ's FL flips spontaneously to the opposite direction due to themal fluctuation.Thus, the STT-MRAM retention time is generally characterized by the thermal stability factor (∆) [18].The higher the ∆, the longer the retention time.

1T-1MTJ Cell Design
Fig. 1b shows a bottom-pinned 1T-1MTJ memory cell and its corresponding read/write (R/W) operations.The threeterminal cell includes an MTJ device (storage element) and an NMOS transistor (access selector).The three terminals are connected to a bit line (BL), a source line (SL), and a word line (WL), as shown in the figure .The voltages on the BL and SL control R/W operations on the cell when the WL is asserted.For instance, a write '0' operation requires the BL at V DD and the SL grounded, which leads to a current I w0 flowing from BL to SL.In contrast, a current I w1 with the opposite direction goes through the cell during a write '1' operation.To guarantee a successful transition of the MTJ state, the magnitude of write current (both I w0 and I w1 ) has to be larger than the critical switching current I c .The larger the current above I c , the faster the switching can be.It is worth noting that the actual switching time t w under a fixed pulse varies from one cycle to another since the STT-induced magnetization switching is intrinsically stochastic.During a read operation, a significantly smaller voltage V read than V DD is applied on the BL to draw a read current I rd , which can be as small as ∼10 µA or 0.06I c [20], to read the resistive state (R P or R AP ) of the MTJ device by a sense amplifier.
Table 1 lists the key technology parameters of MTJ device to be used for defect modeling.

Device-Aware Test
In conventional tests or cell-aware tests, fault models are derived based on defect injection and circuit simulations at netlist or layout level.All defects irrespective of their physical natures in both interconnects and devices are modeled as linear resistors; e.g., a device-internal defect is typically modeled as a resistor either in parallel to or in series with a defect-free device model, as can be found in the prior work [10,11].However, it has been demonstrated in recent years that this defect modeling approach is inaccurate to tackle pinhole defects in MTJs [12], forming defects in RRAM devices [13], and gate oxide pinhole defects in transistors [21].Moreover, conventional memory faults are typically described by the fault primitive notation [22], where only '0' and '1' states exist.However, in emerging non-volatile memories such as STT-MRAM and RRAM, undefined and extremely low/high resistive states may occur due to defects [17].This calls for an expansion of memory fault space.To address the above limitations, Device-Aware Test (DAT) [13] was proposed to provide a systematic framework for appropriate fault modeling and test of deviceinternal defects.DAT consists of three steps, as illustrated in Fig. 1c.First, manufacturing defects in devices are characterized and modeled physically; the impact of the defect on the technology parameters of the defective device is determined.Subsequently, such impact is incorporated into the device's electrical parameters to obtain a parameterized defective device compact model which can be calibrated by silicon data if available.Second, the defect-free model of the device used in the netlist (simulation model) is replaced with the defective device model obtained in step 1; a systematic fault analysis is then performed to validate realistic faults within a pre-defined complete fault space.Third, based on the fault modeling results in step 2, appropriate test solutions are developed; e.g., March tests, Design-for-Testability (DfT), stress tests, etc.

DEFECT CHARACTERIZATION
Electrical characterization with pulses is a common practice to evaluate the write performance of STT-MRAM devices.When we performed comprehensive characterization on devices with CD ranging from 60 nm to 120 nm, some devices showed an abnormal third resistive state in addition to the two bi-stable P and AP states.As the resistance of this unexpected state is always between R P and R AP , we refer to it as intermediate (IM) state in this article.In this section, we first introduce the experimental set-up for measuring IM state defects.Thereafter, the measured results of an MTJ device without any IM states and an MTJ device with a single IM state are presented and compared.Then, we elaborate the dependence of IM state occurrence probability on bias voltage, device size, and switching direction.Finally, we briefly review the related work in the literature and discuss the potential causes of IM state defects.

Measurement Set-up
Fig. 2a and 2b show the pulse configurations in each cycle for AP→P and P→AP switching characterization, respectively.For AP→P switching characterization, a positive voltage pulse (V p =0.6 V, t p =50 ns) was applied to the MTJ device under test to initialize it to AP state, as illustrated in Fig. 2a.The pulse was followed by a read operation using a relatively long but small voltage pulse (V p =10 mV, t p =0.7 ms) to check whether the device has been initialized to AP state successfully.After the read, a negative pulse with t p =15 ns was applied to the device to study AP→P switching.Similarly, a second read was applied to read out the resistive state of the device.As the switching behavior is intrinsically stochastic, we repeated these four operations for 10k cycles to obtain a statistical result.To cover the switching probability P sw from 0% to 100%, we swept the pulse amplitude V p of the second pulse in a carefully-tuned range.For P→AP switching characterization, a similar measurement was conducted with the polarity of both write pulses reversed, as shown in Fig. 2b.

Identification of IM State Defects
Fig. 3a and 3b show the measured results of a representative normal MTJ A (nominal CD=100 nm) for AP→P switching and P→AP switching, respectively; each point represents a readout resistance of the second read pulse in Fig 2 .It can be seen that when V p =−0.74 V, AP→P switching probability is 100% in the measured 10k cycles.When V p =0.45 V, P→AP switching probability is 99.2%, meaning that 0.8% of the 10k cycles experience failed transitions (marked with red triangles), due to the STT-switching stochasticity.In both cases, there is no third resistive state observed.In contrast, Fig. 3c and 3d show the measurement data of a typical device with IM state (MTJ B) with the same size and experimental conditions.It is clear that a line of unexpected orange points (i.e., IM state) show up between the two lines representing AP and P states.The occurrence probability of IM state in AP→P switching direction is 1.6% when V p =−0.74 V while it is 0.6% in the opposite switching direction when V p =0.45 V.It is also worth noting that the probability of failed transition of MTJ B is much higher than that of MTJ A under the same applied pulses.The disparity of R P (red lines) and R AP (green lines) between these two devices is attributed to process variations; the slight TMR drop in this defective MTJ was not a common rule in all observed defective MTJs with IM states, compared to good MTJs.

Dependence of IM State Defects
We observed that the occurrence of IM state significantly depends on the applied bias voltage, switching direction (i.e., AP→P or P→AP), and device size in our experiments.of IM state of four different MTJ devices in AP→P and P→AP switching directions, respectively; the nominal CD of MTJ C and D is 100 nm while it is 120 nm for MTJ E and F. It can be seen that the successful transition probability (P ST ) between P and AP states (marked with green square points corresponding to the left y-axis) increases from 0% to 100%, as the amplitude of V p increases in both switching directions.The orange circle points represent the occurrence probability of IM state (P IM ) corresponding to the right y-axis at various V p points.One can observe that P IM increases with the amplitude of V p until reaching a peak at P ST ≈50% (marked with the horizontal dash line), then it decreases as V p further increases; this rule applies for all four devices in both switching directions despite the peak height of P IM varies from one device to another.Furthermore, even for the same device, there is a large difference in the peak height of P IM for the AP→P and P→AP switching directions.This indicates that the occurrence probability of IM state also depends on the switching direction.
To investigate whether the MTJ size plays a role in determining the occurrence probability of IM state, we repeated the same measurements on MTJ devices with four different sizes, i.e., CD=60 nm, 75 nm, 100 nm, and 120 nm.For each size, we measured 60 devices; the number of devices with IM state is shown with the blue histogram (left y-axis) in Fig. 5.It is clear that the smaller the MTJ device (i.e., smaller CD), the less likely to see IM states in our devices.More specifically, 57 devices out of the measured 60 devices with CD=120 nm exhibit IM states in the measurement, whereas the number is 5 and 0 for MTJs with CD=75 nm and 60 nm respectively.Among those devices with observed IM states, the median of the maximum occurrence probability of IM state (i.e., the peak height of P IM in Fig. 4) becomes smaller when CD decreases, as shown with the two orange curves corresponding the right y-axis in Fig. 5.It is also worth noting that the median of the maximum P IM in AP→P switching direction is slightly smaller than that in P→AP switching direction for a given MTJ size.This is probably because AP→P switching generates more Joule heating than the opposite switching direction, which reduces the retention time of IM state; thus, the captured number of IM states on average is smaller in AP→P switching direction under the same measurement set-up.Interestingly, Intel also presented similar measurement results in [15].Based on the above observations, it can be inferred that STT-MRAM technology down-scaling is helpful in reducing IM state defects in MTJs, thus leading to a more deterministic and uniform transition between the bi-stable AP and P states.

Related Work and Potential Causes
There are several prior works on studying IM states in MTJ devices based on experiments and/or simulations, as listed in Table 2. Yao et al. [23] observed stable IM states in both P→AP and AP→P switching directions after the removal of write pulses with a similar measurement set-up to ours; the read pulse width is 200 ms, indicating that the retention time of IM state (RT IM ) is at least 200 ms.They attributed the physical causes of IM state to the multi-structure of the FL induced by the dipole field and large device size.Aoki et al. [24] also observed IM states during STT-switching with sub-10ns pulses and claimed that those IM states are metastable meaning that they disappear after the removal of write pulses; the claimed physical cause is similar to the above one.Subsequently, more research works [14,15,25] were conducted and reported that the observed IM states are metastable due to the inhomogeneous distribution of stray field at the FL and unreversed magnetic bubbles, as elaborated in the table.In recent two years, studies in [16,26] on IM states reveal that IM states in MTJ devices take place due to Skymion formation and their retention time can be as long as the bi-stable P and AP states.In this work, our measurement data also clearly demonstrates the existence of IM states in MTJ devices especially for large sizes (CD>75 nm).It manifests as a third resistive state between P and AP states.The occurrence of IM state is probabilistic depending on the switching direction, applied bias voltage, and device size.In addition, we swept the read pulse width from 50 µs to 10 ms in our measurements; the results show that the IM states occur in all these configurations indicating that RT IM is larger than 10 ms after the removal of write pulses.The root causes can be attributed to some physical imperfections such as unreversed magnetic bubbles, inhomogeneous distribution of stray field or even skyrmion generation.To accurately describe the faulty behavior of STT-MRAM cell in the presence of an IM state defect, we need to have an accurate defect model.

LIMITATIONS OF CONV. TEST APPROACH
In conventional memory testing, manufacturing defects are typically modeled as linear resistors, namely opens, shorts, and bridges [27].The resistance value represents the defect strength.This approach is also inherited to test emerging non-volatile memories such as STT-MRAM, as can be found in the prior art [8][9][10][11]17].For any defect in the MTJ device, it is modeled as a linear resistor either in parallel to (R pd ) or in series with (R sd ) a defect-free MTJ model, as illustrated in Fig. 6.The physical mechanism of defect is never taken into account and manifested as a difference in the defect model.
To verify the effectiveness of resistive models in modeling the IM state defect, we injected R sd and R pd separately into our STT-MRAM simulation circuits and performed static fault analysis.A static fault is defined as a fault that can be sensitized by at most one operation.To describe static memory faults in a systematic way, we adopted the fault primitive (FP) notation [22].An FP is denoted as a threetuple S/F/R , where • S (sensitization) denotes the operation sequence that sensitizes the fault.S∈{0, 1, 0w0, 0w1, 1w0, 1w1, 0r0, 1r1}; '0' and '1' are logic values, 'r' and 'w' denote a read and a write operation, respectively.
• F (faulty effect) describes the value of the faulty cell after S is performed; F ∈{0, 1}.
• R (readout value) describes the output of a read operation in case the last operation in S is a read.R∈{0, 1, −} where '−' denotes that R is inapplicable.
For example, 0w1/0/-denotes a w1 operation to a cell containing '0' (S=0w1) fails, the cell remains in its initial value '0' (F =0), and the read output is not applicable (R=−).Using the above FP notation, the entire fault space for singlecell static faults can be defined; it can be easily derived that it consists of 12 FPs [28].The fault modeling results  are shown in Table 3.It can be seen that four different FPs were sensitized; they are IRF0, IRF1, TF1, and TF0.Note that a single defect may cause different FPs, depending on its strength (i.e., resistance in this case).These four FPs can be used to generate test solutions such as March algorithms.First, each sensitized F-P is assigned its own detection condition.For instance, IRF0= 0r0/0/1 requires a read operation on the faulty cell at state '0' to guarantee its detection, denoted as (...0, r0, ...), where means that the detection condition does not depend on the addressing direction [22].The detection condition for TF1= 1w0/1/-is (...1, w0, r0, ...), meaning that a down-transition write followed by a read is enough to detect this fault, regardless of the addressing direction.The detection conditions of all sensitized FPs are compiled into the following optimal March test with three march elements: Note that different versions of March tests can be generated (e.g., with two march elements) as long as the test satisfies all the detection conditions.
Based on our measurement results in the previous section, one can easily observe that the sensitized four FPs using the conventional fault modeling approach cannot cover the faulty behaviors of IM state defects in MTJ devices.This is because an IM state defect manifests itself as a resistive state between R P and R AP with an occurrence probability.This means that this defect may turn an MTJ device into the undefined state 'U' and this faulty behavior occurs intermittently.The conventional fault modeling and test approach consider the MTJ device as an ideal black box (only state '0' and '1').Therefore it fails to capture the abovementioned characteristics of IM state defect.As the four FPs are inappropriate in representing IM state defects, March tests that target these faults obviously cannot guarantee the detection of such defects.Therefore, we need to apply DAT to IM state defects for accurate defect and fault modeling, which will eventually leads to high-quality test solutions that we desire.

DEVICE-AWARE DEFECT MODELING
In order to investigate the faulty behavior of memory cell in the presence of an IM state defect, first an appropriate physics-based defect model needs to be developed.In this section, we will follow the device-aware defect modeling approach proposed in [7], which consists of three steps: 1) physical defect analysis and modeling, 2) electrical modeling of defectie MTJ device, and 3) fitting and model optimization.Next, we will work out these three steps for the IM state defect.

Physical Defect Analysis and Modeling
Based on the characteristics and potential forming mechanisms of IM state, as presented with silicon measurements in Section 3, we physically model the IM state at three key aspects as follows.

Partial switching behavior of the FL
As explained in the previous section, the most probable cause of IM state in MTJ devices is that some parts of the FL switch to the intended state under a write pulse while the rest remain in their initial state due to unreversed magnetic bubbles, inhomogeneous distribution of stray field at the FL, or even skyrmion generations.Therefore, we model this partial switching behavior by splitting the FL into two regions: 1) P-state region and 2) AP-state region with the assumption that these two regions are independent magnetically and electrically.Fig. 7a and Fig. 7b show the vertical and horizontal cross-section schematics of an MTJ device with both P-state and AP-state regions, respectively.As a result, we can derive: where A 0 is the entire cross-sectional area of the MTJ, A P and A AP are the cross-sectional area of the P-state and AP-state regions, respectively.A IMP and A IMAP are the corresponding normalized area with respect to A 0 ; they can be any value in the range of [0, 1].Note that this model also covers the defect-free case where only P and AP states can exist exclusively; i.e., A IMP = 0 represents AP state whereas A IMP = 1 means P state.

Probabilistic occurrence of IM state
As introduced previously, the IM state does not show up in all write cycles.Instead, we observed experimentally that it has a certain occurrence probability depending on the applied bias voltage V p , MTJ size CD, and the switching direction.Apart from that, it is expected that the FL thickness (t FL ) also plays a role in determining the IM occurrence probability, as it significantly influences the thermal stability of the device [18].We define a discrete random variable X as whether or not the IM state occurs.For a given V p , CD, and t FL , X obeys a Bernoulli distribution.Its probability mass function Pr (X) is: As shown in Fig. 4, the correlation between P IM and V p exhibits a curve which is quite similar to Gaussian function (Bell curve).Thus, we model the V p dependence of P IM as: where V pk is the applied bias voltage when P IM reaches its peak H IM , and V wd is a parameter controlling the width of the Bell curve.Note that the polarity of V p determines the switching direction; a negative V p results in an AP→P transition while a positive V p leads to a reversed transition.
Since H IM shows a linear scaling trend with CD, as shown in Fig. 5, it can be modeled as a linear piecewise function: where S lp is the slope of the curve.Since all the measurements we performed were on MTJ devices with the same t FL , it is assumed that t FL has no impact on P IM .However, for a generic model for devices with different P IM , such impact should be incorporated.Combing Equations (2-4), S lp , V pk , and V wd are three fitting parameters which can be tuned and fitted to measurement data, which will be covered later.

Retention time estimation of IM state
The retention time of IM state (RT IM ) indicates how long the IM state remains after the removal of write pulses; it determines the time period where the memory fault behavior appears in the presence of the IM state.Thus, it is important to estimate RT IM of our devices and integrate it into the defect model if necessary.Conventionally, the following static model is used to roughly estimate the retention time of AP or P state for a given ∆ [29]: where τ 0 is the inverse of the attempt frequency (∼1 ns).However, the retention time for STT-MRAMs has intrinsic stochasticity, as the magnetization flip induced by thermal fluctuation is unpredictable.This static model fails to capture the stochastic property.Actually, the calculated retention time using Equation ( 5) corresponds to the time after which the MTJ state flips at a probability of 63%, as pointed out in [30].As an alternative, a statistic model derived from the switching model in thermal-activation regime is widely used, as can found in [18,30,31]: where P RT is the switching probability of a certain MTJ state due to thermal fluctuation after time RT (i.e., the confidence in the estimation of RT ).Next, we will model the retention time of IM state RT IM based on this statistic model.As illustrated in Fig. 7, the IM state takes place when some parts of the FL switch while the rest remain in their initial state.Thus, the retention time of IM state RT IM is the time period before the magnetization of the P-state or APstate region spontaneously flips to the opposite direction under the influence of thermal perturbation such that the two regions merge again into an entire one.In other words, RT IM is the smaller one in the retention time of the P-state region and AP-state region.
In the above equations, ∆ P and ∆ AP are the thermal stability factor of the normal P and AP states of MTJ, respectively.RT IMP and RT IMAP are the retention time of the P-state and AP-state regions in IM state, respectively.The modeling principle for RT IMP and RT IMAP is based on the observation with device-level silicon measurements that ∆ scales linearly with CD (i.e., √ A) when CD>40 nm [32].Fig. 8 shows the estimated retention time in IM state RT IM as a function of A IMP .It can be seen that RT IM increases with A IMP until reaching a peak at A IMP = 0.64, after which it goes down.The maximum RT IM can be up to one day for both P RT =63.0% and 99.9%.However, it is still more than three orders of magnitude smaller than RT P ; note that RT P is smaller than RT AP due to the existence of stray field at the FL.Furthermore, the large amount of Joule heating generated under switching pulses may increase the junction temperature by more than 50 • C [33].This will further reduce RT IM in practice.

Electrical Modeling of Defective MTJ Device
With the obtained physical model of IM state, we can map it to the three key electrical parameters: R, I c , and t w as a reflection of the impact on the device's electrical behavior.
As we model the IM state by splitting the FL into AP-state and P-state regions (see Fig. 7), electrons can go through via either the P-state region or the AP-state region under an electric field.Therefore, the overall conductance of IM state is the sum of the conductance of these two parallel regions.
where G P and G AP are the conductance when the entire FL is in P and AP states, respectively.A IMP is the normalized 1.8 area of P-state region in IM state with respect to the entire cross-sectional area of the FL.By replacing conduction with resistance (G=1/R) in the above equation, we can derive: R P and R AP are both dependent on the bias voltage V MTJ applied across the MTJ device.Fig. 9a shows the measured R-V loop of MTJ C, the same one shown in Fig. 4; the red solid curves are fitting curves used to extract the exact resistance at a given bias voltage with the physical model in [12].With R P and R AP extracted from measurement data at different bias voltages, we can calculate R IM for different A IMP values using Equation (11); the results are shown in Fig. 9b for V p = 10 mV, 300 mV, and 700 mV.Conventionally, the switching spectrum between P and AP states in STT-MRAMs can be divided into two regimes: 1) precessional regime for short pulses (<∼40 ns for our devices), 2) thermal activation regime for long pulses [12,18].The switching behavior in the precessional regime is dominated by the STT effect while the thermal effect plays a major role in determining the switching behavior in the thermal activation regime.To model the switching behavior between P, AP, and a third IM state, we modify the equation of the critical switching current I c in the STT-switching model as follows [18].
In this equation, η is the STT efficiency, α the magnetic damping constant, e the elementary charge, the reduced Planck constant.The rest of parameters have already been introduced previously.When A IMP = 1 (indicating P state), the above equation collapses to the original equation for I c (P→AP).When A IMP ∈ (0, 1) (indicating IM state), I c (IM→AP) is smaller than I c (P→AP) as only the P-state region in the FL necessitates a flip.Similar interpretation can be inferred for IM(AP)→P switching.Note that the switching from P or AP state to IM state is governed by the aforementioned statistical model in Equation (2)(3)(4).
Furthermore, the switching time t w in the precessional regime (namely, switched by the STT-effect) can be estimated using the Sun's model as follows [12]: Here, C E ≈0.577 is Euler's constant, ∆ the thermal stability in P or AP or IM depending on the switching direction, µ B the Bohr magneton, P the spin polarization, and m the FL magnetic moment.V p is the bias voltage across the MTJ device to switch its state.R(V p ) is the resistance of the MTJ device; it shows a non-linear dependence on V p (see Fig. 9a).In addition, we assume that t w obeys a normal distribution for a given V p as a model for the switching stochasticity [34].

Fitting and Model Optimization
In the third step of our device-aware defect modeling approach, fitting and model optimization can be conducted if silicon data is available.With the measured data presented in the Section 3, next we will illustrate this step by fitting the obtained model to a specific device MTJ C as an example.Note that our MTJ compact model is generic and device-todevice variations due to process variations can be modeled by assigning a Gaussian distribution to the key technology parameters of MTJ.
First, R P and R AP of MTJ C can be extracted from its R-V loop, as shown in Fig. 9a.As the measured R IM =1050 Ω (see Fig. 3c and 3d) and the read bias is 10 mV, we can calculate the A IMP value based on our model.The result is marked with the blue point (A IMP =0.48) in Fig. 9b.Second, the fitting results of P ST and P IM are shown in Fig. 10.On the positive side V p >0 for P→AP switching, S lp =1e-3, V pk =0.4369, and V wd =0.0145.On the negative side V p <0 for AP→P switching, S lp =3.9e-4, V pk =-0.7096, and V wd =0.0182.Third, the critical switching current I c is not directly measurable.Thus, I c fitting is not applicable here.In addition, the switching time t w changes with V p as well.The fitting process and results are presented in [12], thus will not be repeated here.
The output of device-aware defect modeling is a calibrated Verilog-A MTJ compact model.After verifying and calibrating the MTJ model in Python as presented previously, we moved this model to Verilog-A so as to make it compatible with circuit simulators such as Cadence Spectre adopted in this article for subsequent fault modeling.Fig. 11 shows the verification results of the MTJ compact model integrating the following three variation sources affecting the switching behavior for P→AP switching under pulses with t p =15 ns as an example.
• Switching stochasticity (STO): In Fig. 11a, only the switching stochasticity (cycle-to-cycle variation) is enabled while process and temperature variations are disabled.We swept the bias voltage V p from 0.3 V to 0.5 V in 50 steps, each of which involved a 5k-cycle Mente Carlo simulation to obtain statistical switching results.It can be seen that the circuit simulation results accurately emulate the measurement and fitting results shown in the positive part in Fig. 10.
• Process variation (PV): Process variations in MTJ's geometrical parameters (e.g., CD, t FL , t TB ) and magnetic properties (e.g., H k and M s ) greatly contribute to the device-to-device variation in the switching behavior on top of the intrinsic switching stochasticity, as shown with silicon data in [35,36].Our MTJ model takes into account process variation by introducing a Gaussian distribution to each of the above parameters.Fig. 11b shows the switching statistics with PV enabled only; we set the 3σ corner at 10% away from the average (i.e., 3σ = 0.1µ) in our simulations.One can observe that PV on this scale introduces a slightly wider distribution in both P ST and P IM than STO in Fig. 11a. •

Temperature variation (TV):
The operating temperature also has a large impact on the switching behavior in STT-MRAM as demonstrated in [36,37].In our simulations, we took into account temperature variation by assigning a uniform distribution to the operating temperature from −40 • C to 125 • C (typical industrial standard).Fig. 11c shows the switching statistics with TV enabled only; it is clear that TV has a contribution as large as STO and PV in the switching variation of STT-MRAM.
Fig. 11d shows the switching statistics combining all the above three sources of variation.It shows that the switching voltage V p may span more than 0.2 V from 0% to 100% switching probability; across the entire switching curve, the IM state appears with varying probability as shown in the figure.Due to the large variation in the switching behavior, it is unwise to adopt fixed overdrive pulse amplitude and duration in order to obtain 100% switching in all cells, all cycles, and all operating temperature for write operations in practice.

DEVICE-AWARE FAULT MODELING
Device-aware fault modeling consists of two sub-steps: 1) fault space definition, 2) fault analysis.The former defines all possible faults theoretically.The latter validates realistic faults in the presence of the defect under investigation in a pre-defined fault space using SPICE-based circuit simulations.Next, we will work out these two sub-steps for IM state defects in MTJ devices and compare the fault modeling results with that of the conventional resistive model.Finally, we study the distribution of observed memory faults on write voltage and time for the purpose of test development.

Fault Space Definition
In device-aware fault modeling, we expand the fault space to cover all possible memory faults that we have observed in STT-MRAMs based on measurement data.The upgraded FP notation is S/F n /R , where S (sensitizing sequence) remains the same as the one described in Section 4, F n and R are explained as follows.
• F n (faulty effect).F ∈{0, 1, U, L, H}, where the additional states 'U', 'L', and 'H' denote undefined, extreme low, and extreme high resistive states, respectively, as have been observed in real fabricated devices [17].In STT-MRAMs, data is stored in MTJ devices whose pre-defined resistance ranges determine the logic states '0' and '1'.Due to defects or extreme process variations, the MTJ's resistance can be outside of these ranges, as demonstrated with measurement data presented in [17].The subscript 'n' specifies the nature of the faulty effect.n∈{p, i, t}, where 'p', 'i', and 't' denote permanent, intermittent, and transient faults, respectively [38].When n=p, it is omitted as a compatibility measure to the conventional notation.
• R (readout value).R ∈ {0, 1, ?, −}, where the additional '?' denotes a random readout value in case the sensing current is very close to sense amplifier's reference current (e.g., the cell under read is in a 'U' state).
For example, write transition fault W0TFU= 1w0/U/means that a down-transition operation (S=1w0) turns the accessed memory cell to an undefined state (F n =U) permanently; more details about the FP notation and naming scheme can be found in [13,17].Based on the above FP definition, the entire fault space can be redefined.It can be derived that the total number of static faults consists of 52 single-cell faults.

Fault Analysis
After IM state defects are accurately modeled and a complete fault space is defined, the STT-MRAM netlist with/without an IM state defect can be simulated in a SPICE-compatible circuit simulator to validate the corresponding faults in the space.Our fault analysis consists of seven steps [17]: 1) circuit generation, 2) defect injection, 3) stimuli generation, 4) circuit simulation, 5) fault analysis, 6) FP identification, and 7) defect strength sweeping and repetition of steps 2 to 6 until all defects and their sizes are covered.

Simulation Setup
The simulation circuits were from [17] with a 3×3 1T-1MTJ array and peripheral circuits (e.g., write driver and sense amplifier).All transistors in the netlist were built with the 90 nm predictive technology model (PTM) [39].Process variations in transistors were lumped into the variation in the threshold voltage V th with 10% away from its nominal value at 3σ corners.For the nine MTJ devices in the memory array, our Verilog-A MTJ compact model with CD=100 nm was adopted; Variations in MTJ performance were covered by enabling STO, PV, and TV options in the MTJ model, as detailed in Section 5.3.The defect injection was executed by replacing the defect-free MTJ model (with only P and AP states) located in the center of the array with a defective one (with P, AP, and IM states) presented in the previous section.The defect strength was configured by assigning a float number to A IMP ∈(0, 1) as an input parameter of the Verilog-A MTJ model; it was swept from 0 to 1 in 100 steps in the simulations.The remaining eight MTJs surrounding the central one were always defect-free.
In terms of stimuli, we simulated S ∈ {0, 1, 0w0, 1w1, 0w1, 1w0, 0r0, 1r1}, i.e., all static operations.V DD was set to 1.6 V and V WL at 1.8 V.Note that boosting the voltage on the WL is a common practice in the MRAM community due to the source degeneration (i.e., V GS <V DD ) of NMOS selectors [5,40].The write pulse width was set to 20 ns and read pulse width at 5 ns.Due to the large variation in the switching behavior induced by STO, PV, and TV, we conducted 2k Monte Carlo simulations for each sensitizing sequence S.
Since the simulation overhead is immense due to Monte Carlo simulations (2k cycles), we performed the circuit simulations in a cluster with eight compute nodes to speedup the simulation by exploiting job-level parallelism.We first ran the simulation with a defect-free netlist.Thereafter, the  whole simulation process was repeated after injecting an IM state defect with certain A IMP value into the netlist.Finally, fault analysis and FP identification can be conducted by comparing the simulation results of the above defect-free and defective cases.

Fault Modeling Results
Table 4 lists the fault modeling results due to IM state defects.When A IMP ∈[0.30, 0.61], two FPs were observed: 0w1/U i /-and 1w0/U i /-.The intermittent write transition fault W1TFU i = 0w1/U i /-means that an up-transition operation on a memory cell with inital state '0' transforms the memory cell into a 'U' state with a certain probability (i.e., intermittently).Similarly, the intermittent write transition fault W0TFU i = 1w0/U i /-was also observed.Since these two FPs both involve the 'U' state and are intermittent, they belong to hard-to-detect faults [17].Their detection cannot be guaranteed by March tests and thus requires DfT solutions.Note that transition failures due to switching stochasticity are typically not considered as memory faults induced by defects [11]; thus, they are excluded here.

Comparison to the Conventional Resistive Model
Fig. 12 shows a Venn diagram which compares the fault modeling results using our device-aware (DA) defect model and the conventional resistive model.Clearly, the DA model leads to two hard-to-detect faults while the resistive model results in four easy-to-detect faults.There is no overlap between the two circles.This means that IM state defects in MTJ devices exhibits unique faulty behaviors which cannot be covered by the resistor-based defect models.The two FPs sensitized using our DA model are intermittent and involve the 'U' state, which make them hard to be detected by March tests.In contrast, the resistive models resulted in only easyto-detect faults, since the MTJ device was considered as an ideal black box and thus only '0' and '1' states were observed in the simulations.

Fault Distribution vs. Write Voltage and Duration
To investigate the dependence of the observed write transition faults on write voltage and duration, we swept V WL from 1.4 V to 2.2 V and t p from 10 ns to 40 ns in our circuit simulations.Fig. 13 shows the simulation result statistics of S=0w1 at varying V WL and t p in the defect-free case.
The successful transition probability P ST rises from 0% (red area) to 100% (blue area) as V WL and t p increase.However, one can observe that the transition area occupies a large area in the contour map, which poses a big design challenge for reliable and deterministic write operations in STT-MRAMs.This clearly indicates that write schemes with a fixed configuration of write voltage and duration are unwise in practice with four drawbacks: 1) large energy consumption, 2) long write latency (performance loss), 3) more susceptible to back-hopping effect [41,42], and 4) reduced endurance or even early breakdown induced by aggressively wearing out the untra-thin MgO tunnel barrier under a large switching current.This has led to the introduction of more flexible write schemes such as write-verify-write scheme by Intel [5] and self-write-termination scheme by TSMC [43].Fig. 14 shows the IM state statistics in S=0w1 operations at varying V WL and t p in the defective case (A IMP =0.48 as an example).It can be seen that the IM state shows up with different probability P IM in a large area of the contour plot, especially in the area where P ST is near 50%.Obviously, the closer to the top-right corner, the less likely to see an IM state and more likely to have a successful transition.However, large V WL and t p incur the aforementioned four drawbacks.Hence, in practice, a trade-off has to be made and a flexible and self-adaptive write scheme is more desirable.The simulation results for S=1w0 are similar, thus they are excluded due to space limitations.

DEVICE-AWARE TEST DEVELOPMENT
The last step of DAT is to develop appropriate test solutions for the derived faults: W1TFU i and W0TFU i .In this section, we first explain the test philosophy.Thereafter, a test solution with weak write operations is introduced.Its circuit implementation will also be presented and discussed.

Test Philosophy
To detect IM state defects, the following two key steps are crucial: 1) fault sensitization, 2) fault detection.The former forces a defective MTJ into the IM state so that it exhibits faulty behavior, whereas the latter distinguishes it from the normal memory behavior.Fig. 15a illustrates the energy barrier diagram of a defect-free MTJ with bi-stable AP and P states.The energy barrier in AP→P switching is larger than that of the opposite switching direction, due to the existence of stray field which is in favor of AP state.Fig. 15b illustrates the energy barrier diagram of a defective MTJ with AP, P, and IM states.As already discussed in previous sections, the IM state can be set with write operations with certain occurrence probability P IM ; the peak of P IM occurs at the bias voltage where P ST =∼0.5 (see Fig. 4).Once the IM state is set, the device may stay in IM state without external interference for certain period of time (i.e., retention time of the IM state) or fall back to AP or P state in an accelerated process under external interference.This is because the energy barrier from IM to P (or AP) is much smaller than that between P and AP states, as illustrated with the height of the two-way arrows in Fig. 15b.Hence, to distinguish the IM state from P and AP states, a feasible solution is to provide sufficient external energy to push the device in IM state back to P (or AP) state while avoiding disturbing devices in AP (or P) state.Typically, there are mainly three sources of external energy which can be provided to affect the thermal stability factor ∆ of MTJ.They are thermal energy reflected as temperature (T ), electric current (I), and magnetic field (H).The quantitative correlation between these three variables and ∆ can be approximately expressed as follows [18,44]: First, the above equation indicates that ∆ can be reduced by heating up the MTJ devices (i.e., burn-in test).The elevated temperature leads to an increase in thermal perturbation, which in turn increases the chance of spontaneous flip of one state to the others.Although this approach is effective in kicking an MTJ device out of the IM state, the switching direction (i.e., IM→P or IM→AP) is not controllable.Thus, burn-in test is an unsuitable approach to detect IM state defects.Second, applying an electric current I going through the MTJ is also an approach to reduce ∆ due to its Joule heating effect.After being spin-polarized, it is also used to switch the magnetization in the FL.More importantly, current-induced switching is bipolar, meaning that the switching direction is controlled by the current direction.Third, external magnetic field H has a large influence on ∆.
It is widely used in the characterization test of MRAM and serves as the write method in the first generation of MRAM technology, also referred to as Toggle MRAM.Field-induced switching is also bipolar, as the direction of H determines the switching direction of magnetization in the FL.In summary, the detection of IM state defects can be achieved by applying a weak write current/field, which provides a moderate energy to push a defective MTJ out of its IM state without disturbing the bi-stable P and AP states of defect-free MTJs.Next, we will elaborate the test process with weak write operations.
The first march element (w0) initializes all memory cells to state '0' in normal mode.The second march element is composed of two operations in normal mode; the first one is an up-transition write and the second one is a read.For a defect-free MTJ, the MTJ state switches from '0' to '1' as intended and the readout is logic '1'.Note that we do not take into account failed transitions caused by the switching stochasticity, since they can be mitigated by circuit-level designs such as write-verify-write as mentioned previously.For a defective MTJ with IM state, the w1 operation may result in a transition to '1' (AP) or 'U' (IM) state.If the device ends up in the 'U' state, the readout value can be random ('?'); i.e., sometimes '0', sometimes '1', unpredictably.The third march element consists of a weak down-transition operation in DfT mode and a read operation in normal mode.The weak write operation can be implemented as a relatively weak current ( w0) or field ( w0 H ) with reduced amplitude or duration in comparison to normal write operations.The weak write operation induces an IM→P transition while it is not strong enough to change AP state.As a result, the readout is expected to be logic '1' for the MTJs which are in AP state before the weak write.However, the readout of those MTJs which are in IM state before the weak write is logic '0'.Since the occurrence of IM state is probabilistic, this test cannot 100% guarantee the detection of IM state defects with a single shot.To increase to detection confidence, repeating the above march test for a certain number of times can be considered.
The implementation of weak write operations requires dedicated DfT.Since STT-MRAM exploits an electric current for w0 and w1 operations in normal mode, adding a DfT circuit to write drivers to tune the write voltage or duration will provide a feasible solution with minimal area overhead.For example, if a weak write voltage on the WL ( V WL ) is utilized for the DfT circuit, it has to meet the following requirement: V WL (P SIM =1) < V WL < V WL (P ST =0), where P SIM is the switching probability of IM state to either P or AP state and P ST is the switching probability between P and AP states.This ensures that defective memory cells are detected while defect-free ones are not over killed.Given this consideration, V WL can be set to a point in the black curve in the bottom-left corner of Fig. 13; it marks the boundary of the area where P ST =0.Hamdioui et al. [45] proposed a programmable DfT scheme for weak write operations to detect open defects in RRAMs; this DfT scheme can also be adopted here to configure the weak write operations for STT-MRAMs.In addition, Naik et al. [37] proposed an internal bias control design for setting optimal write bias voltages in STT-MRAM in order to adapt to different operating temperature.This bias control design for normal write operations can also be reused to select V WL in DfT mode.
We implemented the above March test and verified the design based on circuit simulations.Fig. 17 shows the waveforms of five key signals in both defect-free and defective cases.First, both the defect-free and defective MTJs are initialized to state '0' (P), as shown with the MTJ resistance (R MTJ ) waveform.The normal w1 operation turns the defect-free MTJ into AP state as intended and the defective MTJ into IM state (sensitizing the W1TFU i fault).Note that V DD =1.6 V whereas V WL en and V WL are both boosted to 1.8 V. Next, the r1 operation reads out the MTJ state on the signal V out .The readout of IM state is unpredictable; on the waveform, it outputs a fake '1'.The third operation is a weak write 0 operation w0 with V WL degraded to 1.4 V and t p unchanged at 20 ns in DfT mode.It switches the defective MTJ from IM state to P state, while the defectfree MTJ remains in AP state as the provided energy is not high enough to invoke a full transition from AP state to P state.The last r1 operation detects the IM state defect, since the defective MTJ outputs a '0' while the defect-free case is '1', as illustrated in the figure.

CONCLUSION
This paper presents comprehensive characterization of IM state defects in STT-MRAM devices.The occurrence probability of IM state depends on the switching direction, device size, bias voltage, and FL thickness.It also demonstrates that the traditional fault modeling and test approach based on linear resistors fails to accurately model this defect at the functional behavior; hence it fails to detect such a defect during manufacturing tests.The use of device-aware test suggests that an IM state defect leads to intermittent write transition faults.To detect them, we propose and implement a test solution based on weak write operations.
Emerging memory technologies such as STT-MRAM, R-RAM, and PCM require unique manufacturing steps which could cause unique defects.These may not be detected by traditional memory tests, neither can be modeled with traditional fault modeling approaches.This calls for a better understanding of new defect mechanisms and better fault modeling and test approaches such as device-aware test.

Fig. 6 .
Fig. 6.Resistive models for MTJ-internal defects in the conventional test.

Fig. 7 .
Fig. 7. MTJ schematics with both P-state and AP-state regions in the FL simultaneously.

Fig. 9 .
Fig. 9. (a) R-V loop experimental data vs. fitting curves to extract R P & R AP at varying voltage, (b) R IM vs.A IMP with respect to three voltages.

Fig. 10 .
Fig. 10.Curve fitting of P ST and P IM to measurement data.

Fig. 11 .
Fig. 11.Verification of Verilog-A MTJ compact model with Cadence Spectre: (a) Switching stochasticity (STO) enabled only, (b) process variation (PV) enabled only, (c) temperature variation (TV) enable only, and (d) all the three sources of variation enabled simultaneously.

Fig. 12 .
Fig. 12.Comparison of sensitized fault primitives using device-aware defect model (left) and the conventional resistive model (right).

Fig. 13 .Fig. 14 .
Fig. 13.Successful transition probability P ST statistics in 0w1 operations at varying WL voltage V WL and pulse width tp in the defect-free case.

Fig. 15 .
Fig. 15.Comparison of energy barriers between: (a) a defect-free MTJ with bi-stable AP and P states and (b) a defective MTJ with AP, P, and IM states.

TABLE 2
Related work on IM state defects in MTJ devices in the literature.Multi-domain structure of the FL induced by the dipole field and large device size RT IM = RT P /RT AP Skyrmions formation in MTJs without the DMI effect

TABLE 3
Static fault modeling results for IM state defects using resistive models.

TABLE 4
Fault modeling results of IM state defects using our device-aware (DA) defect model.