Predictions and Uncertainty Estimates of Reactor Pressure Vessel Steel Embrittlement Using Machine Learning

An essential aspect of extending safe operation of the active nuclear reactors is understanding and predicting the embrittlement that occurs in the steels that make up the Reactor pressure vessel (RPV). In this work we integrate state of the art machine learning methods using ensembles of neural networks with unprecedented data collection and integration to develop a new model for RPV steel embrittlement. The new model has multiple improvements over previous machine learning and hand-tuned efforts, including greater accuracy (e.g., at high-fluence relevant for extending the life of present reactors), wider domain of applicability (e.g., including a wide-range of compositions), uncertainty quantification, and online accessibility for easy use by the community. These improvements provide a model with significant new capabilities, including the ability to easily and accurately explore compositions, flux, and fluence effects on RPV steel embrittlement for the first time. Furthermore, our detailed comparisons show our approach improves on the leading American Society for Testing and Materials (ASTM) E900-15 standard model for RPV embrittlement on every metric we assessed, demonstrating the efficacy of machine learning approaches for this type of highly demanding materials property prediction.


Introduction:
Nuclear power is a key component of global clean energy production, producing roughly 20% of the power in the US and 25% of the power in the EU as of 2021. For perspective, as of 2019 this represents approximately half of the low-carbon electricity generation in the EU. 1,2 Given the urgent need to shift from fossil fuels to green energy sources to mitigate the numerous negative effects of global climate change, 3 the role of nuclear energy as a source of clean electricity will be even more important in the future. In the near term, nuclear energy's contribution will depend on life extension of the existing fleet of reactors.
A key component of light water nuclear reactors (LWRs) is the massive heavy section reactor pressure vessel (RPV), which houses the nuclear fuel core and pressurized reactor coolant. The RPV is fabricated from low-alloy steels. Since RPVs are individually custom-built, they contain a wide range of Cu, Ni, Mn, P, Si, C and other dissolved solutes. During reactor operation, the neutron flux leaking from the reactor core impinges the walls of the RPV, resulting in embrittlement of the steel, manifested as an upward shift in the ductile to brittle transition temperature (TTS). Embrittlement is mainly the result of radiation-enhanced formation of Curich and Mn-Ni-Si-rich precipitates (CRPs and MNPs). [4][5][6] The precipitates impede dislocation glide, resulting in irradiation hardening manifested as an increase in the yield stress (∆ y), which in turn results in an increase of the TTS. Replacement of the RPV is not feasible. Thus, it is essential to understand, and accurately predict, how the RPV steels embrittle as a function of the neutron fluence (with typical units of neutrons n/cm 2 ), which is the corresponding flux (n/cm 2 -s) integrated over time, up to 80 years, or more, of extended reactor life. Note, a detailed review of embrittlement mechanisms and associated predictive TTS models can be found in the review from Odette et al., 6 hence, there is no attempt to cover the many details therein in this machine learning focused paper.
There are two main sources of embrittlement data. The first and most relevant is from plant-specific surveillance programs. A set of surveillance capsules, mounted on the inner vessel wall, contain test samples what were believed to be the most embrittlement-sensitive welds and base metals in the corresponding RPV. The capsules, which contain tensile and Charpy V-notch impact test specimens (and in some cases fracture toughness specimens), are periodically removed and tested to monitor embrittlement as a function of fluence. The fluence of these specimens closely tracks, though is modestly higher than, the exposure of the vessel itself. The corresponding tensile tests are used to track the evolution of the yield stress, ultimate tensile strength and ductility of the material. Impact tests, which measure the energy absorbed in breaking Charpy V-notch specimens as a function of temperature, are used to evaluate the TTS.
The transition temperature is typically indexed at an absorbed energy of 41J. [5][6][7] The other source of embrittlement data is accelerated irradiations in test reactors, generally at significantly higher fluxes than in the surveillance database. Thus, the potential effect of flux must be considered in the interpretation and use of test reactor embrittlement data.
A number of physically-motivated, semiempirical models have been developed to predict TTS based on surveillance data. 6,8 The three most commonly used models are the E900 model (which is the ASTM standard), 9 the Eason-Odette-Nanstad-Yamamoto (EONY) model, 4 and the recent Odette-Wells-Almirall-Yamamoto (OWAY) model. 6 The E900 and EONY models consist of physically-motivated analytical functions with parameters that were fitted to surveillance data as new data became available and understanding of the underlying physics matured. The OWAY model uses surveillance data, or TTS predictions, at intermediate fluence combined with a set of slightly flux-adjusted test data at fluence somewhat greater than the peak at 80-year life, to interpolate the TTS in between low and high fluence.
Distinct from these physics-informed semiempirical models of RPV steel embrittlement are fully data-driven approaches which leverage machine learning (ML) techniques. The use of ML techniques in materials science has exploded in recent years. [10][11][12][13][14] ML models differ from nondata-driven approaches because they do not rely on fitting multiple parameters for complex, ideally physical, models to experimental data. Thus, ML can significantly reduce the time and materials domain-specific expertise needed to extract empirical information. In addition, many ML models allow for determination of uncertainty estimates on new predictions, and can provide additional insight on the effects of variable combinations not represented in current physicsbased models. However, the RPV ML models generally have little or no physical constraints and can be difficult to interpret (an issue sometimes captured by referring to the models as being a "black-box"), which can lead to lack of trust in predictions. To balance the lack of physical constraints and black box nature of ML models, it is necessary for them to be rigorously assessed to verify that the model produces acceptably accurate predictions when data is available for comparison, that the predictions follow known physical trends, and that the model uncertainty estimates are trustworthy. The development and validation of accurate ML models to predict RPV embrittlement would be especially useful for informing 80-to-100-year life extension conditions, associated with low flux (e.g., 3×10 10 n/cm 2 -s) and relatively high fluence (e.g., approximately 10 20 n/cm 2 for 100-year life extension) conditions, which are sparsely available, potentially requiring extrapolation from accelerated high flux, relatively high fluence experiments. It is worth noting that the exact fluence incurred by a reactor at 100-year life extension will vary based on the reactor type and operational conditions and may vary in the approximate range of mid-10 19 to 10 20 n/cm 2 . We use the value of 10 20 n/cm 2 as an approximate upper bound.
There have been a number of data-centric ML studies aimed at predicting embrittlement of irradiated steel alloys over the past 15 years. These studies include prediction of irradiated yield stress and TTS of reduced activation ferritic/martensitic (RAFM) alloys for use as materials in high dose applications like fusion plants, [15][16][17][18] as well as ∆ y and TTS prediction for RPV steel alloys. [19][20][21][22][23][24][25][26][27][28] A detailed summary of these previous studies is given in the recent review from Morgan et al. 29 Here, we summarize key findings of the three studies most relevant for the present work. First, the work of Liu et al. 24 used kernel ridge regression models to predict ∆ y for the Irradiation Variable (IVAR), Belgian Reactor 2 (BR2) and Advanced Test Reactor 1 (ATR1) data comprising 1501 data entries, which they refer to as "IVAR+" in their work, and which is included in what we refer to as the "UCSB" dataset in this work (see Section 2 for more information on the datasets used in this work). They fit ∆ y to a 9-element feature vector consisting of atomic fractions of Cu, Ni, Mn, Si, P, and C, temperature, flux and effective fluence and performed a suite of detailed random and targeted grouped cross validation (CV) tests. From random 5-fold CV, they obtained a low root-mean-squared error (RMSE) of 14.7 MPa, and RMSE of at most 25 Based on the results of Liu et al., 24 Ferreño et al., 25 Mathew et al., 22 and other studies, [19][20][21]23,26,30 a variety of ML (e.g., neural network, gradient boosting, kernel ridge regression, depending on the study) models have been successfully fit to ∆ y and TTS RPV embrittlement data. These fits have been conducted on various sets of RPV embrittlement data both from test reactors (IVAR, BR2, ATR1) and surveillance data (PLOTTER-15), as well as the test reactor RADAMO database 20 and some combination of test reactor and surveillance data, 22  Ferreño et al. 25 explicitly pointed out that one shortcoming of their model was the lack of low flux, high fluence data in the PLOTTER-15 database, potentially limiting the ability of the ML model to accurately extrapolate to life extension conditions. In addition, there has been very little work devoted to quantitatively assessing the accuracy of ML model uncertainty estimates (i.e., error bars on predictions) specifically for the application of predicting RPV steel embrittlement.
Having an ML model with well-validated uncertainty estimates would foster greater confidence in new predictions, such as those for low flux-high fluence conditions. We also note that models to date have fit to a range of databases, but none have been fit to all of the available databases (e.g., no single paper uses even the union of the databases that can be found in these papers), which we attempt to do here. Finally, at present there is no accessible ML model that can be easily used by the broader community. Such access is important not only to encourage rigorous testing by domain experts, but for the continual improvement of the model as new data and ML techniques become available.
In this study, we work to address all of the issues raised above. We do so by (1)

Data and Methods:
Here, we describe the sources of RPV embrittlement data used in this work. The first source contains hardening data comprising the Irradiation Variables (IVAR), Belgian Reactor (BR2), and Advanced Test Reactor 1 and 2 (ATR1 and ATR2) data obtained by researchers at the University of California Santa Barbara, which we collectively refer to as the "UCSB" database throughout this work. 32 A detailed description of this database was provided in the previous work of Liu et al., 24 and here the database has been updated to include the latest ATR2 data, resulting in a total of 1556 data entries. The UCSB database includes the composition information of Cu, Mn, Ni, P, Si, and C for each RPV steel alloy, together with its flux, fluence, product form (i.e., plate, weld, forging, or standard reference material (SRM)) and measured hardening ∆ y. More details regarding the UCSB database can be found in the works of Liu et al. 24,28 Figure 1 shows the ranges of flux and fluence included in the UCSB database, where the pink, blue, green, and red points denote data for the ATR1, ATR2, BR2 and IVAR data groups, respectively. The ATR1 data consists of very high flux and fluence (2×10 14 n/cm 2 -s and 10 21 n/cm 2 , respectively), while the ATR2 data consists of high fluence of ≈1.4×10 20 n/cm 2 but at lower fluxes of ≈3.7×10 12 n/cm 2s. While this fluence level is consistent with fluence of 80 to 100-year life extension in some PWRs, the flux level is still about two orders of magnitude higher than a typical commercial RPV. We note that, due to their extremely high flux and fluence values relative to the rest of the database, the ATR1 points are left out of the model fit, reducing the database by only 6 data points, but resulting in improved model fits in the flux-fluence range of interest. The BR2 data have flux levels that are high, lying between the ATR1 and ATR2 data. For BR2, the fluence levels range from 10 19 n/cm 2 to slightly above 10 20 n/cm 2 . The IVAR data are at consistently lower flux and fluence levels, with fluxes ranging from 7×10 10 n/cm 2 -s to ≈10 12 n/cm 2 -s, while the fluence levels range from very low (below 10 17 n/cm 2 ) up to ≈3×10 19 n/cm 2 -s. A clear limitation of the test reactor data is the lack of low flux, high fluence data approaching the 40 to 100-year life extension conditions. A summary of detailed statistics of the UCSB database, split out by data group, is provided in Table S1 of the SI.  Table S1 of the SI.
We also include additional test reactor data from the RADAMO database (black points in Figure 1) and test reactor data present in the PLOTTER-15 database, which we denote as "PLOTTER-15 MTR" (orange points in Figure 1). Both the RADAMO and PLOTTER-15 MTR data tend to have higher fluence levels consistent with the upper end of the PLOTTER-22 and IVAR fluence levels, but again at higher flux levels more consistent with the BR2 and ATR2 test reactor data. The RADAMO hardening data takes the form of ∆ y from tensile tests, while all but 15 of the PLOTTER-15 MTR points are TTS from Charpy impact tests.
As discussed above, the UCSB database, RADAMO database, and a small portion of the PLOTTER-15 MTR database contains hardening in the form of ∆ y values for each RPV data entry, while the PLOTTER-22 and nearly all of the PLOTTER-15 MTR databases contain TTS values. Therefore, in order to combine these databases for the purposes of constructing ML models of TTS, we need to convert the ∆ y values into TTS values. We use TTS values in this work because TTS is used in the ASTM E900-15 and EONY models, and is the basis for embrittlement regulations. The relationship between ∆ y and TTS is dependent on alloy composition as well as flux and fluence. Detailed models to predict the conversion factor for a given alloy and irradiation condition are presently being developed. 34 The previous work of Mathew et al. 22 made the assumption that there is a linear relationship between ∆ y and TTS as: TTS = cc×∆ y , where ∆ y has units of MPa, and TTS has units of ˚C, and cc is the conversion coefficient. Mathew et al. used a cc value of 0.6 ˚C/MPa. 22 While they did not cite a reference for this value, our independent analysis of the literature, described below, yielded a value of cc = 0.61 ˚C/MPa. In this work, data of instances where both ∆ y and TTS were obtained for the same material at a given irradiation condition were collected from the literature. A total of 311 data points were collected from the works of Odette et al., 35 Lee et al., 36 and Nanstad et al. 37 The WebPlotDigitizer tool was used to extract the data available only in graphical form. Of the 311 total collected data points, 8 were flagged and removed as outliers based on their ratio of TTS/∆ y either being negative or greater than 10 (suggesting totally unphysical behavior), leaving 303 points. These points, which are plotted along with a quadratic fit, are shown in Figure 2. We used this fitted quadratic form to convert all of the ∆ y values in the various databases to TTS values. All ML model fits and assessments in this work were performed using the MAterials Simulation Toolkit for Machine Learning (MAST-ML). 38 Briefly, MAST-ML simplifies and accelerates the use of machine learning models in materials science, enabling users to quickly perform assessments of multiple models and cross validation tests, with codified analysis data and figure output. MAST-ML leverages widely used ML libraries like scikit-learn 39 and Keras 40 (Tensorflow 41 backend). To evaluate our ML models, we perform three types of tests. The first test is referred to as a "full fit", which fits to all of the data. A full fit is useful for deployment of a final model, however, due to the potential for overfitting, full fit errors may not represent the typical model error on new test data. The second test is random cross validation, where random subsets of the data (here, 20%) are used as test data while the remaining 80% is used as training.
This test provides a reliable assessment of model performance on test data that is similar to (i.e., within the domain) of the training data. The third test is leave out group cross validation, where the test data is an entire distinct group of data (e.g., PLOTTER-22 surveillance data), and the training data is everything else (e.g., UCSB, RADAMO, PLOTTER-15 MTR). This test is more demanding than random CV and is meant to mimic how the model may perform when given test data that is significantly different (i.e., partially or totally outside the domain) from the training data. MAST-ML also includes a robust uncertainty quantification framework with error bar recalibration using the methods of Palmer et al. 31 In this work, we focus on using an ensemble of fully-connected neural networks as our model. We construct an ensemble of 10 fully connected neural networks using the EnsembleModel class in MAST-ML, which builds upon scikit-learn's BaggingRegressor class. Initial tests of 1, 2, 5, 10, and 15 networks in the ensemble saw saturation of performance by 10 networks. For all fits, the features used are Cu, Ni, Mn, P, Si and C content (in weight percent), temperature, and the base-10 logarithm of flux and fluence, for a total of 9 features. For PLOTTER-22 entries that contain blank entries for Si or C data, values of 0 wt% were used. In Figure S1 of the SI, we plot the true (i.e., measured) TTS values as a function of each feature.
Each network in the ensemble was fit using all 9 features in the dataset, and the different networks are fit on bootstrapped samples of the training data. Approximately 25 different model architectures were manually constructed and tested to find a model which resulted in low 5-fold CV RMSE while not being too computationally expensive. This search of model architecture is by no means exhaustive, and while a more ideal model architecture may exist, the present model was found to give accurate full fit and 5-fold CV predictions, have well-calibrated uncertainty estimates, and demonstrate reasonable prediction trends with known physics. The neural network used in the present work was built using Keras, run in the MAST-ML package, and consists of a sequential (fully-connected) neural network with 5 layers. The first layer has an input dimension of 9 and is a dense layer with 1024 nodes. The second layer is a dropout layer with dropout rate equal to 0.3. The third layer is another 1024-node dense layer. The fourth layer is another dropout layer with dropout rate of 0.3. The fifth layer is the single-node prediction layer.
The rectified linear activation function was used throughout, and the model was optimized using the Adam optimizer with mean squared error as the metric.
Similar to work from Liu et al., 24 we added artificial data points to our fits that, for each alloy, correspond to a TTS = 0 ˚C value for fluence of 10 15 and 10 16 n/cm 2 at fluxes of 10 8 , 10 9 , 3. Results and Discussion:

Machine learning fits on individual databases
In this section, we first evaluate the performance of our ML model for predicting TTS for the UCSB and PLOTTER databases individually. We focus on these two databases for several reasons. First, UCSB and PLOTTER represent large test reactor and surveillance databases, respectively, allowing comparison across these two different classes of data. Second, within our sets of test reactor data, UCSB was based on controlled single and combined variable experiments resulting in a consistent, high-resolution map of embrittlement dependencies, covering a very wide range of alloy and irradiation conditions. This consistency minimizes contributions from non-physical factors in assessing the model. Finally, the ML predictions on these two data sets can be readily compared to previous research. Then, in Section 3.2, we assess performance of our ML model on the combined database. Here, we evaluate the performance of our ML model based on full fits to all of the data, and random 5-fold CV. Figure 3 contains parity plots assessing these fits, and Table 1 contains a summary of the per-group RMSE values for all fits. Figure 3A and Figure 3B show  Figure 3C and Figure 3D show

Machine learning fits on combined databases
Here, we assess the quality of our ML model when the UCSB and PLOTTER databases are joined together into a single database, with the RADAMO and PLOTTER-15 MTR data included as well (see Section 2 for details on how the databases were joined). Figure 4 contains parity plots assessing our ML model on a full fit to the entire database ( Figure 4A), random 5-fold CV ( Figure   4B), and the more demanding leave out group CV ( Figure 4C).  Figure 3, which is reasonable considering the higher error on the PLOTTER-15 MTR subset and slightly higher BR2 errors in the combined database fit. Similarly, the 5-fold CV RMSE of 12.2 +/-0.7 ˚C as shown in Figure 4B is within the error bars of the individual PLOTTER database 5-fold CV RMSE value and slightly higher than the individual UCSB database 5-fold CV RMSE value. The leave out group CV test in Figure 4C shows a significantly higher RMSE of 29.3 ˚C (note this is averaged over each left-out group).
Interestingly, the PLOTTER data is still predicted quite well for this test, with an RMSE of 18.2˚C, an impressive result considering no PLOTTER data was used in training for that data split. In this case, we speculate the robustness of our model predictions on PLOTTER data is due to the inclusion of the IVAR data, which covers similar, though smaller, flux and fluence ranges compared to PLOTTER. We discuss the merits of our ML model on high fluence data in Section 3.3. In Figure S2 of the SI, we plot the TTS residuals as a function of each input feature. Generally, we find that the residuals are highly symmetric about zero, and there is no appreciable trend in the residuals with any feature. This is an important aspect of the ML model predictions, as any trend of the residuals with, for example, fluence, would constitute undesirable behavior of the model.

Assessments of model predictions for high fluence data
Developing ML models to accurately predict high fluence, extended-life embrittlement is a primary objective of this study. In this section, we discuss the ability of our ML model to predict TTS of high fluence embrittlement. Here, we again evaluate our ML model for full and random 5fold CV fits. We then assess the role of low fluence data on the high fluence predictions. This involves comparing fits to only the high fluence data subset to fits for all fluences. We define "high fluence" as data at, or above, a fluence of 6×10 19 n/cm 2 ; all other data is categorized as "low fluence".    Table 2).  Table 3 and Figure 6C). Overall, the above results show that our ML model generally performs better than the E900 model on the target RMSE metrics for PLOTTER. We note here that while we focus on reporting and comparing RMSE values in our analysis, the mean absolute error (MAE) is also important, and we provide a table of MAE values for our model and others in the literature in  Figure 4, where an RMSE of 22.0 ˚C was obtained on IVAR data when the PLOTTER data was used as training data.
The E900 model shows significantly higher errors on the BR2 and ATR2 data groups, suggesting that E900 may not accurately predict TTS for combined high flux and high fluence conditions.
Again, this is not unexpected given the flux and fluence conditions used in the E900 training data.
These results suggest that the ML model has a much wider domain of applicability than E900. We have also plotted the full residual histograms of our ML model and the E900 and EONY models in Figure 6D. Note that the data shown in Figure 6D are for PLOTTER data only. Another histogram showing the distribution of residuals of the high fluence PLOTTER data only is shown in Figure S3 of the SI. Regardless if we examine the residual distributions on all PLOTTER data or just the high fluence portion of PLOTTER data, our ML model has both the smallest mean error (i.e., smallest bias) and the smallest standard deviation of residuals. In addition, our ML model is fit to far more data, which, as discussed above, gives it a larger domain of applicability than E900.
Finally, we discuss the advantages and shortcomings of our ML model vs. the E900 model on the merits of (i) model errors, applicability domain and uncertainty estimates, (ii) prediction of physical trends, (iii) model interpretability, (iv) model evolution and re-use. Regarding (i) (model errors), as discussed above, compared to E900, our ML model generally has lower errors across all databases, and, importantly, the high fluence portion of the PLOTTER database. The level of improvement of our ML model over E900 on the total and high fluence PLOTTER database is notable, with a maximum of 1.8 ˚C and 2.6 ˚C (13.2 % and 14.5 % reductions), respectively. In addition, we found modest performance improvement for our ML model vs. E900 for data outside of PLOTTER when our ML model did not have access to that data. Therefore, we have some evidence that the improved performance of our ML model on a wider domain is primarily due to its having access to more data for fitting, although it is possible that the more flexible ML models do offer advantages over E900 in fitting a larger database. An important limitation of the E900 model is that it does not provide an estimate of uncertainty (i.e., an error bar) with a given prediction. As described in Section 3.5, a key advantage of using an ensemble of neural networks is it provides well-calibrated error bars on our predictions. Regarding (ii) (prediction of physical trends), as described in Section 3.6, our ML model shows no signs of unphysical behavior or problematic overfitting. This behavior is both a strength and weakness of our ML model vs. E900.
It is a strength because our ML model appears able to capture the detailed known physical trends without assuming an explicit functional form, thus capturing correct embrittlement physics without needing deep domain expertise. However, since no physics-based functional form is used for our ML model, detailed testing of physical trend assessment is required to foster confidence in the model. Regarding (iii) (model interpretability), in favor of E900, it should be noted that it has far fewer parameters (26 + explicit functional form vs. ~1M for a single neural network in the ensemble). While there is no evidence that the extra parameters in the ML are introducing problematic overfitting, having fewer parameters is generally desirable. Furthermore, E900 has a simple explicit analytical form, while the functional behavior of the neural networks used in the ML model cannot be understood intuitively and only probed indirectly through methods such as cross plots. Regarding (iv) (model evolution and re-use), our ML model takes advantage of modern ML tools and is therefore easy to fit and update (i.e., re-fit) as new data becomes available. This stands in contrast with the E900 model, which requires significant manual retuning of the functional form and its underlying parameters and significant domain expertise to do so. Putting all these aspects together suggests the following. First, our and similar ML models are likely to provide notable improvements in performance vs. E900 within the domain of PLOTTER and likely even life-extension conditions on existing plants, but at the cost of increased complexity and decreased transparency. The greatest advantages of using ML models may come from the ability to quickly fit and improve them, particularly when integrating large amounts of data, and to take advantage of ML-associated technologies like the error estimation methods we have employed and discuss in Section 3.5.    Figure 7A, we plot the distribution of normalized residuals, defined as the residual value divided by the corresponding model uncertainty estimate (sometimes referred to as an "r-statistic" 31,43 or a "Z-score" 44 ). A wellcalibrated model will have a normalized residual distribution with mean of zero and standard deviation of one, i.e., the normalized residuals will follow a unit-normal distribution. From Figure   7A, we can see the uncalibrated values have a mean near zero but slightly negative, and a large standard deviation of about 3.2. In contrast, after calibration, the calibrated values have mean closer to zero and standard deviation very close to 1, indicating the recalibration was successful when examining the distribution of normalized residuals. Figure 7B contains the same data as While the analysis shown in Figure 7A and Figure 7B demonstrate the ML model uncertainties are well-calibrated in the sense that they produce the expected distribution of values, it does not provide an indication of the direct trend between actual model errors (residuals) and their corresponding predicted errors (uncertainty estimates). To better quantify the relationship between true and predicted errors, in Figure 7C we make a residual vs. error (RvE) plot with quadratic rescaling of model uncertainties. RvE plots with constant, linear, and quadratic rescaling with unequal and equal sample binning are shown in Figure S4 in the SI for comparison. In addition, we have plotted the distribution of normalized residuals and RvE plots with quadratic recalibration for data separated into low fluence and high fluence groups in Figure   S5 of the SI. This method of representing true vs. predicted model errors was introduced by Morgan and Jacobs 10    Next, we examine the trends of our ML model predictions as a function of four key features for which embrittlement trends are well-known: Cu content (in weight percent), Ni content (in weight percent), fluence and flux. These results are presented in Figure 9 in the form of cross-plots. We note that a version of these cross-plots with shifted data points overlaid is provided in Figure S6 of the SI. Because we are plotting predicted TTS as a function of a single variable, we hold the other composition and temperature values fixed to correspond to median values in the full database. These median values correspond to a fictious alloy with temperature = 290 ˚C, wt% Cu = 0.12, wt% Ni = 0.69, wt% Mn = 1.4, wt% P = 0.009, wt% Si = 0.22, wt% C = 0.14.
The flux and fluence values were chosen to correspond a flux of 3×10 10 n/cm 2 -s and fluences of 4×10 19 n/cm 2 (black dashed lines) and 1×10 20 n/cm 2 (green dashed lines). The exception to this is the plot of predicted TTS vs. fluence in Figure 9C, where the ML model was evaluated at different flux levels. In Figure 9, the ML predictions are shown as dashed lines, where the shaded regions are the corresponding one-standard deviation calibrated error bars. The TTS increases significantly with increasing Cu content (Figure 9A), which is consistent with physical understanding due to increased volume fractions of CMP dislocation glide obstacles, resulting in increased Δσy and TTS. The leveling off at about 0.3 wt.% Cu is due to pre-precipitation during RPV heat treatment. 46,47 The further decrease in TTS at even higher Cu up to ≈0.8 wt% Cu is believed to be due to enhanced pre-precipitation driven by higher supersaturation of this insoluble element. 47,48 The increased TTS with Ni in Figure 9B is also due to an increase in the precipitate volume fractions, since precipitates typically contain more than 30% Ni. The predicted TTS with fluence in Figure 9C at lower flux (black dashed curve for typical RPV service conditions at ≈3×10 10 n/cm 2 -s) flatten at lower fluence due to Cu depletion from the matrix, but subsequently increase significantly at fluences of greater than 10 19 n/cm 2 ; this is due to continuing precipitation of Ni, Mn and Si which is slower than that for Cu. 6 In Figure 9C, since we are plotting as a function of fluence, the ML predictions are given at the low flux of 3×10 10 n/cm 2s, while the green dashed curve in Figure 9C is for a very high flux conditions of 1×10 14 n/cm 2 -s and shows a generally lower TTS up to high fluence. It is well-established that high flux delays precipitation because vacancy and interstitial recombination lead to less efficient radiationenhanced diffusion. 4,6,49,50 However, it is also well known that the effect of flux decreases with increasing fluence in a way that depends on the steel composition. 6,28 At very high fluence greater than 10 20 n/cm 2 there is a crossover with higher TTS at higher flux. However, such a crossover is not observed at lower fluxes closer to service conditions, and physical considerations suggest that it is likely not reliable. 6,28 The predicted TTS shown in Figure 9D shows that the TTS continuously decreases with flux at an intermediate fluence of 4x10 19 n/cm 2 . In contrast, at a higher fluence of 10 20 n/cm 2 , the predicted TTS varies non-monotonically as a function of flux; however, the overall variation in TTS is modest, and falls well within the uncertainty band, ranging from a low to high value of 98 to 110 °C, respectively, over the entire range of flux. Most significantly, there is now strong empirical evidence that flux effects are significant at low fluence, particularly in sensitive Cu bearing steels, but they decrease, or disappear, at high fluence. 6,28,60 For example, this conclusion was specifically supported both in the extensive hand analysis of test reactor data, 6 and in the ML study by Liu et al. 28 Specifically, the ML predictions in this study (see Figure 10) clearly show increasing flux delays, thus decreases TTS at low to intermediate fluence, except in low Cu steels. However, at higher fluence, ML predicts a cross-over in Cu bearing steels, where higher flux leads to higher TTS. As shown in Figure 10, the predicted crossover fluence is very low in low Cu (< ≈ 0.075%) and low to medium Ni (≤ 0.75%) steels. At higher Cu (> ≈ 0.1%) the crossover fluence increases with Ni. Although details differ, a qualitatively similar trend was observed in the test reactor ML study of ∆ y in Liu et al. 28 Note the crossover fluence also depends on the specific values of the high and low fluxes. Thus, while the fluence at a specified ∆ y was well-predicted by the earlier ML model in domains with flux-fluence data, significant extrapolations of the simpler curve shapes were taken as being unreliable. Based on these considerations, the previous ML study from Liu et al. assumed there is no effect of flux beyond the crossover fluence. 28 The most relevant high and low fluxes are the ATR2 values of 3.68x10 12 n/cm 2 -s, and a typical PWR vessel flux of ≈3-4x10 10 n/cm 2 -s, which reaches a fluence of 10 20 n/cm 2 in about 80-100 years. In the current study the crossover fluence in Cu bearing steels is near, or more generally above, 10 20 n/cm 2 -s (see Figure 10). In contrast, the crossover is at very low fluence, and the TTS decreases with decreasing flux, in low Cu steels. It is beyond the scope of the current paper to explore the physics that might support this observation in low Cu steels, such as the role of so-called unstable matrix defects, which have opposing effects of increasing recombination, while adding a small increment of hardening. [51][52][53][54] Overall, additional study to better assess the applicability domain of our model and more deeply analyze the flux effects of specific alloys is warranted. In summary, our ML model TTS predictions are broadly consistent with both previously observed and key physical mechanisms, which mediate embrittlement of RPV steels.

TTS predictions for life extension
Having established that our ML model provides accurate embrittlement predictions with well-calibrated uncertainties, and that the embrittlement predictions follow known physical trends, we next predict TTS values and their uncertainties encompassing approximate LWR life extension conditions. Figure 11A contains a heatmap grid of predicted TTS values as a function of flux and fluence for the same fictitious "median" alloy used for the TTS predictions in Figure 9.
In Figure 11A, the colors denote the magnitude of TTS value (blue is lower, red is higher), the large numbers in the first row of each box are the predicted TTS in ˚C, and the smaller number underneath is the corresponding calibrated one standard deviation uncertainty estimate. The values corresponding to a fluence of 10 20 n/cm 2 are outlined with a black box. From these predictions, we see that a hypothetical median alloy has a TTS of about 100-101 ˚C +/-13-15 ˚C.       Table S2. Table S2. Summary of linear fit statistics for the residual vs. feature plots shown in Figure S3. To aid in interpreting the impact of the slopes, the change in residual TTS over the dynamic range of the data is also provided, and is small in all cases, indicating negligible trend of residuals vs. each feature.