MultOpt++: a fast regression-based model for the development of compositions with high robustness against scatter of element concentrations

Alloys-by-design is a term used to describe new alloy development techniques based on numerical simulation. These approaches are extensively used for nickel-base superalloys to increase the chance of success in alloy development. During alloy production of numerically optimized compositions, unavoidable scattering of the element concentrations occurs. In the present paper, we investigate the effect of this scatter on the alloy properties. In particular, we describe routes to identify alloy compositions by numerical simulations that are more robust than other compositions. In our previously developed alloy development program package MultOpt, we introduced a sensitivity parameter that represents the influence of alloying variations on the final alloy properties in the post-optimization process, because the established sensitivity calculations require high computational effort. In this work, we derive a regression-based model for calculating the sensitivity that only requires one-time calculation of the regression coefficients. The model can be applied to any function with nearly linear behavior within the uncertainty range. The model is then successfully applied to the computational alloys-by-design work flow to facilitate alloy selection using the sensitivity of a composition owing to the inaccuracies in the manufacturing process as an additional minimization goal.


Introduction
The conventional trial-and-error-based property testing method to develop new alloys demands large experimental effort. Especially for modern high performance alloys containing more than eight alloying elements, this approach is relatively expensive because of the high number of different alloy compositions [1]. The numerical alloy-by-design approach [2,3], which is supported by the calculation of phase diagrams (CALPHAD) method [4], is the state of the art to predict new highly complex alloys.
During the manufacturing process of nickel-base superalloy components, various processes can lead to scattering of the element concentrations. In the course of the heat production of an alloy by vacuum inductive melting such a scattering occurs due to measurement inaccuracies of the scales, low purity of the master alloys, chemical reactions with the crucible or contaminated atmosphere [14]. In the state of the art process for casting of nickel-based superalloys called high rate solidification (HRS), the molten alloy is poured into a preheated mould under high vacuum and slowly withdrawn from the heating zone into the cooling zone [15]. Due to the liquid phase dwell time of several hours, volatile elements such as Al and Cr can evaporate [16]. The recently developed FCBC process is performed under a low pressure atmosphere and at higher withdrawal rates, preventing evaporation of these elements [17], but industrial application is still pending. Figure 1(a) shows an exemplary distribution of the element concentrations 5 along the withdrawal direction in a single crystal HRS casting process. In addition, the root mean squared error (RMSE) of the achieved element concentration from the target concentrations and the standard deviation of six independentantly produced castings of one experimental alloy with separately produced heats are shown. Besides a scattering of the Mo concentration and a systematic deviation of the Ti content, the decrease of the Al and Cr concentrations along the solidification path is visible. That means, that both the dispersion of the concentrations in the component (RMSE) and the reproducibility (standard deviation) of various heats and castings D  c must be considered regarding the sensitivity of the alloy properties.
The concept of sensitivity is shown schematically in figure 1(b). The slope of a plot of a property model function f against the element concentration indicates how strong a change in the concentration affects the spread of the respective property. Thus, a low sensitivity alloy exists in a concentration range where the slope is small. Therefore, it will only exhibit small variations in its properties for a given variation of the concentrations of the alloying elements. 5 Measured by the optical emission spectrometry on Ametek Spectromaxx (SPECTRO Analytical Instruments GmbH, Kleve, Germany).
The sensitivity of the alloy properties is rarely been taken into account [18] during complex optimization processes because of the lack of fast models for the property robustness, which is defined as the inverse of the sensitivity. The general concept of multi-objective robust optimization was initially developed by Deb et al [19] from single-objective robust optimization [20,21], and it is has been summarized in detail by Beyer et al [22]. Rettig et al [5] applied sensitivity analysis of the alloy properties in a post-optimization process using the alloy development program package MultOpt. An improvement of MultOpt called MultOpt++ has been briefly summarized by Markl et al [23].
In the current work, a new regression-based model for online sensitivity evaluation was derived. This will allow robust optimization of the alloy properties by including the property sensitivity as an additional minimization goal.

Alloy property prediction
Consider a multicomponent alloy consisting of n alloying elements. The chemical composition of this alloy can be represented as a n-dimensional vector: is the concentration of ith alloying element. The typical design space for alloy development is given by a 2×n matrix containing the concentration range defined for a certain optimization problem max are n-dimensional vectors containing the minimum and maximum concentration values of the ith alloying elements, respectively. In some cases, the considered design space is extended by the temperature range  r , which is usually defined as T , r min max , where T min and T max are the minimum and maximum temperatures relevant for alloy development.

op op
Here, the property model is a functional description of the alloy properties, which can be obtained during a model development process. This model can be constructed based on known physical relationships or using statistical regression analysis with experimental data or empirical rules. In this study, the applied models are based on CALPHAD calculations, combining both of the previously mentioned approaches. CALPHAD calculations are a very powerful tool for alloy development and allow prediction of several properties, even for complex compositions with more than eight alloying elements, such as Ni-base superalloys. The common thermodynamic calculations performed by the CALPHAD method during the optimization routine in this work are the calculation of transformation temperatures such as liquidus, solvus and solidus temperature and the amount of phases and their compositions. The databases used for such calculations are being continuously improved and show good agreement with experimental data [24][25][26] and less agreement for the prediction of topologically close packed (TCP) phases [27,28].

Sensitivity
In general, the sensitivity where k is the property index, describes the variation of the objective function  ( ) f c T , k with the variation or uncertainty of one Δc ( i) or all D  c design variables with n elements. The widely used approaches for local sensitivity calculation, such as  are summarized in [29]. Equation (2) is recommended for sensitivity analysis, but it does not take into account the interactions between the design variables c ( i) (i=1, K, n) and it is only valid for nearly linear functions [30]. To include these effects in sensitivity calculations, we focus on a variation-based method (3): The sensitivity of a multi-objective optimization problem (MOOP) has to combine the sensitivities of each objective function in a similar way to that proposed by Wang et al [31]: , whose calculation during the optimization process is time consuming because of the large number of design variables. By ignoring the interactions between n variables, the required number of function calls can be reduced from 2 n in the simplest case of permutating all boundary values  D   c c 0 to only 2n. This causes difficulties in locating good solutions for complex optimization problems, such as single-crystal Ni-base superalloy development. Depending on the number of optimization goals and input variables, each objective function has to be called for a typicall alloys-bydesign routine in our approach up to 10 4 −10 9 times. A simple optimization including interactions will then take 10 4 ×2 8 ×5 ms≈4 h using surrogate models [5]. The use of direct Thermo-Calc 6 (TC) calculations significantly increases the optimization time (in our alloy-by-design tool by a factor > 50).
In our approach, we first calculate the spatial distribution of s f k within reasonable concentration and uncertainty ranges, as summarized in table 1. Additionally, some implicit and explicit constraints for development of single-crystal nickel-based superalloys for turbine blade production are defined to prevent calculation of impractical solutions. Extensive regression analysis of approximately 50 000 calculated points is then performed to obtain the free parameters of the model and select the most appropriate regression model from a set of considered models for each alloy property distribution. The number of required calculations is comparable with the number of function calls for one complex alloy development problem (∼10 7 calls), but they only have to be performed once and can be more easily parallelized because no communication between processes is required.
During statistical analysis, three different linear regression models (5)-(7) with and without stepwise variable selection [32] are applied and compared with each other to identify the most appropriate model (MAM) for prediction of the sensitivity of each alloy property: , T is the vector of the merged input parameters and a ( i) (i=1, K, p) are unknown parameters to be defined from the precalculated data points. In the first place, the differences between these models lie in their complexity. The goal is to find a simple model, which allows to describe the true function with as few coefficients as possible. Here, L, I, and Q correspond to the linear, interaction, and quadratic terms involved in the model construction, and S and F in the superscript denote regression models with and without stepwise selection of variables. For example, SLIQ describes a model with linear, interaction, and quadratic terms calculated by stepwise regression analysis.
By substituting the property distribution s  ( ) The value s can be also used to predict the property variation within a certain specification limit D  c s by replacing the uncertainty D  c with D  c s .

Optimization
The computer-based development process for indentifying good alloy candidates is usually applied in the objective space (properties), where the algorithm attempts to achieve the best property value available in the whole design space (concentrations). Simultaneous optimization of several goals has to be able to handle so-called MOOPs, which also contain a certain number of constraints. The result of such a calculation is a Pareto-front describing the best possible combinations of non-dominated function values for a given optimization problem. The effectivity of evolutionary algorithms (EAs) for solving MOOPs has already been shown by many researchers in a wide range of approaches [21,[33][34][35][36][37].
The model for prediction of the sensitivity derived by regression analysis was applied to the optimization problem described in [5], which was slightly modified and extended by

Functions constraints
Concentrations constraints is not performed in the current work and has to be applied in post-optimization for the whole Pareto-front or selected alloys only with direct and therefore extensive sensitivity calculations as given by equations (1) or (2), if required.

Implementation details
Regression analysis with stepwise variable selection and testing of the null hypothesis H 0 (an extended review of the method is given by Nickerson [38]) were performed with the built-in functions of MATLAB 7 software (version 2016a 64-bit), such as stepwise [32], swtest [39], lillietest [40,41], adtest [42], and jbtest [43]. The in-house-developed alloy-by-design tool MultOpt++ (for details the readers are referred to [5] and [23]) was used for the CALPHAD calculations with the programming library TC-API version 2017b (64-bit) based on Thermo-Calc version 2017b 8 . All of the Thermo-Calc calculations were performed with the TTNi8 9 database and deactivated global minimization. The optimization kernel of MultOpt++ is the Geneva Ivrea-Via Arduino 1.6.1 10 optimization library [44] and it was modified to enable storage of a whole Pareto-front during the optimization process. For multi-criteria optimization, the EA was applied.

Model selection
One of the main aims of statistical analysis was to identify the MAM of each alloy property to be used in MOOP. Usually, physical-based models are preferred to any formal mathematical Table 2. Definition of the optimization problem for the current work. I SSS , ρ, g¢, d g g¢ , W ht , p, g¢ T , T sol , and T liq are the solid solution strengthening index, density, misfit between the γ and g¢ phases, g¢ fraction, heat treatment window, cost, g¢ solvus, and solidus and liquidus temperatures, respectively.  c r represents the whole optimization range.
description, such as linear regression, and, ideally, the investigator who collects or generates the data should select such a model. However, if it is not possible to determine which model is the MAM based on expert knowledge, some quantitative statistical goodness-of-fit measure can be applied to compare competing models, such as the Akaike information criterion (AIC) (for detailed information refer to [32]), RMSE, or the coefficient of determination (R 2 ). Compared with the well-known RMSE and R 2 measures, application of the AIC for model comparison has the advantage that the number of model parameters is taken into account as a penalty term and it can thus avoid the so-called over-fitting problem. Taking into account all of the considered criteria, the MAM is the model with the smallest values of the AIC and RMSE statistics and the highest R 2 value. Thus, for all of the property distributions described with regression-based models using stepwise variable selection, we prefer the model that can achieve at least the same quality of prediction according to the best values of the considered statistical goodness-of-fit measures with full models but reduces the number of regression coefficients. Therefore, the s  ( ) x f SLIQ k model from equation (7)

Distribution calculation
The assumption of a normal distribution of objective function values s s owing to variation of the design variables is satisfied for more than 90% (see table 3) of the CALPHAD-based functions used in the current work by the null hypothesis H 0 rej test. This means that the function values of the remaining compositions have either a nonlinear trend close to the selected points within the D  c range or the distribution of randomly selected alloys is already a non-Gaussian distribution, for example, because of cutting off negative concentration values owing to limitation by the lower boundary.
For simulation of the uncertainty of a manufacturing process, the content of each element (i) is assumed to be normal distributed  m s ( ) , with a mean value μ ( i) equal to the nominal alloy composition c ( i) . Here, the standard deviation σ ( i) is defined as half of the respective process inaccuracy Δc In this case, the relative error in comparison with the distribution width calculated with 25 000 points is ≈2%. The inaccuracy defined in equation (8) can be adapted to a particular manufacturing process. So, e.g. a uniform distribution of elements can also be assumed within the inaccuracy range or the specification limits. Since such values of a process are usually known in weight percent, the concentration units in our model are also in weight percent. The variations in molar fraction can be recalculated by the CALPHAD method, if required. The property distribution calculation was performed with the TTNi8 11 database, which can differ for other databases but can still be used as a guidance value.

Regression coefficients
In general, for a given linear function where the regression coefficients α i indicate the influence of a single design variable on the function variation. For further details, see [45]. The correctness of a linear model is correlated with the determination value R 2 and should ideally be close to 1 [29]. The mean R 2 value for all of the SLIQ property models calculated in the current work is 0.958±0.013, while for FL it is 0.851±0.051 (table 4). Nevertheless, as previously mentioned, the smaller number of coefficients in the FL model simplifies interpretation of the effect of each element and their uncertainty with respect to the whole property distribution s For better comparison of each predictor and thus its influence on the models in weight and atomic fractions, we used the standardized regression coefficients. Each input parameter x ( i) , as defined in equations (5)- (7), was standardized by the mean parameter value ⟨ ⟩ The coefficients for each function were normalized by the absolute coefficient value. From a materials science point of view, the alloy designer can directly derive the effect of each element on the alloy sensitivity and its variation from the coefficient value, as shown for figure 2(a)) and atomic (figure 2(c)) fractions. According to Table 3. Rejected fraction of null hypothesis H 0 rej tests of the composite normality (Matlab functions: swtest for the Shapiro-Wilk parametric hypothesis test [39], lillietest for the Lilliefors test [40,41], adtest for the Anderson-Darling test [42], and jbtest for the Jarque-Bera test [43] . I SSS , ρ, g¢, d g g¢ , p, g¢ T , T sol , and T liq are the solid solution strengthening index, density, misfit between the γ and g¢ phases, g¢ fraction, cost, g¢ solvus, and solidus and liquidus temperatures, respectively. value. According to figure 2(b), the most effective elements for increasing the I SSS index are Al and Ti, because they reduce the matrix phase fraction, and Re, Mo, and W, which act as solid solution strengthening elements. However, high amounts of these elements increase the sensitivity of the s value, which can be seen in figure 2(a) (red squares). A slightly positive effect on the sensitivity can be achieved by increasing the amount of Cr (see figure 2(a)) (blue squares), which also has a negative effect on the solidus and liquidus temperatures ( figure 2(b), blue squares). Table 4. Comparison of the regression models. The best values are shown in italics and the selected model for further calculations is shown in bold. For better comparability of the RMSE, the standardized function values were used. I SSS , ρ, g¢, d g g¢ , W ht , p, g¢ T , T sol , and T liq are solid solution strengthening index, density, misfit between the γ and g¢ phases, g¢ fraction, heat treatment window, cost, g¢ solvus, and solidus and liquidus temperatures, respectively. According to figure 2, the variations of Al and Ti cause the strongest property deviations for almost all of the properties, so their contents should be set as precisely as possible. Conversely, variation of cobalt has almost no effect because it exhibits similar characteristics to nickel. It should be noted that the variation of one element is balanced by the variation of the main element nickel, and therefore the coefficients describe the effects of both elements. In addition, the effect of ruthenium for CALPHAD-based calculations is insignificant, partly because of an incomplete description in the TTNi8 database, which has previously been shown by Matuszewski et al [46] and Ritter et al [26].
The evaluated coefficients for the SL model are given in table 5. These coefficients can be stored in thermodynamic databases for faster prediction of the distribution width of available properties. Direct calculation of distributions is already implemented in Thermo-Calc 2016b software, but it requires a high number of samples (several hundred for a complex Ni-based superalloy) to be calculated and cannot be included in the optimization routine.
wt . Normalization was performed with the maximal absolute value of the coefficients for a certain function. The influence of each element on a given property is given by the color of the square: blue and red indicate increasing and reducing effect of the property distribution or property value, respectively. I SSS , ρ, g¢, d g g¢ , W ht , p, g¢ T , T sol , and T liq are the solid solution strengthening index, density, misfit between the γ and g¢ phases, g¢ fraction, heat treatment window, cost, g¢ solvus, and solidus and liquidus temperatures, respectively.

Optimization
The sensitivity of the optimization problem defined in equation (4) can be extended by any number of constraint functions n con , as well as user-defined weighting factors w k for each sensitivity depending on the requirements of the optimization goals  figure 3. With this additional information, more robust alloys can be selected accepting somewhat inferior values of I SSS , ρ or both. The density sensitivity only has a minor influence on the S op p value. In the case of equal weighting factors of w I SSS and w ρ , the preferred alloy will be the one with a low I SSS value because this will lead to lower sensitivity. Because the S op p value is the squared sum of all of the single sensitivities, the sensitivity of each S f p k value can be recalculated in a post-optimization process, if required. Alternatively, each sensitivity S f k can be used as an additional optimization goal accepting a longer optimization time.
In the first instance, the different function slopes of the property functions f k will have an effect on the property sensitivity S f k , as shown in figure 1(b). In addition, switching between interpolation functions within a CALPHAD database independently of the different temperature ranges, especially if temperature deviation ΔT is also considered, which is ignored in the current work, also influences S f k . Therefore, the sensitivity S f k can also be interpreted as the maximum relative error for a given input parameter distribution of optimized properties, giving the alloy designer the ability to select candidates with properties that are more stable or predicted more accurately by CALPHAD. This is indicated in figure 3 by the expected  [24,25,47]. Especially the description of the TCP phases is quite poor [27,28,46]. Nevertheless, despite this insufficient accuracy, one has the advantage of separating potentially TCP-forming from potentially non-forming alloys. The predictive accuracy obtained is sufficient to test the stability of a large number of highly promising alloys determined by optimizing well-known properties and to reduce the postprocessing effort in experimental validation significantly, taking into account the possible neglection of even better alloys. For some alloy properties, the CALPHAD prediction inaccuracy s f