Predicting length of fatigue cracks by means of machine learning algorithms in the small-data regime

In the last decade repair intervals of certain parts of gas turbines, which are installed in the combustor or turbine sections, have been extended by 100%. Constantly growing energy needs force operators of petrochemical and power plants to reduce downtime and maximize production. Many companies sign a Long-term Service Agreement with the engine’s OEM to assure seamless execution of the optimized maintenance plan, with availability and productivity guarantees. The value delivered by the manufacturer is based on the knowledge, experience and data accumulated over the years. Combining this with selected operating parameters of the units allows to develop predictive models, which support data-driven decisions and provide actionable insights to the end users [6]. Due to the complexity of the analyzed system, numerous estimators are built and utilized, e.g., for diagnostic and anomaly detection [17, 13], classification [2], regression, or as a synthetic sensors [33]. A structured collection of such models reflecting part-to-part interactions and fed with operational data is referred to as the digital twin of the gas turbine [30]. Survival analysis for rotating equipment is typically carried out on limited datasets due to the low occurrence of failures, long duration and high costs associated with the destructive tests. These constraints set Weibull analysis for years as the default method for forecasting failure probability and damage size [1]. Scientific papers on the correct execution of regression analysis in the small-data regime [28] have not disappeared in the big data era. New contributions have been published [23, 14] and are needed to provide guidance on how to use machine learning algorithms, which were not so commonly used in the past, for predictive modeling based on small samples. Such studies will be executed regardless of the growth of big data, but with the development of new data infrastructures the small data might be pooled and scaled more frequently [26]. Fatigue is among the most frequent causes of failures in mechanical systems. Deep knowledge of the fatigue crack growth mechanism is needed for the design and maintenance of gas turbines, especially with rising expectations regarding the efficiency and comprehensive optimization of the production process. Since the groundbreaking contribution of Paris and Erdogan [32], numerous fatigue crack growth equations have been published and these have been summarized well in [4]. For the most part, these models are based on experimental data and vary in complexity. There is no universal approach to fatigue life prediction; thus, for a particular problem, the optimal method should Predicting length of fatigue cracks by means of machine learning algorithms in the small-data regime Maciej Badora *, Marzia Sepe , Marcin Bielecki , Antonino Graziano , Tomasz Szolc a


Introduction
In the last decade repair intervals of certain parts of gas turbines, which are installed in the combustor or turbine sections, have been extended by 100%. Constantly growing energy needs force operators of petrochemical and power plants to reduce downtime and maximize production. Many companies sign a Long-term Service Agreement with the engine's OEM to assure seamless execution of the optimized maintenance plan, with availability and productivity guarantees. The value delivered by the manufacturer is based on the knowledge, experience and data accumulated over the years. Combining this with selected operating parameters of the units allows to develop predictive models, which support data-driven decisions and provide actionable insights to the end users [6]. Due to the complexity of the analyzed system, numerous estimators are built and utilized, e.g., for diagnostic and anomaly detection [17,13], classification [2], regression, or as a synthetic sensors [33]. A structured collection of such models reflecting part-to-part interactions and fed with operational data is referred to as the digital twin of the gas turbine [30].
Survival analysis for rotating equipment is typically carried out on limited datasets due to the low occurrence of failures, long duration and high costs associated with the destructive tests. These constraints set Weibull analysis for years as the default method for forecasting failure probability and damage size [1]. Scientific papers on the correct execution of regression analysis in the small-data regime [28] have not disappeared in the big data era. New contributions have been published [23,14] and are needed to provide guidance on how to use machine learning algorithms, which were not so commonly used in the past, for predictive modeling based on small samples. Such studies will be executed regardless of the growth of big data, but with the development of new data infrastructures the small data might be pooled and scaled more frequently [26].
Fatigue is among the most frequent causes of failures in mechanical systems. Deep knowledge of the fatigue crack growth mechanism is needed for the design and maintenance of gas turbines, especially with rising expectations regarding the efficiency and comprehensive optimization of the production process. Since the groundbreaking contribution of Paris and Erdogan [32], numerous fatigue crack growth equations have been published and these have been summarized well in [4]. For the most part, these models are based on experimental data and vary in complexity. There is no universal approach to fatigue life prediction; thus, for a particular problem, the optimal method should Predicting length of fatigue cracks by means of machine learning algorithms in the small-data regime

Article citation info:
In this paper several statistical learning algorithms are used to predict the maximal length of fatigue cracks based on a sample composed of 31 observations. The small-data regime is still a problem for many professionals, especially in the areas where failures occur rarely. The analyzed object is a high-pressure Nozzle of a heavy-duty gas turbine. Operating parameters of the engines are used for the regression analysis. The following algorithms are used in this work: multiple linear and polynomial regression, random forest, kernel-based methods, AdaBoost and extreme gradient boosting and artificial neural networks. A substantial part of the paper provides advice on the effective selection of features. The paper explains how to process the dataset in order to reduce uncertainty; thus, simplifying the analysis of the results. The proposed loss and cost functions are custom and promote solutions accurately predicting the longest cracks. The obtained results confirm that some of the algorithms can accurately predict maximal lengths of the fatigue cracks, even if the sample is small.
properly balance the accuracy and computational costs. Usually, the finite element method is used to solve the crack growth equation. Nevertheless, discretization of the domain at each iteration is a challenge, thereby alternative approaches are utilized, e.g.: by the extended finite element method, the boundary element method, and their hybrids and meshless methods [36].
The growth in computational power, the availability of open-source software and simple, user-friendly libraries has resulted in the enormous popularity of machine learning in recent years. These methods have also been adapted to estimate the fatigue life, while the artificial neural network (ANN) is the most widely used algorithm [21]. The ANN was successful in including the effect of mean stress for fatigue damage prediction [12] [22] or the effects of load sequences and temperatures on the fatigue life [20]. Attempts to utilize regression trees [27], random forests [43], kernel-extreme learning machine [15], or Bayesian network [37] can be found in recent publications. Typically, the models are based on experimental data, though outputs of a finite element analysis were used for the training [10] [41]. Data-driven estimators are effective for complex, high-dimensional problems, where analytical solutions do not exist. A fully specified model responds quickly, therefore it can be installed in the edge devices or utilized for real-time monitoring of the damage. However, these methods also have certain drawbacks, e.g., a variety of challenges caused by small samples, time-consuming optimization, poor extrapolation abilities, or a lack of interpretability [31]. In this contribution the authors try to address these problems, providing a tutorial on the execution of regression analysis by means of machine learning algorithms in the small-data regime.

The Analyzed Object and Problem Setup
This paper focuses on the 1 st stage Nozzle (S1N) of a heavy-duty gas turbine. Three different engine configurations are analyzed and referenced as Type A, Type B and Type C. The Nozzle assembly has 18 segments cast from a cobalt-based alloy FSX-414. A compressor discharge airflow is used to cool the part by means of a cooling insert in the airfoils and cooling holes at the leading and trailing edges. The analyzed object is shown in Fig. 1 and Fig. 2 [38].
Failure modes recorded during intermediate inspections of the component are as follows: cracks, oxidation and erosion of airfoils/platforms, corrosion of airfoils due to contaminants, deformation of airfoils due to creep. -A degradation of the surface of S1N due to oxidation, erosion, deformation, or corrosion results in the gradual loss of efficiency of the gas turbine. However, these negative effects should be captured by the operator, analyzed by a cross-functional team, and properly addressed. In the case of cracks, the standard instrumentation provides no direct indications of the presence and size of the damage. If the engine is not maintained as per the OEM guidelines, S1N cracks may even lead to an unplanned outage. Examples of trailing edge cracks of the analyzed Nozzle are shown in Fig. 3. The damage was found in locations where the trailing edges have the highest temperatures. The main cause of this damage are thermo-mechanical stresses, which are maximal during transient states, caused, e.g., by engine startups and shutdowns. Some locations are also subjected to tension during steady-state operations.
The ability to evaluate the length of crack without having to stop the engine and inspect the parts may result in substantial benefits for the owner, such as: avoiding component repair by using the Condition Based Main-tenance approach,

An Overview of Available Empirical Data
The utilized data are the proprietary property of Baker Hughes Company LLC and cannot be published. Data provided in this paper allow the reader to understand the context and the decisions made by the authors. The analyzed positions on trailing edges of the Nozzles are referenced as Position 1, 2, 3 and 4, without disclosing any further details.
The first piece of utilized input data contains the following: numbers of fired hours (FH), fired starts (FS) and emergency shutdowns (ESD) accumulated by each segment since the last repair and since the part was manufactured, measurements taken and damages observed during repair ac-tivities, after the operation, configuration of the Nozzle and the engine(s) where the part previously operated.
Trailing edge cracks were found on 754 out of 868 (87%) S1N segments subjected to the repair. The longest cracks, or less than 5 mm shorter than the longest, were recorded at Position 2 for 640 out of 754 segments (85%). Modeling of the crack lengths at Position 2 shall be prioritized with respect to the remaining positions (Fig. 4).
Gas turbines are equipped with a wide range of measuring instruments. Remote Monitoring and Diagnostic services are used to record the time series of operational parameters. These random variables are utilized to predict damage to parts. The operational data are available for 555 out of 868 segments (31 Nozzle sets) subjected to the repairs. The sampling interval is set to 1 hour. The optimal interval allows for the correct modeling of dynamics of the failure mode, but does not simultaneously enlarge the dataset excessively. A preprocessing focuses on the removal of erroneous data and variables correlated with each other. The second piece of utilized input data contains the following time series: pressure, temperature, and relative humidity of ambient air, pressure and temperature of air at the discharge of the axial compressor, pressure losses in the inlet and exhaust ducts, axial compressor pressure ratio, mean temperature of gases at the inlet and outlet of the S1N assembly, and output of the engine (these are calculated parameters), temperatures of exhaust gases and their spread, speeds of the high-pressure and low-pressure shafts, positions of the Inlet (or Nozzle) Guide Vanes. -Missing readings of ambient air parameters are completed with the data published online [44]. For each of the variables a value range is defined beyond which the reading is considered either incorrect or related to transient states. Such records are removed from the dataset.
The set of operational data is composed of 1,029,215 records in total, although for some periods of the Nozzles' service time the data are unavailable. The coverage varies from 26% to 100% with the average equal to 73%. The Nozzles operated in units that drive centrifugal compressors in the process of natural gas liquefaction. Such engines operate at base loads with a very stable operational profile. Therefore, it is assumed that the available data also describe the missing periods well.

Feature Selection
Long cracks of the Nozzles may jeopardize the availability of the gas turbine; hence, the maximal crack size at Position 2 is selected as the dependent variable. Each Nozzle set is labeled with just one value, composing the sample with 31 data points. The complete input dataset has 32 independent variables; thus, to avoid overfitting, to reduce dimensionality and to make the models interpretable [18,24], feature selection is executed. The maximal number of predictors is set to 5, which gives ~6 subjects per feature. The predictors are the same, regardless of the regression method used, thus simplifying the interpretation of the results.
The time series of operating parameters related to each observation are simplified to medians. This transformation greatly reduces the dataset and is performed to pool together the different types of inputs previously described. It is justified for the analyzed units, which operate in steady-state conditions.
The features are selected using the Scikit-learn library [34] based on the simultaneous evaluation of mutual information, Pearson's and Spearman's correlation coefficients, and an analysis of chi-square tests. The removal of the least important features is done iteratively (i.e. ~20% of remaining variables are removed after each iteration) until 10 independent variables are identified.
The next step concerns feature engineering. The features are computed based on the time series of the operating parameters previously filtered. The variables reflect the distributions of these parameters in a simplified way. For each of the time series, considering all the records, certain statistical measures are calculated (i.e. 50 th , 70 th , and 90 th percentile). The new features are computed as the number of service hours with a reading below the 50 th percentile, between the 50 th and 70 th percentiles, between the 70 th and 90 th percentiles, and higher than the 90 th percentile. The enlarged set is again iteratively reduced to 10 variables.
The last step is about the application of a wrapper method to find the optimal combination of the features but using a specific algorithm. A random forest is chosen given its data-driven nature of decorrelated trees and leveraging the law of large numbers [5]. Each iteration removes one variable based on the importance of the features. Additionally, a 5-fold cross-validation (CV) with 50 repetitions is used during each iteration. The following features are selected for the modeling: median of average exhaust temperatures, -EXH T  , number of fired starts accumulated by the part since the manu-facturing, TFS, number of service hours with the mean temperature of gases at the inlet of the S1N assembly between the 50 th and 70 th percentiles, Tin P50-P70 , median of ambient air temperatures, -AMB T  , median of gas turbine output, -P  . These predictors are consistent with the physical phenomena that cause the damage. Thermal stresses σ are proportional to the Nozzle's material temperature change, ∆T MATL : where E is the Young's modulus, α is the thermal expansion coefficient, φ COLD and φ HOT are the cooling effectiveness coefficients on the cold/hot side of the part, Tin is the mean temperature of gases at the S1N assembly inlet, and T COOL is the coolant temperature. A specific range of Tin is present in the set of predictors, while T COOL is strongly correlated with the gas turbine output and the ambient temperature. TFS is considered as the number of load cycles, an inherent term in any fatigue crack growth equation.

Analysis Setup
As per [19], the expected mean squared error (MSE), for a given test value x, can be decomposed to the sum of variance of ( ) f x , the squared bias of ( ) f x , and the irreducible error Var ε ( ) : Minimization of the MSE leads to simultaneous minimization of the variance and bias to find a trade-off between them. A robust approach to regression analysis is to split the data into three subsets: the training set used to fit the models, the validation set used to tune the parameters of the models, and the test set used to assess the generalization capabilities and performance of the fully specified model. The sample has just 31 records, therefore an appropriate data splitting method is required to properly balance the sets and to avoid errors that may affect the entire analysis [42]. To facilitate the decision, the observations are separated into three classes: "short" class composed of cracks with a length lower than L -LOW (9 out of 31 observations, 29%), "medium" class composed of cracks with a length between -L LOW and L HIGH (10 observations, 32%), "long" class composed of cracks with a length higher than or equal to L HIGH (12 observations, 39%).
Then the dataset composition can be described as follows: 24 records (77%) correspond to gas turbines that operate in the marine environment (salty, onshore), while in the case of "short" and "medium" classes the fraction is equal to 95% (18 out of 19 observations); Records related to engines that operate in the tropical environ-ment (damp, onshore) make up 58% of the "long" class (7 out of 12 observations); 19 records (61%) correspond to Type C units and make up the majority of the "medium" and "long" classes (90% and 58% respectively); 7 records (23%) correspond to Type A units and make up 33% of the "short" and "long" classes; 5 records (16%) correspond to Type B units and make up 33% of the "short" class and minority of the remaining classes.
Regression models estimating damage to the parts of the gas turbines should be accurate across the entire range of observations, to support decisions about the necessity to execute maintenance. Consequently, the main constraint on the test set composition is to properly reflect the split between the "short", "medium", and "long" classes. The test subset is composed as follows: Each class of crack lengths represents 1/3 of the set; - To avoid excessive reduction of the training and validation sub-sets, the test set is composed of 6 data points (19% of the dataset) with 2 observations from each of the classes; 5 out of 6 records correspond to gas turbines that operate in marine (salty, onshore) environment; 4 out of 6 records correspond to Type C units, while the remain-der correspond to Type A and Type B; The "short" class is represented by the longest observations as-signed to this class; Records characterized by moderate values of the predictors are selected to the test subset, while the rest of the sample (with higher variance) is used for training and validation.
In the regression analysis a loss function is the prediction error for a single data point, while a cost function represents the error for all observations in the dataset. A squared error is commonly used as the loss function; thus, the MSE or root mean squared error (RMSE) are the most popular cost functions. The configuration of the cost function affects all subsequent steps of the analysis, i.e.: the tuning of the models' hyperparameters, the interpretation of results, and the selection of the optimal model. Also, it influences contributions that the deployed model brings when decisions are made. To obtain satisfactory results, the learning objective should be defined carefully with a deep understanding of the problem and user requirements.
For the analyzed Nozzle cracks, a certain value of absolute error is acceptable for the "short" class, which do not require immediate action. The same error is unacceptable for the "long" class, so cracks requiring careful evaluation due to the severity of potential consequences. The MSE does not meet this requirement, thereby custom loss and cost functions are used in this paper. The loss function has a variable width of the scoring bounds dependent on the length of observations (Fig. 5). The interval between the bottom bound b(x) and the top bound t(x) decreases as the size of actual cracks increases. The minimal width of the interval is equal to ~40% of the maximal width. The loss function is defined as follows:

Fig. 5. Loss function scoring bounds on the "Predicted vs Observed" plot
The loss function is positive only if the predicted crack length is within the bounds. The correct predictions of cracks belonging to the "long" class are scored 2x higher with respect to the remaining classes. This bonus favors regression models that accurately predict the longest cracks. The magnitude of the bonus should quantify the difference of importance of one aspect over others and should not artificially promote certain solutions. The authors noticed heavily biased predictions for short and medium observations when the bonus was too high.
The cost function is defined as the mean loss: General constraints on the composition of the validation set are as follows: To correctly reflect the proportions between the "short", "me-dium", and "long" classes; To The training, validation and test subsets were created in a fully controlled manner, based on the analysis of composition and clusters of the entire dataset. The authors observed that this approach gives better outcomes, simplifies the interpretation of obtained results, and improves the accuracy of the model. The loss and cost functions are custom, favor models that predict correctly in a specific range of the domain and reflect well the requirements of the user. Use of these functions significantly improved the accuracy of the prediction with respect to the MSE or the RMSE.

Description of Mathematical Model
Multiple linear regression (MLR) models assume that the response variable depends linearly on the independent variables. In scalar form it is represented as follows: where k is the number of predictors, x ik is the value of the kth predictor for the ith observation, β 0 is the intercept, β k are the regression equation coefficients, and ε is the error term. The Nozzles are not damaged prior to the service, thereby β 0 = 0. In matrix notation, (5) simplifies to: where  β is composed of coefficients of the regression equation (5) and X is the matrix of features of size i × k.
Polynomial regression models assume that the response variable depends nonlinearly on the independent variables. In scalar form it is represented as follows: where p is the degree of the polynomial equation. If β 0 = 0, then (7) in matrix notation has the same form as (6), although the size of the matrix of features X depends on the degree p. X is complemented by all the possible interaction features of the degree j (j = 2, …, p) calculated as products of distinct independent variables. The degree of the polynomial equation p ϵ {2, 3, 4} is the only hyperparameter tuned during the analysis.
The support vector regression (SVR) formulates (6) as an optimization problem aimed at finding the narrowest margin around the approximated surface [3]. The maximum error ε sets the width of the margin. The objective is to minimize the Euclidean norm of the coefficients' vector  β that is normal to the approximated surface, subject Only predictions outside the margin, called support vectors, are penalized. The solution of this constrained optimization problem is as follows: For non-linear relations, data are mapped into a higher dimension feature space  using a similarity function called a kernel ( ) The explicit mapping : Φ →   is not required if the kernel satisfies: In the space  the solution of the optimization problem is as follows: The kernel function is also used with the ridge regression (KRR), which is a linear model with a regularization parameter λ 2 ≥ 0. If the squared error is the cost function, then the minimized objective function has this form: with the following solution: where K = XX T is the Gramian matrix and the kernel function, α α = + ( ) − K I n λ 2 1  y is the dual variable. The solution of the optimization problem is as follows: A random forest is an ensemble of decision trees. A single tree divides the space of inputs  into j high-dimensional rectangles R j , in order to minimize the error at each tree split: where ˆj R y is the mean value of the dependent variable in the R j rectangle. A random subsample is drawn at each split; thus, the decision trees are decorrelated. Responses from all the trees comprising the random forest are averaged to obtain the final estimate: where m is the number of decision trees and  ( ) m f x  is the response of the mth decision tree.
In the AdaBoost.R2 algorithm [11] an ensemble of m weak learners, one-node decision trees, is created. Their training focuses on observations with the most inaccurate predictions obtained at the preceding iteration. The prioritization is based on weights assigned to each i x  that depend on the confidence in the weak learner θ, being a function of the average loss of this weak learner. For an unseen vector of predictors x  , the response is calculated as the weighted median: ( ) where the meanings of m and  ( ) m f x  are the same as in (16).
The extreme gradient boosted algorithm (XGBoost) [7,45] is an ensemble of gradient boosted decision trees, whose responses are summed to get the final estimate: The following function is minimized at each iteration t: where j = 1,2, …, T is the leaf's ordinal number, G j is the gradient and H j is the Hessian of the loss function, w j is the similarity score assigned to the jth leaf, λ 2 is the regularization parameter and γ is the minimum loss reduction to split a node. Pruning of the decision trees is based on the gain value: where L and R are the scores on the new left/right leaf, and N is the score on the new node. The new branch is removed if the gain is negative.
Artificial neural networks are non-linear statistical models. The output from a neuron Z comprising a hidden layer of the network is a linear combination of inputs x i : where m = 1, 2, …, M denotes the neuron's ordinal number, σ is the activation function, α 0m is the bias term and m α  is the vector with weights. The output of the entire network with a single hidden layer is as follows: where β 0 is the bias term,  β is the vector with weights and Z = (Z 1 , Z 2 , …, Z M ) is composed of neuron outputs. Function g is the identity function in the case of regression problems. (Stochastic) gradient descent and back-propagation algorithms are used to calculate the weights.

Results of the Regression Analysis
Python programming language is utilized to create the regression models. NumPy [16] and Pandas [29] libraries are used for the data manipulation, Keras library [8] is used to build artificial neural networks, while Scikit-learn library is used to build the remaining models. The z-score is utilized to standardize the input data.
The Differential Evolution [39] method available in the SciPy library [40] is used to solve the optimization problem and to find the optimal vector  β in the multiple linear and polynomial regression equations. The verification and validation of the models are presented in Fig. 6 and Fig. 7. The observed and the predicted values are normalized by dividing them by the maximal measured crack length. The dimensionless quantities generalize the results and make them more scientifically meaningful. It applies to all the figures in this section of the paper. The verification is done to assess the behavior of the model on the training set, while the validation evaluates the performance on the unseen, test dataset. The observations related to the obtained results are as follows: Regardless of the degree of the polynomial equation, the mean value of the cost function is equal to 1.200, therefore p = 2 is chosen to reduce the complexity of the model; No further efforts are taken by the authors to simplify the regression equation; However, an analysis of variance should be done to determine whether all the independent variables are statistically significant, especially if some components of the  β vector are close to zero.

EXH T  -
and Tin P50-P70 are the most important features in the MLR model, while their product and squared TFS are essential in the polynomial model (i.e. the modulus of the β k coefficients corresponding to these features is higher than the modulus of the remaining coefficients of the regression equation). The polynomial model has a lower normalized RMSE than the -MLR model (0.529 and 0.612, respectively), but this comes with the higher complexity and lower interpretability of the model. Responses of the models are not sufficiently sensitive to chang-es of the input parameters; hence, the length of short cracks is overpredicted and the longest cracks are underestimated. The accuracy is good only for the "medium" class. To reduce the space of hyperparameters a fixed number of parameter values is sampled using the RandomizedSearchCV class. The entire space is iteratively limited to the subspaces in which the cost function is maximized. When the space is sufficiently small, the Grid-SearchCV class is used to evaluate all the remaining combinations of parameter values. The highest CV scores are achieved with the polynomial kernel: The verification and validation of the models are presented in Fig.  8 and Fig. 9. The observations related to the obtained results are as follows: The mean values of the cost function (1.126 and 1.097 for the -SVR and the KRR, respectively) are comparable with the mul-tiple linear regression (1.133), so does the normalized RMSE (0.635 and 0.611, respectively). Responses of the models are sensitive to changes of the input parameters; hence, the predictions and the observations are in the same range. An exhaustive hyperparameter tuning can be done, since the mean time to fit these models is very short. The accuracy is unsatisfactory for the "long" class. -In the case of random forest regression, the AdaBoost.R2 and XG-Boost algorithms, the following parameters are tuned: the maximum number of decision trees -m, the maximum depth of decision trees, whether to build the trees on the entire training set or on its subset, or whether to use bootstrap samples, the number of predictors used during each split, the minimum loss reduction γ, or the minimum decrease in im-purity required to split a node, the minimum sum of the instance weight -H j required to split a node, or the minimum number of samples on the leaf after the split, the minimum number of observations required to split a node -(not applicable to the XGBoost), the learning rate η (not applicable to random forest regres-sion).
These hyperparameters are common for the considered tree-based algorithms. The size of the space of hyperparameters varies depending on the type of algorithm. Additionally, in the case of XGBoost regression, the following parameters are tuned: the booster type ϵ {gbtree, dart}, the L1 regularization parameter -  gain for each decision tree in the XGBoost model or based on the mean decrease in impurity (Gini importance) in the remaining models.
The models based on random forest and AdaBoost algorithms - give low values of the normalized RMSE (0.544 and 0.500, respectively). However, based on the CV scores (0.962 and 0.980, respectively), these models are ranked lower than the remaining ones. The XGBoost regression model has a significantly higher CV score (1.096), but the normalized RMSE calculated against the test set is the highest (1.000). The tuning of hyperparameters aimed at reducing the error (i.e. to increase prediction accuracy for the longest crack) results in a substantial decrease of the cross-validation score. Despite some regularization parameters having non-zero values, estimates for the training observations are almost perfect. A stronger regularization or reduction of the number of decision trees has a detrimental effect on the CV score and does not improve the normalized RMSE. The responses of the models are sensitive to changes in the in-put parameters; hence, the predictions and the observations are in the same range. Due to the higher complexity of these algorithms and the nu-merosity of user-defined constants, the tuning of hyperparameters requires more time in comparison with the previously discussed methods. The models underestimate the longest observations and none of them have a high cross-validation score or good results of the validation against the test set.  The highest cross-validation score (0.910) is obtained with the Adamax algorithm [25], the rectifier linear unit (relu) activation function defined as σ(x) = max(0, x), and the Lecun initializer that draws the initial weights from the uniform distribution The artificial neural network has 3 hidden layers each with 15 neurons.
The verification and validation of the model are presented in Fig.  13. The observations related to the obtained results are as follows: the model gives low value of the normalized RMSE (0.519), although it has the worst mean value of the cost function; the model is a black-box and it is unknown which independent variable is the most important; responses of the models are sensitive to changes in the input parameters; hence, the predictions and observations are in the same range; regardless of the dataset size, the tuning of hyperparameters is very time-consuming, due to the quantity of user-defined constants and the length of time needed to fit a single artificial neural network; the model underestimates the longest observations. - The objectives of this study are to apply the statistical learning algorithms to a real technical problem, to share the approach and learnings with other researchers, and to evaluate if these algorithms are effective in the small-data regime. The results presented herein are sufficient to achieve the objectives. In the authors' opinion, the inclusion of additional results (e.g.: the scripts, the results before the normalization, or the final form and coefficients of the regression models) do not increase the scientific value of the paper. A quantitative summary of the obtained results is reported in Table 1.

Conclusions
The main outcomes of this study are as follows: The utilized algorithms can accurately predict the crack lengths based on operational parameters of the engine, i.e.: the temperature of exhaust gases, the number of fired starts accumulated by the part, the temperature of gases at the Nozzle's inlet, the gas turbine output and ambient air temperature. Based on the analysis of the linear/polynomial regression equations and the  values of gain/Gini importance, the median of the average exhaust temperatures EXH T  is the most important feature in the majority of the regression models. The polynomial regression model is the best model, considering the cross-validation score and the normalized RMSE evaluated against the test set. Nevertheless, it is not sufficiently sensitive to changes in the input parameters. It underestimates the longest observations as well, which is a common drawback with the created models. The AdaBoost regression model predicts these cracks with the lowest normalized RMSE. Before being used as a standalone support for data-driven decisions, the model should be subjected to further validation. The split into training, validation and test subsets was done in a fully controlled manner, considering the clusters and the composition of the entire dataset. Each subset represents the sample in a quantitative and qualitative way. Such a consistent approach reduces ambiguity during the cross-validation, testing and interpretation of obtained results, especially in the case of small datasets. The (root) mean squared error should not be automatically cho-sen as the cost function, as the results might be suboptimal from a business, risk management or other relevant perspective. A custom cost function better reflects the requirement of the user/ customer and may favor certain solutions. In this paper, accurate predictions of the longest cracks are awarded with a bonus. The structure of the cost function drives the form and capabilities of the model. A variant of leave-p-out cross-validation is utilized, since the usage of k-fold cross-validation provides unclear and hard-tointerpret results for the training set composed of 25 records. This is computationally feasible because of the effective feature selection and the small number of observations. None of the models outperforms the remainder. Some of them are accurate for the "short" class, while others are better for the "long" class. Combining these models into an ensemble could improve the overall accuracy and better support data-driven decisions. None of the analyzed algorithms surpasses the others because of certain advantages or characteristics. The order of regression models in Table 1, resulting from the mean cross-validation score, will change for a different dataset. Nevertheless, finding the optimal structure and hyperparameters of the artificial neural network was the most complicated and timeconsuming. In general, the short time needed to fit a model allows for an extensive tuning of hyperparameters and is one of the few benefits of small samples. Building conclusions on a small dataset can be reliable, but it requires a rigorous approach and understanding of the decisions made throughout the entire analysis. The applicability domain must be defined for such models to limit extrapolation attempts and application of the model to items not covered by the training set. If possible, results of the data-driven model should be compared with a different approach, e.g., the results of the finite element analysis. A staggered implementation of the regression model is suggested, preferably connected with checks of the hardware (e.g., borescope inspection of the analyzed Nozzles), data gathering, and subsequent update of the model.
The authors applied several known machine learning algorithms to predict the maximal length of fatigue cracks. The analyzed dataset was not created synthetically or obtained during an experiment, but it describes components that operated in the industrial gas turbines under variable operating conditions. The purpose was to assess if the data-hungry methods can provide valuable results for the inhomogeneous sample composed of 31 observations only. The authors proved that the selected algorithms are effective in the small-date regime. The experience shared herein is universal and can support other researchers and professionals working with similarly sized sets.
It is the first attempt of the authors to apply machine learning algorithms in the small-data regime. Further research will focus on techniques that combine physics-based descriptions of the analyzed phenomenon with empirical models, staying in the small-data regime. Physics-informed neural networks [35,9] are examples of such hybrid models. This approach should assure the consistency of the outputs with the laws of physics, and improve the accuracy of the predictions and extrapolation capabilities. The ultimate objective of the researchers is to propose a method of scaling the hybrid models and transferring knowledge between domains.