Mechanical assessment of defects in welded joints: morphological classification and data augmentation

We develop a methodology for classifying defects based on their morphology and induced mechanical response. The proposed approach is fairly general and relies on morphological operators (Angulo and Meyer in 9th international symposium on mathematical morphology and its applications to signal and image processing, pp. 226-237, 2009) and spherical harmonic decomposition as a way to characterize the geometry of the pores, and on the Grassman distance evaluated on FFT-based computations (Willot in C. R., Méc. 343(3):232–245, 2015), for the predicted elastic response. We implement and detail our approach on a set of trapped gas pores observed in X-ray tomography of welded joints, that significantly alter the mechanical reliability of these materials (Lacourt et al. in Int. J. Numer. Methods Eng. 121(11):2581–2599, 2020). The space of morphological and mechanical responses is first partitioned into clusters using the “k-medoids” criterion and associated distance functions. Second, we use multiple-layer perceptron neural networks to associate a defect and corresponding morphological representation to its mechanical response. It is found that the method provides accurate mechanical predictions if the training data contains a sufficient number of defects representing each mechanical class. To do so, we supplement the original set of defects by data augmentation techniques. Artificially-generated pore shapes are obtained using the spherical harmonic decomposition and a singular value decomposition performed on the pores signed distance transform. We discuss possible applications of the present method, and how medoids and their associated mechanical response may be used to provide a natural basis for reduced-order models and hyper-reduction techniques, in which the mechanical effects of defects and structures are decorrelated (Ryckelynck et al. in C. R., Méc. 348(10–11):911–935, 2020).


Introduction
Our ability to design and produce materials with desired properties has dramatically improved, and commonly integrates sensors, control, simulation data and computerized pre-dictions (see reviews in [4][5][6]). These techniques combine material imaging [7] and digitaltwins frameworks [8] for manufacturing and evaluating material properties. In mechanics, numerous applications concern control, defects and anomaly detection [9] or fatiguedesign of materials [10,11]. Machine learning algorithms, notably, have been proposed in the aeronautical field [12], fabrication process [13] or pipelines applications [14], and have been combined with numerical computations to study defects in ball bearings [15]. Simulation-driven machine learning methods based on existing mechanical models are especially attractive as they avoid the explicit parametrization of defects being modeled. Assessing and certifying the mechanical properties of structures containing defects nevertheless requires advanced micro-mechanical models, as well as non-destructive imaging techniques such as X-ray tomography [16] or ultrasonic measurements [17]. This is due to the recognition that composites (or, for that matter, porous) microstructures, often exhibit widely-varying effective responses, as demonstrated by homogenization theories [18,19] and in optimal-design problems [20]. A broad range of mechanical properties may be achieved by tailoring the inner geometrical arrangement of microstructures, as surveyed in e.g. [21][22][23]. Aside for a few rigorous results obtained for particular geometries, e.g. the Eshelby [24] or Vigdergauz [25] inclusions, the effect of the shapes of pores on the overall mechanical response is difficult to quantify even for linearly-elastic media, and usually involves sophisticated mathematical tools. In plane strain, the presence of corners [26], up to the limiting case of a crack tip [27], bottlenecks [28,29], and high-aspect ratios are known to be mechanically-determining factors, as highlighted by studies based on conformal mappings [30] or radon transforms techniques [31].
Although these rigorous and (semi-)analytical results are useful as guides, they are restricted to linear media under plane strain or stress, with notable exceptions [32,33]. They do not allow one to explore the links between morphology and mechanical response, important in industrial problems. The latter often involve inverse-design problems within a given class of microstructures or morphologies, that results from manufacturing constraints [34,35]. In energetic granular materials, for instance, particles shape and size depend on crystallography and may be controlled by surface treatment, to some extent [36]. Furthermore, the overall material response alone, characterized by an effective stiffness tensor, is insufficient for determining the full mechanical response. The local response, sensitive to the internal microstructure arrangement of a given material system, must be accounted for. Damage localization, which leads to brittle or ductile fracture in composite materials, is driven by the local stress state in the microstructure, which is itself a complex result of the load distribution within the material.
Lately, shape statistics based on morphological operators have been devised [1,37] and so-called shape spaces [38] have gained attraction as a versatile method for quantifying shapes, seen as points in a high-dimensional metric space representing Fourier-based expansions. On a sphere, the Laplace-Beltrami eigenfunctions have an explicit form in terms of spherical harmonics [39,40], which can then be used to represent continuous shapes [41], seen as deformation of the sphere, or as mapping between a sphere and an arbitrary shape. This decomposition is especially useful for modeling data on a regular grid (i.e. on images), in computer graphics [42], medical image analysis [43] or material science [44]. In the context of mechanics, sophisticated image analysis approaches based on machinelearning methods have already been employed to detect, and more generally classify, "critical" defects as exemplified in several industrial problems [13,14]. Other approaches have sought to infer the mechanical response of materials using temperature fields [45]. These methods can be supplemented by transfer learning [46] and shape explorations techniques to determine mechanically-relevant criteria for assessing the effect or criticality of defects.
The purpose of the present work is twofold. First, a methology is developed to explore and classify defects based on shapes and mechanical response, using morphological transforms, spherical harmonics, and the Grassmann distance. To illustrate our approach, a set of defects previously observed in tomography images of welded joins is analyzed to serve as a model problem. Second, use is made of machine learning methods to compare and correlate the resulting classifications. The adequacy of the method for detecting critical defects, and other possible applications, are discussed, as well as future works, such as those exploring the dependency of our results on the choosen metrics.
The present article is organized as follows. Section 2 presents the set of defects used as the basis for the present study, whereas Sect. 3 deals with the various distances used for clustering, including full-field mechanical computations. Our main results, which concern the mechanicaly-based clustering of shapes, are given in Sect. 4. These results are compared to those obtained after data augmentation of the initial set of defect in Sect. 5. We conclude in Sect. 6.

Data set of defects and goal of the present work
The present work is based on a data set of defects obtained in L. Lacourt's PhD thesis [47]. These defects have been extracted from a segmented X-ray tomography image of welded joints, see [2]. The data set consists in 1288 defects in total, each containing between 500 and 100,000 voxels. Smaller defects present in the original image have been discarded in the present study. Slightly more than half of the defects are close to spheres, whereas the rest of them display various convex and non-convex shapes ( Fig. 1), see [48]. After segmentation, each defect is embedded in a bounding box in 3D, with edges aligned with the axis (e 1 , e 2 , e 3 ) of a Cartesian coordinates system. The shape has been rotated so that its first ans second principal axis are aligned with e 1 and e 2 . A reflection with respect to the plane (e 1 , e 2 ) is carried out so that the highest absolute coordinate along e 3 is positive. Finally, a homothety is performed so that the dimension of the shape along axis e 1 is 1/4 that of the embedding box. For all shapes, the embedding box is a cube containing L 3 = 80 3 voxels. Accordingly, the shapes have varying volume fractions, but the same diameter with respect to their bounding box,. This is so that cracks or pores with very high aspect ratios can be discretized with similar resolution.
The effects of such defects on the mechanical response of a structure can be efficiently estimated using the two-scales hyper-reduction method proposed in [2] for fatigue. In this method, schematized in Fig. 2, the effect of the overall structure and of defects are dissociated, whereas interactions between the two are taken into account by the far-field   [49,50]. In practice, a reduced basis is computed for the structure without defect and another one for the defect. A "global reduced basis" is then computed by transferring both reduced basis on the real mesh containing the defect and concatenating them. Using this reduced basis, a "hyper-reduced" simulation is performed on a reduced domain of integration (Fig. 2, orange and grey regions). As such, numerical computations are carried out on a sane structure without pores, and on isolated defects, rather than on the entire structure containing defects. Generally speaking, the method is most efficient when dealing with complex and time-consuming constitutive-laws. Speed-up as high as 10 2 and 10 3 have been obtained in fatigue in mechanics, for elastoplastic behavior, and about 10 in linear-elastic cases [3].
While the hyper-reduced method improves on standard finite element techniques, computing the reduced basis of defects can be time-consuming. This point needs to be addressed in industrial applications where the effect of defects must be quantified in near real-time. Often, the pores shape is random, but follows a certain probability distribution that needs to be estimated. The mechanical responses of shapes close to one another need not be computed twice, in general. However, as noted in Sect. 1, while different shapes may yield similar mechanical response, small difference such as the presence of corners, could induce different mechanical responses. The goal of this work is to investigate whether one may pick an appropriate reduced basis for a defect by learning the mechanical responses of a set of other defects, and how they relate to their shapes. To do this, the mechanical computations for the fields around a defect (rectangle a, Fig. 2) are replaced by statistical learning, making use of pre-computed mechanical fields used as training data (rectangle b, Fig. 2). The full scheme in Fig. 2 will not be implemented in the present work. Instead we focus on the task in rectangle (b) of the same graph, and consider linear elasticity as a proof of concept for our approach.

Mechanical and morphological distances
In the following, we make use of a Fourier-based scheme with rotated discrete Green operator [51] to carry out mechanical computations. The method uses periodic boundary conditions, relevant for quasi-isolated defects and has been found to be efficient when compared to finite element, both in terms of memory computations, accuracy and CPU time [52]. For each defect, six FFT computations with prescribed overall strain ε are carried out, corresponding to the six independent strain loadings, in our case E i = ε i , (i = xx, xy, xz, yy, yz, zz) with E j = 0 (j = i). Accordingly, the data consists in a fourth-order tensorial field, denoted localization tensor in homogenization theories, which has both minor and major symmetry. Figure 3 shows as an example two strain components obtained under uniaxial extension. The fluctuation of the strain field inside the pore depends on the choice of the Green operator and has no physical meaning, except for the mean of the strain in the pore. Accordingly, the strain field inside the pore is replaced by its mean in all mechanical computations.
In the rest of this study, use is made of the Grassman distance [53,54]  and V 2 is given by: where · F is the Frobenius distance and is a diagonal matrix with eigenvalues θ i obtained from the singular value decomposition: and I is the identity, and W 1 , W 2 ∈ R N × R N are orthogonal matrices. Distance (1) measures the dissimilarity between two defects by considering the subspaces (Grassman manifolds) generated by the set of strain responses for each loading to the two defects [55]. The distance is appropriate for mechanical responses with the same number of applied loadings. In the more general case of time-varying loadings, with different number of time steps for each defect, the Schubert distance [56] may be used instead.
The Grassmann distance involves a singular value decomposition performed on matrix V t 1 · V 2 (see Eq. (2)) of size N × N . These computations become time-consuming when a large number of objects (more than 1200 here) must be compared to one another. To improve on the computation of the Grassmann distances, we define a subdomain of size (L/2) 3 , included in the bounding box, and containing all defects. We define two alternative pseudo-Grassmann distances, computed as in (1) with the data for V 1 and V 2 restricted to either or its boundary ∂ . The computations of the pseudo-Grassmann distances in and ∂ is much more efficient as the bounding box has a volume eight times smaller compared to the entire domain. Histograms for the (pseudo-)Grassmannn distances between 400 defects are represented in Fig. 5(a). The distribution of distances for the pseudo-Grassmann distance computed using ∂ is strongly different from that of the Grassmann distance, indicating that the former can not be substituted to the latter. However, this is not so for the pseudo-Grassmann distance computed on the entire subdomain which is close to the results obtained for the Grassmann distance, see Fig. 5(b). Accordingly, in the rest of this study, the Grassman distance is evalued on the subdomain only. The shape of defects is quantified by two means, a morphological and spectral decomposition. Consider first the morphological transform based on the signed distance function: for a pore P, with boundary ∂P. This distance is obtained by propagating a distance function with quasi-Euclidean metric, and leads to spherical iso-lines far from the defect. The signed distance fields for two defects P 1 , P 2 is vectorized into two arrays denoted F 1 and F 2 ∈ R L 3 and we denote "morphological distance" the distance: where · 2 is the Euclidean distance. We also define a distance based on the spectral decomposition for the Laplace-Baltrami expansion [44,57]. This expansion can be conveniently written in terms of spherical harmonics in the case of the sphere [58]. The latter form a basis for square-integrable functions on the unit sphere and this decomposition can accordingly be used to characterize star-shaped defects. We briefly recall how this spectral decomposition is estimated on digital images (the reader is refered to [58] for a detailed discussion). The decomposition reads, in spherical coordinates (θ , φ): where P are Legendre polynomials and x k are the coordinates (k = 1, 2, 3) of points along the surface of the defect. In practice, a set of 25 × 25 pixels are picked along the surface of the object, distributed uniformly along all directions from the center, providing values for the x k (θ , φ). The center is the minimial of the signed distance function. The double sum in (5) is truncated to |m| ≤ ≤ max = 10 and a least-square optimization procedure is used to determine the coefficients c m k . The latter are used to define the distance: Conversely, Eq. (5) can be used to reconstruct a shape, for a given set of values c m . Two shapes and their associated reconstitution are shown in Fig. 6. The difference between  Fig. 1 the two are a consequence of the truncation of the spectral decomposition, and of the way interpolation points on the surface are chosen, i.e. uniformly distributed along all directions on the sphere rather than uniformly-distributed on the surface of the object. This reconstruction is imperfect and only captures some of the features of each shape.

Clustering analysis
In this section, we consider the k-medoids clustering algorithm, which provides us with a set of classes as well as a most-central point (the "medoid") in each class, that is present in the data set. The classification algorithm, which minimizes distances to the medoids, is based on the matrix of distances between points, and does not require the coordinates of each point [59]. Additionaly, since the medoid is present in the data set, its pre-computed reduced basis can be used in hyper-reduced methods for taking into account defects that belong to a known mechanical class. We split the data set into two groups, a training set of 936 defects and a testing set containing 508 defects. The training set corresponds to the data collected on two-third of the welded joint and has been obtained on two tomography images. The test data corresponds to the rest of the welded joint, and has been obtained by a third tomography. Accordingly, the data in the two sets are not randomly drawn from a collection of defects, but instead are obtained from different sources, as would be expected in industrial applications. Our results are shown in Fig. 7 for the Grassman distance as well as the two shape distances. The points cloud representation in three dimensions is obtained by multidimensional scaling [60]. Partly-overlapping clusters on this representation may actually be separated when additional dimensions are considered. In this view with reduced dimension, the data form a continuous cloud of points. The medoids (right) exhibit spherical, oblate and non-convex shapes. The amount of information contained in the multidimensional scaling is plotted as a function of the dimension in Fig. 8(a) in the case of the Grassman distance, showing a strong decrease of the amount of unknown information up to d ≈ 8 and a slower decrease after that. The effect of the number of clusters is shown in Fig. 8(b), which represents the intra-clusters distance, i.e. the sum of the distances of each shape to its medoid. This distance decreases with the number of clusters. The "typical" number of clusters corresponding to this decrease is about 5, at which point the curve displays an elbow.
Shape clustering as determined by the k-mdeoids analysis can not be used directly to assign a defect to its mechanical cluster, as shown in Fig. 9. This figure represents the confusion matrix that summarizes the number of shapes that belong to a given mechanical cluster and to a cluster based on either the morphological of spherical harmonics distance. Cluster labels are the same as in Fig. 7. The color scale indicates a concentration of shapes from a geometrical cluster into a specific mechanical cluster. Assigning a mechanical cluster to a shape based on its morphological or spherical-harmonics cluster would result in 74% and 87% erroneous labeling, respectively. Instead, we consider a classifier based on a dense neural networks (Fig. 10). The input to the network are the distances to the medoids based on the morphological distance. The network is trained to predict the label of the cluster corresponding to the Grassmann distance. It contains three hidden layers of 15 neurons each and is optimized on the log-loss function:  representing 90% of the initial set, and a validation set representing the remaining 10% of the initial set of data. The validation set provides a stop criterion. Loss and accuracy curves, computed using (7), are shown in Fig. 11. Figure 12 shows the confusion matrix representing the predicted and true labels, that summarizes the assignement by the network of a mechanical cluster for the various shapes, either in the training or test sets. The percentage of misclassified shapes by cluster is Figure 11 Loss function and accuracy vs. number of epochs for the training of the neural network, with (b, d) and without (a, c) data augmentation, and using the morphological (a, b) and spherical harmonics (b, d) distances given on the right column. To quantify these results, we define an error on the training set e tr = M tr /N tr as the ratio of correct label predictions M tr divided by the total number of predictions N tr in the training set. We consider likewise a similarly-defined error e te for the testing set. We also introduce a second error criterion e tr , equal to the mean of the proportion of misclassified shapes in each (non-empty) mechanical cluster for the training data, and likewise e te for the testing set. These various errors highlight sub-optimal performances of e tr = 13%, e tr = 10% for the training data, as well as e te = 18% and e te = 26% for the testing set. Higher errors e tr = 16%, e tr = 19%, e te = 29% and e te = 34% are observed when using spherical harmonics instead of the morphological distance (Fig. 12, rows 3  and 4). These results may be attributed to the small number of defects in some classes, as will be investigated in the next section.

Data augmentation
To improve on the results presented in Sect. 5, we focus on data augmentation. Both the morphological and spherical harmonics distances are defined as Euclidean distances of vectors in multi-dimensional spaces. As explored in [1], these types of representations can be used for data interpolation as well. Let us consider the linear interpolations (0 ≤ s ≤ 1): with respect to two shapes P 1 and P 2 , where F and c m k are defined in Eqs. (3), (4) (6) and (5). The vectors F(s) and c m k (s) provide continuous interpolations between the two shapes, as illustrated in Fig. 13. An alternative approach consists in using a singular value decompo- Figure 12 Left column: confusion matrices for the training (rows a, c) and testing sets (rows b, d) between Grassmann clustering (true label) and the label predicted by the dense neural network using the signed (rows a, b) and spherical harmonics distances (rows c, d). Right: percentage of misclassified shapes in each cluster sition on the matrix containing as columns the spherical harmonics decomposition c m k (P i ) for all defects P i . Considering as an example the first three singular values, a new shape may be represented as a point in a three-dimensional space. By paving this space with a set of points, one generates new defects that interpolate between the shapes corresponding to  Fig. 14(a). Figure 14(b) shows random shapes generated in this space.
We now generate 3128 artificial shapes with the above data augmentation techniques. The linear interpolation method in Eq. (8) is used preferentially on set of shapes that belong to mechanical clusters with few shapes. We then classify the shapes according to the k-medoids method, as described in Sect. 4. Results corresponding to the mechanical, morphological and spherical harmonics clustering are shown in Fig. 15. The points cloud are much more dense and homoegneous as compared to the same results obtained without data augmentation (Fig. 7) and suggest the latent space is better represented. Despite this, mechanical clusters can not be predicted using either the morphological or spherical harmonics clustering (Fig. 16): their respective errors read e te = 77% and e te = 67%.
Again, use is made of a classifier based on a dense neural network that is trained to predict the mechanical cluster using distances to the medoids. Figure 17 shows the confusion matrix obtained for the training and testing sets, when either the morphological or spherical harmonics distances are considered. The proportion of misclassified shapes by cluster is shown on the right. In the case of the morphological distance, the errors for the training set read e tr = 6%, e tr = 5% and e te = 6%, e te = 19% for the testing set, When spherical harmonics are considered, errors are slighly higher. They read e tr = 8%, e tr = 8% for the training data and e te = 9%, e te = 27% for the testing set (see Table 1 for a summary of the various errors. In any case, these errors are significantly lower than that obtained without Clustering provided by the k-medoids algorithm using the Grassmann, morphological and spherical harmonics distances, after data augmentation. Medoids on right data augmentation (Fig. 12), highlighting the benefits of data augmentation. Furthermore, the errors of the neural network consist most often in predicting a label which is a neighbor of the correct mechanical cluster, with similar mechanical response. This materializes into a band-diagonal structure for the confusion matrices. In any case, these errors are significantly lower than that obtained without data augmentation (Fig. 12), highlighting the benefits of data augmentation. Furthermore, the errors of the neural network consist most often in predicting a label which is a neighbor of the correct mechanical cluster, with similar mechanical response. This materializes into a band-diagonal structure for the confusion matrices.

Concluding remarks
In the present work, data-analysis and clustering methods have been proposed for classifying the mechanical properties of porous defects. Although the approach is restricted to linear elasticity, it is fairly general and can be adapted to nonlinear or time-dependant mechanical responses. While we use conventional clustering and data analysis tools, the methods rely on distances defined in the space of the defects mechanical responses (i.e. a 3D tensorial field) and on geometrical distances based on a morphological transform and a spectral decomposition. Such distances allow us to explore a wide space of defects, and perform data augmentation, without the need for explicit parametrization of shapes.
Our methodology is detailed on a set of defects observed in welded joints. It is found that reliable results on clustering require a large number of shapes in each mechanical class.
Furthermore, a simple neural network was able to link mechanical and geometrical clusters with a satisfying accuracy, within the space of defects close to that observed in welded joints. Nevertheless, the method applies to arbitrary shape, and may be extended to other types of defects. These results should be useful in particular for a refined two-scale hyper reduction method, as outlined in the introduction, where mechanical properties of defects may be selected on the fly, without solving balance equations.
Possible improvements and future works include hierarchical clustering, extension of the spherical harmonics decomposition to non-star shaped defects, and data augmenta- Figure 17 Left column: confusion matrices for the training (rows 1, 3) and testing sets (rows 2, 4) showing the true Grassmann clustering label vs. the label predicted by the neural network using the morphological (rows 1, 2) and spherical-harmonics distance (rows 3, 4). Right: percentage of misclassified shapes in each cluster after data augmentation tion with shape extrapolation, instead of interpolation. In particular, the spherical harmonics decomposition provides a natural basis for data augmentation as well as mechanical clustering. Table 1 Percentage of wrongly-assigned labels, averaged over all shapes, or over mechanical clusters, for the training data, with either the morphological or spherical harmonics distances, using data augmentation