From Classification to Segmentation with Explainable AI: A Study on Crack Detection and Growth Monitoring

Monitoring surface cracks in infrastructure is crucial for structural health monitoring. Automatic visual inspection offers an effective solution, especially in hard-to-reach areas. Machine learning approaches have proven their effectiveness but typically require large annotated datasets for supervised training. Once a crack is detected, monitoring its severity often demands precise segmentation of the damage. However, pixel-level annotation of images for segmentation is labor-intensive. To mitigate this cost, one can leverage explainable artificial intelligence (XAI) to derive segmentations from the explanations of a classifier, requiring only weak image-level supervision. This paper proposes applying this methodology to segment and monitor surface cracks. We evaluate the performance of various XAI methods and examine how this approach facilitates severity quantification and growth monitoring. Results reveal that while the resulting segmentation masks may exhibit lower quality than those produced by supervised methods, they remain meaningful and enable severity monitoring, thus reducing substantial labeling costs.


Introduction
Cracks may appear in various types of structures, including walls [1,2,3], road surfaces [4,5], bridges [6,7,8,9], tunnels [10], dams, beams [11], pipes [12,13], railway sleepers [14,15] and slabs [16], made from different materials such as masonry, concrete, brick, stone and wood.The detection and monitoring of these cracks are essential for ensuring structural safety and play a significant role in structural health monitoring.Traditional methods to detect surface cracks involved manual visual inspections on a regular basis.However, these inspections have several drawbacks, including limited availability of human resources, service interruption (e.g., closures of railway sections or bridges), inspector subjectivity, high costs, and challenges in accessing hazardous or contaminated areas.Automatic visual inspection provides a solution to overcome these issues by enabling efficient, cost-effective and safe structural health condition monitoring of surface cracks [17,18].This can be achieved through the use of Unmanned Aerial Vehicles (UAVs), robots or other vehicles equipped with imaging or video capturing capabilities [19,20,21].These technologies, in conjunction with data-driven approaches based on computer vision and machine learning, enable the automatic processing of large volumes of data, thereby making the assessment more objective.Image-based approaches can, naturally, only detect surface cracks.Cracks present deep inside the material, whose detection requires other types of sensing such as ultrasound [22] or X-ray [23], are not considered in this work.
The automated detection and segmentation of cracks in images pose challenges due to the diverse aspects of cracks, the complexity and diversity of material textures, and irregular illumination conditions.Various approaches have been proposed for crack detection, including classification approaches based on supervised machine learning [24,8] and deep learning, notably based on convolutional neural networks (CNNs) [4,1,2,9,25].
A significant concern in structural health monitoring is the development and propagation of cracks over time, which can lead to increased stress and eventual failure of the structure.Once a damage has been detected, it becomes crucial to monitor the evolution of its severity to trigger timely maintenance actions and prevent catastrophic consequences.The severity of a surface crack can be quantified by measuring its width, length or area [26,16,27,28].These measurements can be derived from the binary segmentation mask of the crack through crack profiling techniques such as skele-tonization, thinning, tracking, labeling, and width measurement [29].However, achieving a precise segmentation of the damage is necessary for accurate measurements.Several approaches have been developed for segmentation, primarily based on image processing techniques [30].These methods typically involve two steps: (1) image enhancement, including noise reduction, filtering, shading correction, etc. and (2) image binarization (thresholding) to obtain the crack segmentation [31,11,10,32,25,33,26].Other approaches use Fourier or wavelet transforms, edge detection [6] or the percolation method [34].Crack depth is another severity metric estimated by [35] using optical reflection properties.Nevertheless, image processing methods usually involve complex pipelines with multiple steps, providing handcrafted solutions for specific use cases.Moreover, these methods often struggle to handle complex cases like low contrast, distracting elements, diverse material textures and complex backgrounds.
Data-driven approaches based on supervised deep learning have demonstrated excellent performance in pixel-level semantic segmentation of cracks.Numerous approaches have leveraged CNNs [29,36,5,37], among which the popular architectures U-Net [38,19,39,40,27] and DeepLabv3+ [28,41].In [42], the authors propose a fusion of saliency cues with the image within a U-Net.However, these approaches require extensive pixel-level annotations of a large number of images, which is a labor-intensive and tedious process.As an alternative approach for severity estimation, [16] proposed training a classifier to directly classify severity levels without the need for a segmentation step.However, this method requires images to be labeled beforehand based on their severity level and provides more limited information.
An important barrier in the development of deep learning-based automated crack segmentation systems is the high cost associated with pixel-level annotation of large sets of images.To circumvent this issue, unsupervised (requiring no labels) and weakly-supervised (requiring only image-level labels) segmentation methods have received growing attention [43].As an example of an unsupervised approach, Chow et al. [44] tackled crack segmentation through anomaly detection using an autoencoder.However, the lack of any supervisory signal hinders the performance of such approaches, as our experiments will also demonstrate.
In our work, we focus on weakly-supervised approaches that leverage explainable artificial intelligence (XAI).Explainable AI aims at enhancing the transparency and trustworthiness of AI systems, which is crucial for safety-critical applications [45].Numerous methods have been developed to explain the decisions made by machine learning-based systems, particularly for deep neural network models and image data.These methods differ in the types of explanation and in the techniques utilized to produce these explanations.Feature attribution explanation methods are indirect ways of explaining a model by calculating an importance score for each variable or feature handled by the model (such as pixels in an image) to predict the target outcome, resulting in so-called attribution maps.The main idea is to train a classifier to classify the damages, extract explanations of the classifier's decisions in the form of per-pixel attribution maps and derive segmentation masks from these maps.In other words, the principle is to approximate segmentation masks by explanations.The advantage of this approach is that while annotating images for segmentation is tedious, classification labels can be obtained at a fraction of the cost.
Numerous feature attribution methods exist in literature, differing in their approach to compute relevance scores, the desired properties or constraints to be satisfied, and consequently, the quality of the resulting explanations.Seibold et al. [46] proposed to use an XAI technique called Layer-wise Relevance Propagation (LRP) [47] to segment damages in magnetic tiles and sewer pipe images.However, this study focused only on one of such methods, namely the LRP technique, limiting the range of applicable network architectures (for instance, LRP cannot be used in models with skip-connections).Other explanation methods were not evaluated.Furthermore, [46] solely evaluated the resulting segmentation quality in terms of F1 score, precision and recall, but the damage severity was not further assessed, which is a major requirement in many structural health monitoring applications.
In this paper, we aim to build upon this initial study, offering several key contributions.Our main contributions include proposing a comprehensive methodology and evaluating the abilities of various explainability methods in generating high-quality segmentation masks for crack detection in masonry building walls.Additionally, we assess whether this framework enables quantification and monitoring of damage severity, such as crack width measurement, to facilitate timely decision-making.Following the XAI methods taxonomy from the literature review by Arrieta et al. [45], our focus is primarily on post-hoc feature attribution methods (also known as feature relevance methods) and architecture modification methods suitable for convolutional neural networks and image data.Post-hoc explainability methods aim at explaining the decisions of a given black-box model without inherent transparency.In this work, we evaluate and compare six post-hoc tech-niques: Input×Gradient [48], Layerwise Relevance Propagation (LRP) [47], Integrated Gradients [49], DeepLift [50], DeepLiftShap and GradientShap [51].These included techniques are widely used by practitioners; they vary in terms of the relevance computation method and exhibit different, yet reasonable, computational runtimes.Additionally, our study includes one recent inherently explainable architecture modification method (B-cos networks [52]), and one recent post-hoc method that utilizes an auxiliary network to generate attribution maps (Neural Network Explainer [53]).For the latter, we also propose an extension specifically tailored for classification tasks in which one class represents the absence of a foreground object.This adaptation differs from its original formulation and is particularly relevant in damage classification scenarios.Importantly, all methods compared in our study generate attribution maps at the input resolution.Indeed, a high resolution is necessary to capture the thin structure of cracks.For this reason, we did not include methods producing class activation maps (CAM) at the feature level such as CAM [54] and Grad-CAM [55], as well as weakly-supervised approaches based on global pooling layers for heatmap generation [56,57,58].
The contributions of this work are summarized as follows: 1. We propose a comprehensive methodology and evaluate the performance of various explainable artificial intelligence (XAI) methods in generating high-quality segmentation masks for cracks in masonry building wall surfaces, without requiring pixel-level annotation of images.These masks are derived from the explanations of a classifier.2. We extend upon the Neural Network Explainer method [53] to accommodate classification tasks specifically when one class represents the absence of a foreground object (i.e., damage-free image samples in the scenario of damage classification).

We propose to use damage-free images as baselines in the Integrated
Gradients and DeepLift-based methods, improving the quality of their explanations.4. We investigate the applicability of these methods in quantifying damage severity and monitoring its progression, thereby facilitating timely decision-making.To this aim, we evaluate crack severity quantification and growth monitoring abilities using severity metrics such as the number of cracks, maximum crack width and crack area.
The remainder of this paper is structured as follows.In Section 2, we introduce the proposed methodology to produce crack segmentation masks based on classifier explanations, as well as the explainable AI techniques used throughout our work.In addition, we derive an adaptation of the Neural Network Explainer method suitable for damage classification.The following Section 3 presents the data, crack classification models, and experimental settings.Section 4 reports the results of our experiments.Finally, Section 5 concludes the paper and discusses the outcomes and perspectives of this research.

Proposed methodology
In this section, we introduce the general methodology for generating crack segmentation masks without pixel-level segmentation labels.This involves utilizing classifier explanations of the explainable AI methods that are evaluated in this study and performing post-processing on the resulting attribution maps.Additionally, we propose an adaptation of the Neural Network Explainer method suitable for damage classification.

Overview of the proposed methodology
We propose the following methodology to generate crack segmentation masks based on classifier explanations, without the need for pixel-level annotation of ground-truth segmentation masks: 1. Collect and label a dataset of positive (containing a crack) and negative (damage-free, without any crack) image patches.2. Train a binary classifier using the labeled training images.3. Perform inference on unseen test images.For each positive prediction, extract attribution maps from the classifier corresponding to the positive class, using an XAI method able to extract attribution maps at input resolution.4. Post-process the resulting attribution maps: (a) Binarize the continuous attributions to obtain binary masks.(b) Apply morphological operations to close gaps in the masks and remove noisy attributions.5. Compute crack severity levels based on the resulting masks.
The main principle is to approximate segmentation masks by classifier explanations.In cases where a well-performing classifier can be obtained for distinguishing between damage-free and damaged image samples, the explanations provided by an XAI approach, especially the pixel-level attribution maps, become valuable for generating accurate segmentation masks of damages.These maps aim at highlighting the discriminative regions that contribute to the target class.In the context of crack detection, a pixel should contribute to the crack class if and only if it is part of a cracked region.Therefore, there is an expected correspondence between explanations and segmentations, as the attribution maps should accurately highlight the regions where cracks are present.
The proposed methodology is illustrated in Figure 1b and compared to the standard supervised semantic segmentation workflow (Figure 1a).The data labeling cost of step 1 is order of magnitudes smaller compared to the pixel-level labeling required for training a supervised segmentation algorithm, as it only requires image-level labels.In step 2, we apply a convolutional neural network as the classifier.It is important to note that while our study focuses on the binary case, the methodology can be easily extended to handle multi-class and multi-label classification, accommodating different types of damages that may co-occur in the same image.In the multi-class case, attribution maps can be extracted for each damage class.For step 3, we assume throughout this work that the crack segmentations are generated for unseen future images, which are represented as an independent hold-out test set.However, it is also possible to generate the segmentation masks for the training images.In practice, we observed no significant difference in the results.Therefore, we report the results solely based on the test set.After the post-processing (step 4), the resulting binary masks provide approximate segmentations of the cracks, allowing the quantification of crack severity in step 5.

Explainable AI methods
In this section, we present the explainable AI methods evaluated in our study.We have included several popular XAI methods, each employing distinct approaches to compute relevance scores.All methods are able to generate attribution maps at the input resolution, which is a requirement to capture the thin structures of cracks.Moreover, these methods have easyto-use available implementations, and a reasonable computational runtime.We intentionally avoided methods that output lower-resolution maps at the feature level, e.g., CAM [54], Grad-CAM [55] and heatmap network-based approaches with global pooling layers [56,57,58].We also omitted very  computationally expensive methods in this study, such as perturbation-based approaches.
Input×Gradient [48] is one of the earliest gradient-based explanation methods.It operates by computing the gradient for each input dimension at the current input value and then multiplying it with the input itself.This process reveals the change in the output resulting from an infinitesimal change of the input, thus indicating the local importance of each input dimension.However, its effectiveness is limited to the immediate local information provided by the gradient.
Layer-wise Relevance Propagation (LRP) [47,59] is a popular technique for decomposing a decision into pixel-wise relevance scores.It utilizes a backward propagation process and relies on access to the model internals such as weights and activation functions.LRP operates backward and propagates the relevance scores from the upper to the lower layers of the neural network, using specifically designed propagation rules such as LRP-0, LRP-ϵ, LRP-γ, LRP-αβ and the z B -rule.For the best explanation quality, the rules are adjusted for different types of layers and activations.
Integrated Gradients [49] calculates and accumulates the gradients along a straight-line path that interpolates between a reference input, called baseline, and the input.The method satisfies two desirable properties known as completeness (i.e., the sum of the attributions over all features equals the difference between the model's output at the input and at the baseline) and implementation invariance (i.e., two networks with different implementations but the same outputs for all inputs produce identical attributions).
The DeepLift method [50] computes relevance scores by comparing the network activations with a reference activation obtained on a baseline input.The contributions of each feature are computed as the differences between each neuron's activation and their reference activation, and propagated in a backward pass using a recursive algorithm similar to backpropagation.Being based on activations rather than gradients, it avoids shortcomings of gradient-based methods such as zero or discontinuous gradients.DeepLift's attribution quality is typically comparable to Integrated Gradients, but it runs significantly faster.
DeepLiftShap (also called DeepSHAP) is an application of DeepLift that uses SHAP (Shapley additive explanation) values as a measure of contribution [51].Its attributions are estimated by sampling random images from a baseline distribution and averaging the resulting DeepLift attributions.Additionally, GradientShap approximates SHAP values by comput-ing an expected gradient instead of an integral, as performed in Integrated Gradients, and can be seen as an approximation of the latter.Under model linearity and feature independence assumptions, SHAP values are approximated by the gradient expectation.The feature attributions are estimated by sampling random images from a baseline distribution and averaging their gradients multiplied by the difference between the input and the baseline.
The Neural Network Explainer (NN-Explainer) [53] is a recent method that trains an auxiliary network, referred to as the explainer, to generate attribution maps for a trained classifier, referred to as the explanandum (i.e., the model to be explained).These maps take the form of masks, denoted as m ∈ [0, 1] W ×H , predicted for each class, where W and H represent the input image's width and height, respectively.Concretely, the explainer's architecture is similar to that of a segmentation network.The explainer is trained to minimize the cross-entropy of the explanandum within the image region selected by the mask and to maximize entropy (i.e., uncertainty) outside of the mask.The loss function also incorporates multiple regularization terms that penalize the area of the mask while encouraging smoothness.However, this formulation is valid for images containing one or multiple foreground objects and is not directly applicable in the context of damage classification, where one class represents the absence of foreground objects.To adapt the NN-Explainer to the context of crack detection, we propose a modification of the approach introduced in Section 2.4.
The final method considered in this study, B-cos networks [52], is an explainable-by-design approach that aims at making the learned model inherently transparent.The authors propose to replace all linear transformations in the network, including convolutions, with a so-called B-cos transform.This transform promotes alignment between weights and inputs during training, and demonstrates that the network can be faithfully summarized by a single input-dependent linear transformation.Attribution maps for a given class and input can then be obtained simply by visualizing the corresponding dimensions in the (input-dependent) matrix associated with this linear transform.

Post-processing steps
The attribution maps undergo post-processing in two stages.In the first stage, we binarize the continuous attribution maps through thresholding.In the second stage, three morphological operations are applied as follows: (1) Closing, which involves dilation followed by erosion, to merge dense regions in the attribution maps; (2) Area opening with a minimum area threshold to remove remaining noise; (3) Second closing to close larger gaps in the resulting mask.Choosing an appropriate radius for the morphological closing is crucial, as a too large value will merge noisy attributions, while a too small value will result in holes in the mask.Generally, closing increases the mask area, thereby increasing recall with respect to the ground-truth segmentation.However, it may also introduce false positive pixels, thereby reducing the precision.The post-processing steps are visualized step-by-step in Figure 2 for two different attribution methods (Integrated Gradients in the top row and LRP in the bottom row).Depending on the attribution method, different steps of the post-processing may or may not improve the segmentation quality, as measured by the F1 score relative to the groundtruth segmentation.
Please note that the quality of the segmentation could potentially be further optimized by tuning the post-processing specifically for each attribution method and final application.However, for the sake of simplicity and to ensure a fair comparison, we used identical post-processing steps across all methods in our benchmark.Furthermore, it is important to mention that the evaluated explainability techniques and the post-processing steps do not incorporate specific knowledge about the structural characteristics of cracks, such as pixel connectivity or other effective regularization techniques used in the literature [20,37,3].While these techniques are not the focus of our study, they can be used in combination with the proposed methods to im-  prove the resulting crack segmentation.This makes the proposed methodology general and applicable is various contexts, while enabling it to be further extended and improved for each specific application.

Adaptation of the NN-Explainer for damage classification
In this section, we derive our adaptation of the NN-Explainer [53] in the context of (K + 1)-class multi-label classification, where K represents the number of different damage types ranging from 1 to K. In this setting, multiple damages can occur in the same image, while the negative class 0 represents the damage-free class.The illustration of our method can be found in Figure 3.The original NN-Explainer method proposes the following loss function to train the explainer network: ) where x ∈ R C×W ×H represents the input image (C, W , H denoting the image channels, width and height, respectively), Y is the set of positive target classes present in the image (defined at the image level), m ∈ [0, 1] W ×H is the aggregated mask generated for the target classes, m = 1 − m is the inverse target mask, S is a set of per-class masks and n ∈ [0, 1] W ×H is the aggregated mask produced for non-target positive classes.The hyperparameters λ E , λ A and λ TV are used for balancing the loss terms.Only positive samples containing at least one damage are used to train the explainer.In the remaining sections, we also adopt the notations from [53], where E represents the explainer and F denotes the explanandum.We now discuss the different terms of the loss function and present our proposed modification.
Classification loss: L C (x, Y, m).This loss function encourages the target mask m to highlight the regions in the input that are correctly classified by the trained model.It is computed as the sum of binary cross-entropies for each positive class present in the image, using the probabilities output by F when applied to the masked input: p = F(x ⊙ m), where ⊙ represents element-wise multiplication.The expression for the classification loss is as follows: It is important to note that in the case of single-label classification, such as in our crack detection study, this loss is equivalent to the traditional crossentropy.
Negative entropy loss: L E (x, m).We propose to modify this loss term in order to adapt it for damage detection.The original formulation of NN-Explainer [53] was designed for multi-label classification tasks where every input image contains one or several objects.In such cases, where there is no specific class for the absence of objects, NN-Explainer incorporates a loss term that maximizes the classification entropy (i.e., uncertainty) when the objects of interest are masked out, representing only background to the classifier.This is achieved by computing the negative entropy of the model probabilities on the inverse-masked input p = F(x ⊙ m) across all positive classes.The formulation for the negative entropy loss is as follows: However, in our case of damage detection, the classifier distinguishes between the presence and absence of objects (damages), with the absence (background) being represented by the negative (damage-free) class.In other words, the damage-free class is also a target class per se.Therefore, we replace the entropy term with the cross-entropy against the negative class, as an input containing only damage-free regions should be classified as the negative class.Thus, we propose the negative classification loss L NC as a replacement for the negative entropy loss: Area loss: L A (m, n, S) and Smoothness loss: L TV (m, n).The area loss plays a crucial role in penalizing the size of the mask, ensuring that it remains as small as possible and preventing the trivial solution of a target mask with 1 values everywhere.It also sets constraints on the minimum and maximum allowed areas beyond which mask areas are penalized.The smoothness loss, based on the total variation, promotes smooth and artifact-free masks.No modifications have been made to these two losses from the original paper [53].Finally, our modified loss is expressed as follows:

Experiments
This section first presents the experimental data used in our experiments and the networks adopted for the crack classification task.We then provide details on the compared explainable AI methods and the experimental settings.The following paragraphs report the results of our experiments and studies on the segmentation quality, crack severity quantification, and growth monitoring.

Data
We conducted experiments on the Experimental DIC (Digital Image Correlation) cracks dataset [60,3], which consists of 530 256×256 image patches from stone masonry walls that were damaged in a shear-compression loading experiment conducted at the EESD laboratory at EPFL.The mm/pixel ratio is 0.43.All the images have annotated ground-truth segmentation masks for the cracked image patches.To perform the classification, we augmented this dataset with 874 additional negative patches taken from the same walls.The training and validation sets consist of 767 patches (301 positive/466 negative) and 328 patches (129 positive/199 negative), respectively.These patches were extracted from 17 high-resolution images and randomly split.The test set, on which all results are reported, contains 309 patches (100 positive/209 negative) extracted from three different high-resolution images.Examples of images can be seen in Figure 4.The complete dataset will be made available online, along with the code.

Crack classifier
The second step in our framework (see Figure 1b in Section 2) requires training a well-performing classifier to distinguish between positive and negative image patches.We use a VGG11 [61] architecture with 128 neurons in the fully-connected layers, which we refer to as VGG11-128.We chose the VGG architecture because it is a widely-used standard CNN architecture in the related literature [62,3], and it allows us to implement LRP rules [46] (no skip-connections as in residual networks) as well as B-cos networks [52].
To adapt the network to our small dataset, we reduced the number of neurons in the fully-connected layers from 4096 to 128 compared to the original VGG11.We trained the VGG11-128 model from scratch, using the cross-entropy loss and the Adam optimizer with a learning rate of 10 −4 , β 1 = 0.9, β 2 = 0.999 and a weight decay of 10 −8 .The only data augmentations applied are random horizontal and vertical flipping with probability 0.5.We employed early stopping based on the validation fold to select the best-performing model.
On the test set, the classifier achieved a balanced accuracy of 89% , with a true positive rate (TPR) of 79% and a true negative rate (TNR) of 99%, as reported in Table 1.While the performance is sufficient to demonstrate the approach, it is worth noting that a more advanced classifier could potentially be trained by incorporating additional data augmentations.
For the B-cos network variant (VGG11-128 B-cos), we implemented the version with MaxOut units [63] and B = 2, as recommended in [52].The training parameters are identical, except for the learning rate, which is reduced to 10 −5 .It is worth noting that the performance of the B-cos classifier model is significantly lower than that of the standard classifier model (VGG11-128).While the authors in [52] observed only a small decrease in performance on the CIFAR-10 benchmark, the performance gap is more pronounced in our task.

Compared methods and experimental settings
In this study, we benchmark the ability of a total of eight different explainable AI methods to generate crack segmentation masks based on the pre-trained crack classifiers introduced in the previous section.
First, we evaluate six widely used post-hoc attribution methods for images, as introduced in Section 2): Input×Gradient, Integrated Gradients (IntGrad), DeepLift, DeepLiftShap, GradientShap, and Layerwise Relevance Propagation (LRP).The model to be explained is the trained VGG11-128 classifier.We used the PyTorch implementations of the captum library [64] for the first five methods, using the default parameters unless specified.For LRP, we adopt the segmentation rules proposed in [46].The z B -rule is used for the first convolutional layer, the LRP-αβ rule is used for the two lower convolutional layers, the LRP-γ rule is used for all following convolutional layers, and the LRP-ϵ rule for all fully connected layers.The library zennit [65] was used to implement these rules in our network.
The NN-Explainer method has been adapted to the setting with a negative class for damage-free samples, as explained in Section 2.4.The explanandum is frozen and consists of the trained VGG11-128 classifier.For the explainer, a natural choice of architecture is the U-Net11 [62], as it uses an encoder similar to the one in the classifier for feature extraction.The weighting hyperparameters in the loss function are set to λ NC = λ A = λ TV = 0.1.The area loss also has two additional parameters to constrain the area range.We set the minimum and maximum values to 0.001 and 0.15, respectively, which roughly corresponds to the distribution of crack areas in the data, instead of the values 0.05 and 0.3 used in [53].
For the selection of class labels for the explainer training, we chose the ground-truth image-level training labels.As explained in [53], it is more suited to handle attributions for false negative predictions.In the case of significant false positives, using the explanandum's predictions as labels might be better suited, but that is not the case here.We train the explainer model from scratch, using the Adam optimizer with a learning rate of 10 −5 , β 1 = 0.9, β 2 = 0.999 and no weight decay.
For the B-cos network, we extract the visualizations by following the procedure described in the experimental section of the paper [52].
We also include unsupervised and supervised methods, not based on XAI, for comparison.
The "Raw method" simply uses the raw pixel intensities, with the image only converted to grayscale before post-processing.We also evaluated different image enhancement techniques, such as Gaussian blurring, shading correction [31] and Min-Max Gray Level Discrimination [11], before the binarization.However, these methods did not improve the results, as they primarily address uneven illumination of the image and are unable to distinguish between the material patterns and the cracks in our dataset.As a second unsupervised comparison method, we trained a convolutional autoencoder (CAE) to reconstruct only the damage-free training images, as done in [44].Then, the pixel-wise reconstruction error is used to generate attribution maps for the testing images.The CAE has a VGG11 encoder followed by a fully-connected layer with 128 neurons and a 100-dimensional bottleneck, and a symmetric decoder.The mean squared error loss is used to train the CAE.
Finally, as a supervised oracle method, we trained a U-Net11 [62] on the ground-truth pixel-level segmentation labels of the training set.This serves as an upper bound on the performance that could be achieved by a supervised segmentation model with access to the fully labeled dataset at pixel-level.The U-Net was trained from scratch using the Dice loss for 100 epochs, with the Adam optimizer (learning rate 10 −4 , β 1 = 0.9, β 2 = 0.999).It is important to note that a fair comparison cannot be made between our methodology and the U-Net, as the U-Net relies on direct supervision from the pixel-level labels, which necessitates extensive annotation prior to training.
For the post-processing, we employ the simple or GMM thresholding strategies described in [46] for the binarization of attribution maps.In the second stage, we used an elliptical kernel for the morphological closing operations.The radius is set to r = 5 px in the first closing.The area opening operation uses a minimum area of 50 px 2 .Finally, the second closing uses a radius of r = 25 px to close larger gaps in the masks.The parameters were fixed empirically and were kept identical across all methods in our benchmark, without specific tuning to enable a fair comparison.

Importance of the baseline image
Some relevance methods require a baseline image (such as IntGrad and DeepLift) or a baseline image distribution (such as DeepLiftShap and Gradi-entShap) as a reference point for computing changes or gradients compared to the input.The commonly used standard baselines for these methods are a zero matrix (i.e., a black image) or a random normal baseline distribution.However, we observed that with these standard baselines the methods performed poorly, as the attributions tend to be spread over large areas surrounding the crack.To address this issue, we propose an alternative approach of sampling a set of images from the damage-free class (without cracks) as baselines.For methods like IntGrad and DeepLift, we use the mean of the sampled damage-free images as the baseline.For DeepLiftShap and Gra-dientShap, we use these damage-free images as the baseline distribution, with 10 samples in order to keep the runtimes reasonable.This modification significantly improves the quality of attributions, as the focus of the attribution shifts more towards the crack itself rather than the healthy regions of the image.An example illustrating this improvement is visualized in Figure 5. Quantitatively, when using the standard baseline, IntGrad obtained an F1-score of only 11.13% (after simple thresholding), whereas it increased to 18.91% when using the mean damage-free image baseline (see next paragraph for complete results).All results presented in this paper for IntGrad,  DeepLift, DeepLiftShap, and GradientShap employ our proposed baseline method.

Segmentation quality evaluation
This paragraph presents the results of our benchmark in terms of the segmentation quality of the generated masks.While localization performance is commonly used to evaluate attribution methods in a quantitative way, such as the grid pointing game [66,67,52], segmentation quality provides a more fine-grained measure of localization performance.However, evaluating segmentation quality requires access to ground-truth segmentation masks [46,53].Fortunately, in our case, we have access to the true segmentation masks of the DIC images, allowing us to evaluate the segmentation quality metrics such as F1 score (also known as Dice score), Precision, Recall and Intersection-over-Union (IoU or Jaccard index) on the test set.These metrics are calculated based on the per-pixel true positives (TP), false positives (FP) and false negatives (FN) in relation to the ground-truth segmentation mask: For each evaluated method, we provide the results using various combinations of thresholding and morphological post-processing, as outlined in the methodology proposed in Figure 1b in Section 2. In Table 2, we present the quantitative findings.Methods are grouped by type of supervision: XAIbased (weakly-supervised, following the proposed methodology), unsupervised, and fully supervised.
The two best-performing XAI methods are DeepLiftShap and LRP, achieving F1/IoU scores of 38.1%/23.60%and 37.43%/23.03%,respectively, when used in conjunction with simple thresholding and morphological post-processing.The next two best methods are DeepLift and the B-cos network.Although these scores may appear relatively low, particularly when compared to the performance of the supervised U-Net model (83.67%/71.93%), it is important to note that these segmentations were generated without explicit supervision, and only required a trained binary classifier for cracked and non-cracked image patches.Unlike the U-Net oracle model, no pixel-level labels were necessary.The unsupervised approaches, using raw pixels and the CAE, both obtain very low performance (around 5% F1 after post-processing).In both cases, the resulting masks comprise most of the dark pixels in the images, without separating the crack from the material texture.The noisy textures of the images in our dataset, common for construction materials, hinders the autoencoder from producing accurate reconstructions.Since there is no supervisory signal, the CAE cannot separate the discriminative signal from noise and only reconstructs the mean of the image distribution.While the CAE performs well on less noisy images as shown in [44], it struggles with more complex cases like ours.In terms of post-processing, the GMM thresholding strategy generally yields the best performance.However, when morphological operations are applied, the simple strategy produces superior results.The selection of the post-processing techniques, particularly the choice of morphological operations, involves a trade-off between precision, recall and the visual aspect of the resulting mask (e.g., presence of noise or holes in the mask).Therefore, the preference for specific post-processing strategies should be based on the end user's preference and the requirements of downstream tasks.
Visualizations for five examples are shown in Figure 6, depicting the results after binarization (with the simple thresholding strategy) in Figure 6a, and after the application of morphological operations in Figure 6b.The figure displays the image and ground-truth mask in the first two columns, followed by masks obtained by XAI methods, unsupervised methods, and the supervised U-Net.In terms of qualitative assessment, DeepLift, DeepLiftShap, and LRP produce visualizations that closely resemble the ground-truth segmentation.LRP and the B-cos network yield cleaner masks with less noise, although the attributions of the B-cos network are incomplete.Other methods exhibit more scattered and noisy attributions, resulting in masks that are too spread out.This qualitative observation corroborates with the quantitative results, showing higher recall but lower precision.The main source of error lies in the omission of very thin cracks when multiple cracks are present in the image (as observed in examples 4 and 5 in Figure 6b).In such cases, the attributions of the thinner cracks are overshadowed by noise and are lost during the post-processing stage.It is worth noting that even the supervised U-Net fails to capture the two thinner cracks in the last example.The NN-Explainer method shows promising results but produces masks that are too wide, resulting in a high recall but low precision.We believe this may be due to two main reasons.Firstly, the formulation of the explainer's training loss fails to effectively penalize the mask area, making it difficult to tightly fit around the crack.The mean total area penalty used in the loss formulation is not suitable for capturing thin, skeleton-like structures such as cracks.Secondly, masking out image regions with zero values (i.e., black pixels) may not be ideal for cracks that are comprised of dark pixels.An in-distribution masking operation might be more appropriate and could be explored in future work.Lastly, B-cos networks also show promising results but suffer from partial coverage of attributions over the object, failing to highlight the full length of the cracks and thereby compromising the segmentation quality.B-cos explanations are also constrained by the inferior classification performance of the B-cos network.As explained previously, the unsupervised methods are unable to separate the crack from the background texture, resulting in masks that cover the entire image.

Augmentation smoothing
Augmentation smoothing (AugSmooth) is a technique that involves averaging multiple attribution maps obtained using Test-Time Augmentations (TTA).Originally introduced with Grad-CAM [55], AugSmooth improves localization at the expense of the additional computation incurred by TTA, requiring to predict and extract attribution maps for each augmented input [68].In our methodology, AugSmooth is applied during the generation of attribution maps, prior to post-processing.We conducted experiments using LRP+AugSmooth, and the results are presented in Table 3.For TTA, we used 6 combinations of random horizontal and vertical flipping, as well as random intensity scaling by factors 0.9, 1.0 or 1.1, following the approach described in [68].When combined with simple thresholding, substantial improvements were observed, with an absolute increase of +2.01% in F1 score, primarily attributed to an increased recall (+4.22%).However, it is worth noting that LRP+AugSmooth slightly degraded results (-0.35%) when used in conjunction with GMM thresholding, which already exhibited high recall, due to a decrease in precision.Ultimately, when incorporating postprocessing, LRP+AugSmooth achieved an F1 score approaching 40%.

Crack severity quantification
To evaluate the severity of the damage, we calculated the number of cracks per patch (CPP) [3], the total crack area per patch, and the maximum crack width.The width estimation method from [26] was used to estimate the maximum crack width.In Table 4, we report the mean absolute error (MAE) and mean absolute percentage error (MAPE, in %) of each method compared to the corresponding ground-truth metric extracted from the ground-truth mask.Results for the Raw and CAE methods have been omitted as the resulting masks do not allow meaningful estimation of crack severity.When estimating the number of CPP, most methods achieve a mean absolute error of less than 1, which is close to the performance of the U-Net supervised oracle model (0.74).The two best-performing methods are DeepLift (0.72) and DeepLiftShap (0.78), followed by LRP and NN-Explainer.However, all methods exhibit high errors in estimating crack area and width.This is primarily due to the methods overestimating the extent of the cracks or missing some cracks, as observed in Figure 6b.The choice of post-processing, which favors clean masks but increases their size, also contributes to this issue.Specifically, the simple thresholding is more suitable compared to the GMM strategy, which tends to overestimate the crack size.The LRP method provides the most accurate assessment of crack area and width.On average, the error is 91% for crack area, and 163% for maximum crack width, which corresponds to a 2X to 3X range.Although these errors may seem high, it is important to consider that we are dealing with very thin structures that typically cover about 1% of the image area, and mistakes of a few pixels result in a high percentage of error.In conclusion, our XAI-based methodology offers a rough estimate of crack severity metrics, but it is not as accurate as a fully supervised segmentation approach (the U-Net obtains 20% MAPE on both area and width).However, it is worth noting that for tasks involving extracting the skeleton from the binary mask, the overestimation of the crack width is not a critical issue.Furthermore, if this overestimation is consistent, results can be calibrated and still be valuable for crack growth monitoring, which we study in the following section.

Crack growth monitoring
In this section, we investigate the slightly different task of crack growth monitoring, which involves studying the evolution of crack size or severity over time.This task is equally, if not more, important, than accurately estimating severity metrics.If we observe the same damage at different time points, our methodology would be valuable if it can also highlight a potential difference in estimated severity, even if the absolute value may not be very precise.Since obtaining real images showing the growth of cracks over time is challenging, we have designed an artificial experiment as a proof-of-concept.
For this experiment, we randomly sample 100 damage-free images and 100 crack masks from the DIC dataset.For each pair of samples, we simulate a linear growth trajectory by generating a sequence of five images using the original crack skeleton, growing it linearly by repeatedly applying a dilation operation (using an elliptical kernel with radius r = 5 px), and overlaying the grown crack onto the damage-free image.We would like to mention that this experiment does not cover the crack branching during their growth, or the initiation of new cracks.We then follow the methodology outlined in Figure 1, using the same pre-trained classifier as in previous experiments.We derive the segmentation masks from the attribution maps of the positive class for each XAI method compared in this study.Finally, we calculate the crack areas and maximum widths based on the true and estimated masks in each growth trajectory.
One example of a generated growth trajectory is illustrated in Figure 7.The generated images and their corresponding ground-truth masks are displayed in the first two columns.LRP explanations are displayed after thresholding (third column), and after final post-processing with morphological  The generated images and their corresponding ground-truth masks are displayed in the first two columns.LRP explanations are displayed after thresholding (third column), and after final post-processing with morphological operations (last column).In the first image, the crack is too thin and the classifier has a low prediction confidence (66% softmax score), explaining the poor corresponding explanation.
Figure 8: Evolution of true and estimated severity metrics as a function of crack growth, using our methodology with LRP, simple thresholding and morphological closing.Even if the area and width are over-estimated, they allow to monitor the growth of the crack, as long as the classifier's explanation is relevant.The classifier exhibited low prediction confidence in the first growth step, leading to a poor explanation and severity quantification.
operations (last column).Visually, we observe that the resulting mask accurately follows the growth of the crack, except for the first image in the sequence.In this case, the crack was too thin and the classifier exhibited low confidence in its prediction (0.66 softmax probability score for the crack class), resulting in a poor corresponding explanation.For comparison, the softmax confidence scores were equal to 1.00 for the other images in the sequence.
On Figure 8, we illustrate the evolution of true and estimated severity metrics as crack growth progresses.With the exception of image 1, the metrics consistently exhibit a monotonically increasing trend and closely align with the true values, particularly as the crack becomes larger.While there is some overestimation in both area and width, they still provide effective means for monitoring crack growth, as long as the classifier's explanation remains relevant.Hence, it is crucial to obtain an accurate and robust classifier with good generalization abilities to ensure reliable reliance on its explanations.
To quantitatively evaluate and compare the growth monitoring abilities of the different XAI methods, we first assess whether the estimated severity metrics exhibit linear variations.This is done by computing the average rvalue, representing the correlation coefficient of a linear fit of the estimated severity as a function of time.Secondly, we analyze whether the slopes of the estimated severity are in close agreement with the slopes of the groundtruth severity metric.This is assessed by calculating the mean absolute percentage error (MAPE) between the slopes.We filter the dataset to include only images classified as cracks by the classifier, and discard trajectories with fewer than three elements.This filtering process results in 87 retained growth trajectories out of the initial 100.The results for the area and maximum width metrics are reported in Table 5.
We observe that only DeepLiftShap and LRP consistently exhibit a strong linear correlation for both the area and width metrics, with high r-values.IntGrad and the B-cos network achieve a high r-value for the area metric only.For the other methods, there was no clear positive correlation between crack growth and the metrics extracted from the XAI-based segmentation masks, indicating that the growth of the crack did not translate into the resulting attribution maps.However, we also observed a general degradation in the quality of the explanations for artificially generated cracks because their appearance does not closely resemble the real data distribution, which is a limitation of our study.In terms of slope, representing the severity growth rate, LRP achieves the best accuracy with an approximate 35% MAPE.
Finally, we summarize this experiment with a two-dimensional plot (Figure 9) representing the error in severity estimation on the x-axis and the error in severity growth rate estimation (i.e., linear slope) on the y-axis, for each of the XAI methods.A good method should be located in the upper right corner, indicating low errors in both dimensions.Among all the XAI methods compared in our study, LRP demonstrated the most consistent behavior in terms of severity quantification and growth monitoring.method is DeepLiftShap.

Runtime comparison
In this section, we compare the computational runtime performance of the evaluated methods.Experiments were performed on a server with an Intel Xeon E5-2620 CPU, an NVIDIA GeForce RTX 2080 Ti GPU and running Ubuntu 22.04.The processing times per image in seconds are reported in Figure 10.
Among the XAI methods, the fastest is Input×Gradient at 0.251 s/image.The NN-Explainer requires a single forward pass of the explainer network, making it the second-fastest method at 0.664 s/image.DeepLift, Gradi-entShap, B-cos networks and LRP each take around one second per image.DeepLiftShap requires applying DeepLift for each sample of the baseline distribution, making it naturally slower and scaling linearly with the size of the baseline distribution (here with 10 samples).Integrated Gradients is the slowest method in our study, as it requires multiple steps to interpolate between the baseline and the input.We kept the default number of steps  in the captum library, equal to 50.The post-processing time is negligible when using the simple thresholding strategy (0.018 s/image, including the morphological operations), but increases by one order of magnitude with the GMM thresholding (0.372 s/image).
In conclusion, the LRP method provides the best compromise in terms of segmentation quality, growth monitoring ability and computational runtime.In cases where LRP cannot be used, for instance, with classifier architectures containing skip-connections, DeepLiftShap provides an effective solution.Overall, the entire approach is slower but still in a similar order of magnitude as the inference of a supervised segmentation model such as U-Net.

Conclusion
Automated segmentation and severity quantification of cracks in images are crucial tasks in structural health monitoring.However, deep learningbased semantic segmentation algorithms require extensive pixel-level labeling of large datasets for supervision.In this work, we have proposed a methodology and benchmarked the performance of various explainable artificial intelligence (XAI) methods, as well as post-processing techniques, in generating high-quality segmentation masks for cracks in masonry building wall surfaces.These masks are derived from the explanations of a binary classifier, trained on damage-free and damaged samples.Moreover, we have proposed a modification of the Neural Network Explainer method suitable for damage classification applications, where a negative class represents the absence of damage.Additionally, we have proposed using damage-free images as baselines in the Integrated Gradients and DeepLift-based XAI methods to enhance the quality of their explanations.The results of our benchmark study have demonstrated that this methodology allows us to approximate segmentation masks with promising performance (around 0.4 F1-score).While this performance falls below that of fully supervised segmentation approaches such as U-Net, it outperforms purely unsupervised approaches such as autoencoders.
Finally, we have investigated the applicability of these methods in quantifying damage severity and monitoring its progression.We evaluated severity metrics such as the number of cracks, maximum crack width and crack area.The experimental results demonstrated that accurately estimating the severity metrics was challenging, primarily due to overestimation of the crack extent.However, we found that it was possible to effectively monitor the severity evolution over time.One key takeaway from this study is the effectiveness of the Layer-wise Relevance Propagation (LRP) method, which excelled in terms of segmentation quality, growth monitoring ability, and computational efficiency.
It is worth mentioning that there are other variations of this methodology that can be explored.For instance, the resulting masks could serve as approximate, coarse labels for training a supervised segmentation model.If ground-truth pixel-level labels are available, these coarse labels could be fine-tuned or used in a semi-supervised setting.These strategies are left for future work.
We believe that one of the potential breakthroughs of this approach will be to enable the more rapid development of automated crack detection and monitoring systems, by avoiding the time and cost of pixel-level image annotation.As part of future work, we also plan to apply the methodology to different types of defects and different types of infrastructure, such as railway sleepers.Moreover, we aim to evaluate the approach using real crack growth data.Finally, we intend to investigate other families of explainable AI methods, beyond feature attribution explanation methods.

Figure 1 :
Figure 1: Workflows for automated image-based crack segmentation and severity quantification.(a) Supervised semantic segmentation workflow.(b) Workflow of the proposed weakly-supervised methodology based on XAI and classifier explanations.This methodology allows to generate approximate segmentation masks and quantify severity while circumventing the high cost of pixel-level labeling of training images.

Figure 2 :
Figure 2: Illustration of the post-processing steps using Integrated Gradients (top) and LRP (bottom) attribution maps.(1) Binarization (2) First morphological closing (3) Area opening (4) Second closing.The curve on the right shows the evolution of F1 score at each post-processing step.

Figure 3 :
Figure3: Overview of the NN-Explainer method[53] for damage classification with a negative (damage-free) class and K positive (damage) classes.The explainer E learns to predict masks S for each class.Class 0 represents damage-free samples.Masks of positive classes present in the input are merged using their element-wise maximum (denoted by ∨ in the diagram) to create the target mask m (and its inverse m), while masks of other positive classes that are not present form the non-target mask n.In our application, there is a single damage class for cracks, and the non-target mask does not play a role.The explanandum F (i.e., the model to be explained) is frozen.
Zero baseline and resulting DeepLift attributions (c) Ground-truth mask (d) Mean damage-free baseline and resulting DeepLift attributions

Figure 5 :
Figure 5: Comparison between DeepLift attributions obtained with (b) the standard zero baseline and (d) our proposed baseline based on a sample of damage-free images.Attributions are more focused on the crack with our baseline.The behavior is similar for other attribution methods using a baseline image or distribution (e.g., Integrated Gradients).

Figure 7 :
Figure 7: Visualization of artificial linear crack growth trajectories for crack growth monitoring.The generated images and their corresponding ground-truth masks are displayed in the first two columns.LRP explanations are displayed after thresholding (third column), and after final post-processing with morphological operations (last column).In the first image, the crack is too thin and the classifier has a low prediction confidence (66% softmax score), explaining the poor corresponding explanation.

Figure 9 :
Figure 9: Comparison of the XAI methods used in our study in terms of crack severity quantification performance and crack severity growth rate estimation performance, using area and maximum width as severity metrics.Mean absolute percentage error compared with the ground-truth severity.Best methods lie in the upper right corner.

Table 1 :
Crack classification performance on the DIC dataset (values in %).

Table 2 :
Crack segmentation quality obtained with the proposed weakly-supervised methodology using different explainable AI methods and post-processing techniques (values in %).Best and second-best scores in bold underlined and bold, respectively.

Table 4 :
Assessment of crack severity estimation using number of cracks per patch (CPP), crack area per patch and maximum crack width.The table reports the mean absolute error (MAE) or mean absolute percentage error (MAPE, in %) with the ground-truth severity measure (lower is better).Best and second-best scores in bold underlined and bold, respectively.

Table 5 :
Assessment of crack growth monitoring abilities of different XAI methods in our proposed methodology, using 100 artificially generated linear growth trajectories.Severity metrics are the crack area and maximum width.The table reports the average r-value of a linear fit of the estimated severity as a function of time, and the mean absolute percentage error (MAPE, in %) between the estimated slope (i.e., growth rate) and the ground-truth slope.Best and second-best scores in bold underlined and bold, respectively.