Automatic crack classification and segmentation on masonry surfaces using convolutional neural networks and transfer learning

Masonry structures represent the highest proportion of building stock worldwide. Currently, the structural condition of such structures is predominantly manually inspected which is a laborious, costly and subjective process. With developments in computer vision, there is an opportunity to use digital images to automate the visual inspection process. The aim of this study is to examine deep learning techniques for crack detection on images from masonry walls. A dataset with photos from masonry structures is produced containing complex backgrounds and various crack types and sizes. Different deep learning networks are considered and by leveraging the effect of transfer learning crack detection on masonry surfaces is performed on patch level with 95.3% accuracy and on pixel level with 79.6% F1 score. This is the first implementation of deep learning for pixel-level crack segmentation on masonry surfaces. Codes, data and networks relevant to the herein study are available in: github.com/dimitrisdais/crack_detection_CNN_masonry.


Introduction
Brick masonry is one of the main structural components in modern and historical structures along the world. Numerous old masonry buildings still exist proving that when well-preserved, the life cycle of such structures may be significantly extended [1]. In many cases, historical masonry structures have been found to be vulnerable to seismic excitations and thus thorough damage assessment is required to propose suitable restoration schemes, when necessary [2]. Moreover, masonry has been widely used in modern structures in countries with low or no seismicity. When masonry structures subjected to induced seismicity, like the ones in the north of The Netherlands, have been found to be susceptible to seismic excitations given the fact that they were constructed without any seismic design [3]. As another example of vulnerable masonry structural systems, arch bridges are the most common single bridge type on the UK rail network, most of which are now over than 100 years old and showing significant signs of distress. The importance to develop improved analysis and assessment methods for these bridges was highlighted [4]. In brief, masonry structures need to be properly inspected to detect any defects on early stage or after an extreme event in order to safeguard them.
Manual inspection is the most common practice due to its simplicity and the lack of reliable alternatives. Nevertheless, this practice is rather laborious, slow and expensive when accounting for the man-hours required to be invested in the field and at the office to process the obtained data. On top of that, the quality of the process can be subjective since it heavily relies on the skills and the physical condition of the inspector as well as lack of experience or tiredness could easily lead to ill-reported damage. Manual inspection can raise safety concerns since there are parts of the structures with access restrictions and difficult to reach. The manual inspection becomes particularly difficult for the postevent cases, such as in the catastrophic aftermath of a strong earthquake, when a high number of buildings need to be inspected with limited resources in a short time. Apart from the efficiency, reliability is another aspect to be considered when inspecting masonry structures manually. Significant variability in the routine inspection documentation of structural conditions was previously reported [5,6]. Discrepancy was observed both for the assignment of condition ratings but also for the prepared documents, e.g. field inspection notes, photographs, etc. Specifically, on average between four and five different condition rating values were assigned to each structural component, with a maximum of six being assigned [5].
In order to address the drawbacks of manual inspection, vision-based assessment and monitoring of civil infrastructures are gaining ground [7]. In particular, computer vision for crack detection has interested researchers for quite some time. Vision-based crack detection is a perfect example of non-destructive assessment technique, which can be useful especially for historical structures where strict regulations apply and even simple interventions, such as placing crack-rulers, are not permitted by the conservation authorities.
Deep Learning (DL), which is a subfield of artificial intelligence, and its representative tool, namely Convolutional Neural Network (CNN) have proven their efficacy in object detection [8]. Unlike traditional machine learning approaches, DL does not require any hand-crafted features and thus provides end-to-end classifiers which internally learn features and can automatically detect objects [9]. This attribute of DL algorithms along with the recent development of the graphics processing units (GPU) which allowed for very fast computations have boosted their usage in different fields. For the case of crack detection from images, the user only provides as input different photos and receives as output any detected cracks in these photos without the necessity for any manual intervention. DL for crack detection has found different case studies such as on inspection of bridges [10], gas turbines [11] and asphalt surfaces [12].
The aim of this paper is to examine different DL techniques for crack detection on images from masonry walls. Recent developments in DL for crack detection and successful techniques are highlighted in Section 2.1 while studies for vision-based assessment on masonry surfaces found in the literature are presented in Section 2.2. In order to address the lack of data in the literature, a dataset with photos from masonry structures is produced containing complex backgrounds and various crack types and sizes (Section 3). Since for masonry structures little work has been done for crack detection it is deemed beneficial to train networks both for patch classification (Section 4) and pixel-level segmentation (Section 5) in order to examine the efficacy of different techniques and broadcast the feasibility of DL methods on crack detection for masonry surfaces. To the authors' best knowledge, this study is the first implementation of DL for pixel-level crack segmentation on masonry surfaces. The technique of transfer learning is also leveraged in order to improve the performance of the DL networks for crack detection on patch and pixel level. Finally, a comparative study is performed where a segmentation network trained on masonry images is tested on photos with cracks taken from concrete surfaces in order to evaluate the ability of CNNs to generalize over different materials (section 6). Codes, data and networks relevant to the herein study can be found in the GitHub repository: gi thub.com/dimitrisdais/crack_detection_CNN_masonry.

Convolutional neural networks for crack detection
Image classification with CNN can be categorized into three types: image patch classification, boundary box regression and semantic segmentation [13]. In image patch classification the image is divided in patches and each patch is labelled with a class (Fig. 1a). When boundary box regression is considered, a box bounds the detected object, that is a crack, and reveals its position and boundaries (Fig. 1b). These two classification techniques have been extensively used to detect cracks and other defects, and have shown promising results [11,[14][15][16]. Nevertheless, these techniques are implemented at block level rather than at pixel level. On the contrary, semantic segmentation provides information about the exact location, width or length of any defects/cracks since each pixel is assigned to a class label (Fig. 1c) [17][18][19][20]. Pixel-wise image segmentation has gained ground in the recent years over image patch classification and boundary box regression. A review on DL methods for semantic segmentation applied to various application areas was presented in [21].
Recently Fully Convolutional Networks (FCNs), which are end-toend networks, have been extensively used for semantic segmentation [22]. FCNs performed as an extended CNN where the final prediction was an image with semantic segmentation instead of a class identification. In a recent study, FCNs have been implemented for semantic segmentation on concrete crack images by evaluating several pre-trained network architectures serving as the backbone of FCN encoder [23]. FCNs were also used by Yang et al. [24] to semantically identify and segment crack pixels with different scales and were combined with morphological operations to extract geometric characteristics, such as length and width, directly from images without manual measurement. An automatic crack segmentation method based on CNN and consisting of the extended FCN and the Deeply-Supervised Nets (DSN) was introduced by Liu et al. [17]. Special care was given to produce a dataset of photos from asphalt and concrete surfaces with cracks in multi-scale and multi-scene to evaluate the crack detection systems. A modified FCN with fine-tuning the DenseNet-121 (a densely connected CNN) was implemented by Li et al. [25] to provide pixel-level detection of multiple damages, i.e. crack, spalling, efflorescence and holes, found on concrete surfaces. The suggested method outperformed the results obtained from a method based on SegNet (a deep convolutional encoder-decoder architecture trained to classify urban street pictures at pixel level) while producing smaller sizes of trained models as well. An FCN was implemented to simultaneously identify material type (concrete, steel, asphalt), as well as fine (cracks, exposed rebar) and coarse (spalling, corrosion) structural damage [26].
U-net is a deep FCN that was developed for biomedical image segmentation and outperformed other state of the art networks [27]. Since then, U-net has become a benchmark for image segmentation in different fields and its efficacy to detect thin edges resulted in its wide implementation on the inspection of structures. In particular, U-net was implemented for crack detection on pavement surfaces [28,29]. Another showcase for U-net was presented by Liu et al. [30] for concrete crack detection, where the U-net performed better than the other FCN methods [23,24] while being trained on significantly smaller training sets. Note that in order to solve the sample imbalance problem, the focal loss function was selected in the study by Lui et al. [30]. A DL algorithm based on U-Net and a CNN with alternately updated clique (CliqueNet), called U-CliqueNet, was proposed to separate cracks from background in tunnel images [31]. The system obtained promising results and was able to separate cracks from images with noise similar to cracks, such as patchwork joints, wires, etc. It is noted that, while other studies were based on datasets of couple of hundreds of images, the proposed network was trained on an extensive dataset consisting of 50,000 and 10,000 images of 496 × 496 pixels for training and testing respectively.
Feature Pyramid Networks (FPN) [32] is a typical model architecture to generate pyramidal feature representations for object detection. This architecture extracts features at different scales and then fuses them which reportedly provides predictions of higher accuracy [32]. FPN achieved state of the art single-model results on the COCO detection benchmark and has been implemented as a generic feature extractor in several applications such as object detection and instance object segmentation [33,34]. FPN was combined with an hierarchical boosting module to perform pavement crack segmentation obtaining high accuracy and generalizability [35]. The cutting-edge single-stage object detector YOLOv3 adopting FPN was utilized to detect multiple concrete damages of highway bridges [36]. Multiscale feature maps were obtained by a generic pretrained CNN model and fused by implementing FPN in order to apply crack segmentation on concrete images [37].
Furthermore, transfer learning in DL has been extensively implemented on different fields of computer vision with remarkable results and is considered suitable when the training dataset is small allowing for better performance and less computational effort [9,38]. The intuition behind transfer learning for image classification is that if a model is trained on a large and general enough dataset, this model will effectively serve as a generic model of the visual world [39]. CNNs utilizing transfer learning have been used extensively for image classification and semantic segmentation in the field of crack detection [19,25,40,41]. Transfer learning was implemented in image-based structural recognition to perform component type identification, spalling condition check, damage level evaluation, and damage type determination [42].
Lately, different studies obtained noteworthy results in crack segmentation by implementing region proposal networks followed by algorithms for pixel-level crack detection [43,44]. In particular, such a hybrid method was proposed by Kang et al. [44] where crack regions detected by Faster R-CNN were processed by a modified tubularity flow field algorithm to segment the crack pixels. As reported by Kang et al. [44], the advantages of this method is that Faster R-CNN can detect crack regions very well even on complex backgrounds while only a dataset of images with bounding boxes of cracks is required which drastically reduces the time to prepare a dataset. As stated by Kang et al. [44], the proposed method is useful for concrete surfaces only and its applicability on different materials might be limited Moreover, Chen et al. [45] implemented an encoder-decoder architecture and proposed a switch method to distinguish between the negative and positive sample automatically and skip the decoder module when the sample is negative to save the inference time.

Vision-based assessment on masonry surfaces
As shown in Section 2.1, vision-based and, in particular, DL methods for crack detection have been widely applied for concrete surfaces or asphalt. On the contrary, little research has been done on vision-based assessment and specifically for defect detection applied to masonry surfaces. Inarguably, the surface of masonry is less homogeneous and significantly noisier as compared to concrete or asphalt [46]. On top of that, studies have shown that DL models are sensitive to material. In other words, DL models that were trained on a specific surface type failed to achieve same accuracy when the material was different. The development of DL models that could be robustly applied to infrastructure inspection images for both concrete and asphalt pavement was attempted but crack detection models trained on one material did not necessarily work on other materials and significant performance degradation would be expected when testing on other materials [47]. In another study, various CNNs were trained on images from concrete structures for crack detection and their transferability of learned features to photos from different materials was examined by Ö zgenel and Sorguç [48]. Brickwork images were found to be the most challenging among the tested cases since brickwork jointing and background textures constitute challenging noises. Moreover, it was concluded that the level of variance in the dataset was more important than the number of samples.
Point clouds were obtained with laser scanning and photogrammetry techniques and were combined to detect different types of defects on ashlar masonry walls by using machine learning classification based on geometry and colour information [49]. U-net [27] was used for brick segmentation in masonry walls [50]. McCormick et al. [51] used a system that combined different types of sensors (multiple highresolution cameras, laser scanning and inertial measurement unit) for tunnel inspection. Digital image correlation techniques were utilized to automatically trace any changes in between consecutive inspections and subsequently an operator would appropriately classify them among a list of defects [51]. Thus, the defect detection process is not fully automatized and human intervention is still required.
CNN to classify different defect types, such as cracking, spalling, mortar loss, and vegetation, from images of masonry structures was used by Brackenbury et al. [46]. In detail, photos were taken from masonry bridges and corrected for perspective distortion and then resized to ensure a constant resolution. Defect classification was implemented on patches of 100 × 100 pixels. It was suggested that by firstly detecting and segmenting mortar joints, and then classifying defects, defected and defect-free areas of the masonry could be all predicted with more confidence and better accuracy.
A novel damage identification architecture to detect two types of damages (efflorescence and spalling) in historic masonry buildings based on the Faster R-CNN ResNet101 model was proposed by [52]. In particular, two orthophotos were extracted from a historical structure and were segmented into small patches of 500 × 500 pixels. The produced patches were annotated with bounding boxes marking the investigated damage types. Quick identification and detection of the surface damage was achieved. The necessity for the expansion of the database with wider range of distances and angles and more types of structural samples was reported.
An automatic vision-based crack detection system using CNN was proposed by Ali et al. [53] to ease the inspection of masonry structures. The feature extraction process was done by CNN from colour images and three classifiers were studied, namely the CNN itself, SVM and Random Forest. False negative areas were found since the system would confuse the grout lines with cracks. Finally, since the cracks on masonry structures could not be easily identified, extreme care was needed when annotating the dataset.
A common limitation observed in the existing literature for visionbased assessment on masonry surfaces is that the developed methods regarded only a single structure and therefore their ability to generalize when tested on more diverse data remains to be evaluated.

Dataset preparation
In order to address the lack of data in the literature, a dataset with photos from masonry structures is produced containing complex backgrounds and various crack types and sizes. DL networks are data-driven techniques, thus they heavily rely on the quality and amount of data [54]. Before preparing the masonry dataset for this study, an extensive literature review is performed to spot good and bad practices when collecting data for crack detection. It is highlighted that the goal of training a network is to enhance its ability to generalize when fed with diverse data.
Special care is frequently paid when collecting data so that photos are taken in a homogeneous way keeping constant conditions, such as distance, angle etc. [25,55,56]. Moreover, it is common for datasets for crack detection to be custom-made and manually pre-processed to exclude noisy background and for images to be carefully selected to focus on the cracks [56]. Nevertheless, a common criticism over developed DL methods is that they attain remarkable results when tested on monotonous backgrounds, but their accuracy severely drops when deployed on images with complex backgrounds. Choi and Cha [19] observed that when a CNN trained on images of monotonous background and subsequently tested on a more complex dataset the performance drastically decreased; precision dropped from 0.874 to 0.231. Several studies have emphasized the necessity for more complex datasets [13,57,58]. The issue they raised is particularly important for the context of this paper since masonry surfaces consist of brick or stone materials, possibly with mortar joints, with several complex objects around, such as windows, doors, ornaments, labels, lamps, cables, vegetation etc. which can be characterized as noise for the crack detection process. Other materials, such as concrete or asphalt that crack detection methods have been widely investigated, provide a relatively smooth and flat surface. On the contrary, masonry surface is usually rough and uneven since mortar might protrude around the bricks or some gaps might exist in the interface between mortar joints and brick units. These anomalies might create shadows in the photos especially when the photos are taken with acute angles, causing the network to falsely consider these regions as cracks. Moreover, cracks are usually covered with dust or colour-paints. Therefore, it is deemed that a database as generic as possible would lead to higher chances of developing a tool that is able to perform accurately in real cases.
Taking all these into consideration a masonry dataset is prepared for this study. Photos were collected from different sources. Various images of masonry walls containing cracks were obtained from the Internet. Additionally, photos were taken from different masonry buildings in the Groningen region, The Netherlands. In fact, in order to simulate the scenario where different users will contribute in the data collection by taking photos with devices of different characteristics, various members of our research group were asked to photograph cracks from masonry walls with their phones or DSLR cameras after providing them with simple guidelines. It is noted that photos from masonry surfaces with ( Fig. 2a-b) and without ( Fig. 2c and d) cracks were taken under similar conditions (angle, distance, etc.) in order to enrich further the non-crack class.
The herein created dataset will be referred to as "masonry dataset". In total 351 photos containing cracks and 118 without any crack were gathered from masonry surfaces. These photos were divided in patches of 224 × 224 pixels, which leads to 4057 patch containing cracks while extra 7434 non-crack patches were randomly selected from the gathered photos. A sample of photos from the masonry dataset with cracks is presented in Fig. 3. A wide range of scales and resolutions was considered. The crack patches depicture from small (couple of bricks) to larger (whole masonry walls) field of views. Cracks might extend over the joints, the bricks or both. Cracks appearing as straight lines, zigzag or complex shapes were examined. A diverse type of cracks in terms of length, width and shape were included in the masonry dataset. Moreover, the crack patches included different types of noisy background, such as windows, plants, lamps and signs (Fig. 3). Further examples of objects that typically exist around masonry façades and are included in the non-crack patches are shown in Fig. 4.
Along the development of this study, while collecting new data the different networks were run with the available dataset each time. It was observed that the metrics were improving as the masonry dataset was being enriched. The greatest improvement was recorded in the precision value; while extra types of background objects were included in the dataset the easier it was for the networks to learn to accurately negate them. Therefore, by improving how closely the dataset represents the real world the better would be the performance of the networks.

Convolutional neural networks for crack image classification
Image patch classification for crack detection was implemented by leveraging the effect of transfer learning via fine-tuning. The technique of fine-tuning was implemented by discarding the fully connected (FC) layers at the top of a pretrained network and training new, freshly initialized FC layers on the new data with a low learning rate [9]. In detail, a FC layer with 128 features and rectified linear unit (ReLU) activation was added followed by batch normalization and a dropout layer with a probability of 0.5. Batch normalization is a technique that improves the speed, performance, and stability of artificial neural networks and was used to normalize the input layer by adjusting and scaling the activations while dropout temporarily disconnects the neural connections between connected layers during training. Finally, a FC layer with softmax activation was placed to classify the images as crack or non-crack.
Different state of the art CNNs pretrained on ImageNet (1.2 million images with 1000 categories) were examined herein for their efficacy to classify images from masonry surfaces on patch level as crack or noncrack. The considered networks were: VGG16 [59], MobileNet [60], MobileNetV2 [61], InceptionV3 [62], DenseNet121 [63], DenseNet169 [63], ResNet34 [64], ResNet50 [64]. The configuration of ResNet34 and the pre-trained weights were obtained from Yakubovskiy [65], while for the rest of the networks the configuration and the weights were extracted from Keras [66]. The details of the different networks are shown in Table 1. All the models were deposited in the GitHub repository: github.com/dimitrisdais/crack_detection_CNN_masonry.
At this point the architecture of MobileNet is highlighted since it obtained the best results as will be shown below (Section 4.3). Mobile-Net is a lightweight network destined to run on computationally limited platforms; it achieved accuracy comparable to VGG16 on ImageNet with    only 1/30 of the computational cost and model size [67]. A standard convolution both filters and combines inputs into a new set of outputs in one step. MobileNet is based on depthwise separable convolutions which is a form of factorized convolutions (see Fig. 5); the depthwise convolution applies a single filter to each input channel and the pointwise convolution then applies a 1 × 1 convolution to combine the outputs of the depthwise convolution. This factorization (Fig. 5) has the effect of drastically reducing computation and model size. MobileNet comprises of multiple factorized layers with depthwise convolution, 1 × 1 pointwise convolution, batch normalization and ReLU activation (Fig. 6a) instead of layers of regular convolutions followed by batch normalization and ReLU activation (Fig. 6b). The MobileNet architecture has two hyper-parameters that is width and resolution multipliers in order to easily produce smaller versions of the network. Herein, for both hyperparameters the default value is selected, that is 1, which means than no shrinking is applied to the model [60,66].
MobileNet or networks that made use of depthwise separable convolution have been implemented in recent studies for crack detection. Single Shot MultiBox Detector [68], an object detection framework, was combined with MobileNet to detect different damage types on road surfaces [69]. MobilneNet performed as the encoder of a semantic segmentation network based on DeepLab [18] for real-time tunnel crack analysis [70]. The depthwise separable convolution was used to reduce computational complexity and improve computational efficiency of image classification for crack detection [71]. Depthwise convolutions have been successfully used for pixel-level segmentation of cracks on concrete surfaces [19].

Training configuration
The networks for image classification are allowed to train for a great number of epochs, with a minimum of 50 epochs, until the accuracy (see Eq. (7)) on the validation set does not increase any further. The data are fed to the network with a batch size of 10.
Optimization in DL networks updates the weight parameters to minimize the loss function. The Adam method (Adaptive Moment Estimation) was found to outperform other stochastic optimization methods [72], i.e. it converges faster, and is selected as the optimizer of the network herein. Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments [72]. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters [72]. The hyperparameters have intuitive interpretations and typically require little tuning. The weight update with Adam optimizer is described as follows: where t is the timestep, g t is the gradient vector, m t and v t are the first (mean) and second (uncentered variance) biased moment estimates of the gradients respectively, mˆt and vˆt are the first (mean) and second (uncentered variance) bias-corrected moment estimates of the gradients respectively, β 1 and β 2 are the exponential decay rates for the moment estimates, α is the learning rate, w is the model weights and ε = 10 − 8 .
The default values 0.9 and 0.999 are taken for β 1 and β 2 respectively [66,72]. The networks are trained with a constant learning rate α equal to 0.001.
In the context of an optimization algorithm, a loss function is used to evaluate a candidate solution (i.e. a set of weights) that will minimize the prediction error. The cross entropy (CE) loss function (L CE ) is utilized herein and is given as: where y is the ground truth, ŷ is the prediction. y can take values equal to 0 (non-crack) or 1 (crack) while ŷ can be in the range of 0 to 1. The performance of the networks is evaluated based on the values of accuracy which is defined as: where TP, TN, FP and FN correspond to true positive, true negative, false positive and false negative, respectively. The classification is binary with non-crack and crack cases corresponding to negative and positive class respectively. Thus, TP implies that a crack image is correctly classified while TN means that a non-crack image is predicted accurately. While accuracy performs as an average of the performance of the two classes, TP and TN provide a better insight on the classification error for each class individually and thus are reported as well.
For the training of the image classification networks the 4057 crack and the 7434 non-crack patches of the masonry dataset are used. 60% and 40% of the patches are used for training and validation respectively. The networks are implemented on Keras [66], a high-level neural network API, written in Python and by utilizing TensorFlow as back-end. The networks are run on a laptop with Intel i7 processor with 2.20 GHz, 16 GB RAM and Nvidia GPU GeForce(R) RTX 2060 with 6 GB.

Results for crack image classification
In this section the results from the trained networks for image classification are presented .The obtained metrics from the trained models on the validation set are enlisted in Table 1 for the epoch that the highest accuracy is reached for each case. While all the considered networks obtain high accuracy on the validation set, that is 88% or more, Mobi-leNet outperforms the rest by scoring accuracy 95.3% (Table 1). In order to examine the benefit of transfer learning, MobileNet is also evaluated without pretraining with its weights randomly initialized [66]. Indeed, the accuracy of MobileNet drops from 95.3% to 89.0% which reveals that transfer learning offers a significant boost to the performance of the network. In more detail, when random initialization is considered, the ratio of TN remains high, that is 96.4%, however TP declines considerably from 89.8% to 75.8%. Consequently, without pretraining the network struggles to differentiate edges corresponding to the crack class and tends to label them as non-crack.
In Figs. 3 and 4 representative images of the masonry dataset are presented. Based on the accuracy of the model it can concluded that the network learns rich features that allow for correct classifications on the dataset produced. A closer look to the performance of MobileNet is highlighted in the produced confusion matrix (Fig. 7). It is inferred that MobileNet excels in predicting correctly the non-crack case with only 1.6% error while the error in the crack class is higher, that is 10.2% of the crack images are classified as non-crack. Different cases of FP and FN predicted with MobileNet from the validation set are displayed in Figs. 8 and 9 respectively. Part of a pipe (Fig. 8a), joints without mortar (Fig. 8b, g-h), edges around doors (Fig. 8e-f), and blurry or dark edges (Fig. 8c-d) are wrongly classified as cracks. Evidently, a further expansion of the masonry dataset should take into consideration a better representation of the cases that yielded FP so that the network will learn their features and correctly classify them. On the other hand, crack images taken with acute angle (Fig. 9a) or with great field of view capturing thin cracks (Fig. 9b-d) are misclassified. Moreover, there are cases of close-up images of thin (Fig. 9e-j) or well-shaped cracks (Fig. 9k-m), crack with missing mortar (Fig. 9n) and crack in dark background (Fig. 9o) that the network falsely negates them to the noncrack class.

Convolutional neural networks for crack segmentation
As per Long et al. [22] "semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where". Recently FCNs [22], which are end-to-end networks, have been extensively used for semantic segmentation and in particular for crack segmentation, as highlighted above (Section 2.1). FCNs perform as an extended CNN where the final prediction is an image with semantic segmentation instead of a class identification. FCNs adopt architectures with pyramidal shapes; they follow the usual contracting path (encoder) of image classification networks and replace any FC layers with convolutional layers while on top of the encoder an expanding path (decoder) is added with successive convolutional layers followed by upsampling operators. The encoder captures context while the decoder enables precise localization. In order to avoid loss of low-level information, skip connections are used to allow the decoder to access the low-level features obtained by the encoder branch. A schematic representation of the encoder-decoder architecture of FCNs is shown in Fig. 10. U-net [27], a deep FCN, and FPN [32], a generic pyramid representation, are considered herein and combined with different CNNs performing as the backbone of the encoder part of the network. FPN in fact adopts a similar architecture with U-net, but FPN performs predictions independently at different stages of the expanding path and subsequently concatenates these predictions while U-net only produces predictions at the last stage. The implementation of the U-net and FPN based models with different CNNs as backbone is in accordance with the work of Yakubovskiy [65] and is further elucidated in the next paragraphs. Furthermore, different networks that were successfully used in the literature for crack segmentation are examined in an extensive comparative study. U-net [27] built upon the original implementation of FCN [22] by increasing the number of feature channels in the upsampling part, which allow the network to propagate context information to higher resolution layers. As a result, in U-net the expansive path is almost symmetric to the contracting path yielding a U-shaped architecture. In the encoder there are repeated blocks of two 3 × 3 convolutional layers and each of them is followed by batch normalization and ReLU activation. These blocks are referred to as ConvBlock. ConvBlocks are followed by a 2 × 2 max pooling layer with stride 2 which halves the dimensions of the images and doubles the number of feature channels, a process that is called downsampling. In the decoder, a 2 × 2 deconvolution layer succeeds each ConvBlock. The deconvolution layer, usually referred to as transpose convolution layer, upsamples the images, meaning it doubles its size and halves the number of feature channels. The final deconvolution layer restores the original size of the image. Then, a 1 × 1 convolution with sigmoid activation follows which yields the final prediction for each pixel of the image. In total the network has 23 convolutional layers. Same-level ConvBlocks between the encoder and the decoder are merged with skip connections (Fig. 11).
FPN [32] is a typical model architecture to generate pyramidal feature representations for object detection. FPN is independent of the backbone network and its architecture makes it easily configurable to receive different CNNs as the backbone of the encoder. In particular, FPN adopts a convolutional architecture as its backbone, typically designed for image classification, and builds a feature pyramid with a bottom-up pathway, a top-down pathway and lateral connections. The high-level features, which are semantically strong but lower resolution, are upsampled and combined with higher resolution features to generate feature representations that are both high resolution and semantically strong. The upsampling layer repeats the rows and columns of the input features by 2 × 2 and fills in the new rows and columns by using the nearest neighbour algorithm [66]. The bottom-up pathway which is the feed-forward computation of the backbone CNN produces a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. Layers producing output maps of the same size are considered in the same network stage and for each stage one pyramid level is defined. The top-down pathway obtains higher resolution features by upsampling by a factor of 2 spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced by element-wise addition with features from the bottom-up pathway which undergo a 1 × 1 convolutional layer to reduce channel dimensions. Further on, 3 × 3 convolutions are appended on each merged feature map and the produced maps from the different stages are concatenated. A schematic representation of FPN is displayed in Fig. 12.
The CNNs that were tested for image classification in Section 4 are utilized as the encoder for U-net and FPN in order to perform crack segmentation on pixel level this time. In particular, the considered networks are: VGG16 [59], MobileNet [60], MobileNetV2 [61], Incep-tionV3 [62], DenseNet121 [63], DenseNet169 [63], ResNet34 [64],  ResNet50 [64]. It is noted that U-net is also considered as a standalone network configured as explained above (Fig. 11). For further reference, the models based on U-net and FPN will be called with the base-model followed by the backbone network, e.g. U-net-MobileNet uses U-Net as base-model with MobileNet as backbone. Moreover, apart from U-net, other networks found in the literature and performed well in crack segmentation are examined as well. In particular, DeepLabv3+ [73], DeepCrack [17], and FCN based on VGG16 (will be referred to as FCN-VGG16) [22]. All the networks used in the herein study for segmentation are listed in Table 2 and can be found in the GitHub repository: github. com/dimitrisdais/crack_detection_CNN_masonry.

Training configuration
The segmentation networks are allowed to train for a great number of epochs, with a minimum of 100 epochs, until the F1 score (see Eq. (11)) on the validation set does not increase any further. The data are fed to the network with a batch size of 4. Similar to the image classification   Datasets for image segmentation on crack detection are characterized by severe class imbalance i.e. the background class occupies the greatest part of photos while cracks extend over limited pixels. Due to this imbalance, if special measures are not taken, the network tends to become overconfident in predicting the background class which could lead to misclassifications of cracks and numerous false negatives. To overcome this, the weighted cross entropy (WCE) loss function is implemented herein. In particular, misclassifications of the crack class are penalized with a higher weight. The WCE loss function (L wCE ), utilized here, is given as: where y is the ground truth, ŷ is the prediction, and β is the weight of the positive class (i.e. crack) chosen as 10. Also, y can take values equal to 0 (background) or 1 (crack); while ŷ can be in the range of 0 to 1. In order to evaluate the effect of the loss function to the performance of the network, different loss functions are examined, i.e. CE, F1 score and focal loss. CE and F1 score correspond to the loss functions obtained from Eqs. (6) and (11) respectively while focal loss reshapes CE to downweight easy examples and thus focus training on hard negatives [74]. It is noted that the focal loss is implemented with the default values suggested by Lin et al. [74]. The evaluation of the network is on the values of precision, recall and F1 score. These metrics are given as: where TP, FP, TN, FN correspond to true positive, false positive, true negative and false negative, respectively. Another common metric in image segmentation is accuracy which denotes the correctly predicted pixels over the total number of pixels. When there is class imbalance, accuracy is not considered suitable to evaluate the performance of the network since accuracy will yield a score close to its maximum value, that is 1, even if the whole image is classified as the prevailing class (i.e. background). Therefore, accuracy is ignored and is not reported.
Precision regards the correct positive predictions over the total number of the positive predictions and measures the efficiency of the network to negate crack-like objects in the background. Recall considers the completeness of the positive predictions i.e. how many of the actual crack pixels are correctly classified. Precision and recall frequently conflict with each other [75]. In other words, usually high recall values lead to low precisions and vice versa. F1 score is the harmonic mean of precision and recall.
Requesting the model to segment the exact width of the crack has been found to be rather strict and hard to achieve. Different approaches have been implemented in order to overcome this limitation. In particular, connectivity constraints were incorporated in the loss function to take into consideration the relationship among annotations of neighbouring pixels [56]. Other suggested solution was to apply postprocessing to isolate noisy parts [76]. A common approach was to allow for some tolerance in the evaluation of the crack detection. Thus, background pixels predicted as cracks (FP) were considered as TP if they were a few pixels apart from the annotated cracks [56,[76][77][78]. The latter proposed approach is followed in the herein study.
For the training of the segmentation networks the 4057 crack patches of the masonry dataset were used. In particular, 60% and 40% of the patches were used for training and validation respectively. The crack Table 2 Details and metrics of the networks used for segmentation. The metrics are presented for the validation set. patches were fed to the networks along with pixel-level annotated labels. Similar to the classification networks, the segmentation models were implemented on Keras [66] by utilizing TensorFlow as back-end and were run on the same computing laptop (see Section 4.2 for details).

Results for crack segmentation
In this section the segmentation results from the trained networks are presented. The obtained metrics from the trained models on the validation set are shown in Table 2 for the epoch that the highest F1 score is reached for each case. From Table 2, a high value of recall does not necessarily mean high precision and vice versa. Thus, F1 score, the average between recall and precision, is deemed the most indicative metric to decide which networks perform better. Thus, U-net-MobileNet and FPN-InceptionV3 attain the highest F1 score, that is 79.6%, and FPN-MobileNet follows with 79.5%.
Firstly, the effect of the loss function on the performance of the networks was evaluated. U-net-MobileNet was trained, apart from WCE, with CE, F1 score and focal loss as loss function. It is noted that similar results were extracted for the other networks but for brevity only results for U-net-MobileNet are presented. The performance of U-net-MobileNet for the different loss functions is displayed in Table 2 and the evolution of the metrics is shown in detail in Fig. 13. As shown in Table 2, the best performance is reached when WCE is utilized; the obtained F1 score is 79.6%, 76.6%, 78.2% and 71.2% for WCE, CE, F1 score and focal loss respectively. Precision is in the range of 90% for CE (Fig. 13b) and focal loss (Fig. 13d) while recall remains significantly lower, i.e. in the range of 60% to 70%. Thus, these two loss functions are not able to handle the class imbalance problem for crack segmentation since the network becomes overconfident in predicting background while neglecting the minority class, that is crack. When WCE (Fig. 13a) and F1 score (Fig. 13c) are used as loss function the discrepancy between precision and recall is less profound. Specifically for WCE, in the first epochs, the recall value ranks approximately 90% while further on converges to 80% and from the 80th epoch onwards decreases to 70%. On the other hand, precision follows an opposite path, starting from 50% and gradually increasing up to 85% in the final epochs. F1 score in the beginning of training is 60% and then converges to value close to slightly below 80%. The highest F1 score is attained in the 45th epoch. The performance of the three metrics indicates that in the beginning, the system is overconfident to predict cracks. In this process, it misclassifies background as cracks. Similar behaviour was reported by [30,56]. As Zhang et al. [75] pointed out, precision and recall frequently conflict with each other and a compromise between recall and precision is made to select the best model. In order to visualize the meaning of different values of recall and precision, predictions with U-net-MobileNet for different images are exhibited for the epochs 3 and 45 which correspond to the highest recall and F1 score respectively (Fig. 14). All the examples in Fig. 14 rank a recall value close to 100% (i.e. maximum value) at epoch 3. Nevertheless, precision and F1 score remain significantly lower. Taking a closer look at the predictions at epoch 3, large parts of the background have been misclassified as cracks (Fig. 14). Regarding the predictions on epoch 45, recall slightly drops while precision significantly increases since the network learns to negate greater parts of the background (Fig. 14).
Furthermore, the networks found in the literature, that is DeepCrack, DeepLabv3+, FCN-VGG16 and U-net, have similar performance in terms of F1 score, i.e. from 74% to 75.7%. U-net outperforms the other networks obtained from the literature achieving F1 score 75.7% with FCN-VGG16 following closely with F1 score 75.6%. Moreover, regarding the performance of the networks found in the literature except for FCN-VGG16, significant discrepancy is observed between the recall and precision values; the networks favour the recall which lead to lower values of precision. The models based on U-net and FPN with a pretrained CNN as backbone attain F1 score from 77.2% to 79.6% which means that they surpass the F1 score, that is 75.7%, of the models found in the literature and are implemented without pretraining. Furthermore, in Table 2 can be observed that U-net and U-net-MobileNet without pretraining reach similar F1 score, that is 75.7% and 75.4% respectively, while the pretrained U-net-MobileNet yields F1 score 79.6%. This observation highlights the effect of pretraining on the performance of the networks; F1 score is boosted by 4.2% when pretraining is considered for U-net-MobileNet. The U-net-MobileNet without pretraining in terms of F1 score records performance similar to FCN-VGG16 and U-net and outperforms DeepCrack and DeepLabv3+. The models based on FPN in general score higher than the corresponding ones built on U-net while the highest F1 score is obtained with U-net-MobileNet and FPN-InceptionV3 (Table 2). It is noted that the models based on FPN have almost half the size of the ones with U-net in terms of model parameters and memory size of the stored weights (Table 2). Thus, FPN models match the performance of the U-net counterparts while being significantly more lightweight networks.
In Fig. 15 different examples from the validation set are presented with predictions obtained with DeepCrack, DeepLabv3+, U-net, U-net-MobileNet (with and without pretraining) and FPN-InceptionV3. In particular, images with edges around openings (Fig. 15a-e), crack-like mortar joints (Fig. 15f-i), shadows (Fig. 15k) and dark spots (Fig. 15l) are displayed. While the pretrained U-net-MobileNet and FPN-InceptionV3 are able to negate different types of noisy background, the rest of the networks (Fig. 15) score lower in terms precision.
Images from the validation set with predictions obtained using Unet-MobileNet have already been presented in Fig. 14 and Fig. 15 while  Fig. 16. The network successfully segments cracks with different crack size, scale and background complexity; closeup photos (Fig. 16a-c), images with a larger field of view ( Fig. 16d-f) and with unwanted objects (i.e. windows and colour-paints) ( Fig. 16g-i). Apparently, there are cases that the network failed to perform crack segmentation accurately. For example, in Fig. 16j-k the network fails to detect parts of the cracks. Moreover, Fig. 16l-r displays examples where the model does not manage to negate noisy types of background.

Comparative study
In a previous study for crack detection on concrete surfaces it was concluded that when a DL network was trained on images of monotonous background and subsequently tested on a more complex dataset the performance drastically decreased [19]. In more detail, precision from 87.4% fell to 23.1%. Moreover, DL networks trained on concrete images found to perform poorly when tested on masonry images because they are rather complex [48]. This behaviour of CNN was explained by [79]; the transferability of features decreases as the distance between the base task (i.e. training dataset) and target task (i.e. testing dataset) increases. To build up on these findings, U-net-MobileNet trained on the masonry dataset is tested on images from concrete surfaces in order to evaluate the ability of CNNs to generalize over different materials.
In particular, the open source dataset prepared by Yang et al. [24] is selected and will be referred to as the "concrete dataset". The dataset consists of 776 concrete images containing different crack types. Examples of images in the concrete dataset with their labelled cracks are presented in Fig. 17. An FCN was trained and morphological transformations were applied to further improve the crack segmentation. The reported F1 score, recall and precision were 80%, 79% and 82% respectively [24].
When U-net-MobileNet is tested on the concrete dataset it ranks 74.7%, 70.9%, 91.2% for F1 score, recall and precision respectively. The network does not perform satisfactorily in terms of recall value while excels in terms of precision. These results can be explained by taking a closer look on the predictions on the concrete dataset (Fig. 17). In fact, the network performs exceptionally segmenting cracks with complicated shapes (Fig. 17a-d) obtaining 79% recall or above and a minimum of 94% in terms of precision. On the other hand, the network fails to detect cracks like in Fig. 17e-f but it is noted that these defects look like spalling and do not have a typical crack-like shape; similar defects do not exist in the masonry dataset. Additionally, precision is high which implies that the network can easily negate the background. This could be attributed to the fact that concrete surfaces are rather homogeneous and less complex than masonry surfaces. Consequently, the performance of U-net-MobileNet trained on the masonry dataset deteriorates, i.e. F1 scores declines from 79.6% to 74.7%, when tested on the concrete dataset but not as drastically as reported in the literature when networks trained on concrete images were consequently tested on masonry photos. As explained above (Section 2.2), this is attributed to the fact that masonry surfaces are more complex than concrete ones. It is noted that in the literature there are various datasets of concrete surfaces while only limited data for masonry exist. Thus, when crack segmentation on concrete surfaces is requested, it is recommended to train a model solely on concrete images instead of relying on models trained on masonry    data. On the other hand, for cases where only few data exist, e.g. timber surfaces, a model trained on a dataset with complex backgrounds like the masonry dataset produced herein could be an alternative.

Conclusions
In this study the feasibility of DL techniques for crack detection on images from masonry walls is investigated. Even though masonry surfaces have been reported to be rather complex for CNN on crack detection, this study showcases that DL algorithms are able to accurately detect cracks from images of real masonry surfaces. In order to address the lack of data in the literature, a dataset with photos from masonry structures was produced containing complex backgrounds and various crack types and sizes. Different DL networks are considered and by leveraging the effect of transfer learning crack detection on masonry surfaces is performed both on patch and pixel level. To the authors' best knowledge, this is the first implementation of DL for pixel-level crack segmentation on masonry surfaces. State of the art CNNs pretrained on ImageNet are examined for their efficacy to classify images from masonry surfaces on patch level with MobileNet obtaining the highest accuracy, that is 95.3%. U-net, a deep FCN, and FPN, a generic pyramid representation, are combined with different pretrained CNNs performing as the backbone of the encoder part of the network to perform pixel level crack segmentation. U-net-MobileNet and FPN-InceptionV3 attain the highest F1 score, that is 79.6%, and outperform other networks for crack segmentation from the literature. In particular, for U-net-Mobi-leNet, when the backbone CNN is considered without pretraining, F1 score declines from 79.6% to 75.4%, which demonstrates the beneficial effect of transfer learning. The ability of CNNs to generalize over different materials is evaluated. The performance of U-net-MobileNet trained on the masonry dataset deteriorates, i.e. F1 scores declines from 79.6% to 74.7%, when tested on concrete images but not as drastically as reported in the literature when networks trained on concrete images were consequently tested on masonry photos. Codes, data and networks relevant to the herein study can be found in the GitHub repository: gi thub.com/dimitrisdais/crack_detection_CNN_masonry.
Although the proposed DL algorithms achieved promising results, further improvements are required to achieve a fully automated visionbased assessment of masonry surfaces. The current study focuses on the detection of cracks but in the future the annotations of the masonry dataset could be updated to consider other defect types as well. The results of DL methods heavily rely on the quality of data. Thus, the expansion of the current masonry dataset is highly recommended with special care for the inclusion of even broader background types. In particular, including photos under low-lighting conditions and further evaluating the accuracy of the crack detection is highly recommended. With the increasing accessibility to high quality camera sensors it is advised that the research community develops ways to further mobilize engineers, practitioners and citizens to contribute in the data collection process and provide them with guidelines and automatic procedures that will render the gathered data reliable. Significant research has been devoted to the automatic semantic segmentation of photos coming from building façades, a technique known as façade parsing. Further studies are advised to evaluate whether façade parsing could be utilized to preliminarily detect objects like doors, ornaments, etc. and negate them so that the network would search for defects only on masonry surfaces. Herein, networks based on U-net and FPN architectures were implemented. Recent studies have come up with updated versions of these architectures which outperformed the original implementations. A further investigation whether these updated versions could improve the accuracy of the herein suggested DL algorithms for crack detection is encouraged. The best performing networks implemented herein scored better than other networks which have already been successfully used in the literature for crack segmentation on concrete or asphalt surfaces. Thus, it is highly recommended that the best architectures used herein are implemented on other types of surfaces as future research.

Declaration of Competing Interest
None.

Acknowledgements
Several photos obtained by inspectors for Helifix, UK, and were kindly offered to expand our masonry dataset and their contribution to this study is highly appreciated. Nektarios Lianos, Computer Vision specialist at Geomagical Labs, is acknowledged for his insightful comments on the implementation of the deep learning networks. The work has been partially funded by RVO within the project "SafeGO -Seismic Monitoring, Design And Strengthening For thE GrOningen Region", Grant No: RAAK.MKB09.021. The crack detection method was developed alongside the project "Seismic Monitoring of Historical Buildings in Groningen" funded by Rijksdienst voor het Cultureel Erfgoed, Grant No: 126761.