Historical-crack18-19: A dataset of annotated images for non-invasive surface crack detection in historical buildings

This article presents the details of Historical-crack18-19 dataset containing around 3886 annotated concrete surface images from historical buildings. The dataset comprises about 40 raw images collected from an ancient mosque (Masjid) in Historic Cairo, Egypt, with about 757 cracked and 3139 non-cracked surface instances. The images of Historical-crack18-19 dataset were captured using Canon EOS REBEL T3i digital camera with 5184 × 3456 resolution over two years (2018 and 2019). The images of Historical-crack18-19 dataset are annotated with the help of an expert and are intended for training and validation of automated non-invasive crack detection and crack severity recognition as well as crack segmentation approaches based on Machine learning (ML) and Deep Learning (DL) models. According to the environmental circumstances, where the dataset was collected, several challenges are encountered by crack detection/segmentation systems in surface images of historical buildings (illumination, crack-like patterns, separators, dust, blurring, deep texture, etc.). Further, researchers can use the dataset for benchmarking the performance of state-of-the-art methods designed for solving related (image classification and object detection problems. Historical-crack18-19 dataset is freely available at [https://data.mendeley.com/datasets/xfk99kpmj9/1].


a b s t r a c t
This article presents the details of Historical-crack18-19 dataset containing around 3886 annotated concrete surface images from historical buildings. The dataset comprises about 40 raw images collected from an ancient mosque (Masjid) in Historic Cairo, Egypt, with about 757 cracked and 3139 non-cracked surface instances. The images of Historical-crack18-19 dataset were captured using Canon EOS REBEL T3i digital camera with 5184 × 3456 resolution over two years (2018 and 2019). The images of Historical-crack18-19 dataset are annotated with the help of an expert and are intended for training and validation of automated non-invasive crack detection and crack severity recognition as well as crack segmentation approaches based on Machine learning (ML) and Deep Learning (DL) models. According to the environmental circumstances, where the dataset was collected, several challenges are encountered by crack detection/segmentation systems in surface images of historical buildings (illumination, crack-like patterns, separators, dust, blurring, deep texture, etc.

Value of the Data
• Historical-Crack18-19 dataset could be useful for training and validation of algorithms for crack detection, severity recognition and crack segmentation in historical buildings. • Historical-Crack18-19 dataset can be used to develop new ML and DL architectures to enhance the efficiency of these architectures for historical building crack detection and severity recognition. • Historical-Crack18-19 dataset provides visual tracking of cracks in surface images of historical buildings. Therefore, it enables engineers to perform architect examinations for the early identification of structural health problems. • Supports numerous feature selection and feature extraction methods by their color scheme, textural, and shape descriptors of different cracks. • Original historical surfaces images are taken from a broader view, so could be desirable by the architects for the depth analysis. • Historical-Crack18-19 dataset is collected in a natural environment with inconsistent light intensity and weather. Hence, it composes challenges for the researchers to identify the crack severity with the naked eye.

Data Description
The Historical-Crack18-19 image dataset contains about 3886 annotated images of cracked and intact walls of the historical building. Its purpose is for autonomous crack detection, severity recognition, and segmentation algorithms training, validation, and benchmarking based on computer vision, machine learning, deep convolutional neural networks (DCNN), or other techniques. Such techniques are currently widely used in the structural health monitoring field. For valuable historical buildings, continued advancement of crack detection, its severity recognition, and segmentation algorithms needs an annotated varied image dataset, which has not been available until now. Real images of historical building cracks were captured from an ancient building suffering from cracking problem (the Mosque (Masjed) of Amir Al-Maridani, located in Sekat Al Werdani, El-Darb El-Ahmar, in Cairo Governorate). It was built during the era of the Mamluk Sultanate of Cairo, Egypt in 1339-40 CE. It is distinguished by its octagonal minaret and its large dome and is considered one of the most distinctively decorated historical buildings.
This article presents the dataset containing 40 raw RGB images of cracked and non-cracked (Intact) surfaces that could help the researchers to apply various image processing and computer vision algorithms for crack detection and segmentation. Table 1 lists the number of intact and cracked walls included in Historical-Crack18-19.

Camera Specification
The collected dataset was captured using Canon EOS REBEL T3i, an advanced DSLR camera with a CMOS sensor system and resolution of 5184 × 3456 pixels. The sensor is of size 22.3 × 14.9 mm, a diagonal of 26.82 mm (1.06 ) and, a surface area of 332.27 mm 2 .

Processing
Generally, the performance of ML and DL techniques is strongly affected by many factors such as training dataset size, the number of features, algorithm parameters, and the nature and complexity of the studied problem in some cases. Although in many domains, such as natural imaging, many datasets providing millions of images like ImageNet are available, unfortunately, the data available to train an accurate and robust classifier is insufficient in many application areas such as cracks in historical buildings due to the nature of the studied problem [3 , 4] . As a result of this scarcity, classification accuracy significantly decreases, and some classification models suffer from overfitting. Thus, sufficient training datasets are prerequisites for successful classification.
To tackle this problem, the original images are divided into sub-images 256 × 256 with 96 dpi resolution. As shown in Table 1 the final Historical-Crack18-19 consisted of 3886 images, which are very limited training datasets. Dataset was divided into two classes namely; intact and cracked with 3139 and 757 images, respectively. It is noted that the training dataset size needs to be enlarged for training purposes. To achieve this, the data augmentation process is suggested to be applied (could be applied) to increase training dataset size via generating new samples similar to the training samples. The most common types of data augmentation are data augmentation based on basic image manipulations (through applying geometric or color space (photometric) transformations) and data augmentations based on deep learning (through applying feature space augmentation or GAN-based data augmentation). Some example adjustments include translating, cropping, scaling, rotating, changing brightness and contrast. Unfortunately, selecting unsuitable (the inconvenient) data augmentation methods probably lead to increasing insufficiently informative samples, which have no impact or harmful impact on the classifier's accuracy and robustness [5] . The augmentation process may include some spatial and intensity transformation such as: