Deep Learning-based Engraving Segmentation of 3-D Inscriptions Extracted from the Rough Surface of Ancient Stelae

Ancient stelae are considered important historical sources. However, it is a challenge to recognize the inscriptions carved on stelae that have rough surfaces due to prolonged weathering. In this paper, we propose a deep learning-based method to extract engraved regions from the 3D scanned mesh of a stela. First, the uneven distribution of vertices in the mesh is transformed using a mesh subdivision method such that the vertices in the mesh are uniformly distributed. Then, surface features (depth, concave features, and local surface features) are extracted from the subdivided mesh. The depth represents the basic shape of the mesh and is obtained from the aligned mesh. The concave features effectively represent concave regions by using a Frangi filter, and the local surface features have the spin image technique applied to describe the fine shapes of neighboring vertices relative to a vertex. The mesh and the surface features are rasterized into feature images, and engraved regions are segmented from the feature images by using a FC-DenseNet. Our experiments confirm that the proposed method effectively extracts engraved regions of the inscriptions from the rough surface of a stela and it shows robustness to noisy and extremely abraded characters. The proposed method outperformed the second-best method, obtaining an F1 score, IoU, and SIRI of approximately 2.95%, 3.65%, and 7.53%, respectively.


I. INTRODUCTION
Archeological stelae recording events of the past have important value when it comes to studying political and cultural history. However, stelae have usually been exposed to prolonged physical and chemical weathering from the environments where they are located, making it difficult to recognize the characters engraved on them. Weathering causes damage of various sizes and shapes, such as scratches, cracks, dents, scaling, and spalling. The damaged regions are morphologically similar to engraved regions, and are difficult to distinguish from the strokes of the characters. Moreover, the colors of the surface, such as stains, moss marks, and stone patterns, make it more difficult to recognize character strokes. The direction of light can be used to detect thin character strokes through shadows, but it often makes them harder to find. To recognize characters, it is necessary to separate external factors such as colors and light from the surface and to effectively remove noise so as to extract only the engraved regions. Note that concave regions consist of not only engraved regions but also noise due to damage.
Rubbing is a traditional method applied directly to the surface of stelae to extract engraved regions [1]. An image is obtained by rubbing ink onto paper placed on the surface of a relic. The areas where the ink and paper touch the surface become black; elsewhere, the color of the paper is maintained, showing results similar to a binary image.
Although external factors are excluded, and concave regions are extracted, the results can vary depending on the materials used, the environment, and the proficiency of the person using the rubbing method. In particular, rubbing causes direct physical and chemical damage to the surface of the relic from ink contamination.
Recently, 3D scanning-based methods have been introduced to solve the problems associated with the rubbing method [2], [3]. The 3D scanning process does not require physical contact with the relic, and as such, the surface of the relic is not damaged, and the features of the surface are preserved, excluding the external factors. The transformed digital data can be easily reproduced and distributed.
The 3D scanning-based methods can be divided into those that focus on enhancing visualization and those for automatic extraction of engraved regions. The methods relating to visualization are shading-, curvature-, and morphologybased. The shading-based methods emphasize thin strokes by dynamically adjusting the light position or by adapting lighting effects to reveal surface details of 3D objects [4]- [6]. The curvature-based methods represent in color the convexity or concavity of the surface [7]- [9]. The modified curvaturebased method estimates valleys and ridges of the surface using improved curvatures, and emphasizes the surface through adaptive filtering and colorization [10]. The morphologybased method highlights concave regions by subtracting smoothed mesh with a Laplacian filter [11]. Several methods combine multiple visualizations to improve the detection of engravings [12]- [14]. However, the visualization results capture both engraved regions and noise, so archeological knowledge is required to analyze and interpret the results.
The methods automatically extracting engraved regions without subjective evaluation are as follows. The valley/ridges-based methods directly extract valleys and ridges from the surface [15]- [17]. The depth estimationbased relief extraction (DRE) method [18] estimates the relative depth of each vertex by using normal vectors of the virtual base surface, and then separates engraved regions with a depth threshold determined from using an expression maximization (EM) algorithm [19]. The curvature-based methods [20], [21] estimate engraved regions by detecting canal-like shapes with a Frangi filter [22], which was introduced to identify blood vessels using curvature from a medical image. However, since rough surfaces do not follow a Gaussian distribution, and since these methods lack the ability to distinguish between engraved regions and noise, a lot of noise is also extracted in the results.
Machine learning-based methods were proposed to distinguish engraved regions from noise. Texture-based methods are widely used [23]- [25]. The 2D texture-based method segments an input image containing multiple textures given a patch of a reference texture [24]. The 3D texture-based method [25] extracts texture features on local 3D surfaces, and classifies engraved regions using a support vector machine (SVM) [26]. In the deviation map-based method, a 3D mesh is projected into a 2D image to reduce the computa-tional complexity and generate an enhanced deviation map [27]. Then, pixel-wise segmentation is performed using a random forest model trained using the local surface patches [28]. The segment-based method involves the initial extraction of concave segments from the 3D mesh [29]. Features including local extrema, cross-sectional, and appearance information are extracted from the segments. The engravings are identified from the classification results obtained with an SVM model. The machine learning-based methods showed higher performance, compared to rule-based methods, but have limitations in that only local surface information is mainly used. In other words, they cannot utilize a global context, mainly using the local context for classification of engraved regions and noise.
Despite the previous work, it is still a challenge to separate only engraved regions from rough surfaces, except for noise. It is difficult to extract engraved regions from rough surfaces of stelae because the reference surface is not flat and contains a lot of noise. Although engraved regions can be separated from noise by considering the local surface characteristics, noise has local characteristics similar to the engraved regions. Noise appears on the surface of the stela as a result of prolonged weathering over a long period of time, just like the engraved regions. In contrast to the noise, the regions of engraved characters were originally carved on the surface of the stela. The strokes of the characters are short, shallow, and thin, and easily confused with, or interpreted as, noise. Thus, it is necessary to extract not only local shape features of engraved regions but also the global context for classification.
In this paper, we propose a method to segment the engraved regions of 3D inscriptions using deep learning. First, 3D scanned data are preprocessed by mesh subdivision to evenly distribute the position of vertices in the mesh. Then, surface features consisting of depth, concave features, and local surface features are extracted. The mesh and the surface features are rasterized into feature images, and the engraved regions are segmented using a FC-DenseNet. We select surface features that effectively represent the surface and utilize both local and global contexts via the FC-DenseNet to classify pixel-wise engraved regions.
The contributions of this paper are as follows.
1) We propose a method of segmenting the characters on a very rough 3D mesh with high-resolution by applying CNN-based 2D segmentation while preserving the fine features.
2) The proposed method shows the highest performance, both objectively and subjectively, when compared to conventional methods. 3) Multiple surface features are combined instead of using a single surface feature showing the better performance. 4) The proposed method utilizes both local and global contexts, unlike the approaches of conventional methods, to classify engraved regions achieving the higher performance.

5)
The proposed method not only shows the best performance, but also performed well on extremely abraded characters. The experiments show the proposed method is more robust to noise than conventional methods. This paper is organized as follows. We review related studies in Section 2. Details of the proposed method are described in Section 3. We present evaluations of our results in Section 4, and conclude the paper in Section 5.

II. RELATED WORKS
In this section, we briefly introduce the studies relevant to the proposed method. For clarity, these previous studies are divided into subsections relating to the Frangi filter [22], the spin image [30], and FC-DenseNet [31].

A. FRANGI FILTERS
The curvature-based relief extraction (CRE) method applies a Frangi filter to extract engraved regions from relics [20]. CRE exploits the advantages of the Frangi filter by using the principal curvature to extract canal-like concave regions from a 3D mesh. The concavity, C, of vertex v i is given as where k 1 and k 2 (|k 1 | ≥ |k 2 |) are the maximum and minimum principal curvatures at a vertex. The vertex index, i, is omitted for clarity. The parameters Γ and Π represent the ratio of the principal curvature |k 2 /k 1 | and the magnitude k 2 1 + k 2 2 , respectively. When k 1 ≥ 0 (concave), the first term indicates how similar the local surface is to a canal by using a ratio of the principal curvatures. The second term represents the depth of the local surface using a magnitude of the principal curvature. The principal curvatures calculation requires the computation of the second derivatives of the surfaces. In the proposed methods, the principal curvatures are obtained using Rusinkiewicz's method [9] that utilizes tensor representation.
However, CRE is not as effective at extracting engravings with coarse surfaces as it is in extracting engravings of flat murals. In particular, the principal curvature is obtained in quadratic differential form, which is weak against noise, resulting in very messy boundaries of the extracted regions.
The modified curvature-based relief extraction (MCRE) method applies Gaussian smoothing to reduce the noise of the principal curvature and increase the performance of the Frangi filter [21]. MCRE extracts engraved regions by applying a threshold value according to obtained concavity C. Subsequently, the dual curvature-based relief extraction (DCRE) method extracts only deeply engraved regions by using two Frangi filters with different parameters [21]. MCRE and DCRE can effectively extract concave regions. Nevertheless, the methods are less accurate in the detection and removal of noise, and there are difficulties in adjusting the parameters according to the degree of weathering on each characters. The segment-based relief extraction (SRE) method is employed with MCRE to obtain segments [29]. Features consisting of appearance-based, cross section-based, and local extrema-based characteristics are extracted for each segment. Then, SRE classifies each segment into either an engraved region or noise by using an SVM, and it does not need to adjust the parameters depending on the degree of weathering [26]. The result obtained by SRE is more accurate, compared to other methods.

B. SPIN IMAGES
A spin image is a 3D shape descriptor that is invariant to rotation and translation [30]. The descriptor was mainly introduced for object recognition, surface matching, and facial feature point detection [32], [33]. We apply the spin image as a 2D histogram to represent the distribution of peripheral vertices at each vertex. The projection function, S i (j), that has position (ρ, γ) with respect to the neighboring vertex v j from a vertex v i , is as follows: where n i is the normal vector of v i . The neighborhood vertices are mapped relative to reference vertex v i . Then, spin image S i is obtained as a 2D histogram of S i (j) with a cylindrical subspace having the number of rows, n γ , and columns, n ρ , in the γ and ρ directions, respectively. The bin sizes of the 2D histograms are denoted as ∆γ and ∆ρ. Fig. 1 shows a representation of a spin image. A space with a reference vertex and neighboring vertices is illustrated in Fig. 1 (a). The geometric structure of the spin image is shown in Fig. 1 (b) where n γ = 4, n ρ = 3, and the bin sizes, ∆γ and ∆ρ, are arbitrary. The generated spin image is a 4×3 (h×w) histogram, as shown in Fig. 1 (c). Thus, the spin image captures the distribution of vertices located within a cylindrical subspace with radius of n ρ and a height of n γ .

C. FC-DENSENET
Image segmentation is considered an essential element in scene understanding, and is a topic being addressed significantly in computer vision and machine learning [34], [35].  Image segmentation means classifying images on a pixelwise basis, and it is being researched and utilized in various environments, such as face recognition, medical imaging, and autonomous driving. In pixel-wise classification, the global context is very important because the local context alone is not sufficient.
Recently, the advances in deep learning-based convolutional neural networks (CNNs) have achieved high performance in semantic segmentation applications. After the development of the fully convolutional network [36], many models based on encoder-decoder structures were proposed, such as DeconvNet [37], SegNet [38], and U-Net [39]. These models extract features using the encoder structure, and restore them to high-resolution images through the decoder structure.
In particular, U-Net was proposed for cell image segmentation in medical images, and achieved high performance despite being trained on a small data size [39]. Bezmaternykh et al. applied U-Net to segment characters from historical documents [40]. Their research has characteristics similar to the subject of this paper because the characters were extracted from ancient materials that contain a lot of noise due to weathering. Although the data available for the study were not sufficient owing to a limited number of ancient artifacts, U-Net had acceptable results from character segmentation.
When the number of layers in a CNN is increased, performance can be increased. However, information from previous layers can be lost due to vanishing gradients, which occurs when networks are deep. To solve this problem, DenseNet was presented [41]. DenseNet includes a dense block architecture in which feature maps of previous layers are concatenated with feature maps in the next layers. Through this structure, a dense block preserves information with a strong gradient flow. Since the features of each layer are connected and preserved, DenseNet has fewer channels, reducing the number of parameters. And the dense block has the effect of regularization, and it reduces overfitting.
FC-DenseNet combines U-Net with DenseNet, providing a deep network for image segmentation [31]. FC-DenseNet has very deep networks but a small number of parameters. The performance of FC-DenseNet is promising in the extraction of local and global contexts and has the advantages of both U-Net and DenseNet.

III. THE PROPOSED METHOD
In this section, we describe the proposed method for segmentation of engravings by using deep learning. An overview of the proposed method is shown in Fig. 2. The process is divided into mesh subdivision, surface feature extraction, rasterization, and image segmentation. First, we perform mesh subdivision that transforms the vertices in the mesh from nonuniform to uniform distribution. We then extract features consisting of depth, and concave and local features to characterize the surface. Thereafter, the rasterization method is applied to convert the mesh and features to feature images. Finally, the engraved regions are segmented and identified using FC-DenseNet.
The 3D scanned data are expressed as a polygon mesh, M = {V, E, F }.. Set V contains the list of vertices representing the position of the 3D surface, and it is expressed is determined by sampling the surface of a real 3D target through a 3D scan. Set E consists of a list of edges connecting two vertices. E is expressed as . The set containing all triangulated faces connecting three edges is expressed notations, n V , n E , and n F are the numbers of vertices, edges, and faces, respectively, in the mesh. The region of interests (ROIs) where the characters can exist are manually selected. The process of the proposed method works based on the vertices in the ROIs.

A. MESH SUBDIVISION
The 3D scanned mesh, M scattered , has uneven distributions of vertices. During the 3D scanning process, the vertices of well-scanned areas are dense, and the vertices of poorly scanned areas are sparse. The unequal distribution of vertices is not suitable for feature extraction. In particular, for a histogram, a spin image is highly dependent on the number of vertices. Thus, mesh subdivision is essential in order to equalize the distribution of the vertices so they have even distribution at any vertex. To transform the uneven distribution of vertices in M scattered into a mesh with evenly distributed vertices (M gridded ), linear triangular interpolation [42] is applied. After the faces are regenerated, M gridded is grid-shaped, and the vertices are uniformly distributed. This preprocessing is done before the extraction of features. Fig. 3 illustrates an example of mesh subdivision. The surface of a raw mesh and the corresponding mesh with the positions of vertices are shown in Fig. 3 (a) and Fig. 3 (b), respectively. Observe the uneven distribution of vertices in the poorly scanned areas of the mesh in Fig. 3 (b). The mesh in Fig. 3 (c) is the output of the preprocessed mesh after mesh subdivision. The sparse areas in the mesh are removed, and the vertices are uniformly distributed, thereby resulting in the same mesh resolution across the surface of the stela.

B. SURFACE FEATURE EXTRACTION 1) Depth
The depth of the mesh represents the basic surface feature of the stela. We use principal component analysis for the vertices in the ROI of a character to remove the slope of the zaxis of mesh M gridded , and we approximate the surface into a plane. The z-value of each vertex in the aligned mesh is used as the depth, which is normalized as where s preserves the scale for the depth, and normalizes the depth into the range [0, D max ]. Since depth deviation is different for each character, scaling is adapted.  However, certain characters are located on the outer boundary of the stela, and the depth decreases rapidly along the outside boundary, making the value of D max large. Therefore, the following expression is applied to limit the depth: Equation (6) holds the maximum value to 1, and values below 0 are clipped to adjust the distribution into the range [0, 1].

2) Concave features
We extract all concave regions of the mesh using MCRE [21]. The Frangi filter effectively extracts concave regions using the principal curvature. However, since the principal curvature is obtained in quadratic differential form, it is vulnerable to noise, and the process requires denoising. Gaussian smoothing is applied to the mesh to address this problem, but the proper parameters must be applied because a high value for standard deviation from Gaussian smoothing removes engraved regions. After Gaussian smoothing, the principal curvatures of the mesh are obtained using Rusinkiewicz's method [9]. Then, concave feature C i of vertex v i is obtained using (1) and has a value in the range [0, 1].

3) Local surface features
To extract a local surface feature, we apply the spin image because of its effectiveness in describing the surface features VOLUME 4, 2016 of 3D objects [30]. The local surface feature denoted as S i is obtained for every vertex v i . Spin images generated on the surface of a mesh are illustrated in Fig. 4. Depending on the shape of the local surface, spin images are classified into four main forms. For a vertex located on a flat surface, a straight spin image is obtained, while for a vertex located on a convex surface, a downward spin image is generated, and for a vertex in concave areas, an upward spin image is acquired. The last form is for a vertex located in mixed surface areas, so a mixed-form spin image is obtained. However, the number of vertices in the spin image shell increases as the area covered by the spin image expands. Therefore, if the variables have different sizes, a process for fitting the units is necessary. The spin image is normalized for the outside shell volume of the cylinder using the following expression: where x ∈ {1, 2, 3, ..., n ρ } and y ∈ {1, 2, 3, ..., n γ }.
As discussed previously, the spin image is a 2D histogram, so we apply flattening to represent it with feature vectors for each vertex. We apply min-max normalization for all S i to adjust values into the range [0, 1]: The spin image bins of the mesh are uniquely divided into n γ n ρ channels such that each channel of S i has different types of local surface image representation for the mesh.
Representations of the three surface features extracted from the raw mesh in Fig. 3 (a) are illustrated in Fig. 5 (a) to Fig. 5 (c). It is important to note that a single channel representing the image of the local surface feature is shown in Fig. 5 (c).

C. RASTERIZATION
The mesh and the three corresponding features extracted from the mesh are transformed into feature images I such that the engraved regions are extracted using FC-DenseNet. The position of vertex v i of the mesh is mapped to the pixel position. Concurrently, the surface features are also mapped to the pixel intensity: When the resolution of the mesh is maintained, the computation cost increases due to the large number of pixels. However, if mesh resolution is reduced, fine features cannot be extracted. Therefore, feature images are generated by applying subsampling to reduce the size of the input images for FC-DenseNet while preserving the resolution of the features.
In addition, the characters in an ancient stela have different sizes, allowing multiple characters to be located in the kernel of the CNN. This leads to learning imbalances, because small characters can be learned in large numbers. Another problem that needs to be addressed is characters that are located on the boundary of the kernel that truncates the character regions. Consideration of the global context is significant in removing the noise that shares features similar to the engraved regions. The training of truncated characters degrades the performance of FC-DenseNet because the global context is difficult to extract. We experimentally confirmed that segmentation results are better than when a single character exists in the CNN's kernel without truncation. Therefore, we apply zeropadding to the outer region of the ROI so that only one character exists in the kernel. Fig. 6 shows the result of the rasterization process on a sample mesh.

D. IMAGE SEGMENTATION
Engraved regions are segmented by using FC-DenseNet in the image domain. The engraving segmentation of the mesh domain demands large amounts of memory and incurs computational costs due to the large number of vertices in the mesh. It is difficult to extract engravings directly in the mesh domain due to the large number of vertices in the mesh. A mesh patch of a character contains about 70k vertices, while the majority of existing 3D segmentation methods work on small point clouds such as 4,096 point clouds [43]. Furthermore, sufficient training datasets are not available because the number of ancient artifacts is limited. To address the computation complexity, we convert the mesh to image dimensions similar to [27], and use FC-DenseNet [31] for image segmentation. FC-DenseNet shows excellent performance with the extraction of local and global contexts, and has the advantages of both U-Net and DenseNet.
The structure of the modified version of FC-DenseNet that we used in this study is shown in Fig. 7. The FC-DenseNet consists of the encoder structure used for contracting paths, and the decoder structure for extending paths. Details of the FC-DenseNet structure are presented in Fig. 8. The input first passes through the 3×3 convolution layer. Then, encoding is performed on the input data through the dense block and transition down. And decoding is applied via the dense block and transition up. In the contracting path, all feature maps of the dense block are linked to the feature maps of the next stages, but in the expanding path, only the output channels of the dense block are linked to the feature maps of the next   stages. After processing the last 1×1 convolution layer, the sigmoid function is adapted to segment the engraved regions for pixel-wise classification.
The architecture of the dense block is presented in Fig. 9. The number of channels for each convolution in the dense block is referred to as growth rate parameter. Since the feature maps of a dense block are concatenated with each result of each convolution, the number of channels in the feature map increases in the form of an equivalent sequence through the growth rate. The growth rate of the dense block is 16.
Transition down reduces the resolution through max pooling, and transition up increases the resolution through transposed convolutions with a stride of 2 while maintaining the same number of channels.
We removed the dropout layers, and initiated a convolution of 256 channels owing to the large number of channels in our input images, unlike the original model. DICE loss is applied during training with the proposed method [44]. The loss function is expressed as where p true is ground-truth labels, p pred is predicted labels, and ϵ is a very small value to prevent division by zero. Generally, the character image has a small proportion of character regions, compared to background regions, so there are many true negatives (TNs). The loss is suitable for segmenting the characters because it does not consider TNs. The similarity between the regions for p true and p pred is well represented by the loss, and when the loss decreases, the result of the segmentation becomes closer to p true . After prediction, The p pred are converted into meshes using the linear triangular interpolation for visualization [42].

A. DATASET
In this paper, we automatically extract only the character regions (except for noise) from the stela called Musul-ojakbi, a Korean treasure (No. 516) that was created in the year of "Musul" (578 AD) during the Silla dynasty. The stela was built to commemorate construction of a reservoir, and records information that includes the construction period, the location and size of the reservoir, the number of workers, the names and appointments of the construction managers, and the name of the inscriptor. The information was written in Chinese in the Idu script. Musul-ojakbi is regarded as a highly valuable source in various archeological fields, such as Korean history, facility history, and Idu script history during the Silla period. However, despite its historical value, it is difficult to read the characters due to prolonged weathering.
The reference surface of Musul-ojakbi was not polished. Therefore, the characters were carved on the unflattened surface of natural stone. As a result, some strokes were even carved upon noise. Musul-ojakbi and its very rough surface is shown in Fig. 10. Moreover, the strokes of the characters are short, shallow, and thin. In the 3D scanned data from Musulojakbi, the dimensions of the front area, excluding the sides and back of the stela, are 982.4×650.8×24.3 mm (H×W×D). The total number of characters is estimated at 169, in which the average character size is 30.6×29.1×2.0 mm. However,  Musul-ojakbi contains a lot of noise due to weathering over a long period of time. The noise regions have features similar to the engraved regions due to the weathering, making it difficult to segment the engraved regions. The noise sometimes crosses engraved regions and covers strokes entirely. An example of a character in Musul-ojakbi is Fig. 11. It contains many dents and cracks, which makes it difficult to distinguish the engraved region. In addition, a lot of noise connects engraved regions, so the boundaries of inscriptions are often ambiguous. Therefore, a global context is required to separate the engraved regions from the noise.

B. ENVIRONMENTS
Consideration for the centerline of a stroke is more important for recognition of a character than the boundaries of the stroke, which is similar to a character font, provided the centerline is not broken. Furthermore, the importance of the outer boundaries is low, because the boundaries of engraved regions are ambiguous due to noise and abrasions.
In the experiments, we used a 3D scanned mesh of Musulojakbi. The mesh consists of approximately 23 million vertices with an average resolution of 0.25 mm.
We analyzed and classified the possible characters originally engraved on the mesh based on weathering. The number of characters in each classification is presented in Table 1. The characters were classified based on human observation.  [18], (c) CRE [20], (d) MCRE [21], (e) DCRE [21], (f) SRE [29], (g) the proposed method, and (h) ground truth.
We classified the characters into 4 classes according to the surface state with the guidance of an archeologist. The estimated number of characters on the stela is 169. These characters were put into four classifications: high, medium, low, and bad. In particular, bad characters are difficult for users with no experience in epigraphy to recognize without guidance from an expert. Since the distinction of engraved regions is ambiguous due to prolonged weathering, labeling of ground truth (GT) is subjective. To reduce bias in subjectivity while generating GT, one person produced GT for all 169 characters with the guidance of the archeologist.
For evaluation, 134 characters from the total number were selected. The remaining 35 characters were not included. The characters have overlapping ROIs, making it difficult to use for learning and evaluation. For the qualitative evaluation, we excluded the characters of bad quality from 134 characters, because the extremely damaged characters deteriorate the performance of the model. For the noise robustness evaluation, we used 134 characters including bad quality.
Since the data size is small, we applied fivefold validation for investigating the reliability of the proposed method. The dataset was randomly divided into five subsets, while the characters of each quality were included in each subset at the same rate. The four subsets were used as the training set, and the other subset was used as the validation to tune the hyperparameters of the model. The process was repeated five times, leading to five results for each model. The scores of each model were averaged to identify the results through a cross validation.
The hyperparameters used in this experiment are as fol-VOLUME 4, 2016 lows. Since the average resolution of the mesh is 0.25 mm, and the resolution of dense areas of the mesh is more than the average, a lower value is required in the mesh subdivision, so 0.2 mm was selected for the grid resolution. The scale s=1/6 was adapted to obtain depth. To obtain concave characteristics, the values σ = 1.0, α = 0.6, and β = 0.1 were selected. For local surface features, we obtained 10×8 spin images by using ∆γ = 0.15, n γ = 10, ∆ρ = 0.35, and n ρ = 8.
The mesh and features were then converted to 128×128 images via rasterization, which were then subsampled to 0.6 mm. The size of a depth feature image is 128×128×1, the concave feature image is 128×128×1, and the local surface feature image is 128×128×80. However, certain channels in the local surface feature images are described as zero matrices, and they were removed. Consequently, the size of the local surface feature was reduced to 128×128×64. The size of all the surface feature images is 128×128×66. Since the proposed method is scale-dependent, we applied only data augmentation for rotation and translation. The weight initialization method in FC-DenseNet is HeUniform. The model was trained with LR = 0.0001, with a mini-batch size of 8 using the Adam optimizer. The performance of the proposed method was compared with engravings obtained with DRE [18], CRE [20], MCRE [21], DCRE [21], and SRE [29]. However, these conventional methods operate in a mesh dimension, making it difficult to directly apply a quantitative evaluation owing to the different resolutions. We performed a quantitative comparison against the proposed method by transforming the estimated engraving results of each method into image dimensions. The boundary accuracy of a character has low importance when recognizing the character, so evaluation of the image dimension with a low resolution can be more accurate for comparison.

C. THE QUALITATIVE EVALUATION
The segmentation results from engravings of seven characters selected from the Musul-ojakbi stela are shown in Fig. 12. Fig. 12 (a) visualizes the roughness of the surfaces by applying color maps to the depth differences. DRE shows unsatisfactory results in Fig. 12 (b). DRE estimates the relative depth of each vertex, and then separates the engraved regions using the EM algorithm, which assumes a Gaussian mixture model. However, due to weathering, the depth deviation is so severe that the depth distribution does not follow a Gaussian distribution. Therefore, DRE showed unacceptable estimated engraving from the EM algorithm for certain characters with a high or low threshold. CRE was significantly affected by noise in the principal curvatures. With CRE, the engravings were extracted along with noise, as shown in Fig. 12 (c). The engravings identified with MCRE, shown in Fig. 12 (d), were more accurate than those detected with CRE, and the improvement is associated with Gaussian smoothing in MCRE to effectively reduce the noise of the principal curvature. The results from DCRE, shown in Fig. 12 (e), provided only deeper concave regions by using the two Frangi filters. There is considerable reduction in the noise from DCRE compared to MCRE. However, the drawback in DCRE is reduction of the stroke thickness, which resulted in failure to detect shallow or thin strokes. SRE classifies engraved regions using SVM, and results are illustrated in Fig. 12 (f). SRE effectively eliminated more noise than MCRE. However, since it classifies engraved regions based on candidate segments, it showed unsatisfactory results when the candidate segments were not well obtained. In SRE, some noise was connected to the engraved regions, and there were misclassified regions because the local context was primarily used for classification. The results from the proposed method are presented in Fig. 12 (g), and are most similar to GT in Fig. 12 (h). Comparing the results of the proposed method with SRE, the incorrectly detected engraved regions of SRE were accurately classified. The proposed method is more reliable than MCRE, DCRE, CRE, and DRE. Therefore, the visualization results show that the proposed method using both local and global contexts performed better than SRE and other existing methods.  [21], (c) SRE [29], and (d) the proposed method.

D. THE QUANTITATIVE EVALUATION
The character regions of the rasterized feature images account for 3.4% of the pixels on average in the 128×128 images. This occurs because the size of the characters varies greatly. This basically results in a very high accuracy that considers TNs. Though higher accuracy is better, it is not the appropriate metric for a proper comparison. A higher precision indicates that the detection of noise in the background was reduced, and a higher recall means that segmentation results of the engravings are acceptable. If the scores for precision and recall are simultaneously high without bias on either side of precision or recall, the backgrounds are less noisy, and the engravings are well extracted, making the character more recognizable. It is important that neither precision nor recall is biased in character recognition. The F1 score is a harmonic mean of precision and recall, as well as a good evaluation indicator. Intersection over union (IoU) represents the degree to which the regions overlap as a ratio of the intersection region and the union region of the predicted regions and the true regions. We also considered the segmented inscription recognition index (SIRI) [45], a metric that quantifies the subjective recognition score from character recognition. Conventional SIRI divides engraved regions into inner and outer regions [46]. Noise is divided into regions close to and far from the engraved regions. Breaking the centerlines of strokes has a great impact on character recognition, but conventional SIRI simply divides the engraved regions into inner and outer regions, and assigns the same weight to the same region. To address this problem, generalized SIRI [45] applies the level set function to the engraved regions by assigning high weight values to the center of the stroke and decreasing the weight values as the centerlines extend outwards. The equations of generalized SIRI are as follows: Thus, the generalized SIRI assigns stiff penalties for breaking the strokes and character boundaries. The parameters w F N and w F P were 18.4 and 2.7, respectively. The quantitative results from fivefold validation are presented in Table 2. The proposed method outperformed other methods in accuracy, recall, F1 score, IoU, and SIRI [45] at 0.0013, 0.0238, 0.0295, 0.0365, and 0.0753 respectively. The highest accuracy of the proposed method proves that the engraving segmentation is excellent than the conventional methods. In the case of precision, the proposed method shows the similar score to SRE. However, since our data is a skewed data, precision does not change even if the strokes of the characters are thinned or disappear. The recall of the proposed method shows the highest score, indicating that the strokes are well extracted like the GTs. The best SIRI of the proposed method means that the engraving segmentation shows the highest subjective recognition. IoU indicates that the predicted regions and the GT regions are also highly overlapped, and SIRI indicates that the results show the highest subjective results. Therefore, although the engraving segmentation of the proposed method shows that the detection of noises in the backgrounds is similar to the second-best method, the strokes of the characters are well extracted like the GTs. The proposed method shows objectively and subjectively the best performance compared to the conventional methods.
The scores of each feature were evaluated. Each single feature outperformed SRE in all metrics except precision. The result of the single depth feature shows that the proposed deep learning-based method using both local and global contexts performed better than SRE using the local context only. This indicates that the global context must also be used to extract engraved regions of inscriptions with coarse surfaces.

E. ROBUSTNESS TO NOISE
The F1 score comparisons from engraving segmentation according to the degree of abrasion are presented in Table  3. The proposed method showed the highest performance on all states of abrasion. Although the proposed method and SRE produced similar scores for high-quality characters, the performance of the proposed method was significantly higher for the other states, regardless of the degree of abrasion. In particular, for engraving segmentation in the bad state, we used a model of the fivefold validation that was trained using only characters in the high, medium, and low states. The bad characters included characters that are difficult for an ordinary person to recognize. Nevertheless, the performance of the proposed method with characters in a bad state was acceptable. Therefore, the proposed method quantitatively and qualitatively outperformed the existing methods. The satisfactory performance of the proposed method is linked to its robustness to noise.

F. THE PREDICTED ENGRAVED REGIONS FOR THE ENTIRE STELA
The engraving segmentation for all 169 characters on Musulojakbi is illustrated in Fig. 13. Compared to the other methods, engraving prediction from the proposed method showed the best performance. The proposed method achieved very good performance compared to rubbing (the most common investigation method for archeology). The proposed method extracted the engraved regions accurately, with less influence from noise, compared to MCRE. And the proposed method showed the best results by classifying the engraved regions misclassified by SRE into engraved regions.

V. CONCLUSION
In this paper, we proposed a deep learning-based approach to the extraction of engravings from a 3D scanned mesh of a severely weathered stela. The deep learning model was trained using a basic depth feature, a concave feature, and a local surface feature. Frangi filters and the spin image were employed in the extraction of concave features and local surface features, respectively. The local and global contexts of the features were considered in FC-DenseNet for the segmentations. Through experiments, we confirmed that each subfeature of the proposed features is effective in describing the surface of the stela. The proposed method outperformed conventional methods objectively and subjectively. F1 score, IoU, and SIRI values from the proposed method were 2.95%, 3.65%, and 7.53%, respectively. Furthermore, the proposed method showed robustness to noise and achieved an acceptable F1 score of approximately 5.2%-higher than the second-best method with extremely abraded characters.