OCR with the Deep CNN Model for Ligature Script-Based Languages like Manchu

Manchu is a low-resource language that is rarely involved in text recognition technology. Because of the combination of typefaces, ordinary text recognition practice requires segmentation before recognition, which affects the recognition accuracy. In this paper, we propose a Manchu text recognition system divided into two parts: text recognition and text retrieval. First, a deep CNNmodel is used for text recognition, using a sliding window instead of manual segmentation. Second, text retrieval finds similarities within the image and locates the position of the recognized text in the database; this process is described in detail. We conducted comparative experiments on the FAST-NU dataset using different quantities of sample data, as well as comparisons with the latest model. ,e experiments revealed that the optimal results of the proposed deep CNN model reached 98.84%.


Introduction
Optical character recognition (OCR) is the key technology for the digitization of many modern characters, and it is also the mainstream technology in text recognition. Indeed, Tausheck obtained a patent on OCR technology 89 years ago. Around the 1950s and 1960s, research on OCR technology began to increase in countries around the world. e system for identifying postal codes in Japan that was developed then is still in use today. Since the 1970s, Chinese character recognition has undergone decades of research and development. In 1984, Japanese researchers designed a device capable of recognizing Chinese characters in multiple typefaces. e recognition rate was as high as 99.98%, with a recognition speed greater than 100 characters per second. At present, the methods and technologies for Chinese character recognition research have matured, and this has been applied to product design.
Used by ethnic minorities such as the Manchu and Xibe in China, Manchu is a spoken and written language with a phonetic script. Due to the specificity of letter concatenation and deformation in the Manchu alphabet system, the writing rules are completely different from modern Chinese and more closely resemble Mongolian and ancient Chinese. Manchu is read and written from top to bottom and from left to right, and it can also be written in Pinyin characters that are divided into monophthongs, diphthongs, consonants, and digraphs, with the lengths of radicals corresponding to these letters. is can, however, differ in different Manchu scripts. During the translation and interpretation of Manchu books, some radicals are difficult to identify due to faults such as consecutive writing, deformation, misidentifications, scratches, and cracks, which make the corresponding text, in turn, difficult to identify quickly and accurately. Some of these faults are caused by preservation measures in the scanned book image, while the error rate of false detection in Manchu images through existing Manchu recognition methods has greatly increased for other reasons as well. Recognition errors often occur when the Manchu source text is handwritten, so using OCR on handwritten Manchu remains inconvenient.
Manchu recognition methods often require segmentation of Manchu into basic units (e.g., letters) first, followed by recognition [1]. Improvements in Manchu recognition are often simply improvements in segmentation accuracy, which do not solve the fundamental problem of low recognition accuracy caused by letter concatenation and deformation. Manchu words are composed of one or more letters connected along the vertical axis.
ere is no gap between letters in the same word [2]. e initial point of the letters is located on the central axis in the Manchu word image, so it is difficult to recognize quickly and accurately the Manchu characters formed based on the positioning image or handwriting through traditional segmentation methods [3].
is article presents a methodology and system for Manchu text recognition. e recognition method can quickly recognize parts of letters in the text image without word segmentation. Based on this recognition component, we extended the text retrieval system to the full database to find similar texts. e system identified all of the similar letters in turn in the images of Manchu text through the sliding window. e partial recognition of letters and the partial adjustment image in the sliding window are compared according to the standard images of the letters. Following letter recognition, it is possible to find the associated Manchu characters in the database quickly. e system can output the corresponding letters with their database counts, as well as locate the standard image of all letters marked as associated letters in the database. e sliding window was proposed to reduce the computational complexity and improve the recognition accuracy; it can index part of the letter area with higher accuracy, and this local recognition ensures the reliability of letter identification and reduces the probability of false detection.

Background and Existing Work
To create a fully functional OCR system, it is necessary to understand the scripting background of the language and its methodology. e following sections describe the deep learning model in detail, with various convolution network (CNN) models used in image classification. e glyph structure and preprocessing method for Manchu are also analyzed in detail, according to the latest methods of recognizing Manchurian scripts.

CNN Models.
As a subfield of machine learning, deep learning has achieved good results in image classification. A variety of CNN models have sprung up that show promising ability in shallow and deep feature mining, which continuously improves the classification accuracy. ese models have a wide range of applications, such as face [4], speech [5], and scene text recognition [6]. A general CNN is usually composed of an input layer, a convolutional layer, a pooling layer, and a fully connected network. e depth of CNNrelated models has gradually deepened over the past decade, and the models have gradually become larger. AlexNet [7] has 60 million parameters and 650,000 neurons; it consists of five convolution layers, with the largest pool layer, and three fully connected layers with 1000-way softmax. VGG [8], an architecture with a very small (3×3) convolution filter, is used to evaluate the network with increased depth. Significant improvements to the prior art configuration can be achieved by pushing the depth to the 16-19 weight layer, while some studies have focused on reducing model parameters, maintaining high classification accuracy, and improving equipment ease of use. SqueezeNet [9] proposed reducing the size of the model while achieving the same results; it employs three main strategies: a reduced filter, decreasing the input channel filters, and downsampling. ResNet [10] uses a residual learning framework in which the layer is changed to learn the residual function of the layer input instead of learning the unreferenced function. MobileNet [11] uses the depth separable convolution based on a streamlined structure and introduces two simple hyperparameters to compromise between speed and accuracy; it is suitable for mobile terminals and embedded devices. DenseNet [12] introduces the dense convolution network to shorten the connection between the input and output layers, and to strengthen the feature connection between layers.

Manchu Language OCR.
To explain the text processing procedure during text recognition, we start with a brief overview of the structure and alphabet of Manchu. According to the National Standard of the People's Republic of China, Information Technology Universal Multi-Eight Coded Character Set Sibe, Manchu Character type, the same letter in Manchu generally has four different forms for the independent, initial, medial, and final shapes. A Manchu word is composed of one or more letters.
Research on Manchu character recognition is still in its infancy, and most studies are based on character segmentation. According to the structural characteristics of the text, Manchu words are divided into individual characters by projection methods or the stroke-by-stroke growth method. en, backpropagation neural network, statistical pattern recognition, and support vector machine (SVM) methods [13] are used to recognize the characters or strokes. Finally, the individual characters or strokes are combined into characters according to specific rules. e advantage of the method is that the size of the dataset is smaller than that of the word-level dataset, which further compresses the training data to reduce the number of calculations [14]. A classifier with a relatively simple structure can also be used for recognition to improve efficiency [15]. However, due to the complexity of the Manchu word structure, the correct segmentation of Manchu letters cannot be fully realized, which restricts the accuracy of subsequent character recognition. e character recombination technology, after recognition, also needs to be resolved. For example, Yi et al. built an offline handwritten Manchu text recognition system in 2006 [16]. ey first extracted and preprocessed the recognition target, and then they segmented the extracted Manchu text to form Manchu stroke primitives. Next, they performed statistical pattern recognition on the stroke primitives to obtain the stroke sequence. e stroke sequence was then converted into a radical sequence, and the fuzzy string matching algorithm was used to achieve Manchu-Roman transliteration as an output, with a recognition rate of 85.2%. Finally, implicit Markov model method was used to process the recognition results for the stroke primitives and further improved the recognition rate to 92.3% [17]. In 2017, Akram and Hussain et al. proposed a new Manchu stroke extraction method [18]. After preprocessing the recognition target, the main text is determined, and the text growth method is used to extract the Manchu strokes automatically; the stroke extraction accuracy was 92.38%, and the stroke recognition rate was 92.22%. In 2019, Arafat and Iqbal et al. worked on the recognition of handwritten Manchu text [19]. First, the handwritten like Manchu text is scanned, and the image is preprocessed. e full-text elements are then segmented. Next, the projection features, endpoints, and intersections are extracted from the Manchu text elements. After the feature and the chain code features, the three types of features are classified and recognized, and the combinations of the three types of features are recognized simultaneously. Finally, to further improve the recognition rate, the hidden Markov algorithm is used to process the recognition results, which yielded the highest recognition rate (89.83%) [20]. ese articles on Manchu character recognition are all based on character segmentation and used manual design to extract shallow features. However, the segmentation-based OCR method is based on accurate character segmentation technology [21]. For historical document images, due to the complexity of various typefaces and styles, inconsistent lighting during image capture, noise caused by capture equipment, and variable background colors (among other idiosyncrasies), correct segmentation of characters becomes more challenging. However, a hand-designed feature extractor requires significant engineering skills and domain expertise. Recognition methods based on character segmentation restrict the recognition accuracy for Manchu words because of errors in processing character segmentation. is paper therefore proposes a method for the recognition of Manchu words without segmentation using a CNN to recognize and classify unsegmented Manchu words. e proposed method also includes improvements to the traditional CNN so that it can train on any size of unsegmented Manchu word images, thereby reducing the influence of normalization preprocessing on the recognition rate.

Methodology
In our present work, we use the deep CNN model to recognize the text and then build a Manchu recognition system. e deep CNN model uses four convolution layers to mine different image features. In the Manchu recognition system, the sliding window method is used to identify the same characters in the database. ese two approaches are discussed below.

Manchu Recognition Algorithm.
is paper builds a CNN to identify and classify Manchu words without segmentation [22,23]. e CNN architecture proposed in this article is shown in Figure 1.
e CNN model constructed in this paper includes nine layers in total: four convolutional layers, two maximum pooling layers, and the classification layers, which consist of a flattening layer, a fully connected layer, and the output layer [24]. e first layer is a convolutional layer; the number of convolution kernels is set to 32, and the size of each convolution kernel is set to n × n (n � 5 or 3 or 2: the appropriate convolution kernel is selected through experimentation). is layer convolves the data of the input layer to obtain 32 feature maps. Each feature map consists of (28−n+1) × (28−n+1) neurons, and each neuron has an n × n (n � 5 or 3 or 2) acceptance domain. e activation function used is ReLU. e second layer is another convolutional layer, with the same settings as the first layer. e activation function used is ReLU. e third layer is the pooling layer, which plays the role of subsampling and local averaging. e receptive field size of each neuron is set to 2 × 2, with one trainable bias, one trainable coefficient, and a sigmoid activation function. e fourth and fifth layers are additional convolutional layers, with the same settings as the first layer. e activation function used is ReLU. e sixth layer is another pooling layer, with the same settings as the third layer. e seventh layer is the flattening layer. Because the input image may be two-dimensional, the flattening layer is used to reduce the dimensionality of the multidimensional matrix output by the previous layer to obtain a one-dimensional feature vector. e eighth layer is a fully connected layer, which contains 256 neurons, and each neuron in this layer is fully connected to the previous layer. e ninth and final layer is the output layer, which outputs a onedimensional vector of length 671 with the softmax activation function. Because the size of the word image in the multilevel Manchu dataset is arbitrary in length, the size of the original image must be scaled to a fixed size of 28 × 28.

Manchu Recognition System.
e proposed Manchu recognition system is divided into ten units, and the process is illustrated in Figure 2.
(1) e letter image-reading unit reads the standard image of each letter (2) e Manchu image acquisition unit collects the Manchu images and creates binarized images (3) e image-preprocessing unit filters the binarized image, extracts the salient area after filtering, and performs edge detection on the salient area to obtain the image for recognition (4) e parameter initialization sets the initial values of variables i and j (5) e sliding image extraction searches for the image to be recognized using the sliding window method Scientific Programming (6) e standard line segment extraction unit filters the binarized image of the j-th letter standard image (7) e contour line extraction unit scales the i-th contrast image (8) e connection strength output unit calculates the connection strength between the vector contour line of the i-th contrast image and the j-th letter standard image and then jumps according to the value of i (9) e window sliding unit increases the value of i, j, and W and then jumps according to the value of j (10) e result output unit renders the characters and numbers of the letters corresponding to all of the letter standard images marked as associated letters in the database e first step in our proposed Manchu recognition system is to read the standard image of each letter. Currently, the standard letter image is an image of the Manchu alphabet prestored in the database, which contains 114 images of Chinese alphabets in total; the database also stores the characters and their numbers of the alphabets corresponding to the alphabetic standard images, as well as the component numbers of the constituent letters of each Manchu image. e corresponding output can be based on multiple one or more letters in Manchu characters, and the standard image of each letter contains different glyphs. e second step is to collect the Manchu images and obtain binarized images. e Manchu image is acquired by using a line-scan camera, handwriting input, or scanning Manchu books with a scanner. e third step is to filter the binarized image and, after filtering, to extract the salient area and perform edge detection to obtain the image to be recognized. e filtering method used could be mean, median, or Gaussian filtering; the method for extracting the salient area could use the AC [25], histogram-based contrast [26], LC [27], or frequencytuned saliency detection [28] algorithms. e edge detection of the salient area determines the pixel-area boundaries of the Manchu text in the image. e fourth step is to set the initial value of variable i to 1, where the value range of i is [1, N], i is a natural number, and N is the ratio of the size of the image to be recognized to the size of the sliding window. en, it is necessary to set the initial value of variable j to 1, where the value range of j is [1, M], M is the total number of letter standard images, and the width of the sliding window is W. e fifth step is to search for the image to be recognized using the sliding window method. e longest line segment is searched in the sliding window through the Hough transform as the i-th line segment, and the i-th line segment and the vertical axis (y-axis) of the image matrix are calculated. e angle in the clockwise direction is the i-th angle. When the length of the i-th line segment is greater than or equal to K, the image in the sliding window is rotated counterclockwise by the i-th angle to obtain the image in the sliding window as the i-th contrast image (when i is less than K, it means that there are no Manchu letters to be recognized in the area of the sliding window), where K � 0.2 * W. e sliding window method searches the image to be recognized through a window that slides, and the step amount for each slide is the size of the sliding window; the size of the sliding window is between [0.01, 1] times the size of the image to be recognized; that is, the height and width (size) of the sliding window are set as W * W, and the value of W is [0.01, 1] times the width of the image to be recognized. is can be adjusted according to the full number of characters in the image to be recognized, and the distance of each slide is the width of the sliding window. Each time a row  is swiped on the image matrix, it automatically jumps to the next row according to the height of the sliding window (scanning the image horizontally from the pixel area in the image matrix that has not been selected by the sliding window). Note that the height and width are also called the length and width, in pixels. Figure 3 illustrates a schematic diagram of the scanning slide described in this step. e sixth step is to filter the binarized image of the j-th letter standard image. After filtering, the salient area is extracted, and edge detection is performed on the salient area to obtain the j-th letter standard image to be recognized and to search for the j-th letter by the Hough transform. e longest line segment in the image requiring recognition is taken as the standard line segment. e eighth step is to calculate the connection strength between the vector contour line of the i-th comparison image and the j-th letter standard image. To do so, first increase the value of j by 1; when the value of j is greater than M, go to Step 9, when the connection strength is greater than the strength threshold, mark the j-th letter standard image as an associated letter and go to Step 9, and when the j value is less than or equal to M, go to Step 6. e intensity threshold value is 0.5-0.8 times PNum or 0.5-0.8 times the number of corner points on the contour line in the vector contour line of the j-th letter standard image. See Appendix B for more details. e ninth step is to increase the value of i by 1, set the value of j to 1, and slide the sliding window at the distance of W. When the value of i is less than or equal to N, go to Step 5. When the value of i is greater than N, go to Step 10. e tenth step is to output all of the alphabet standard images marked as associated letters in the database corresponding to the characters and numbers of the letters. e output characters and their numbers are used to output Manchu characters, including more than one character corresponding to the alphabetic standard image marked as part of an associated alphabet in the database.

Experiments
We tested the proposed system experimentally. First, the experiment verified the recognition effect of the CNN and deep CNN on different types of unsegmented Manchu word data. e unsegmented Manchu word dataset contains 671 categories; each category has 1000 samples, so the total size is 671,000. In the training process, 1,000 sample images were shuffled, then 900 images were randomly selected for training, and the remainder was used for testing. e CNN model and deep CNN model were used for 100, 200, 300, 400, 500, 600, and 671 categories. When using the CNN to recognize Manchu words, the original image was normalized to a uniform size of 28 × 28. We tested three networks with convolution kernels of 5 × 5, 3 × 3, and 2 × 2, and the convolution kernel of 3 × 3 yielded the best results. e convolution kernels of the two networks were therefore set to 3 × 3, the sliding window size of the maximum pooling layer was set to 2 × 2, the number of filters was set to 32, the dropout ratio was set to 0.25, and there were 4 convolution layers. e experimental results are shown in Table 1.
Under the same parameters, in the recognition and classification of different categories of unsegmented Manchu words, the recognition rate of the deep CNN is 0.76%, 0.61%, 1.00%, 0.89%, 0.82%, 0.73%, and 1.23% higher for each category count than that of the traditional CNN. is indicates that the CNN was improved by the spatial pyramid pooling layer, which has a certain inhibitory effect on the influence caused by image normalization. At the same time, it can be seen that, for 671 types of data, the CNN can obtain a high recognition rate.
We collected papers on Manchu text recognition and summarized the classification accuracy in Table 2. Li et al. [29] used a spatial pyramid pooling layer in the CNN to replace the last maximum pooling layer, and a classifier of any size is proposed to recognize Manchu words without word segmentation. Zheng et al. [30] put forward the idea of nonsegmentation to identify and understand Manchu and replace Manchu roles with words; an end-to-end nine-layer CNN was also proposed to extract the deep features of Manchu text images automatically. Xu et al. [31] improved the traditional projection segmentation method, effectively improving the accuracy of segmentation. e method Scientific Programming 5 proposed in this paper is more accurate than these other methods by 1.16%, 3.84%, and 11.44%. Next, we considered the influence of different numbers of convolutional layers on recognition and classification. e number of convolutional layers was set to 2, 4, and 6 for both CNN and deep CNN. e convolution kernel of the two networks was set to 3 × 3, the sliding window size of the maximum pooling layer was set to 2 × 2, the number of filters was set to 32, and the dropout ratio was set to 0.25. e experimental results are shown in Table 3.
As can be seen from Table 3, as the number of convolutional layers increases, the accuracy increases. At the same time, when different numbers of convolutional layers are set, the recognition rate of the deep CNN is higher than that of the traditional CNN by 0.50%, 1.23%, and 1.25% for 2, 4, and 6 layers, respectively. is further shows that the CNN is improved by using the spatial pyramid pooling layer.
e experimental results show that, under the same parameters, the recognition rate of the deep CNN model designed in this paper for the recognition and classification of different categories of unsegmented Manchu words is higher than that of the traditional CNN. Compared with the methods proposed in other papers, the method we proposed here has great advantages. When setting different numbers of convolutional layers, the recognition rate of the deep CNN is higher than that of traditional CNNs, and the deep CNN model proposed in this paper avoids the feature expression problem caused by image normalization. It was tested in a recognition experiment with Manchu words of different lengths and obtained higher recognition accuracy than the traditional CNN model.

Conclusion
Traditional CNNs require input images of a consistent size. Manchu is a phonetic text, and its word length is not fixed. Preprocessing to ensure the uniform size is therefore required before recognition and classification, and such preprocessing reduces the recognition rate. To reduce the effect of image normalization preprocessing on the recognition rate, this paper improves the traditional CNN and constructs a new nonsegmented Manchu word recognition network model: deep CNN. We also proposed a Manchu identification system, which locates the position of the recognized text in a database. e model solves the image normalization problem, realizes the extraction of deep features from the unsegmented Manchu word images of any size, and recognizes and classifies different types of unsegmented Manchu word data. Using the deep CNN to recognize and classify unsegmented Manchu words in categories of 100, 200, 300, 400, 500, 600, and 671, the recognition rates were 99.85%, 99.88%, 99.24%, 99.35%, 99.13%, 98.98%, and 98.84%, respectively. e experimental results indicated that the deep CNN reduced the effects caused by image normalization preprocessing and obtained higher recognition accuracy than the traditional CNN model.

A. Extraction Method
is section briefly describes the extraction method for the vector outline of the j-th letter standard image in the seventh step. Starting from the end of the standard line segment with the closest distance to the ordinate axis, the curvature value of each edge point on the standard line segment is calculated. e average value of all the curvature values is then calculated, and all of the edge points whose curvature value is greater than the average value are identified as the edge. e corner points constitute a large curvature point set. Each corner point in the large curvature point set whose curvature value is greater than the average value is connected in turn, according to the steps outlined in the following: (1) Let the coordinates of the corner point with the smallest value of the abscissa (x-axis) of the large curvature point set be (Xmin, Ymin), and set the  Reference Accuracy (%) Li et al. [29] 97.68 Zheng et al. [30] 95 Xu et al. [31] 87.4 Present paper 98.84  Ymax). Set the spacing formed by multiple columns of pixels on the y-axis to span pixels, where the span is an integer between 10 and 100. Set the initial value of variable h to 0, and set the initial value of variable r to 1, where both h and r are natural numbers. Step 10 (that is, the connection process ends). If there is no corner point with Linkmark � 3 in the interval to be connected in the r + 1 layer, then judge whether the connection mark Linkmark of all of the corner points in the large curvature point set is equal to 0 (that is, the connection process ends), and if so, go to Step 10; if not, go to Step 6. (6) Input the corner points of Linkmark � 0 in the connected interval of the r-th layer into the array as a connected array; input the corner points of Linkmark � 1 in the to-be-connected interval of the r+1 layer into another array as the array to be connected; the corner points in the connected array and the to-be-connected array are sorted according to the ordinate value from small to large. (9) When the array mark ArrayMark of all of the corner points in the connected array or the to-be-connected array is equal to 0 (that is, all of corner points in any range within the distance between the two coordinate axes are connected), increase the variables h and r by 1, set the array mark ArrayMark of all corner points in the array to be connected to 1, and set the Linkmark of all the corner points in the array to be connected and the connected array to 0, and go to Step 3 (i.e., connect the next set of coordinate axis spacing ranges); otherwise, go to Step 6 (i.e., continue to connect the corner points in the connected array and the array to be connected).

B. Calculation of Connection Strength
is section briefly describes the method for calculating the connection strength between the vector contour lines of the Scientific Programming i-th contrast image and the j-th letter standard image in the eighth step.
(1) Let the vector contour line of the i-th contrast image be P and the vector contour line of the j-th letter standard image be Q; superimpose P and Q with the center of gravity of P and Q as the center. Of these, P � p 1 , p 2 , . . . , p k |k > 0 , k is the number of corner points on the contour line in the vector contour line of the i-th contrast image; Q � q 1 , q 2 , · · · , q n |n > 0 , n is the number of corner points on the contour line in the vector contour line of the j-th letter standard image; p k and q n are the corner points on the contour line; and p 1 and q 2 are the distances on P and Q. e corner points with the smallest distance on the ordinate axis, p 1 , p 2 ,. . .,p k and q 1 , q 2 ,. . .,q n , are the corner points in the two sets of sequences with the increasing distance of the index number from the ordinate axis in sequential increments. (2) Connect the nearest corner points on P and Q with vector lines in sequence, and the set of connected edges is the set of connected edges V � (p FC , q FC ) , where FC is the number of corner points, the value range is [1,PNum], PNum is a constant, and the value of PNum is smaller of the two values k and n. (3) Calculate the distances from the endpoints p FC to p 1 of all connected edges in the connected edge set (calculate the distances from all corner points of p 1 , p 2 ,. . ., p PNum to p 1 ), and use the calculated distances as the first distance set in turn. Calculate the distances from the endpoints q FC to q 1 of all the connected edges in the connected edge set (calculate the distances from all corner points of q 1 , q 2 ,. . ., q PNum to p 1 ), and use the calculated distances as the second distance set in turn. Calculate the difference between each distance element in the first distance set and each distance element in the second distance set in sequence; count all of the difference values as the number of positive and negative numbers; when the number of positive numbers is greater than the number of negative numbers, go to Step 4; otherwise, go to Step 5; the distance element is each distance value in the set. (4) Calculate the connection strength S of the connected edge set V, and go to Step 6, where S � PNum FC�1 Coeffi(|p FC − q FC | x ). e similarity function is where |p FC − q FC | x means the difference between the abscissa value of point p FC (distance value from point p FC to the x-axis) and the abscissa value of q FC (distance value from point q FC to the x-axis). where |q FC − p FC | y means the difference of the ordinate value of point q FC (distance value from point q FC to the y-axis) and the ordinate value of p FC (distance value from point p FC to the y-axis). (6) Output connection strength S.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.