Natural Scene Chinese Recognition Based on Deep Learning

Text recognition is one of the research focuses in the field of computer vision. In recent years, many researchers have proposed recognition methods based on codec frameworks, such as CRNN, but most of these methods are based on English, and relatively few are based on Chinese. Chinese language, because of its large number of character types and contains a large number of similar characters, only using CNN to extract features will lose a large amount of information. Meanwhile, the method of giving combined attention is an emerging direction. In this paper, we study the recognition of the Chinese language and introduce attention to BiLSTM to obtain more contextual information. Compared with the original algorithm, the performance of this algorithm is improved on the synthetic dataset.


Introduction
In the aspect of deep learning, most of the work boils down to the problem of sequence prediction. At present, there are mainly two text recognition frameworks, one is CRNN framework and the other is attention-based codec framework. Both frameworks are methods to align image feature sequence with label sequence.
However, the CRNN framework method still has room for improvement. In the CTC-based method, the loss of feature details obtained by the feature extraction network is more serious. For example, when recognizing lines of text that contain small or complex characters, some characters are lost in the recognition result or characters that are similar in appearance are indistinguishable. Therefore, this paper uses BiLSTM combined with attention to obtaining more context information to improve the final recognition rate.

Materials and Methods
In this paper, the attention mechanism is added in the middle of the two-layer BiLSTM to pay more attention to the important feature sequences to obtain more context information. H_n represents the hidden state of BiLSTM output at the first layer, and h_n is used to generate an attention matrix w, output1 represents the feature sequence of BiLSTM output at the first layer, then we can get where aij represents the context vector obtained by attention, bij represents the context vector output at the first layer, then bij will be sent to the second layer BiLSTM. Its structure is shown in Figure 1. The feature extraction part is the same as the original algorithm without changes. Then the validation was performed on the synthetic dataset. The impact of two-layer BiLSTM and one-layer BiLSTM on the final recognition results compared.
The RCTW dataset consists of 12263 natural scene images containing Chinese, most of which are captured images and a few are synthetic images. The data set is divided into three parts: training set, verification set and test set. The images used in this paper are 10,000 images selected from the images cut out based on the annotation information.
The synthetic dataset in this paper, using the TextRecognitionDataGenerator method. The background images are searched from the Internet, including wood background, iron background, wall background and glass background, etc.; in the selection of fonts, common script, Song and italics are used; the synthesized text is from the domestic news and international news crawled from Sina.com, which is folded into one line of every 8 characters. Also, blurring and skewing were added. 300,000 samples and their detailed text labels were synthesized using the above method. The size of each sample is fixed at 1000 pixels wide and 128 pixels high, and the label length is fixed at 8 characters. In the annotation format, the image name is the label corresponding to the image. Figure 2 shows some samples from the synthetic data. These data are divided into training set, validation set, and test set by 8:1:1, the training set contains 240,000 samples, the validation set contains 30,000 samples and the test set contains 30,000 samples.

Results & Discussion
For the benchmark algorithm selection, the CRNN network proposed by [3] is used in this work.
We use the line recognition accuracy rate as an evaluation metric, which is defined as follows:  Table 3 shows the comparison between CRNN and the algorithm in this chapter on the synthetic data set. It can be seen that the algorithm in this chapter improves the accuracy and verifies the validity of the proposed algorithm.
Tab. 3 Accuracy of CRNN with the algorithms in this chapter on synthetic data sets Method Perfect match rate CRNN 0.5820 Algorithm of this chapter 0.5837 As shown in Table 4, the results of some CRNN with the algorithms in this chapter on the synthetic dataset are demonstrated. Positive primary screening test;6 More than still in Canada "across More than still in Canada "across Ongoing trust issues.

Tab. 4 Partial demonstration of CRNN and the algorithms in this chapter on a synthetic dataset
Ongoing trust issues.
As shown in Table 5, the impact of two-layer BilSTM and one-layer BiLSTM on the final recognition results on the synthetic dataset is demonstrated.
Tab. 5 Accuracy of different layers of BiLSTM on synthetic datasets Method Perfect match rate CRNN(Two-layer BiLSTM) 0.5820 CRNN(One-layer BiLSTM) 0.5663 As shown in Table 6, which shows the results of the models trained by different layers of BiLSTM on synthetic datasets tested directly on RCTW, it can be seen that there is a significant decrease in accuracy, which is conjectured to be because the synthetic data is far less complex than the real data.
Tab. 6 Partial results of CRNN model on RCTW.

Conclusion
Because of the deficiency of Chinese natural scene recognition, the CRNN model is improved. In the sequence modeling phase, attention is introduced to obtain more context information. Finally, the CRNN algorithm is compared with the algorithm proposed in this chapter in the synthetic data set, and the effectiveness of the improved algorithm is verified. However, this method increases the complexity of the model and requires more training time. At the same time, comparing the two-layer BiLSTM with the single-layer BiLSTM, we can find that the final accuracy of the latter is slightly lower, which may be due to the problem of multi-classification, while the single-layer BiLSTM can not classify better.