Natural Scene Chinese Character Text Detection Method Based on Improved CTPN

Text detection in natural scenes, due to differences in size, font, line direction, lighting conditions, text weakness and image background complexity, plays an important role in the research field and remains a challenging and important topic. We have improved the CTPN text detection network and changed the Side-refinement detection box to determine the scaling mechanism. And based on the experiment, change the LSTM network to the GRU neural network. In the dataset of Chinese character text game in natural scene released by Meituan, it reached 0.78 F1-Measure, which reached 0.89 and 0.61 F1-Measure on the ICDAR 2013 and 2015 data sets respectively. Compared to the 0.88 and 0.61 F1-Measure in the CTPN article, there is a big improvement.


Introduction
With the rapid development of Internet technologies and portable mobile devices, more and more application scenarios require the use of textual information in images. At present, natural scene text detection has become a research hotspot in the field of computer vision and pattern recognition, document analysis and recognition. Effective scene text detection can enhance the performance of many multimedia applications, such as mobile visual search, content based image retrieval and automatic symbol translation.
However, the research on Chinese character text detection in natural scenes is less and has great research significance. Because Chinese character natural scene data sets are few, writing is random, and many fonts are similar, the related research is less and the recognition effect is not good. However, many natural scene images require Chinese character recognition. If the Chinese character recognition technology in natural scenes breaks through, it will be a content-based image retrieval technology, automobile driverless technology, intelligent transportation system, and visual perception assisting technology for the Chinese character field. And other fields of technology have made tremendous contributions.In addition, the breakthrough and development of this technology will play a more important role in promoting other fields, with far-reaching theoretical research significance and broad application prospects.
In recent years, the better text detection method is the CTPN (Connecti-onist Text Proposal Network) proposed by Zhi Tian in 2016. This method combines the convolutional neural network with the bidirectional cyclic neural network (BiLSTM) for the first time, and makes clever use of text. The semantic sequence characteristics of the information reached 0.88 and 0.61 F1-Measure on the ICDAR 2013 and 2015 data sets, respectively.

Introduction to CTPN Network
Before the CTPN model was proposed, the better model was Faster R-CNN [8]. At that time, many text localization algorithms optimized it. However, Faster R-CNN did not consider the characteristics of the text itself. Text lines generally exist as horizon-tally long rectangles, and each word in the text line has an interval, and there is se-mantic association between the texts. In response to this feature, CTPN thinks of a novel idea. They split the task of text detection. The first step is to detect a part of the text box and determine if it is part of a text.After all the small text boxes in a picture are detected, the small text boxes belonging to the same text box are merged, and after combining, a complete and large text box can be obtained, and the text detection task is completed. Therefore, in the process of detection, CTPN introduces a mathematically similar "differential" idea, as shown in Figure 1 [3], first detecting a small, fixed-width text segment. In the post-processing part, these small text segments are connected to obtain a text line. Figure 1. CTPN differential thought map The specific implementation process of CTPN is： • Use the first 5 Conv stages of VGG16 to get the feature map, the size is W*H*C; • Use the 3*3 sliding window to extract features from the feature map in the previous step, and use these features to predict multiple anchors. Here, the anchor definition is the same as the definition in the previous fastener-rcnn, that is, defining the target candidate area; • Input the features obtained in the previous step into a bidirectional LSTM, output the result of W*256, and then input the result into a 512-dimensional fully connected layer (FC).

•
Finally, the output obtained by classification or regression is mainly divided into three parts. According to the above figure, from top to bottom, it is 2k vertical coordinates: indicating the height of the selection box and the coordinates of the center y axis; 2k scores: indicating k anchors Category information indicating whether it is a character; k side-refinement indicates the horizontal offset of the selection box.In the experiment, the horizontal width of the anchor is 16 pixels, that is, the unit of the minimum selection box is "16 pixels"; • Using an algorithm constructed with text, the resulting elongated rectangle (shown in Figure 2 [3]) is then merged into a sequence box of text.

Improved CTPN network
Our improvements include improvements to the Side-refinement detection frame merging mechanism, taking height information into the detection location and com-bining, and also changing the BiLSTM network to GRU, thereby accelerating network training and application runtime and improving network efficiency.

Side-refinement merge mechanism improvement
There are many authors' implementation methods in CTPN that require special atten-tion: Detecting Text in Fine-scale proposals (selecting the anchor, which is the candi-date "rectangular differential box"), Recurrent Connectionist Text Proposals (using the context) The RNN process of this information), Side-refinement (text construc-tion, combining multiple proposals into a straight line).Among them, the task of the Side-refinement stage is to combine and summarize the positioned "small rectangles" to obtain the position information of the required text information.The last reserved small rectangle is the case where score>0.7 is required, that is, the small red rectangles in the following figure are merged, and finally a large yellow rectangle is generated, as shown in Figure 3 [3].  Figure 3. Results of Side-refinement The main idea is that each two similar proposals (that is, candidate areas) form a pair and merge different pairs until they can no longer be merged.Judging two proposals, Bi and Bj can form a pair with Bi->Bj and Bj->Bi; and because the specified return box has a width of 16 pixels, it will cause some position errors. , so here defines Side-refinement, the defined formula is as follows [3]: Among them, the band * is represented as GroundTruth. Xside represents the left or right boundary of the regression, cxa represents the abscissa of the center of the an-chor, and w a is a fixed width of 16 pixels.So the definition of O is equivalent to a scaled scale that stretches the result of the box after the regression, thus better match-ing the position of the actual text.
However, in the case of the block character of Chinese text, CTPN only defines the left or right boundary of the positioning frame, and since the Chinese text in the natu-ral scene is generally at the same height position, this article sets the Side-refinement definition, taking into account the text height bounding box, forming a new defini-tion of the following scaling ratio O.
Among them, the band * is represented as GroundTruth. Yside represents the left or right boundary of the regression. The rest is the same as the original paper. Yside considers the text height information within the zoom ratio, and combines the more consistent height information of the Chinese text of the scene, so that the combined detection frame is more accurate at the text height. The test result of Chinese text is shown in Figure 4. The green box is the improved CTPN test result, and the red box is the CTPN test result.It can be seen that CTPN does not perform well when detecting vertical text, and the improved CTPN considers the height information (yside), and the detection performance is better improved when detecting Chinese text.

BiLSTM changed to GRU
In the CTPN paper, the Recurrent Connect-ionist Text Proposals stage uses the bidi-rectional LSTM pair to feature the feature extraction layer of the VGG16 and the feature of the 3*3 sliding window, and then performs feature sequence prediction, and finally obtains a feature sequence with a depth of 256.When the GTX 1060 GPU performs network training on the Chinese character text data set published by Meituan, the training loss is slowed down, and the loss rate changes as shown in Figure 5, and still trained for 17 hours in the GPU-accelerated environment.  Figure 5. Loss rate curve using BiLSTM training Due to the slow decline in the training process and the slow convergence, this paper adjusts the BiLSTM network to a GRU (Gated recurre-nt unit) network in the im-proved CTPN network [9]. The comparison between LSTM and GRU is shown in Figure 6 [9]:

Experimental description and comparison
This section mainly includes an introduction to the Chinese character text dataset, a comparative experiment on ICDAR 2013 and ICDAR 2015, a comparison of the CTPN and the improved CTPN, and an evaluation indicator, which includes statistics P, R, F, and run time T. different, using different devices, different locations, different times and different environments. The data set is mainly Chinese text, and the label content is relatively complete. Each picture is marked with the position and text of a single character, as well as the position and text of each string. Among them, 20,000 pictures were used for training, 2000 sheets were used for verification, and 3000 sheets were used for testing.

Data set and positioning effect display
The model formed by the improved CTPN network training is used to detect the posi-tion of the Chinese text. The effect is shown in Figure 8:

Comparison of evaluation indicators
In addition to testing the Chinese text dataset mentioned above, we also conducted experiments on the ICDAR 2013 and ICDAR 2015 datasets. In terms of algorithm evaluation indicators, we selected the same precision (P) as the original CTPN. The recall(R), F1-Measure(F) and predicted positioning time T are used as statistical eval-uation indicators. The comparison results of ICDAR2013 indicators are shown in Table 1 Table 2, and the other methods are the results in the CTPN paper.

Conclusion
We improved the CTPN text detection network for the characteristics of natural scene Chinese character data, and combined the height information of the ground truth box into the detection frame merging mechanism to form a new detection and merging mechanism. In addition, BiLSTM is replaced with a GRU network, which greatly reduces the running time of network training and test applications, and effectively improves the efficiency of the algorithm.We apply the improved