A Novel Scene Text Recognition Method Based on Deep Learning

Scene text recognition is one of the most important techniques in pattern recognition and machine intelligence due to its numerous practical applications. Scene text recognition is also a sequence model task. Recurrent neural network (RNN) is commonly regarded as the default starting point for sequential models. Due to the nonparallel prediction and the gradient disappearance problem, the performance of the RNN is difficult to improve substantially. In this paper, a new TRDD network architecture which base on dilated convolution and residual block is proposed, using Convolutional Neural Networks (CNN) instead of RNN realizes the recognition task of sequence texts. Our model has the following three advantages in comparison to existing scene text recognition methods: First, the text recognition speed of the TRDD network is much fast than the state-of-the-art scene text recognition network based recurrent neural networks (RNN). Second, TRDD is easier to train, avoiding the problem of exploding and vanishing, which is major issue for RNN. Third, both using larger dilated factors and increasing the filter size are all viable ways to change receptive field size. We benchmark the TRDD on four standard datasets, it has higher recognition accuracy and faster recognition speed based on the smaller model. It is hopefully used in the real-time application.

Error splitting or merging in the word segmentation almost affect the accuracy of recognition. In addition, these methods adopt isolated character classification, recognize subsequent word separately, and discard meaningful context information of text, so their reliability and robustness are reduced in text recognition. In order to solve the problem, sequence text recognition [Españaboquera, Castrobleda, Gorbemoya et al. (2016); Bissaccoet, Cummins and Netzer (2013) ;Xiong, Wang, Zhu et al. (2018)] is proposed. For the scene text, the segmentation of the text is no needed, the holistic text is directly recognized as a sequence. The strong sequence features are extracted through the Deep Neural Networks (DNN) network ensure robustness to various distortions text and messy background. Sequence text recognition becomes the mainstream model of scene text recognition, such as CRNN [Shi, Bai and Yao (2015)], DTRN [He, Zhang, Ren et al. (2015)], FAN [Cheng, Bai, Xu et al. (2017)], that generally use the RNN model to learn contextual information of text. RNN is almost the only choice of sequence models, but it has some disadvantages such as not parallelism, unstable gradient, and high memory requirement for training [Bai, Kolter and Koltun (2018)], researchers have been looking for better models to replace RNN. In recent research, temporal convolutional network (TCN) is applied across all sequence tasks, the performance of TCN outperforms canonical recurrent architectures such as LSTM [Hochreiter and Schmidhuber (1997)], GRU [Jozefowicz, Zaremba and Sutskever (2015); Dey and Salemt (2017)] and RNN on 11 sequence tasks [Bai, Kolter and Koltun (2018)].TCN network is essentially a CNN network, which integrates dilated causal convolution and residual block. Motivated by TCN design idea, this paper proposed a new network model TRDD (Text recognition based on dilation and residual block). The model uses two basic residual modules, one module is composed of dilated convolution and the other is composed of ordinary convolution. The TRDD network has the following characteristics: First, in both training and evaluation, a long input sequence can be processed as whole in TRDD, instead of sequentially as in RNN. Second, the receptive field size of sequence features can be increased by using larger dilation factors or increasing the filter size. Third, the residual block is used to improve the network training speed and enrich the semantic features of the text. Fourth, the filters in TRDD are shared across a layer, with backpropagation path depending only on network depth, in practice, gated RNNs likely to take up too much memory.
The goal of supervised network training is to find a network that minimizes some expected loss between the predictions and the actual outputs: ( 0 , 1 , … , , ( 0 , 1 , … , )).
(2) RNN was once considered to be the only option that processes sequence data. The RNN architecture is shown in Fig. 1. Is the input, ℎ is the hidden layer unit, is the output, is the loss function, and is the label of the training set. ℎ Represents the state at time , which is determined not only by the input of , but also by ℎ −1 , … , , Are weights, and the same type of connection weights are the same.
The BPTT (back-propagation through time) algorithm is a common used method for training RNN. In fact, it is a BP algorithm based on time back propagation which continuously searches for a better path along the negative gradient direction until module convergence. The partial derivatives of and at time t are as follows: A major issue of RNN is the problem of gradient vanishing, which is caused by its architecture. The activation function of RNN is generally the function or the ℎ function, reference formula (11, 12), the function graph is shown in Figs. 2(a), 2(b). In the back-propagation gradient calculation, it can be seen from formula (9, 10) that is the multiplication of the derivatives of or ℎ over time series.

Figure 2: Activation function and its derivatives
It can be seen from Figs. 2(c) and 2(d) that the range of the function derivative range of the function is (0, 0.25), the derivative range of the ℎ function is (0, 1]. The product of multiple derivatives multiplied is getting smaller and smaller until it is close to zero, which is the phenomenon of "gradient disappearance", the calculation process is as follows: . : The second problem RNN is that predictions for later time steps are performed sequentially, � depends only on 0 , 1 , … , and not on any "future" inputs +1 , … , . The predictions for later time steps must wait for their predecessors to complete, so it cannot be done in parallel like CNN predictions. Finally, the RNN takes up too much memory during the training process. In the case of a long input sequence, RNN can easily consume amount of memory to store the temporary and partial results. Because the backpropagation path depends not only on network depth but also on the length of sequence, RNN network requires more memory than CNN. Jozefowicz et al. [Jozefowicz, Zaremba and Sutskever (2015)] searched through more than ten thousand different RNN architectures and evaluated their performance on various sequence modeling tasks. They concluded that if there were "architecture must better than the LSTM", then they were "not trivial to find". Yet recent results indicate that temporal convolutional network which called TCN can outperform recurrent networks on sequence modeling task. The distinguishing characteristics of TCN are the convolutions in the architecture and map a sequence of any length to output sequence of the same length, just as with RNN. TCN architecture is shown in Fig. 3(a), a dilation causal convolution with dilation factors d=1, 2, 4 and filter size k=3. The receptive field is able to cover all values from the input sequence. Fig. 3 is an example of residual connection [He, Zhang, Ren et al. (2015)] in a TCN.
(a) Causal convolutional (b) Residual connection module Inspired by the TCN network, the TRDD network proposed in this paper makes full use of the dilated convolution and residual modules in the network architecture. The dilated convolution expands the size of receptive field, and the residual network enhances semantic information of sequence features.

TRDD network for Scene text recognition
The network architecture of TRDD, as shown in Fig. 4, mainly consists two parts, the features extraction layers and transform layer. The features extraction layers use a dilated convolution and residual network to extract robust sequence features which is consistent with the order of the text in image. The transform layer translates the pre-frame predictions by the features extraction layers into a label sequence. The TRDD absorbs the design idea of TCN in the sequence modeling task, abandons the RNN network, and fuses the dilated convolution and residual module in the network, and archives large improvements.

Features extraction layers
The traditional text recognition aims at taking a cropped image of a single word and recognizing the word depict, but they can't be applied in function to scene text recognition due to the variable foreground and background texture. Scene text is no longer segmented by single characters, and features are extracted directly from text images to form sequence features. Assuming that ( 0 , … , , … , ) are feature vectors extracted from text images through CNN. From CNN receptive field analysis, the receptive field size of the sequence feature corresponds to a range of the input text image. As shown in Fig. 5.

Figure 5:
Receptive field of the sequence feature RNN is one of approach to increase the size of the receptive field of sequence features. In this paper, we present a new module that uses dilated convolutions to extract sequence features from input text image, which is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of the resolution or coverage. Let 0 , 1 , … , −1 be discrete functions and let 0 , 1 , … , be discrete 3×3 filters.
Define the receptive field of an element in +1 as the set of elements in 0 that modify the value of +1 ( ). Suppose the size of the receptive field of in +1 be the number of these elements. It is clear to see that the size of the receptive filed of each element in +1 is �2 +2 − 1� × (2 +2 − 1) .The receptive field is a square of exponentially increasing size. As shown Fig. 6: Set 1 Is produced by from 0 by a one-dilated convolution, each element in 1 has a receptive field of 3×3; Set 2 Is produced by from 1 by a two-dilated convolution, each element in 2 has a receptive field of 7×7; Set 3 Is produced by from 2 by a four-dilated convolution, each element in 3 has a receptive field of 15×15. In the TRDD model, we used two basic unit modules to extract sequence features: (1) a residual module consists of three dilated convolution with a dilation factor = 1,2,4 , which call "A-Module" as shown in Fig. 7(a), (2) a residual module consists of three convolutions with filter sizes = 1 × 1,3 × 3,1 × 1, which call "B-Module", as shown in Fig. 7(b). The network architecture is shown in Fig. 8. Before text images being fed into the network, they need to be scaled to the same height. The text image is a color image with a height of 32. The feature of image is extracted through two paths, one based on "A-Module" and the other path is on "B-Module". The features are represented by × × ( > ), is the number of channels of feature map, is the width of the feature map and is the height of the feature map. After several pooling operations, the height of the feature map is converted to 1( = 1), and the three-dimensional matrix × × is converted to the two-dimensional matrix × , which is the sequence features of the text image. For example, for a color text image with a height of 32 and a width of 280, the matrix of sequence features vector extracted by features Extraction layers is 36×512. It should be noted that each feature vector of a feature sequence is produced from left to right on feature maps by column, each column of the feature maps corresponds to a text range of input image, which termed the receptive field. Experiments show that the size of receptive fields for extracting sequence features by this method is large, and the width is generally larger than half of the image width.

Transform layer
Transform layer transforms the sequence features( = ( 0 , … , )) extracted from the text image into a sequence of label set ( = ( 0 , … , )), including Chinese characters, punctuation, English characters, numbers, spaces and all other characters. This conversion process is shown in Fig. 9, predictions are made by select the label sequence that has the highest probability.

Calculation of the loss function
We utilize conditional probability define in Connectionist temporal classification (CTC) [Graves, Santiago and Gomez (2006); Graves (2008)] to calculate loss in the training phase. TRDD can be training with the maximum likelihood estimation of the probability as the objective function.
where is the sequence length.
( 1 1 , 2 2 … ) Is a probability distribution over the set S ' = ∪ { } where contains all labels in the recognition task, as well as a 'blank' label or no label. Since the probabilities of the labels at each time step are conditionally independent given , the conditional probability of a path π ∈ S ' is given by: where is the activation of output unit at time .
Paths of model output are mapped onto labelling ∈ S ' ≤ by an operator that removes first the repeated labels, then the blanks. Assume path: 1 , 2 , 3 , 4 values are as follows Then ( 1 ), ( 2 ), ( 3 ), ( 4 ) yield the labelling = ( , , ℎ, , ), since the paths are mutually exclusive, the conditional probability of some labelling ∈ ′ is sum of all paths corresponding to it, since the paths are mutually exclusive.
where the probability of is defined as formula (15). The different paths that are mapping into the same labelling is what allows CTC to use unsegmented data, because it means that the module only to learn the order of the labels, and not to align with the input sequence one by one. A naive calculation of ( | ) is unfeasible, since there are many paths for labelling . For example, suppose there are 30 path mapping label sequences and length of is 5, there are 29 5 = 120000 possible paths. However, ( | ) can be efficiently computed using the forward-backward propagation algorithm described in Graves et al. [Graves, Santiago and Gomez (2006)].

Network training
Denote the training dataset by = ( , ), where , is training text image and is the ground truth label sequence. The objective function for CTC is to minimize the negative log probability of the conditional probability of ground truth.
where is the sequence vectors produced by the features extraction layers from . The objective function is calculated directly from the input image and its ground truth label sequence, so the module can be end-to-end trained on pairs of images and sequences.

Experiments
In this section, we perform a mass of experiments to verify the effectiveness of TRDD from three aspects: the receptive field of sequence features, the convergence speed and prediction accuracy of the network, and the speed and accuracy of network recognition.

Receptive field analysis
The concept of receptive filed is crucial for understanding and analyzing how deep network work. Since anywhere in an input text image outside the receptive field of unit does not affect the value of that unit, it is necessary to control the size of receptive field to ensure that it covers the all relevant image region. State-of-the-art scene text modules are basically based on CNN and RNN module, such as CRNN and DTRN. The CNN extracts the sequence features from the text image and RNN learns contextual information. From the concept of receptive field, the feature extracted by CNN already contains a big receptive field, If the receptive field range of feature can meet the needs of text recognition, the LSTM module has little role and can be removed. Feature vectors 1 , … , … , are extracted from the input image through the network module. The method of calculating the receptive field of in the input image is as follows: From left to right, the pixel values of each column of the text image are set to zero, and the variation of the feature vector is calculated. The magnitude of these changes reflects the intensity of each column affecting the feature vector , thereby derives size of receptive field in the input image. The input image resolution is a 3×280×32, the extracted feature vector is ( 1 , … , … , 36 ). Calculate the receptive field size of 10 extracted by CRNN and TRDD, The X-axis represents the width of the text image and the Y-axis represents the average response intensity. As shown the Fig. 11: (1) The size of receptive field of sequence feature extracted by TRDD and CRNN in input image is very close in the X-axis.
(2) The receptive field sensitivity of TRDD is larger than CRNN in the Y-axis, For Asian fonts such as Chinese, Japanese or Korean recognition, the local information is more important for text recognition, representation of the sequence feature extracted by TRDD is more effective than CRNN. Figure 11: t=10 Respond to range on the input image
The experimental results are shown in Fig. 12. It can be seen from the Fig. 12(a), the TRDD network converges faster than the DTRN, and the network training error is reduced by 2%. As can be seen from the Fig. 12(b), the TRDD network is 3% more accurate than DTRN.

Network convergence and accuracy 4.3.1 Network predictive speed test
The 60,000 text images from Syn90K are used to evaluate the prediction speed, model size and prediction accuracy of the two models. The results of the comparison are shown in Tab. 1. In the table, "Pre. Time" is the average prediction time of 6000, "Mod. Size" is the size of the model file, and "Accuracy" is the average test accuracy. As can be seen from the Tab. 1, Compared with DTRN network, the prediction time of TRDD network is greatly improved. The prediction time of the TRDD network is increased by 2.5 times, the model size is reduced by 27%, and the average prediction accuracy is greater than the DTRN model.

Network recognition accuracy experiment DataSets
The following datasets are used in our experiments.

Conclusion
In this work we present a new TRDD network model based TCN. Comparing with the traditional sequence text recognition model, the problem of gradient vanishing and gradient exploding in the training phase can be solved due to the removal of the RNN module. Moreover, the prediction speed has a fundamental improvement comparing to other networks since it can be processed in parallel. The dilated convolution increases the receptive field range of the sequence features, and the residual network enriches the semantic expression of the sequence features. Experiments show that the convergence speed, prediction speed and network model size of the TRDD are better than other networks, especially the network prediction speed, TRDD outperforms previous state-ofthe-art results in scene text recognition.