Focal CTC Loss for Chinese Optical Character Recognition on Unbalanced Datasets

. In this paper, we propose a novel deep model for unbalanced distribution Character Recognition by employing focal loss based connectionist temporal classification (CTC) function. Previous works utilize Traditional CTC to compute prediction losses. However, some datasets may consist of extremely unbalanced samples, such as Chinese. In other words, both training and testing sets contain large amounts of low-frequent samples. The low-frequent samples have very limited influence on the model during training. To solve this issue, we modify the traditional CTC by fusing focal loss with it and thus make the model attend to the low-frequent samples at training stage. In order to demonstrate the advantage of the proposed method, we conduct experiments on two types of datasets: synthetic and real image sequence datasets. The results on both datasets demonstrate that the proposed focal CTC loss function achieves desired performance on unbalanced datasets. Specifically, our method outperforms traditional CTC by 3 to 9 percentages in accuracy on average.


Introduction
Recently, Deep Convolutional Neural Networks (DCNN) have achieved great success in various computer vision tasks, such as image classification and object detection [1][2][3][4][5][6].Such success is supposed to contribute to large-scale data, dropout [7], and regularization [8][9][10][11][12] techniques.Image sequence recognition, which can be regarded as a variant of object detection, still remains challenging due to the difficulty in detection sequence-like objects.Different from classification and detection problems which predicts one label for an entire image or a region, sequence recognition is required to compute a sequence of labels for an input image, as shown in Figure 1.
In such cases, we cannot readily apply Deep Convolutional Neural (DCNN) Networks [13,14] to sequencelike recognition tasks since DCNN can only generate label sequences in fixed lengthes depending on the input sequences.This limitation constrains its application in scenes that require to predict sequences of various length.
Traditional methods including [15,16] are based on a detection-recognition strategy.Individual characters are firstly detected and then recognized to form a full sentence.
However, detecting a single character is challenging especially for Chinese words.Different from English, a lot of Chinese words are composed of structural parts in left-right order.This phenomenon restricts the application of detectionrecognition methods.
A commonly used method is by overcutting the input sequence into slices and recognizing them through recurrent neural networks (RNN).Compared with the aforementioned detection-recognition methods, this method cuts them into many slices before feeding them into a RNN based recognition model.Due to the great power of remembering past information, it is not required to locate characters for RNN models.The final results are predicted by fusing memories into current state information.There is another challenge for Chinese image-based sequence recognition, i.e., the unbalance of training set.Different from small lexicon language datasets, large lexicon language datasets, such as Chinese, suffer from severe unbalanced sample distribution.Most words except for the small part are rarely used in everyday scenes.In this paper, we refer to the commonly used samples as easy samples and others as hard samples.
Existing methods for sequence recognition can be classified into two branches: seq2seq fashion [17,18] and CTC loss function based models [19].None of the above works take unbanlanced datasets into consideration, especially in Chinese image-based sequence recognition tasks.The unbanlance of a dataset will result in severe overfitting for easy samples and underfitting for hard samples.In order to remedy the unblance problem between easy and hard samples during training, we propose focal CTC loss function to prevent the model from forgetting to train the hard samples.
To the best of our knowledge, this is the first work attempting to solve the unbalance problem for sequence recognition.[20].Neumann proposed new oriented strokes for character detection and classification in [21].Manually designed image features always are confined to low-level which limits the performance.Lee developed a mid-level representative of characters with a discriminative feature pooling in [22].Yao developed a mid-level feature named Strokelets to describe the parts of characters in [23].Other interesting works brought insightful ideas by proposing to embed word images in a common vectorial subspace and convert word recognition into a retrieval problem [24,25].Some works take advantages of RNN which extract a set of geometrical or iamge features from handwritten texts into a sequence of image features [26].Some other approaches [27] treat scene text recognition as an image classification problem and assign a class label to each English word (90K words in total).

CTC Loss Function
Based Sequence Recognition.Connectionist Temporal Classification (CTC) is proposed in [19], which presents a CTC loss function to train RNNs to label unsegmented sequences directly.CTC is widely used for speech recognition [28,29].In this paper, we focus on applying CTC in image-based sequence recognition applications.Graves proposed the first attempt to combine recurrent neural networks and CTC for off-line handwriting recognition in [30].After the revival of neural networks, deep convolutional neural networks have been devoted to image-based sequence recognition.Hasan applies Bidirectional Long-Short Term Memory (BLSTM) architecture with CTC to recognize printed Urdu texts [31].Shi proposed a novel neural network architecture, which integrates feature extraction, sequence modeling, and transcription into a unified framework [32,33].Deep TextSpotter [34] trains both text detection and recognition in a single end-to-end pass.
He developed a Deep-Text Recurrent Network (DTRN) that regards scene text reading as a sequence labeling problem [35].

Seq2seq
Based Sequence Recognition.seq2seq and attention frameworks are prevalent in machine translation research [36].Recently, such frameworks are adopted for image-based sequence recognition.Ba proposed a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of an input image and the model learns to both localize and recognize multiple objects despite being given only class labels during training [37].Lee presents recursive recurrent neural networks with attention modeling for lexicon-free optical character recognition in natural scene images [18].Xu introduced an attention based model that automatically learns to describe the content of images [38].Another interesting work is Spatial Transformer Networks [39]  assigned to well-classified examples in an object detection framework [40].Lee presents recursive recurrent neural networks with attention modeling (R2AM) for lexicon-free optical character recognition in natural scene images [41].

Architecture
We design our model by extending the architecture proposed by [32], which consists of three components: convolutional, recurrent, and transcription layers.The overview of the architecture is illustrated in Figure 2. We adopt the Residual Network in [42], which contains 51 layers as the convolutional layers and bidirectional Long-Short Term Memory [43][44][45] as the recurrent layers.The transcription layer is mainly based on CTC or our focal CTC function.
Similar to many CNN based Computer Vision applications, the convolutional layers are employed for generating feature maps of the given image.As has been already demonstrated by some detection literatures, such as Fast-RCNN and Faster-RCNN, it is possible to locate and recognize objects with feature maps from a pretrained CNN model.Hence, this strategy can be readily adopted by image-based sequence recognition task.Specifically, we first overcut the feature maps into multiple slices.Each slice can be regarded as a representation of a bounding box.These slice feature maps form a fix-length sequence.Then we feed this sequence into a RNN-based layer to predict variable-length label sequence.Note that the deep structure of ResNet is able to provide enough receptive field for a small area that cover the corresponding character.So RNN is not necessary for imagebased sequence recognition mission.Despite this fact, we still utilize LSTM to predict label sequences due to its excellent ability in retaining and discarding previous information.The transcription layer is responsible for translating the output of hidden unit of LSTM to labels and finding the label sequence that has the highest probability conditioned on the per-slice predictions.We employ a fully connection layer to interface with the output of LSTM hidden units.Then we calculate the probability of each label through a softmax layer.Finally, we calculate the CTC loss through our focal loss based CTC layer with the predicted and ground truth label sequences as input.

Feature Extraction.
In order to obtain a suitable representation of a given image, many CV models use pretrained CNN to generate feature maps.In fact, one of the advantages for CNNs over other traditional feature extracting methods is that they can capture local textures by different filters.For this reason, a well-trained CNN model can be readily adopted by image sequence recognition tasks.In our case, we use Residual Network, known as ResNet, to extract visual feature maps.It is first proposed by He et al. in [42] to solve classification problems and won 1st places on the tasks of Ima-geNet detection, ImageNet localization, COCO detection, and COCO segmentation.Compared with other previously proposed deep CNN models, ResNet has deeper layers but low complexities.The elegant performance of ResNet should be contributed to the introduction of deep residual learning framework.We briefly illustrate this structure in Figure 3.The formulation of residual learning can be viewed as inserting shortcut connections into a plain feed-forward network.In fact, shortcut connections simply perform identity mapping and their outputs are added to the outputs of the following layers (Figure 3).Supposing that the input of residual learning block is denoted as , the learning process can be formulated as optimising a residual function (): where (⋅) denotes an underlying mapping to be fit by several stacked layers.Note that the size of  should be consistent with the output of (⋅).
We take the output of the last CNN layer of ResNet as feature maps that correspond to the entire image.We overcut this feature map into slices.Similar to Fast RCNN or Faster

Output gate
Forget gate RCNN, each slice contains information of a local area of the original image.

Label Sequence Prediction.
Based on feature maps of image slices, we predict label sequence based on a bidirectional LSTM network.RNN [43] is a class of neural networks for efficient modeling dynamic temporal behavior of sequences through directed cyclic connections between units.Each unit is able to retain internal hidden states which are considered to contain information of previous ones.Generally speaking, RNN can be viewed as an extension of hidden Markov models.In spite of the advantages in dealing with sequential signals, traditional RNN unit suffers from the vanishing gradient problem [46], which limits the range of context it can store and thus makes training process difficult.
To solve this issue, Long-Short Term Memory(LSTM), a variant of RNN, is proposed.It is capable of capturing long and short term temporal dependencies than traditional RNN unit.Specifically, LSTM extends RNN by adding three gates to the RNN neuron: a forget gate  controlling to what extent the current information is supposed to be retained; an input gate  to decide how much effect the current input should have on the hidden state; an output gate  to constrain the information of current memory which is available to output the hidden state.These gates enable LSTM to solve long-term dependency problems in sequence recognition tasks.More importantly, LSTM is easier to optimize as these gates help the input signal to effectively propagate through the recurrent hidden states ℎ() without affecting the output.Figure 4 is a schematic illustration of a LSTM unit.LSTM also alleviates the gradient vanishing or exploding issues of RNN [47].In our case, we formulate the operation of a LSTM unit in (8a).For convenience, we omit the indication of forward or backward layers.
where  is the sigmoid activation function, which calculates probabilities for all gates.  ,   ,   represent the forget gate, input gate, and output gate at th step, respectively.  ,  −1 store information of current and last cell state.ℎ  , ℎ −1 represent the hidden units of the two successive steps. * and  * are the weights and bias, which transform two vectors into one common space.In the literature [48], rectified linear units (ReLU) are also employed as the activation function.
In our case, the label sequence is predicted through a Bidirectional LSTM.The size of hidden unit is 128.Each label is calculated by fusing hidden state of forward and backward hidden layers.At the th step, ℎ   and ℎ  − are combined through a concatenate layer followed by a fullconnect layer.The final results, which take the form of probability distributions, are obtained through a softmax layer.

Transcription
Transcription is used to convert the predictions of each slice made by the Bidirectional LSTM into a label sequence.In this section, we first briefly review the definition of CTC loss and then introduce our proposed focal CTC loss.The CTC loss function, which is first proposed in [19], aims to model the conditional probability of label sequences given probability distribution of each predicted label.In essence, a CTC layer is supposed to be a kind of loss functions rather than a network layer.For this reason, the terminology of CTC layer is not accurate and may lead to misunderstandings about CTC.The focal CTC loss function is mainly inspired by the focal strategy in object detection applications.The main contribution of this paper is that by employing focal strategy, CTC loss function can be more effective in optimising the entire model.

Probability of Label Sequences.
Let R and L be a real number set and a label set, respectively, which are always named as lexicon.Let X = R × be the feature space of the input, and Y = L  be the label space, where superscripts , ,  represent feature dimension, sequence time, and label length, respectively.Following the previous method, the input is overcut into slices.Each slice is supposed to contain a fraction of some single label characters, implying  ≥ .CTC loss function can be regarded as modeling the joint probability distribution over X and Y, which is denoted as D X×Y .
A CTC loss function has an input of a softmax layer [49].We add a blank label B to L and hence obtain a new label L + = L ∪ {B}.An input sequence x ∈ R × is transformed to another sequence y ∈ L +  through the softmax layer.We denote activation of output unit  at time  as    .Then    is interpreted as the probability of observing label  at time , which defines a distribution over the set L +  of length  sequences of the lexicon L + = L ∪ {B}.Reference [19] refers to the elements of L +  as paths and denotes them as .We assume that the distribution of the outputs of the network is conditionally independent.Then the probability of path  can be expressed as follows: A sequence-to-sequence mapping function B is defined on sequence  ∈ L +  .B maps  onto  by firstly removing the repeated and blank labels.For example, B maps "B1BB1B220" onto "1120".The conditional probability is then calculated through summing probabilities of all  that are mapped by B onto : The time complexity of the naive way to compute the conditional probability of ( 4) is exponential as (L + )  paths exist.Reference [19] provides an efficient dynamic programming algorithm to compute the conditional probability.The CTC loss is obviously differentiable since the conditional probability ( | y) only contains addition and multiplication operations.

Focal CTC Loss.
In [40], the cross entropy of focal loss is defined as follows: where   is the probability of ground truth in the softmax output distribution. and  are hyperparameters used to balance the loss.An intuitive understanding of the focal loss is that it can be viewed as multiplying cross entropy by   (1 −   )  (the minis belong to cross entropy: −log(  )).It is easy to find that the closer   approaches to 1, the smaller the focal loss will be.So the focal loss will reduce the effect of examples but pay more attention to hard negative samples during training.With the focal theory, we redefine our focal CTC loss as follows: where (( | y) is the conditional probability mentioned above.The negative log function converts the optimization process from a maximization problem to a minimization problem in order to adopt gradient decent algorithm.In this way, we can focus our loss on undertrained samples and "ignore" overtrained samples.

Datasets.
In this section, we evaluate the focal CTC loss on both synthetic and real datasets.We establish two synthetic datasets by concatenating 5 MNIST [13] images which are resized to 32 × 32 in y-axis to a long image with a resolution of 32 × 160.We split the alphabet "0 − 9− " into two subalphabets "0−9−ℎ" and "−" with the same size.The Figure 5: We provide distribution of labels for both two synthetic datasets.Each bar represents the frequency of two characters.For example, the first bar of (a) represents how many 0 and 1 exist in dataset with unbalance ratio 10 : 1.
first dataset with a unbalance ratio of 10:1 consists of two parts, one containing 1,000,000 long images which are concatenated by 5 images randomly sampling from "0 − 9 − ℎ" and the other one containing 100,000 long images concatenated by 5 images randomly from " − ".The second dataset with an unbalance ratio of 100:1 consists of 1,000,000 long images which are concatenated by 5 images randomly sampling from "0 − 9 − ℎ" and 10,000 long images concatenated by 5 images randomly from " − ".We use a dataset containing 10,000 images to test the accuracy.The ratio of high-frequency and low-frequency characters we used in training phrase is set to be 1 : 1.We present label distribution of both datasets in Figure 5.
We also test the focal CTC loss on a real Chinese-ocr dataset [50], which consists of 3,607,567 training and 5,000 test samples.Each of them is a 32 × 280 pixel image with a 10--Chinese-character label.The frequencies of words are illustrated in Figure 6.

Training Strategy Evaluation
Metrics.We implement our focal loss function in tensorflow framework, which is known as a flexible architecture supporting complex computations in machine learning and deep learning.A typical CTC loss function can be formulated as follows: where  and  denote label sequences from ground truth and output by RNN units, respectively.Both are constrained in length by an integer scalar .This function has already been implemented in Tensorflow framework.So we first easily compute ( | ) by calling (7) defined in TF.Then we calculate focal loss according to (5).We summarize the entire process as follows:  =  −  (8b) Before we train our model, we set the learning rate and batch size to be 0.001 and 128, respectively.All parameters except CNN are initialized by sampling from a Gaussian distribution.The weights of CNN are copied from ResNet and kept unchanged during training.We optimise our model using stochastic gradient descent (SGD) with Nesterov momentum [51] set to be 0.9.We run all the experiments on a single NVIDIA M40 GPU.The entire training process is described in Algorithm 1.
We evaluate of our model in terms of two metrics: the naive accuracy and soft accuracy.The naive one means that the predicted sequence can only be recognized as positive when it is just the same as the ground truth.The soft accuracy  refers to tolerating an edit distance of 1 from prediction to label.The edit distance between two sequences  and  is defined as the minimum number of insertions, substitutions, and deletions required to change  into .

Results
. Results for various  and  are shown in Tables 1  and 2 for the synthetic dataset with unbalanced ratios 100:1 and 10:1, respectively.The best gains of the two datasets, as highlighted in bold in Tables 1 and 2, are 9% and 6.7%, respectively.Moreover, the focal CTC not only improves the Low-frequency accuracy, which is a natural result, but also enhances the High-frequency accuracy in the 10:1 dataset.The improvement for 100:1 dataset is mainly due to the enhancement of Low-frequency data.However some choices Complexity  of  and  perform poor bad such as  = 0.5,  = 2  = 0.75,  = 2 and  = 0.5,  = 1 for 100:1 dataset.For  = 0.75, accuracy of High-frequency drops dramatically.
As for  = 0.5, accuracy of Low-frequency is not so good.The accuracy of 10:1 dataset achieves promising results for both High-frequency and Low-frequency samples.Similarly, the same issue occured in 100:1 dataset.In addition, some choices of  and  also result in poor performance such as  = 0.99,  = 1  = 0.75,  = 1, and  = 0.75,  = 2.However, a bad choice of  and  would impair the accuracy as finding that  = 0.25,  = 0.5 achieves an exiting promotion of at least 5% on both datasets, despite the existence of both High-frequency and Low-frequency data.We highlighted these results in bold italic font in Tables 1 and 2.
We present the results on the real dataset in Table 3.The best improvement of accuracy, as highlighted in bold, is 4.1%.This is an exciting improvement for a real life application.Additionally, we observe that  = 0.25,  = 0.5 makes a promotion of 3.6% which is also a considerable enhancement.
In order to observe the convergence situation for different , we plot the test Accuracy and Soft Accuracy by changing curve of the real dataset for  = 0.99.We can see that, for all training process, the focal CTC loss of  = 0.5 achieves the best convergence ratio for both the Accuracy and Soft Accuracy, as shown in Figure 7.The focal CTC loss of  = 2 performs bad all the time.
With the above results on both the synthetic and real datasets, we can conclude that the focal CTC loss with  = 0.25, and  = 0.5 gives a considerable improvement compared with the CTC loss.Some other choices of  and  may achieve more considerable enhancement.So in real life applications, we can choose  = 0.25 and  = 0.5 for unbalanced datasets.

Qualitative Results
. We provide some examples of both synthetic and real datasets in Figure 8.The sequences predicted by CTC or focal CTC are marked in red and green, respectively.All images are sampled from test split.Generally, the prediction is obviously improved with focal CTC loss employed.Due to the extreme unbalance of distribution in Chinese words, it can be seen from Figure 8(a) that some uncommonly used words can not effectively be detected by CTC based model but by our proposed focal CTC based model.As for synthetic dataset, it is interesting that CTC and focal CTC both work well for 10:1 dataset.However, the performance of CTC drops in case that we make the distribution of characters more unbalanced.

Conclusion
In this paper, we a focal CTC loss function, which can balance the loss between easy and hard samples during training.We test various hyperparameters  and  on both the synthetic and real datasets.The results show that setting  = 0.25 and  = 0.5 achieves a considerable improvement for both the synthetic and real dataset.Besides, we also point out that some choices may result in bad performance.To some extent, our proposed focal CTC loss function alleviates the unbalance of big lexicon sequence recognition.

Figure 1 :
Figure1: Distinct from object detection which aims to locate and recognize important objects, sequence detection is required to output a sequence of labels in various length.However, the prediction could be challenging if the distribution of labels is unbalanced.

Figure 2 :
Figure 2: The network architecture.The architecture consists of three parts: convolutional layers, which extract a feature sequence from the input image; recurrent layers, which predict a label distribution for each frame; transcription layer, which translates the per-frame predictions into the final label sequence.

Figure 3 :
Figure 3: Convolutional layers (ResNet) which are used to extract image feature sequences.The basic building block is residual learning unit, surrounded by the green dash box. x

Figure 4 :
Figure 4: A schematic illustration of a LSTM neuron.Each LSTM neuron has an input gate, a forget gate, and an output gate.

Figure 6 :
Figure 6: We provide distribution of labels for Chinese-ocr dataset.The first bar represents the most 300 frequently used words in dataset.Obviously, most Chinese words only account for a small part of all words.

Figure 7 :
Figure 7: The comparison of convergence speed for different  of real data with the same  = 0.99.

Figure 8 :
Figure 8: We present prediction results of Chinese word in (a).We also show some examples of synthetic datasets with unbalance ratios 10 : 1 and 100 : 1 in (b) and (c), respectively.