Deep Template Matching for Offline Handwritten Chinese Character Recognition

Just like its remarkable achievements in many computer vision tasks, the convolutional neural networks (CNN) provide an end-to-end solution in handwritten Chinese character recognition (HCCR) with great success. However, the process of learning discriminative features for image recognition is difficult in cases where little data is available. In this paper, we propose a novel method for learning siamese neural network which employ a special structure to predict the similarity between handwritten Chinese characters and template images. The optimization of siamese neural network can be treated as a simple binary classification problem. When the training process has been finished, the powerful discriminative features help us to generalize the predictive power not just to new data, but to entirely new classes that never appear in the training set. Experiments performed on the ICDAR-2013 offline HCCR datasets have shown that the proposed method has a very promising generalization ability to the new classes that never appear in the training set.


I. INTRODUCTION
Offline handwritten Chinese character recognition (HCCR) has been a important research realms since early works in 1980s [1]. Due to the great diversity of handwriting style, confusion between similar characters and large number of character classes, offline HCCR is still a challenging problem. In the last few years, there has been a significant amount of work on improving HCCR performance. The typical recognition model for HCCR mainly focuses on three parts: preprocessing, feature extraction and classification. Although researchers have proposed many methods to improve the Chinese character recognition rate, the best traditional modified quadratic discriminant function (MQDF) based methods are still far from human performance. Benefits from the blooming growth of computational power, massive amounts of training data and better training technologies, deep convolutional neural networks (CNN) [8] have achieved significant improvement in many computer vision tasks. Nowadays, deep CNN-based approaches become the new novel technology for solving HCCR problems.
Most Chinese character recognition methods focus on a balanced dataset, which contains the frequently used 3755 characters in the GB2312-80 standard level-1 set and each character has hundreds of samples. All testing characters are shown at training time, which is known as a closed set recognition problem. However, a more complete set would contain about 7000 characters for modern Chinese texts. The number of characters is over 54000 for historical and scholarly collections, which corresponding to a open set recognition problem. To obtain a satisfying recognition performance, training samples for each character should be sufficient, especially for deep CNN-based methods. Therefore, current approaches can handle only a limited number of documents satisfactorily.
In this paper, we propose a method to recognize Chinese characters as a template matching problem. The advantages of using template matching to recognize Chinese characters are: (1) In the current methods, the size of the model is proportional to the number of categories. Compared to predicting the probability that the character images falls into each category, our method has only one output unit which represent the similarity between templates and the input character images. Therefore, the size of our model is fixed no matter how many categories need to be classified. (2) We use sample pairs to train the template matching network, which will augment the training data naturally. For a c-class classification task, if there are n samples of each category in the training dataset, we can generate nc 2 sample pairs for template matching problem. The number of training samples increased to c times of the original dataset. (3) Template matching can be used to recognize the Chinese characters which are not shown in the training set without any additional training process. This is well-known as zero-shot learning problem.
The rest of this paper is organized as follows. Section II reviews the related works about HCCR. Section III introduces the details about the proposed method. Section IV reports the experimental results. The conclusions of this study and our future work are summarized in Section V.

II. RELATED WORKS
Due to the extraordinary achievement of deep learning in computer vision tasks [8]- [11] , the research for offline HCCR has been changed to convolutional neural networks (CNN). Multi-column deep neural networks (MCDNN) [13] [14] was the first reported successful use of CNN for offline HCCR. After that, a research team from Fujistu developed a CNN-based method and took the winner place in ICDAR-2013 competition [15]. A voting format of alternately trained relaxation convolutional neural networks (ATR-CNN) was proposed by the same team in [16]. Zhong et al. [17] combined the traditional feature extraction methods with the inception architecture proposed in GoogLeNet [10] and achieved very high accuracy for offline HCCR, which became the first model beyond human performance. In [18], Zhang et al. added a adaptation layer into pre-trained CNN to adapt the new handwriting styles of particular writers, which sets new benchmarks for offline HCCR. Recently, some researchers have focused on the high computational cost and large storage requirement for CNN-based models. Xiao et al. [19] proposed a Global Supervised Low-rank Expansion and an Adaptive Drop-weight technique to solve the problems of speed and storage capacity. Li et al. [20] designed an efficient CNN architecture and implemented cascaded model in a single network, which achieved the state-of-the-art results for offline HCCR.
However, all current methods mentioned above consider each Chinese character as a single class and train a multiclass classifiers for HCCR (character-based classifier). These approaches require a great deal of labeled samples to optimize the models. And the learned model can not recognize new characters that are not shown in training datasets. In a real applications, the number of Chinese characters is very huge, and there will not be sufficient labeled data of the rarelyused characters for optimizing the models. Therefor, it is impossible to obtain a good performance using the characterbased classifier in real-world applications.

A. Siamese Network
Siamese neural network is a class of neural network architectures that contain two subnetworks. Those subnetworks accept different inputs but share the same configuration with the same parameters and weights. An energy function is added at the top layer to compute some metric between the feature representation on each side. Parameter updating is mirrored across both subnetworks. A typical siamese neural network is shown in Fig.1.
Siamese neural network was first introduced to solve signature verification by Bromley and Lecun in 1990s [22] and became popular among tasks that involve finding similarity between two inputs. The network guarantees that two similar images should be mapped to adjacent location in feature space. In [23], Lecun et al. proposed a contrastive energy function which contained symmetric terms to decrease the energy of same pairs and increase the energy of different pairs.

B. Template Matching
When we were learning Chinese, we always practiced writing by following the template character in the textbook. Once we learned how to write, we remembered this character forever. Inspired by this, we treat the HCCR task as a template matching problem. The template characters are generated by font of Microsoft YaHei (msyh.ttf). Some examples are shown in Fig.2. Instead of using an energy function to compute some metric between the feature representation, we use the L 1 distance between the twin feature vectors f 1 and f 2 to predict the similarity between the templates and input character images. More precisely, the prediction is given by p(I x , I c ) = σ(w|f (I x ) − f (I c )|+b), where f (I) represents the feature vector for image I extracted by the neural network, and σ is the sigmoid activation function which maps the output onto the range [0,1]. Thus the template matching task can be treated as a binary classification problem and the binary cross entropy objective is a nature choice for training the network.

C. Classification
As shown in Fig.3, when we have finished optimizing the siamese network as a binary classification task, we can use the discriminative capacity of the learned features for recognition problem. Suppose we are given a test image I x , which we wish to classify into one of C characters. We can generate those character template {I c } C c=1 , and query the network using {I x , I c } as input pairs for a range of c = 1, 2, . . . , C. Then predict the class to the maximum similarity.
Considering that the convolution neural network is usually accompanied by a huge amount of computation, we can extract the features for C templates and stack them to form a matrix F C . Once we have a new image I x to be classified, we can perform just one feedforward pass to extract the features F x . Then we compute the similarity between F x and F C with negligible computation cost.

A. Datasets
We use the offline CASIA-HWDB1.0 and CASIA-HWDB1.1 datasets for training our neural network, and evaluate the model performance on the ICDAR-2013 offline competition datasets. All datasets are collected by the Institute of Automation of the Chinese Academy of Sciences. The number of character classes is 3755(level-1 set of GB2312-80). Training dataset contains about 2.67 million samples contributed by 720 writers and test data contains about 0.22 million samples contributed by another 60 writers [21].

B. Architecture
Limited by the hardware, we design a very compact architecture consisting of seven convolutional layers and one fully connected layers as base network. Each of the first three convolutional layers are followed by a max-pooling layer. Following this, two convolutional layers are followed by a max-pooling layer. Then another two convolutioinal layers are followed by a fully connected layer which contains 128 neurons. In our baseline network, batch normalization [12] is added for all convolutional layers to enable the network converge easily and minimize the risk of overfitting. The detailed architecture is shown in Fig.4.

C. Experimental Settings
The datasets provide gray images with white background pixels. We process the gray images in three steps: firstly, we reverse the gray levels by I = 255 − I: background as 0 and foreground in a range of [1,255]. Then we crop the redundant background area and preserve true boundary of the Chinese stroke. At last, the image is resized to 64 × 64 by contour center linear normalization with square root of sine of aspect [4]. Since batch normalization is added at all convolutional layers, we initialize the learning rate at 0.1, and then reduce it ×0.1 when the validation performance stops improving. We use stochastic gradient descent algorithm to train our model. The mini-batch is set to 256 with a momentum of 0.9. The regularization strategy we used is the weight decay with L 2 penalty. The multiplier for weight decay is set to 10 −4 during the training process. We conduct all experiments on TensorFlow [24] using a GTX1060 (3G) card.

D. Performance on Open Set
In these experiments, we split the Chinese characters into two parts: C s and C u , which are two non-overlapping subset of all 3755 Chinese characters shown in CASIA-HWDB dataset. The datasets are also divided into two parts: D s and D u . . For a single Chinese character in C s , we use all samples labeled as this character to generate positive pairs. For the remaining Chinese characters, we randomly select n samples for each characters to generate negative pairs. The specific sampling method is listed in Algorithm 1. We use 3 character sizes of C s and 3 sampling ratios, yielding 9 different dataset. To monitor performance during training, we use the recognition accuracy of unseen characters on D train u as validation criteria to determine when to reduce the learning rate or stop training. After the training process is finished, we evaluate the recognition rate on D test s , D test u , D test respectively. The final classification results for each of the 9 training sets are listed in the table below (TABLE  I). These results shows that our proposed method have a promising generalization ability to new Chinese characters.

E. Performance on Close Set
In this section, we conduct experiments on all 3755 Chinese characters. For fair comparison, we train the siamese network  Fig. 4. Baseline convolutional architecture for template matching problem.  Our proposed template-matching-based classifier provides lower accuracy than the current methods because templatematching-based neural network learns the similarity metric on the characters, which is very difficult for handwritten Chinese characters, especially on the similar characters. Fig.5 shows the top-10 false predictions of our template-matching-based classifiers. As can be seen, the recognized characters are very similar to the ground truth.

V. CONCLUSIONS
In this paper, we present a new technology for handwritten Chinese character recognition by learning deep siamese convolutional neural networks for template matching. The character templates are machine-printed images which uses Microsoft YaHei font. We evaluate our template-matching-based recognition on CASIA-HWDB dataset. The results shows that our proposed method can recognize characters which are not shown in the training set. To the best of our knowledge, no research has focused on the work of handwritten Chinese character recognition for "open set". In our future work, we will focus on better understanding of the error cases, further improving the model and reducing the performance gap with state of the art methods.