Few Shots Are All You Need: A Progressive Few Shot Learning Approach for Low Resource Handwritten Text Recognition

Handwritten text recognition in low resource scenarios, such as manuscripts with rare alphabets, is a challenging problem. The main difficulty comes from the very few annotated data and the limited linguistic information (e.g. dictionaries and language models). Thus, we propose a few-shot learning-based handwriting recognition approach that significantly reduces the human labor annotation process, requiring only few images of each alphabet symbol. The method consists in detecting all the symbols of a given alphabet in a textline image and decoding the obtained similarity scores to the final sequence of transcribed symbols. Our model is first pretrained on synthetic line images generated from any alphabet, even though different from the target domain. A second training step is then applied to diminish the gap between the source and target data. Since this retraining would require annotation of thousands of handwritten symbols together with their bounding boxes, we propose to avoid such human effort through an unsupervised progressive learning approach that automatically assigns pseudo-labels to the non-annotated data. The evaluation on different manuscript datasets show that our model can lead to competitive results with a significant reduction in human effort. The code will be publicly available in this repository: \url{https://github.com/dali92002/HTRbyMatching}


Introduction
Training data-hungry deep learning-based models in low resource scenarios is challenging due to the scarcity of labeled data. This is particularly the case of modern Handwritten Text Recognition (HTR) systems when applied to manuscripts with rare scripts or unknown alphabets. For example, ancient civilizations used no longer used alphabets (e.g. cuneiform, Egyptian hieroglyphs) and historical ciphers (used in diplomatic reports, secret societies, or private letters) often used invented alphabets to hide their contents [1].
Recognizing and extracting information from these documents is important to understand our cultural heritage, since it helps to shed new light on and (re-)interpret our history [2]. However, a manual transcription is unfeasible due to the amount of manuscripts, and the automatic recognition is difficult due to the very few availability of annotated data for training. Moreover, the problem becomes harder in ciphers because when the alphabet is invented, no dictionaries or language models are available.
Contrary to deep learning models, human beings are able to learn new concepts from one or few examples. To imitate this ability, recent research is being conducted in the field called few-shot learning [3]. For this rea-son, we explored in our previous work [4] whether few shot detection could be adapted for recognizing enciphered manuscripts. The main reason is that typical HTR models must be trained on the particular alphabet to be recognized, and whenever the alphabet changes, the system must be retrained from scratch with samples from the new script. For this reason, we treated the recognition as a symbol detection task: by providing one or few examples of each symbol alphabet, the system was locating them in the manuscript. Therefore, the model was able to cross multiple scripts, while requiring only few labeled data from each new cipher alphabet. The first experimental results obtained good performance in enciphered manuscripts compared to the typical methods, while reducing the amount of labelled data for fine-tuning.
Nevertheless, the required labelled data in our few-shot model still implied a significant human effort: labelling few enciphered pages for fine tuning means the manual transcription of thousands of symbols together with their corresponding bounding boxes. To alleviate such a problem, in this paper, we minimize such manual labeling stage by proposing an unsupervised learning approach that can automatically and progressively annotate the data by assigning pseudo-labels from the unlabeled handwritten text lines. As a result, our method requires only few-shots of the desired alphabet: the user just crops few examples (preferably 5) from each symbol to perform the pseudolabeling, avoiding to annotate text lines (including symbol bounding boxes) from the same alphabet for fine tuning. This means that pseudo-labeled data is automatically obtained to fine-tune our model, with a zero manual effort.
The main contributions of our work are as follows: (i): We propose a few-shot learning model for transcribing manuscripts in low resource scenarios with a minimal human effort: it only requires labeling five examples from each new symbol alphabet, instead of labeling entire text lines. (ii): We propose an unsupervised, segmentationfree method to progressively obtain pseudo-labelled data, which can be applied to cursive texts with touching symbols. (iii) We propose a generic recognition and pseudolabeling model that can be applied across different scripts. (iv): We demonstrate the effectiveness of our approach through extensive experimentation on different datasets, reaching a performance similar to the one obtained with manually labelled data.

Low Resource Manuscript Recognition
Low resource handwritten text recognition applied to historical manuscripts is an active field in the document analysis community. However, the research on the transcription of enciphered manuscripts with invented alphabets is quite recent. The first attempt towards transcribing this type of handwritten text was proposed in [5] using MultiDimensional Long Short-Term Memory (MDLSTM) Recurrent Neural Networks [6]. The performance was satisfactory, but at the cost of a time-consuming data labeling effort. Of course, for each new cipher alphabet, a similar annotation stage was required. For this reason, some unsupervised methods were introduced [7,8]. In those approaches, the enciphered document is first segmented into isolated symbols, then a clustering algorithm is applied to group the visually similar symbols. The main drawback of such methods is the segmentation stage, because the symbol segmentation was often inaccurate, provoking transcription errors. Similarly, and given the lack of labelled data, some researchers have opted for learning-free symbol spotting approaches [9,10] for ancient manuscripts (e.g. Egyptian hieroglyphs, cuneiform, or runes).
In summary, supervised methods obtain good performance but they require a large amount of labeled data. Contrary, unsupervised or learning-free methods can be applied when labelled data is not available, but they lead to a lower performance. Thus, to maintain good performance while reducing the manual annotation, few-shot learning for manuscript recognition seems preferable [4], since it reaches a performance similar to supervised methods but requiring few annotated text lines. A similar approach based on character matching was proposed in [11], although the experiments were mostly carried out on synthetic data, instead of on real historical or cursive manuscripts.

Handwritten Text Pseudo-Labeling
Pseudo-labeling models aim to make profit from unlabeled data when training. In semi-supervised learning [12,13], few labeled data is used to start the process. For instance, in the label propagation approach based on distances [14,15], labels are assigned from the unlabeled data (called pseudo-labels) to be used to reinforce the training. Similarly, in [16], the training started with true labels and gradually increased with pseudo labels. In [17] a shared backbone extracted features from the labeled, pseudo-labeled and unlabeled data at each iteration. Then, from the feature space, the reliable labels were estimated according to the distance with the true labels while the non trusted labels were pushed away with an exclusive loss. Moreover, a pseudo-labeling curriculum approach for domain adaptation [18] used a density-based clustering algorithm. The idea was to annotate data with the same labels set, but taken from a different domain.
In HTR, this strategy was hardly applied mainly due to the difficulties in character segmentation, since touching characters are common in cursive texts. In [19], labels were guessed at word level using keyword spotting. A confidence score was used to assign new labels to the retrieved words and enlarge the dataset. Furthermore, a text to image alignment was proposed in [20] following this strategy.

Proposed Approach
In this section we describe our approach for few-shot handwritten text recognition. First, our model is trained on synthetic data, i.e. text line images created using various Omniglot symbol alphabets [21]. Afterwards, the model is fine-tuned using the pseudo-labelling approach with the specific alphabet from the target domain (real manuscript). These steps are described next.

Few-shot Manuscript Matching
As stated before, few shot object detection has shown to be suitable for recognizing manuscripts in low resource scenarios. Formally, in few-shot detection, if the size of the alphabet is N , and we provide k examples from each symbol alphabet (named shots (or supports)), the task is considered as an N -way k-shot detection problem. In such setting, the model can be trained on certain alphabets with sufficient labelled data, and later, tested on new alphabets (classes) with few labeled data.
Our few-shot learning model, illustrated in Fig. 1, is segmentation free and works at line level. As input, it takes the text line image with an associated alphabet in the form of isolated symbol images. In this step, one or few examples (usually up to five) of each alphabet symbol should be given. The two inputs are propagated in a shared backbone to get the feature maps. The feature maps are used in the Region Proposal Network (RPN) with an attention mechanism, which performs the depth-wise cross correlation between them, as illustrated in Fig. 2. The Region of Interest (ROI) pooling is applied to the RPN proposals and the support image to provide well cropped symbol image candidates. Thus, we obtain two feature maps representing the regions that are candidates to contain the support image. Those are combined together and passed to the final stage where the bounding boxes are produced with the class 1 (similar to the support) or 0 (different from the support symbol). For each labeled bounding boxes, a confidence score between 0 and 1 is predicted according to the similarity degree with the support image. We repeat this process for the all supports (all the alphabet symbols) and take only the bounding boxes with high confidence score (higher than a given threshold) to construct a similarity matrix between the symbol alphabet and the line image regions. This matrix is the input of the decoding algorithm, which provides the final transcription.

Similarity Matrix Decoding
The decoding algorithm shown in Fig. 1 takes the similarity matrix, traverses the columns from left to right, and decides, for each pixel column, the final transcribed symbol class among the candidate symbols. Concretely, for each time step, it chooses the symbol having the maximum similarity score. To minimize errors, a symbol is only transcribed if its bounding box is not overlapped by another symbol with a higher similarity value for a certain number of successive pixels (in our case, we used 15 pixels as a threshold). Despite its simplicity, this decoding method presented in Algorithm 1 is effective for transcribing sequences of symbols. It can be considered also as a modified version of the Connectionist Temporal Classification (CTC) algorithm [22].
As mentioned before, our few-shot model is first trained on the Omniglot dataset: we synthetically construct lines to learn the matching in different alphabets. Then, at testing time, it can be used to recognize unseen alphabets, requiring only a support set composed of few examples of each symbol alphabet. However, in our previous  Fig. 3. Our pseudo labeling approach: At the beginning synthetic lines are generated using the supports set. Then, the pseudo-labeling phase starts. At starting, there is no pseudo-labeled data, so only synthetic lines will be used for retraining the model. Then, the model predicts symbols from the real unlabeled lines with the same script. The symbols with highest confidence score, namely pseudo-labels, are labelled and added with their predicted bounding boxes. Next, the model is retrained again using the synthetic lines and the pseudo-labeled symbols from real lines. The process is repeated until the full dataset is annotated.  [4], experiments showed that the predictions can be significantly improved when we fine-tune the model using some real text lines, because there is a domain difference between the synthetic Omniglot symbols and the real historical symbols.

Progressive Pseudo Labeling
Since low resource manuscripts are mostly unlabeled, the user must provide the label of each symbol together with their corresponding bounding box. Thus, to reduce the human labor, we propose to automatically annotate the manuscripts that will be used for fine tuning the model  described above. Our proposed progressive data pseudolabeling strategy consists in the following two stages.

Synthetic Data Generation
Our few-shot model needs to be fine-tuned using data from the target domain (often with an unseen alphabet) to reduce the gap between the source and target domains. But, since we aim to minimize the user effort, we restrain the demands on a support set of few examples from each new symbol alphabet. Hence, the user must only select up to 5 samples per symbol, called shots. From those shots, we automatically generate synthetic lines by randomly concatenating them in a line image. We tried to make those synthetic lines as realistic as possible. To do so, the space between characters was chosen randomly between 0 and 30 pixels, also, before concatenation we rotate each character randomly between -5 and 5 degrees. Moreover, we add some artifacts to the upper part and lower part of the line to simulate a realistic segmentation of a handwritten line. Those created lines compose our starting labeled set. Since our model was only pretrained on a different data domain, i.e. the synthetic Omniglot lines, this technique significantly improves the model prediction for unseen alphabets or scripts.

Pseudo-Labeling Process
After retraining our model with synthetic lines, we begin labeling the non-annotated data. The process is illustrated in Fig.3. Of course, at the beginning, the pseudo labeled set is empty (no labels are available), so only the synthetic lines can be used for training. Then, we pass the real text lines through our model to get the predictions, which include the bounding boxes of the regions that are similar to the input alphabet images as well as the assigned similarity score. Since the higher score, the more credible the label, we choose the top scored predictions as pseudo labels at this iteration. We experimentally found that the best option is to choose, at each iteration, the 20 % of the training data size as the number of the new pseudolabels. The obtained pseudo-labeled set will be joined to the synthetic set for the next training iteration.
This process is repeated until annotating the whole unlabeled set (all text lines), or in the case where it is not possible to add new pseudo labels with credible confidence score (we set a threshold of 0.4 as the minimum confidence score for assigning pseudo-labels). In fact, whenever the score is below this threshold, it is better not to label the symbol. Note that we label the handwritten lines without the need of segmenting them into isolated symbols. In this way, the remaining unlabeled symbols in the different lines at each iteration are considered as background during the next training. Fig. 4 shows an example of a handwritten line during the pseudo labeling process. At the beginning, the whole image is considered as a background. Then, the symbols with higher confidence score are labeled in the first iteration, while the hardest ones will be labelled in the next iterations.

Experimental Results
In this section we present the experiments that we performed to validate our approach. We begin by presenting the different datasets (corresponding to low resource scenario), and then, we present and discuss the results.

Datasets
As low resource handwritten text, we chose historical enciphered manuscripts and Codex Runicus.

Enciphered Manuscripts
As we said before, ciphers are a typical form of low resource handwritten data. Many ciphers use a large variety of invented symbols instead of using common alphabets. In this work we choose two enciphered manuscripts, namely Borg and Copiale. Both are described in detail and publicly accessible in these links 12 . In our experiments, we exclude the symbols with very few occurrences (once or twice), so we use 24 symbols from the Borg manuscript  and 78 from the Copiale one. Fig. 5 shows some examples of these handwritten ciphers. As it can be observed, in the Borg cipher, the symbol segmentation is difficult because of the frequent touching symbols, which is one of our main motivations for our segmentation free proposed method. For Copiale, the size of the alphabet is large, so it can be used to test our approach for higher number of classes. We took few pages of each document for fine tuning, i.e. to perform the progressive pseudo labeling process. Also, a performance comparison when using manually produced labels is provided.

Codex Runicus
The Codex Runicus is a historical manuscript, written on 100 parchment folios (leaves) around 1300AD in the province of Scania, in medieval Denmark. We took 10 pages to perform our experiments. Those pages were transcribed by an expert to compare it with our automatic labeling. An example of this manuscript is illustrated in Fig. 5. We chose this manuscript because it uses a rare alphabet and perfectly fits in our low resource handwriting recognition problem.

Experimental Setup and Metrics
To carry out the experiments, we first trained our proposed few-shot handwriting recognition model using lines created from the Omniglot dataset only. Then, we retrained the model using synthetic lines, created by randomly concatenating the 5 selected symbols (shots), and applying some transformations (including rotation, resizing, thickness modification, etc), hence called Synthetic Data (SD). This step is performed to reduce the domain gap between the Omniglot lines and the real lines. Afterwards, we start predicting the labels and obtaining the Pseudo Labeled Data (PSD) by using the approach detailed in Subsection 3.3. We finally fine-tune the model with this data and compare its performance to a method that uses Real Labeled Data (RLD) for training. The evaluation is done according to the Symbol Error Rate (SER) metric. It is the same as the Character Error Rate used in HTR. Formally, SER = S+D+I N , where S is the number of substitutions, D of deletions, I of insertions and N is the ground-truth's length. Obviously, the lower the value, the better performance.
We compare our approach with our previous few-shot model [4], the unsupervised [7,8] and supervised [5] approaches for ciphered manuscript recognition.

Results
In the Borg cipher, 117 non annotated training lines, containing 1913 symbols, are used to learn the pseudolabels, whereas 273 lines are used for testing. This manuscript is considered a hard case because of the overlapping symbols, which makes predicting correct bounding boxes challenging. Also, the writing style is variable. From the training lines, we crop 5 examples of each Borg symbol class.
The obtained results are shown in Table 1. As it can be seen, using a few-shot method with real labels leads to a SER of 0.21, being considered as the upper bound. But, this result is costly, since a user must manually annotate 1913 symbols, including their labels and bounding boxes. We also notice that the supervised MDLSTM with a larger training set, annotated at line level (no bounding boxes are required), obtains a moderate result, because of this manuscript's difficulties metioned before. We notice that the unsupervised methods are only useful when the segmentation of lines into isolated symbols is accurate, which is a costly task as well. Our few-shot model, trained only on Omniglot and tested on Borg, leads also to a poor result (a SER of 0.53). Of course, the reason is the difference between the training and test domains. On the other side, when using the pseudo-labeled data provided by our approach, we obtained an acceptable result of 0.24 SER, with a high gain in user effort because we only require 5 examples of each symbol, avoiding a costly manual annotation.
The Copiale manuscript contains easy to segment symbols but with a larger alphabet size. As it can be noticed from Table 1, the MDLSTM performs better in this dataset because of the larger labelled training lines and a unique handwriting style. However, our model achieves a competitive result by using less data (the few-shot model is trained with 176 lines containing 7197 symbols). Anyway, annotating these lines is costly, so a better choice is to automatically produce pseudo-labels. By using our pseudo-labeling process, we reach a competitive performance, compared to the manually labeled data (a SER of 0.15 versus 0.11).
Finally, we tested our method on the Runicus manuscript, as an example of ancient document with a rare alphabet. This manuscript can be considered easier than ciphers because the symbol segmentation is easy and the alphabet size is moderate. For this reason, an unsupervised clustering method can be also appropriate. Since labeled datasets of this specific historical manuscript do not exist, an expert took 56 lines belonging to 4 pages and annotated the containing 1583 symbols to be used for fine tuning. Of course, when using real labelled data, the results are better (a SER of 0.05) than without any fine tuning (a SER of 0.40). When we compare the quality of our produced-pseudo labels against the manually created ones, we observe that, by using pseudo-labeling, we reach a competitive result of 0.09 SER. This demonstrates the suitability of our method, because the performance is close to the one obtained with manual labels, while significantly reducing the annotation effort.
All in all, we can conclude that our proposed pseudolabeling method achieves good results when recognizing low resource handwritten texts, with an important decrease in the user effort for data annotation. The analysis of the human effort is detailed next.

Annotation Time Consumption
Manually annotating data is a time consuming task and should be taken into account when using HTR models. Thus, in this section, we measure the time needed to label the three datasets to illustrate the manual labeling effort. As shown in Table 2, the more lines and the bigger the alphabet size, the more time is required to label the symbols with their bounding boxes. For reference, we measured the required time for providing the shots for our method and compared it with the manual annotation time. We found that locating and cropping 5 examples of each symbol in the alphabet takes approximately 40 seconds. Thus the user needed to spend 16 minutes for Borg, 17 min for Runicus and 52 min for Copiale for providing the shots for our approach. So, we can conclude that automatically providing pseudo-labels significantly minimizes the manual effort with a minimal loss in recognition performance compared to the manual annotation.

Pseudo-labeling Performance Analysis
Our proposed method progressively labels the dataset: we start by labelling easy symbols and progressively label the complicated ones. As a consequence, the accuracy of correctly labeling bounding boxes decreases as we select new pseudo labels at each iteration. We evaluated the quality of our pseudo-labeling approach on the three tested datasets by comparing the predicted bounding boxes and their corresponding pseudo-labels to the manually annotated ones. A predicted bounding box is considered is defined as a correct detection if it has a minimum overlap (Intersection over Union: IoU) of 0.7 with the groundtruth box. We found that, more the dataset is difficult (in terms of segmentation, alphabet size and similarity between symbols), more the performance of our pseudolabeling approach decreases and more iterations in the labeling process are needed. For example, the Borg labeling accuracy reaches 74 % after obtaining all the labels. In Copiale, where symbols are easy to segment, the result was 85 %. In Codex Runicus, we obtain the higher pseudo-labeling accuracy (a 94 %) because the symbol segmentation is easier than Borg and the number of classes is lower than in Copiale. It is worth to mention that, during our experiments, we found that it is better to continue the labeling process despite a decreasing in the labeling performance. The reason is that, although we might add some wrong labels, in general, the incorporation of hard examples benefits the training and even a bounding box with a wrong label is still helping in the segmentation part. Moreover, the experiments show that there is a low difference between the manually annotated labels and our automatic produced ones, which encourages us for further improvements in our labeling process.

Threshold Selection Study
In our experiments we set a threshold of 0.4 before adding a character into the labeled set. This threshold is chosen after testing other values and finding that 0.4 is the optimal one. We show the results of the conducted experience in Table 3, where we tested different thresholds to select the pseudo-labels. The experiments were done on the Borg dataset. In this paper, we address handwriting recognition in low resource scenarios. Means, when there is a few labeled or unlabeled data to train on. So far, we opt to use an unsupervised approach that starts from a few shots of the desired alphabet. However, the choice of labeling some real lines as a start and pseudo-labeling the rest is also possible. We tested this strategy as presented in Table 4. The obtained results show that starting with 50 % of labeled lines (i.e 58 lines) lead to a good result, which is even better than the normal supervised training with 117 lines, this is due to the curriculum learning that improves the model convergence. However when reducing the starting lines to 30 % or 20 % the performance decreases and becomes similar to starting from only a few shots. It is to note also that starting with more manually labeled lines means obviously reducing the size of the unlabeled lines to be pseudo labeled, which also decreases the training time. Overall, we can conclude that starting from a few shots is a better solution in terms of avoiding the costly annotation effort, since the SER is slightly affected (we obtain 0.24 as SER using our unsupervised pseudo-labeling).

Conclusion
In this paper, we presented a novel pseudo-labeling few-shot transcription method for low-resource scenarios (manuscripts with rare alphabets and very few labelled data). We show that we can significantly reduce the human labor of annotating handwritten datasets, while maintaining the performance. The performed experiments on the enciphered and historical manuscripts confirmed the usefulness, with a significant reduce in user effort and a minimal loss in recognition performance.
Our pseudo-labeling few-shot model is a significant extension of our previous work [4]. In fact, its simplicity makes it even applicable on top of other methods, like [11]. Also, for common scripts but with few labeled data, pseudo-labels can be predicted to train usual HTRs, which may lead to better results than the few-shot ones.
As future work, we aim to enhance the quality of the provided labels to keep reducing the need of manual intervention. Also, we plan to extend our approach to work at paragraph or page level. It can be extended also to cover more low resource and other scripts.