In-domain versus Out-of-domain transfer learning for document layout analysis

.


Introduction
Document layout analysis refers to the task of identifying the different semantically meaningful regions of a document page and grouping them based on a set of pre-defined classes such as text, decorations and titles.Being able to understand the layout of a document's page is of paramount importance for both humanities scholars and computer scientists as it represents the first step towards the extraction and further analysis of their contents making it easier to perform other tasks such as optical character recognition [1], automatic text transcription [2] and writer identification [3].In recent years there has been an increasing effort in trying to automatize this process however, differently from printed, well-structured, documents for which very promising results have been obtained, performing this task on heavily edited, handwritten documents has proven to be particularly challenging especially when it comes to ancient manuscripts.These are characterized by various degrees of degradation, inconsistent conditions in the capture of the instances of datasets and large amounts of additions and corrections which are heavily intertwined with the main text.These characteristics make it nearly impossible to rely on the most popular techniques adopted for printed documents, which typically rely on bounding boxes to group together the different elements of the page layout [4].
The alternative is represented by the adoption of pixel-level segmentation maps that allow for the higher degree of precision needed to overcome this problem.However, compared to typical segmentation tasks in a natural environment, the areas representing the different regions of ancient manuscript layouts are typically characterized by very small regions with jagged edges that make it really time-consuming to produce the corresponding segmentation maps [5].Furthermore, an understanding of the content of the pages is typically needed to make sure that the different regions are classified correctly, meaning that the segmentation process must be supervised by an individual with appropriate domain knowledge.Both these requirements lead to scarcity in the availability of data for this type of task, with the majority of available datasets providing at most a couple dozen images as the corresponding training set.Some examples of such datasets are represented by Diva-HisDB [6], Bukhari [7] and the very recent U-DIADS-BiB [8].
In the past few years, this problem has been tackled by various authors [9][10][11], who developed a set of few-shot-learning-oriented frameworks specifically aiming at leveraging the small amount of data available to generate more and more accurate predictions for the task at hand, producing results that are on par or even surpass previously available state-of-the-art approaches that relied on much more data.In the present paper, we tackle the problem from another point of view by exploring different transfer learning approaches as a way to make good use of alternative data sources to pre-train our models.In particular, we analyze the effectiveness of different initialization approaches for the encoder component of a selected semantic segmentation model, such as training from scratch, in-domain transfer learning, cross-domain transfer learning as well as a combination of these last two.The rest of the paper is organized as follows: in Chapter 2 a review of the related works in this field of research is provided, an overview of the methodology adopted is provided in Chapter 3 and a discussion of the experimental setup and the corresponding results is reported in Chapter 4. Finally, in chapter 5 we summarize our findings and propose a direction for future works.

Related works
The scarcity of extensively labeled data in the field of ancient manuscript analysis can be attributed to the specialized knowledge and substantial time and financial resources required for its creation, particularly when dealing with documents featuring intricate layouts.Consequently, a logical progression involves the development of systems capable of delivering commendable performance with limited annotated data.Nevertheless, the literature currently provides only a limited number of works showcasing such systems.
In [12], a few-shot learning technique named Deep & Syntax, designed for segmenting historical handwritten registers, is presented.Few-shot learning, a paradigm enabling models to generalize from a limited number of examples, proves beneficial in scenarios with restricted annotated data.The suggested method utilizes a hybrid system that relies on recurrent patterns to delineate individual records.This hybrid system integrates U-shaped neural networks, typically employed in image segmentation tasks, with logical rules such as filtering and text alignment.
Another example of a few-shot learning strategy for document layout segmentation has been introduced in [13].In this approach, only two ground truth images per manuscript are employed to train the segmentation model, yielding results comparable to supervised models that currently represent the state-of-the-art for this task.The proposed framework combines a novel data augmentation method, with a segmentation refinement module employing a traditional computer vision approach for local thresholding.This integration fully exploits the limited dataset available while still achieving competitive performance compared to other supervised methods, as highlighted in their in-depth and analytical work [9].A further more recent approach is the one presented in [11], where a one-shot learning approach is introduced for the layout segmentation of ancient Arabic documents.In this paper, the authors introduce an efficient framework that, despite being trained on only one labeled page per manuscript, the framework achieves state-of-theart performance compared to other approaches tested on a challenging dataset of ancient Arabic manuscripts.This method consists of 3 main components, a semantic segmentation backbone, a dynamic instance generation module, and a segmentation refinement module, and aims to overcome the limitation of requiring extensive manual labeling for training machine learning models in this field.
Finally, in [10] the authors tackle the challenge of limited ground truth availability by proposing an unsupervised deep learning approach for page segmentation.Their method involves the use of a Siamese neural network to differentiate between patches based on quantifiable properties, with a specific emphasis on the count of foreground pixels.The goal is to ensure that spatially adjacent patches demonstrate similarities in their measurable characteristics.Following the training of the network, the acquired features are then utilized for the task of page segmentation.
Transfer learning approach.Transfer learning pre-trained deep networks is an approach to derive advantages from the representations learned on a large and general-purpose database while having relatively few examples to train a model [14,15].In the literature, numerous works employ transfer learning techniques as they have wide application in many domains, such as in the medical [16,17], biometric [18], agricultural [19], industrial [20] and robotic [21] fields.Conversely, in the field of ancient document layout analysis, the effectiveness of transfer learning techniques has not been extensively explored, as there are only a few works in the literature addressing this topic.
In an investigation conducted in [22], it was determined that the outcomes of the semantic segmentation problem, whether employing training from scratch or cross-domain learning from a preexisting model, are contingent upon the specific characteristics of the test dataset.The accuracy of segmentation varies significantly across datasets, irrespective of the model architecture or training methodology employed.To arrive at these findings, the researchers initiated their model's encoder with pre-trained weights from ImageNet and systematically compared its performance with models trained from scratch using an ancient document dataset.
In the most recent study [23], the authors present an overview of domain-specific transfer learning for document layout segmentation.They demonstrate that utilizing documentrelated images for pre-training yields consistently enhanced performance and faster convergence compared to training from scratch or relying on a large, general-purpose dataset like ImageNet.
One limitation common to both these works, however, is that they explore only the use of Ima-geNet as a potential out-of-domain dataset for pre-training, which focuses on the classification of instances at an image level instead of a pixel one.
In this paper, we extend the examination of the efficacy and potential advantages of employing different transfer learning strategies for the layout analysis of ancient manuscripts in contrast to training the model from the ground up.Specifically, we provide a more in-depth analysis of what kind of data represents the best for the pretraining process in the context of document layout segmentation.For this reason, we introduce the use of an additional cross-domain dataset, namely MS-COCO, specifically tailored towards semantic segmentation in images and we also expand the analysis to hybrid strategies involving the combination of both cross-domain and in-domain data for pre-training our model.

Segmentation architecture
As the semantic segmentation model for our experiments, we opted for the popular DeepLabv3 [24] architecture.DeepLabv3 is a ResNet-based architecture that employs atrous (dilated) convolutions in cascade or in parallel with different dilation levels.This approach allows for retaining a larger spatial resolution for the feature maps throughout the network architecture compared to models relying heavily on striding and pooling layers.The key aspect that makes atrous convolutions effective in the context of semantic segmentation tasks is that they allow to create deeper neural network architectures, while at the same time providing output feature maps that are larger than those of a traditional deep CNN architecture and without any increase in the amount of computation needed.While we are aware there are more recent and sophisticated models for semantic segmentation, the focus of this work is not on obtaining the best possible results or on setting a new state-of-the-art but to better understanding which transfer learning strategies lead to the best results when all the other conditions are kept as consistent as possible.For this reason, we chose DeepLabv3 as it represents a well-tested, reliable

Transfer learning strategies
Transfer learning is referred to as the process through which the knowledge learned from a task (upstream) is re-used to boost the performance on a related task (downstream).In this work, we explore three different pre-training strategies characterized by the different types of upstream data employed to initialize the segmentation model.Table 1: Compact overview of key information regarding the datasets used for our analysis.
Furthermore, we studied the effects of two transfer learning pipelines to adapt the trained models to the target data.An overview of the whole setup is provided in Fig. 2.

Cross-domain transfer learning
As previously mentioned, in our analysis, we explore three different pre-training strategies.The first one is represented by cross-domain pretraining, meaning that the dataset characterizing the upstream task is substantially different from the one characterizing the downstream task, either by content, objective, or both.This has been one of the most popular approaches for transfer learning since the release of large-scale general-purpose datasets such as ImageNet1K and it has proven to be very effective in a wide variety of application fields.The main advantage of this strategy is that this type of dataset is very easily available in an already structured format nowadays.Furthermore, deep learning models pre-trained on this kind of datasets are provided by many open-source libraries, making them easily accessible.However, when working on downstream tasks characterized by data not belonging to the domain of natural images, it becomes harder to learn effectively transferable features through this approach.A common example is represented by the medical imaging field [25].

In-domain transfer learning
The second pre-training strategy we analyze is the in-domain one, meaning that, for the pre-training for our model we employ a dataset that shares the same domain as the one we will use for the downstream task we are interested in even though the specific instances are different.While the features learned through this approach are typically more applicable to the downstream task, making it more reliable for domain-specific applications, the data needed is not always easily available as their limited scope is a deterrent to the very timeconsuming task of building a specific dataset for this purpose.

Combining cross-domain and in-domain transfer learning
The final strategy we explore is represented by the combination of cross-domain and in-domain pre-training.The way we combine the two previously described approaches is by performing an initial training of our model on large-scale crossdomain datasets to learn a general set of features.As a second step, we perform a fine-tuning process on the in-domain dataset, in our case U-DIADS-Bib, to learn a set of domain-specific features.
During this second phase, we don't freeze any of the model's weights allowing the feature-extractor to be tailored to the specific nature of the features characterizing the in-domain dataset.Other than the potential improvement in performance that this approach could lead to, we believe it could also represent a way to reduce the amount of domain-specific data needed compared to relying exclusively on an in-domain transfer learning strategy.

Transfer learning pipelines
The two transfer learning pipelines we studied are shown in Fig. 2(b) and (c) and differ by the way the feature extraction module of the pretrained model is employed.In (b) this module is frozen when performing the training process on the downstream task, while in (c) it's fine-tuned on the target data allowing it to learn its peculiarities and therefore improve its effectiveness on the target task at the cost of a higher computational effort during the training process.An important thing to notice is that in both pipelines the weights of the decoder and classifier modules of the model are initialized randomly before the training step on the downstream task.When working with either cross-domain and in-domain data individually as the input to pre-train our model we explore the effects of both pipelines to better understand their respective effects on the final performance.
When combining both types of data we perform a first pre-training step on the cross-domain data following pipeline (a), then we fine-tune the entire model on the in-domain data of the upstream task following pipeline (c) and, finally, we perform the transfer learning step on the target downstream data by once again exploring the effects of both freezing the feature extractor or fine-tuning the whole model.
Tab. 1 provides a schematic overview of all the dataset presented in this section.

Pre-training datasets
In this section, we will give a brief description of the 3 datasets employed to perform a pretraining of the selected model architecture.We decided to analyze the effectiveness of the features learned through training on datasets with different characteristics, in particular, we selected 2 large cross-domain datasets, namely ImageNet-1K [26] and COCO [27] which are employed, respectively for classification and semantic segmentation tasks, and represent our cross-domain data sources, as well as a small, recently published in-domain dataset which focuses on semantic segmentation specifically applied to the layout analysis of ancient manuscripts, called U-DIADS-Bib [8].

ImageNet-1k
ImageNet-1k is a hierarchically structured dataset consisting of 1.281.167natural images focused on the classification task, with its instances being organized in 1000 different categories.This dataset represents probably the most popular resource for pre-training computer vision models and it has been successfully employed in a wide variety of application fields since its release in 2012.

COCO
The COCO (Common Objects in Context) dataset, introduced in 2014, contains over 200k labeled images covering 80 different object categories which appear in around 1.5 million individual instances in the dataset images.While this dataset is much smaller then the previously presented ImageNet-1k its main advantage is that it focuses mainly on the tasks of object detection and semantic segmentation, making it more relatable to the downstream task we are analyzing in this work.

U-DIADS-Bib
U-DIADS-Bib1 [8] is a recently published dataset focusing specifically on the layout segmentation of ancient manuscripts.It consists of a total of 200 images representing the pages of 4 different manuscripts written either in Latin or Syriaque.Each of the dataset instances can contain up to 6 different segmentation classes which are: The main text of the manuscript, decorations, titles, chapter headings, additional paratexts and finally the background of the pages.The key characteristics of this dataset are the improved precision in the definition of the ground truths compared to previously available ones as well as its heterogeneity, which made it an ideal candidate for our analysis on in-domain transfer learning.Fig. 1 shows a sample page of 2 of the documents characterizing the dataset, together with the corresponding segmentation maps.

Evaluation datasets
In the following section are presented the two datasets we selected to evaluate the different initialization strategies for our segmentation model.We relied on the two most popular datasets for layout analysis of handwritten documents which are the DIVA-HisDB dataset [6] and the Bukhari dataset [7].

DIVA-HisDB
The first dataset selected to evaluate our model on is the DIVA-HisDB dataset [6], a historical document dataset consisting of a total of 150, highresolution, pixel-annotated pages coming from 3 different medieval manuscripts, identified as CSG18, CSG863 and CB55 and characterized by

Bukhari dataset
The second dataset we selected for the evaluation process in this paper is the one presented by Bukhari et al. [7], which represents the most popular one for the task of document layout segmentation on historical Arabic manuscripts.It consists of 32 images each representing a page from one of three different Arabic historical manuscripts.Out of all the samples 24 are typically used for the training process while the remaining 8 are used for the testing.

Training setup
As previously stated for all our experiments we relied on the DeepLabv3 architecture for semantic segmentation which was trained following one of the three pipelines reported in Fig. 2. The training process on the target datasets was performed on a total of 200 epochs, using the Adam optimizer with a learning rate of 1e − 3 with a batch size of 20.
Furthermore, an early stop condition was introduced after the first 50 epochs in case the model loss on the validation set didn't decrease over the last 20 epochs.All the instances of the employed datasets were resized to 672 × 1008, keeping the aspect ratio of the original images intact to avoid artifacts.

Losses
The loss we adopted to train our models is a combination of the dice loss and the weighted cross-entropy loss, where the weight of each class is represented by the square root of the inverse frequency of that class in the instances belonging to the corresponding document as proposed in [9] to account for the substantial class imbalance characterizing all the documents datasets, detailed in Tab. 3.

Metrics
To evaluate the different initialization strategies we relied on 4 popular metrics in the field of document layout segmentation, namely precision, recall, Intersection over Union (IoU) and F1-Score, which are defined as follows: each metric was calculated individually for each document class of the 2 target datasets.Furthermore, a macro average of the scores obtained for the different segmentation classes was performed, to ensure that each of them contributes equally to the final score.

Transferability of the learned features
Tab. 2 shows the performances, in terms of the 4 selected evaluation metrics, obtained by our model when initialized employing different strategies and trained on the target dataset following each of the three, aforementioned, pipelines.In this table, the best and second-best performing systems for each transfer learning pipeline are highlighted in bold and underlined respectively.Furthermore, we mark in red the scores achieved by pre-trained models that don't improve over the random initialization baseline.From this analysis, we can observe that, pre-training on outof-domain datasets with no further fine-tuning of the encoder module, consistently leads to a drop in performance compared to random initialization, meaning that the features learned during the training on the upstream task have no real overlap with the ones needed to perform the downstream task.In contrast, pre-training on U-DIADS-Bib, representing the in-domain source dataset, consistently leads to an increase in performance compared to the baseline, with improvements in the scores for the selected metrics going from 1% to 6% for all the classes of the target datasets.
The only exception being represented by the score obtained on the recall metric for the CSG18 document class of the DIVA-HisDB dataset.On the other hand, when fine-tuning the whole model on the downstream task data, every pre-training strategy consistently leads to an improvement over the baseline approach.This implies that even if the source and target datasets belong to very different domains, pre-training on the former still leads to a better starting point for the training of the latter compared to random initialization.Furthermore, we can clearly observe how combining the cross-domain and in-domain initialization strategies consistently leads to the best overall results, with the setup represented by pre-training on COCO and fine-tuning on the DIVA-HisDB dataset outperforming all the other initialization strategies on all the selected metrics, regardless of the way the encoder weights are treated during the final training step, with the only exception of the recall for the CSG18 class.It is interesting to observe how, even though pretraining on natural image datasets doesn't provide any real benefit over random initialization when used individually while freezing the feature extractor module of the segmentation model, it actually represents a valid strategy when combined with a fine-tuning step on an in-domain dataset such as U-DIADS-Bib, especially when using COCO as the source dataset.In fact, we can observe how the hybrid strategy involving the combination of the COCO and U-DIADS-Bib datasets consistently achieves better results compared to relying exclusively on the latter.This means that, even in this scenario, the pre-training on COCO still leads to a more robust initialization than a random one.On the other hand, the results obtained by combining ImageNet1K and U-DIADS-Bib are overall comparable to the simple in-domain transfer learning strategy, with only slight improvements on some of the metrics.
We can also observe how, regardless of the pretraining strategy employed, fine-tuning the whole model on the target dataset consistently leads to better performance compared to training only the decoder and segmentation modules while keeping the encoder module frozen.This behavior is expected as the model can more effectively learn a set of bespoke features on the target dataset following this approach.

Impact on convergence time
As a further result, we show in Fig. 3 and Fig. 4 the learning curves representing the evolution of the validation loss on the validation set throughout the 200 training epochs for pipeline Fig. 2(b) and (c) respectively.As we can observe, in both scenarios, all the strategies involving the use of an in-domain dataset, both on its own or combined with a pre-training step on a natural image dataset, lead to faster convergence of the model on all the document classes characterizing the target datasets, compared to both random initialization and pre-training exclusively on outof-domain datasets.Additionally, in-domain and hybrid pre-training strategies consistently allow for a much more stable learning process, drastically reducing the performance spike characterizing the other strategies.On the other hand, when pre-training exclusively on the cross-domain datasets and freezing the encoder module the convergence time is comparable to the one of the model training from scratch with the downside of the final validation loss being higher.While, when fine-tuning the whole model on the target data, we can observe a marked instability of the training process during the first 50 to 75 epochs for the models pre-trained exclusively on crossdomain or in-domain data, but after that point they substantially stabilize, leading to a lower final loss compared to the model trained from scratch.As previously mentioned, this phenomenon does not occur in the cases where a mixed pre-training strategy is employed.The validation loss curves characterizing them are very stable throughout the training process and consistently lead to the best overall performance compared to all the other approaches.
For completeness, in Appendix A we also provide the evolution of the Intersection over Union metric on each validation set of the selected manuscript classes across the different training epochs for both transfer learning strategies.

Performance in low-data regimes
Finally, we provide an analysis of the performance achieved through the different initialization strategies when artificially limiting the amount of data available from the target datasets to train the model.In particular, in Fig. 5 we show the results obtained by our models when trained only on 20%,

Conclusions
In this work, we provided an overview of different transfer learning strategies in the context of document layout segmentation.We compared 4 different initialization approaches namely random initialization, cross-domain through the popular ImageNet1K and COCO datasets, in-domain initialization through the U-DIADS-Bib dataset, and hybrid initialization combining the pre-training of both the ImageNet1K and COCO dataset with a fine-tuning step on the in-domain one.Furthermore, we explored 2 different fine-tuning strategies, involving respectively the training of the whole model and the training of exclusively the decoder and segmentation modules on the downstream task datasets.We tested the different approaches on 2 publicly available target datasets for document layout analysis, the DIVA-HisDB and Bukhari one.We found out that, differently from other application areas, pretraining on large-scale, general-purpose datasets consisting of natural images doesn't bring any real benefit and is actually detrimental when working with downstream tasks revolving around document layout segmentation, both in terms of convergence speed as well as in terms of the overall performance of the model on the target dataset, unless the entire model is fine-tuned on the target data, leading to the intuition that the features learned from cross-domain data are not transferable directly to the domain of manuscript analysis.However, these learned features still represent a better starting point compared to random initialization of the model weights.On the other hand, transfer learning strategies revolving around the use of in-domain source data, as well as hybrid strategies that make use of both in-domain and out-of-domain data, consistently lead to increased effectiveness and efficiency, in terms of the amount of data needed, on the downstream segmentation task, regardless of the training strategy employed on the target dataset.
In particular, we have shown how pre-training on the COCO dataset followed by a fine-tuning process on U-DIADS-Bib, led to the best overall performance on the target task while at the same time substantially speeding up the convergence time of the model and, leading to a much more stable training process.Furthermore, we have shown how this approach allows for a much more efficient use of the available data, achieving better performance than the randomly initialized model even when trained on only 20% of the data compared to the latter when trained on the entire dataset available for the downstream task.To conclude, while we focused on the specific task of document layout segmentation, we believe our findings are likely applicable also to other tasks involving the analysis of documents, both in printed and handwritten form.As a future effort, we plan to expand our analysis in this direction, to gain a deeper insight on the effect of in-domain transfer learning strategies across different tasks.

Fig. 1 :
Fig. 1: Samples from the classes Latin2 and Syriaque 341 of the U-DIADS-Bib dataset, together with their corresponding ground truth mask in which each color represents a different semantic class of the document layout.

Fig. 2 :
Fig. 2: Visual representation of the different training strategies we explored in this study.(a) represents the traditional training from scratch approach which we used as our baseline, (b) shows a transfer-learning approach in which the feature extractor component of the network is frozen during the training process on the target downstream task and finally (c) shows a further transferlearning strategy in which the entire model is fine-tuned on the downstream task dataset.For both (b) and (c) we explored the use of in-domain and cross-domain data to pre-train the selected model as well as a combination of the two.

Fig. 3 :
Fig. 3: Learning curves representing the evolution of the validation loss for the 4 target document classes using different initialization strategies and keeping the encoder frozen at the time of fine-tuning on the target data.

Fig. 4 :
Fig. 4: Learning curves representing the evolution of the validation loss for the 4 target document classes using different initialization strategies and fine-tuning the entire model on the target data.

Fig. 5 :
Fig. 5: Performance (IoU) of the segmentation model on the test set of the 4 document classes when initialized with different strategies while relying on increasing percentages of the available data for the training.Only the decoder and classifier modules of the model where trained on the target data, while the feature-extractor was kept frozen.

Fig. 6 :
Fig. 6: Performance (Iou) of the segmentation model on the test set of the 4 document classes when initialized with different strategies while relying on increasing percentages of the available data for the training.The model was fine-tuned entirely on the target data, with no frozen modules.

Fig. A1 :
Fig. A1: Learning curves representing the evolution of the performance of the model with respect to the Intersection over Union (IoU) metric on the validation set for the 4 target document classes using different initialization strategies with only the decoder and segmentation head modules fine-tuned on the target dataset.

Fig. A2 :
Fig. A2: Learning curves representing the evolution of the performance of the model with respect to the Intersection over Union (IoU) metric on the validation set for the 4 target document classes using different initialization strategies with the whole model fine-tuned on the target dataset.

Table 2 :
Tabular overview of the performances obtained by fine-tuning the selected model on the target datasets document classes when initialized with different strategies.The best and second-best results for each transfer learning pipeline are reported in bold and underlined respectively, while the red values represent the instances in which a pre-trained model performed worse than the baseline model trained from scratch.complex and heterogeneous layouts as well as different levels of degradation.Each of the pages can contain up to 4 different segmentation classes, categorized as main text, comments, decorations and background.For each of the documents 20 images are typically reserved for training, 10 for validation and 20 more for the testing process.

Table 3 :
Classes distribution (%) at pixel level for each manuscript class of the three datasets employed for the analysis.