HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition

Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Probably the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 3.3 million names. The database contain more than 105 thousand unique names with a total of more than 1.1 million images of personal names, which proves useful for transfer learning to other settings. We provide three examples hereof, obtaining significantly improved transcription accuracy on both Danish and US census data. In addition, we present benchmark results for deep learning models automatically transcribing the personal names from the scanned documents. Through making more challenging large-scale databases publicly available we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition.


I Introduction
As part of the global digitization of historical archives, the present and future challenges are to transcribe these efficiently and cost-effectively.We hope that the scale, quality, and structure of the HANA database can offer opportunities for researchers to test the robustness of their handwritten text recognition (HTR) methods and models on more challeng-ing, large-scale, and highly unbalanced databases.The availability of large scale databases for training and testing HTR models is a core prerequisite for constructing high performance models.While several databases based on historical documents are available, only few have been made available for personal names.For linking, matching, or genealogy, the personal names of individuals is one of the most important pieces of information, and being able to read personal names across historical documents is of great importance for linking individuals across e.g.censuses: See, for example, Abramitzky, Boustan, and Eriksson (2012), Abramitzky, Boustan, and Eriksson (2013), Abramitzky, Boustan, and Eriksson (2014), Abramitzky, Boustan, and Eriksson (2016), Abramitzky, Boustan, Eriksson, Feigenbaum, and Pérez (2020), Abramitzky, Mill, and Pérez (2020), Feigenbaum (2018), Massey (2020), Bailey, Cole, Henderson, and Massey (2020), and Price, Buckles, Van Leeuwen, and Riley (2019).Importantly, Abramitzky et al. (2020) and Bailey et al. (2020) both discuss the rather low matching rates when linking transcriptions of the same census together.This is partly due to low transcription accuracy of names.Furthermore, both papers are concerned about the lack of representativeness of the linked samples.These two observations strongly motivates the HANA database, i.e., collecting and sharing more data of higher quality, and to the work on improving transcription methods in order to reduce the potential biases in record linking.
In total, the HANA database consists of more than 1.1 million personal names written on single-line images with each personal name consisting of an average of three names.All original images are made electronically available by Copenhagen Archives and the processed database described is made freely available.While most of the existing databases contain single isolated characters or isolated words, such as the names available in the Handwriting Recognition database at Kaggle, one of the important features of our database is the resemblance with other challenging historical documents, where source data often contain general image noise, different writing styles, and varying traits across the images. 1he rest of the paper is organized as follows: In Section II, we describe the database and the data acquisition procedure in detail.Section III presents the benchmark results on the database using a ResNet-50 deep neural network in three different model settings.In addition, this section provides examples validating how the HANA database can be used for transfer learning on, e.g., Danish and US census data.In Section IV, we discuss the database and the benchmark methods and results, in addition to some considerations for future research.Section V concludes.

II Constructing the HANA Database
This section describes the HANA database in detail and the image-processing procedures involved in extracting the handwritten text from the forms.In 1890, Copenhagen introduced a precursor to the Danish National Register.This register was organized and structured by the police in Copenhagen and has been digitized and labelled by hundreds of volunteers at Copenhagen Archives.In Figure I, we present an example of one of these register sheets.
The Register Sheets In total, we obtain 1,419,491 scanned police register sheets from Copenhagen Archives.All adults above the age of 10 residing in Copenhagen in the period 1890 to 1923 are registered in these forms.Children between 10 and 13 were registered on their father's register sheet.Once they turned 14, they obtained their own sheet.Married women were recorded on their husband's register sheet, while single women were recorded on their own sheet.This is most likely biasing our database to include more men than female, as we focus on the main individual on the register sheets.The female to male ratio calculated based on the number of spouses registered is somewhere between 1 and 1.5 in the final database with the most likely male to female ratio being 1.3 (56% men relative to 44% women).
Prior to 1890, the main registers used by the police were the census lists which goes back to 1865 and lasted until 1923.However, due to the census lists only being registered twice a year, in May and November, some migration across addresses was not recorded, and individuals residing only shortly in the city would not have been recorded (Copenhagen Archives, 2022a).
A wealth of information is recorded in the police register sheets, including birth date, occupation, address, name of spouse, and more, all of which is systematically structured across the forms.While this paper focuses on extracting and creating benchmark results for the personal names, the remaining information can be constructed using similar procedures to those presented in this paper and may serve as additional databases for HTR models.Rare information is also included in the register sheets, such as whether the individuals were wanted, had committed prior criminal offences, or owed child benefits.This kind of information is written as notes and is therefore typically written under special remarks in the documents.As opposed to the censuses, which were sorted by streets and dates, the police register sheets were sorted by personal names.This made it easier for the police to control the migration of citizens of Copenhagen and track individuals over time.Once an individual died, they were transferred to the death register (Danish National Archives, 2022).
In 1923, the Danish National Register replaced the registration of all citizens in Copenhagen (Copenhagen Archives, 2022b).Ever since 1924, the Danish National Register has registered all individuals in all municipalities in Denmark (Konow, 2009).

Data Extraction and Segmentation
To segment the data, we use point set registration.Point set registration refers to the problem of aligning point spaces across a template image to an input image (Besl and McKay, 1992).To find point spaces that roughly correspond to each other across semi-structured documents, we extract horizontal and vertical lines from the document.We use the intersections as the point space, which we align with the template points.We briefly outline the method below; see Dahl, Johansen, Sørensen, Westermann, and Wittrock (2021) for more details.
To start the process of extracting the personal names from the forms, we binarize the images.We extract horizontal and vertical lines from the documents by performing several morphological transformations, see, e.g., Szeliski (2010).The intersections are subsequently found using Harris corner detection (Harris and Stephens, 1988).Once we have the point space defined, we use Coherent Point Drift (Myronenko and Song, 2010), which coherently aligns the point space from the input image to the point space on the template image.This yields a transformation function that maps the points found in the input images to the points in the template image.To improve the segmentation performance of the database, we add several restrictions to the transformations such that all extreme transformations are automatically discarded.This reduces the size of the database to just over 1.1 million images with attached labels.Even though this removes more than 20% of the data, we believe the gain from more reliable data outshines the cost associated with a smaller database.
Once we have prepared the images, we clean the labels to fit into a Danish context, which implies that all non-Danish variations of letters are replaced with the Danish equivalent of these.A few of these might be incorrect, e.g., if the individuals are foreigners, but we expect the level of misclassification arising from this to be smaller than the number of characters labelled incorrectly by the volunteers at Copen-hagen Archives.In addition, we restrict the sample to names that only contain alphabetic characters and with a length of at least two characters, yielding a final database of 1,105,904 full names.
It is possible to increase the number of extracted names for each sheet by considering the spouse and children of an individual.However, this would entail lowering the quality of the data, as the last name is not necessarily present for these individuals and the quality of the segmentation is also lower.Hence, we leave this for future work.
The personal name labels are either categorized as first or last names by Copenhagen Archives.Most commonly, the last name is written as the first word on the image while the subsequent words are the first and middle names (in that order).However, some exceptions occur, and there are other rules that may interfere with the structure of the ordering, such as underlining and numbering.The structure of the database can therefore be challenging for HTR models, as this structuring complication has to be overcome by the models.The figure shows examples from the HANA database with the corresponding labels written above.The last name is typically written as the first word followed by the first and middle names, which is the case for all images above.

Train and Test Splits
The test database consists of 5% of the total database and is randomly selected.The training data consists of 1,050,082 documents while the test data consists of 55,822 documents.2,129 surnames are only represented in the test sample, which contains a total of 10,228 unique last names relative to the overall of almost 70,000 unique last names.
As mentioned previously, the database is highly unbalanced due to vast differences in the commonness of names.Only the 604 most common surnames in the database occur at least 100 times, and only the 3,463 most common surnames occur more than 20 times.This covers slightly more than 85% of the data, meaning that almost 15% of the images contain names that occur fewer than 20 times.This naturally leads to challenges for any HTR model, as it needs to learn to recognize names with very few or even zero examples in the training data.However, this is also an important and indeed crucial goal to work towards.
Labelling While transcribers at Copenhagen Archives were instructed to make accurate transcriptions of the register sheets, there exist humanly introduced inconsistencies in the labels.The same points made by Deng, Dong, Socher, Li, Li, and Fei-Fei (2009) can be made here, as there are especially two issues to consider.First, humans make mistakes and not all users follow the instructions carefully.Second, users are not always in agreement with each other, especially for harder to read cases where the characters of an image are "open to interpretation".
With respect to the first point, we perceive this as part of the challenge for constructing any digital handwriting database, as they are all based on human transcriptions.For this database, Copenhagen Archives used super users to validate the transcriptions.In addition, it is possible to send requests for corrections at the website of Copenhagen Archives and thereby change incorrect labels.With respect to the second point, a number of considerations should be taken into account.A common labelling error found in the database is the existence of subtle confusing characters, similar names, or phonetically spelling errors.Characters or names that are often misread are, e.g., Pedersen versus Petersen, Christensen versus Christiansen, and Olesen versus Olsen.Solutions for these complications are difficult, as it is in many cases a judgement call by the transcriber.To reduce the number of incorrect labels in the training database, one could consider combining similar names, but we refrain from adopting that strategy.
Further Characteristics of the Database Despite there being 69,906 unique surnames and 48,394 unique first and middle names, the total number of unique names amounts to only 105,607, as there is an overlap between the two sub-groups.There are fewer than 50 thousand examples of the characters q, w, x, z, å, and ae.For q and å, there are fewer than five thou-sand examples.The vast majority of names contain four to nine characters, with only 6.35% of the names being shorter or longer.Quite frequently reported for Danish last names is the fraction of names ending with sen.For this database, 710,117 surnames end with sen, which corresponds to 64.21% of all last names in the database.Appendix B provides additional characteristics of the names in the database.

III Benchmark Results
This section describes the benchmark results published together with the HANA database and the value of transfer learning is illustrated.We use a variant of a ResNet-50 network for estimating the benchmark results.We transcribe the surnames in a character-by-character classification fashion.The predictions are subsequently matched to the closest existing name.One could also consider the surnames as an entity and classify each word in a holistic sense.We imagine that this could be problematic due to the unbalanced nature of this database and the training samples does not contain all unique names.We train three neural networks, one to predict the last name, one to predict the first and last name, and one to predict the entire name, i.e. first, middle, and last names.
We start by describing the architecture, optimization, and other details of the neural networks used in the paper.Scripts for the implementations are all in Python (Van Rossum and Drake, 2009) using PyTorch (Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga, et al., 2019).
Network Architecture Each neural network uses a ResNet-50 with bottleneck building blocks (He, Zhang, Ren, and Sun, 2016) as its feature extractor; the weights of the PyTorch version of ResNet-50 pretrained on ImageNet (Deng et al., 2009) are used as the initial weights.The neural networks differ only insofar as their classification heads differ.Here, a method similar to the one described in Goodfellow, Bulatov, Ibarz, Arnoud, and Shet (2013) is used, with the exception that the sequence length is never estimated.The weights (and biases) of the classification heads are randomly initialized.For the last name network, 18 output layers are used (names are at most 18 letters long), each with 30 output nodes (letters a-å as well as a "no character" option).For the first and last name network, 36 output layers are used (2 names of at most 18 letters), each with 30 output nodes.For the full name network, 180 output layers are used (up to 10 names of at most 18 letters), each with 30 output nodes.
Optimization All neural networks are optimized using stochastic gradient descent with momentum of 0.9, weight decay of 0.0005, and Nesterov acceleration based on the formula in Sutskever, Martens, Dahl, and Hinton (2013).The batch size used is 256 and the learning rate is 0.05.The networks are trained for 100 epochs and the learning rate is divided by ten every 30 epochs.The loss of each classification head is the negative log likelihood loss of the head, and the total loss is the average of the negative log likelihood loss of each head.Image Preprocessing Images are resized to half width and height for computational reasons (resulting in images of width 522 and height 80).The images are normalized using the Im-ageNet means and standard deviations (to normalize similarly to the pretrained ResNet-50 feature extractor).During training, image augmentation in the form of RandAugment with N = 3 and M = 5 is used (Cubuk, Zoph, Shlens, and Le, 2020); the implementation is based on Kim (2020).
Prediction of Networks Some post processing of predictions is performed.Each layer is mapped to its corresponding character (the 29 letters and the "no character" option).Then, for each name (i.e.sequence of 18 output layers), the "no character" predictions are removed and the remaining letters form the prediction.Letting, θ i denote the "no character" option for character i, this means that both [h, a, n, θ 4 , s, θ 6 , ..., θ 18 ] and [h, a, n, s, θ 5 , ..., θ 18 ] will be transformed to hans.
Matching As an additional step, we also test the performance if we refine the predictions of the networks by using matching.In some cases, a list of possible names (i.e. a lexicon of valid outcomes) may be present, in which case this can be used to match predictions that are not valid to the nearest valid name.2Specifically, we use the procedure in the difflib Python module to perform this matching.
For the last name network, the predictions that do not fall within the list of valid last names are matched to the nearest last name.For the first and last name network, a similar procedure is used separately for the first name and the last name.
For the full name network, a similar procedure is used separately for the first name, the up to eight middle names, and the last name.
Performance Measures To measure the performance of our networks, we focus on the word accuracy (WACC) of our models.Thus, a name prediction is considered to be incorrect if a single character or more of the name is transcribed incorrectly, and thus the character error rates are significantly lower than the word error rates implicitly reported. 3We consider performance both with and without using matching to a lexicon as a post processing step.Further, we report performance at different levels of data coverage.As our networks report a measure of their confidence for each prediction, we can rank all predictions by this measure.Then, we can calculate the WACC at, e.g., 90% data coverage by removing the 10% of predictions where the network is the least certain.We believe this metric is interesting, as it might be used to, e.g., (1) select the predictions where a sufficient WACC is reached or (2) let humans assist in transcribing images that the network is particularly uncertain about.

Results
Table I show an overview of the performance.The data coverage is either at 100% or 90% and presents the word accuracy if we use all transcriptions in the test set compared to only testing on the 90% of the test data on which the model is the most certain.There is a trade-off between data coverage relative to accuracy, which is the motivation for also showing the results using a threshold at the 90th quantile.The table represent three different models for character-by-character recognition.The first model predicts only the characters in the last name, the second model predicts the first name and the last name, and the third model predicts the full name sequence.All of them are trained on the full database.For the full name model, the number of names present in a person's predicted name is equal to the number of names in the corresponding label in 96.85% of the cases.Using the Levenshtein distance to calculate the character error rate of the predictions (without matching) we find error rates of 1.48% for the last name network, 1.66% for the first and last name network, and 11.82% for the full name network.The word accuracy for the last name model is 94.33% without matching; this drops to 93.52% for the first and last name model.The full name model is evaluated on the full names of the individuals and has to take into account the correct ordering of the all names in order to obtain each name correct, which we believe explain the performance deterioration for our rather simple network, where the word accuracy drops to 67.44%.

Figure III: Performance on the HANA database: Last Name
The figure shows the performance on the test set from the HANA database for the model trained on last names.We find that the matching of names to closest name relative to the unmatched performance is very similar until the 80th percentile.From this point on, the two lines diverge and the matched predictions clearly outperform the unmatched predictions.
For the last name and the first and last name networks, Figures III and IV provide additional measures of the performances.Figure III shows the performance of the last name model over the entire range of data coverage, both with and without the use of matching.While both models achieve higher accuracy at lower data coverage, what is particularly interesting is the difference between the two: At full data coverage, matching improves the performance substantially, while this difference quickly disappears the moment the most uncertain predictions are sorted away.Figure IV shows the corresponding performance for the first and last name network.Here, the first and last name performances are illustrated separately.As for the last name model, it is evident that match-ing is particularly helpful at full data coverage and for last names.Interestingly, while the network is better at transcribing first names at full data coverage, at around 80% data coverage this changes, and it becomes better at transcribing last names, though the last name transcription performance of this model is lower than for the last name model, suggesting that estimating a separate model only for first names and using it in conjunction with the last name model might be superior to a model estimating both jointly.The figure shows the performance on the test set from the HANA database for the model trained on first and last names.We find that the network obtains a higher accuracy on the first names relative to the last names (reversing once below around 80% data coverage) and that the last name accuracy is lower than the performance of the model that is trained only on last names.
Transfer Learning By publishing the database we aim to establish a foundation for transfer learning to handwritten names from other data sources.This in turn can help others transcribe handwritten names more accurately -while also reducing costs, as less manual labelling will be needed.To motivate the usefulness of the HANA database for transfer learning, we present results for three separate transfer learning examples: Transcription of handwritten surnames from Danish and US census data (see Figure V for some examples of these images), and transcription of the handwritten names from the Handwriting Recognition database from Kaggle which contain transcriptions of 410,000 handwritten names.In this section, we provide details from our experiments for the two census datasets; Appendix A provides details for our third example.In all cases, our results demonstrate that adopting a transfer learning strategy based on the HANA database can increase transcription accuracy, even when large amounts of training data is available.
For both the Danish and US census data, we present two The width to height ratio of the HANA database is 6.5 which is similar to the Danish census with a ratio of 7.2 while the US census ratio is 3.7.The figure shows the performance gain from adopting a transfer learning strategy based on the HANA database.Panel VIa shows the performance on the Danish census data and Panel VIb on the US census data.We find that the performance gain is larger for the smaller training sets but still substantial with more than 50,000 training examples.Also, we find that the performance increase is larger for names that to a greater extent mimics the original HANA training examples, which is part of the reason for the better performance on the Danish census data.Most likely, the handwriting is also more similar across these datasets.Even so, the generability of the HANA database seems to be largely validated by the performance increase in both tested transfer learning exercises on the US census data.
sub-cases: We analyze the performance when a relatively small number of training images are available (approximately 10,000) and when a larger number of training images are available (approximately 50,000).By training networks both with and without transfer learning on these datasets, we can infer the magnitude of the performance boost achieved by using the HANA database for transfer learning.
In total, we train eight new networks.Due to the difference in the number of labelled images for each dataset and the use of transfer learning from the HANA database last name network, we expect that the optimal learning rate for each network might differ substantially.For this reason, we perform a grid search on a validation set consisting of five percent of the training data for each network to tune the learning rate.
All other training settings are similar to those we used to train models on the HANA database.Thus, all new models are similar to the last name model on the HANA database, and training only differ with respect to the learning rate used and the starting weights.4 Figure VI shows our main findings.The performance based on the Danish census is illustrated in Panel VIa, while the performance based on the US census is illustrated in Panel VIb.The data coverage is gradually increasing along the first axis, and at a data coverage of 100% it is clear that the worst performing model for each census is the network trained on the small database without transfer learning while the best performing model is the network trained on the large database with transfer learning.Quite interestingly, there is a difference between which model is the second best between the Danish and US census.For the Danish census, the model trained on the small database with transfer learning is better than the model trained on the large database without transfer learning, while this is reversed for the US census.This is likely due to the larger similarity between the HANA database and the Danish census compared to the similarity between the HANA database and the US census (see Figures V and II).However, we also find large performance gains for the US census, particularly for the small database, which seems intuitive as smaller datasets have less information to learn from and thus would benefit more from transfer learning.
The US census images differ from those of the HANA database and the Danish census in that they contain only the surname.This might contribute to the smaller performance gain we see when applying transfer learning, compared to the Danish census.Further, the performance both with and without transfer learning is worse on the US census.We use a large test sample from the US census to validate the performance with more than 60,000 test examples, while for the Danish census data we have approximately 6,000 test examples.In general, it seems that the US census data is more difficult to transcribe, making the reduction in error rates from transfer learning even more promising.At full data coverage on the Danish census, the WACC increases from 77.8% to 92.2% for the small training set and 86.1% to 94.6% for the large training set.On the US census, the WACC increases from 72.8% to 78.7% for the small training set and from 84.7% to 86.8% for the large training set.
We believe that transfer learning from the HANA database can provide large gains when transcribing handwritten names from other data sources.These gains are particularly large when transfer learning to a domain that is close to the HANA database and when relatively few labelled images are available.The gains can also be substantial when transfer learning to a domain that is further away but when relatively many labelled images are available.Using transfer learning with more than 50,000 training sample points, we achieve an error rate reduction of 61.4% for the Danish census and 14.2% for the US census data.This equates to 21,772 corrections of falsely transcribed images when transcribing one million handwritten US names.For the Danish names, and for the US names when only 10,000 labelled images are available, the increase in transcription accuracy is much larger.Thus, while transfer learning leads to smaller gains when more labelled images are available, the benefits are still tangible.Further, we find that most currently available datasets with handwritten names contain fewer than 50,000 labelled handwritten names, and labelling thousands of images is both timeconsuming and expensive.This means that transfer learning from the HANA database not only help improving transcription accuracy, it could also reduce costs as fewer labelled images will be needed.

IV Discussion
Table I and Figures III and IV summarize our results on the HANA database.Due to computational constraints, we only tested the performance of relatively few models.Yet, our models still achieved impressive performance, being able to transcribe names with high accuracy.As these models are the first results on this database, there are currently no available comparable results, and we hope that other researchers can use these results as a benchmark and transfer learn from this database.To show the validity of such a strategy, we presented two transfer learning exercises in detail (with a third one discussed in Appendix A), showing that the use of the HANA database can significantly increase the transcription accuracy of names from both Danish and US census.
We believe that large-scale databases are a necessary prerequisite for achieving high accuracy when transcribing handwritten text.This database proves to be sufficiently large for models to read handwritten names with high accuracy.The high performance is achieved despite several stated complications.
The most common complications with the labels and the corresponding images are the structure of the personal names on the images relative to the labels, confusion of certain characters, and general typos.We emphasize that the labels are not perfect and we find that this is especially true for harder to read cases where certain characters are "open to interpretation".
As a robustness check one could also test the models using phonetically spelled versions of the names, e.g., Christian versus Kristian.We choose not to do this in our benchmark models as there exist labels that are very similar but have different meanings.Therefore, by allowing for small discrepancies in the names one could easily create mislabeled training data across very similar names.We realize that it could to some extent mitigate the complication from the harder to read cases where the transcribers possibly made mistakes, but we leave this as an open question for future work.

V Conclusion
This paper introduces the HANA database, which is the largest publicly available handwritten personal name database.The large-scale HANA database is based on Danish police register sheets, which have been made freely available by Copenhagen Archives.The final processed database contains a total of 3,355,033 names distributed across 1,105,904 images.Benchmark results for transcription based deep learning models are provided for the database on the last name, first and last name, and full name.
Our goal is to create and promote a more challenging database that in many ways is more comparable to other historical documents.Specifically, historical documents are often tabulated and can therefore be cropped into single-line fragments, which should make it easier to train HTR models and to make more efficient transcriptions.Second, the naturalism of the police register sheets are in our opinion quite comparable to a lot of widely used historical documents such as census lists, parish registers, and funeral records.This makes any performance based on these documents more representative of the performance that would be obtained in custom applications.To validate this point, we showed examples of models transfer learning from the HANA database on Danish and US census handwritten names.We find that transfer learning increases the word accuracy from 77.8% to 92.2% (86.1% to 94.6%) for the Danish census and from 72.8% to 78.7% (84.7% to 86.8%) for the US census when 10.000 (50.000) training examples are available.
We want to highlight two important features of our database.First, despite the challenges associated with labelling errors and unstructured images, the size of the database appears to compensate, making possible high performance models for automatically transcribing handwritten names.Second, related to the prior point, despite the commonness of names being far from evenly distributed, resulting in a highly unbalanced sample of the represented names, with 65,020 names singularly represented out of a total of 105,607 different names, the models still generalize well.We view this as very encouraging, suggesting that high performance automatic transcription is possible even in difficult and realistic scenarios.
We have performed image-processing procedures to make the database useful for training single-line learning systems.Further, the code for replicating our results and transfer learning from our models is made freely available.We strongly encourage other researchers to use the HANA database and to make improvements to our procedures in order to continuously increase the size and quality of the database.Ultimately, we believe this can help making automatic transcriptions of personal names and other handwritten entities much more precise and cost efficient in addition to making the transcriptions fully end-to-end reproducible.By adding improvements to existing linking methods, due to fewer transcription error rates, this could further incentivize the usage and construction of reliable long historical databases across multiple generations.

A Additional Transfer Learning Illustration
As an additional illustration, we transfer learn from our last name model to the Handwriting Recognition database available from Kaggle containing roughly 410,000 images.To do so, we make a few changes to make the data fit into our current framework.First, we split names including hyphens and only include the last part of the names.This means that 806 labels in the test set are altered in this manner, potentially upward biasing our models.In addition, we remove empty name labels and names including special characters, which reduces the test set from 41,370 to 41,264 images.Finally, we remove three images from the training set and one image from the validation set, due to above 18 characters in the corresponding names.In total, 329,982 training images, 41,252 validation images, and 41,264 test images remain after these corrections.
The purpose of showing the performance of transfer learning from the HANA database to this dataset is to show the performance gain from transfer learning in a setting with a very large training sample consisting of hundreds of thousands of sample points.We create two training sets: A "small" set, where only the validation images are used for training (41,252 images) and a "large" set, where both the training and the validation images are used for training (371,234 images).In both cases, we use the test images to evaluate our models.We train two models for each set: One where we transfer learn from the HANA database last name model and one where we do not use transfer learning.
We proceed similarly to the transfer learning examples discussed in Section III.The only differences are: (1) the image size is smaller, here around 388 by 40, which is the resolution we train at, and; (2) for computational reasons, we conduct only a search for the learning rate for the two models trained on the small set, and then use the learning rates found here also for the two models trained on the large set.
As we expect, the performance increase of using transfer learning is highest for the small training set, where accuracy increases from 81.33% to 83.24%.For the large training set, accuracy increases from 87.48% to 87.83%.While these increases are lower compared to our other transfer learning exercises, it is important to note that the size of the training sets are much larger: The "small" training set we use for this example is almost as large as the large training set we use in our other examples, and the "large" training set we use for this ex-ercise contains 371,234 samples, a scale often not realistically obtained.
We also study the performance of these models at different levels of data coverage.Figure VII shows the performance of the four models at different levels of data coverage.Much in line with our earlier results, the models trained on the small set perform worse than the models trained on the large set, but in both cases the model transfer learning from the HANA database outperforms the model not utilizing the HANA database for transfer learning.The figure shows the performance gain from adopting a transfer learning strategy based on the HANA database.We find that the performance gain is larger for the smaller training sets, but still present even when nearly 400,000 observations for training are available.

B Further Characteristics of the Database
In this appendix, we present additional characteristics of the HANA database.Panel VIIIb shows the length of names per image.The name is in this context defined as each word in a full name, i.e. either first, middle, or last name.The longest name consists of 18 characters with most names being somewhere between 3 and 10 characters long.Panel VIIIc shows the distribution of characters in the names.As seen, the most frequent character used in the names is e, which appears approximately 3.4 million times, while both q and å appear fewer than 5,000 times.All panels aggregates across all individuals present in either the train or test data.

Figure I :
Figure I: Example of a Police Register Sheet Figure II:Examples from the HANA Database Figure IV:Performance on the HANA database: First and Last Name

Figure V :
Figure V: Examples from the Danish and US Censuses (a) Danish Census (b) US Census Figure VI: Transfer Learning Performance on Danish and US Census (a) Danish Census (b) US Census Figure VII:Transfer Learning Performance on Kaggle HTR Figure VIII  shows the distribution of individual names for each full name, the length of the names, and the distribution of characters.Panel VIIIa shows the number of names per image.While most individuals have either just a first and last name, potentially in combination with a single middle name, a significant number have more than one middle name.Panel VIIIb shows the length of each name.This should be interpreted as being the length of either the first, middle, or last name.The longest single name in our database is 18 characters.It is important to note that this figure does not represent the full name sequence length, as this could potentially be 10x18 characters long.The final distribution plot in Panel VIIIc shows the character distribution aggregating across all names.
Figure VIII: Further Database Characteristics

( a )
Distribution of Names Per Image (b) Distribution of Length of Names (c) Distribution of Characters Panel VIIIa shows the distribution of names per image file.As shown, the majority of images contain two to four names; the longest full name consists of 10 separate names.

Table I :
WACC on the HANA database The table shows the test performance of the HTR models as measured by word accuracy (WACC).The data coverage is defined as the fraction of the test database the model is tested on (keeping predictions where the network is most confident).For the models with 90% data coverage, we remove the 10% of the test sample where the model is most uncertain.All models are trained on the full train database allowing the networks to learn primitives and characters from uncommon names.