How to do lexical quality estimation of a large OCRed historical Finnish newspaper collection with scarce resources

The National Library of Finland has digitized the historical newspapers published in Finland between 1771 and 1910. This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.40 billion words. The National Library's Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of the newspaper material (from 1771 to 1874) is also available freely downloadable in The Language Bank of Finland provided by the FINCLARIN consortium. The collection can also be accessed through the Korp environment that has been developed by Spr{\aa}kbanken at the University of Gothenburg and extended by FINCLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield style information retrieval test collection has also been produced out of a small part of the Digi newspaper material at the University of Tampere. Quality of OCRed collections is an important topic in digital humanities, as it affects general usability and searchability of collections. There is no single available method to assess quality of large collections, but different methods can be used to approximate quality. This paper discusses different corpus analysis style methods to approximate overall lexical quality of the Finnish part of the Digi collection. Methods include usage of parallel samples and word error rates, usage of morphological analyzers, frequency analysis of words and comparisons to comparable edited lexical data. Our aim in the quality analysis is twofold: firstly to analyze the present state of the lexical data and secondly, to establish a set of assessment methods that build up a compact procedure for quality assessment after e.g. new OCRing or post correction of the material. In the discussion part of the paper we shall synthesize results of our different analyses.


Introduction
Newspapers of the 19 th and early 20 th century were mostly printed in the Gothic (Fraktur, blackletter) typeface in Europe. It is well known that the typeface is difficult to recognize for OCR software (Holley 2008;Furrer and Volk 2011;Volk et al. 2011). Other aspects that affect the quality of the OCR recognition are the following, among others (cf. Holley 2008;Klijn 2008, for a more detailed list):  quality of the original source and microfilm  scanning resolution and file format  layout of the page  OCR engine training  unknown fonts  etc.
As a result of these difficulties scanned and OCRed document collections have a varying amount of errors in their content. A quite typical figure is that of the 19 th Century Newspaper Project of the British Library (Tanner et al. 2009): they report that 78 % of the words in the collection are correct.
This kind of quality is not very good, but quite realistic. The amount of errors depends heavily on the period and printing form of the original data. Older newspapers and magazines are more difficult for OCR; newspapers from the early 20 th century are easier (cf. for example data of Niklas 2010 that consists of a 200 year period of The Times of London from 1785 to 1985). There is no clear measure of the amount of errors that makes OCRed material useful or less useful for some purpose, and the use purposes and research tasks of the users of digitized material vary hugely (Traub et al. 2015). A linguist who is interested in the forms of words needs as errorless data as possible; a historian who interprets texts on a more general level may be satisfied with text data that has more errors. In any case, the quality of the OCRed word data is of crucial importance.
Digital collections may be small, medium sized or large and different methods of quality assessment are useful or practical for different sizes of collections. Smallish and perhaps even medium sized collections may be assessed and corrected intellectually, by human inspection (cf. Strange et al. 2014). When size of the collection increases, human inspection becomes impossible, or human inspection can only be used to assess samples of the collection. In our case, the size of the collection makes comprehensive human inspection impossible: almost 2 million newspaper pages of varying quality cannot be assessed by human labour.
Thus quality assessment of OCRed collections is most of the times sample-based, as in the case of the British Library's 19 th Century Newspapers Database (Tanner et. al. 2009) 3 . A representative part of the collection is assessed e.g. by using a parallel digital clean collection, when such is available or can be produced cost effectively. Word and character level comparisons can then be made and error rates of the OCRed collections can be reported and compared. Holley (2008) gives the following, mainly practical, OCR quality accuracy figures and quality estimations that are based on discussions with OCR contractors and academic librarians: Good OCR accuracy 98-99% accurate (1-2% of OCR incorrect) Average OCR accuracy 90-98% accurate (2-10% of OCR incorrect)

Poor
OCR accuracy below 90% accurate (more than 10% of OCR incorrect) 4 3 "To discover the actual OCR accuracy of the newspaper digitization program at the BL we sampled a significant proportion (roughly 1%) of the total 2 million plus pages..." This kind of approach where a clean parallel data for the OCRed sample is produced in house or by a contractor, is beyond our means. 4 Unfortunately it is not clear, whether accuracy here is on character or word level, but for the sake of discussion we'll suggest that the figures are word level accuracy figures, as even high character level accuracy can mean quite low word level accuracies (Tanner et al. 2009).
Another, fully automatic possibility to assess quality of the collection is usage of digital dictionaries Niklas (2010), for example, uses dictionary look-up to check the overall word level quality of The Times of London collection from 1785 to 1985 in his OCR post-correction work. This kind of approach gives a word accuracy approximation for the data (Strange et al. 2014).
Unfortunately usage of dictionaries suits only languages like English that have only little inflection in words and thus the words in texts can be found in dictionaries as dictionary entries. A heavily inflected, morphologically complex language like Finnish needs other means: full morphological analysis of the material is needed for this type of language. We shall discuss this approach with our material later on.
Some other methods could also be used. Baeza-Yates and Rello (2012) suggest a simple spelling error based look-up method to evaluate lexical quality of web content, based on the original idea of Gelman and Barletta (2008). We believe that this method might also be useful in analysis of our data, but we are not able to discuss its possible use at present. Ringlstetter 2013) show also promise, but their use at the present is beyond our means.

Quality assessment of the Digi
There has been on-going work on the assessment of the quality of the Digi since 2014. Part of this work has been described in Kettunen et al. (2014) and Kettunen (2015). These publications describe mainly first post-correction trials of the Finnish newspaper material. To that effort we set up semiautomatically seven smallish parallel corpora (ca. 212 000 words) upon which post-correction trials were done. Results of the evaluation showed that the quality of the evaluation data varied from about 60 % word accuracy at worst to about 90 % accuracy at best, the mean being about 75 % word accuracy. The evaluation samples, however, were small, and on the basis of the parallel corpora it is hard to estimate what the overall quality of the data is. Scarce availability of edited 19 th century parallel newspaper material makes this approach also hard to continue any further (Lauerma, 2012) and there are no resources to set up larger parallel data for evaluation purposes by ourselves. Thus another type of approach is needed.
Since the first trials we have done further work on lexical level with our data. In winter 2015 we extracted the database of the Digi collection and extracted the words from the page texts of the dump. Punctuation of the text was discarded in the dump but distinction between lower and upper case letters was kept in the resulting word lists.
We got two different word lists: the first and smaller one consists of all the Finnish newspaper and magazine word material up to year 1850. It has about 22 million word form tokens, which is less than 1 % of the whole data. The second and more interesting word list consists of the Finnish words in the newspapers during the period 1851-1910 and it contains about 2.39 billion word form tokens.
As the main volume of the lexical data of the collection is in the 1851-1910 section of the corpus, we shall concentrate mainly on the analysis of this part of the corpus, but will show also some results of the time period of 1771-1850.
As far as we know there is no single method or IT system available that could be used for analyzing the quality of word data in a very large historical newspaper collection. 5 Thus we ended up in using a few simple ways to approximate quality of our data. Firstly we analyzed all the words of the index with two modern Finnish morphological analyzers, commercial FINTWOL 6 and open source Omorfi 7 . As there is no fully developed morphological analyzer of historical Finnish available, this is the only possible way to do morphological analysis for the data 8 . A typical morphological analyzer consists of a rule component and a lexicon (Pirinen, 2015). If the analyzer can relate an input word after application of rules to a base form or forms in its lexicon, it has successfully recognized/analyzed the word. We ran our data through the analyzers and counted how many of the words were recognized or unrecognized by the analyzers. Obviously the number of unrecognized words contains both historically/dialectically unknown words for the modern Finnish analyzers (out-of-vocabulary, OOV, includes also words in foreign language) and OCR errors. A positive recognition does not also guarantee that the word was what it was in the original text. However, when the figures of our analyzed data are compared to analyses of existing edited dictionary and other data of the same period, we can approximate, what amount of our data could be OCR errors.
Secondly, we made frequency calculations of the word data and took different samples out of that data for further analysis with the morphological analyzers. These analyses show a more detailed picture of the data. Table 1 shows initial recognition rates of all the word tokens and types in the Digi with the two morphological analyzers. In Table 2 we show results with a more history aware version of Omorfi 9 (we call this HisOmorfi) and a later version of Omorfi, v. 0.2 . Omorfi 0.2 does not recognize words much better than version 0.1, but HisOmorfi achieves improved recognition of 3 % units with the main part of the data. There is improvement in recognition with HisOmorfi for every type of data, although for word types improvement is small.  At this stage we need also comparable recognition rates for edited lexical data of the same period.
For comparison purposes we used material from the Institute for the Languages of Finland. From their web pages 11 we collected two different word corpuses from two different historical periods of Finnish and four different dictionaries from the 19 th century. Figures of this data are shown in Table   3. Sizes of dictionaries refer to unique dictionary entries extracted from the data, not to all of the words in the material. Unless otherwise mentioned, the data consists of word types.   As can be seen from Tables 1-3 and Figure 2, type level recognition rates of the Digi data are very low compared to edited comparable material of the same period. The main reason for this is the high number of once occurring words (hapax legomena), most of which are OCR errors, which will be shown later (cf. also Ringlstetter et al. 2006 Interestingly, there is no big variation in the recognition rates of earlier and late 19 th century, although it would be expectable that older data contains more old vocabulary that is not recognized.
One reason for quite good recognition of older data may be simpler column structures and larger fonts in older publications, which could have decreased OCR errors. Towards the end of the 19 th century number of columns in newspapers increased and also fonts got smaller. Even if Finnish of the late 19 th century as such should be easier to recognize for morphological analyzers (cf. also On the basis of the edited data analyses we can approximate, that 56-76 % of the words on type level from the 19 th century data can be recognized by modern language morphological analyzers.
On token level the recognition rate can be up to 89 %. If there is older material in the data, recognition will drop, and the drop can be quite large.
Next we proceed to frequency analysis of parts of the data. In order to confirm occurrence of OCR errors in the least frequent word type classes we analyzed all word types that occur 1-10 times in the data.

Other considerations
Orthography of Finnish was already reasonably stable in the mid-19 th century, although there were phenomena that differ from modern language (cf.    Samoinkuin +?  not recognized because written as a compound, correct otherwise  ylöskannetaan +?  not recognized because written as a compound, correct otherwise Amount and effect of these kinds of phenomena are hard to estimate, but it is clear that all these phenomena cause uncertainty in the results and make an estimation of error margins in the analysis hard to establish.
We have now reached a reasonably comprehensive result out of the quality assessment of our data.
We have three different parameters that affect the results: number of OOV vocabulary, number of OCR errors proper and effect of w/v variation in the data. The effect of the OOV factor in the clean VNS_Kotus data is on token level about 14 % and in the VKS_Kotus about 50 %. Their mean is 32 %, but a fair approximation could be 14-20 % in edited material of the latter part of the 19 th century word data. We believe that in the Digi data OCR errors tend to override vastly the OOV words. The variation of w/v has an effect of about 12 % among the unrecognized words.
The initial analysis without considering the w/v variation and effect of OOV's is shown concisely in Figure 5. We have 1.65 G of recognized words and 733 M of unrecognized words (cf. Table 1.).
The 69 % recognition rate can be called raw recognition rate of the data.    About 300 M could be easier OCR errors and perhaps also OOVs (Figure 6.) -unrecognized words tend to be longer than recognized words (Fig. 3. and 4.) The lexical quality approximation process we have set up is relatively straightforward and does not need complicated tools. It is based on frequency calculations and usage of off-the-shelf modern Finnish morphological analyzers. It can be automatized fully, even if we have done it partly step by step half automatically. It is apparent that we need to be cautious in conclusions, as different data are of different sizes which may cause errors in estimations (Baayen 2001;Kilgariff 2001).
However, we believe that our analyses have shed considerable light into quality of the Digi collection.
At this stage we can also reflect usefulness of the analysis procedure from point of view of improvement of the OCR quality of the Digi collection. The main message that our analysis gives, is that the collection has a relatively good quality part, about 69-75 %, and a very bad quality part, about 9-12 %. The set of about 13 % of the words that are not recognized, is harder to estimate. As a part of them belongs to the most frequent part of the data, they could be at least partly easier OCR errors and OOVs. All in all about a 25-30 % share of the collection needs further processing so that the overall quality of the data would improve.
If correction of the data is performed it should be focused on the 24-25 % unrecognized part of the data. Out of this the ca. 300 M possibly easier part could be improved by post-correction of the material with algorithmic correction software. We have tried post-correction with a sample (Kettunen 2015), but the results were not good enough for realistic post-correction. If postcorrection would be focused to only the easier part of the Digi's erroneous data, it could work quite well. General experience from algorithmic post-correction of OCR errors shows, that good quality word material can be corrected relatively well (e.g. Niklas 2010;Reynaert 2008). This may also apply to the medium quality word data. But the worst 9-12 % part of the Digi data cannot be corrected with post-processing; only re-OCRing could help with it, as there is so much of it.
Taken that some action had been taken to improve the quality of the Digi data, we have to consider, whether our procedure would be useful in showing quality improvement, if such had been achieved.
We suggest that improvement of the lexical quality could be shown e.g. with following analyses: -clear improvement in overall recognition rate of the data: at least 3-5 % units in both type and token level analyses -recognition rate in the top 1 M of the most frequent word types should improve significantly, especially in the 100 K-1 M range, that is now beyond mean recognition rate of edited data -a very large drop in the number of unrecognized hapax legomena and other rare word types; in practice this would mean tens of millions of word forms to be become recognizable

Conclusion
In this paper we have suggested how to assess the overall lexical quality of a mainly 19 th century OCRed Finnish historical newspaper collection with circa 2.40 billion words. The procedure uses elementary corpus statistics and morphological analyzers of modern Finnish and is straightforward to use. We also propose how to measure quality improvement after correcting the corpus using the suggested procedure.
Advantages of the procedure are the following:  coverage: the procedure gives an approximation of the quality of the whole corpus and in the same time different parts of the whole can be analyzed; thus it is not based on samples only  period sensibility: comparable same period edited lexical corpora are used in assessment and thus the procedure is reasonably sensible to time variation in the data; the method works reasonably well for the so called period of early modern Finnish (ca. 1820-1870) and beginning of modern Finnish (from ca. 1870-), but would be more vulnerable with earlier material, as lexical coverage of the morphological recognizers would be lower  simplicity: available modern language technology tools and basic corpus statistic methods are used, and no high-level tools need to be developed; The main vulnerability of the proposed procedure at present is possible sampling error and its effects with corpora used. This, however, can be taken into account with adding advanced statistics to the procedure. They may sharpen the procedure, but at present we are satisfied with the current approach and believe that the measures the procedure produces are useful in quality assessment and quality control after improvements in the word data.
The major reason for lexical quality assessment of our data is the fact, that OCR errors in the data may have several harmful effects for users of the data. One of the most important effects of poor OCR qualitybesides worse readability and comprehensibility -is worse on-line searchability of the documents in the collections (Taghva et al. 1996). In a recent study Savoy and Naji (2011), for example, showed how retrieval performance decreases with OCR error corrupted documents quite severely. With mean reciprocal rank as a performance measure, they showed that degradation in retrieval effectiveness is around 17% when dealing with an error rate of 5%. By increasing the error rate to 20%, the average decrease in retrieval is around 46%. Same and larger level of decrease in retrieval effectiveness is shown also in results of the TREC-5's confusion track (Kantor and Voorhees 2000). The effect of errors is not clear cut, however. Tanner et al. (2009) suggest that word accuracy rates less than 80 % are harmful for search, but when the word accuracy is over 80 %, fuzzy search capabilities of search engines should manage the problems caused by word errors. Mittendorf and Schäuble's (2000) probabilistic model for data corruption seems to support this.
Information retrieval is robust even with corrupted data, but IR works best with longer documents and long queries. Empirical results of Järvelin et al. (2015) with the Finnish historical newspaper search collection, for example, show that even impractically heavy usage of fuzzy matching will help only to a limited degree in search of a low quality OCRed newspaper collection, when short queries and their query expansions are used. Evershed and Fitch (2014), on the other hand, show that if OCR word errors are corrected and word error rate decreased with about 10 % units, recall in document retrieval may have about 9-10 % unit boost with historical OCRed English documents.
Users of the Digi collection have complained about the poor OCR of the collection relatively little, but some of them have reported curious search results and been annoyed by the OCR quality (Hölttä, 2016;Kettunen, Pääkkönen, Koistinen, 2016). Basing on the empirical search results with the evaluation collection derived from a small subset of the whole Digi material (Järvelin et al., 2016), it is evident that search results in the Digi collection itself are not optimal, and better OCR quality would probably improve them.
Besides retrieval performance effects poor OCR quality has an effect on ranking of the documents (Taghva et al. 1996;Mittendorf and Schäuble 2000). In practice these kinds of drops in retrieval and ranking performance mean that the user will lose relevant documents: either they are not found at all by the search engine or the documents are so low in the ranking list that the user may skip them.
Some examples of this in the work of digital humanities scholars are discussed e.g. in Traub et al. (2015).
Weaker searchability of the OCRed collection is only one dimension of poor OCR quality. Other effects of poor OCR quality may show in the more detailed processing of the documents, such as sentence boundary detection, tokenization and part-of-speech-tagging, which are important in higher-level natural language processing tasks (Lopresti 2009). Part of the problems may be local, but part will cumulate in the whole pipeline of NLP processing causing errors. Thus the quality of the OCRed texts is the cornerstone for any kind of further usage of the material, and we need to be able to assess the quality of the data in order to be also able to improve it and show the possible improvements meaningfully.