Estimating the Amount of Lithuanian Text Indexed by Global Search Engines

. The aim of the paper is the estimate of the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing (by Microsoft Corporation), and Yandex (by ˛˛˛ (cid:190)(cid:223)(cid:237)(cid:228)(cid:229)Œæ¿ , Russia). For this purpose, a special list of 100 rare Lithuanian words (pivot words) with specific characteristics was compiled. Low frequency of pivot words is crucial to consider the count of document matches reported by GSE as an indicator of the word count. Statistical analysis has shown the following amounts of Lithuanian words as of April 2022: 56 billion words by Google, 29 billion words by Bing and 41 billion words by Yandex. Comparative results for neighbouring Belarusian ( ∼ 0.31 × LT), Estonian ( ∼ 1.45 × LT), Finnish ( ∼ 2.4 × LT), Latvian ( ∼ 0.95 × LT), Polish ( ∼ 11 × LT), and Russian ( ∼ 49 × LT) languages have also been assessed.


Introduction
Global search engines (GSE) became everyday tools that help us to open a window to the vast realm of information on the web. The usage of GSE has become so common that some people even confuse them with the internet itself.
The information on the web is highly multimodal and multilingual consisting of textual, audio, and visual material. However, when we are looking for information with the help of GSE, we can only discover and access that part of information which has been previously indexed by GSE and presented to a user. It must also be said that GSE index only a tiny part of all information that is accessible on the web.
All different modalities for a particular language constitute the digital presence of that language on the web. A larger or smaller digital presence of a language may signify its vitality and importance in the global community. In addition, it can indirectly speak about the language community's economic development or even the level of adaptability to the modern world. The digital presence may also have a geopolitical importance, as it may have an impact on decision making processes of human societies.
There is no easy way to assess the size of the digital presence by using data of commercial search engines, as the main purpose of commercial tools is to generate profit and not to reflect the objective picture of the digital world. Many researchers warned about potential pitfalls when analysing commercial search engines (see e.g. Kilgarriff (2007); van den Bosch et al. (2016)). It needs to be acknowledged that the current research focuses only on that part of digital presence, which is represented as text, while recognizing that there exists a huge realm of audio-visual information, which cannot be assessed by our methods.
Presently, there are four major commercial global search engines with their own indices: Google, Bing, Yandex, and Baidu 1 . In this paper, however, we will focus on the first three, as our test queries have shown that the Chinese Baidu applies inappropriate segmentation methods of Lithuanian words, which adversely affects the results.
Prior to presenting the research we deem necessary to define the key terms of "word" and "token" in our analysis, as their definitions and treatment vary. In this paper we use the following definitions: -token: the smallest unit in a text corpus. A token normally refers to a word form, a punctuation, a digit, an abbreviation, a product name, and anything else between spaces; -word: any token if not a punctuation. Such a definition of word is also used by GSE, but not by corpus linguists. Corpus linguists tend to define a word as a token, which begins with a letter of the alphabet and consists solely from letters, thus ignoring numbers or any mixed alpha-numeric constructions. For this reason, the size in words of the same corpus may differ depending on the method of calculation.
It should be noted that we do not seek to estimate (in terabytes) the total size of indices operated by GSE nor to determine the total number of indexed URLs/documents, rather we seek to estimate the total amount of the text (in words) indexed by GSE. Also we do not consider the cleanness (deduplication) of corpora or GSE indexed texts as a factor to be accounted for. We can only speculate about the deduplication policy used by a particular GSE or corpus creators, so we seek to estimate the whole amount of text regardless of duplication.

Related research
The field of research that focuses on assessing the quantity and quality of information on the web is called webometrics (Björneborn and Ingwersen, 2004). The first research papers on webometrics have been published some thirty years ago (Almind and Ingwersen, 1997) and since then many aspects of the web have been analysed, for instance, assessing the index sizes of different search engines and different domains (e.g. Bharat and Broder (1998)), link structure of the web (e.g. Hirate et al. (2006)), bias of search results (e.g. Gezici et al. (2021)), evaluation of ranking algorithms (e.g. Canca (2022)) and others. There are two research papers, namely Kilgarriff (2007) and van den Bosch et al. (2016), that are closely related to the present analysis, as in both cases the sizes of indices of search engines for specific languages were estimated by extrapolating query frequency results from known corpora against GSE search results. Kilgarriff (2007) presented the analysis for German and Italian languages. Kilgarriff's main idea was to look at texts indexed by Google as a "black box" corpus that can only be studied by queries. The queries are based on a selected list of words, which can be referred to as pivot words. Then comparing the results of the same queries made on this "black box" with frequencies from a known reference corpus (RC), it is possible to infer the size of the "black box" corpus based on the average of count ratios for each tested word.
One of the recent attempts to estimate the size of Dutch and English indices was published by van den Bosch et al. (2016). The study presents a longitudinal observation of the size of Google and Bing indices based on frequencies of 28 pivot words. The unique feature of the study is its longitudinal aspect, as authors set up a system, which has been daily monitoring Google and Bing indices since 2006 and it is still ongoing 2 .
In many ways, we followed the ideas in these two works, albeit with a very different approach to the selection of pivot words, doing more consistent calculations and neglecting the factor of repetitive documents.

Methodology
Our main interest is the estimates for the Lithuanian language. All the efforts, knowledge and sample sizes are adjusted for this purpose. However, for the sake of comparison we have performed a limited scope analysis with less precise estimates (due to smaller test samples) for the neighboring Latvian, Polish, Belarusian, Russian, Estonian, and Finnish languages examining only queries by Google.
For this research we have used the 2nd version of the Corpus of Contemporary Lithuanian Language CCLL2 (Utka et al., 2017) by Vytautas Magnus University (VMU) and various corpora of TenTen family by Sketch Engine (Jakubíček et al., 2013). The details of the corpora are provided in Table 1. As a reference corpus (RC) for Lithuanian we have chosen Sketch Engine ltTenTen14 because of its size, quality and more or less same origin as GSE text. The CCLL2 corpus (5 times smaller than ltTenTen14 and of different build policy) has been involved in this research for selecting pivot words and accomplishing the "proof of concept" when estimating the size of ltTenTen14. For similar reasons we also have chosen the corpora gathered by Sketch Engine for other languages.

Criteria for the list of pivot words
The most important part of this research is the selection of a list of pivot words to be used to query GSE and a reference corpus in parallel. Unfortunately, GSE's queries are only reporting the approximate number of documents found and not the word matches, so in order to compare apples to apples, we should also count documents and not words in a reference corpus. The estimation of size ratio of the two corpora on the docs-to-docs basis (instead of words-to-words) is an indirect measurement. Such a dependency may be highly susceptible to text chunking policy of a particular corpus and as a result it can be nonlinear, e.g. a double increase in docs count may not mean a double increase in corpus size. Let's consider a small example regarding the word the in British National Corpus (BNC) 3 and Brown corpus (Francis and Kucera (1979)). In BNC a word-count is 6,054,939 and docs-count is 4,050, while in the Brown Corpus a word-count is 69,971, while a docscount is 500 (i.e. in both cases every document contains the). The actual corpus size ratio is 95.9 (i.e. BNS is 95.5 times bigger than the Brown corpus). Thus, the wordsto-words ratio of 86.5 gives us a much more realistic estimate of the actual corpus size ratio compared to the docs-to-docs ratio of 8.1.
Since our main interest is the number of words, and we can only measure the number of documents, we should keep the resemblance between them as close as possible. This factor raises one important requirement for pivot words: they should be hapax legomena in all documents where a queried word was found (occurring no more than once per document). That means, we should use infrequent words with low counts through the corpus while ensuring {word frequency count}≈{number of documents with the word in}. Adherence to this principle also avoids some of the subjective peculiarities inherent in low-frequency words: they tend to cluster in certain documents, possibly because of the inclination by some authors to "invent" and use them for very specific purposes.
On the other hand, extremely low frequency counts are statistically prone to greater sampling errors. Therefore, it is essential to select pivot words from within the range of low and high frequency counts. In order to assess this issue, we have evaluated the estimation ratio for the two corpora: CCLL2 and ltTenTen14. 5,000 test words were filtered out from CCLL2 having frequency counts ranging from 1 to extremely high 50,000. The sample of the test words has been divided into 30 intervals and individual docs-to-docs ratios as well as means and medians per interval were calculated. The results, presented in Fig. 1 confirm our reasoning about the unsuitability of high frequency words, as well as those below 10. So for pivot words, we decided to choose the words with frequency counts between 10 and 100 in CCLL2. Words with these frequencies in CCLL2 showed the most appropriate prediction of the size ratio of the two corpora with relative error of 12% (5.4 estimate versus 4.8 actual) suggesting that ltTenTen14 versus GSE comparison will also be feasible. Fig. 1: Docs-to-docs ratio as a function of test word frequency. Corpora under investigation -CCLL2 and ltTenTen14. Shadowed zone (frequency counts between 10 and 100) chosen as a best compromise between statistical errors inherent to low counts and apparently biased ratio estimate at high counts.
Other important requirements to the pivot words are language specific. Pivot words should be able to slice the corpus of particular GSE precisely to subcorpus of documents in specified language (e.g. Lithuanian). Pivot words cannot coincide with regular words of another language. For example the word "imam" is of no use for the examining Lithuanian-only content because it is regular for English, French, Italian an other lan-guages too. Moreover, we should avoid words which have the following characteristics: shorter than 7 letters; international origin; foreign loanwords; proper names of any kind; headword forms; having accented characters; specific for particular domain or time period; normalized (diacritics removed) variants of other words (e.g. Lithuanian sukuosi andšukuosi); common misspellings in target or any other language (as Lithuanian permetant and French permettant).

Querying the GSE
In order to ensure comparative results between GSE and languages, we have adhered to specific criteria. All GSE and languages should be tested at the same time and in the same way. All queries have been performed by using "exact word form search" functionality by means of double quotes surrounding the word to be searched. Query process has been performed manually, in order to avoid anti-robot functionalities behind the scenes that are used by some GSE. No special linguistic, date, type or any other option than can narrow the search scope should be set. When analyzing search results, "Documents found" number should be recorded regardless of any further circumstances (reports of possible duplicates, copyright issues etc.).

Querying reference corpora
All the reference corpora (RC) from Sketch Engine can by queried using special built-in functionality of Sketch Engine's "Wordlist" advanced features allowing batch processing of all the list of test words. The query returns word-counts and document-counts for each test word. Date of the query is not important because the content of Sketch Engine corpora does not change.

Statistical calculations
Following Kilgarriff (2007) we have used docs-counts ratio as an estimate of size ratio of the two corpora (or ordinary corpus versus GSE as a corpus). As it has been explained earlier we have been used thoroughly selected test words to avoid biased estimations. So the i-th estimate of size in words of the indexed Lithuanian text by particular GSE: where: x i is document-count for i-th test word in Lithuanian RC, y i is document-count for i-th test word in particular GSE, N RC is size in words of the RC.
Having the set of estimates {N i } we can calculate mean, median, and outliers.

Results
Main results of this research will be published on CLARIN-LT repository 4 . Here we present the most important part of the results and some samples of test words for all languages. It should be noted that the list of 100 Lithuanian pivot words was prepared with great care and in accordance with the criteria laid out in Section 3.1. Due to the lack of deep and specific knowledge of the other languages, corresponding lists of test words are substantially shorter, may have inconsistencies with the principles listed in Section 3.1 and the results for these language may be less precise.

Results for Lithuanian
The GSE measurements for Lithuanian were performed twice with an interval of approximately six months -for the first time on September 27, 2021 and for the second time on April 11, 2022. The counts for sample test words are presented in Table 2. Statistical analysis of all the test words is presented in Figure 2 and Table 3. Not surprisingly, the biggest number of 56 billion Lithuanian words was indexed by Google, followed by Russian Yandex (41bn) and Microsoft Bing (29bn). It should be noted however, that Yandex's scores raise reasonable doubts, as a significant portion of the reported "number of documents found" is heavily rounded and appears to be suspiciously repetitive.  Another interesting observation relates the indexing by Google: during the past six months, the volume of indexed Lithuanian text there has decreased. This could be explained by the recent Google's policy of "cleaning up" its indices from junk, duplicates or intentionally misleading content. This policy by Google has also been mentioned by Indig (2020). Rather large fluctuations of Google's index size were also reported by van den Bosch et al. (2016) in their longitudinal analysis. On the contrary our counts show that the size of Lithuanian text by Microsoft Bing and Yandex have increased by more that 50% during the past six months, which raises some interesting questions. Such a large increase could be caused by many different reasons, for instance by the technical advancement of web crawling algorithms, by the proliferation of AI-generated texts, by fluctuations similar to those observed by Google, or simply by the tendency to increase indexes. Further investigation is needed to answer these questions.

Results for other languages
A comparative assessment of the amount of indexed text in other neighboring languages was performed on April 2022 only with Google and only with a limited set of pivot words. Sample pivot words and corresponding docs-counts are presented in Table 4 through Table 9.
Interlingual assessment results are presented in Table 10, including calculated results per capita.

Conclusions
Given the current significance of GSE in everyday decision making, it is important to track the changes in the volume of indexed texts, as this may signal important political, technological or social processes within the corresponding societies. On the other hand, the changes could be just a technological or marketing decision of GSE. In any case, the amount of accessible information influences our daily lives and shows the extent of digital presence of the language that we speak. Is it possible to imagine the size of 56 billion words of Lithuanian text indexed by Google? Is this a really large number? This could be compared to books. As one book usually contains about 100 thousand words, Google's "assets" are comparable to 0.5m books, which roughly corresponds to the amount of unique books published (keeping the current production rate) in Lithuania in about 100 years! Even though the exact amount of unique texts is difficult to estimate both on the web and in the libraries due to duplicated material, we think that the calculated size is interesting for data scientists, as well as linguists in establishing the order of magnitude of accessible Lithuanian text on the web.
Such an amount of Lithuanian text operated by GSE shows the language's vitality and allows us to expect rather good results of the search queries. Of course, they may be affected by deliberate text filtering, e.g. for a justified reason of personal data protection, harmful or false information. Besides, the access to the presented information is also influenced by hit ranking algorithms.
Among our future plans is setting up a monitoring system that is similar to the one designed by van den Bosch et al. (2016), first of all, for monitoring the change of the volume size of Lithuanian indexed text and, perhaps, eventually for monitoring other languages. It is also important to continue working on testing and validation of the lists of pivot words for other languages with linguists of these languages, in order to ensure comparable results.

Acknowledgements
We would like to thank Dr. Kristina Vaisvalavičienė for her valuable advice concerning the list of Latvian pivot words and Kalok Man on his consultation regarding Chinese Baidu.