Latvian Language in the Digital Age: The Main Achievements in the Last Decade

. Ten years ago, when the META-NET Network of Excellence conducted a study on language technology support for European languages, Latvian was included in the category of languages with little or no support. During the last decade, notable progress has been made in the development of language resources and tools for Latvian, particularly regarding the creation of advanced datasets like speech corpora and treebanks, state-of-the-art neural language models, machine translation systems, speech technology, and technologies for natural language understanding and human-computer interaction. This paper provides an overview of the most recent activities in the language technology field in Latvia: national and international initiatives, key language resources and tools, key projects and initiatives. We summarize both the recent activities and the most significant achievements after the publication of the META-NET White Paper on Latvian.


Introduction
Ten years ago, the META-NET Network of Excellence conducted an extensive study on 31 European languages on the level of language technology support for these languages. This survey was published in a White Paper book series describing the technology landscape for the European languages (Rehm and Uszkoreit, 2012). Besides general facts about each language, the series describe development in the general language resource and technology areas, as well as in the main application areas of language and speech technology. The series also present a cross-language comparison within four key areas: text analysis, speech and text resources, machine translation, and speech processing. In this report, the Latvian language support in all four key areas was assessed as weak (Skadiņa et al., 2012).
Since the publication of the META-NET White Paper series, the progress and achievements in the key language technology areas for Latvian have been periodically reported through the Baltic HLT conference series and other relevant venues (Skadiņa, 2019;Skadiņa et al., 2016). These reports present notable progress in the rapid development of language technologies for Latvian, particularly with respect to machine translation, speech recognition and synthesis, natural language understanding and human-computer interaction.
Ten years after the META-NET White Paper series, another pan-European survey was conducted within the European Language Equality (ELE) project 1 . This paper provides an overview of the most recent and significant activities in the language technology area in Latvia, highlighting and elaborating on the findings of the ELE project . It also provides a broader overview of the key achievements regarding Latvian language resources and technologies since 2012.

Language policy and major activities
In general, there is a broad recognition by the research and development community as well as policy makers and government institutions that advancing Latvian language technologies is a critical prerequisite for its survival in the digital age.
Research and development activities in Latvia are supported through different EU and national funding instruments: State research programmes, EU Structural Funds programmes (in particular, through the IT Competence centre projects and the Industrydriven research projects), Latvian Council of Science grants for Fundamental and applied research, EU Horizon 2020, Horizon Europe and CEF programmes.

National programmes and initiatives
The necessity of language technology support in digital means and importance of language technologies for the long-term survival of the Latvian language has been always recognised in the policy planning documents. However, up to recently, there has been no dedicated language technology research and development program in Latvia. Thus, research and development activities in this area are, in many cases, fragmented and not always sufficiently funded.
Several policy planning documents for 2021-2027 stress the necessity for support of the Latvian language in digital means:  The State Language Policy Guidelines 2 lists several activities related to the creation and further development of Latvian language resources and tools.
Since 2022 the guidelines are being implemented through the three-year State Research Programme "Letonica -Fostering a Latvian and European Society".  The Digital Transformation Guidelines for 2021-2027 3 include actions to enable Latvian citizens to access European Digital Space in their native language and to support development of the most important language resources for sustainable and wide use in digital services.
The information report "On the development of Artificial Intelligence solutions" also lists several directions of action related to the development of AI-based language technologies, such as machine translation, speech technologies, inclusive technologies and terminology databases.
In 2021, Latvia has also approved the Recovery and sustainability plan 4 . Investments are allocated for the development and implementation of high-level skills in three areas: language technology, quantum computing, and HPC. Establishment of Excellence Centre for Language Technology is envisioned to prepare curriculum for language technology teaching, to advance language resources and create platforms and tools for studying and experimentation, and to conduct research involving young researchers. The main research activities planned within this centre are: development of language resources for speech and text processing, creation of large pre-trained language models, advancement of state-of-the-art speech technologies and machine translation, development of software platforms and a shared technical infrastructure for education and research.

International initiatives
During the last decade, the Latvian language technologies have been part of research, innovation and deployment actions in several FP7, Horizon 2020 and CEF projects on automated translation, speech technologies, human-centred AI, and activities for support digital language equality.
CEF programme has funded several projects where different language resources and tools were created for Latvian together with several other European languages. Numerous neural machine translation (NMT) engines were developed for translation between Latvian and other EU official languages in the CEF project NTEU (Bié et al., 2020). Domain specific NMT systems were developed in the framework of IADAAPTA project (Castilho, 2019).
Anonymization techniques for Latvian were developed in the CEF programme project MAPA that resulted with an open-source multilingual toolkit for public administrations (Ajausks et al., 2020). New Latvian terminology resources were collected, processed and published in the EuroTermBank database in the CEF projects eTranslation TermBank (Pesliakaitė, 2017) and Federated eTranslation TermBank Network (Lagzdiņš et al., 2022).

National initiatives and recent projects
Research and development activities at the national level are mostly supported through three finance instruments: State research programmes, EU Structural Funds programmes, and grants of the Latvian Council of Science.
In 2010, Latvian research institutions and major information technology companies founded the IT Competence Centre (ITCC). The goal of ITCC is to support a long-term cooperation between research organisations and industry to create innovative technologies and prototypes of internationally competitive IT products. Since 2011, more than ten language technology projects have been implemented with support from the ITCC programme (Skadiņa et al, 2016). ITCC supported the creation of the first orthographically and phonetically transcribed Latvian speech corpus, followed by several projects on speech recognition. Several projects addressed machine translation, while others were devoted to intelligent human-computer interaction.
In 2016, the Cabinet of Ministers approved the implementation rules of the Industrydriven research programme of the European Regional Development Fund. In this programme five large projects, mostly implemented through cooperation of research organizations and industry, have been supported. The topics of these projects include: the development of language resources and tools for natural language understanding and generation (Gruzitis et al, 2018), the development of neural network solutions for less resourced languages, multilingual affective human-computer interaction (e.g., Nicmanis and Salimbajevs, 2021), the application of speech technologies for multilingual meeting management, and the transcription of medical speech .
Funding for the creation, development and maintenance of Latvian language resources and tools has been also received from different national research programmes for Latvian language support. Since 2022, a State Research Programme project "Research on Modern Latvian Language and Development of Language Technology" (LATE) is being implemented aiming to advance research on the grammatical, lexicalsemantic, phonetic and phonological systems of the modern Latvian language, and Latvian sign language using data-driven methods, as well as to develop sustainable Latvian language resources and tools.
Since 2018, several projects have been also supported by the Latvian Council of Science, including the development of Latvian Learner Corpus (Dargis et al., 2020a), creation of a pilot Latvian Wordnet and means for neural word-sense disambiguation , and research on natural language understanding and generation for human computer interaction (Gosko et al., 2021).

Infrastructural development
The fundamental support for languages in a digital environment is provided through research infrastructures.
In 2016, Latvia joined European Research Infrastructure Consortium CLARIN (Common Language Resources and Technology Infrastructure). 7 CLARIN-LV is a CLARIN node in Latvia, supporting and collaborating with digital humanities, sharing language resources developed by Latvian academic community, as well as active contributor and participant in international CLARIN activities . CLARIN-LV mainly focuses on Latvian (and Latgalian) language resources and tools, but not excluding other languages. CLARIN-LV repository of language resources and tools 8 was set up in 2020. The most popular language resources are the open lexical database Tezaurs.lv, Latvian Treebank, Balanced Corpus of Modern Latvian, followed by the NLP pipeline as a service for Latvian -NLP-PIPE. CLARIN-LV is a member of Knowledge Center for Systems and Frameworks for Morphologically Rich Languages SAFMORIL 9 and actively participates in different CLARIN ERIC activities, such as CLARIN Resource Families, Teaching with CLARIN, and CLARIN ParlaMint (Erjavec et al., 2022).
Participants from Latvia are among the core members of European Language Grid (ELG) and European Language Equality (ELE) projects. The objective of the Horizon 2020 project European Language Grid is to address fragmentation in the European language technology business and research landscape by establishing the ELG as the primary platform for language technology in Europe and to strengthen European LT business regarding the competition from other continents (Rehm et al, 2021). Various Latvian language processing tools and resources are already available and can be executed on the ELG platform, 10 including machine translation systems, text-to-speech and speech-to-text tools, POS taggers. ELG catalogue also have a comprehensive list of Latvian language resources that is aggregated from META-SHARE, ELRC-SHARE, CLARIN and other repositories.
Latvia sets an example in making language technology services available for public administrations and general public through national language technology platform Hugo.lv. The broad application and high usage of this platform has inspired other countries like Estonia, Croatia, Iceland and Malta to follow this example. Under leadership of Latvian partners Tilde and Culture Information Systems Centre they have joined forces to create and deploy in their countries similar systems in the framework of CEF programme project National Language Technology Platform (Tadić et al., 2022).

Language resources
Since the publication of the META-NET White Paper on Latvian, various new language resources have been created, including advanced datasets and models for natural language understanding, speech recognition and synthesis.

Text and speech corpora
Text corpora have been developed for Latvian already for several decades. In 2012, monolingual text corpora were already rather well represented, while availability of parallel corpora, treebanks and other kind of annotated corpora was weak. Moreover, speech corpora for Latvian were not available yet.
Today, many open-access text corpora are accessible through the Korpuss.lv platform, and most of them are forming the Latvian National Corpora Collection (LNCC)a diverse collection of corpora representing both written and spoken language (Saulīte et al., 2022). The more than 20 corpora of LNCC (its total size currently exceeds 2 billion tokens) represent different types and genres of Latvian written and spoken language. Although the written language is dominant in LNCC, already three relatively large spoken language text corpora are also available: the orthographic transcription of the balanced general-domain 100-hour speech corpus (Pinnis et al., 2014), the orthographic transcription of the balanced 30-hour medical speech corpus (Darģis et al., 2020b), as well as a large subtitle corpus (10M tokens) of public broadcasting.
LNCC is a continuous multi-institutional and multi-project effort, supported by the digital humanities and language technology communities in Latvia. All corpora of LNCC are annotated with a uniform morpho-syntactic annotation scheme which enables federated search and consistent linguistics analysis in all corpora and allows to select and mix various corpora for pre-training large Latvian language models like BERT. Openaccess federated search facility is available through the LNCC website, giving an overview about the absolute and relative frequency of a given search term across all the LNCC corpora.
Modern Latvian is primarily represented through the Balanced Corpus of Modern Latvian LVK2018 (Dargis et al., 2020c), which is being extended to 100 million words. For a balanced subset of LVK2018, syntactic and semantic annotation layers have been added to a various extent (12-17 thousand sentences at the time of writing): Universal Dependencies (UD), named entity and co-reference annotations, frame-semantic annotations (Gruzitis et al, 2018). The multilayer corpus is being enhanced and extended through successive projects, aiming at least at 20k annotated sentences. Notably, the latest release of the UD layer contains nearly 17k sentences. It should also be noted that the Latvian UD treebank has been already classified as a relatively big treebank within the CoNLL 2017 and 2018 shared tasks on UD parsing.
Many corpora are also openly accessible from the Opus platform (Tiedemann, 2016) and the ELRC-SHARE repository 11 . Bilingual and multilingual corpora are also stored at Tilde Data Library 12 . Tilde Data Library includes 12.35 billion parallel sentences and 23.85 billion monolingual sentences in 124 languages. Part of this content is publicly available from the ELRC and ELG platforms, while some of them are also browsable through Hugo.lvthe Latvian State Administration Language Technology Platform. However, domain-specific parallel corpora that would allow training and fine-tuning domain-specific MT engines are lacking.
The first Latvian speech corpus was created in 2012-2013 (Pinnis et al, 2014). The corpus contains 100 hours of transcribed speech, which was a key starting point for the rapid development of speech recognition solutions for Latvian. However, access to this speech corpus is limited, and currently the only open-access Latvian speech corpora are LaRKo and Common Voice Latvian, each of them contain about 8 hours of annotated speech data. In addition, several domain-specific speech corpora have been created (e.g., a medical speech corpus for the radiology domain (Dargis et al., 2020b)).
The development of a general and open-access Latvian language speech corpus has recently started in the National Research Programme's "Letonica -Fostering a Latvian and European Society" project "Research on Modern Latvian Language and Development of Language Technology" 13 (LATE). In this project, a balanced openaccess speech corpus of at least 100 hours will also be created, as well as a quality speech corpus for text-to-speech synthesis.
Multimodal corpora are still not available for Latvian, although the development of a pilot sign language corpus is also planned in LATE project.

Lexical resources
Latvian digital lexical resources are being developed already for a long time. Tezaurs.lv is the largest open lexical dataset and on-line dictionary for Latvian (Spektors at al., 2016). The dictionary is popular not only among researchers, but also widely used by the general public: translators, journalists, students and many others, receiving 30-40M requests per year (generated by more than 100k unique users per month). It is regularly updated, and currently contains more than 380,000 lexical entries that are compiled from more than 300 sources.
The development of another important lexical resource, Latvian WordNet, is currently underway. The chosen methodology for word sense splitting and linking is based on corpus evidence and the data from Tezaurs.lv, ensuring a theoretical foundation that has been fine-tuned for both the actual use of Latvian and the linguistic tradition. Furthermore, the links between synonym sets of Latvian WordNet and Princeton WordNet are also being added .
Different lexicons (mostly bilingual) are available from the Letonika.lv portal. It contains electronic dictionaries for widely used language pairs (Latvian and English, French, German and Russian), as well as dictionaries of the languages of the Baltic countries: Latvian and Lithuanian, Latvian and Estonian.
Latvian terminology is consolidated in the European Terminology Bank (Rirdance and Vasiļjevs, 2006) 14 and the Latvian national terminology portal. 15 Today, EuroTermBank contains about 3.5 million entries (14.5 million terms) from 463 collections in 44 languages. As part of the CEF project FedTerm, Latvian national terminology portal is integrated with EuroTermBank in a federated network that interlinks terminology portals from different European countries.

Text analysis tools
Various basic text processing tools, such as tokenizers and sentence splitters, morphological analysers and taggers, spelling and grammar checkers have been available for Latvian for several decades. Spelling and grammar checking tools are available for users through Microsoft and Tilde products, as well as some open-source text processors. Various open-source Latvian NLP tools are integrated into NLP--PIPE: a modular pipeline for text tokenisation and sentence splitting, morphological tagging, named entity recognition, syntactic dependency parsing, etc. (Znotins and Cirule, 2018).
With introduction of large pre-trained language models that can be fine-tuned for different NLP tasks several part-of-speech taggers, dependency parsers (Znotiņš and Bārzdiņš, 2020) and named entity recognizers (Vīksna and Skadiņa, 2020) have been developed. Finally, several sentiment analysis tools have been created as well.

Natural language understanding (NLU) and generation (NLG)
Regarding NLU and NLG, apart from neural transformer encoder, transformer decoder and transformer encoder-decoder models for Latvian, experiments with the interlingual knowledge-based representations, namely FrameNet, Abstract Meaning Representation (AMR) and Grammatical Framework, have also been conducted for Latvian, English and other languages. This demonstrates the expertise and potential of combining machine learning and knowledge-based approaches for state-of-the-art NLG for both highresourced and less-resourced languages for practical use cases when predictability and precision is as important as fluency and scalability (Ranta et al., 2020).

Machine translation
With respect to machine translation (MT), the situation for the Latvian language has changed considerably comparing to 2012. Today, most global companies, which offer machine translation services, support also Latvian (e.g., Google Translate 16 , Bing Microsoft Translator 17 , DeepL 18 , Amazon Translate 19 , Watson Language Translator 20 , and others). In 2017, Latvian was included as a competition language in the shared task of news translation of the Conference on Machine Translation 21 . Neural machine translation (NMT) systems that were developed by Tilde were recognised among the best systems (Bojar et al., 2017). Based on these results Tilde together with partners who provided language resources developed the EU Council Presidency Translator, which has already been used in 8 countries (Pinnis et al., , 2021. However, not having enough data for various narrower domains still limits development of MT engines for specific domains and lesser resourced languages like Latvian. In 2012, the dominant machine translation paradigm was phrase-based statistical machine translation (SMT). However, since late 2016, the state-of-the-art paradigm is neural machine translation (Bojar et al., 2016). This paradigm shift has impacted machine translation research also for Latvian. Prior to 2016, work focused on improving MT quality for Latvian with the use of external morphological taggers or parsers (e.g., Skadiņš et al., 2010), domain-specific terminology (e.g., Pinnis, 2015), domain adaptation (e.g., Pinnis et al., 2013;Pinnis and Skadiņš, 2012), and other methods. Although such methods are also relevant nowadays, the paradigm shift required complete rework of these methods since NMT differs substantially from SMT. The shift to NMT has also spurred research in areas that are specifically important for NMT, such as input representations for neural networks that would improve word splitting consistency for morphologically rich languages (such as Latvian) (Pinnis et al., 2017), NMT system robustness (e.g., Bergmanis et al., 2020), and others.
Recent NMT research in Latvia and for Latvian has been focused on the following topics: terminology integration in NMT Pinnis, 2021a, 2021b), analysis of biases in NMT systems , robustness of NMT systems , speech translation (Alves et al., 2020), and others.

Speech technology
For many years, speech technology support for Latvian was almost non-existent due to the lack of data for training speech recognition and synthesis models. Shortly after the transcribed 100-hour corpus of spoken Latvian was created, several automatic speech recognition (ASR) systems were developed (Salimbajevs and Strigins, 2015;Znotins et al., 2015). Today, the accuracy of these systems is comparable to the state of the art.
General-purpose Latvian speech synthesisers and recognisers developed by Tilde 22 and IMCS UL 23 are publicly available and are constantly advanced with new features (Nicmanis and Salimbajevs, 2021). Domain-specific speech transcription systems are also being developed, most notably for the medical domain: IMCS UL together with Riga East University Hospital have developed an ASR system with the focus on radiological and histopathological examination reports, as well as medical case histories (Dargis et al, 2020b;Znotiņš et al., 2022). Tilde has researched methods for adaptation of ASR to medical domain with untranscribed audio (Salimbajevs and Kapočiūte-Dzikiene, 2022), and together with Children's Clinical University Hospital developed an ASR system focusing on psychiatry, paediatrics and radiology.
There is an ongoing work on different online solutions encompassing Latvian speech recognition. That includes live event transcription for people with hearing impairments, captioning of video recordings, live transcription of online video meetings 24 .

Human-computer interaction
With the renaissance of AI and availability of computational resources that have made deep learning techniques applicable to natural language processing tasks, the humancomputer interaction with help of virtual assistants has become actual topics again.
Today several task-oriented virtual assistants, which help users finding answers to their questions, can communicate in Latvian. Virtual assistants are also used by public services. For example, at the time of writing, Hugo.lv 25 lists 15 virtual assistants for different public administration services, including the Latvian State Radio and Television Centre, the Courts Administration, the Bank of Latvia, the Rural Support Service, and many others. These have been developed using the capabilities of the conversational AI platform tilde.ai. It allows users to create their own virtual conversational agents for specific tasks. These agents support both text and voice input, can recognise intents expressed in the input, and deliver response using text, visual media, or voice modalities.
However, natural language understanding is still not solved problem and thus a lot of work needs to be done to create technologies for deeper language understanding and human-computer interaction. Several steps to this direction have been already made by researching intent detection (Balodis and Deksne, 2019;Kapočiūtė-Dzikienė et al., 2021), slot filling (Gosko et al, 2021) and next dialogue action prediction techniques Skadiņš, 2020, 2021;Skadiņa and Goško, 2020).

Conclusion
We have provided an overview of the current state of the Latvian language in the digital environment. Since the publication of the META-NET White Papers, notable progress has been made in the development of various language resources and tools for Latvian.
Although the Latvian language is used by a rather small number of speakers, and it is often categorised as less resourced, it is represented rather well not only by different language resources (digital libraries, text and speech corpora, lexicons, etc.) but also by core language technologies. Concerning more advanced technologies, Latvian has a reasonably good support for machine translation, speech recognition and synthesis, while solutions that involve deep state-of-the-art natural language understanding, like virtual assistants and text summarisers, are less developed.
There are still significant gaps with respect to availability, size and technology readiness level (TRL) of language resources. With respect to language resources, significant gaps are identified for both monolingual and multilingual data of all forms: written, spoken and multimodal. For example, datasets that represent conversational data, question answering, knowledge bases, informal language or specific domain are small or non-existent. There are almost no spoken and multimodal open-data available.
Domain-specific parallel and multilingual data that would allow training and finetuning MT engines are insufficient, while the current open-access monolingual text corpora are too small for training massive language models like GPT-3. Consequently, there is a lack of large pre-trained language models (both general and domain-specific) and lack of benchmarks for specific NLP tasks, e.g., Latvian GLUE or SQUAD.
Creation of such models is limited not only by availability of necessary data but also by insufficient hardware infrastructure, which could be solved through significant longterm support for research infrastructures.
Another important aspect is IPR and GDPR regulations that need to be more flexible, allowing wider use of IPR protected data for the development of language resources and technologies in a way that does not harm the interests of the authors.
Overall, similarly to many other languages of Europe, there is insufficient amounts of quality corpora, including monolingual corpora, currently available for Latvian, as well as insufficient computational resources, for training large-scale SOTA language models like GPT-3. However, there are resources and competence available for pre-training relatively smaller language models like BERT and GPT-2, and for fine-tuning large pretrained multilingual models like mT5 and XLS-R for various downstream tasks.
Limited availability of human resources leads to gaps and limitations also in language technology development. Although the Latvian LT industry and research groups have demonstrated excellent results in LT adaption for morphologically rich languages (which is not a trivial task), they are less present among leaders in the development of world-class novel language technology solutions.
Finally, gaps and fragmentation in research and development activities related to LT is a result of short, project-based (mostly 2-3 years, sometimes even 1 year, rarely 5 years) research and development funding and disproportion between funding for research (TRL 1-4) and industrial activities .
With respect to policies and instruments, strong national and international support is necessary for further Latvian language research and development activities, including dedicated long-term LT programs that provide equal support for both research and industrial activities. Close synchronisation between national and international activities is necessary, especially, with respect to research infrastructures and research priorities.