Minimalistic Approach to Coreference Resolution in Lithuanian Medical Records

Coreference resolution is a challenging part of natural language processing (NLP) with applications in machine translation, semantic search and other information retrieval, and decision support systems. Coreference resolution requires linguistic preprocessing and rich language resources for automatically identifying and resolving such expressions. Many rarer and under-resourced languages (such as Lithuanian) lack the required language resources and tools. We present a method for coreference resolution in Lithuanian language and its application for processing e-health records from a hospital reception. Our novelty is the ability to process coreferences with minimal linguistic resources, which is important in linguistic applications for rare and endangered languages. The experimental results show that coreference resolution is applicable to the development of NLP-powered online healthcare services in Lithuania.


Introduction
Digital means of medical informatics, especially when applying natural language processing (NLP), are indispensable in the application of e-health and digitalization of medical records and processes [1]. e use of NLP has proved as a lower-cost alternative to traditional medical methods in many cases such as to forecast stress symptoms and suicide risk in free-text responses sent via a mobile phone [2], or to detect seasonal disease outbreaks by monitoring search engine queries [3], and discovery of healthcare knowledge from social media [4,5].
With the development of Semantic Web technology, web information retrieval (IR) is changing towards meaningbased IR. e quality of retrieved documents relevant to the user also highly depends on the information extraction (IE) methods applied. In general, IE focuses on automatic extraction of structured information from the unstructured source. Standard document text preprocessing steps used in IE are lexical analysis, morphological analysis, and named entity recognition (NER), which can be complemented by coreference resolution and semantic annotation. e main issue here is the ambiguity and complexity of the natural language, thus making the progress in IE dependent on the evolution of the NLP techniques. While for widely used languages (such as English), the IE-related NLP research has already reached the levels of maturity and practical application on a massive scale (e.g., IBM Watson project) [6], but the resource-poor languages, such as Lithuanian [7], remain an open NLP research field. e baseline application is often steered towards automated concept extraction [8,9], often in combination with text mining [10,11].
NER when applied to biomedical texts is a critical step for developing decision support tools for smart healthcare. Examples for it are as follows: drug name recognition (DNR), which recognizes pharmacological substances from biomedical texts and classifies them for discovering drug-drug interactions [12,13]; biomedical named entity recognition (BNER), which extracts biomedical concepts of interest such as genes and proteins [14]; and medical entity recognition, which is information extraction from unstructured electronic health records [15][16][17]. Such studies include mapping clinical descriptions to Systematized Nomenclature of Medicine codes [18] or other medical lexicons. Unstructured texts in the medical domain contain valuable medical information, and there are many errors, such as spelling errors, improper grammatical use, and semantic ambiguities, which hinder data processing and analysis [19]. Structuration of medical domain knowledge using biomedical ontologies and controlled vocabularies provide support for data standardization and interoperability, healthcare administration, and clinical decision support [20]. Rich concepts linked by semantic relationships such as in the Unified Medical Language System (UMLS) contribute to healthcare data integration, pattern mining from EHRs, medical entity recognition in clinical text, and clinical data sharing [21]. e development of online healthcare services as powerful platforms provides users with an opportunity to address health concerns such as improving patient-centered care and supporting self-management. e users consider online healthcare services as a vital source of health information but still need more powerful semantic search engines to arrive at informed decisions on their own health and for more active participation in healthcare processes [22]. However, these services depend greatly upon the support provided by the natural language processing.
Our previous work included the development of the semantic search framework [23] for answering questions presented in structured Lithuanian language, which is based on Semantics of Business Vocabulary and Business Rules (SBVR) language. Our results in [23] showed that there is a strong need to complement the NLP pipeline of semantic search with the coreference resolution. Such coreference resolution tools have not been developed for the Lithuanian language yet, especially for very sector-important digital medical application, as we apply our algorithm to process digital transcripts of a hospital reception. erefore, the creation of such tools is a prerequisite for further improvement of NLP-supported health-oriented decision making in the Lithuanian language, while the experience gained could be extended to developing semantic search tools for other under-resourced languages as well. e rest of the paper is structured as follows. In Section 2, we analyse the related works in the coreference resolution field, including the state-of-the-art and the methods proposed for languages which are grammatically similar to Lithuanian. Sections 3 and 4 present the coreference resolution algorithm and its experimental evaluation. Section 5 presents conclusions and discusses future work.

Related Works
Machine-learning and rule-based approaches are efficient methods in semantic processing, especially when enhanced with external knowledge and coreference clues derived from the structured document, while often still performing better (in comparison with classic implementations) in coreference resolution when provided with ground truth mentions [24], while further expanded with scaffolding approaches [25]. Unsupervised methods can be applied to large-scale scenarios [26,27]. Alternatively, a hybrid strategy may be used based on a set of statistical measures and syntactical and semantic information [28]. "Off the shelf" type of IR algorithms can be utilized quite successfully in some of the scenarios, especially with limited focus areas (in a medical sense) [29]. Accuracy can be further improved by analysing trigram frequencies [30] and applying graph-style algorithms [31] in context-sensitive corpus fragments.
In general, the coreference resolution methods can be classified into knowledge rich and knowledge poor. Both methods require large resources such as semantic information, syntactic annotations, or preannotated corpora of hospital transcripts from a hospital reception. Under resourced, rarer languages, like Lithuanian, usually do not have such resources available.
While up-to-date, no research has been performed to solve coreferences in Lithuanian, but many solutions have been proposed for other languages, mostly for English (Table 1). Note that the evaluation results are not directly comparable, as the authors used different corpora.
Considering languages that are more-or-less grammatically similar to Lithuanian (which is one of the Baltic languages), we summarize the related work on Latvian (only other Baltic language) and Slavic languages such as Polish, Russian, and Czech in Table 2. (i) For Latvian, the only solution is LVCoref [45]. It is a rule-based system that uses an entity-centric model. It focuses on named entity matches (exact matches, acronyms) and uses Hobbs' algorithm for pronouns. (ii) For Polish, rule-based Ruler [46] for scoring of candidates uses coreferences gender/number and including (removal of nested groups) rules, lemma, and Wordnet rules for nominal expressions and pronoun rule specifically targeting pronouns. BARTEK [47] is an adaptation of BART, which was designed for English, to Polish. Mixed Polish coreferences resolution approach combines neural networks architecture with the sieve-based approach [48]. (iii) For Russian, RU-EVAL-2014 [49] was an evaluation campaign of anaphora and coreferences resolution tools that employed a wide variety of approaches. e evaluation was performed on Russian Coreference Corpus (RuCur). Machine learning approaches [50] were also used. (iv) For Czech, coreferences are annotated in the tectogrammatical layer of Prague Dependency Treebank (PDT) and their first coreference resolution approach was rule based [51]. At first, all possible candidates are collected and then their list is narrowed down using 8 filters, and then from remaining ones closest to corefering object is selected as antecedent. Nguy et al. [52] adapted two older English language approaches to Czech language and used Decision Tree C5 for the classifierbased approach, while the ranker-based approach employed the averaged perceptron algorithm. Both approaches were trained and evaluated on PTD data with ranker-based approach providing better results. Treex CR [53] was developed for the Czech language and adapted to English, Russian, and German, although for Russian and German, English coreferences labels were projected, which produced notably lower results [54].
In summary, the rule-based solutions have the advantages of easier adaptability and provide comparable results when good training data are not available as is the case for Lithuania. Many of more advanced solutions cannot be fully adapted for rarer and under-resourced languages due to the lack of available linguistic resources, as is the case with Lithuanian language. For example, BARTat the time supported 64 feature extractors, but due to lack of language-specific resources for the Polish language, only 13 could be utilized. e solutions that are not heavy on linguistic resources can be very useful for resource-poor languages in general.

A Rule-Based Coreference Resolution: A Lithuanian Case
3.1. Definition and Framework. Coreference resolution (or anaphora) is an expression, the interpretation of which depends on another word or phrase presented earlier in the text (antecedent). For example, "Tom has a backache. He was injured." Here the words "Tom" and "He" refer to the same entity. Without resolving the relationship between these two structures, it would not be possible to determine why Tom has the backache, nor who was injured. In such cases, semantic information would be lost. Anaphoric objects are expressed with pronouns and cannot be independently interpreted without going back to its antecedent. In this work, such expressions are called coreferences, unless it is required to make a distinction. Usage of such expressions can vary depending on the type and the style of the text. Here we focus on texts from medical-related domains. e role of coreference resolution in the semantic search framework is to provide additional semantic information after named entity recognition before semantic annotation ( Figure 1).

Conceptual Model of Coreference Resolution.
In this chapter, a conceptualization of coreference resolution is presented. A given model, which is expressed as UML class diagram (Figure 2), specifies the concepts playing a certain  [34] Modified centering theory 0.72-0.81 na na Mitkov [35] POS tagger, antecedent indicators 0.897 na na RAP [36] Salience factors 0.85-0.89 na na Xrenner [37] Syntactic and semantic rules 0.51-0.55 0.49-0.57 0.49-0.56 Probabilistic [38] Bayesian rule 0.82-0.84 na na MARS [39] Genetic algorithms 0.53-0.84 na na Soon et al. [40] Machine learning (decision tree C5) 0.65-0.69 0.53-0.56 0.62 ILP [41] Machine learning (logistic classifier) 0.78-0.89 0.47-0.58 0.61-0.68 Wiseman et al. [42] Deep learning 0.77 0.70 0.73 Lee et al. [43] Deep learning 0.81 0.73 0.77 Zitnik et al. [44] Conditional random fields 0.68-0.94 0.30-0.87 0.41-0.87  role in the extraction of coreferences of a certain type. e model gives us an understanding of the following: (i) What features of text, sentence, and word help us recognize the existence of coreference (they are specified in the package Concepts of Input Flow) (ii) What kind of text preprocessing is required (iii) What additional resources are required for resolution of certain type coreferences (they are specified in the package Database of Public Persons and Classification of Professions) For example, from the model provided, it is clear that, before coreference resolution starts, it is important to preprocess text and obtain the following: (i) A text segmented into sentences and lexemes (ii) Morphological features of lexemes identified (iii) Named entities recognized Text preprocessing itself is not a task of coreference resolution, so it is out of the scope of this paper.
It is worthy to mention that the model is quite abstract, language independent, and technology independent. erefore, it is applicable not only for Lithuanian but for grammatically similar languages as well. Concepts of this model are used for the formalization of coreference resolution rules in the next section. e concepts are explained in more detail below. e main concepts of coreference resolution are Text, Lexical_Unit, and Named_Entity. e concept Text assumes a textual document whose content should be analysed. Each test has an associated publication date, which is important for solving coreferences. Each text consists of at least one Lex-ical_Unit, which includes paragraphs, sentences, words, and punctuations, classified into the Sentence and Lexeme categories. Lexeme assumes lexical units such as words, punctuations, and numbers. Each lexeme is characterized by a lemma and a part of speech, and some of them (nouns and pronouns) by grammatical gender and number. e lexeme could be specialized by POS category: Noun, Pronoun, and Oth-er_Part_Of_Speech. Special cases of Other_Part_Of_Speech are Comma and Conjunction, which are required for the description of conditions of some coreference resolution rules.
A Named_Entity concept defines an object to whom pronouns or certain nouns can refer. NER algorithms usually recognize three types of entities: a person (Per-son_NE), an organization (Organization_NE), and a location (Location_NE). e named entities of a person type require special attention a person can be mentioned not only using pronouns but also using a position he/she holds (Posi-tion_Held) and a professional name (Profession). Additional information about a person could help resolving such coreferences more precisely. As an example, source of such information could be a Database of Public Person, which includes Known_Person-a well-known person mentioned as Person_NE in the text. e output of a coreference resolution algorithm is a Coreference-a relationship between coreferents. For each coreference, its type (nominal and pronominal), subtype (relative pronoun and noun repetition), position (points backward, forward, or irrelevant in case of repetitions), and group (is singular, refers to the coreference group or is ambiguous) are specified. Each referent refers at least to one coreferent (a concept Mention). Each Mention starts at a certain position in the text, is of a certain length, and fits at least one Lexeme. Some of them can fit a certain Named_Entity.

Coreference Resolution Algorithms.
e decision table with guidelines for the application of the certain resolution algorithm is shown as Figure 3. e conditions are checked consecutively on every lexeme in the text, and, if the condition is satisfied, a corresponding algorithm is activated. For example, if C2 condition is met then immediately A1 algorithm is activated.
For resolution of a specific type of references, we propose the following algorithms: (i) A1: specific rules resolution algorithm for resolution of certain usage of pronouns (ii) A2: general pronoun resolution algorithm which focuses on the cases where pronouns refer to nouns (or noun phrases) that are recognized as named entities of "person" class (iii) A3: PRA (partial, repetition, and acronym) resolution algorithm for resolution of nouns recognized as named entities and their repeated usage in the same text (iv) A4: HHS (hypernym, hyponym, synonymous) resolution algorithm for resolution of nouns recognized as profession names including their synonyms and hypernyms/hyponyms (v) A5: feature resolution algorithm for resolution of nouns that represent certain feature (at the moment only public position being held) of the named entity of a person e coreference resolution starts from the sequential analysis of each lexeme looking for a certain type of pronoun and noun. Depending on identified features of lexeme, a decision about further analysis is taken. e decision table ( e upper right quadrant shows the possible alternatives for the conditions of the corresponding row. In the upper right quadrant, the answer "na" stands for "not relevant." In the lower right quadrant, "✓" means that the algorithm should be applied and "7" means that it should not be applied. e idea is that the pronoun-related coreferences should be solved first sequentially by checking the conditions C1, C2, and C3. en a noun-related coreference resolution should start by sequentially checking the conditions C4, C6, and C7.

Formal Description of Coreference Resolution Algorithms.
First-order logic (FOL) formulas are employed to define the main conditions the algorithms should check when resolving coreferences. e concepts of the coreference Computational and Mathematical Methods in Medicine 5 resolution model (Figure 2) became the predicates or constants in the FOL formulas: the classes became the unary predicates of the same name as class; the associations between classes-the binary predicates of the same name as association; the attributes of classes-the binary predicates of the same name as attribute plus verb "has" at the beginning; and the literals of enumerations-constants. e algorithms follow the grammar rules of the Lithuanian language which are based on the analysis of morphological features of lexemes and their order in the sentence and text. Examples of Lithuanian language sentences were translated into English as closely as possible. All proper names were changed to generic abbreviations to comply with GDPR.

A1: Specific Rules Resolution.
In some cases, there exists a rather rigid structure for pronoun usage and it can be easily defined by using specific rules, for example, . In both cases, pronoun "kuriuo" refers to the noun "vyras." In the first example, we do not have an optional preposition "su," while we have it in the second one.
A condition for the existence of such reference formally is defined as follows: For every sentence s of text t and for every "Relative" type pronoun p, which is contained in the sentence s and has a start position sp1, is of length ln1, follows comma c or follows prepositional lexeme l1, which follows comma c, and for every noun l2, which has a start position sp2, is of length ln2, precedes comma c, is of the same gender g and of the same number n as the pronoun p, the only one coreference relation r, which is resolved in text t, is of "Pronominal" type, "Relative" subtype, "Backward" position and "Single" group between the pronoun p and the noun n, its referent starts at position sp1 and has length ln1, and which fits only one lexeme p and refers to only one mention m, which starts at position sp2, has length ln2, and fits only one lexeme l2, exists (Rule 1).
e relative pronoun might be plural and refer to multiple singular (or multiple plural) nouns: For such case, a special condition must be defined: For every sentence s in text t and for every "Relative" type pronoun p of "Plural" number, which is contained in the sentence s and has a start position sp1, is of length ln1, follows comma c1 or follows prepositional lexeme l, which follows comma c1, and for every noun n1, which precedes comma c1, has a start position sp2, is of length ln2, follows conjunction j, and for every noun n2, which precedes conjunction j, has a start position sp3, is of length ln3, and for every existing noun n3, which follows comma c2, and for every existing noun n4, which precedes comma c2, has a start position sp4, is of length ln4, the only one coreference relation r, which is resolved in text t, is of "Pronominal" type, "Relative" subtype, "Backward" position and "Multiple" group, its referent starts at position sp1 and has length ln1, fits only one lexeme p, refers to only one mention m1, which starts at position sp2, has length ln2, and fits noun n1, refers to only one mention m2, which starts at position sp3, has length ln3, and fits only one noun n2, and refers at least to one mention m3, which starts at position sp4, has length ln4, and fits noun n4, exists (Rule 2).

A2: General Purpose Pronoun Resolution.
is algorithm focuses on the cases where pronouns refer to nouns (or noun phrases) that are recognized as named entities of "person" class by NER. e algorithm starts from the identification of not demonstrative pronoun. In a given example below, such a pronoun is in the second sentence-"Jis" ("He") If the pronoun is in the relative clause, the algorithm moves backwards analysing words going before the pronoun. In a given example, the pronoun is at the beginning of the sentence, so remaining parts of the sentence are not analysed, and the algorithm moves one sentence backwards. e conditions for the existence of such reference formally could be defined as three alternatives. e first one describes conditions for reference existing in the same sentence s1 before pronoun p: For each text's t sentence s1 and pronoun p not of Demonstrative type that is contained in sentence s1 and has gender g, number n, start position sp1 and length of ln1, and named entity e1 that is in the same sentence s1, is expressed by lexeme l, and has gender g, number n, start position sp2 and is of length ln2, and is before pronoun p (sp2 is lower than sp1), but closer to pronoun p than possible named entities e2 and e3 (sp2 higher than sp3 and sp4), the only one coreference relation r, which is resolved in text t, is of "Pronominal" type, "Relative" subtype, "Backward" position and "Single" group between the pronoun p and the named entity e1, its referent starts at position sp1 and has length ln1, and which fits only one pronoun p and refers to only one mention m, which starts at position sp2, has length ln2, and fits only one named entity e1, exists (Rule 3).
Rule 5: ∀t, s1, s2, s3, p, l, e1, g, n, sp1, ln1, sp2, [Coreference(r) ∧ resolved_in(r, t) ∧ has_type (r, Pronominal) ∧ has_ subtype (r, General) ∧ has_position(r, Backward) ∧ has_group(r, Single) ∧ has_start_position(r, sp1) ∧ has_length(r, ln1) ∧ fits(r, p) ∧ Mention(m) ∧ refer-s_to(r, t) ∧ has_start_position(m, sp2) ∧ has_length(m, ln2) ∧ fits(m, e1) ∧ fits(m, l)]] Another example presents a case when a coreferent of the pronoun "man" (in English, "for me") is in the following sentence: If the algorithm does not find any named entities moving backwards, it moves back to pronoun and proceeds forward. e algorithm continues moving forward until it locates "J. Jonaitis" entity, which is recognized as a person. Since the gender of the pronoun "man" is ambiguous (it can refer to both female and male persons), only their grammatical numbers are compared. Both are singular; therefore, the algorithm picks "J. Jonaitis" as a postcedent of the corefering object "man." Conditions for the existence of such reference formally could be defined as two alternatives. e first one describes the conditions for reference existing in the same sentence s1 after pronoun was mentioned: For each text's t sentence s1 and pronoun p not of Demonstrative type that is contained in sentence s1 and has gender g, number n, start position sp1 and length of ln1, and named entity e1 that is in the same sentence s1, is expressed by lexeme l, and has gender g, number n, start position sp2 and is of length ln2, and is after pronoun p (sp2 is higher than sp1), but closer to pronoun p than possible named entities e2 and e3 (sp2 higher than sp3 and sp4), the only one coreference relation r, which is resolved in text t, is of "Pronominal" type, "Relative" subtype, "Backward" position and "Single" group between the pronoun p and the named entity e1, its referent starts at position sp1 and has length ln1, and which fits only one pronoun p and refers to only one mention m, which starts at position sp2, has length ln2, and fits only one named entity e1, exists (Rule 6).

A3: PRA Resolution.
is algorithm is based on exact (or partial) string matches and several rules for acronyms. Once a first named entity that can be matched with an initial named entity is found, then the algorithm stops to keep annotations simple: B ⟶ A, C ⟶ B and D ⟶ C.
is allows the formation of the coreference chains linking all mentions of the same entity in a text that can be later reused for semantic analysis, for example, In the afternoon, Tomaitis [named entity] has been taken to a surgery room.
In this example, two mentions of the same entity are made: "Tomaitis" and "Tomaitį." ey are of different cases, but their lemmas are identical. A condition for the existence of such reference formally is defined as follows: For each text's t sentence s1 that includes named entity e1, that has start position sp1 and is of length ln1, which is expressed by lexeme l1 that has lemma l and for each same text's t sentence s2 that includes named entity e2, that has a start position sp1 and is of length ln1, which is expressed by Computational and Mathematical Methods in Medicine lexeme l2 that has lemma l, the only one coreference relation r, which is resolved in text t, is of "Nominal" type, "Repetition" subtype, "Irrelevant" position and "Single" group between the noun n1 and the noun n2, its referent starts at position sp1 and has length ln1, and which fits only one noun n1 and refers to only one mention m, which starts at position sp2, has length ln2, and fits only one noun n2, exists (Rule 8).

A4: HHS Resolution.
is algorithm is based on profession classification. It attempts to resolve the use of synonyms and hypernyms/hyponyms. e algorithm determines that "Doctor" in professions classification is a hyponym of "Surgeon," they also agree in gender and number; therefore, the algorithm adds their pair to annotations. Conditions for the existence of such reference formally are defined as follows:

(i) [LT] Gydytojai
For each text's t sentence s1 that has profession p1, which is either broader or narrower than profession p2, name v1 expressing noun n1, which has gender g, number m, start position sp1 and is of length ln1, and for each same text's t sentence s2 that has profession p2, which is either broader or narrower than profession p1, name v2 expressing noun n2, which has gender g, number m, start position sp2 and is of length ln2, the only one coreference relation r, which is resolved in text t, is of "Nominal" type, "Hypernym_hyponym" subtype, "Irrelevant" position and "Single" group between the noun n1 and the noun n2, its referent starts at position sp1 and has length ln1, and which fits only one noun n1 and refers to only one mention m, which starts at position sp2, has length ln2, and fits only one noun n2, exists (Rule 9). Rule 9: ∀t, s1, s2, n1, n2, sp1, ln1, sp2, ln2, v1, v2, p1, p2.[Text(t) ∧ Sentence(s1) ∧ Sentence(s2) ∧ consists_ of(t, s1) ∧ consists_of(t, s2) ∧ Noun(n1) ∧ contains(s, n1) ∧ has_start_position(n1, sp1) ∧ has_length(n1, ln1) ∧ Noun(n2) ∧ contains(s2, n2) ∧ has_start_position(n2, An example of synonym is given as follows: Both "head surgeon" and "chief surgeon" are synonymous; therefore, the condition for the existence of such reference formally could be defined as follows: For each text's t sentence s1 that has a profession's p name v1, which is expressed by noun n1 that has gender g, number m, start position sp1 and is of length ln1, and for each same text's t sentence s2 that has same profession's p name v2 expressed by noun n2 that has gender g, number m, start position sp2 and is of length ln2, the only one coreference relation r, which is resolved in text t, is of "Nominal" type, "Synonym" subtype, "Irrelevant" position and "Single" group between the noun n1 and the noun n2, its referent starts at position sp1 and has length ln1, and which fits only one noun n1 and refers to only one mention m, which starts at position sp2, has length ln2, and fits only one noun n2, exists (Rule 10).

A5: Feature Resolution.
is algorithm at the time attempts to resolve only those cases when a person is being referred to by his public post (feature) that he holds, other types of features are not currently resolved, for example, Here a noun "cardiologist" is selected, the algorithm moves backwards till it reaches "S. Suskelis" and checks the knowledge base if at the time of the publication of the medical record he has held the position of the cardiologist.. Since "he holds it" the algorithm checks if "S. Suskelis" and "cardiologist" agree in gender and number. ey agree, and their pair is added to annotation as a feature reference. A condition for the existence of such reference formally is defined as follows: For each text's t sentence s1 that has known person k, who during publication date d had certain position h (publication date d is same or later than position h start date fd and same or earlier than position h end date td), mention as named entity e, that has a start position sp1 and is of length ln1, and for each same text's t sentence s2 mentioned noun n, that has a start position sp2 and is length ln2, which is mentioned after named entity e (noun n has a higher start position sp2 than named entity's sp1), whose lemma l matches with position's h lemma l, number is Singular and gender g matches known person's k gender g, the only one coreference relation r, which is resolved in text t, is of "Nominal" type, "Feature" subtype, "Backward" position and "Single" group between the noun n and the named entity e, its referent starts at position sp2 and has length ln2, and which fits only one noun n and refers to only one mention m, which starts at position sp1, has length ln1, and fits only one named entity e, exists (Rule 11).
( Figure 1) because it requires lexical, morphological, and NE annotations of the text should be analysed. Solutions for other languages should not follow the same NLP pipeline architecture. But a supply of coreference resolution component with lexical, morphological, and NE information of the text must be ensured.
Coreference resolution for Lithuanian was implemented using Java programming language and JSON data format for annotation storage. But the proposed approach is not technology dependent, and for other languages, it can be implemented on any other platform. e evaluation was performed by analysing 100 articles that have been preannotated and are available in our Lithuanian Language Coreference Corpus [55], in addition to the transcribed records of medical reception, which we cannot disclose due to the privacy requirements.
For evaluation, we used precision, recall, and F1 metrics. Recall R is the ratio of correctly resolved anaphoric expressions C to the total number of anaphoric expressions T. Precision P is the ratio of correctly resolved anaphoric expressions C to the number of resolved anaphoric expressions F. F1 is a harmonic mean of P and R: (1) Five experiments were performed with different combinations of coreferencing algorithms presented in Section 3. e results of the experiments are presented in Table 3. Note the following threats to validity of our results: (i) e database of public persons must be constantly updated as new information becomes available.
Otherwise, recall will get noticeably lower when annotating newer texts. (ii) In the case where plural pronouns and nouns are used, they are difficult to be identified because of many variations possible that often ignore grammatical compatibility rules.
Linking the named entity to the position held taking into account the date of the publication of the text is limited considering that the text might be published today but written about things that happened in the past. ere are no tools, which can identify the timeframe of a certain part of the text.

Conclusion
Medical entity recognition and coreferencing are difficult tasks in Lithuanian natural language processing (NLP). We proposed the coreference resolution approach for the Lithuanian language. e coreference resolution algorithm depends on morphological and named entity recognition (NER) annotations and preexisting databases. Due to the proposed approach being detached from specific implementation and rules being formalized, it would not be difficult to adapt it for grammatically similar languages. Our novelty is the ability to process coreferences with minimal linguistic resources, which are very important to consider in linguistic applications for under-resourced and endangered languages. While the proposed method provides encouraging results, when analysing transcribed medical records and other corpora, and they are comparable to the results achieved by other authors applying different resolution approaches on other languages, it has certain limitations: it is domain specific and is able to resolve only a subset of coreference types, while the relatively small dataset was used for experiments. Nevertheless, we hope that our method can contribute to the sustainable development of the NLPpowered online healthcare services in Lithuania.
Data Availability e dataset used in this research is available upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.