Towards Automatic Distinction between Specialized and Non-Specialized Occurrences of Verbs in Medical Corpora

The medical ﬁeld gathers people of different social statuses, such as students, pharmacists, managers, biologists, nurses and mainly medical doctors and patients, who represent the main actors. Despite their different levels of expertise, these actors need to interact and understand each other but the communication is not always easy and effective. This paper describes a method for a contrastive automatic analysis of verbs in medical corpora, based on the semantic annotation of the verbs nominal co-occurents. The corpora used are specialized in cardiology and distinguished according to their levels of expertise (high and low). The semantic annotation of these corpora is performed by using an existing medical terminology. The results indicate that the same verbs occurring in the two corpora show different specialization levels, which are indicated by the words (nouns and adjectives derived from medical terms) they occur with.


Introduction
The medical field gathers people of different social statuses, such as medical doctors, students, pharmacists, managers, biologists, nurses, imaging experts and of course patients.These actors have different levels of expertise ranging from low (typically, the patients) up to high (e.g., medical doctors, pharmacists, medical students).Despite their different levels of expertise, these actors need to interact.But their mutual understanding might not always be completely successful.This situation specifically applies to patients and medical doctors who are the two main actors within the medical field (McCray, 2005;Zeng-Treiler et al., 2007).Beyond the medical field, this situation can also apply to other domains (e.g., law, economics, biology).The research question is closely linked to the readability studies (Dubay, 2004), whose purpose is to address the ease with which a document can be read and understood by people, and also the ease with which the corresponding information can be exploited by the people later.As noticed, one source of difficulty may be due to the specific and specialized notions that are used : for instance, abdominoplasty, hymenorrhaphy, escharotomy in medical documents, affidavit, allegation, adjudication in legal documents, etc.This difficulty occurs at the lexical and conceptual level.Another difficulty may come from complex syntactic structures (e.g., coordinated or subordinated phrases) that can occur in such documents.Hence, this difficulty is of syntactic nature.With very simple features, reduced to the length of words and sentences, the classical readability scores address these two aspects (Flesch, 1948;Dale and Chall, 1948;Bormuth, 1966;Kincaid et al., 1975).Typically, such scores do not account for the semantics of the documents.In recent readability approaches, the semantics is being taken into account through several features, such as: medical terminologies (Kokkinakis and Toporowska Gronostaj, 2006); stylistics of documents (Grabar et al., 2007;Goeuriot et al., 2007); lexicon used (Miller et al., 2007); morphological information (Chmielik and Grabar, 2011); and combination of various features (Wang, 2006;Zeng-Treiler et al., 2007;Leroy et al., 2008;Franc ¸ois and Fairon, 2013).
We propose to continue studying the readability level of specialized documents through the semantic features.More precisely, we propose to perform a comparative analysis of verbs observed in medical corpora written in French.These corpora are differentiated according to their levels of expertise and thereby they represent the patients and the medical doctors' languages.Our study focuses on verbs and their co-occurents (nouns and adjectives deriving from medical terms), and aims to investigate on the verb semantics, according to the types of constructions and to the words with which the verb occurs in the corpora.In order to achieve this, we pay a particular attention to the syntactic and semantic features of the verbs' co-occurents in the studied texts.
Our method is based on the hypothesis according to which the meaning of a verb can be influenced or determined by its context of appearance (L 'Homme, 2012) and by its arguments.Indeed, various studies on specialized languages have shown that the verb is not specialized by itself (L 'Homme, 1998;Lerat, 2002).Rather, being a predicative unit that involves participants called arguments, the verb can be specialized or not, depending on its argumental structure and the nature of these arguments.
In our study, the description of verbs is similar to the one performed in Frame Semantics (FS) (Fillmore, 1982), since we provide semantic information about the verbs co-occurents.The Frame Semantics framework is increasingly used for the description of lexical units in different languages (Atkins et al., 2003;Padó and Pitel, 2007;Burchardt et al., 2009;Borin et al., 2010;Koeva, 2010) and specialized fields (Dolbey et al., 2006;Schmidt, 2009;Pimentel, 2011).Among other things, Frame Semantics provides for a full description of the semantic and syntactic properties of lexical units.FS puts forward the notion of "frames", which are defined as conceptual scenarios that underlie lexical realizations in language.A frame comprises a frame evoking lexical units (ULs) and the Frame Elements (FEs), which represent the participants to the verbal process.For instance, in FrameNet (Ruppenhofer et al., 2006), the frame CURE is described as a situation that involves some specific Frame Elements, (such as HEALER, AFFLICTION, PATIENT, TREATMENT), and includes a lexical unit such as cure, alleviate, heal, incurable, treat.1In our approach, an FS-like modeling should allow us to describe the semantic properties of verbs.Using this framework, we will be able to highlight the differences between the studied verbs usages through their various frames and, by doing so, uncover the linguistic differences observed in corpora of different levels of expertise.However, the FS framework will be adapted in order to fit our own objectives.Indeed, the automatic annotation of the verbs co-occurents into frames will rely on the use of a terminology (Côté, 1996) which provides a semantic category for each recorded term.These categories (e.g., anatomy, disorders, procedures, chemical products) typically apply to the verb co-occurents and should be evocative of the semantics of these co-occurents and the semantic properties of verbs: we consider that the semantic categories represent the frame elements which are lexically realized by the terms, while the verbs represent the frame evoking lexical units.
In a previous study, we have looked at the behavior of four verbs (observer (observe), détecter (detect), développer (develop), and activer (activate)) in medical corpora written by medical doctors by contrast to texts written by patients (Wandji Tchami et al., 2013).The results showed that in the corpus written by doctors some verbs tend to have specific meanings, according to the type of arguments that surround them.In the current work, we try to go further by enhancing our method (improved semantic annotation, automated analysis of verbs) and by distinguishing specialized and non-specialized occurrences of verbs.
In the next sections, we present the material used (section 2), the method designed (section 3).We then introduce the results and discuss them (section 4), and conclude with future work (section 5).

Material
We use several kinds of material: the corpora to be processed (section 2.1), the semantic resources (section 2.2), a resource with verbal forms and lemmas (section 2.3) and a list of stopwords (section 2.4).

Corpora
We study two medical corpora dealing with the specific field of cardiology (heart disorders and treatments).These corpora are distinguished according to their levels of expertise and their discursive specificities (Pearson, 1998): Expert corpus contains expert documents written by medical experts for medical experts.This corpus typically contains scientific publications, and show a high level of expertise.The corpus is collected through the CISMeF portal2 , which indexes French language medical documents and assigns them categories according to the topic they deal with (e.g., cardiology, intensive care) and to their levels of expertise (i.e., for medical experts, medical students or patients).Forum corpus contains nonexpert documents written by patients for patients.This corpus contains messages from the Doctissimo forum Hypertension Problemes Cardiaques3 .It shows low level of expertise, although technical terms may also be used.The size of corpora in terms of occurrences of words is indicated in Table 1.We can see that, in number of occurrences, these two corpora are comparable as for their sizes.
Corpus Size (occ of words) Expert 1,285,665 Forum 1,588,697 Table 1: Size of the two corpora studied.
Further to our previous work (Wandji Tchami et al., 2013), we have added another semantic axis E STUD-IES, that groups terms related to the scientific work and experiments (e.g., méthode (method), hypothèse (hypothesis)...).Such notions are quite frequent in the corpora, while they are missing in the terminology used.The only semantic category of Snomed that we ignore in this analysis contains modifiers (e.g., aigu (acute), droit (right), antérieur (anterior)), which are meaningful only in combination with other terms.Besides, such descriptors can occur within medical and non-medical contexts.
As stated above, we expect these semantic categories to be indicative of frame elements (FEs), while the individual terms should correspond to lexical realizations of those FEs, as in Framenet.For instance, the Snomed category DISORDERS should allow us to discover and group under a single label terms that denote the same notion (e.g., hypertension (hypertension), obésité (obesity)) related to the FE DISORDER.
The existing terminologies may not provide the entire coverage of the domain notions (Chute et al., 1996;Humphreys et al., 1997;Hole and Srinivasan, 2000;Penz et al., 2004).For this reason, we attempted to complete the coverage of the Snomed International terminology in relation with the corpora used.We addressed this question in two ways: • We computed the plural forms for simple terms that contain one word only.The motivation for this processing is that the terminologies often record terms in singular forms, while the documents may contain singular and plural forms of these terms.
• We tried to detect the misspellings of the terms using the string edit distance (Levenshtein, 1966).This measure considers three operations: deletion, addition and substitution of characters.Each operations cost is set to 1.For instance, the Levenshtein distance between ambolie and embolie is 1, that corresponds to the substitution of a by e.The minimal length of the processed words should not be lesser than six characters, because with shorter words the propositions contain too much of errors.The motivation for this kind of processing is that it is possible and frequent to find misspelled words in real documents, especially in the forum discussions (Balahur, 2013).
In both cases, the computed forms inherit the semantic type of the terms from the terminology.For instance, ambolie inherits the D DISORDER semantic type of embolie.Besides, we also added the medication names from the Thériaque resource4 .These are assigned to the C CHEMICAL PRODUCTS semantic type.The whole resource contains 158,298 entries.

Resource with verbal forms
We have built a resource with inflected forms of verbs: 177,468 forms for 1,964 verbs.The resource is built from the information available online5 .The resource contains simple (consulte, consultes, consultons (consult)) and complex (ai consulté, avons consulté (have consulted)) verbal forms.This resource is required for the lemmatization of verbs (section 3.3).

List of stopwords
The list of stopwords contains grammatical units, such as prepositions, determinants, pronouns and conjunctions.It provides 263 entries.

Method
We first perform the description of verbs in a way similar to FS and then compare the observations made in the two corpora processed.The proposed method comprises three steps: corpora pre-processing (section 3.1), semantic annotation (section 3.2), and contrastive analysis of verbs (section 3.3).The method relies on some existing tools and on specifically designed Perl scripts.

Corpora pre-processing
The corpora are collected online from the websites indicated above and properly formatted.The corpora are then analyzed syntactically using the Bonsai parser (Candito et al., 2010).Its output contains sentences segmented into syntactic chunks (e.g., NP, PP, VP) in which words are assigned parts of speech, as shown in the example that follows: Le traitement repose sur les dérivés thiazidiques, plus accessibles, disponibles sous forme de médicaments génériques.
(The treatment is based on thiazidic derivates, more easily accessible, and available as generic drugs.The syntactic parsing was performed in order to identify the syntactic chunks, nominal and verbal, to prepare the recognition and annotation of the terms they contain and to better the recognition of verbs. The Bonsai parser was chosen: it is adapted for french texts and it provides several hierarchical syntactic levels within the sentences and phrases.For instance, the phrase médicaments génériques (generic drugs) is syntactically analyzed as NP: (NP (NC médicaments) (AP (ADJ génériques)))) that contains one NP médicaments and two APs génériques and the final dot.The VP of the sentence contains the verb repose (is based).As we can observe, the output of the Bonsai parser neither provides the lemmas of the forms nor the syntactic dependencies between the constituents.So our study concentrates on the verbs co-occurences with nouns, noun phrases and some relationnal adjectives.The further analysis of the corpora is based on this output.

Semantic annotation
The Bonsai format is first converted into the XML format: we work on the XML-tree structure.The semantic annotation of the corpora is done automatically.For this task, the Snomed International terminology was chosen because it is suitable for french and it offers a better outreach of the french medical language.We perform the projection of terms from the terminology on the syntactically parsed texts : • All the chunks (NPs, PPs, APs and VPs) are processed from the largest to the smallest chunks, within which we try to recognize the terminology entries which co-occur with the verbs in the corpora.Indeed, at this stage, since our chunker does not provide dependency relations, we can only work on nouns and noun phrases that co-occur with the verbs.For instance, the largest chunk (NP (NC médicaments) (AP (ADJ génériques)))) gives médicaments génériques, (generic drugs) that is not known in the terminology.We then test médicaments (drugs) and génériques (generic), of which médicaments (drugs) is found in the terminology and tagged with the C CHEMICAL PRODUCTS semantic type.
• Those VPs in which no terms have been identified are considered to be verbal forms or verbs.
Examples of corpora enriched with the semantic information are shown in Figures 1 (expert corpus) and 2 (forum corpus).In these Figures, verbs are in bold characters, semantic labels for the verbs cooccurents are represented by different colors: DISORDERS in red, FUNCTIONS in purple, ANATOMY in clear blue.These semantic categories, provided by the terminological resource, label the words that are likely to correspond to FEs.

Figure 1: Examples of annotations in expert corpus
We can see that in the two corpora, there are both short and long sentences.Besides, the terms recognized are often atomic.For instance, we do not recognize complex terms embolie pulmonaire and thrombose du tronc, but their simple atomic components embolie, pulmonaire, thrombose and tronc.Also, some terms match none of the terminology's entries because they are part of VPs, such as cathéter in Figure 1.

Automatic analysis of verbs
For the analysis of the verbs, we extract information related to verbs and to the words with which they occur.Currently, only sentences with one VP are processed 8 842 sentences for the expert corpus and 10 563 for the forum corpus.
• Lemmatization of verbs.As we noticed, the syntactic parser's output does not provide the lemmas.
For the lemmatization of the verbs, we use the verbal resource described in section 2.3.Hence, the content of the verbal chunk is analyzed: -it may contain a simple or complex verbal form that exists in the resource, in which case we record the corresponding lemma; -if the whole chunk doesnot appear in the resource, we check out its atomic components: if all or some of these components are known, we record the corresponding lemmas.This case may apply to passive structures (a été conseillé (has been advised)), insertions (est souvent conseillé is often advised) or negations (n'est pas conseillé (is not advised)): in these cases, the lemmas are avoir être conseiller, être conseiller and être conseiller.These lemmas will be normalized in the further step: the head verb will be chosen automatically and considered as the main lemma within the verbal phrase; -finally, the VPs may consist of words that are not known in the verb resource.These may be morphologically contructed verbs (réévaluer (reevaluate)) or, words from other parts of speech, errouneously considered as verbs (e.g., télédéclaration, artérielle, stroke).This is unfortunately a very frequent case.
• Extraction of information related to the verb co-occurents.For the extraction of these information, we consider all the verbs appearing in sentences with one VP.For each verb, we distinguish between: -semantically annotated co-occurents, that are considered to be specialized; -and the remaining content of the sentence (except the words that are part of the stoplist), more precisely noun phrases, is considered to contain non specialized co-occurents.
In both cases, for each verb, we compute the number and the percentage of words in each of the above mentionned categories of co-occurents.
Finally, we provide a general analysis of the corpora.For each verb, we compute: the number of occurrences in each corpus, the total, minimal, maximal and average numbers of co-occurents, both specialized and non-specialized.On the basis of this information, we analyse the differences and similarities which may exist between the use of verbs in the two corpora studied.The purpose is to provide information about the specialized and non-specialized occurrences of verbs.

Corpora pre-processing
The parsing, done with the Bonsai parser, provided the syntactic annotation of corpora into syntactic constituents.We have noticed some limitations: • The Bonsai parser does not perform the lemmatization of lexical units whereas we needed to extract the verbs lemmas.The use of external resources made it possible to overcome this limitation; • The verbal chunks do not always contain verbal constituents, but can contain other parts of speech (e.g., télédéclaration, artérielle, stroke) and even punctuation.This is an important limitation for our work, mainly because we focus on verbs.Therefore, if we cannot extract the verbs properly, this can obviously have a negative impact on the final results.These limitations, resulting from the Bonsai parser, highlight some of the issues that characterize the state of arts as far as the syntatic analysis for French is concerned.For the future work, we are planning to try other syntactic parsers for French.

Semantic annotation
Concerning the semantic annotation we have made several observations: • Some annotations are missing, such as site d'insertion (insertion site) that can be labeled as TOPOG-RAPHY or risque (risk) as FUNCTION.This limitation is also related to the annotation of the forum corpus, that often contains misspellings or non-specialized equivalents of the terms.This limitation must be addressed in future work in order to detect new terms or the variations of the existing terms to make the annotation more exhaustive; • Other annotations are erroneous, such as or (ou) in French annotated as CHEMICALS (gold)) in English-language sentences.In future, the sentences in English will be forehand filtered out at the processing stage; • The terminological variation and the syntactic parsing provided by Bonsai make the recognition of several complex terms difficult.As we noticed previously, we mainly recognize simple atomic terms.For the current purpose, this is not a real limitation: the main objective is to detect the specialized and non-specialized words that co-occur with the verbs.Still, the number and semantic types of these words co-occuring with verbs can become biased.For instance, instead of one DIS-ORDER term embolie pulmonaire (air embolism), we obtain one DISORDER term embolie (embolism) and one ANATOMY term pulmonaire (air).

Automatic analysis of verbs
The contrastive analysis of the words, co-occuring with verbs, provides the main results of the proposed study.Table 2: General information related to the verbs and their co-occurent words: total and average numbers of co-occurents In Table 2, we compute the total number of verbs (T otal V ), the total number of words co-occuring with verbs per corpus (T otal coocc ), the total number of non specialized co-occurents per corpus (N sp−coocc ), the average number of specialized co-occurents per verb (A sp−coocc /V ), the average number of non specialized per verb (A ¬sp−coocc /V ).We can notice that the forum corpus provides slightly more verbs than the expert corpus.This observation might be considered to be obvious, since the forum corpus is a bit larger than the expert corpus.But if we combine this with the fact that the numbers and average numbers of co-occurents (specialized and non-specialized) are higher in the expert corpus, then the observation start making sense, since these results can be related to the confirmation by (Condamines and Bourigault, 1999) of the fact that nominal forms tend to be more frequent in specialized texts, whereas verbal forms tend to be more frequent in non-specialized texts.However, it is important to notice that some candidates in the list of non-specialized co-occurents have to be filtered out, such as adverbs (conformément, régulièrement, précocément, partiellement) and non relationnal adjectives (variables, inconscients, différents).The abundance of adverbs in the expert corpus (Table 4) by contrast to the forum corpus, where their presence seems to be less important, is consistent with the previous work, which show that non-specialized documents tend to have simpler syntactic and semantic structures (Wandji Tchami et al., 2013) and less adverbs (Brouwers et al., 2012).Table 3: Information on some verbs that occur in Expert Ex and Forum F o corpora In Table 3, we give similar information but for with individual verbs.For each verb, in every corpus, we compute the number of occurence (N occ ), the number of words (N coocc ) occuring with the verb, the number of specialized co-occurents (N sp−coocc ), the percentage of specialized co-occurents (% sp−coocc ), the number of non specialized co-occurents (N ¬sp−coocc ), the percentage of non specialized co-occurents (% ¬sp−coocc ), the average number of specialized co-occurents (A sp−coocc ) and the average number of non specialized co-occurents (A ¬sp−coocc ).These verbs are chosen because they occur in the two corpora studied and because they are sufficiently frequent as compared to others.In our opinion, these verbs may receive specialized and non-specialized meanings according to their usage.Indeed, Table 3 shows that these verbs behave differently according to the corpus.On the one hand, there are verbs (e.g., augmenter, favoriser, signaler, traiter, risquer) that occur with an important number of specialized co-occurents in the Experts Ex corpus while they have lower numbers of specialized co-occurents in the Forum F o corpus.On the other hand, there are verbs (e.g., causer, subir, prescrire) that have more specialized co-occurents in the Forum corpus than in the Expert corpus.If we consider the number of occurrences of these verbs, we can definitely notice that some of them (e.g.causer and subir) regularly occur with more specialized co-occurents in the Expert corpus (although with lower number of specialized co-occurents) than in the Forum corpus.This means that their frames involve different numbers of specialized co-occurents, that are higher in the Expert corpus.
In table 4, we show the frequent co-occurents for five verbs.We can propose two main observations: • Some verbs involve an important number of specialized co-occurents, that have different semantic types in the Expert and Forum corpora.For instance, the verb augmenter provides a total of 88 specialized co-occurents that belong to nine semantic types (D, P, S, J , C, F, T , L and A).The most frequent among them are F (27), D (18), T (15), and P (9), and occur mostly in the Expert corpus.These might be more general verbs, with weaker specific selectional restrictions.
• Other verbs frequently occur with specialized terms that belong to a specific semantic type.This most frequent label can be specific to one corpus only or simultaneously to the two.For instance, for the verb prescrire, the most frequent labels are the same in the two corpora: C, J , P and T terms.Traiter frequently occurs, in the two corpora, with C and D terms.
The general observation is that, for a given verb, the Expert corpus shows more sophisticated syntactic structures with higher number of specialized co-occurents.Besides, some verbs may show similar or different behavior in the two corpora studied.According to the objectives of the proposed work, we consider that an important presence of specialized terms in a sentence or corpus indicates a very specialized use and meaning of the verbs.Quantitative and qualitative analysis of the data support this first study and results.

Conclusion
We have proposed an automatic method to distinguish between specialized and non-specialized occurrences of verbs in medical corpora.This work is intended to enhance the previous study (Wandji-2013).Indeed, the method used has changed from semi-automatic to completely automatic; and a new task is performed in order to enhance the annotation process : the syntactic parsing of the corpora.Also, some new materials are used namely the Bonsai parser, the resource of verbal forms, the stoplist.There is an increase in the quantity of data analyzed; all the verbs of the various corpora were considered in this study.The annotation is based on an approach similar to Frame Semantics, considering the fact that semantic information related to the verbs co-occurents are provided through the use of a medical terminology.Though our method is still under development, it has helped to notice that some verbs regularly co-occur with specialized terms in a given context or corpus while in another, the same verbs mostly occurs with general language words.This observation takes us back to the issue of text readability, described in the introduction.Indeed, the verbs whose occurences are characterized by the predominance of specialized terms, can be considered as sources of reading difficulties for non experts in medecine.

Future work
We plan to extend this study in different ways.The recognition of the verb neighbors must be improved with the main objective to make the annotations more exhaustive.In this study, we have portrayed the verbs behaviors and their relations with the words with which they occur in the corpora.However, our aim is to automatically identify the verbs arguments, among his co-occurents.We also plan to peform an automatic distinction between : the syntactic functions (subject, object, etc.) of the verbs arguments and the core and non-core elements.We also plan to compute the dependency relations within sentences, either by using another chunker or by integrating to our treatment chain a tool that can perform this task.
In addition, we will concentrate on the description of semantic frames of the medical verbs and on the identification of other eventual reading difficulties that might be related to the verbs usages in the corpora.
As indicated above, we processed sentences that have only one verbal phrase (8 842 for the Forum corpus and 10 563 for the Expert corpus).In the future, we will process other sentences, coordinated or subordinated, which will be segmented into simple propositions before the processing.Another point is related to the exploitation of these findings for the simplification of medical documents at two levels: syntactic and lexical.Finally, working at a fine-grained verbal semantics, we can distinguish the uses of verbs according to whether their semantics and frames remain close or indicate different meanings.

Figure 2 :
Figure 2: Examples of annotations in forum corpus.

Table 4 :
Description of the verbs co-occurents