Construction of Disease-Symptom Knowledge Graph from Web-Board Documents

Pechsiri, Chaveevan; Piriyakul, Rapepun

doi:10.3390/app12136615

Open AccessArticle

Construction of Disease-Symptom Knowledge Graph from Web-Board Documents

by

Chaveevan Pechsiri

^1,* and

Rapepun Piriyakul

²

¹

College of Innovative Technology and Engineering, Dhurakij Pundit University, Bangkok 10210, Thailand

²

Department of Computer Science, Ramkhamhaeng University, Bangkok 10240, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(13), 6615; https://doi.org/10.3390/app12136615

Submission received: 27 May 2022 / Revised: 16 June 2022 / Accepted: 22 June 2022 / Published: 29 June 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Authors are encouraged to provide a concise description of the specific application or a potential application of the work. This section is not mandatory.

Abstract

The research aim is to construct a disease-symptom knowledge graph (DSKG) as a cause-effect knowledge graph containing disease-symptom relations as a cause-effect relation type determined from downloaded documents on medical web-board resources. Each disease-symptom relation connects a disease-name concept node (a causative-concept node) to a corresponding node having a group of correlated symptom-concept/effect-concept features as common symptom-concept/effect-concept features among some disease-name concepts. The DSKG benefits non-professionals in preliminary diagnosis through a recommender web-board. There are three main problems: how to determine symptom concepts from sentences without annotation on the documents having disease-name concepts as the documents’ topic-names; how to determine the disease-symptom relations from the documents with/without complications; and how to construct the DSKG involving high dimensional symptom-concept features after union of the correlated symptom-concept groups. Therefore, we apply a word co-occurrence pattern including medical-symptom expressions from Wikipedia including MeSH and the Lexitron Dictionary to determine the symptom concepts. The Cartesian product is applied for automatic-supervised machine learning to determine the disease-symptom relation. We propose using Principal Component Analysis for constructing the DSKG by dimensionality reduction in the symptom-concept features with minimized information loss. In contrast to previous works, the proposed approach enables the DSKG construction with precise and concise representation scores of 7.8 and 9, respectively.

Keywords:

cause-effect relation; disease-symptom knowledge graph; word co-occurrence pattern

1. Introduction

The research objective is to construct of a disease-symptom knowledge graph (DSKG) automatically from determined disease-symptom relations on documents downloaded from two medical web-board resources, a Thai-hospital web-board resource (e.g., http://www.si.mahidol.ac.th/sidoctor/e-pl/ (accessed on 15 February 2021); https://www.bumrungrad.com (accessed on 15 February 2021); and etc.) and a Thai-Healthcare-Knowledge web-board resource (e.g., http://haamor.com/ (accessed on 15 February 2021); and https://www.doctor.or.th/doctorme/general (accessed on 15 February 2021)). The DSKG is formed as a cause-effect knowledge graph that represents disease-symptom relations which are a cause-effect relation type between causative concept nodes and effect concept nodes, where each causative concept node is represented by a certain disease-name concept, and each effect concept node is represented by a correlated symptom-concept group. With regard to [1], most of the patients with a certain disease have multiple symptoms rather than a single symptom, and some of these multiple symptoms are also either correlated or represent co-occurrences as common symptoms among some diseases; e.g., stuffy, runny nose, and cough symptoms are common symptoms among Cold, Flu, Airborne Allergy, and COVID-19 diseases (https://newsinhealth.nih.gov/2022/01/it-flu-covid-19-allergies-or-cold (accessed on 15 February 2021). The DSKG presents several disease-symptom relations determined or extracted from the downloaded documents on which each disease-symptom relation is a link connecting a disease-name concept node to a node containing a symptom-concept group as an occurrence of multiple symptoms. Furthermore, this disease-symptom relation is a semantic relation, particularly the cause-effect relation type (called CErel), which links each disease-name concept (d_i; i = 1, 2, …, numofDiseaseElements) as the causative concept to the corresponding symptom-concept group (SymGroup_di) as the effect-concept group resulted by d_i from the d_i document (which is the downloaded disease document having d_i as the document topic name), where: d_i ∈ DS is a disease-name concept set; SymGroup_di ⊂ SG is a set of symptom-concept groups resulted by the corresponding DS elements. Moreover, the symptom concepts of our research include sign concepts and also are the basic symptoms of each d_i. In addition to CErel, the semantic relation is a directional link between two or more concepts, entities, or sets of entities that participate in the relation [2] as follows:

<Concept1>---(Relation)---<Concept2>

where “<…>” and “(…)” symbols represent a concept and a relation type, respectively.

Thus, CErel is the relation type which links Concept1, e.g., d_i, to Concept2, e.g., SymGroup_di, as follows:

<d_i>---(CErel)----> <SymGroup_di>.

where SymGroup_di = {Sym_i₁, Sym_i₂, …, Sym_i _{last_i}}_{of_di}; SYM is the universal symptom-concept set, and then Sym_ij ∈ SYM; i = 1, 2, …, numofDiseaseElements; j = 1, 2, …, last_i; SymGroup_d₁ ∪ SymGroup_d₂… ∪ SymGroup_d _{numofDiseaseElements} ⊂ SYM; and the result of SymGroup_di ∩ SymGroup_dl is either not null or null on which i<>l and 1 ≤ l ≤ numofDisease Elements. Moreover, the concepts and the relations are the foundation of knowledge and thought [2] where the concepts are the building blocks of knowledge, and the relations are the cement linking up the concepts into the knowledge structures. According to the knowledge structure, the DSKG is formed by the CErel connections which connect several different <d_i> nodes to a node containing several correlated s_c features (where s_c is a symptom-concept expressed on the documents; s_c ∈ S is obtained by the union of all subsets of SG or all SymGroup_di from all d_i documents; i = 1, 2, …, num is numofDiseaseElements; S ⊂ SYM; S = {s_c}; and c is an index, c = 1, 2, ..., m which is the number of symptom-concept features; see Figure 1).

With regards to Figure 1, the DSKG also presents a <s₁, s₈, …, s_β> node as the common symptom-concept features among d_i nodes.

In addition, identifying the symptoms in terms of the symptom concepts is vital towards diagnostics of diseases in the medical field. Approximately 70–90% of the diagnostic information is comprised of a patient’s history and physical examinations that involve costly physical tests [1]. Although about one third of the identifiable common symptom concepts do not provide a conclusive disease-based explanation [1], the DSKG of our research can be used as an additional healthcare procedure for preliminary diagnosis of some potential diseases during the diagnostic processes which potentially reduces physical examination costs. Moreover, the DSKG will help healthcare practitioners to avoid tunnel visioning and maintain awareness on the presence of multiple symptom concepts instead. The DSKG can also be expanded beyond its use by healthcare practitioners to non-professionals in preliminary diagnosis of the possible diseases from some actual symptom occurrences as a web application system containing a healthcare recommender application using the DSKG on their mobile phones or computers.

Thus, the research focuses on constructing the DSKG from the determined cause-effect pairs, i.e., the d_i-SymGroup_di pairs, having CErel from the downloaded disease documents on the Thai medical web-board resources. The sym_ij element expressed on the d_i document is mostly based on an event expression on an Elementary Discourse Unit (EDU which is defined as a simple sentence or a clause by [3]). In addition, the event expression is explained by a verb with the event semantic [4] on the EDU’s verb phrase, where each EDU expression is based on a general linguistic expression, e.g., a general Thai linguistic expression (see Figure 2), after stemming words and completing stop-word elimination.

In Figure 2, a concept of each element (called “an element concept”) in the Verb_weak, Verb_strong, Adv, Adj, and Noun sets is based on the medical-symptom-expression list of Wikipedia (https://en.wikipedia.org/wiki/List_of_medical_symptoms (accessed on 10 January 2022) and MeSH (https//:www.ncbi.nlm.nih.gov/mesh (accessed on 15 February 2021), after translating from English to Thai by the Lexitron Dictionary (https://dict.longdo.com (accessed on 15 February 2021) followed by the Thai to English translation by the Lexitron Dictionary and WordNet [5] (https//:word-net.princeton.edu/obtain (accessed on 15 February 2021). An example of a downloaded disease document is shown in Figure 3 which contains seven different symptom-concept expressions based on verb phrases of EDU2-EDU7 and EDU11.

There are several techniques in the literature [6,7,8,9,10,11,12,13,14,15] applied for determining the cause-effect/disease-symptom relation from the unstructured data, e.g., texts, without constructing the cause-effect/disease-symptom graph or network except [7,12,13,14,15] (see Section 2) where each cause-effect/disease-symptom relation in the cause-effect/disease-symptom graph or network of [7,12,13,14,15] is based on a causative-concept feature, e.g., a disease-name concept feature, connecting to one effect-concept feature, e.g., a symptom-concept feature. In contrast, unlike the aforementioned literature, our DSKG is constructed by several CErel connections where each CErel connection is the link between one causative-concept feature, e.g., a d_i feature, and a group of correlated effect-concept features (e.g., a group of the s_n₁, s_n₂, …, and s_n_η features which are the correlated s_c features as the common symptom-concept features) where n1, n2, …, and nη are the index (c) values; 0 < n1 < n2 < … < nη; η is numberOfCorrelatedSymtomConcepts; and η ≤ m is the number of symptom-concept features (see Figure 1).

In addition to [12,13,14,15], determination of the CErel or the disease-symptom relation from the documents involves a disease-name concept feature set and a symptom-concept feature set where their symptom-concept feature elements are mostly expressed by at least one term/word on NP1 or NP2 of a simple sentence (see Section 2). Whilst there is another research [16] working on only a symptom recognition from documents without determining the disease-symptom relation where their symptoms are based on either NP1 or VP. In contrast, the symptom-concept features of our DSKG construction are mostly expressed by at least two terms/words as a composite variable [17] on an EDU’s verb phrase (VP) including NP1 of the EDU after stemming words and stopping word removal, e.g., “((คอ/throat)/NP1 ((เจ็บ/hurt)/Verb_strong)/VP” (a sore throat), (ศรีษะ/head)/NP1 ((มี/has)/Verb_weak (อาการ/sympyom)/noun (บวม/swelling)/NP2)/VP (a swollen head), and “(คนไข้/patient)/NP1 ((ปวด/pain)/Verb_strong (ศรีษะ/head)/NP2)/VP” (The patient has a headache)). The composite variable is a variable made up of two or more individual variables, called indicators, into a single variable [17]. Each indicator alone does not provide sufficient information, but altogether they can represent the more complex concept. The composite variable of the symptom-concept feature of our research consists of one or two terms from the EDU’s VP and one term from the EDU’s NP1 for obtaining a symptom/effect concept.

However, the Thai documents have some specific characteristics, such as zero anaphora or implicit noun phrases, without word and sentence delimiters, etc. All of these characteristics are involved in three main problems for constructing the DSKG from the documents: (1) how to determine a symptom concept of an EDU_h occurrence as EDU_h,Symij (which is an EDU_h occurrence with a symptom concept, Sym_ij, on a d_i document, h = 1, 2, .., endDocument_di) among several EDU_h occurrences with non-symptom concepts on the d_i document; (2) how to determine CErel between d_i and each SymGroup_di element (Sym_ij of d_i) from a downloaded d_i document for subsequently determining a d_i-SymGroup_di pair with Cerel, where SymGroup_di appears as a symptom-concept EDU vector, i.e., 〈EDU_h_1,Symi1, EDU_h_2,Symi2, ..., EDU_hrim_{,Symi last_i}〉_{of_di}, on the d_i document (0 < h1 < h2 < … < hrim ≤ endDocument_di), whilst some downloaded d_i documents contain d_i symptom concepts of the primary disease with/without other symptom concepts resulted by complications as the secondary disease; and (3) how to construct the DSKG based on each <d_i> node (which is represented by the d_i feature), connecting to the <s_n₁, s_n₂, …, s_n_η> node (which is represented by a group of the correlated s_c features) with the problem of the high dimensional feature space of s_c, where s_c ∈ S is obtained by the union of all SymGroup_di from the determined d_i-SymGroup_di pairs having CErel. S then has the high dimensional feature space [18] resulting in a time consuming way to find the correlated s_c features for the DSKG construction.

We then need to develop a framework which combines the statistical techniques, the machine learning techniques, and the linguistic phenomena to learn several EDU expressions for solving the research problems. With regard to Figure 3, most of the symptom-concept occurrences on the documents are based on Verb or VP (see Figure 2). Therefore, we apply a word co-occurrence (called “wc”) pattern on an EDU occurrence where a wc pattern consists of three major term sets on an EDU occurrence, a predicate-verb term set, an agent term set, and a patient/information term set (see Section 3.1). The word co-occurrence [19,20] is expressed as a compound term with/without any pattern or restriction depending on each research perspective, whilst the wc pattern of our research is used for obtaining an EDU’s wc expression as a composite variable along with determining a wc concept, particularly a symptom concept, which is called “wc_Sym” of a wc expression on an EDU occurrence. Thus, the symptom-concept feature, Sym_ij, occurring on the EDU_h,Symij occurrence of the d_i document, is represented by wc_Symij of the EDU_h,Symij’s wc expression without the concept annotation. With regard to the d_i document, wc_Symij and d_i are then used as a wc_Symij feature and a d_i feature, respectively, for determining a d_i-SymGroup_di pair with CErel where the concepts of wc_Symij and d_i are based on the Verb_weak, Verb_strong, Adv, Adj, and Noun sets (on Figure 2) prepared and collected from the medical-symptom-expression list of Wikipedia and MeSH after the English-Thai translations. Likewise, three contributions of this paper proved a statistically based approach involving machine learning. The first one is how to determine a SymGroup_di element, Sym_ij, based on the wc pattern used for determining the wc_Symij feature as the composite variable without the concept annotation, whereas the symptom concepts in previous research, e.g., [13,15], are based on only NP1/NP2, whilst NP1 is likely an ellipsis (or NP1 has a null value) in our documents. The second one is how to determine d_i-wc_Symij pairs having CErel by machine learning techniques with automatic-supervised learning automatically where the positive/negative instances are formed by the Cartesian product of DS × SG [21] with consideration of the disease type (t or Type_t); see Section 3.2. According to the DSKG, each wc_Symt,ij feature is based on the basic symptom of d_t,i where d_t,i is d_i in Type_t, and wc_Symt,ij is wc_Symij of d_t,i. Thus, the symptoms of the complications are excluded by the complicationTerm set, {‘ภาวะแทรกซ้อน/complcations’, ‘ไม่รักษา/unTreat’, ...}. And the third one is how to construct the DSKG by clustering or wrapping the determined S elements to become the group of the correlated s_c features as the dimensionality reduction of the feature space of S with minimized information loss. Thus, the constructed DSKG diagram also presents the <s_n₁, s_n₂, …, s_n_η> node as the multi-symptom concept node among some <d_i> nodes.

Therefore, we apply the wc pattern to obtain the wc expression of EDU_h,Symij where the wc pattern is relied upon a predicate-argument pattern [22] (see Section 3.1) The wc_Symij feature is determined from the wc expression by the elements in the Verb_weak, Verb_strong, Adv, Adj, and Noun sets collected from the medical-symptom-expression list of Wikipedia including MeSH through the English-Thai translation without the concept annotation (see Section 3.1 and ii in Section 4.2). Moreover, all symptoms of the complications are excluded if the complications occur right after the complicationTerm set element. We also apply the machine learning techniques: Support Vector Machine (SVM) [23], Naïve Bayes (NB) [24], and Linear Logistic Regression (LR) [25] to determine d_t,i-wc_Symt,ij pairs having CErel by the automatic-supervised learning from the result of the Cartesian product of DS × SG aligned with the disease type (t or Type_t) on the learning corpus (see Section 3.2). Each d_t,i-SymGroup_dt,i pair with CErel of Type_t is subsequently solved by grouping the determined d_t,i-wc_Symt,ij pairs having CErel with the same d_t,i from the test corpus. We then propose using the principal component analysis (PCA) [26] to solve the high dimensional feature space of S by wrapping the S elements to become the group of the correlated s_c features as the common features for constructing the DSKG (see Section 3.3).

Our research is organized into six sections. In Section 2, related works are summarized. Research problems in constructing the DSKG from the documents are described in Section 3, and Section 4 shows our framework for constructing the DSKG through CErel determination from the documents. In Section 5, we evaluate and discuss our proposed methodology and then present the conclusion in Section 6.

2. Related Works

Several strategies [6,7,8,9,10,11,12,13,14,15] have been proposed to determine the cause-effect/disease-symptom relation from the documents as the unstructured data without concerning the cause-effect/disease-symptom knowledge graph construction except [7,12,13,14,15]. Girju [6] determined a causal relation based on a lexico syntactic pattern (NP1 causal-verb NP2) by decision tree learning. Therefore, the cause/disease and effect/symptom occurrences are based on noun phrases as NP1 and NP2, respectively, with in one sentence. In contrast, the causal relation based on one complex sentence [7] was determined/extracted by using a cue-phrase set (which was a word, a phrase, or a word pattern) for connecting two NP-pair as a cause and an effect including probabilities. The extracted causal relations [7] were used for constructing the causal network as the knowledge graph for the term protein with the two relations of the causal relation and the hypernym relation without concerning the high dimensional effect feature sets. Moreover, Riaz and Girju [8] used a set of linguistic features and Integer Linear Programming to learn a causal relation within one sentence from annotated verb_effect-noun_cause pairs on verb phrases as the causal relation based on the expert annotations and FrameNet including WordNet to generate a training corpus. For example: “A campaign has started to try to cut the rising number of children dying [_cause from solvent abuse].” where the pair “dying_solvent abuse” encodes causality by the annotation. Reference [9] proposed the Restricted Hidden Naïve Bayes model to the lexico syntactic pattern (NP1 ConnectiveVerb NP2 where NP1 and NP2 are noun phrases as a cause and an effect, respectively, or vice-versa) of each sentence to learn the classes, annotated by experts on 26 feature templates categorized into four feature groups, the contextual, syntactic, positional, and connective features. They extracted/determined the causality with the 0.873 precision and the 0.841 recall from the English documents. Reference [10] also extracted/determined the causal relation within one sentence (where causes and effects were based on noun expressions) by using the linguistic rules based along with Bayesian inference to reduce the number of pairs produced by ambiguous patterns, whilst [11] used syntactic patterns by manual annotation with one sentence or between two sentences having a cause-effect link/relation. The cause-effect links were extracted or determined by a syntactic pattern-based algorithm from scientific papers with 47% and 70% on average precision and recall, respectively. The determined cause-effect links were applied to represent the core of scientific papers as a summarization. Reference [12] extracted disease-symptom relationships from texts by using the syntactic-patterns based on the quality- and specificity-based selection from several determined syntactic-patterns (where each syntactic-pattern was determined on each dependency graph of a sentence containing both a disease entity and a symptom entity based on a noun term or a noun phrase). Reference [13] automatically constructed the health knowledge graphs of a disease-symptom connection by using logistic regression, naive Bayes, and a Bayesian network using noisy OR gates to learn and determine the connection or relation between the disease codes and the symptom concepts from textual data of patient medical records with a 0.85 precision, a 0.6 recall, a 0.704 F1-score with the positive/negative classes based on the actual symptom occurrence on the textual data. However, the symptom expressions on the textual data were based on noun phrases, e.g., “(The patient)/NP1 (complains of (a worsening cough)/NP2)/VP” and “(He)/NP1 (also has (a dry cough)/NP2)/VP.”. Reference [14] determined each disease-symptom relation and also symptom names within one sentence by learning of a multi-column convolutional neural network (MCNN) based on a human annotator from 50,000 random sentences of the Japanese web texts. MCNN had an input sentence which was divided into five consecutive word sequences: a symptom candidate name (SYMname), a given disease name (DISname), the word sequence before the SYMname, the word sequence between the SYMname and the DISname, and the word sequence after the DISname. Their proposed symptom name extraction method achieved a 93.8% F1-score, and the disease–symptom relationship extraction method achieved an 88.3% F1-score where the extracted symptom names were based on noun or noun-phrase expressions. Reference [15] used the PubMed bibliographic literature database and the association between symptoms and diseases in the MeSH metadata fields of PubMed to determine the disease-symptom relationships where symptoms and diseases were based on noun phrases. They applied the term frequency-inverse document frequency to measure the strength of an association between symptom_i and disease_j for constructing the disease-symptom network without concerning the high dimensional symptom features. Moreover, Ref. [16] recognized only the medical symptom expressions on patient texts without determining the disease-symptom relation. They applied the sentence/phrase templates based on either a noun phrase or a verb phrase including the symptom concepts labeled by the experts to capture the surface of symptom expressions from the patient text. The machine learning techniques were applied for the multi-label classification of symptoms including the long tail symptoms from the surface of the symptom expressions. The [16] model achieved a 76% F1-score.

However, the causative-concept and effect-concept features of the previous works [6,7,8,9,10,11,12,13,14,15] are based on at least one word of either the verb term expression or the noun-term/noun-phrase expression without the composite variable consideration. The cause-effect relation/association determinations of [6,7,8,9,10,11,12,13,14,15] except [13] are based on machine learning by the expert or human annotation, whereas CErel of our research was based on the automatic supervised learning. According to a few research works [7,12,13,14,15] on the cause-effect graph/network construction from the unstructured data, i.e., texts, their cause-effect graphs/networks are based on each CErel connection between the node of the causative-concept feature and the node of the effect-concept feature without concerning the high dimensional data on the feature set, particularly the effect-concept feature set, even if their corpora are large sizes. However, our DSKG construction (which is based on the CErel connection between the <d_i> node and the <s_n₁, s_n₂, …, s_n_η> node containing the group of the correlated s_c features as an occurrence of multiple symptoms) involves the high dimensional feature problem.

3. Problems of DSKG Construction

There are three main problems that must be solved: how to determine the wc_Symij features on the EDU_h,Symij occurrences without concept annotation on the d_i documents, how to determine the d_i-wc_Symij pairs having CErel with automatic supervised learning, and how to construct the DSKG based on the correlated s_c features with the high dimensional feature problem.

3.1. How to Determine wc_Symij Features on EDU_h,Symij Occurrences without Concept Annotation on di Documents

According to the corpus behavior study of the health-care domain, most of the symptom-concepts on the EDU_h,Symij occurrences are the event or state expressions by the verb phrases where each verb phrase contains a predicate verb/a predicate-verb term (v_a; a = 1, 2, ..., numberOfpredicate) which is used to identify the EDU_h,Symij expressions among EDU_h expressions as shown in the following (a)–(d) examples (other than Figure 3) with another problem of having the same concept with different verb phrase expressions resulting in the different wc expressions. These examples also include the phonetic expression by http://translate.google.com/ (accessed on 15 February 2021).

Example 1:

(a): EDU: “[คนไข้] ปวดกล้ามเนื้อ” ([A patient] has a muscel pain.)
“([(คนไข้/Khnk̄hị̂)/patient])/NP1
(((ปวด/pwd)/pain)/Verb_strong ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun)/VP”
(b): EDU: “ผู้ป่วยมีอาการปวดกล้ามเนื้อเล็กน้อย” (A patient has a symptom of mild muscle pain.)
“((ผู้ป่วย/P̄hū̂ p̀wy)/patient)/NP1 (((มี/mī)/has)/Verb_weak ((อาการ/xākār)/symptom)/Noun ((ปวด/pwd)/pain)/Verb_strong ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun ((เล็กน้อย/lĕkn̂xy)/mild)/Adj)/VP”
(c): EDU: “ผู้สูงอายุมีอาการเหนื่อย” (An elder has a tired symptom.)
“((ผู้สูงอายุ/P̄hū̂ s̄ūngxāyu)/elder)/NP1 (((มี/mī)/has)/Verb_weak ((อาการ/xākār)/symptom)/Noun ((เหนื่อย/h̄enụ̄̀xy)/be tired)/Verb_strong)/VP”
(d): EDU: “[ผู้ป่วย] เหนื่อยมาก” ([A patient] is very tired.)
“((ผู้ป่วย/P̄hū̂ p̀wy)/patient)/NP1 (((เหนื่อย/h̄enụ̄̀xy)/be tired)/Verb_strong (มาก/māk)/very)/Adv)/VP”

where: (a) and (b) examples and (c) and (d) examples have different verb phrase expressions with the same major symptom concepts of “(ปวดกล้ามเนื้อ/pwd kl̂ām neụ̄̂x)/pains in muscle” and “(เหนื่อย/h̄enụ̄̀xy)/be tired”, respectively; and the [..] symbol means ellipsis.

With regard to [22], the predicate-argument pattern is shown in the following, verb(agent_argument, patient/information_argument) where verb is an element of a predicate-verb term set; agent_argement is an element of an agent term set; and patient/information_argument is an element of a patient/information term set. According to the predicate-argument pattern, we then apply the following wc pattern on each EDU occurrence based on Figure 2 to obtain the wc expressions for automatically determining the wc_Symij features after the EDU_h,Symij identification from the d_i documents by the predicate-verb term set.

wc pattern: V + W1 + W2

where:

V is a predicate-verb term set; V = Verb_strong ∪ V_inf; v_a ∈ V. Since v_weak,b has a weak symptom concept, w_inf,c as an information word is added to v_weak,b to become a strong symptom concept which is an element of Vinf or (v_weak,b + w_inf-c) ∈ Vinf where v_weak,b ∈ Verb_weak; w_inf,c ∈ (Noun ∪ Adj ∪ Verb_strong); w_inf,c is a word right after v_weak,b; and a, b, c are an integer or index.

W1 is an agent term set; w_1,g ∈ W1; w_1,d is a head noun or a Noun element of NP1; and g is an integer or index.

W2 is a linguistic patient/information term set; w_2,e ∈ W2; W2 = Noun ∪ Adj ∪ Adv; w_2,e is also a word sequence right after v_a; w_2,e has a null value if w_2,e does not exist; and e is an integer or index.

Moreover, the concept elements of Verb_strong, Verb_weak, V_inf, Noun, Adj, and Adv sets of Figure 2 are firstly prepared and collected from the results of the Thai-word and Thai-EDU segmentations on the translated terms (English to Thai by the Lexitron Dictionary) of the medical-symptom-expression list on the Wikipedia and MeSH web sites. In addition, the segmented Thai words are then translated from Thai to English by the Lexitron Dictionary and WordNet for collecting a concept of each element in the Verb_strong, Verb_weak, Vinf, Noun, Adj, and Adv sets on Figure 2. If the segmented Thai word has several English word senses, the expert will select the corresponding symptom concept in English (see ii in Section 4.2).

According to each corpus of our research, the wc_Symij feature (which presents in the form of the predicate-argument pattern) is then determined from each term of the wc expression including the Thai-to-English translation by the collected element concepts of the Verb_strong, Verb_weak, Vinf, Noun, Adj, and Adv sets. For the approach of the wc expression for the wc_Symij feature determination, the w_1,g ellipsis has to be solved by the following rule: w_1,g is a Noun element of the previous EDU’s NP1 if the current EDU’s NP1 is an ellipsis. In addition to the problem of the different verb phrase expressions with the same symptom concept, we apply another rule to obtain wc expression with the actual symptom expression: if “v_weak,b + w_inf,c” is “มี/have + อาการ/symptom”, we take the next two words right after “v_weak,b + w_inf,c” to be v_a and w_2,e as the actual symptom expression instead as shown in the following from the previous (b) and (c) in Example 1 including each term translation on the wc expression from Thai to English by using the collected element concepts in the Verb_strong, Verb_weak, V_inf, Noun, Adj, and Adv sets.

(b): V + W1 + W2 = ((ปวด/pwd)/pain)/Verb_strong +([(ผู้ป่วย/P̄hū̂p̀wy)/patient])/Noun + ((กล้ามเนื้อ/kl̂ām neụ̄̂x)/muscle)/Noun
(c): V + W1 + W2 = ((เหนื่อย/h̄enụ̄̀xy)/beTired)/Verb_strong + ((ผู้สูงอายุ/P̄hū̂ s̄ūngxāyu)/elder)/Noun + null

Moreover, there are some general-concept rules applied to the acquired wc expressions for the general wc_Symij feature presentation: the w_1,g concept is “person” if w_1,g is in the Person set, {‘คนไข้,ผู้ป่วย/patient’, ‘ผู้สูงอายุ/elder’, ‘ทารก/infant’, ‘เด็ก/child’, ‘วัยรุ่น/teenager’, …}. In addition, if the term in w_2,e is a concept of an element in the Symptom-Expression-Level set, {‘มาก/very’, ‘เล็กน้อย/little’, ‘ปานกลาง/moderately’,…}, w_2,e has a null value for a general symptom concept. Thus, the wc_symij features of EDUs’ wc expressions of the (a)–(d) in Example 1 are represented by the predicate-argument pattern as follows: (a) pain([person],muscle), (b) pain(person,muscle), (c) beTired(person), (d) beTired([person]) where (a), (b) examples and (c), (d) examples have different EDUs’ verb-phrase expressions but the same symptom concepts of pain(person,muscle) and beTired(person), respectively.

Therefore, after each EDU_h,Symij occurrence on the d_i document has been identified by the predicate-verb term (v_a) followed by the w_1,g, and w_2,e of the wc pattern, the wc_Symij feature is automatically determined by translating all wc expression terms (v_a, w_1,g, w_2,e) from Thai to English by the collected concept elements of the Verb_strong, Verb_weak, V_inf, Noun, Adj, and Adv sets.

3.2. How to Determine d_i-wc_Symij Pairs Having CErel with Automatic Supervised Learning

We apply SVM, NB, and LR to learn the d_i-wc_Symij pairs having CErel with the automatic-supervised learning from the learning corpus where the positive and negative instances with the CErel and nonCErel classes, respectively, are assigned by the Cartesian product of DS × SG aligned with the disease types. Thus, the downloaded disease documents are separated into two groups according to two disease types, an infectious disease type (Type_t = Type₁) and a non-infectious disease type (Type_t = Type₂). Each disease-type contains the d_t,i-SymGroup_dt,_i pairs having the CErel connections which link the d_t,i features to the corresponding SymGroup_dt,i features determined from the d_t,i documents (where d_t,i is d_i in Type_t; SymGroup_dt,_i is SymGroup_di resulted by d_t,i from the d_t,i document; and wc_Symt,ij is wc_Symij of d_t,i). Thus, Sym_t,ij (which is a Sym_ij element in Type_t) is a SymGroup_dt,i element represented by wc_Symt,ij as follows:

SymGroup_dt,I = {wc_Symt,i1,wc_Symt,i2,..,wc_{Symt,ilast_i}}_{of_dt,i}

(1)

where I = 1, 2, ..., numofDseaseElements_t; j = 1, 2, ..., last_i.

Moreover, there are some downloaded d_t,i documents containing both the d_t,i symptom expressions and the symptom expressions of the d_t,i complications. Therefore, if the element of the complicationTerm set is identified on the d_t,i document, then all symptoms, wc_Symt,ij, right after the occurrence of the complicationTerm set element, are excluded.

The positive/negative instances of the d_t,i wc_Symt,ij pairs from each d_t,i-SymGroup_dt,i pair are formed by the result of the Cartesian product of DS × SG as follows: each d_t,i wc_Symt,ij pair on the certain d_t,i-SymGroup_dt,i pair is the positive instance having the CErel or the positive class if d_t,i and SymGroup_dt,i have the same Type_t. Meanwhile, each d_t,i wc_Symt,i,j pair on the certain d_t,i-SymGroup_dt,i pair is the negative instance having nonCErel or the negative class based on Type_t of the d_t,i document if d_t,i and SymGroup_dt,I have the different Type_t (see Figure 4).

DS = {D_t}

(2)

where D_t is a disease name set in Type_t; t = 1, 2;

D_t = {d_t,i}

(3)

where i = 1, 2, ..., numofDiseaseElements_t;

SG = {SymGrp_t}

(4)

SymGrp_t = {SymGroup_dt,i}

(5)

where SymGroup_dt,i = {wc_Symt,i₁, wc_Symt,i₂, .., wc_{Symt,ilast_i}}_{of_ dt,i}.

According to the automatic-supervised learning by NB, SVM, and LR to each disease type from the learning corpus, we then determine the d_t,i wc_Symt,ij pairs having CErel on the d_t,i documents of each disease type from the test corpus. Thus, the d_t,i-SymGroup_dt,i pair with CErel is determined by grouping all determined d_t,i wc_Symt,ij pairs having CErel by the same d_t,i from the d_t,i document. All d_t,i-SymGroup_dt,i pairs having CErel are then used for constructing the DSKG.

3.3. How to Construct DSKG with the Problem of High Dimensionality of Symptom-Concept Features

After determining the d_t,i-SymGroup_dt,i pairs having CErel from all downloaded d_t,i documents, there are many different s_c features from the union of all SymGroup_dt,i as a high dimensional feature space of S (or the high dimensional symptom concepts) to form the DSKG from several different disease-name concepts (d_t,i). The high dimensional symptom concepts result in difficulty in finding the relevant disease symptom concepts for constructing the concise DSKG. Therefore, we propose using PCA to solve the high dimensionality problems. According to PCA by [27], a variance and a covariance are defined as follows:

The variance of a random variable is defined as

var {(X}_{1}) = E (X_{1}^{2}) - {({E (X}_{1}))}^{2}

(6)

where (6) shows how variance measures the average deviation from the mean value. When we have more than one random variable, it is useful to analyze the covariance:

cov (X_{1}, X_{2}) = E (X_{1} X_{2}) - E (X_{1}) E (X_{2})

(7)

If the covariance is zero, which is equivalent to saying that the correlation coefficient is zero, the variables are said to be uncorrelated. The variances and covariances of the elements of a random vector x are often connected to a covariance matrix (C(x)) whose a,b-th element is simply the covariance of X_a and X_b:

C (x) = (\begin{array}{l} cov (X_{1}, X_{1}) cov (X_{1}, X_{2}) . . . cov (X_{1}, X_{m}) \\ cov (X_{2}, X_{1}) cov (X_{2}, X_{2}) . . . cov (X_{2}, X_{m}) \\ \dots \dots \dots \\ cov (X_{m}, X_{1}) cov (X_{m}, X_{2}) . . . cov (X_{m}, X_{m}) \end{array})

(8)

The diagonal of the covariance matrix gives the variances. The covariance matrix is basically a generalization of variance to random vectors. In addition to our research, the covariance of X_a and X_b features are s_a and s_b features within the symptom-concept feature matrix of mxm (where m is the number of different symptom-concept features; m > 100). Therefore, 〈s₁, s₂, …, s_m〉 is a symptom-concept feature vector on which s_c ∈ S, and c = 1, 2, .., m. The symptom-concept feature vector of the symptom-concept feature matrix is then rotated for grouping the symptom-concept features of the vector into separated feature groups with the minimum number of separated feature groups where each separated feature group is called “Fgroup_z”; z = 1, 2, .., numofFeatureGroups and numofFeatureGroups is less than m. After the symptom-concept feature vector rotation, a feature loading weight from an eigenvector for the s_c feature is determined according to Fgroup_z. The high feature loading weight of s_c to Fgroup_z infers that the correlation between s_c and Fgroup_z is high. The different s_c feature elements with the high feature loading weights in a certain Fgroup_z are wrapped to become a factor (called “Factor_z”) including its factor score (called “FactorScore_z”) determined by Equation (9) from the feature loading weights of the wrapped s_c feature elements.

{FactorScore}_{z} = \sum_{c = 1}^{k_{z}} w_{c} (\frac{x_{c} - {\bar{x}}_{c}}{S D_{c}})

(9)

where

$w_{c}$ is a feature loading weight of $s_{c}$ from an eigenvector in ${Factor}_{z}$ ;
$s_{c}$ is a symptom-concept ( $w c_{s y m i j}$ ) feature element within ${Factor}_{z}$ ;
$c$ = 1, 2, …, $k_{z}$ ;
$k_{z}$ is the number of different symptom-concept features in ${Factor}_{z}$
$x_{c}$ is an original value of the number of each $s_{c}$ with its mean, ${\bar{x}}_{c}$ ,
and standard deviation $S D_{c}$ .

Therefore, the symptom-concept feature space is reduced from m to numofFeatureGroups, which results in a less time consuming way to find the relevant s_c feature elements as the common symptom concepts among the certain d_i occurrences to construct the DSKG (see Section 4.5).

4. System Overview

There are five steps in our framework; Corpus Preparation, Determination of d_t,i and wc_Symt,ij Features, Automatic Supervised Learning d_t,i wc_Symt,ij Pairs Having CErel, Determination of d_t,i wc_Symt,ij Pairs Having CErel for Collecting d_t,i-SymGroup_dt,iPairWithCErel, and DSKG Construction as shown in Figure 5.

4.1. Corpus Preparation

This step is the preparation of two disease-symptom corpora from the downloaded disease documents having the disease names as the document topic names on two medical web-board resources; one disease-symptom corpus downloaded from the Thai-Healthcare-Knowledge web-board resource is used as the learning corpus, and the other one downloaded from the Thai-hospital web-board resource is used as the test corpus. Each disease-symptom corpus consists of the same disease name concepts from 70 different disease-name documents (or the 70 d_i documents) on each medical web-board resource where the documents associated with various infections, strokes, kidneys, diabetes, cardio- and vascular diseases are randomly selected from about 700 and 400 different disease-name documents of the Thai-Healthcare-Knowledge web-board resource and the Thai-hospital web-board resource, respectively. The selection of these diseases is motivated by the rapidly increasing number of patient cases in Thailand, e.g., diabetes [28]. This step involves using Thai word segmentation tools [29] including named entity recognition [30,31] on each disease-symptom corpus. After the word segmentation has been achieved on the corpora, EDU segmentation is then dealt with [32,33]. The learning corpus and the test corpus then have 12,000 EDUs and 10,000 EDUs, respectively. With regard to each medical web-board resource, each disease-symptom corpus consists of 33 different disease names in Type₁ (or 33 d_1,i documents of the infectious disease type) and 37 different disease names in Type₂ (or 37 d_2,i documents of the non-infectious disease type). Thus, a sample size by random sampling for evaluating the symptom concept determination on the test corpus is about 35 different d_i documents (which consisted of both disease types) by Equation (10) [34].

n = \frac{N}{1 + {NE}^{2}}

(10)

where

N is a population size;
E is Error; n is a sample size;
If N = 70 and $E \approx 10 % \underset{}{\Rightarrow} n = 35$

.

Moreover, all instances from the results of the Cartesian product of DS × SG on the correct symptom-concept determination and the correct disease-name recognition [30,31] from the learning corpus are used for an automatic-supervised learning of the d_t,i wc_Symt,ij pairs having CErel within each disease type, Type_t: Type₁ and Type₂, based on tenfold cross validation (see Section 4.3). The learning results are used to determine the d_t,i wc_Symt,ij pairs having CErel from the test corpus, whilst the correct d_t,i wc_Symt,ij pairs having CErel are collected into the d_t,i-SymGroup_dt,i pair with CErel having the same d_t,i for the DSKG construction.

4.2. Determination of d_t,i and wc_Symt,ij Features

The objective of this step is to determine the d_t,i and wc_Symt,ij features from the learning corpus and the test corpus of each disease type (Type₁, Type₂) for Automatic-Supervised-Learning of d_t,i wc_Symt,ij Pairs Having CErel in Section 4.3 and Determination of d_t,i wc_Symt,ij Pairs having CErel for Collecting d_t,i-SymGroup_dt,iPairWithCErel in Section 4.4, respectively.

i.: Determine d_t,i Features

The disease name feature, d_t,i, from the d_t,i document having d_i as the topic name in Type_t is determined by the named-entity recognition [30,31] in each disease type of both the learning corpus and the test corpus from the previous step of “Corpus Preparation” (Section 4.1) and then the named-entity translation from Thai to English by using the Lexitron dictionary. The determined d_t,i features from the learning corpus are then collected into the D_t set for Automatic-Supervised-Learning in Section 4.3.

ii.

Determination of wc_symt,ij Features

With respect to the wc pattern, it is necessary to prepare and collect the Verb_strong, Verb_weak, Noun, Adj, and Adv sets of Figure 2 before determining the wc_Symt,ij features as follows:

●: Preparation and collection of the Verb_strong, Verb_weak, Noun, Adj, and Adv sets: Each element concept in the Verb_strong, Verb_weak, Noun, Adj, and Adv sets on Figure 2 is prepared and collected from several terms on the medical-symptom-expression list of the Wikipedia and MeSH web sites after translating these terms from English to Thai by the Lexitron dictionary as shown in Table 1.
From Table 1, some translated terms of the noun expressions on the medical-symptom-expression list (from English to Thai by the Lexitron Dictionary) are presented as EDU or noun-phrase expressions in Thai, for example:
Example 2. “arrhythmia” = “หัวใจเต้นผิดจังหวะ/H̄ạwcı tên p̄hid cạngh̄wa” as a Thai sentence/EDU expression which is presented in the following with the part of speech after using word and EDU segmentation tools [29,32,33]:
EDU: ((หัวใจ/H̄ạwci)/Noun)NP1 ((เต้น/tên)/Verb_strong (ผิดจังหวะ/p̄hid cạngh̄wa)/Adv)/VP
These segmented words of the EDU segmentation are then translated from Thai to English by the Lexitron dictionary and WordNet for collecting the element concepts in the Verb_strong, Noun, and Adv sets (on Figure 2) as shown in the following:
((หัวใจ/H̄ạwci)/heart)/Noun, ((เต้น/tên)/pulse)/Verb_strong, and ((ผิดจังหวะ/p̄hid cạngh̄wa)/irregularly)/Adv,
Example 3. “palpitation” = “อาการใจสั่น/Xākār Cı s̄ạ̀n)” as a Thai noun phrase expression which is presented in the following with the part of speech after using the word segmentation tool [29]:
((อาการ/Xākār)/Noun (ใจ/Cı)/Noun (สั่น/s̄ạ̀n)/Verb_strong)/NP
The results of this NP’s term translation from Thai to English by the Lexitron dictionary and WordNet for collecting the element concepts in the Noun and Verb_strong sets (on Figure 2) are ((อาการ/Xākār)/symptom)/Noun, ((ใจ/Cı)/heart)/Noun, and ((สั่น/s̄ạ̀n)/shake)/Verb_strong.
According to the Thai to English translation of Example 2 and Example 3, if the segmented Thai word has several English concepts, the expert will select the corresponding English concept for collecting the element concepts.
●: Determination of wc_Symt,ij features from each d_t,i document: After stemming words and eliminating stop words of either the learning corpus or the test corpus on each disease type, EDU_h,Symt,ij of the d_t,i document is identified by the predicate-verb term (v_a ∈ Verb_strong ∪ V_inf). The wc expression of EDU_h,Symt,ij is then obtained by V, W1, and W2 of the wc pattern (see Section 3.1) including the following general rules (R1, R2):
R1: w_1,g is a Noun element of the previous EDU’s NP1 if the current EDU’s NP1 is an ellipsis (where w_1,g ∈ W1).
R2: if (((v_weak,b + w_inf,c) ∨ (v_strong,f + w_2,e)) ∈ Symptom-Cue) ∧ (Symptom-Cue={มี/have+อาการ/symptom’, ‘เป็น/be + อาการ/symptom’, ‘เกิด/occur + อาการ/symptom’}, then we take the next two words right after “v_weak,b + w_inf,c” or”‘v_{strong, f} + w_2,e” to be the new v_a + w_2,e expression as the symptom expression instead where v_weak,b ∈ Verb_weak; w_inf,c ∈ Noun ∪ Adj ∪ Verb_strong; v_strong,f ∈ Verb_strong; w_2,e ∈ W2; v_a ∈ Verb_strong ∪ V_inf; (v_weak,b + w_inf,c) ∈ V_inf ; and a, b, c, e, and f are an integer as the element index.
The wc_symt,ij features of the wc expressions are automatically determined by R3, R4, and R5 as the concept rules and are represented by the predicate-argument pattern after the wc expressions of the disease-symptom documents are translated from Thai to English by the collected element concepts in the Verb_strong, Verb_weak, Noun, Adj, and Adv sets.
R3: if (w_1,g ∈ Person) ∧ (Person = {‘คนไข้, ผู้ป่วย/patient’, ‘ผู้สูงอายุ/elder’, ‘ทารก/infant’, ‘เด็ก/child’, ‘วัยรุ่น/teenager’, ……}), then the w_1,g concept is “person”.
R4: if w_2,e ∈ Symptom-Expression-Level, then w_2,e has a null value for a general concept (where Symptom-Expression-Level = {‘มาก/very’, ‘เล็กน้อย/little’, ‘ปานกลาง/moderately’,…}).
R5: if (v_weak,b = ‘รู้สึก/feel’) ∧ (w_inf,c= v_{strong, f}) ∧ (v_{strong, f} ∈ Verb_strong), then (v_weak,b + w_inf,c)= v_{strong, f}.
For example, the concept of (รู้สึก/feel)/Verb_weak + (ปวด/pain)/Verb_strong is equivalent to “pain” as shown in the following EDU with the wc_symt,ij feature as pain(person, stomach).
EDU: “คนไข้รู้สึกปวดกระเพาะอาหาร/The patient feels pain in the stomach”.
(คนไข้/Khnk̄hị̂)/patient)/NP1 ((รู้สึก/rū̂s̄ụk)/feel)/Verb_weak (ปวด/pwd)/pain)/Verb_strong (กระเพาะอาหาร/krapheāa xāh̄ār)/stomach)/Noun)/VP.

In addition to the complications, if the element of the complicationTerm set is identified, then all wc_Symt,ij features right after the occurrence of the complicationTerm set element are excluded.

Therefore, all wc_Symt,ij features of the wc expressions from each d_t,i document of the learning corpus are determined and grouped into the corresponding symptom-concept group of the d_t,i feature (as SymGroup_dt,i) for the automatic supervised-learning step of d_t,i wc_Symt,ij Pairs having CErel in Section 4.3.

4.3. Automatic-Supervised-Learning of d_t,i wc_Symt,ij Pairs Having CErel

Each d_t,i feature and the SymGroup_dt,i elements as the wc_Symt,ij features determined from the previous step are used for this step of the automatic supervised learning, where Type_t (t = 1) is the infectious disease type, and Type_t (t = 2) is the non-infectious disease type. The Type_t’s learning corpus, containing several instances of d_t,i wc_Symt,ij pairs resulted from the Cartesian product of DS × SG, is used for learning d_t,i wc_Symt,ij pairs having CErel by NB, SVM, and LR on each Type_t. The positive instance (CErel class) of each d_t,i wc_Symt,ij pair is formed if Type_t of D_t and Type_t of SymGrp_t are the same; otherwise, the negative instance (nonCErel class) is formed according to Type_t of D_t (see Section 3.2).

After the Cartesian product of DS × SG results on the learning corpus with the 70 d_i documents, the positive and the negative instances of each Type_t from the Cartesian product results are then selected by the random sampling of the positive instances and the negative instances with an approximately equal number of positive and negative instances covering all 70 d_i features. The sizes of the Type_1’s learning sample and the Type_2′s learning sample are 1878 instances and 2125 instances, respectively, containing both the positive and negative instances.

NB learning [24]: the feature sets, D_t and SymGrp_t, exist in the d_t,i wc_Symt,ij pairs of the positive/negative instances with the CE-rel/nonCE-rel class, respectively, formed by the automatic supervised learning on each disease-type learning sample. The learning results of this step by using Weka [35] are the feature probabilities of d_t,i and wc_Symt,ij in Type_t where each wc_Symt,ij feature is represented by the symptom-concept code (see Table 2).

SVM learning [23] with the linear kernel: The linear function, f(x), of the input x = (x₁…x_n) assigned to the positive class if f(x) ≥ 0, and otherwise to the negative class if f(x) < 0, can be written as follows:

\begin{matrix} f (x) = ‹ w \cdot x › + b \\ = \sum_{k = 1}^{n} w_{k} x_{k} + b \end{matrix}

(11)

where

x is a dichotomous vector;
w is a weight vector;
b is a bias;
and (w,b)∈Rⁿ × R are the parameters that control the function.

With regard to each disease-type learning sample, the SVM learning determines w_k and b for d_t,i and wc_Symt,ij features (x_k) in each disease-symptom pair (d_t,i wc_Symt,ij) with either the positive class (CE-rel) or the negative class (nonCE-rel) formed by the automatic supervised learning on each disease-type learning sample.

LR [25]: The logistic regression model of the research is based on the linear logistic regression with binary vector data. Usually, the input data with any value would be used to establish which attributions are influential in predicting the given outcome with values between 0 and 1, and hence can be interpreted as a probability. The logistic function can be written as:

F (x) = \frac{1}{1 + e^{- (β_{0} + β_{1} x_{1} + β_{2} x_{2})}}

(12)

F(x) is interpreted as the probability of the given outcome to be predicted where x₁ and x₂ are attribute variables, and

β

₀,

β

₁, and

β

₂ are the model estimators which play the role of momentum for each attribute. The LR learning is to determine

β

₀,

β

₁, and

β

₂ for d_t,i and wc_Symt,ij as x₁ and x₂ features, respectively, in each disease-symptom pair (d_t,i, wc_Symt,ij) with either the positive/CErel class or the negative/nonCErel class formed by the automatic supervised learning on each disease-type learning sample.

4.4. Determination of d_t,i wcSym_t,ij Pairs Having CErel for Collecting d_t,i-SymGroup_dt,I Pair with CErel

There are three steps in the CErel determination from the test corpus consisting of 70 d_i documents: d_t,i and wc_Symt,ij Determination, Determination of d_t,i wc_Symt,ij Pairs having CErel, and Collection of d_t,i-SymGroup_dt,i Pairs having CErel

i.: d_t,i and wc_symt,ij Determination

The disease name concept, d_t,i, is determined from the d_t,i document having d_i as the topic name in Type_t by the named-entity recognition [30,31] followed by the named-entity translation from Thai to English by the Lexitron dictionary in each disease type of the test corpus. After stemming words and eliminating stop words of the test corpus on each disease type, the EDU_h,Symt,ij occurrence is identified by the predicate-verb term set (V; V = Verb_strong ∪ V_inf) to each EDU_h occurrence on the d_t,i document. According to R1–R5, each wc_symt,ij feature of the wc expression on the identified EDU_h,Symt,ij occurrence is then determined by translating all terms in the predicate-argument pattern from Thai to English by the collected element concepts in the Verb_strong, Verb_weak, Noun, Adj, and Adv sets. In addition to the complications, if the element of the complicationTerm set is identified, then all wc_Symt,ij features right after the occurrence of the complicationTerm set element are excluded.

ii.: Determination of d_t,i wc_symt,ij Pairs having CErel

The objective of this step is to determine the d_t,i wc_symt,ij pairs having the CErel class by NB, SVM, and LR of each disease type on the test corpus.

NB: The cause-effect relation between the d_t,i features and wc_Symt,ij features through test corpus of each disease type is solved by Equation (13) along with the probabilities of d_t,i and wc_Symt,ij on Table 2.

\begin{matrix} DSympPairClass = \underset{c l a s s \in C l a s s}{\arg \max} P (c l a s s | d_{t, i} w c_{S y m t, i j}) \\ = \underset{c l a s s \in C l a s s}{\arg \max} P (d_{t, i} | c l a s s) P (w c_{S y m t, i j} | c l a s s) P (c l a s s) \end{matrix}

(13)

where

DSympPairClass is a relation class between a disease-name concept and a symptom concept on a disease-symptom pair;
$d_{t, i} \in D_{t}$ on which $D_{t}$ is a disease name set in ${Type}_{t}$ ;
$t = 1, 2$ ;
$i = 1, 2, \dots, n u m o f D i s e a s e E l e m e n t s$ ;
$w c_{S y m t, i j}$ is a wc concept, particularly a symptom concept of a $w c$ expression on the $d_{t, i}$ document ;
$C l a s s = \{' C E - r e l^{'},' n o n C E - r e l^{'}\}$ .

If DSymPairClass of Equation (13) is the CE-rel/CErel class, the d_t,i wc_symt,ij pair is collected into DSP_t (which is a list of disease-symptom pairs having CErel in Type_t) as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).

Algorithm 1 DeterminationOfDiseaseSymptomPairsHavingCErel Algorithm.
`Assume that each EDU is represented by (NP VP) including stemming words and stop word removal; L_dti is a list of EDUs on d_ti;` `d`_t,i `is a disease name of Type`_t`(t={1,2}) i=1,2,.., numofDiseaseElements`_t`;` `DSP`_t `is a``list``of``disease-symptom pairs with CErel in Type`_t`;` `DName`_t `is a disease name set in Type`_t`;` `DETERMINATION_OF_DISEASE_SYMPTOM⌝_PAIRS_HAVING⌝_CEREL`
`1`	`Dname₁`←∅`; DName₂`←∅`;`
`2`	`ArrayList<string>[]` `DSP` `= new ArrayList`[2]`;` `String[][] d = new String`[2][40]`;` *`/DSP_t (``t``={1,2})``contains two ArrayList data structures for Type₁ and Type₂`**
`3`	`Set<String> complicationTermSet = new HashSet<String>();` *`/ complicationTermSet is a set of complication terms`**
`4`	`complicationTermSet.add (“`ภาวะแทรกซ้อน/`complcations”);`
`5`	`complicationTermSet.add (“`ไม่รักษา/`unTreat”);` *`/ add more elements into complicationTermSet`**
`6`	`Count = ComplicationTermSet.size()` *`/ determine size which is the number of complicationTermSet elements.`**
`7`	`String[] CTS = ComplicationTermSet.toArray` `(new String[ComplicationTermSet.size()]);` *`/` `Convert ComplicationTermSet as a set structure to an Array structure.`**
`8`	`For (t= 1 to 2 ; t++ )`
`9`	`{₁ If t=1 then n=33 else n=37;`
`10`	`For (i= 1 to n ; i++ )`
`11`	`{₂ d_ti` `= getDiseaseNameConceptFromDocumentTopicName /*get a disease name`
`12`	`Dnamet = Dnamet` ∪ `d_ti ; h=1; j=0; complication=0` `;` `/* Each desease name element is collected into Dname_t`
`13`	`while h` ≤ `length[L_dti]` ∧ `complication=0` `do`
`14`	*`{₃ For (k= 1 to Count; k++) /check complications.`**
`15`	`If EDU_h.contains(CTS_k)then complication=1;`
`16`	`v_h = EDU_h.VP.`verb`; wrd_h = EDU_h.VP.`word *`/` verb `is an EDU_h’s main verb (a verb of EDU_h.VP) ;` `/`* word `is a word right after an EDU_h’s main verb of EDU_h.VP`
`17`	`If (complication = 0)`∧`(( v_h` ∈`V_strong)`∨`( v_h + wrd_h` ∈`Vinf))then` *`/` `V is the predicate-verb term set; V = Verb_strong`∪`Vinf .`**
`18`	`{₄ If EDU_h.headNounOfNP1` ∈ `W1` ∧ `EDU_h.firstWordOfNP2` ∈ `W2 then` *`/ W1 is an agent-term set; W2 is` `a linguistic-patient term set`**
`19`	`{₅ j++;`
`20`	`sym =` wc_symt,ij`Determination;` *`/ based on`** wc `Expression of EDU_h,Symt,ij by using R1-R5.`
`21`	`switch (choice) {`
`22`	*`Case 1:CErelDetermination(d_t,i,sym)byEquation13; break;/NB`**
`23`	*`Case 2:CErelDetermination(d_t,i,sym)byEquation11; break;/SVM`**
`24`	*`Case 3:CErelDetermination(d_t,i,sym)byEquation12, break;/LR` `}`**
`25`	`If (class= ‘Positive’)`∨ `(class =’CE-rel’) then`
`26`	`DSP_t.``AddCause⌝EffectPairWithCErel(``d_t,i``+“-“+sym);`
`27`	`}₅ }₄ h++ }₃ }₂ }₁`
`28`	`}Return Dname_t, DSP_t`

SVM: The cause-effect relation between d_t,i and wc_Symt,ij of the d_t,i wc_Symt,ij pair from the test corpus of each disease type is solved by the weight vector from all d_t,i and wc_Symt,ij features. The weight vector and the bias obtained from the SVM learning by using Weka [35] are used to determine the d_t,i wc_Symt,ij pair with CE-rel by Equation (11). If f(x) ≥ 0, the d_t,i wc_symt,ij pair with CE-rel/CErel as the positive class occurs, otherwise the nonCE-rel/nonCErel occurs as the negative class. The d_t,i wc_Symt,ij pair with the positive class is collected into DSP_t as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).

LR: The research applies Equation (12) to determine the DSympPair class which is a relation class, i.e., a CE-rel/nonCE-rel class, between a disease-name concept (d_t,i) and a symptom concept (wc_Symt,ij) on a d_t,i wc_Symt,ij pair from the test corpus of each disease type, whilst F(x) is interpreted as the probability of either “positive/CErel” as the CE-rel class or “negative/nonCErel” as the nonCE-rel class by the following rules.

Rule 1 (CE-rel_Class): If (f(x)_{CE-rel_Class} ≥ 0.5, then the d_t,i wc_Symt,ij pair has CErel between d_t,i and wc_Symt,ij.

Rule 2 (nonCE-rel_Class): If (f(x) _{nonCE-rel_Class} ≥ 0.5, then the d_t,iwc_Symt,i pair has nonCErel between d_t,i and wc_Symt,ij.

DSympPair Class = Max (f(x) _{CE-rel_Class}, f(x) _{nonCE-rel_Class})

(14)

According to Equation (14), x₁ and x₂ are the attribute variable pair of each d_t,i wc_Symt,ij pair from the test corpus of each disease type where ß₀, ß₁, and ß₂ of d_t,i and wc_Symt,ij are obtained by the automatic supervised learning by LR on the learning sample of each disease type. The d_t,i wc_Symt,ij pair with the CE-rel class is collected into DSP_t as shown in the algorithm of determining disease-symptom pairs having CErel (Algorithm 1).

iii.: Collection of d_t,i-SymGroup_dt,i Pairs having CErel

All correct determined d_t,i wc_Symt,ij pairs having CErel in DSP_t from the previous step are grouped by the same d_t,i into SymGroup_dt,i resulting in the d_t,i-SymGroup_dt,i pair with CErel as shown in Figure 6.

Dname_t from the Algorithm 1 results is D_t on Equation (3); D_t = { d_t,₁, d_t,₂,….num_t} where num_t is numofDiseaseElements_t in Type_t.

According to Figure 6, all SymGroup_dt,i of the d_t,i-SymGroup_dt,i pairs having CErel are collected into SymGrp_t.

SymGrp_t = {SymGroup_dt,₁, SymGroupdt, 2, .. SymGroup_dt,numt};

D_t and SymGrp_t are used for the DSKG construction in the next step.

4.5. DSKG Construction

According to D_t and SymGrp_t, the subsets of SymGrp_t form a union into Stype_t by Equation (15). In addition, Stype_t of both disease types form a union into S (which is the union of all SymGroup_dt,i from both disease types) by Equation (16).

{Stype}_{t} = (\cup_{i = 1}^{n u m o f D n a m e_{t}} {SymGroup}_{d t, i})

(15)

where

${Stype}_{t}$ is a symptom-concept set of ${Type}_{t}$ (t = 1, 2);
$n u m o f D n a m e_{t}$ is cardinality of the $D_{t}$ or ${Dname}_{t}$ set;
${SymGroup}_{d t, i}$ is a symptom-concept group resulted by $d_{t, i}$

S = {Stype}_{1} \cup {Stype}_{2} = s_{1}, s_{2}, \dots s_{m}

(16)

where m is the cardinality of S.

S is then the symptom-concept feature vector with the vector size m where m of the research is 143. PCA (based on IBM SPSS Statistics for Windows, Version 21.0) is used to reduce the number of m features of the symptom-concept feature vector by wrapping each s_c feature element having the feature loading weight from the eigenvector ≥ |0.4| based on our corpus within the corresponding Fgroup_z to become Factor_z (z = 1, 2,.., 39) as shown in Table 3.

In Table 3, the number of s_c features is reduced from 143 to 39 groups of Fgroup_z. From Equation (16), we conclude D_i as in Equation (17).

D_{i} = (\cup_{t = 1}^{2} D_{t})

(17)

Therefore, d_i pairs (d_i ∈ D_i) with Factor_z instead of SymGroup_di (see Table 4) where Factor_z consists of the correlated s_c feature elements with the FactorScore_z calculation to each disease name, d_i.

In Table 4, we then select the Factor_z occurrences as the common and relevant factors having the highest FactorScore_z of each d_i for constructing DSKG (see Figure 7).

5. Evaluation and Discussion

The research has three evaluations of Automatic Determination of wc_Symt,ij Features, Determination of d_t,i wc_Symt,ij Pair Pairs Having CErel, and DSKG Construction.

5.1. Automatic Determination of wc_Symt,ij Features

The test sample of 35 different d_i documents from both disease types on the test corpus is used for evaluating the automatic determination of the wc_Symij features as the SymGroup_dt,i elements without the concept annotation. The automatic determination of the wc_Symt,ij features of the research is evaluated in terms of a precision, a recall, and a F1-score on Table 5 based on the answer sets provided by three experts with max win voting. The number of different wc_Symt,ij (symptom-concept) features determined from the test sample on the test corpus shows that the Type₂ (the non-infectious disease type) documents have more diversity symptom-concept features resulting in lower frequencies of most symptom-concept features (as shown in Figure 8) than the Type₁ (the infectious disease type) documents.

From Table 5, the averages of the precision, the recall, and the F1-score for the wc_Symt,ij feature determination from the Type₁ and Type₂ documents are 0.890, 0.773, and 0.827, respectively. The reason for having the low average recall of the wc_Symt,ij feature determination from both disease types is the symptom with the long-tailed explanation, e.g.,

EDU: “คนไข้มีต่อมน้ำเหลืองบริเวณด้านหลังลำคอโตผิดปกติ/Khnk̄hị̂ mī t̀xmn̂ả h̄elụ̄xng briweṇ d̂ān h̄lạng lả khx to p̄hid pkti” (The patient has lymph nodes in the back of the neck being abnormally large.).

“(((คนไข้/Khnk̄hị̂)/patient)/Noun)/NP1 (((มี/mī)/has)/Verb_weak ((ต่อมน้ำเหลือง/xmn̂ảh̄elụ̄xng)/lymph-nodes)/Noun ((บริเวณด้านหลัง/briweṇ d̂ān h̄lạng)/back area)/Noun ((ลำคอ/lả khx)/neck)/Noun ((โต/to)/large)Verb_strong ((ผิดปกติ/p̄hid pkti)/abnormally)/Adv)/VP”.

where the wc pattern cannot cover the “(โต/to)/large” term as the main symptom concept. However, the result of the medical symptom recognition [16] from the patient text based on multi-label classification labeling by the expert is 0.76 F1-Score which covers the symptom with the long-tailed explanation.

5.2. Determination of d_t,i wc_Symt,ij Pairs Having CErel

According to the test corpus, by the automatic supervised learning, the correct determined d_t,i wc_Symt,ij pairs having CErel are collected or grouped by the same d_t,i into the d_t,i-SymGroup_dt,i pairs having CErel. Thus, the correct determined d_t,i wc_Symt,ij pairs having CErel also result in the correct d_t,i-SymGroup_dt,i pairs having CErel for the DSKG construction.

The determination of d_t,i wc_Symt,ij pairs having CErel by NB, SVM, and LR from the d_t,i documents (4700 EDUs of the d_1,i documents and 5300 EDUs of the d_2,i documents) with the automatic-supervised learning is evaluated in terms of a precision, a recall and, a F1-score as shown in Table 6 based on three experts with max win voting. According to the balance data from our learning corpus, we also report the Receiver Operator Characteristics (ROC) curves to show each classifier performance by Weka as shown in Figure 9. The use of ROCs to evaluate the performance follows the study layout by [36].

According to Figure 8, the lower diversity and higher frequency of the wc_Sym_1,ij features from the Type₁ documents than the ones of the wc_Sym_2,ij features from the Type₂ documents result in Type₁ having a better precision, better recall, and better F1-score of determining d_t,i wc_Symt,ij pairs having CErel by NB, SVM, and LR than Type₂ does as shown in Table 6. With regard to NB, SVM, and LR, the evaluation results of determining d_t,i wc_Symt,ij pairs having CErel depend on (1) the frequencies of the d_t,i and wc_Symt,ij features for NB, (2) the weight vector for SVM, and (3) the model estimators (ß₁, and ß₂) including the frequencies of d_t,i and wc_Symt,ij features for LR. Therefore, the precision, recall, and F1-score results of determining d_t,i wc_Symt,ij pairs having CErel of each disease type by SVM are higher than the results of determining d_t,i wc_Symt,ij pairs having CErel of each disease type by NB because most of the wc_Symt,ij features of each disease type have the low frequencies of occurrences on the documents (see Figure 8). According to Type₁, the F1-score result of determining d_t,i wc_Symt,ij pairs having CErel by LR is 0.009 higher than the one by SVM since some wc_Sym_1,ij features in the Type₁ documents have the high frequency occurrences. According to Type₂, the F1-score in determining d_t,i wc_Symt,ij pairs having CErel by LR is 0.001 lower than the one by SVM because most of wc_Sym_2,ij features in the Type₂ are low frequency features. The LR and SVM performances evaluated by the F1-score are not quite different, whereas the LR and SVM performances evaluated by ROC Area or AUC (Area under Curve) are quite different in both disease types (see Figure 9). According to [36], LR with ROC Area = 0.850 and 0.843 in Type₁ and Type₂, respectively, is considered to be excellent performance in classifying d_t,i wc_Symt,ij pairs having CErel in both disease types. Therefore, we use the LR estimator to determine d_t,i wc_Symt,ij pairs having CErel from the external data of another corpus (e.g., Patient Care and Health Information from Mayo Clinic (https://www.mayoclinic.org/patient-care-and-health-information (accessed on 20 May 2022) after using WordNet and MeSH to determine the symptom concepts from noun phrases with the evaluation results as follows: Type₁ has the 0.750 precision, the 0.667 recall, and the 0.706 F1-score; Type₂ has the 0.857 precision, the 0.600 recall, and the 0.706 F1-score. In contrast, our LR results of the 0.761 and 0.743 F1-scores for Type₁ and Type₂, respectively, are higher than the ones from the external data of another corpus because our symptom concepts are based on the composite variable.

Therefore, the average of F1-scores in determining d_t,i wc_Symt,ij pairs having CErel for Type₁ and Type₂ by NB, SVM, and LR are 0.743, 0.748, and 0.752, respectively, based on the automatic supervised learning on the disease feature set and the symptom feature set (the composite-feature set), whereas the previous work [14] determines the disease-symptom relation by neural network learning from each sentence containing a disease-name and symptom-term pair based on the human annotation with the 0.883 F1-score, even though their symptom terms are based on nouns/noun phrases. In addition, another previous work [13] automatically determines the disease-symptom relation by NB, LR, and the Bayesian network using noisy OR gates with the 0.704 F1-score from the patient textual data (where the symptom occurrences are based on the noun/noun phrases expression feature set). The work [13] based on the specific cases is useful for the physicians in the curative system, whereas our research based on two disease-symptom corpora randomly selected from two medical web-board resources, i.e., the Thai hospital and Thai-Healthcare-Knowledge web-boards, is useful for the non-professionals in preliminary diagnosis.

Moreover, the performance of the CErel classification model based on machine learning is generally evaluated by the F1-score and the ROC curve which infers a statistic test by [37] as follows: the area under the ROC curve (AUC) of LR = 0.843 (Type₂) yields the statistic of 10.6 with a p-value < 0.001, indicative of the excellent discriminating ability.

5.3. DSKG Construction

We evaluate the DSKG construction in terms of a 10-point score for a concise representation and a 10-point score for a precise representation which are based on the assessments with scoring by three experts and two end-users who are non-professional persons. The results of the average assessments of the concise representation and the precise representation by the experts and end-users are 9 and 7.8, respectively. The reason for the precise representation score of 7.8 is that the cause-effect network should include some information of a patient condition and a severity of the conditions. However, the DSKG construction by PCA enhances the primary health care by enabling the non-professional persons in primarily diagnosis problems with the concise and precise representation through an online recommender application system, whereas the previous works on the cause-effect graph/network construction from texts mostly are not as concise as in [13,15] containing all symptom terms of each disease term.

6. Conclusions

This research approaches constructing a DSKG consisting of several CErel connections determined from the downloaded disease documents on the Thai hospital and Thai-Healthcare-Knowledge web-boards, where each CErel connection links the causative-concept node (or the <d_i>node represented by the d_i feature) to the correlated effect-concept node, or the <s_n₁, s_n₂, …, s_n_η> node represented by a group of the correlated s_c features as the common symptom-concept features. Moreover, all s_c features of each d_i document on our research are based on the basic symptoms. Our proposed method of constructing the DSKG, which needs to reduce the high dimensional feature space of S (S = {s_c}) for the graph construction, relies on the determination of the wc_Symt,ij features representing the s_c features and the automatic supervised learning d_t,i wc_Symt,ij pairs having CErel from the downloaded documents as the unstructured data. PCA is then proposed for constructing the DSKG by the dimensionality reduction of the symptom-concept feature space with minimizing information loss. To evaluate the proposed method, the conciseness and precision of the DSKG construction depends on the number of different symptom-concept features and the number of different disease-name concept features, whilst the accuracy of the CErel determination depends on the corpus behavior, e.g., the diversity of the wc_Symt,ij feature occurrences and the frequencies of the wc_Symt,ij feature occurrences. In addition, the accuracy of the wc_Symt,ij feature determination depends on the number of symptoms with long-tailed explanations. In contrast to previous works of the cause-effect/disease-symptom relation determination and the cause-effect/disease- symptom knowledge graph/network construction from the documents, (1) our determination of the symptom-concept feature, wc_Symt,ij, as the composite variable (which relies on the predicate-argument pattern from NP1 and VP through the preparation and collection of the Verb_strong, Verb_weak, Noun, Adj, and Adv sets from the medical-symptom-expression list on Wikipedia including MeSH without the concept annotation) has a high F1-Score. Whereas the symptom-concept determination of the previous works are only based on either the noun-phrase [12,13,14,15] or verb phrase [16] concepts by either the expert annotation or the automatic string-matching to ICD-9 codes and UMLS concepts, (2) our determination of d_t,i wc_Symt,ij pairs having CErel is based on the Cartesian product of DS × SG as the automatic supervised learning on each d_i feature on the noun/noun-phrase expression being paired with each group of the wc_Symt,ij features on several EDUs’ wc expressions with/without the complications, whereas the previous works on determining the cause-effect/disease-symptom relation on the noun/noun-phrase pair with either supervised learning by experts [11] or the automatic determination relied on the ICD-9 codes [13] without concerning some d_t,i documents containing both the basic symptoms and the complications; (3) our DSKG construction by PCA to reduce the high dimensional symptom-concept features presents the DSKG with preciseness and high conciseness, whereas the previous works of the cause-effect/disease-symptom graph/network construction from texts consist of each CErel connection between the causative-concept node (represented by one causative-concept feature) and the corresponding effect-concept node (represented by one effect-concept feature) without concerning the high dimensional effect features, particularly the symptom features [7,13,15]. Moreover, the DSKG results, e.g., in Figure 7, were frequently found to be in alignment with scientific findings and also the objective of this research. For example, following the DSKG in Figure 7, vascular diseases were found to be associated with several diseases including kidney disease and myocardial infarction which are also found in the literature (e.g., [38,39,40]). In the future, the temporal feature and the condition feature should be considered to increase the accuracy of the CErel determination which results in increasing the preciseness of the DSKG for web-application development as in [41]. Moreover, the proposed method can also be applied in other languages, and the DSKG of our research enhances the primary health care by supporting the non-professional persons with the knowledge structure in primarily diagnosis problems through the recommender system.

Author Contributions

Conceptualization, C.P. and R.P.; Data curation, C.P. and R.P.; Formal analysis, C.P. and R.P.; Funding acquisition, C.P.; Investigation, C.P. and R.P.; Methodology, C.P. and R.P.; Project administration, C.P.; Resources, C.P. and R.P.; Software, C.P.; Supervision, C.P.; Validation, C.P. and R.P.; Visualization, C.P.; Writing—original draft, C.P.; Writing—review & editing, R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kroenke, K. A practical and evidence-based approach to common symptoms: A narrative review. Ann. Intern. Med. 2014, 161, 579–586. [Google Scholar] [CrossRef]
Khoo, C.; Na, J.C. Semantic Relations in Information Science. Annu. Rev. Inf. Sci. Technol. 2006, 40, 157–228. [Google Scholar] [CrossRef] [Green Version]
Carlson, L.; Marcu, D.; Okurowski, M.E. Building a discourse-tagged corpus in the framework of rhetorical structure theory. Curr. New Dir. Discourse Dialogue 2003, 22, 85–112. [Google Scholar]
Pustejovsky, J. The syntax of event structure. Cognition 1991, 41, 47–81. [Google Scholar] [CrossRef]
Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Girju, R. Automatic detection of causal relations for question answering. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Workshop on Multilingual Summarization and Question Answering-Machine Learning and beyond, Sapporo, Japan, 7–12 July 2003; pp. 76–83. Available online: https://aclanthology.org/W03-1210.pdf (accessed on 25 June 2007).
Chang, D.-S.; Choi, K.-S. Incremental cue phrase learning and bootstrapping method for causality extraction using cue phrase and word pair probabilities. Inf. Process. Manag. 2006, 42, 662–678. [Google Scholar] [CrossRef]
Riaz, M.; Girju, R. Recognizing Causality in Verb-Noun Pairs via Noun and Verb Semantics. In Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language, Gothenburg, Sweden, 6 April 2014; pp. 48–57. Available online: https://aclanthology.org/W14-0707.pdf (accessed on 25 May 2022).
Zhao, S.; Liu, T.; Zhao, S.; Chen, Y.; Nie, J.-Y. Event causality extraction based on connectives analysis. Neurocomputing 2016, 173, 1943–1950. [Google Scholar] [CrossRef]
Sorgente, A.; Vettigli, G.; Mele, F. A Hybrid Approach for the Automatic Extraction of Causal Relations from Text. Emerg. Ideas Inf. Filter. Retr. 2017, 746, 15–29. [Google Scholar] [CrossRef]
Cao, M.; Sun, X.; Zhuge, H. The contribution of cause-effect link to representing the core of scientific paper—The role of Semantic Link Network. PLoS ONE 2018, 13, e0199303. [Google Scholar] [CrossRef] [Green Version]
Hassan, M.; Makkaoui, O.; Coulet, A.; Toussain, Y. Extracting Disease-Symptom Relationships by Learning Syntactic Patterns from Dependency Graphs. In Proceedings of BioNLP 15; Association for Computational Linguistics: Beijing, China, 2015; pp. 71–80. [Google Scholar] [CrossRef]
Rotmensch, M.; Halpern, Y.; Tlimat, A.; Horng, S.; Sontag, D. Learning a Health Knowledge Graph from Electronic Medical Records. Sci. Rep. 2017, 7, 5994. [Google Scholar] [CrossRef]
Wada, S.; Iida, R.; Torisawa, K.; Takeda, T.; Manabe, S.; Matsumura, Y. Extracting Symptom Names and Disease-Symptom Relationships from Web Texts Using a Multi-Column Convolutional Neural Network. MedInfo 2019, 264, 423–427. [Google Scholar] [CrossRef]
Zhou, X.; Menche, J.; Barabasi, A.; Sharma, A. Human symptoms–disease network. Nat. Commun. 2014, 5, 4212. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mottaghi, A.; Sarma, P.K.; Amatriain, X.; Yeung, S.; Kannan, A. Medical symptom recognition from patient text: An active learning approach for long-tailed multilabel distributions. arXiv 2021, arXiv:2011.06874. [Google Scholar]
Song, M.-K.; Lin, F.-C.; Ward, S.E.; Fine, J.P. Composite Variables. Nurs. Res. 2013, 62, 45–49. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zawbaa, H.M.; Emary, E.; Grosan, C.; Snasel, V. Large-dimensionality small-instance set feature selection: A hybrid bio-inspired heuristic approach. Swarm Evol. Comput. 2018, 42, 29–42. [Google Scholar] [CrossRef]
Figueiredoa, F.; Rocha, L.; Couto, T.; Salles, T.; Gonçalves, M.A.; Meira, W., Jr. Word co-occurrence features for text classification. Inf. Syst. 2011, 36, 843–858. [Google Scholar] [CrossRef]
Sedighi, M. Application of word co-occurrence analysis method in mapping of the scientific fields (case study: The field of Informetrics). Libr. Rev. 2016, 65, 52–64. [Google Scholar] [CrossRef]
Nykamp, D.Q. Cartesian Product Definition, from Math Insight. Available online: http://mathinsight.org/definition/cartesian_product (accessed on 25 May 2022).
Falke, T.; Gurevych, I. Utilizing Automatic Predicate-Argument Analysis for Concept Map Mining. In Proceedings of the 12th International Conference on Computational Semantics (IWCS), Montpellier, France, 19–22 September 2017; pp. 1–10. Available online: https://aclanthology.org/W17-6909 (accessed on 25 May 2022).
Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar] [CrossRef]
Mitchell, T.M. Machine Learning; The McGraw-Hill Co., Inc.: New York, NY, USA; MIT Press: Singapore, 1997; Available online: https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf (accessed on 25 September 2000).
Freedman, D.A. Statistical Models: Theory and Practice; Cambridge University Press: Cambridge, UK, 2009; Available online: https://scholar.google.co.th/citations?user=HntkbYMAAAAJ&hl=en&oi=sra (accessed on 25 June 2015).
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
Hyvärinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis, A Wiley-Interscience Publication; John Wiley & Sons, Inc.: New York, NY, USA, 2001; Available online: https://www.cs.helsinki.fi/u/ahyvarin/papers/bookfinal_ICA.pdf (accessed on 25 April 2021).
Reutrakul, S.; Deerochanawong, C. Diabetes in Thailand: Status and Policy. Curr. Diabetes Rep. 2016, 16, 28. [Google Scholar] [CrossRef]
Sudprasert, S.; Kawtrakul, A. Thai word segmentation based on global and local unsupervised earning. In Proceedings of the NCSEC 2003, Chonburi, Thailand, 28–30 October 2003; pp. 1–8. [Google Scholar]
Chanlekha, H.; Kawtrakul, A. Thai named entity extraction by incorporating maximum entropy model with simple heuristic information. In Proceedings of the IJCNLP 2004, Haina, China, 22–24 March 2004; pp. 1–7. Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.1449 (accessed on 25 December 2007).
Tongtep, N.; Theeramunkong, T. Pattern-based Extraction of Named Entities in Thai News Documents. Thammasat Int. J. Sci. Technol. 2010, 15, 70–81. [Google Scholar]
Chareonsuk, J.; Sukvakree, T.; Kawtrakul, A. Elementary discourse unit segmentation for Thai using discourse cue and syntactic information. In Proceedings of the NCSEC 2005, Bangkok, Thailand, 27–28 October 2005; pp. 85–90. [Google Scholar]
Ketui, N.; Theeramunkong, T.; Onsuwan, C. Thai elementary discourse unit analysis and syntactic-based segmentation. Information 2013, 16, 7423–7436. [Google Scholar]
Smith, J.G.; Duncan, A.J. Elementary Statistics and Applications: Fundamentals of the Theory of Statistics; Mc GrawHill Book Company Inc.: London, UK, 1944. [Google Scholar] [CrossRef]
Eibe, F.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann 2016. Available online: https://www.cs.waikato.ac.nz/ml/weka/ (accessed on 25 May 2022).
Le, N.Q.K.; Kha, Q.H.; Nguyen, V.H.; Chen, Y.-C.; Cheng, S.-J.; Chen, C.-Y. Machine Learning-Based Radiomics Signatures for EGFR and KRAS Mutations Prediction in Non-Small-Cell Lung Cancer. Int. J. Mol. Sci. 2021, 22, 9254. [Google Scholar] [CrossRef] [PubMed]
Mandrekar, J.N. Receiver Operating Characteristic Curve in Diagnostic Test Assessment. J. Thorac. Oncol. 2010, 5, 1315–1316. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Nakamura, R.; Tomiyoshi, G.; Shinmen, N.; Kuroda, H.; Kudo, T.; Doi, H.; Mine, S.; Machida, T.; Kamitsukasa, I.; Wada, T.; et al. An anti-deoxyhypusine synthase antibody as a marker of atherosclerosis-related cerebral infarction, myocardial infarction, diabetes mellitus, and chronic kidney disease. SM Atheroscler. J. 2017, 1, 1001. [Google Scholar]
Daniels, S.R.; Bates, S.; Lukin, R.R.; Benton, C.; Third, J.; Glueck, C.J. Cerebrovascular arteriopathy (arteriosclerosis) and ischemic childhood stroke. Stroke 1982, 13, 360–365. [Google Scholar] [CrossRef] [PubMed] [Green Version]
González-Ramírez, J.; Martínez-Coronilla, G.; López-Rocha, L.D.; Leija-Montoya, A.G.; Hernández-Díazcouder, A.; Fontes-Garcia, Z.; Silva-Mancilla, M.; Sánchez-Muñoz, F. Acute Myocardial Infarction and Periodontitis: Importance of Awareness and Prevention in Latin America. Appl. Sci. 2022, 12, 3131. [Google Scholar] [CrossRef]
Mazorchuck, M.; Dobriak, V.; Chumachenko, D. Web-Application Development for Tasks of Prediction in Medical Domain. In Proceedings of the 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine, 11–14 September 2018; Volume 1, pp. 5–8. [Google Scholar] [CrossRef]

Figure 1. Present DSKG formed by several CErel connections on which each CErel connection links each <d_i> node to a <s₁, s₃, ..., s_β> node as an occurrence of multiple symptoms where s₁, s₃, ..., and s_β are the correlated s_c features; d₁-->s₁ ∧ s₂ ∧ s₈ … ∧ s_β; d₂-->s₁ ∧ s₃ ∧ s₄ ∧ s₈ … ∧ s_β; …; d_num-->s₁ ∧ s₅ ∧ s₈ … ∧ s_β.

Figure 2. A general Thai linguistic expression including Thai-to-English translation by Lexitron Dictionary where NP1 and NP2 are noun phrases; VP is a verb phrase; Verb_strong is a strong-verb concept set; Verb_weak is a weak-verb concept set; Adv is an adverb concept set; Adj is an adjective concept set; and Noun is a noun concept set. English translations for each non-English texts are provided in figure.

Figure 3. The EDU expressions with symptom concept on the downloaded disease document of Diabetes (where a [...] symbol means an ellipsis of a word/words inside the symbol). English translations for each non-English texts are provided in figure.

Figure 4. The positive and negative sample instances from the result of the Cartesian product of DS × SG.

Figure 5. A Framework of DSKG Construction.

Figure 6. Show d_t,i-SymGroup_dt,i pairs having CErel after grouping correct determined d_t,i wc_Symt,ij pairs having CErel by d_t,i.

Figure 7. Show DSKG with each arrow being CErel which connects d_i to Factor_z (or F_z), e.g., Factor18/F18 and Factor20/F20 containing some common s_c features represented by the underlined s_c.

Figure 8. The frequencies of the correct determined wc_Symt,ij features from 35 different d_i documents containing the 17 d_1,i documents in Type₁ and the 18 d_2,i documents in Type₂.

Figure 9. Show ROC curves of NB, SVM, and LR classifiers of each disease type, i.e., Type₁ and Type₂ (TPR is True Positive Rate; FPR is False Positive Rate).

Table 1. Example terms of the medical-symptom-expression list used for collecting the element concepts in Verb_strong, Verb_weak, Noun, Adj, and Adv sets of the segmented-term translation.

Example Terms of Medical-Symptom-Expression List on Wikipedia Web Site	Medical-Symptom-Expression Term Translation from English to Thai by Lexitron Dictionary	Term/EDU segmentation	Segmented-Term Transalation fromThai to English by Lexitron Dictionary including WordNet
bleeding	เลือดออก/Leụ̄xd xxk	((เลือด/Leụ̄xd)/Noun)/NP1 ((ออก/xxk)/Verb_strong)/VP	(Leụ̄xd/blood)/Noun (xxk/come out)/Verb_strong
belching (a gerund of a belch verb)	เรอ/Rex	(เรอ/Rex)/Verb_strong	(Rex/belch)/Verb_strong
fatigue	อ่อนแรง/X̀xn ræng	(อ่อนแรง/X̀xn ræng)/Verb_strong	(X̀xn ræng/be weak)/Verb_strong
malaise	อ่อนเพลีย/X̀xnphelīy	(อ่อนเพลีย/X̀xnphelīy)/Verb_strong	(X̀xnphelīy/be tired)/Verb_strong
flatulence	อาการท้องเฟ้อ/Xākār tĥxngfêx	((อาการ/Xākār)/Noun (ท้องเฟ้อ/tĥxngfêx)/Adj)/NP	(Xākār/symptom)/Noun (tĥxngfêx/flatulence)/Adj
proctalgia	อาการปวดทวารหนัก/Xākār pwd thwār h̄nạk	((อาการ/Xākār)/Noun (ปวด/pwd)/Verb_strong (ทวารหนัก/thwār h̄nạk)/Noun)/NP	(Xākār/symptom)/Noun (pwd/pain)/Verb_strong (thwār h̄nạk/anus)/Noun
tinnitus	อาการมีเสียงในหู/Xākār mīs̄eīyng nı h̄ū	((อาการ/Xākār)/Noun (มีเสียง/mīs̄eīy ng)/V_inf (ใน/nı)/Preposition (หู/h̄ū)/Noun)/NP	(Xākār/symptom)/Noun (mīs̄eīy/have a ringing in)/V_inf (h̄ū/ear)/Noun
claudication	อาการปวดขา/Xākār pwd k̄hā	((อาการ/Xākār)/Noun (ปวด/pwd)/Verb_strong (ขา/k̄hā)/Noun )/NP	(Xākār/symptom)/Noun (pwd/pain)/Verb_strong (k̄hā/leg)/Noun
chorea	ประสาทชักกระตุก/Pras̄āth chạk kratuk	((ประสาท/Pras̄āth)/Noun)/NP1 ((ชักกระตุก/chạk kratuk)/Verb_strong)/VP	(Pras̄āth/nerve)/Noun chạk kratuk (convulse)/Verb_strong
muscle weakness	กล้ามเนื้ออ่อนกำลัง/Kl̂ām neụ̄̂xx̀xn kảlạng	((กล้ามเนื้อ/Kl̂ām neụ̄̂x)/Noun)/NP ((อ่อนกำลัง/x̀xn kảlạng)/Verb_strong)/VP	(Kl̂ām neụ̄̂x/muscle)/Noun: (x̀xn kảlạng/weaken)/Verb_strong
anorexia	เบื่ออาหาร/Beụ̄̀x xāh̄ār	((เบื่อ/Beụ̄̀x)/Verb_strong (อาหาร/xāh̄ār)/Noun)/VP	(Beụ̄̀x/be bored with)/Verb_strong (xāh̄ār/food)/Noun
arrhythmia	หัวใจเต้นผิดจังหวะ`/`H̄ạwcı tên p̄hid cạngh̄wa	((หัวใจ/H̄ạwcı)/Noun)/NP1 ((เต้น/tên) (ผิดจังหวะ/p̄hid cạngh̄wa:)/Adv)/VP	(H̄ạwcı/heart)/Noun ((tên/pulse)/Verb_strong (p̄hid cạngh̄wa/irregularly)/Adv
……………	……………	……………	……………

Table 2. Show the probabilities of d_t,i and wc_Symt,ij with the CE-rel and nonCE-rel classes from Type₁ and Type₂ disease documents (where the values of the d_t,i features and the wc_Symt,ij features with the symptom-concept codes are d_1,3 = HepatitisB, d_1,7 = German-measles, d_1,17 = TyphoidFever, d_1,58 = Dysentery, d_1,67 = Enteritis, d_2,18 = diabetes, d_2,41 = PulmonaryVascularDisease, d_2,54 = Myocarditis, d_2,63 = BrainDisease, d_2,70 = Anemia, S₆ = haveFever (person), S₇ = haveHeadache (person), S₁₀ = vomit (person), S₂₈ = weaken (person), S₃₁ = haveSweat (person)).

d_t,i (Type₁)	CE-Rel	NonCE-Rel	wc_Symt,ij	CE-Rel	NonCE-Rel
d_1,3	0.0590	0.0601	S6	0.0754	0.0237
d_1,7	0.0465	0.0474	S7	0.0344	0.0227
d_1,17	0.0755	0.0769	S10	0.0419	0.0350
d_1,58	0.0962	0.0980	S28	0.0084	0.0435
d_1,67	0.0631	0.0643	S31	0.0195	0.0047
…	…	…	…	…	…
d_t,i (Type₂)	CE-rel	nonCE-rel	wc_Symt,ij	CE-rel	nonCE-rel
d_2,41	0.0203	0.0216	S6	0.0247	0.0734
d_2,54	0.0365	0.0389	S7	0.0247	0.0327
d_2,63	0.0365	0.0389	S10	0.0277	0.0471
d_2,70	0.0397	0.0424	S28	0.0398	0.0064
d_2,18	0.0397	0.0424	S31	0.0068	0.0199
…	…	…	…	…	…

Table 3. Show Fgroup_z with the feature loading weights of s_c feature elements where each yellow cell is a feature loading weight of a s_c feature element ≥ |0.4|.

Symptom	…	Fgroup18	Fgroup19	Fgroup20	…
…	…	…	…	…	…
S₈₂	…	0.937	−0.017	0.067	…
S₁₁₂	…	0.614	0.036	−0.020	…
S₃	…	0.573	−0.041	−0.366	…
S₁₃₆	…	0.479	−0.030	0.098	…
S₇₁	…	0.479	−0.077	−0.071	…
…	…	…	…	…	…
S₆₄	…	0.000	0.913	0.044	…
S₁₄	…	0.007	0.657	−0.007	…
S₃₃	…	−0.036	0.636	0.035	…
S₃₁	…	−0.030	0.484	−0.052	…
…	…	…	…	…	…
S₅₂	…	0.006	−0.012	−0.873	…
S₅₁	…	−0.023	−0.030	−0.858	…
S₂₈	…	0.079	−0.087	−0.478	…
…	…	…	…	…	…

Table 4. Show d_i Factor_z pairs having the highest FactorScore_z (Yellow Cell) on each d_i.

Case: d_i	…	Factor17	Factor18	Factor19	Factor20	Factor21	…
d₈	…	−0.2358	−0.3685	−0.3134	−0.4716	1.2682	…
…	…	…	…	…	…	…	…
d₂₇	…	−0.2358	2.2686	1.0790	0.3166	−0.2318	…
…	…	…	…	…	…	…	…
d₃₂	…	−0.2358	0.5180	−0.3134	−0.4716	−0.2318	…
…	…	…	…	…	…	…	…
d₄₁	…	−0.2358	−0.3685	−0.3134	3.0963	−0.2318	…
d₄₂	…	−0.2358	5.5921	−0.3134	−0.4716	−0.2318	…
d₄₃	…	−0.2358	−0.3685	−0.3134	−0.4716	−0.2318	…
d₄₄	…	−0.2358	−0.3685	−0.3134	3.0963	−0.2318	…
…	…	…	…	…	…	…	…
d₅₃	…	−0.2358	−0.3685	0.6168	3.0963	−0.2318	…
d₅₄	…	−0.2358	0.4446	−0.3134	0.3166	−0.2318	…
d₅₅	…	−0.2358	−0.3685	1.0790	−0.4716	−0.2318	…
d₅₆	…	−0.2358	1.8883	−0.3134	3.0963	−0.2318	…
…	…	…	…	…	…	…	…

Table 5. The wc_Symt,ij Feature Determination from the first test sample based on Medical-Symptom-Expression List on Wikipedia.

Disease Type Documents	wc_Symt,ij Feature Determination
Disease Type Documents	Number of Different wc_Symt,ij Features	Precision	Recall	F1-Score
Type₁ Documents (17 d_1,i documents)	76	0.892	0.791	0.839
Type2 Documents (18 d_2,i documents)	115	0.887	0.754	0.815

Table 6. Determination of d_t,i wc_Symt,ij Pairs Having CErel from the second test sample.

Determination of d_t,i wc_Symij Pairs Having CErel		Disease Type
Determination of d_t,i wc_Symij Pairs Having CErel		Type₁ (4700 EDUs)	Type₂ (5300 EDUs)
NB	Precision	0.750	0.742
	Recall	0.739	0.737
	F1-score	0.745	0.740
SVM	Precision	0.757	0.745
	Recall	0.747	0.742
	F1-score	0.752	0.744
LR	Precision	0.762	0.743
	Recall	0.759	0.742
	F1-score	0.761	0.743

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pechsiri, C.; Piriyakul, R. Construction of Disease-Symptom Knowledge Graph from Web-Board Documents. Appl. Sci. 2022, 12, 6615. https://doi.org/10.3390/app12136615

AMA Style

Pechsiri C, Piriyakul R. Construction of Disease-Symptom Knowledge Graph from Web-Board Documents. Applied Sciences. 2022; 12(13):6615. https://doi.org/10.3390/app12136615

Chicago/Turabian Style

Pechsiri, Chaveevan, and Rapepun Piriyakul. 2022. "Construction of Disease-Symptom Knowledge Graph from Web-Board Documents" Applied Sciences 12, no. 13: 6615. https://doi.org/10.3390/app12136615

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Construction of Disease-Symptom Knowledge Graph from Web-Board Documents

Abstract

Featured Application

Abstract

1. Introduction

2. Related Works

3. Problems of DSKG Construction

3.1. How to Determine wc_Symij Features on EDU_h,Symij Occurrences without Concept Annotation on di Documents

3.2. How to Determine d_i-wc_Symij Pairs Having CErel with Automatic Supervised Learning

3.3. How to Construct DSKG with the Problem of High Dimensionality of Symptom-Concept Features